Building a Fault-Tolerant, Multi-Tenant Platform on AWS

Infrastructure Case Study — May 7, 2026

When Educational Partners International came to Blue Oak Interactive looking for a long-term home for its growing portfolio of web applications, the answer wasn’t adding more servers to a mounting stack of one-off hosting arrangements. The ask was clear: a single infrastructure environment capable of running multiple production applications side-by-side, surviving the loss of an entire AWS data center without human intervention, and giving the team the operational control to deploy, scale, and audit individual properties independently.

The platform we designed and built has been running continuously in production since July 2020, quietly handling failures, routine deploys, and the natural growth of the portfolio it hosts—without scheduled downtime, without rebuilds, and without surprise. This case study walks through the decisions, the architecture, and what five years of production operation looks like.

The Challenge

EPI’s situation is common among organizations that have grown their digital presence organically: a portfolio of web applications that started life on separate servers, managed separately, backed up separately, and requiring separate attention when something goes wrong. As the portfolio grows, so does the operational overhead.

The requirements were straightforward to name but harder to engineer:

No single points of failure. A failed server, a failed data center, or a failed deployment should never wake anyone up. The platform has to absorb the failure and recover on its own.
True isolation between applications. Each application should have its own credentials, its own storage, its own security boundaries—sharing infrastructure shouldn’t mean sharing risk.
Everything is reproducible. No hand-configured servers that exist only in someone’s memory. Every piece of infrastructure should be defined in code that can be reviewed, versioned, and rebuilt.
Adding a new application is routine. Not a migration project. A documented process that takes hours, not weeks.

The Architecture

Multi-tenant platform on AWS: multi-AZ architecture diagram showing the public edge layer, three availability zones each running control-plane and application servers, and a shared multi-AZ data services tier. — Multi-AZ architecture: public edge layer, three data centers each with control-plane and application servers, and a shared data services tier spanning all three zones.

The environment lives inside a single AWS cloud network, striped across three separate AWS data centers (availability zones) in the same region. This design ensures that any single data center can go offline—for maintenance, for an outage, for any reason—and the applications keep running.

The public-facing layer handles incoming web traffic through AWS load balancers. Every application sits behind this layer, and the load balancer continuously health-checks every running instance. If an application becomes unhealthy, the load balancer stops sending it traffic immediately.

The compute layer uses an orchestrator to schedule and run each application as an isolated container. Applications don’t share a server in the traditional sense—they share a managed pool of compute capacity, but each application runs in its own isolated environment with its own configuration.

All infrastructure is defined in code. There are no snowflake servers, no configuration that lives only in one engineer’s memory, no manual changes that need to be documented somewhere and eventually aren’t. The entire environment can be rebuilt from the code that defines it.

Self-Healing: What Fault Tolerance Looks Like in Practice

The most valuable property of this architecture is one you hopefully never notice: when something breaks, it fixes itself.

When a server in one availability zone fails, the orchestrator detects this within its health-check interval—typically seconds—and reschedules any affected applications onto healthy servers in other availability zones. The load balancer stops routing to the failed server as soon as health checks fail. The applications keep serving traffic.

There is no failover script. There is no on-call playbook for “a server went down.” The redundancy is structural: coordinating servers spread across three data centers, multiple compute nodes per data center each capable of running any tenant’s application, load balancers health-checking across all three data centers, and a database with a synchronous standby in a second data center.

Fault-tolerant often gets used loosely. In this platform it means: EPI has never needed to initiate a failover. The platform has handled every instance failure, every availability zone disruption, and every routine maintenance event on its own.

Application Isolation and Security

Each application on the platform gets its own credentials, its own secret store, and its own security policy. A misconfiguration or compromise in one application cannot expose the credentials or data of any other application sharing the same infrastructure.

Adding a new application to the platform follows a documented, repeatable process:

Create a dedicated secret store for the application’s credentials and configuration.
Write a security policy that restricts access to only that application’s store.
Define the application’s resource requirements, health checks, and deployment configuration.
Load the application’s credentials into its dedicated store.
Provision the application’s database and grant a dedicated account.

That’s it. No new servers. No new network plumbing. No new load balancer configuration. The application joins the existing pool and starts receiving traffic. When an application is decommissioned, the inverse process removes it cleanly and the compute capacity it freed is reclaimed automatically.

This kind of isolation is what separates a shared hosting environment from a true multi-tenant platform. It’s the reason EPI can run wholly distinct properties—each with their own data, their own credentials, their own operational profile—on the same infrastructure without them touching each other.

Day-to-Day Operations

A platform is only as good as its operational reality. A few characteristics of how this one runs:

Deployments are non-events. Updates to any application are rolled out without downtime. The orchestrator drains connections from the old version before removing it, ensuring users are never mid-session when a new version goes live.

Infrastructure upgrades are routine. Updating the underlying server software follows a documented, scripted process: take a server out of rotation, upgrade it, bring it back. Because every application is rescheduled automatically, this is invisible to users.

Backups are automated and auditable. Database backup schedules are defined in the same infrastructure code that builds the rest of the environment. Retention policies, restore tests, and snapshot schedules are managed as configuration, not manual tasks.

Access is controlled and auditable. There is one documented path for operator access to infrastructure. Adding or revoking access is a configuration change, not a server edit.

What Five Years of Production Operation Taught Us

Looking back over five years of continuous operation, a few architectural decisions have aged particularly well:

Treating fault tolerance as structural, not a feature. Redundancy built into the architecture from the first deployment meant that every subsequent change—adding a new application, upgrading infrastructure components—inherited that redundancy automatically. There was no retrofit work because there was nothing to retrofit.
Strong per-application isolation from day one. Giving each application its own credentials, its own secret store, and its own security boundary makes the platform genuinely multi-tenant rather than just shared hosting with extra steps. This is what lets EPI run wholly distinct applications on the same infrastructure with confidence.
Infrastructure as code, without exceptions. Every piece of this environment can be rebuilt from the configuration that defines it. There are no configuration surprises during maintenance windows. There are no “I wonder what’s actually running on that server” moments when someone is on call.

The platform Blue Oak Interactive built for Educational Partners International has been running continuously since July 2020. It has absorbed instance failures, data center disruptions, version upgrades, and the natural growth of the portfolio it hosts—without scheduled downtime, without manual rescue operations, and without surprise. That’s what fault-tolerant, multi-tenant infrastructure is supposed to feel like: like nothing at all.

Blue Oak Interactive designs and builds cloud infrastructure that stays out of the way. If you’re evaluating hosting architecture or want a second opinion on your current setup, get in touch.