On a Sandy Beach: Cell Boundaries: Defining the Scope of a Cell in Cell-based Architecture

Mike's Notes

Pipi uses several different novel cell-like structures in its architecture, unlike the one described in this article. However, there are some good ideas here.

I went with embracing chaos and complexity. The trick is making it work.

Resources

References

Repository

Home >

Last Updated

30/03/2025

Cell Boundaries: Defining the Scope of a Cell in Cell-based Architecture

By: Benjamin Cane

Medium: 2/03/2025

Builder of payments systems & open-source contributor. Writing mostly micro-posts on Medium. https://github.com/madflojo

One of the hardest questions when adopting Cell-based Architecture is defining what should and shouldn’t be within a Cell.

Defining a Cell’s boundary is crucial to ensure you’ve encapsulated everything you need while balancing its size and complexity.

Today, I will share my thought process when defining a Cell’s boundary.

Recap on Cell-based Architecture

For those who may have missed my previous posts, the TL;DR on Cell-based Architecture is that instead of building one massive system, split it into isolated groups called Cells.

Doing so will improve the system’s resilience, performance, and scalability.

To catch up, check out my recent Cell-based Architecture post.

Avoiding a Massive Cell

Ideally, every service a single request may traverse or require as a dependency should be within a Cell.

However, in a complex enterprise, a single request may trigger multiple downstream events and rely on numerous systems.

Since every component within a Cell should be tested, deployed, and failed-over together, managing such a massive system is impractical. Breaking up this customer journey across multiple Cells is better but requires thought and strategy.

Defining some Terms

Before discussing how boundaries are defined, let’s define some terms that will help ensure we are all speaking the same language.

Platform:

For this post, a Platform is a collection of sub-platforms and/or monolith systems that provide a set of functionalities for a customer journey (i.e., payment processing).

Sub-platform:

For this post, a Sub-platform is a collection of microservices that provide a clearly defined capability (e.g., account or request validations).

Region and Availability Zones:

We will also use the public cloud definitions of Regions and Availability Zones, where a Region is a separate geographical area separated by a significant distance, and availability zones are multiple isolated hosting environments close to each other but not relying on a shared resource like power, physical location, or network.

Principles

When defining boundaries, I often apply a set of principles made of rules and guidelines. While these might sound similar, the difference between a rule and a guideline is how much I’m willing to break the guidance.

I do not break the rules; they guide my decisions. A guideline is a bias, something I’m willing to ignore if it makes sense and I have good reasons (which should be written down and captured).

Rules

A single Cell does not span regions.

The core concept of Cell-based Architecture is to build multiple isolated systems that are carbon copies of each other rather than a single system that gains its resiliency by spreading across numerous regions.

To ensure availability, we must deploy those carbon copies in multiple Regions, with each Cell acting independently and isolated at the Region level.

A Cell might span multiple Availability Zones within a Region but it is not required.

Keeping cells within a single availability zone may be prudent if you need low latency.

Of course, this deployment approach will add operational overhead, leading to more cells, as you still need to ensure you have carbon copies in other regions. But this decision is one of those infamous trade-off moments.

Sub-platforms do not span multiple Cells.

Aside from the performance and failover complexities incurred by a sub-platform spanning multiple cells, keeping a sub-platform within a single cell helps solidify its responsibilities and boundaries.

The idea of a sub-platform is that external systems don’t know the internal workings of the sub-platform, only the external facing interfaces.

Is the sub-platform a microservices design, a monolith, or macroservices? It does not matter.

A sub-platform has a contract, and its owners decide its internal workings.

Extending that internal flexibility across multiple Cells becomes very complex and will likely violate other rules.

Components within a Cell must be tested together.

Cells might be a collection of multiple sub-platforms (more on this later), but the idea of a Cell is that it’s a single isolated unit of processing.

Whenever you deploy new capabilities or establish a new Cell, you want to ensure everything works together as expected. The best way to accomplish this is to test the entire customer journey at the Cell level.

If a customer journey spans multiple cells, testing them together may be a good idea, but ideally, the boundaries should be clear enough that cells are independently testable.

Failover is at a whole cell level.

In a traditional architecture where a platform is a single entity across regions, you’ll see scenarios where a single microservice fails, and traffic to that microservice is routed to another region to service new requests.

While this approach sounds excellent from an availability perspective, it’s problematic.

When a single microservice fails, the latency of a single cross-region call may not break the system, but what happens when multiple services fail? Now, a single customer request may traverse regions numerous times.

Not only is this a performance killer, but every time you traverse regions, the chances of packet loss, network failures, etc., increase — a classic “death by 1000 microservices” example.

Cell-to-cell communications must use fixed contracts and protocols.

Interactions between sub-platforms must follow fixed contracts and protocols, whether a REST API, a Message Queue, gRPC, or a file.

Changes within a sub-platform are okay; typically (but not always), a single team can implement them across the sub-platform.

However, we should always assume that different teams manage different sub-platforms. Changes between sub-platforms require coordination and communication across multiple teams. The same applies to Cells.

Changes between Cells always require extra coordination and testing, so we should manage them like any other API change (with versioning!).

Cell to Cell communications can cross Regions, but never Cell internal communications.

When defining a Cell boundary, it’s best to assume that calls between Cells may traverse regions. Cell-based Architecture acknowledges that failures will happen. A Cell will need to failover; when this happens, a request might traverse regions.

While it might be optimal to ensure Cells talk locally within a region, this is an optimization and should not be relied on for performance or resiliency.

When you assume cell-to-cell calls traverse Regions, a new set of non-functional requirements must be considered, such as retries, circuit breakers, latency, etc.

Internal Cell calls should always be local because failover is at the Cell level, not the microservice level. The whole point of Cell-based architecture is to create isolation; if internal communications traverse regions, it breaks the entire design.

Guidelines

Avoid Dependencies between Cells.

Some customer journeys are expansive, so encapsulating everything into a single cell can be challenging.

Building dependencies between Cells is okay, but as described above, Cell-to-Cell calls will span regions and have more non-functional concerns such as increased latency and failures.

It’s easier to manage Cells that do not depend on others, but it’s not always practical.

A Cell should contain a single sub-platform.

Some may argue that this guideline should be a rule, but I don’t see it as that simple.

If you can draw cell boundaries around each sub-platform, that will simplify a lot of your design. However, the practicality of a cell per sub-platform depends on your system’s use case and requirements.

If your customer journey requires low latency, having each sub-platform act as a Cell can impact your performance.

Having too many Cells also creates operational complexity, which can cause just as many issues as you are trying to prevent with Cell-based Architecture.

It’s okay to have a cell contain multiple sub-platforms; you must design it accordingly.

Deploy all components within a Cell together.

I firmly believe that a sub-platform’s components should be deployed together, as this reduces the complexity of releases and testing. Still, I don’t prescribe forcing Cells with multiple sub-platforms to follow this approach.

Deploying the whole Cell as one is a great practice, but it is also okay to do this at a sub-platform level.

Use discretion; this one breaks down to operational overhead, and tooling & team maturity.

Use Natural Boundaries to define Cell Boundaries.

A natural boundary is when a customer journey transitions from one type of architecture pattern to another, like a real-time system (REST APIs) to a batch or event-based system (Message Broker).

A real-time system takes a different approach and requires different resilience and performance from an event-based system. These changes in requirements and flow are great places to define Cell boundaries.

This guideline is not a rule, though. Sometimes, you may want to bring multiple sub-platforms and workloads into a single cell, which may mean not defining a boundary at these transition points.

By Example

Let’s explore an example system to help understand the above approach to defining Cells.

In this Example, a simple backend API calls multiple microservices (using an orchestrated microservice pattern) and triggers events to a set of event-driven microservices.

Following our principle of natural borders, the most apparent boundary is where the customer journey goes from API calls to event messages.

The event-driven systems that perform post-processing can be one Cell, while the APIs can be another.

Benefits:

Microservice-to-microservice calls are local within each Cell, which improves latency and reduces failure points.
APIs and Event-based systems can have different failover mechanisms.
Establishing these as two sub-platforms makes testing and releasing easier as there is a clear contract, and we can test them independently.

Why it works:

APIs do not have a hard dependency on event-based systems to finalize processing.
The event-based system has different needs and SLAs.
The Cells do not share the same database (a rule in microservices that applies to Cells).

Final Thoughts

While I covered many principles for defining boundaries, remember that each situation and use case is different. Guidelines are not the same as rules; they can be broken when reasonable.

However, remember the core concepts behind cell-based architecture when ignoring any guidelines.

By creating independent Cells that operate without reliance on a central dependency, you can isolate and reduce the impact and frequency of failures.

Remember these core concepts while focusing on reducing the number of times a request crosses cells, keeping cells small and manageable, and using natural boundaries.

If you do, you’ll have a resilient and performant architecture.