On a Sandy Beach: March 2025

Three Hundred Years Later, a Tool from Isaac Newton Gets an Update

Mike's Notes

A fascinating article published this week in Quanta Magazine about a mathematical technique used to solve complex problems.

Resources

References

Repository

Home > Ajabbi Research > Libary > Subscriptions > Quanta Magazine
Home > Ajabbi Research > Libary > Mathematics

Last Updated

31/03/2025

Three Hundred Years Later, a Tool from Isaac Newton Gets an Update

By: Kevin Hartnett

Quanta Magazine: 24/03/2025

Kevin Hartnett was the senior writer at Quanta Magazine covering mathematics and computer science. His work has been collected in multiple volumes of the “Best Writing on Mathematics” series. From 2013-2016 he wrote “Brainiac,” a weekly column for the Boston Globe‘s Ideas section.

A simple, widely used mathematical technique can finally be applied to boundlessly complex problems.

Every day, researchers search for optimal solutions. They might want to figure out where to build a major airline hub. Or to determine how to maximize return while minimizing risk in an investment portfolio. Or to develop self-driving cars that can distinguish between traffic lights and stop signs.

Mathematically, these problems get translated into a search for the minimum values of functions. But in all these scenarios, the functions are too complicated to assess directly. Researchers have to approximate the minimal values instead.

It turns out that one of the best ways to do this is by using an algorithm that Isaac Newton developed over 300 years ago. This algorithm is fairly simple. It’s a little like searching, blindfolded, for the lowest point in an unfamiliar landscape. As you put one foot in front of the other, the only information you need is whether you’re going uphill or downhill, and whether the grade is increasing or decreasing. Using that information, you can get a good approximation of the minimum relatively quickly.

Although enormously powerful — centuries later, Newton’s method is still crucial for solving present-day problems in logistics, finance, computer vision and even pure math — it also has a significant shortcoming. It doesn’t work well on all functions. So mathematicians have continued to study the technique, figuring out different ways to broaden its scope without sacrificing efficiency.

Last summer, three researchers announced the latest improvement(opens a new tab) to Newton’s method. Amir Ali Ahmadi(opens a new tab) of Princeton University, along with his former students Abraar Chaudhry(opens a new tab) (now at the Georgia Institute of Technology) and Jeffrey Zhang(opens a new tab) (now at Yale University), extended Newton’s method to work efficiently on the broadest class of functions yet.

“Newton’s method has 1,000 different applications in optimization,” Ahmadi said. “Potentially our algorithm can replace it.”

In the 1680s, Isaac Newton developed an algorithm for finding optimal solutions. Three centuries later, mathematicians are still using and honing his method.

Godfrey Kneller/Public Domain

A Centuries-Old Technique

Mathematical functions transform inputs into outputs. Often, the most important feature of a function is its minimum value — the combination of inputs that produces the smallest possible output.

But finding the minimum is hard. Functions can have dozens of variables raised to high powers, defying formulaic analysis; graphs of their solutions form high-dimensional landscapes that are impossible to explore from a bird’s-eye view. In those higher-dimensional landscapes, said Coralia Cartis(opens a new tab) of the University of Oxford, “We want to find a valley. Some are local valleys; others are the lowest point. You’re trying to find these things, and the question is: What info do you have to guide you to that?”

In the 1680s, Newton recognized that even when you’re dealing with a very complicated function, you’ll still always have access to at least two pieces of information to help you find its deepest valley. First, you can calculate the function’s so-called first derivative, or slope: the steepness of the function at a given point. Second, you can compute the rate at which the slope itself is changing (the function’s second derivative).

Amir Ali Ahmadi sees optimization problems everywhere he looks.

Archives of the Mathematisches Forschungsinstitut Oberwolfach

Say you’re trying to find the minimum of some complicated function. First, choose a point on the function that you think might be close to the true minimum. Compute the function’s first and second derivatives at that point. These derivatives can be used to construct a special quadratic equation — a parabola if your function lives in a 2D plane, and a cuplike shape called a paraboloid if your function is higher dimensional. This quadratic equation, which is called a Taylor approximation, roughly resembles your function at the point you chose.

Now calculate the minimum of the quadratic equation instead of the original — something you can do easily, using a well-known formula. (That’s because quadratic equations are simple; it’s when equations get more complicated that calculating the minimum becomes prohibitive.) You’ll get a point. Then plug the coordinates of that point back into your original function, and you’ll get a new point on the function that is, hopefully, closer to its true minimum. Start the entire process again.

Newton proved that if you keep on repeating this process, you’ll eventually home in on the minimum value of the original, more complicated function. The method doesn’t always work, especially if you start at a point that’s too far away from the true minimum. But for the most part, it does. And it has some desirable attributes.

Mark Belan/Quanta Magazine; Source: arxiv:2305.07512(opens a new tab)

Other iterative methods, like gradient descent — the algorithm used in today’s machine learning models — converge toward the true minimum at a linear rate. Newton’s method converges toward it much faster: at a “quadratic” rate. In other words, it can identify the minimum value in fewer iterations than gradient descent. (Each iteration of Newton’s method is more computationally expensive than an iteration of gradient descent, which is why researchers prefer gradient descent for certain applications, like training neural networks. But Newton’s method is still enormously efficient, making it useful in all sorts of contexts.)

Newton could have written his method to converge toward the true minimum value even faster if, instead of taking just the first and second derivatives at each point, he had also taken, say, the third and fourth derivatives. That would have given him more complicated Taylor approximations, with exponents greater than 2. But the whole crux of his strategy was to transform a complicated function into a simpler one. These more complicated Taylor equations were more than Newton could handle mathematically.

Jeffrey Zhang and his co-authors wiggled functions in just the right way, allowing them to broaden the scope of a powerful optimization technique.

Courtesy of Jeffrey Zhang

“Newton did it for degree 2. He did that because nobody knew how to minimize higher-order polynomials,” Ahmadi said.

In the centuries since, mathematicians have worked to extend his method, to probe how much information they can squeeze out of more complicated Taylor approximations of their functions.

In the 19th century, for instance, the Russian mathematician Pafnuty Chebyshev proposed a version of Newton’s method that approximated functions with cubic equations (which have an exponent of 3). But his algorithm didn’t work when the original function involved multiple variables. Much more recently, in 2021, Yurii Nesterov (now at Corvinus University of Budapest) demonstrated how to approximate functions(opens a new tab) of any number of variables efficiently with cubic equations. But his method couldn’t be extended to approximate functions using quartic equations, quintics and so on without losing its efficiency. Nevertheless, the proof was a major breakthrough in the field.

Now Ahmadi, Chaudhry and Zhang have taken Nesterov’s result another step further. Their algorithm works for any number of variables and arbitrarily many derivatives. Moreover, it remains efficient for all these cases — something that until now wasn’t possible.

But first, they had to find a way to make a hard math problem a lot easier.

Finding Wiggle Room

There is no fast, general purpose method for finding the minima of functions raised to high exponents. That’s always been the main limitation of Newton’s method. But there are certain types of functions that have characteristics that make them easy to minimize. In the new work, Ahmadi, Chaudhry and Zhang prove that it’s always possible to find approximating equations that have these characteristics. They then show how to adapt these equations to run Newton’s method efficiently.

What properties make an equation easy to minimize? Two things: The first is that the equation should be bowl-shaped, or “convex.” Rather than having many valleys, it has just one — meaning that when you try to minimize it, you don’t have to worry about mistaking an arbitrary valley for the lowest one.

Abraar Chaudhry and two colleagues recently found a way to improve a centuries-old method for finding the minima of functions.

Camille Carpenter Henriquez

The second property is that the equation can be written as a sum of squares. For example, 5x2 + 16x + 13 can be written as the sum (x + 2)2 + (2x + 3)2. In recent years, mathematicians have developed techniques for minimizing equations with arbitrarily large exponents so long as they are both convex and a sum of squares. However, those techniques were of little help when it came to Newton’s method. Most of the time, the Taylor approximation you use won’t have these nice properties.

But Ahmadi, Chaudhry and Zhang figured out how to use a technique called semidefinite programming to wiggle the Taylor approximation just enough to make it both a sum of squares and convex, though not so much that it became unmoored from the original function it was supposed to resemble.

They essentially added a fudge factor to the Taylor expansion, turning it into an equation that had the two desired properties. “We can change the Taylor expansion a bit to make it simpler to minimize. Think of the Taylor expansion, but modified a little bit,” Ahmadi said. He and his colleagues then showed that, using this modified version of the Taylor expansion — which involved arbitrarily many derivatives — their algorithm would still converge on the true minimum of the original function. Moreover, the rate of convergence would scale with the number of derivatives used: Just as using two derivatives allowed Newton to approach the true minimum at a quadratic rate, using three derivatives enabled the researchers to approach it at a cubic rate, and so on.

Ahmadi, Chaudhry and Zhang had created a more powerful version of Newton’s method that could reach the true minimum value of a function in fewer iterations than previous techniques.

Like the original version of Newton’s method, each iteration of this new algorithm is still computationally more expensive than methods such as gradient descent. As a result, for the moment, the new work won’t change the way self-driving cars, machine learning algorithms or air traffic control systems work. The best bet in these cases is still gradient descent.

“Many ideas in optimization take years before they are made fully practical,” said Jason Altschuler(opens a new tab) of the University of Pennsylvania. “But this seems like a fresh perspective.”

If, over time, the underlying computational technology needed to run Newton’s method becomes more efficient — making each iteration less computationally expensive — then the algorithm developed by Ahmadi, Chaudhry and Zhang could eventually surpass gradient descent for all sorts of applications, including machine learning.

“Our algorithm right now is provably faster, in theory,” Ahmadi said. He’s hopeful, he added, that in 10 to 20 years, it will also be so in practice.

Correction: March 25, 2025

The graphic in this article has been updated.

Cell Boundaries: Defining the Scope of a Cell in Cell-based Architecture

Mike's Notes

Pipi uses several different novel cell-like structures in its architecture, unlike the one described in this article. However, there are some good ideas here.

I went with embracing chaos and complexity. The trick is making it work.

Resources

References

Repository

Home >

Last Updated

30/03/2025

Cell Boundaries: Defining the Scope of a Cell in Cell-based Architecture

By: Benjamin Cane

Medium: 2/03/2025

Builder of payments systems & open-source contributor. Writing mostly micro-posts on Medium. https://github.com/madflojo

One of the hardest questions when adopting Cell-based Architecture is defining what should and shouldn’t be within a Cell.

Defining a Cell’s boundary is crucial to ensure you’ve encapsulated everything you need while balancing its size and complexity.

Today, I will share my thought process when defining a Cell’s boundary.

Recap on Cell-based Architecture

For those who may have missed my previous posts, the TL;DR on Cell-based Architecture is that instead of building one massive system, split it into isolated groups called Cells.

Doing so will improve the system’s resilience, performance, and scalability.

To catch up, check out my recent Cell-based Architecture post.

Avoiding a Massive Cell

Ideally, every service a single request may traverse or require as a dependency should be within a Cell.

However, in a complex enterprise, a single request may trigger multiple downstream events and rely on numerous systems.

Since every component within a Cell should be tested, deployed, and failed-over together, managing such a massive system is impractical. Breaking up this customer journey across multiple Cells is better but requires thought and strategy.

Defining some Terms

Before discussing how boundaries are defined, let’s define some terms that will help ensure we are all speaking the same language.

Platform:

For this post, a Platform is a collection of sub-platforms and/or monolith systems that provide a set of functionalities for a customer journey (i.e., payment processing).

Sub-platform:

For this post, a Sub-platform is a collection of microservices that provide a clearly defined capability (e.g., account or request validations).

Region and Availability Zones:

We will also use the public cloud definitions of Regions and Availability Zones, where a Region is a separate geographical area separated by a significant distance, and availability zones are multiple isolated hosting environments close to each other but not relying on a shared resource like power, physical location, or network.

Principles

When defining boundaries, I often apply a set of principles made of rules and guidelines. While these might sound similar, the difference between a rule and a guideline is how much I’m willing to break the guidance.

I do not break the rules; they guide my decisions. A guideline is a bias, something I’m willing to ignore if it makes sense and I have good reasons (which should be written down and captured).

Rules

A single Cell does not span regions.

The core concept of Cell-based Architecture is to build multiple isolated systems that are carbon copies of each other rather than a single system that gains its resiliency by spreading across numerous regions.

To ensure availability, we must deploy those carbon copies in multiple Regions, with each Cell acting independently and isolated at the Region level.

A Cell might span multiple Availability Zones within a Region but it is not required.

Keeping cells within a single availability zone may be prudent if you need low latency.

Of course, this deployment approach will add operational overhead, leading to more cells, as you still need to ensure you have carbon copies in other regions. But this decision is one of those infamous trade-off moments.

Sub-platforms do not span multiple Cells.

Aside from the performance and failover complexities incurred by a sub-platform spanning multiple cells, keeping a sub-platform within a single cell helps solidify its responsibilities and boundaries.

The idea of a sub-platform is that external systems don’t know the internal workings of the sub-platform, only the external facing interfaces.

Is the sub-platform a microservices design, a monolith, or macroservices? It does not matter.

A sub-platform has a contract, and its owners decide its internal workings.

Extending that internal flexibility across multiple Cells becomes very complex and will likely violate other rules.

Components within a Cell must be tested together.

Cells might be a collection of multiple sub-platforms (more on this later), but the idea of a Cell is that it’s a single isolated unit of processing.

Whenever you deploy new capabilities or establish a new Cell, you want to ensure everything works together as expected. The best way to accomplish this is to test the entire customer journey at the Cell level.

If a customer journey spans multiple cells, testing them together may be a good idea, but ideally, the boundaries should be clear enough that cells are independently testable.

Failover is at a whole cell level.

In a traditional architecture where a platform is a single entity across regions, you’ll see scenarios where a single microservice fails, and traffic to that microservice is routed to another region to service new requests.

While this approach sounds excellent from an availability perspective, it’s problematic.

When a single microservice fails, the latency of a single cross-region call may not break the system, but what happens when multiple services fail? Now, a single customer request may traverse regions numerous times.

Not only is this a performance killer, but every time you traverse regions, the chances of packet loss, network failures, etc., increase — a classic “death by 1000 microservices” example.

Cell-to-cell communications must use fixed contracts and protocols.

Interactions between sub-platforms must follow fixed contracts and protocols, whether a REST API, a Message Queue, gRPC, or a file.

Changes within a sub-platform are okay; typically (but not always), a single team can implement them across the sub-platform.

However, we should always assume that different teams manage different sub-platforms. Changes between sub-platforms require coordination and communication across multiple teams. The same applies to Cells.

Changes between Cells always require extra coordination and testing, so we should manage them like any other API change (with versioning!).

Cell to Cell communications can cross Regions, but never Cell internal communications.

When defining a Cell boundary, it’s best to assume that calls between Cells may traverse regions. Cell-based Architecture acknowledges that failures will happen. A Cell will need to failover; when this happens, a request might traverse regions.

While it might be optimal to ensure Cells talk locally within a region, this is an optimization and should not be relied on for performance or resiliency.

When you assume cell-to-cell calls traverse Regions, a new set of non-functional requirements must be considered, such as retries, circuit breakers, latency, etc.

Internal Cell calls should always be local because failover is at the Cell level, not the microservice level. The whole point of Cell-based architecture is to create isolation; if internal communications traverse regions, it breaks the entire design.

Guidelines

Avoid Dependencies between Cells.

Some customer journeys are expansive, so encapsulating everything into a single cell can be challenging.

Building dependencies between Cells is okay, but as described above, Cell-to-Cell calls will span regions and have more non-functional concerns such as increased latency and failures.

It’s easier to manage Cells that do not depend on others, but it’s not always practical.

A Cell should contain a single sub-platform.

Some may argue that this guideline should be a rule, but I don’t see it as that simple.

If you can draw cell boundaries around each sub-platform, that will simplify a lot of your design. However, the practicality of a cell per sub-platform depends on your system’s use case and requirements.

If your customer journey requires low latency, having each sub-platform act as a Cell can impact your performance.

Having too many Cells also creates operational complexity, which can cause just as many issues as you are trying to prevent with Cell-based Architecture.

It’s okay to have a cell contain multiple sub-platforms; you must design it accordingly.

Deploy all components within a Cell together.

I firmly believe that a sub-platform’s components should be deployed together, as this reduces the complexity of releases and testing. Still, I don’t prescribe forcing Cells with multiple sub-platforms to follow this approach.

Deploying the whole Cell as one is a great practice, but it is also okay to do this at a sub-platform level.

Use discretion; this one breaks down to operational overhead, and tooling & team maturity.

Use Natural Boundaries to define Cell Boundaries.

A natural boundary is when a customer journey transitions from one type of architecture pattern to another, like a real-time system (REST APIs) to a batch or event-based system (Message Broker).

A real-time system takes a different approach and requires different resilience and performance from an event-based system. These changes in requirements and flow are great places to define Cell boundaries.

This guideline is not a rule, though. Sometimes, you may want to bring multiple sub-platforms and workloads into a single cell, which may mean not defining a boundary at these transition points.

By Example

Let’s explore an example system to help understand the above approach to defining Cells.

In this Example, a simple backend API calls multiple microservices (using an orchestrated microservice pattern) and triggers events to a set of event-driven microservices.

Following our principle of natural borders, the most apparent boundary is where the customer journey goes from API calls to event messages.

The event-driven systems that perform post-processing can be one Cell, while the APIs can be another.

Benefits:

Microservice-to-microservice calls are local within each Cell, which improves latency and reduces failure points.
APIs and Event-based systems can have different failover mechanisms.
Establishing these as two sub-platforms makes testing and releasing easier as there is a clear contract, and we can test them independently.

Why it works:

APIs do not have a hard dependency on event-based systems to finalize processing.
The event-based system has different needs and SLAs.
The Cells do not share the same database (a rule in microservices that applies to Cells).

Final Thoughts

While I covered many principles for defining boundaries, remember that each situation and use case is different. Guidelines are not the same as rules; they can be broken when reasonable.

However, remember the core concepts behind cell-based architecture when ignoring any guidelines.

By creating independent Cells that operate without reliance on a central dependency, you can isolate and reduce the impact and frequency of failures.

Remember these core concepts while focusing on reducing the number of times a request crosses cells, keeping cells small and manageable, and using natural boundaries.

If you do, you’ll have a resilient and performant architecture.

Beyond Trends: A Practical Guide to Choosing the Right Message Broker

Mike's Notes

This article about messaging brokers appeared today in The Software Architects' Newsletter, March 2025, published by InfoQ.

Pipi 4 had its own internal messaging system in 2004, 6 years before Kafca arrived.

Resources

References

Repository

Home > pipiWiki > Engines > Messaging

Last Updated

29/03/2025

Beyond Trends: A Practical Guide to Choosing the Right Message Broker

By: Nehme Bilal

InfoQ: 19/03/2025

Nehme Bilal is a Senior Staff Engineer and L0 Architect at EarnIn, where he has been shaping software development for the past six years. He previously worked at Amazon and Microsoft and holds a PhD in software engineering. Specializing in distributed systems, software development best practices, and architectural design, Nehme is passionate about establishing processes and guidelines that enhance team efficiency and drive the creation of high-quality software.

Key Takeaways

Message brokers can be broadly categorized as either stream-based or queue-based, each offering unique strengths and trade-offs.
Messages in a stream are managed using offsets, allowing consumers to efficiently commit large batches in a single network call and replay messages by rewinding the offset. In contrast, queues have limited batching support and typically do not allow message replay, as messages are removed once consumed.
Streams rely on rigid physical partitions for scaling, which creates challenges in handling poison pills and limits their ability to dynamically auto-scale consumers with fluctuating traffic. Queues, such as Amazon SQS and FIFO SQS, use low-cardinality logical partitions (that are ordered), enabling seamless auto-scaling and effective isolation of poison pills.
Streams are ideal for data replication scenarios because they enable efficient batching and are generally less susceptible to poison pills.
When batch replication is not required, queues like Amazon SQS or FIFO SQS are often the better choice, as they support auto-scaling, isolate poison pills, and provide FIFO ordering when needed.
Combining streams and queues allows organizations to standardize on a single stream solution for producing messages while giving consumers the flexibility to either consume directly from the stream or route messages to a queue based on the messaging pattern.

Messaging solutions play a vital role in modern distributed systems. They enable reliable communication, support asynchronous processing, and provide loose coupling between components. Additionally, they improve application availability and help protect systems from traffic spikes. The available options range from stream-based to queue-based services, each offering unique strengths and trade-offs.

In my experience working with various engineering teams, selecting a message broker is not generally approached with a clear methodology. Decisions are often influenced by trends, personal preference, or the ease of access to a particular technology; rather than the specific needs of an application. However, selecting the right broker should focus on aligning its key characteristics with the application’s requirements - this is the central focus of this article.

We will examine two of the most popular messaging solutions: Apache Kafka (stream-based) and Amazon SQS (queue-based), which are also the main message brokers we use at EarnIn. By discussing how their characteristics align (or don’t) with common messaging patterns, this article aims to provide insights that will help you make more informed decisions. With this understanding, you’ll be better equipped to evaluate other messaging scenarios and brokers, ultimately choosing the one that best suits your application’s needs.

Message Brokers

In this section, we will examine popular message brokers and compare their key characteristics. By understanding these differences, we can evaluate which brokers are best suited for common messaging patterns in modern applications. While this article does not provide an in-depth description of each broker, readers unfamiliar with these technologies are encouraged to refer to their official documentation for more detailed information.

Amazon SQS (Simple Queue Service)

Amazon SQS is a fully managed message queue service that simplifies communication between decoupled components in distributed systems. It ensures reliable message delivery while abstracting complexities such as infrastructure management, scalability, and error handling. Below are some of the key properties of Amazon SQS.

Message Lifecycle Management: In SQS, the message lifecycle is managed either individually or in small batches of up to 10 messages. Each message can be received, processed, deleted, or even delayed based on the application's needs. Typically, an application receives a message, processes it, and then deletes it from the queue, which ensures that messages are reliably processed.

Best-effort Ordering: Standard SQS queues deliver messages in the order they were sent but do not guarantee strict ordering, particularly during retries or parallel consumption. This allows for higher throughput when strict message order isn't necessary. For use cases that require strict ordering, FIFO SQS (First-In-First-Out) can be used to ensure that messages are processed in a certain order (more on FIFO SQS below).

Built-in Dead Letter Queue (DLQ): SQS includes built-in support for Dead Letter Queues (DLQs), which help isolate unprocessable messages.

Write and Read Throughput: SQS supports effectively unlimited read and write throughput, which makes it well-suited for high-volume applications where the ability to handle large message traffic efficiently is essential.

Autoscaling Consumers: SQS supports auto-scaling compute resources (such as AWS Lambda, EC2, or ECS services) based on the number of messages in the queue (see official documentation). Consumers can dynamically scale to handle increased traffic and scale back down when the load decreases. This auto-scaling capability ensures that applications can process varying workloads without manual intervention, which is invaluable for managing unpredictable traffic patterns.

Pub-Sub Support: SQS does not natively support pub-sub, as it is designed for point-to-point messaging where each message is consumed by a single receiver. However, you can achieve a pub-sub architecture by integrating SQS with Amazon Simple Notification Service (SNS). SNS allows messages to be published to a topic, which can then fan out to multiple SQS queues subscribed to that topic. This enables multiple consumers to receive and process the same message independently, effectively implementing a pub-sub system using AWS services.

Amazon FIFO SQS

FIFO SQS extends the capabilities of Standard SQS by guaranteeing strict message ordering within logical partitions called message groups. It is ideal for workflows that require the sequential processing of related events, such as user-specific notifications, financial transactions, or any scenario where maintaining the exact order of messages is crucial. Below are some of the key properties of FIFO SQS.

Message Grouping as Logical Partitions: In FIFO SQS, each message has a MessageGroupId, which is used to define logical partitions within the queue. A message group allows messages that share the same MessageGroupId to be processed sequentially. This ensures that the order of messages within a particular group is strictly maintained, while messages belonging to different message groups can be processed in parallel by different consumers. For example, imagine a scenario where each user’s messages need to be processed in order (e.g., a sequence of notifications or actions triggered by a user).

By assigning each user a unique MessageGroupId, SQS ensures that all messages related to a specific user are processed sequentially, regardless of when the messages are added to the queue. Messages from other users (with different MessageGroupIds) can be processed in parallel, maintaining efficient throughput without affecting the order for any individual user. This is a major benefit for FIFO SQS in comparison to standard SQS or stream based message brokers such as Apache Kafka and Amazon Kinesis.

Dead Letter Queue (DLQ): FIFO SQS provides built-in support for Dead Letter Queues (DLQs), but their use requires careful consideration as they can disrupt the strict ordering of messages within a message group. For example, if two messages - message1 and message2 - belong to the same MessageGroupId (e.g., groupA), and message1 fails and is moved to the DLQ, message2 could still be successfully processed. This breaks the intended message order within the group, defeating the primary purpose of FIFO processing.

Poison Pills Isolation: When a DLQ is not used, FIFO SQS will continue retrying the delivery of a failed message indefinitely. While this ensures strict message ordering, it can also create a bottleneck, blocking the processing of all subsequent messages within the same message group until the failed message is successfully processed or deleted.

Messages that repeatedly fail to process are known as poison pills. In some messaging systems, poison pills can block an entire queue or shard, preventing any subsequent messages from being processed. However, in FIFO SQS, the impact is limited to the specific message group (logical partition) the message belongs to. This isolation significantly mitigates broader failures, provided message groups are thoughtfully designed.

To minimize disruption, it’s crucial to choose the MessageGroupId in a way that keeps logical partitions small while ensuring that ordered messages remain within the same partition. For example, in a multi-user application, using a user ID as the MessageGroupId ensures that failures only affect that specific user’s messages. Similarly, in an e-commerce application, using an order ID as the MessageGroupId ensures that a failed order message does not impact orders from other customers.

To illustrate the impact of this isolation, consider a poison pill scenario:

Without isolation (or shard-level isolation), a poison pill could block all orders in an entire region (e.g., all Amazon.com orders in a country).
With FIFO SQS isolation, only a single user’s order would be affected, while others continue processing as expected.

Thus, poison pill isolation is a highly impactful feature of FIFO SQS, significantly improving fault tolerance in distributed messaging systems.

Throughput: FIFO SQS has a default throughput limit of 300 messages per second. However, by enabling high-throughput mode, this can be increased to 9,000 messages per second. Achieving this high throughput requires careful design of message groups to ensure sufficient parallelism.

Autoscaling Consumers: Similar to Standard SQS, FIFO SQS supports auto-scaling compute resources based on the number of messages in the queue. While FIFO SQS scalability is not truly unlimited, it is influenced by the number of message groups (logical partitions), which can be designed to be very high (e.g. a message group per user).

Pub-Sub Support: Just like with Standard SQS, pub-sub can be achieved by pairing FIFO SQS with SNS, which offers support for FIFO topics.

Apache Kafka

Apache Kafka is an open-source, distributed streaming platform designed for real-time event streaming and high-throughput applications. Unlike traditional message queues like SQS, Kafka operates as a stream-based platform where messages are consumed based on offsets. In Kafka, consumers track their progress by moving their offset forward (or backward for replay), allowing multiple messages to be committed at once. This offset-based approach is a key distinction between Kafka and traditional message queues, where each message is processed and acknowledged independently. Below are some of Kafka's key properties.

Physical Partitions (shards): Kafka topics are divided into physical partitions (also known as shards) at the time of topic creation. Each partition maintains its own offset and manages message ordering independently. While partitions can be added, this may disrupt ordering and requires careful handling. On the other hand, reducing partitions is even more complex and generally avoided, as it affects data distribution and consumer load balancing. Because partitioning affects scalability and performance, it should be carefully planned from the start.

Pub-Sub Support: Kafka supports a publish-subscribe model natively. This allows multiple consumer groups to independently process the same topic, enabling different applications or services to consume the same data without interfering with each other. Each consumer group gets its own view of the topic, allowing for flexible scaling of both producers and consumers.

High Throughput and Batch Processing: Kafka is optimized for high-throughput use cases, enabling the efficient processing of large volumes of data. Consumers can process large batches of messages, minimizing the number of reads and writes to Kafka. For instance, a consumer can process up to 10,000 messages, save them to a database in a single operation, and then commit the offset in one step, significantly reducing overhead. This is a key differentiator of streams from queues where messages are managed individually or in small batches.

Replay Capability: Kafka retains messages for a configurable retention period (default is 7 days), allowing consumers to rewind and replay messages. This is particularly useful for debugging, reprocessing historical data, or recovering from application errors. Consumers can process data at their own pace and retry messages if necessary, making Kafka an excellent choice for use cases that require durability and fault tolerance.

Handling Poison Pills: In Kafka, poison pills can block the entire physical partition they reside in, delaying the processing of all subsequent messages within that partition. This can have serious consequences on an application. For example, in an e-commerce application where each region's orders are processed through a dedicated Kafka shard, a single poison pill could block all orders for that region, leading to significant business disruptions. This limitation highlights a key drawback of strict physical partitioning compared to logical partitioning available in queues such as FIFO SQS, where failures are isolated within smaller message groups rather than affecting an entire shard.

If strict ordering is not required, using a Dead Letter Queue can help mitigate the impact by isolating poison pills, preventing them from blocking further message processing.

Autoscaling Limitations: Kafka’s scaling is constrained by its partition model, where each shard (partition) maintains strict ordering and can be processed by only one compute node at a time. This means that adding more compute nodes than the number of partitions does not improve throughput, as the extra nodes will remain idle. As a result, Kafka does not pair well with auto-scaling consumers, since the number of active consumers is effectively limited by the number of partitions. This makes Kafka less flexible in dynamic scaling scenarios compared to messaging systems like FIFO SQS, where logical partitioning allows for more granular consumer scaling.

Comparison of Messaging Brokers

Feature	Standard SQS	FIFO SQS	Apache Kafka
Message Retention	Up to 14 days	Up to 14 days	Configurable (default: 7 days)
Pub-Sub Support	via SNS	via SNS	Native via consumer groups
Message Ordering	Best-effort ordering	Guaranteed within a message group	Guaranteed within a physical partition (shard)
Batch Processing	Supports batches of up to 10 messages	Supports batches of up to 10 messages	Efficient large-batch commits
Write Throughput	Effectively unlimited	300 messages/second per message group	Scalable via physical partitions (millions of messages/second achievable)
Read Throughput	Unlimited	300 messages/second per message group	Scalable via physical partitions (millions of messages/second achievable)
DLQ Support	Built-in	Built-in but can disrupt ordering	Supported via connectors but can disrupt ordering of a physical partition
Poison Pill Isolation	Isolated to individual messages	Isolated to message groups	Can block an entire physical partition
Replay Capability	Not supported	Not supported	Supported with offset rewinding
Autoscaling Consumers	Unlimited	Limited by the number of message groups (i.e. nearly unlimited in practice)	Limited by the number of physical partitions (shards)

Messaging Patterns and Their Influence on Broker Selection

In distributed systems, messaging patterns define how services communicate and process information. Each pattern comes with unique requirements, such as ordering, scalability, error handling, or parallelism, which guide the selection of an appropriate message broker. This discussion focuses on three common messaging patterns: Command Pattern, Event-Carried State Transfer (ECST), and Event Notification Pattern, and examines how their characteristics align with the capabilities of popular brokers like Amazon SQS and Apache Kafka. This framework can also be applied to evaluate other messaging patterns and determine the best-fit message broker for specific use cases.

The Command Pattern

The Command Pattern is a design approach where requests or actions are encapsulated as standalone command objects. These commands are sent to a message broker for asynchronous processing, allowing the sender to continue operating without waiting for a response.

This pattern enhances reliability, as commands can be persisted and retried upon failure. It also improves the availability of the producer, enabling it to operate even when consumers are unavailable. Additionally, it helps protect consumers from traffic spikes, as they can process commands at their own pace.

Since command processing often involves complex business logic, database operations, and API calls, successful implementation requires reliability, parallel processing, auto-scaling, and effective handling of poison pills.

Key Characteristics

Multiple Sources, Single Destination: A command can be produced by one or more services but is typically consumed by a single service. Each command is usually processed only once, with multiple consumer nodes competing for commands. As a result, pub/sub support is unnecessary for commands.

High Throughput: Commands may be generated at a high rate by multiple producers, requiring the selected message broker to support high throughput with low latency. This ensures that producing commands does not become a bottleneck for upstream services.

Autoscaling Consumers: On the consumer side, command processing often involves time-consuming tasks such as database writes and external API calls. To prevent contention, parallel processing of commands is essential. The selected message broker should enable consumers to retrieve commands in parallel and process them independently, without being constrained by a small number of parallel workstreams (such as physical partitions). This allows for horizontal scaling to handle fluctuations in command throughput, ensuring the system can meet peak demands by adding consumers and scale back during low activity periods to optimize resource usage.

Risk of Poison Pills: Command processing often involves complex workflows and network calls, increasing the likelihood of failures that can result in poison pills. To mitigate this, the message broker must support high cardinality poison pill isolation, ensuring that failed messages affect only a small subset of commands rather than disrupting the entire system. By isolating poison pills within distinct message groups or partitions, the system can maintain reliability and continue processing unaffected commands efficiently.

Broker Alignment

Given the requirements for parallel consumption, autoscaling, and poison pill isolation, Kafka is not well-suited for processing commands. As previously discussed, Kafka’s rigid number of physical partitions cannot be scaled dynamically. Furthermore, a poison pill can block an entire physical partition, potentially disrupting a large number of the application's users.

If ordering is not a requirement, standard SQS is an excellent choice for consuming and processing commands. It supports parallel consumption with unlimited throughput, dynamic scaling, and the ability to isolate poison pills using a Dead Letter Queue (DLQ).

For scenarios where ordering is required and can be distributed across multiple logical partitions, FIFO SQS is the ideal solution. By strategically selecting the message group ID to create numerous small logical partitions, the system can achieve near-unlimited parallelism and throughput. Moreover, any poison pill will only affect a single logical partition (e.g., one user of the application), ensuring that its impact is isolated and minimal.

Event-carried State Transfer (ECST)

The Event-Carried State Transfer (ECST) pattern is a design approach used in distributed systems to enable data replication and decentralized processing. In this pattern, events act as the primary mechanism for transferring state changes between services or systems. Each event includes all the necessary information (state) required for other components to update their local state without relying on synchronous calls to the originating service.

By decoupling services and reducing the need for real-time communication, ECST enhances system resilience, allowing components to operate independently even when parts of the system are temporarily unavailable. Additionally, ECST alleviates the load on the source system by replicating data to where it is needed. Services can rely on their local state copies rather than making repeated API calls to the source. This pattern is particularly useful in event-driven architectures and scenarios where eventual consistency is acceptable.

Key Characteristics

Single Source, Multiple Destinations: In ECST, events are published by the owner of the state and consumed by multiple domains or services interested in replicating the state. This requires a message broker that supports the publish-subscribe (pub-sub) pattern.

Low Likelihood of Poison Pills: Since ECST involves minimal business logic and typically avoids API calls to other services, the risk of poison pills is negligible. As a result, the use of a Dead Letter Queue (DLQ) is generally unnecessary in this pattern.

Batch Processing: As a data-replication pattern, ECST benefits significantly from batch processing. Replicating data in large batches improves performance and reduces costs, especially when the target database supports bulk inserts in a single operation. A message broker that supports efficient large-batch commits, combined with a database optimized for batching, can dramatically enhance application performance.

Strict Ordering: Strict message ordering is often essential in ECST to ensure that the state of a domain entity is replicated in the correct sequence. This prevents older versions of an entity from overwriting newer ones. Ordering is particularly critical when events carry deltas (e.g., "set property X"), as out-of-order events cannot simply be discarded. A message broker that supports strict ordering can greatly simplify event consumption and ensure data integrity.

Broker Alignment

Given the requirements for pub-sub, strict ordering, and batch processing, along with the low likelihood of poison pills, Apache Kafka is a great fit for the ECST pattern.

Kafka allows consumers to process large batches of messages and commit offsets in a single operation. For example, 10,000 events can be processed, written to the database in a single batch (assuming the database supports it), and committed with one network call, making Kafka significantly more efficient than Amazon SQS in such scenarios. Furthermore, the minimal risk of poison pills eliminates the need for DLQs, simplifying error handling. In addition to its batching capabilities, Kafka’s partitioning mechanism enables increased throughput by distributing events across multiple shards.

However, if the target database does not support batching, writing data to the database may become the bottleneck, rendering Kafka's batch-commit advantage less relevant. For such scenarios, funneling messages from Kafka into FIFO SQS or using FIFO SNS/SQS without Kafka can be more effective. As discussed earlier, FIFO SQS allows for fine-grained logical partitions, enabling parallel processing while maintaining message order. This design supports dynamic scaling by increasing the number of consumer nodes to handle traffic spikes, ensuring efficient processing even under heavy workloads.

Event Notification Pattern

The Event Notification Pattern enables services to notify other services of significant events occurring within a system. Notifications are lightweight and typically include just enough information (e.g., an identifier) to describe the event. To process a notification, consumers often need to fetch additional details from the source (and/or other services) by making API calls. Furthermore, consumers may need to make database updates, create commands or publish notifications for other systems to consume. This pattern promotes loose coupling and real-time responsiveness in distributed architectures. However, given the potential complexity of processing notifications (e.g. API calls, database updates and publishing events), scalability and robust error handling are essential considerations.

Key Characteristics

The characteristics of the Event Notification Pattern overlap significantly with those of the Command pattern, especially when processing notifications involves complex and time consuming tasks. In these scenarios, implementing this pattern requires support for parallel consumption, autoscaling consumers, and isolation of poison pills to ensure reliable and efficient processing. Moreover, the Event Notification Pattern necessitates PubSub support to facilitate one-to-many distribution of events.

There are cases when processing notifications involve simpler workflows, such as updating a database or publishing events to downstream systems. In such cases, the characteristics of this pattern align more closely with those of the ECST pattern.

It should also be noted that different consumers of the same notification may process notifications differently. It’s possible that one consumer needs to apply complex processing while another is performing very simple tasks that are unlikely to ever fail.

Broker Alignment

When the characteristics of the notifications consumer align with those of consuming commands, SQS (or FIFO SQS) is the obvious choice. However, if a consumer only needs to perform simple database updates, consuming notifications from Kafka may be more efficient because of the ability to process notifications in batches and Kafka’s ability to perform large batch commits.

The challenge with notifications is that it’s not always possible to predict the consumption patterns in advance, which makes it difficult to choose between SNS vs Kafka when producing notifications.

To gain more flexibility, at EarnIn we have decided to use Kafka as the sole broker for publishing notifications. If a consumer requires SQS properties for consumption, it can funnel messages from Kafka to SQS using AWS event bridge. If a consumer doesn’t require SQS properties, it can consume directly from Kafka and benefit from its efficient batching capabilities. Moreover, using Kafka instead of SNS for publishing notifications also provides consumers the ability to leverage Kafka’s replay capability, even when messages are funneled to SQS for consumption.

Furthermore, given that Kafka is also a good fit for the ECST pattern and that the command pattern doesn’t require PubSub, we had no reasons left to use SNS. This allowed us to standardize on Kafka as the sole PubSub broker, which significantly simplifies our workflows. In fact, with all events flowing through Kafka, we were able to build tooling that allowed us to replicate Kafka events to a DataLake, which can be leveraged for debugging, analytics, replay / backfilling and more.

Conclusion

Selecting the right message broker for your application requires understanding the characteristics of the available options and the messaging pattern you are using. Key factors to consider include traffic patterns, auto-scaling capabilities, tolerance to poison pills, batch processing needs, and ordering requirements.

While this article focused on Amazon SQS and Apache Kafka, the broader decision often comes down to choosing between a queue and a stream. However, it is also possible to leverage the strengths of both by combining them.

Standardizing on a single broker for producing events allows your company to focus on building tooling, replication, and observability for one system, reducing maintenance costs. Consumers can then route messages to the appropriate broker for consumption using services like EventBridge, ensuring flexibility while maintaining operational efficiency.

The Shift Left Data Manifesto

Mike's Notes

The Shift Left Data Manifesto is written by Chad Sanderson, CEO & Co-Founder of Gable.ai

It is reproduced below. Thoughtful ideas.

Resources

References

Repository

Home > Ajabbi Research > Library >
Home >

Last Updated

27/03/2025

The Shift Left Data Manifesto

By: Chad Sanderson

Gable.ai Blog: 25/03/2025

Chad Sanderson is the CEO & Co-Founder of Gable.ai

A core idea behind shifting Data Left is simple but often overlooked: data is code. Or more accurately—data is produced by code. It’s not just some downstream artifact that lives in tables and gets piped into dashboards and spreadsheets. Every record, event, or log starts somewhere—created, updated, or deleted by a line of code. And just like DevOps demonstrated, if you want to manage something well, you start at the point of creation.

Data Management for Software Engineering Teams

Hello everyone, my name is Chad Sanderson. I am the author of the blog Data Products and the CEO/Co-Founder of Gable.ai. Over the past few years, I’ve written quite a bit on data management, data quality, and data contracts. In 2024 I spent a little bit less time writing, and more time implementing. I’ve worked with dozens of enterprises during that time span and watched the evolution of data contracts from a nascent idea stemming from a few LinkedIn posts to driving real change for some of the largest companies in the world.

The result of that experience has been I have become a bit of a data extremist. I believe there is a completely new domain of data management over the horizon, one that will altogether change how we think about the discipline, rewrite most/all of our common best practices, and bring the various stakeholders into a cohesive lifecycle of data management. This is a mix of opinions, combined with a description of the cutting-edge - truly game-changing companies that are pioneering how data management is done, oftentimes from unexpected places.

Over the course of this manifesto, I will try to convince you of a few things I strongly believe:

Most engineering teams are federated or becoming so
The way we manage data is designed for centralized environments
Data strategies will almost always fail, due to points 1 and 2
Federated data management is possible, but requires a different approach
That approach has been historically successful in other engineering disciplines

Is this manifesto about data contracts? No. But they do feature prominently. Data contracts are a component of shifting left, among dozens of other components. These components work together to create an entirely new dynamic of data management, completely inverting the processes, tools, strategies, and adoption rates of data quality and data governance. There is so much content to cover, across such a wide variety of topics that I’m splitting the manifesto into two parts for my own sanity. The first section is going to cover where we are today, why I believe the state of data management is fundamentally flawed, why “culture shift” is almost always impossible without technology, and working solutions, and what we can learn from other industries that have solved the same type of problems. Let’s jump into it.

Conway’s Law

Conway’s Law is the observation that organizations tend to design systems that mirror their communication structure. A product designed by a three-person organization will likely have three components. If it is designed by a single team, it will likely all be built within a single large service.

A media company with separate teams for video encoding, recommendation algorithms, and user interfaces might build a streaming platform where these components are loosely coupled, reflecting the team's structure. Hospitals and insurance companies have separate IT systems due to distinct legal and compliance teams. This results in fragmented medical records across different providers, forcing patients to manually transfer records or redo tests.

There are three primary stakeholders in the data management value chain:

Producers: The teams generating the data
Platforms: The teams maintaining the data infrastructure
Consumers: The teams leveraging the data to accomplish tasks

Conway's Law would dictate that the data management, governance, and quality systems implemented in a company will reflect how these various groups work together.

In most businesses, data producers have no idea who their consumers are or why they need the data in the first place. They are unaware of which data is important for AI/BI, nor do they understand what it should look like. Platform teams are rarely informed about how their infrastructure is being leveraged and have little knowledge of the business context surrounding data, while consumers have business context but don't know where the data is coming from or whether or not it's quality.

Is it any wonder that data management programs are a complete, disjointed mess?

The opposite side of the coin of Conway’s Law is the Law of Unintended Consequences Systems, or to summarize - "The purpose of a system is what it does" (POSIWID) – coined by Stafford Beer, a cybernetics researcher. The rule means that what a technology does is more illustrative of what its intended goal is, rather than any stated intent.

For example, suppose data pipelines are consistently breaking and the data is always low quality. In that case, it means the point of your data ecosystem is not to produce high quality trusted data - it is actually to enable teams to move fast, ship without accountability, and tolerate breakages as an acceptable trade-off.

Your data ecosystem is optimized for speed over reliability, manual firefighting over prevention, and short-term fixes over long-term quality. If high-quality data were truly the goal, the system would have built-in schema enforcement, automated validation, and clear ownership—but since those don’t exist (or are routinely bypassed), the real function of the system is to allow chaotic, ad-hoc data handling that prioritizes short-term delivery over long-term trust.

If Conway’s Law helps explain how data got into such a sorry state, POSIWID explains why - the broader organization optimizes for manual effort over proactive, automated, comprehensive solutions.

Federation Ate the World

The early 2000s marked a fundamental shift in how software engineering teams were structured. As technology companies scaled, they recognized that high-quality software required rapid iteration and continuous delivery. Research on software development lifecycles—such as the work popularized in the "Accelerate" book by Forsgren, Humble, and Kim—demonstrated that teams capable of shipping frequently could identify and fix defects faster, improve reliability, and respond to business needs with greater agility. The more frequently features were pushed into production, the sooner feedback loops closed, leading to better user experiences and stronger business outcomes.

To facilitate this velocity, companies embraced Agile methodologies, which dismantled the traditional, slow-moving hierarchical structures and replaced them with small, autonomous, cross-functional teams. Rather than requiring months or years of deliberation, these teams operated with localized decision-making authority, allowing them to experiment, iterate, and ship software much faster. This shift not only optimized for speed but also reduced the coordination overhead that had historically slowed large engineering organizations.

Out of this decentralized model emerged federated engineering structures and the adoption of microservices architectures. Instead of monolithic applications where multiple teams shared responsibility for a single massive codebase, companies transitioned to a world where individual teams owned their services, databases, infrastructure, and deployment pipelines. Each team was empowered to make locally optimal decisions—choosing their own programming languages, data models, and release schedules, all in the name of speed.

The trade-off, however, was that many centralized cost centers—teams and functions designed for a monolithic, tightly controlled architecture—struggled to adapt. Operations teams, for example, were historically responsible for managing deployments in a centralized, controlled manner. With a monolithic system, they could plan releases, monitor performance, and enforce best practices through well-established governance processes. But in a federated world, visibility disappeared. Now, hundreds of teams were shipping thousands of changes independently, overwhelming centralized ops teams who could no longer track, validate, or mitigate risk effectively.

This same dynamic played out in the world of data. Historically, data teams had ownership over the organization’s entire data architecture—curating data models, defining schema governance, and managing a centralized data warehouse. But as engineering teams began making independent decisions about which events to log, what databases to use, and how to structure data, the once-cohesive data ecosystem fragmented overnight.

Without centralized oversight, engineering teams optimized for their immediate needs rather than long-term data quality. Events were collected inconsistently, naming conventions varied wildly, and different teams structured their data models based on what was most convenient for their service, rather than what was best for the organization as a whole. This led to massive data silos, duplicated efforts, and an overall decline in data consistency.

The response from data teams? The Data Lake.

Instead of trying to force governance onto hundreds of independent teams, companies adopted a "dump now, analyze later" approach, instructing engineering teams to send all their raw data into a centralized cold storage repository. This led to the rise of data engineering, a discipline that emerged to clean, transform, and organize this messy, unstructured data into something usable. Data engineers became reactive firefighters, constantly wrangling broken schemas, cleaning up unexpected transformations, and trying to reconstruct meaning from fragmented event logs.

This model was deemed acceptable by business leaders because it allowed engineering teams to move quickly, even though it meant the data team was perpetually stuck in a reactive mode.

In the early days of the cloud, this reactive data engineering model was sufficient. Most organizations primarily used data for dashboarding and reporting, where occasional inconsistencies could be tolerated. But as the industry evolved, the stakes for data reliability grew exponentially.

Machine Learning & AI: With AI-driven decision-making, poor data quality no longer just caused bad reports—it directly impacted product functionality and user experience. A mislabeled dataset could lead to a faulty recommendation algorithm, an unreliable fraud detection system, or an inaccurate pricing model.
Data as a Revenue-Generating Product: Companies began monetizing data directly—either by selling insights, building customer-facing analytics, or enabling real-time personalization. Inaccurate data now had a direct impact on revenue.
Regulatory Compliance & Risk: As GDPR, CCPA, and other data protection regulations took effect, bad data practices became a legal liability. A single oversight—such as failing to properly delete user data upon request—could result in multimillion-dollar fines.

With these shifts, the consequences of reactive data engineering became untenable. Data teams could no longer afford to be downstream janitors, constantly cleaning up after engineering decisions made without governance. Instead, something fundamental had to change.

The federated model of software engineering isn’t going away—if anything, it has only expanded. However, as we’ve seen, the decentralization of engineering cannot come at the cost of operational visibility, data integrity, and compliance. Organizations now face a critical inflection point:

How do we reintroduce governance without reintroducing bottlenecks?
How do we enable engineering speed while ensuring data correctness and compliance?
How do we prevent reactive firefighting and create proactive, self-service data management?

Just as DevOps introduced infrastructure as code to solve the challenges of federated operations, the next era of data engineering must be proactive, automated, and deeply integrated into the software development lifecycle. Federation ate the world. But now we must decide how to rebuild it—this time, with sustainability, accountability, and resilience at its core.

Shifting Left

As the cloud, microservices, and decoupled engineering teams grew the centralized model of cost center management became harder to maintain and justify. The concept of Shifting Left emerged as a mechanism of driving ownership across a decoupled engineering organization and ultimately became the go-to solution for developers. In our context, shifting left is designed to help data teams overcome the people, process, and cultural challenges created by gaps in communication around data management. Instead of data management solely being the responsibility of the downstream data organizations, the treatment of data is a shared responsibility across producers, data platform teams, and consumers.

Simply put: Shifting Left means moving ownership, accountability, quality and governance from reactive downstream teams, to proactive upstream teams.

While Shifting Left may sound too good to be true, this pattern has happened on three notable occasions in software engineering. The first is DevOps, second is DevSecOps, and third is Feature Management.

DevOps first emerged as a concept between 2007-2008 meant to address the growing gaps between IT teams (Dev) and Operations (ops). Before DevOps, software development and IT operations worked in silos. Developers would write code and pass it to operations teams, who were responsible for deployment and maintenance. This led to:

Slow releases due to hand-offs and bottlenecks.
Frequent deployment failures caused by differences between development and production environments.
Blame culture, where development blamed operations for slow deployments, and operations blamed development for unstable code.

With DevOps, IT teams have become more agile, automated, and collaborative. Development and operations now work closely together, using continuous integration and continuous deployment pipelines to automate testing and deployment, reducing human errors and accelerating release cycles. Infrastructure as Code (IaC) allows teams to manage infrastructure programmatically, ensuring consistency and scalability. Monitoring, logging, and observability tools provide real-time insights, enabling proactive issue resolution rather than reactive firefighting.

These days it is incredibly rare to see an engineering organization operating at a meaningful scale without a DevOps function. Most developers in a company are responsible for writing their own unit and integration tests. Teams rally around version control tools like Github and GitLab, both for collaboration, code review, auditing, and more.

Roughly 10 years later in 2015, we saw a similar pattern in DevSecOps. Security teams were reactive -dealing with fraud and hacking after the fact rather than taking proactive and preventative steps to ensure software was designed with security in mind. Like Ops teams, the Security organization was siloed and disconnected from value, and as a cost center, suffered from the same problems as operations.

DevSecOps is more complex than simply integrating security into existing DevOps workflows because it requires security to be automated, continuous, and developer-friendly—something traditional security practices were not designed for. Unlike traditional security, which was often applied as a final step before deployment, DevSecOps embeds security checks throughout the entire software development lifecycle. This shift introduces several challenges:

Shift-Left Security Requires Developer Buy-In: Security teams normally operated as gatekeepers, reviewing code and infrastructure late in the process. DevSecOps requires developers to take ownership of security much earlier meaning they need security tools that are easy to use, fast, and integrated into their existing workflows. However, many security tools were designed for security experts, not developers.
Balancing Security and Speed: DevOps emphasizes fast, frequent releases, while security traditionally slows things down with rigorous reviews and manual testing. DevSecOps must balance both, requiring automation that can enforce security without blocking deployments. Achieving this requires integrating automated security scanning, policy enforcement, and runtime protection into CI/CD pipelines without causing excessive friction.
Automated Security Testing at Scale: Traditional security relied on periodic manual testing (e.g., penetration testing, compliance audits). DevSecOps requires continuous security testing, including:

Static Application Security Testing for code vulnerabilities.
Software Composition Analysis for third-party dependencies.
Dynamic Application Security Testing for runtime security risks.
Infrastructure as Code Scanning to prevent misconfiguration.
Secrets and Credential Scanning to detect exposed sensitive data.
Integrating these into DevOps pipelines without overwhelming teams with false positives is a major challenge.

The overall takeaway? The more complex or multi-component a cost center’s workflows, the more sophistication is required to effectively shift left while managing the delicate balance of developer expectations, speed and accuracy, plus scale.

And finally, the Shift Left has happened with Feature Management. Traditionally, feature rollouts, experiments, and instrumentation were handled late in the development cycle—often by downstream teams like product, analytics, or growth. This led to:

Engineers shipping features without proper instrumentation or user tracking.
Product teams struggling to get clean data on feature performance post-launch.
A/B testing requiring significant engineering support, slowing experimentation velocity.
Limited ability to control or roll back features without a full redeploy.

By shifting Feature Management into the software development lifecycle, teams can build observability, experimentation, and rollout controls directly into the feature itself. Feature Flags allow engineers to ship code behind toggles, enabling controlled rollouts and fast reversions without redeployments. Instrumentation and product analytics are now added as part of the development process, not as a follow-up task. And experimentation frameworks are increasingly embedded into the codebase, letting product teams test and iterate without waiting on engineering.

Just as DevOps brought deployment and infrastructure closer to development, Feature Management brings experimentation, rollout control, and measurement upstream—making feature delivery safer, faster, and more data-driven.

All three of these disciplines follow the exact same pattern: A critical business function is siloed downstream (Ops, Security, Product/Growth). By pushing tools and methodologies to the left, it isn’t an incremental change in value, but an inversion of how the job is done. Experimentation becomes something that happens for every new feature deployment by default. Systems are secure by default. While every team is following their own maturity within these paths - many are more sophisticated in shifting left than others - it is not just theory. This has already happened, and we in the data space should stop speaking about what might happen, and more about what can.

Shifting Data Left

Unlike engineering and security, data was the last frontier in cloud migration. While cloud-native infrastructure transformed application development and security in the early 2010s, data teams lagged behind, facing unique and more complex challenges that made cloud adoption far more difficult.

The primary reason for this delay lies in the inherent complexity and multi-faceted nature of data that makes the shift left incredibly complex, costly, and difficult to manage. Security, despite its wide operational scope, primarily deals with permissions, monitoring, and compliance enforcement within a codebase or infrastructure environment. Engineering, too, could migrate by lifting application workloads into cloud-hosted services. The product discipline had the easiest transition, given that front-end/full-stack engineers were already adding monitoring and instrumentation to their services with our without product managers asking for it. However, data does not exist in a single place, nor does it follow a single lifecycle. It moves across repositories, services, and storage technologies, often passing through multiple transformations before it can be used.

A typical data workflow spans:

Source data ingestion (from application databases, logs, APIs, and event streams).
Storage across multiple environments (operational databases, data lakes, warehouses, object storage).
ETL (Extract, Transform, Load) and ELT processes that modify and refine data.
Aggregation into analytical databases or data warehouses.
Further transformations inside the warehouse to clean, normalize, and structure data.
Downstream consumption by dashboards, machine learning models, or data products.

This multi-stage pipeline meant that migrating data to the cloud required far more than simply moving databases—it demanded rebuilding the entire data infrastructure stack from ingestion to transformation, storage, governance, and consumption. The cost, complexity, and dependencies across teams slowed down cloud adoption significantly.

Now, 20 years into the cloud era, data teams are encountering the same organizational and technical bottlenecks that operations teams faced in the mid-2000s. Back then, software engineering moved to a decentralized, service-based model, which broke traditional operations workflows and required a complete rethink of deployment and monitoring strategies. Today, data teams face the same disaggregation problem—modern software development creates silos that fragment data, leading to inefficiencies and bottlenecks that restrict its flow and value within the organization.

In a highly federated engineering environment, individual teams often manage their own databases without centralized coordination, emit event streams without consistent schema or semantic governance, and choose storage solutions optimized for local needs rather than global usability. These teams also tend to create ad hoc transformations that may duplicate or overwrite critical business logic.

The result? Data integrity, consistency, and discoverability suffer. Much like operations before the rise of DevOps, data engineering has become a reactive cost center—constantly fixing inconsistent schemas and data drift, resolving duplicate or contradictory transformations, debugging downstream breakages caused by upstream changes, and responding to compliance incidents like untracked PII exposure.

Just as DevOps emerged to address the chaos of decentralized operations, we now need a similar movement for data—one that rethinks how we approach governance, engineering, and automation. A Shift Left approach to data requires embedding quality, governance, and security at the source, not just patching issues downstream.

This new data paradigm must deliver on several fronts:

Schema and contract enforcement at ingestion, to prevent breakages by validating structure at the point of creation.
Versioning and change management, applying DevOps principles to schema evolution and business logic to ensure traceability and control.
End-to-end lineage tracking, giving teams visibility into how data transforms across systems, helping them understand and reduce the blast radius of change.
Automated compliance enforcement, detecting and tagging sensitive data like PII or financial records at the source.
Observability and real-time monitoring, to catch anomalies and schema drift before they impact analytics or AI.

Operationally, shifting data left also changes how teams work:

Engineering teams must own data quality, just as they own application reliability.
Governance must become proactive, driven by automation and policy enforcement, not just documentation.
Data contracts should be standardized to reduce fragmentation and ensure consistent expectations across teams.
Compliance checks must be embedded into CI/CD pipelines, mirroring how DevSecOps integrates security into the development lifecycle.

The lessons from other shift-left approaches are clear: embedding quality, security, and governance at the source is the only scalable approach. The same applies to data. The future of data engineering is not about building bigger, more sophisticated reactive teams—it is about pushing responsibility upstream, empowering engineering teams, and enforcing quality at the point of data creation.

Much like operations and security before it, data must shift left. Organizations that fail to adapt will face the same problems they always have, except this time the excuses of “its someone else’s problem won’t cut it when the success of a companies AI initiative is on the line.. Those that succeed will transform data into a true production-grade asset, enabling faster decision-making, higher reliability, and greater business value.

Data as Code

Imagine you're in charge of a machine that produces a high-value product every day. Your job is to ensure quality. If something starts breaking, you don’t sit around analyzing boxes of defective products—you inspect the machine. Code is the machine. It runs on rules, inputs, and constraints that result in some form of data—a CRM entry, an API payload, a Kafka event, a database write. Managing that data means managing the system that generates it.

This is here it’s useful to separate data management into three different but interconnected layers:

DevOps is focused on the software development lifecycle—code, pipelines, and deployment.
Observability is about the data that’s already been produced—monitoring records, metrics, and aggregates.
Business glossaries operate at a higher level—covering domains, policies, compliance, and internal processes.

Each of these layers has value, but they serve different personas and purposes. DevOps is the proactive software engineering layer; observability is the reactive data team layer; business glossaries are organizational scaffolding. If one of these layers is missing, it becomes incredibly difficult to connect the dots.

For example: without data lineage, business processes can’t be tied back to any actual systems or datasets. Without code lineage, your data lineage is blind outside the warehouse—you have no idea where upstream data is coming from or what’s generating it.

This is why data management patterns—catalogs, contracts, lineage, and monitors—shouldn’t be thought of as individual tools. They’re cross-cutting patterns that apply across personas:

Engineers need catalogs of code assets, source systems, and event producers.
Data teams need catalogs of tables, metrics, and dashboards.
Business teams need catalogs of domains, data products, and process workflows.

Same goes for lineage, contracts, and monitors. It’s not enough to do these things in isolation—we need contract enforcement in CI/CD, not just in downstream pipelines. We need code-level lineage, not just column-level lineage. We need contract monitors that can detect schema and semantic breakage as early as possible.

But it’s not just about capabilities—it’s about making sure those capabilities actually work for the right people. Effective data management requires alignment across three key groups: producers, consumers, and business teams. And each of these groups has very different needs.

Before the rise of the modern data stack, we had legacy catalogs—manual systems designed for centralized data stewardship. These catalogs were maintained by data stewards who filled in process docs, definitions, and ownership by hand. That model worked when data governance was owned by a few people and everything moved slowly.

Then modern data catalogs showed up—tools that connected directly to warehouses like Snowflake and Databricks and scanned tables, dashboards, and metrics. They gave data teams a lot more visibility into the artifacts they worked with day-to-day. But they also ran into friction with software engineering teams. The problem? These catalogs weren’t built for engineers. They didn’t expose anything about the code that produces data, the services that emit events, or the systems that control data generation. And when a tool doesn’t map to your responsibilities, it doesn’t get adopted.

Engineers want to understand how the code they own creates data, where that data goes, and what depends on it. They care about breaking changes, contract violations, and runtime errors—but none of that is visible in warehouse-first tooling. A table in Snowflake doesn’t tell you what GitHub repo created it or what line of code owns the transformation logic. So engineers are left in the dark, and downstream teams are left managing the fallout.

To bridge that gap, we need to bring in techniques that have worked elsewhere—DevOps, CI/CD, and even security. Software composition analysis gives us a model for understanding dependencies. Dataflow analysis gives us insight into how code transforms data. CI pipelines can block bad changes before they reach production. Contract tests can catch mismatches between producers and consumers early. The point is: we already know how to solve this in software—we just need to apply those lessons to data.

Until we do, data will remain something engineers generate but don’t own—leaving the rest of the organization to clean up the mess downstream.

The number one question I get is: “Chad, this all sounds great—and conceptually we’re on board—but how? Where do we start?”

The answer: find allies on the software engineering team—specifically, those who already think in terms of quality, contracts, and validation. One of the best places to start? QA engineers and automated testing teams. They’re on a shift-left journey of their own, working to push testing and validation closer to the source of truth: the code. They already use a familiar concept—contract testing—to enforce expectations between APIs. This process defines the structure and behavior of communication between systems before runtime. Sound familiar? That’s a data contract.

This concept can—and should—be extended to cover all data ingress and egress points: where data enters a system, and where it leaves. Just like APIs, data systems need contracts at the edges. These contracts should validate everything from schema shape to semantic meaning. But more importantly, they shouldn’t just observe the data after it's been produced—they should observe the code that produces it. We need systems that catch breaking changes at the source, during development, not days later in a downstream dashboard.

A system like that doesn’t just help QA—it scales to every engineer. It defines clear ownership for producers and gives data teams hooks into the creation process instead of chasing quality issues downstream. It embeds data quality inside code quality.

And here’s the mindset shift: most data quality problems are code quality problems. There are really two types.

A lapse in judgment—an engineer skips writing a test and a bug slips through.
A broken dependency—an engineer unknowingly changes something a downstream team relied on.

The second category is where things get interesting. Think of a backend engineer changing an API that silently breaks the frontend. That’s not just a code quality issue—it’s a data quality issue. A schema changed. Expectations weren’t communicated. If engineers adopted data contracts to protect themselves, they’d also protect everyone else: analysts, ML teams, finance reporting, compliance—you name it.

And this approach isn’t limited to traditional pipelines. AI is about to turn this problem up to eleven.

Autonomous agents making code changes might work when isolated to a single codebase. But LLMs struggle with system-wide context—how one service depends on another, how a change in a producer might cascade across APIs, databases, and pipelines. As AI starts making changes across systems, data becomes the medium of communication—not just between humans and machines, but between AIs themselves. And with it comes a combinatorial explosion of data dependencies and breakages. Without contracts and shift-left enforcement, we’ll be flying blind into that complexity.

What Happens When We Get This Right

1. Data teams move upstream

When data quality enforcement happens downstream, data teams are left cleaning up issues they didn’t create and don’t control. But when contracts, lineage, and validation happen at the code level, data teams become active participants in the creation of data—not just its consumers. They can influence modeling decisions, track where data originates, and establish shared accountability with engineering. Instead of writing Slack messages about broken dashboards, they’re writing rules that prevent them from breaking in the first place.

2. Compliance becomes code

Policies don’t scale when they live in wikis or checklists. But they do scale when they’re encoded directly into CI/CD. Contracts as code let organizations define standards—PII tagging, schema validation, retention rules—and then automatically enforce them wherever data is produced. This moves compliance from something reactive and manual to something continuous and automated. No more chasing teams down before audits—violations get caught at the pull request.

3. Engineers get early feedback

Software engineers are used to fast feedback loops. They expect test failures, linting errors, and contract violations to be caught before they merge code—not after something explodes in production. Data should work the same way. If a change to a schema or data payload is going to break an ML model or a reporting pipeline, the engineer should know before they hit merge. That kind of signal creates trust—and makes it easier for engineering to take ownership of data quality.

4. Quality becomes a shared responsibility

Right now, data quality is everyone’s problem and no one’s responsibility. But when data contracts are treated like API contracts, and quality issues are caught in dev, the responsibility naturally shifts to the teams closest to the cause. Engineers own their breakages, data teams own their transformations, and business teams finally get transparency into what’s reliable. Everyone’s incentives align. Quality isn’t something you inspect later—it’s something you build from the start.

5. The language changes

When data issues are framed as code issues, the conversation changes. Engineers don’t have to learn new tools or vocabulary—they just see tests, contracts, and CI checks like they do in the rest of their codebase. And when the language is familiar, adoption skyrockets. Suddenly, “data quality” isn’t a data team problem—it’s a software engineering best practice. This is how data quality becomes something everyone owns—because it's finally framed in a language software teams understand.

Conclusion

You made it to the end. Thanks for the read, I know it was long. This is a subject I’m passionate about, and I believe that shifting left is the key to unlocking such a wide range of solutions and utility to data teams that it is hard to succinctly describe. The impact of this movement, in my opinion, will be equally as impactful on software development as the advent of DevOps. And in the age of AI where data matters more than ever, developing the systems and culture to manage data as code is more critical than ever.

You may have noticed this article was light on details. That was intentional, for my own sanity. In the next article: The Engineering Guide to Shifting Data Left, we’ll get more hands on with real world examples, use cases, and implementations. Take care until then, and good luck.

-Chad

Chad Sanderson
Gable.ai
CEO & Co-Founder