The Coming Crisis: Fingers of Instability

Mike's Notes

An excellent description of the critical state using a real-world example: the global financial system.

I think social systems, trees, planets, cars, etc., like everything else in the universe, have a birth, existence, and death, including Capitalism. The laws of science apply to everything.

I have read the first excellent books in the references. I must also read the book by Didier Sornette.

Resources

References

  • Why Catastrophes Happen, by Mark Buchanan.
  • Antifragility, by Nassim Taleb.
  • Why Stock Markets Crash, by Didier Sornette.

Repository

  • Home > Ajabbi Research > Library > Subscriptions > Thoughts from the Frontline
  • Home > Handbook > 

Last Updated

02/03/2026

The Coming Crisis: Fingers of Instability

By: John Maudlin
Thoughts from the Frontline: 28/02/2026

John Maudlin is Co-Founder, Mauldin Economics.

This letter is a little different. I am indeed working on my book about what I believe is a coming crisis by reviewing five different cycle theories. They all arrive at a similar scenario from  different points of view, but they all suggest a crisis occurring sometime around the end of this decade or perhaps shortly thereafter. And all for different reasons. One background element ties them together, which is the subject of today’s letter.

This is essentially a shortened first chapter. To long time readers, that background connection is our old friend: sandpiles and fingers of instability. but with a lot of edits and additions. Jumping in…

Ubiquity, Complexity Theory, and Sandpiles

With five different views about the coming crisis, which one is right? Do they conflict or reinforce each other? The correct answer is they’re all connected, but not in obvious ways. And in the end, it makes no difference which one is “more” right. The results will be the same. Understanding this below-the-radar connection is key to making sure you, your family, community and country all get through this to what will be the inevitable positive conclusion, even if it is a very bumpy ride.

We are going to start our exploration with excerpts from an important book by Mark Buchanan, called Why Catastrophes Happen. I HIGHLY recommend it to those of you who, like me, are trying to understand the complexity of the markets, economy and politics/society. The book is about chaos theory, complexity theory and critical states. It is written in layman’s terms. There are no equations, just easy-to-grasp, well-written stories and analogies. But it gives us an essential framework to understand the coming storms.

As kids, we all had the fun of going to the beach and playing in the sand. Remember taking your plastic buckets and making sand piles? Slowly pouring the sand into an ever-bigger pile, until one side of the pile started an avalanche?

Imagine, Buchanan says, dropping one grain of sand after another onto a table. A pile soon develops. Eventually, just one grain starts an avalanche. Usually it’s a small one, but sometimes it builds on itself and seems like a side of the pile collapses. Why?

Well, in 1987 three physicists named Per Bak, Chao Tang, and Kurt Weisenfeld began to play the sandpile game in their lab at Brookhaven National Laboratory in New York. Now, piling one grain of sand at a time is a slow process, so they wrote a computer program to do it. Not as much fun, but a whole lot faster. Not that they really cared about sandpiles. They were interested in what are called nonequilibrium systems.

They learned some interesting things. What is the typical size of an avalanche? After a huge number of tests with millions of grains of sand, they found there is no typical size. "Some involved a single grain; others, ten, a hundred or a thousand. Still others were pile-wide cataclysms involving millions that brought nearly the whole mountain down. At any time, literally anything, it seemed, might be just about to occur." The piles were chaotic in their unpredictability.

Now, let’s read this next paragraph from Buchanan slowly. It is important, as it creates a mental image that may help us understand the organization of financial markets, the world economy and society (emphasis mine).

"To find out why (such unpredictability) should show up in their sandpile game, Bak and colleagues next played a trick with their computer. Imagine peering down on the pile from above, and coloring it in according to its steepness. Where it is relatively flat and stable, color it green; where steep and, in avalanche terms, ‘ready to go,’ color it red. What do you see? They found that at the outset the pile looked mostly green, but that, as the pile grew, the green became infiltrated with ever more red. With more grains, the scattering of red danger spots grew until a dense skeleton of instability ran through the pile. Here then was a clue to its peculiar

The Critical State

Something only a math nerd could love? Scientists refer to this as a “critical state.” The term can mean the point at which water goes to ice or steam, or the moment that critical mass induces a nuclear reaction, etc. It is the point at which something triggers a change in the basic nature or character of the object or group. Thus (and very casually for all you physicists), we refer to something being in a critical state (or use the term critical mass) when there is the opportunity for significant change.

"But to physicists, [the critical state] has always been seen as a kind of theoretical freak sideshow, a devilishly unstable and unusual condition that arises only under the most exceptional circumstances [in highly controlled experiments]… In the sandpile game, however, a critical state seemed to arise naturally through the mindless sprinkling of grains."

Thus, they asked themselves, could this phenomenon show up elsewhere? In the earth’s crust, triggering earthquakes, or as wholesale changes in an ecosystem – or as a stock market crash?

"Could the special organization of the critical state explain why the world at large seems so susceptible to unpredictable upheavals?" Could it help us understand not just earthquakes, but why cartoons in a third-rate paper in Denmark could cause world-wide riots?

Buchanan concludes in his opening chapter:

"There are many subtleties and twists in the story … but the basic message, roughly speaking, is simple: The peculiar and exceptionally unstable organization of the critical state does indeed seem to be ubiquitous in our world. Researchers in the past few years have found its mathematical fingerprints in the workings of all the upheavals I’ve mentioned so far [earthquakes, eco-disasters, market crashes], as well as in the spreading of epidemics, the flaring of traffic jams, the patterns by which instructions trickle down from managers to workers in the office, and in many other things. At the heart of our story, then, lies the discovery that networks of things of all kinds – atoms, molecules, species, people, and even ideas – have a marked tendency to organize themselves along similar lines. On the basis of this insight, scientists are finally beginning to fathom what lies behind tumultuous events of all sorts, and to see patterns at work where they have never seen them before."

Going back to the sandpile game, you find that as you double the number of grains of sand involved in an avalanche, the probability of an avalanche becomes 2.14 times more likely. We find something similar in earthquakes. In terms of energy, the data indicate that earthquakes become four times less likely each time you double the energy they release. Mathematicians refer to this as a "power law," a special mathematical pattern that stands out in contrast to the overall complexity of the earthquake process.

Fingers of Instability

So, what happens in our game?

"…after the pile evolves into a critical state, many grains rest just on the verge of tumbling, and these grains link up into ‘fingers of instability’ of all possible lengths. While many are short, others slice through the pile from one end to the other. The chain reaction triggered by a single grain might lead to an avalanche of any size whatsoever, depending on whether that grain fell on a short, intermediate or long finger of instability."

Now, we come to a critical point in our discussion of the critical state. Again, read this with not just markets but our entire society in mind:

"In this simplified setting of the sandpile, the power law also points to something else: the surprising conclusion that even the greatest of events have no special or exceptional causes. After all, every avalanche, large or small, starts out the same way, when a single grain falls and makes the pile just slightly too steep at one point. What makes one avalanche much larger than another has nothing to do with its original cause, and nothing to do with some special situation in the pile just before it starts. Rather, it has to do with the perpetually unstable organization of the critical state, which makes it always possible for the next grain to trigger an avalanche of any size."

This concept applies to not just financial markets, but to how we organize our political systems, generational differences, geopolitics and war, the over-production of elites and even how information is interpreted. They ALL connect. The Great Recession was a financial crisis. COVID-19 was a health crisis with a financial crisis and added political crises which further divided a fractious world.

We all see pressures building up in many different aspects of society. They each create their own fingers of instability. But in the sandpile of life, they are connected. 

Now, let’s couple this idea with a few other concepts. First, Hyman Minsky (who should have been a Nobel laureate) points out that stability leads to instability. The more comfortable we get with a given condition or trend, the longer it will persist and then when the trend fails, the more dramatic the correction.

The problem with long term macroeconomic stability is that it tends to produce unstable financial arrangements. Just as long term geopolitical or social stability will eventually produce a critical state. If we believe that tomorrow and next year will be the same as last week and last year, we are more willing to add debt or postpone savings in favor of current consumption. Or ignore any of a number of societal crises. Thus, says Minsky, the longer the period of stability, the higher the potential risk for even greater instability when market participants or a country’s citizens must change their behavior.

Relating this to our sandpile, the longer a critical state builds up in an economy, or in other words, the more "fingers of instability" are allowed to develop connections to other fingers of instability, the greater the potential for a serious "avalanche."

Therefore (and ironically), the longer a crisis takes to come about, the bigger the repercussions. One of the conclusions at the end of the book will be that we simply don’t know when the avalanche will be triggered. The US is such a large and wealthy country, and many of the rest of the shirts in the global laundry are just as (or even more) dirty, that global money might come to the US as a safe haven, thus prolonging our “stability” as the sandpile grows to an ever more critical state.

We Are Managing Uncertainty

Or, maybe, a series of smaller shocks lessens the long reach of the fingers of instability, giving a paradoxical rise to even more apparent stability. This is the thrust of Nassim Taleb’s book, Antifragility.

“People often think that the opposite of fragility is durability. If something is fragile, that means it’s easily broken. Therefore, if something isn’t easily broken, logically that should mean it’s the opposite of fragile. However, there’s another step beyond. Since there isn’t an established English word for such a thing, [Nassim] calls it antifragility—not just the lack of fragility, but its true opposite.

“We live in an unpredictable world. The models and theories we use to try to predict the future invariably fall apart as unforeseen events prove them wrong and, in turn, destroy the plans we made based on those models. Clearly, systems based on such flawed models are bound to be fragile—easily broken.

“The solution to this problem is antifragility. Instead of a never-ending search for more accurate models and better predictions, all we need to do is make sure that we’re in a position to benefit from uncertainty and volatility instead of being harmed by it.

“This is hardly a new concept; nature exhibits antifragility in almost everything she creates. An organism can strengthen itself through minor damage in the form of exercise. In a similar sense, a species can strengthen itself through minor damage in the form of natural selection, which leads to evolution.

“However, unlike nature, humans try to control the world through models and rules. We think we can perfectly predict the future and avoid any shocks that would cause our fragile systems to fall apart. We think we can outsmart millions of years of evolution and antifragility, and we’re almost invariably wrong.

“Instead of trying to predict the future, we should assume that there will be major events we can’t see coming—because, sooner or later, there will be. If we’re prepared for them, using the methods and practices explained in this book, we can make sure that such events work to our advantage instead of hurting us. By avoiding fragility and embracing antifragility wherever possible, we can set ourselves up to thrive in an uncertain world.

Another way to think about it is the way Didier Sornette, a French geophysicist, has described financial crashes in his wonderful book, Why Stock Markets Crash (the math, though, was far beyond me!). He wrote:

"[T]he specific manner by which prices collapsed is not the most important problem: a crash occurs because the market has entered an unstable phase and any small disturbance or process may have triggered the instability. Think of a ruler held up vertically on your  the instantaneous cause of the collapse is secondary."

When things are unstable, it isn’t the last grain of sand that causes the pile to collapse or the slight breeze that causes the ruler on your fingertip to fall. Those are the "proximate" causes. They’re the closest reasons at hand for the collapse. The real reason, though, is the "remote" cause, the farthest reason. The farthest reason is the underlying instability of the system itself.

This is one reason we get "fat tails" in financial markets. In theory, returns on investment should look like a smooth bell curve, with the ends tapering off into nothing. According to the theoretical distribution, events that deviate from the mean by five or more standard deviations ("5-sigma events") are extremely rare, with 10 or more sigma being practically impossible – at least in theory.

However, under certain circumstances, such events are more common than expected; 15-sigma or even rarer events have happened in the world of investing. Examples include Long Term Capital in the late 1990s and any of a dozen bubbles in history. Because the real-world commonality of high-sigma events is much greater than in theory, the distribution is "fatter" at the extremes ("tails") than one would expect.

This holds true in geopolitics, too. The unthinkable sometimes happens. Before World War I began, no one thought it would come to war. Peace had been the rule for 40 years. Surely, mankind had evolved. Until…

Thus, the build-up of critical states, those fingers of instability, is perpetuated even as, and precisely because, we hedge risks. We try to "stabilize" the risks we see, shoring them up with derivatives, emergency plans, insurance, treaties, alliances, political change and all manner of risk-control procedures. And by doing so, the economic and social systems can absorb body blows that would have been severe only a few decades ago. We distribute the risks, and their effects, throughout the system.

Yet as we reduce the known risks, we sow the seeds for the next 10-sigma event. It is the improbable, unseen risks that will create the next real crisis. It is not that the fingers of instability have been removed from the equation, it is that they lurk in different places, not yet visible.

A Stable Disequilibrium

We end up in a critical state that Paul McCulley calls "stable disequilibrium." It has "players" all over the world, tied inextricably together in a vast dance through investment, debt, derivatives, trade, globalization, international business and finance. Each player works hard to maximize their own personal outcome and reduce their exposure to "fingers of instability."

The longer we go on, asserts Minsky, the more likely and violent any "avalanche" is. The more the fingers of instability can build, the more that state of stable disequilibrium can go critical on us.

It's all connected. We are building an unstable sandpile and it will come crashing down at some point. Then we will have to dig our way out.

The good news is we have seen this movie before. And after the crisis, a new period of stability and growth follows, for at least another 50-80 years. In my upcoming book we will look for ways to get through to that happier future.

Scottsdale, Houston, Los Angeles, West Palm Beach, Boston and New York

Next week I fly to Houston where I am on an economic advisory board for the Rice University economics department. Then I will be in LA meeting with the Inner Circle, exploring several companies that are literally changing the technology landscape of defense and energy. We will be opening clinics in West Palm Beach and the DC area, hopefully in early April. Construction has begun. Then NYC and Boston.

I finish this from Scottsdale where Dr. Roizen and I are attending the 2026 Functional Longevity Summit, along with 3-400 doctors. The organizers have asked us to talk about Therapeutic Plasma Exchange. For those interested in staying healthy for longer, Mike and many experts now believe the first part of your journey should begin with therapeutic plasma exchange. Seriously. You can learn more at Lifespan-Edge.com (note the dash). If you haven’t, you really need to read our main research report. The research and other information can make a real difference in your life. You can set up a discovery call to talk with our doctors about the procedure and see if it is right for you. As well as look at a lot more research.

And with that, I will hit the send button. Have a great week.

Your thinking how to make my body antifragile analyst,

Small steps add up

Mike's Notes

A great book about how to use a "systems approach" to grow a healthy, mentally strong attitude.

1 per cent better every day is also my gameplan for Pipi 9. It compounds.

Resources

References

  • Atomic Habits by James Clear. Penguin.
  • The Atomic Habits Workbook by James Clear, published by Penguin, $45.

Repository

  • Home > Ajabbi Research > Library > Subscriptions > 3-2-1 Newsletter
  • Home > Handbook > 

Last Updated

01/03/2026

Small steps add up

By: James Clear
Stuff: 12/01/2026

I’ve been writing at JamesClear.com about habits, decision making, and continuous improvement since 2012. I’m the author of the #1 New York Times bestseller, Atomic Habits, which has sold more than 25 million copies worldwide and has been translated into more than 60 languages. I'm also known for my popular 3-2-1 newsletter, which is sent out each week to more than 3 million subscribers.

At the start of a new year, it’s tempting to aim for big, dramatic changes that promise overnight transformation and shiny, instant results. But according to James Clear, the best-selling author of Atomic Habits, lasting success rarely arrives in a single heroic jolt. Instead, it’s built quietly, steadily, one small choice at a time.

In a new workbook, he explains the power of consistent improvements based around the idea that if you get just 1 per cent better each day, those little gains compound into remarkable results. Small habits are easy to overlook in the moment, but they’re the very building blocks of long-term change.

So, as we step into a fresh year and all the possibility it holds, this feels like the perfect reminder: you don’t need to overhaul your life to make progress.

You don’t need huge motivation or massive willpower. You just need to start small, stay consistent and trust that every tiny step is moving you somewhere bigger.

1 per cent better every day

The typical approach to self-improvement is to set a large goal, then try to take big leaps in order to accomplish it in as little time as possible. Too often, we convince ourselves that change is meaningful only if there is some large, visible outcome associated with it. Whether it is getting stronger, building a business, travelling the world or any number of goals, we put pressure on ourselves to make some earth-shattering improvement that will awe everyone around us.

While this may sound good in theory, it often ends in burnout, frustration and failure. And yet, while improving by just 1 per cent every day isn’t notable (and sometimes isn’t even noticeable), it can be just as meaningful, especially in the long run.

It is so easy to dismiss the value of making slightly better decisions on a daily basis. Sticking with the fundamentals is not impressive. Falling in love with boredom is not exciting. Getting 1 per cent better isn’t going to make headlines.

There is one thing about it though – it works.

In the beginning, there is basically no difference between making a choice that is 1 per cent better or not. (In other words, it won’t impact you very much today.) But as time goes on, these small improvements compound, and you suddenly find a very big gap between people who make slightly better decisions on a daily basis and those who don’t.

Here’s the punch line: if you get 1 per cent better each day for one year, you’ll end up 37 times better by the time you’re done. That’s probably a more massive result than you would ever expect, even from a one-time heroic leap, and yet it’s achievable through just one tiny change a day.

This is why small choices don’t make much of a difference at the time but add up over the long term.

[IMG]

Photo: Edited extract from The Atomic Habits Workbook by James Clear, published by Penguin, $45.

But here’s the thing: if positive compounding is true, then so is the inverse. If you get 1 per cent worse each day for one year, you’ll decline nearly down to zero. The lesson is that what starts as a small win, or a minor setback, grows into something much greater. This is why the first important concept when it comes to behaviour change is the key role of continuous self-improvement. Just one tiny shift can change everything.

If you want to predict where you’ll end up in life, all you have to do is follow the curve of tiny gains and losses and see how your daily choices will compound 10 or 20 years down the line.

This is why it doesn’t matter how successful or unsuccessful you are right now. What matters is whether your habits are putting you on the path toward success. Focus on your current trajectory, not your current results. It’s a much better indicator of where you’re headed.

So stop obsessing over the big and start focusing on the small – it’s the key to building the life you want. Success is the product of daily habits – not once-in-a-lifetime transformations.

Edited extract from The Atomic Habits Workbook by James Clear, published by Penguin, $45.

OpenTelemetry Collector vs. Grafana Alloy: Which Should You Choose?

Mike's Notes

Good to know for when I build Mission Control for the Pipi Data Centre later this year.

Resources

References

  • Reference

Repository

  • Home > Ajabbi Research > Library >
  • Home > Handbook > 

Last Updated

28/02/2026

OpenTelemetry Collector vs. Grafana Alloy: Which Should You Choose?

By: Nick Flewitt
Fusion Reactor Blog: 25/02/2026

Nick lives in the UK and is the Marketing Owner at Intergral GmbH - makers of observability software FusionReactor, OpsPilot and OpenTrace.

In the world of modern observability, the “Sorting Office” of your telemetry data is just as important as the data itself. Whether you are shipping traces, metrics, or logs to FusionReactor Cloud, you generally have two main choices: the industry-standard OpenTelemetry Collector or the newcomer Grafana Alloy.

While both tools do the same basic job—receiving, processing, and exporting OTLP data—they cater to very different workflows.

The OpenTelemetry Collector (Contrib)

The “OG” of the space, the OTel Collector is a vendor-agnostic proxy that serves as the central hub of an observability pipeline.

Pros:

  • Industry Standard: It is the default choice for the CNCF ecosystem and has the widest community support.
  • YAML-Based: If your team is comfortable with Kubernetes and standard infrastructure-as-code, the indentation-based YAML configuration will feel like home.
  • Stability: Features such as the memory_limiter ensure the collector remains healthy even during massive data spikes.

Cons:

  • Visibility: Troubleshooting is done primarily through terminal logs. There is no native dashboard to see “where” data is getting stuck.

Grafana Alloy

Grafana Alloy is a “big tent” distribution that natively supports OpenTelemetry, Prometheus, and Loki formats in a single agent.

Pros:

  • The UI Dashboard: Alloy’s standout feature is a local web UI (usually at port 12345) that provides a live, visual flowchart of your pipeline.
  • Flow Configuration: Uses a programmable, component-based language (Alloy Flow) that is modular and easier to scale for complex routing.
  • Unified Agent: It handles Prometheus scraping more natively than the standard OTel collector.

Cons:

  • Learning Curve: You have to learn a new configuration syntax (HCL) rather than staying with standard YAML.

The Verdict: Which is “Best”?


Use Case Recommended Tool Why?
"Set it and Forget it" OTel Collector A standard, reliable pipeline that uses the same YAML format as the rest of your stack.
Complex Data Routing Grafana Alloy Transform data in complex ways or scrape multiple sources (like Prometheus) simultaneously.
High-Visibility Needs Grafana Alloy See exactly how your data is moving (or failing) through a visual UI.

Final Thought

The best part about the modern observability landscape is that both tools use the OTLP standard. This means you can start with the OTel Collector today and switch to Alloy tomorrow (or vice versa) without changing a single line of code in your applications.

As long as you have your FusionReactor API Key ready, your data will find its home in the cloud regardless of which “middleman” you choose.

Having a Data Centre changes the roadmap

Mike's Notes

This is the revised Pipi roadmap now that the data centre is running. The recent Ajabbi Research report, "The Workspace Issue," has also been revised to reflect these changes.

Resources

References

  • Reference

Repository

  • Home > Ajabbi Research > Library >
  • Home > Handbook > 

Last Updated

01/03/2026

Having a Data Centre changes the roadmap

By: Mike Peters
On a Sandy Beach: 27/02/2026

Mike is the inventor and architect of Pipi and the founder of Ajabbi.

Having a separate Pipi Core Data Centre changes everything. This has a direct impact on what is possible and on the best path forward. 

This is the revised working Deployment Roadmap for Pipi. The sequence is correct; the timing is a guess. Start at Option A.

Option A

When Task Detail
February 2026 Data Centre migration Separate networks for the isolated Pipi Core Data Centre and the Ajabbi Office.
March 2026 Data Centre configuration Pipi running autonomously 24x7x365, with a x10 increase in productivity.
April 2026 Workspace testing 2 Workspaces containing 15,000 web pages are rendered by the CMS Engine with no errors.
May 2026  Workspace testing 3 Workspace UI Menu rendered without errors.
June 2026  Workspace testing 4 Draft in-context help & learning material generated for each workspace without errors.
July 2026  Workspace testing 5 Module-based UI forms and data grids rendered with sample data without errors.
  Workspace testing 6Complete HTML workspace demo with connected initial developer documentation rendered without error
  Workspaces for Pipi & Developers Workspaces are deployed in the Data Centre for production use by System Admin and DevOps, with a x10 increase in productivity.
  Mission Control Telemetry on the Data Centre.
  IaC Deploy infrastructure on GCP Free Tier
 
 
 

If Ajabbi then gets into a Google Startup program with lots of credits, go to Option B; otherwise, go to Option C.

Option B

When Task Detail
  IaC Deploy infrastructure on every non-free part of GCP.

IaCBoxLang VM containing Pipi Open-Source, deployed to GCP via IaC.
  MCP, A2A Pipi > Gemini DeepMind > Pipi
  Working Demo Fully working demo workspaces available on GCP for customers and developers to tryout.
  Customer Deployments Ajabbi Personal, Developer, and Enterprise account Workspaces are available on GCP.
  Security Google Threat Intelligence + Mandiant deployed.
  TPU Pipi deploys to TPU (experimental).
  Customer Deployments Ajabbi Researcher and SME account Workspaces are available on GCP.
 

Try Pipi out on every available cloud platform using the free tier, waiting for any door to open and being ready to move fast.

Option C

When Task Detail
  IaC Deploy infrastructure on the AWS Free Tier.
  IaC Deploy infrastructure on the Azure Free Tier.

IaC Deploy infrastructure on the Digital Ocean Free Tier.
  IaC Deploy infrastructure on the IBM Free Tier.
  IaC Deploy infrastructure on the Oracle Free Tier.
  IaC Deploy infrastructure on the Wasabi Free Tier.
 
...

At a certain threshold, Pipi 10 will come into being, creating many more possibilities.

Supporting ChatGPT on PostgreSQL in Azure

Mike's Notes

Interesting discussion from a team at Microsoft on congestion algorithms for PostgreSQL. Pipi Core uses PostgreSQL.

I would be curious to understand the differences between the PostgreSQL standard and the offerings from Azure, GCP, etc.

Ajabbi enterprise customers can choose which SQL database to use on a cloud platform such as AWS, Azure or GCP.

  • MSSQL
  • MySql
  • Oracle
  • PostgreSQL
  • etc

Resources

References

  • Reference

Repository

  • Home > Ajabbi Research > Library >
  • Home > Handbook > 

Last Updated

26/02/2026

Supporting ChatGPT on PostgreSQL in Azure

By: Affan Dar, Adam Prout, Panagiotis Antonopoulos
Microsoft Blog for PostgreSQL: 29/01/2026

Affan Dar: Vice President of Engineering, PostgreSQL at Microsoft

Adam Prout: Partner Architect, PostgreSQL at Microsoft

Panagiotis Antonopoulos: Distinguished Engineer, PostgreSQL at Microsoft.

How we scaled OpenAI's mission critical workload on Azure Database for PostgreSQL flexible server.

The OpenAI engineering team recently published a blog post describing how they scaled their databases by 10x over the past year, to support 800 million monthly users. To do so, OpenAI relied on Azure Database for PostgreSQL to support important services like ChatGPT and the Developer API. Collaborating with a customer experiencing rapid user growth has been a remarkable journey. 

One key observation is that PostgreSQL works out of box for very large-scale points. As many in the public domain have noted, ChatGPT grew to 800M+ users before OpenAI started moving new and shardable workloads to Azure Cosmos DB

Nevertheless, supporting the growth of one of the largest Postgres deployments was a great learning experience for both of our teams. Our OpenAI friends did an incredible job at reacting fast and adjusting their systems to handle the growth. Similarly, the Postgres team at Azure worked to further tune the service to support the increasing OpenAI workload. The changes we made were not limited to OpenAI, hence all our Azure Database for PostgreSQL customers with demanding workloads have benefited. 

A few of the enhancements and the work that led to these are listed below. 

Changing the network congestion protocol to reduce replication lag 

Azure Database for PostgreSQL used the default CUBIC congestion control algorithm for replication traffic to replicas both within and outside the region. Leading up to one of the OpenAI launch events, we observed that several geo-distributed read replicas occasionally experienced replication lag. Replication from the primary server to the read replicas would typically operate without issues; however, at times, the replicas would unexpectedly begin falling behind the primary for reasons that were not immediately clear. 

This lag would not recover on its own and would grow to a point when, eventually, automation would restart the read replica. Once restarted, the read replica would once again catch up, only to repeat this cycle again within a day or less. 

After an extensive debugging effort, we traced the root cause to how the TCP congestion control algorithm handled a higher rate of packet drops. These drops were largely a result of high point-to-point traffic between the primary server and its replicas, compounded by the existing TCP window settings. Packet drops across regions are not unexpected; however, the default congestion control algorithm (CUBIC) treats packet loss as a sign of congestion and does an aggressive backoff. In comparison, the Bottleneck Bandwidth and Round-trip propagation time (BBR) congestion control algorithm is less sensitive to packet drops. Switching to BBR, adding SKU specific TCP window settings, and switching to fair queuing network discipline (which can control pacing of outgoing packets at hardware level) resolved this issue. We’ll also note that one of our seasoned PostgreSQL committers provided invaluable insights during this process, helping us pinpoint the issue more effectively.  

Scaling out with Read replicas 

PostgreSQL primaries, if configured properly, work amazingly well in supporting a large number of read replicas. In fact, as noted in the OpenAI engineering blog, a single primary has been able to power around 50+ replicas across multiple regions. However, going beyond this increases the chance of impacting the primary. For this reason, we added the cascading replica support to scale out reads even further. But this brings in a number of additional failure modes that need to be handled. The system must carefully orchestrate repairs around lagging and failing intermediary nodes, safely repointing replicas to new intermediary nodes while performing catch up or rewind in a mission critical setup.  

Furthermore, disaster recovery (DR) scenarios can require a fast rebuild of a replica and as data movement across regions is a costly and time-consuming operation, we developed the ability to create a geo replica from a snapshot of another replica in the same region. This feature avoids the traditional full data copy process, which may take hours or even days depending on the size of the data, by leveraging data for that cluster that already exists in that region. This feature will soon be available for all our customers as well.  

Scaling out Writes 

These improvements solved the read replica lag problems and read scale but did not help address the growing write scale for OpenAI. At some point, the balance tipped and it was obvious that the IOPs limits of a single PostgreSQL primary instance will not cut it anymore. As a result OpenAI decided to move new and shardable workloads to Azure Azure Cosmos DB, which is our default recommended NoSQL store for fully elastic workloads. However, some workloads, as noted in the OpenAI blog are much harder to shard. 

While OpenAI is using Azure Database for PostgreSQL flexible server, several of the write scaling requirements that came up have been baked into our new Azure HorizonDB offering, which entered private preview in November 2025. Some of the architectural innovations are described in the following sections.

Azure HorizonDB scalability design 

To better support more demanding workloads, Azure HorizonDB introduces a new storage layer for Postgres that delivers significant performance and reliability enhancements: 

  • More efficient read scale out.  Postgres read replicas no longer need to maintain their own copy of the data.  They can read pages from the single copy maintained by the storage layer. 
  • Lower latency Write-Ahead Logging (WAL) writes and higher throughput page reads via two purpose-built storage services designed for WAL storage and Page storage. 
  • Durability and high availability responsibilities are shifted from the Postgres primary to the storage layer, allowing Postgres to dedicate more resources to executing transactions and queries. 
  • Postgres failovers are faster and more reliable. 

To understand how Azure HorizonDB delivers these capabilities, let’s look at its high‑level architecture as shown in Figure 1.  It follows a log-centric storage model, where the PostgreSQL writeahead log (WAL) is the sole mechanism used to durably persist changes to storage. PostgreSQL compute nodes never write data pages to storage directly in Azure HorizonDB. Instead, pages and other on-disk structures are treated as derived state and are reconstructed and updated from WAL records by the data storage fleet. 

Azure HorizonDB storage uses two separate storage services for WAL and data pages. This separation allows each to be designed and optimized for the very different patterns of reads and writes PostgreSQL does against WAL files in contrast to data pages.  The WAL server is optimized for very low latency writes to the tail of a sequential WAL stream and the Page server is designed for random reads and writes across potentially many terabytes of pages.


 Figure 1 - Azure HorizonDB Architecture

These two separate services work together to enable Postgres to handle IO intensive OLTP workloads like OpenAI’s. The WAL server can durably write a transaction across 3 availability zones using a single network hop.  The typical PostgreSQL replication setup with a hot standby (Figure 2) requires 4 hops to do the same work.  Each hop is a component that can potentially fail or slow down and delay a commit. Azure HorizonDB page service can scale out page reads to many hundreds of thousands of IOPs for each Postgres instance.  It does this by sharding the data in Postgres data files across a fleet of page servers.  This spreads the reads across many high performance NVMe disks on each page server.

Figure 2 - WAL Writes in HorizonDB

Another key design principle for Azure HorizonDB was to move durability and high availability related work off PostgreSQL compute allowing it to operate as a stateless compute engine for queries and transactions. This approach gives Postgres more CPU, disk and network to run your application’s business logic.  Table 1 summarizes the different tasks that community PostgreSQL has to do, which Azure HorizonDB moves to its storage layer.   Work like dirty page writing and checkpointing are no longer done by a Postgres primary.  The work for sending WAL files to read replicas is also moved off the primary and into the storage layer – having many read replicas puts no load on the Postgres primary in Azure HorizonDB.   Backups are handled by Azure Storage via snapshots, Postgres isn’t involved. 

Task 

Resource Savings 

Postgres Process Moved 

WAL sending to Postgres replicas 

Disk IO, Network IO 

Walsender 

WAL archiving to blob storage 

Disk IO, Network IO 

Archiver 

WAL filtering 

CPU, Network IO 

Shared Storage Specific (*) 

Dirty Page Writing 

Disk IO 

background writer 

Checkpointing 

Disk IO 

checkpointer 

PostgreSQL WAL recovery 

Disk IO, CPU 

startup recovering 

PostgreSQL read replica redo 

Disk IO, CPU 

startup recovering 

PostgreSQL read replica shared storage 

Disk IO 

background, checkpointer 

Backups 

Disk IO 

pg_dump, pg_basebackup, pg_backup_start, pg_backup_stop 

Full page writes 

Disk IO 

Backends doing WAL writing 

Hot standby feedback 

Vacuum accuracy 

walreceiver 

Table 1 - Summary of work that the Azure HorizonDB storage layer takes over from PostgreSQL 

The shared storage architecture of Azure HorizonDB is the fundamental building block for delivering exceptional read scalability and elasticity which are critical for many workloads. Users can spin up read replicas instantly without requiring any data copies. Page Servers are able to scale and serve requests from all replicas without any additional storage costs. Since WAL replication is entirely handled by the storage service, the primary’s performance is not impacted as the number of replicas changes. Each read replica can scale independently to serve different workloads, allowing for workload isolation. 

Finally, this architecture allows Azure HorizonDB to substantially improve the overall experience around high availability (HA). HA replicas can now be added without any data copying or storage costs. Since the data is shared between the replicas and continuously updated by Page Servers, secondary replicas only replay a portion of the WAL and can easily keep up with the primary, reducing failover times. The shared storage also guarantees that there is a single source of truth and the old primary never diverges after a failover. This prevents the need for expensive reconciliation, using pg_rewind, or other techniques and further improves availability. 

Azure HorizonDB was designed from the ground up with learnings from large scale customers, to meet the requirements of the most demanding workloads. The improved performance, scalability and availability of the Azure HorizonDB architecture make Azure a great destination for Postgres workloads.

Pipi Data Centre Operational

Mike's Notes

Very good news for Pipi.

Resources

References

  • Reference

Repository

  • Home > Ajabbi Research > Library >
  • Home > Handbook > 

Last Updated

26/02/2026

Pipi Data Centre Operational

By: Mike Peters
On a Sandy Beach: 25/02/2026

Mike is the inventor and architect of Pipi and the founder of Ajabbi.

The long-planned migration of Pipi 9 to its own data centre has been completed. It took 2 weeks to execute. The existing setup was split into an office network connected to the internet and an isolated data centre that is not connected to the internet.

Starting from zero

The initial data centre consists of a single 45U rack and some other shelving, with mainly older equipment. It will do for a start and can grow as more racks are added, equipment upgraded, and more servers are added, etc.

External hard drives being used in the shift

Issues

  • Terabytes of data on backup hard drives to shift
  • Clean reinstalls of many operating systems
  • 14 machines to configure
  • Adobe CS4 does not like Windows 11
  • Making do with what is available now
  • Go slow, think twice and get there faster

Opportunities

  • Pipi on 24x7x365
  • All systems can be turned on using multiple servers
  • DevOps automation is now possible
  • The development cycle will speed up 10x
  • The road to Pipi 10 with BoxLang is now open

Whats next

  • Seat-of-the-pants experimenting to tune the setup
  • Stress test to build resiliency and reliability

Waimumu Field Days - Background to Ajabbi and Pipi

Mike's Notes

I recently attended the 3-day Southern Field Days held every two years at Waimumu, near Gore, Southland, New Zealand (NZ). It's an event for farmers and is very popular.

If you have never been, go next time. Farmers are clever people, and they feed us.

This year, I visited every company at the Field Days that has a software product for farmers to discuss the integration and scaling challenges they were facing and introduce Pipi.

Pipi could be used by developers to build SaaS for farm and forestry management.

I invited them to give me feedback by testing a demo workspace for agriculture later this year. I got quite a bit of interest and was given many business cards. I found it much easier to talk in person than over Zoom. I also learned that next time to show them a live demo on a tablet.

This is a copy of the personalised email text I sent afterwards to establish contact. To be followed up on in a few months with individual invites to have a closer look, test, give me feedback, etc.

This message is being updated as I learn what works.

Resources

References

  • Reference

Repository

  • Home > Ajabbi Research > Library >
  • Home > Handbook > 

Last Updated

25/02/2026

Waimumu Field Days - Background to Ajabbi and Pipi

By: Mike Peters
On a Sandy Beach: 24/02/2026

Mike is the inventor and architect of Pipi and the founder of Ajabbi.

Waimumu site



Tractors and diggers

Fencing @ Waimumu

Hi xxx

I may have met you in person at the SI Field Days (or one of your people gave me your card) for a chat about an enterprise AI I am the architect of, which could be used for Agritec. It has open integration.

There will be nothing much to see publicly now, but here is some background information for you. I will get in touch in a few months as beta testing proceeds.

Ajabbi.

Ajabbi is a community-driven pre-revenue bootstrap startup with a first teaching customer and a main product, Pipi, which has a closed core and open-source applications. Optimised to build self-managing no-code enterprise platforms for critical and socially useful infrastructure (Health, agriculture, transport, nature conservation, arts & culture, built infrastructure, utilities). Customer fees are based on usage with no moats.

There will be several parts to Ajabbi

    1. Ajabbi.com for SaaS operations and to handle income and cloud expenses. An % of income will support R&D. Net profit will go to the Ajabbi Foundation.
    2. Ajabbi Research for R&D on Pipi.
    3. Ajabbi Foundation will fund other open-source products, Pipi user groups, industry conferences, books, and science, among other areas.

Pipi

Started in 1997 as version 1. By version 4 (2005-2007), it was a SaaS platform supporting ecological restoration projects in NZ (govt valuation: NZ$ 3 M).

Now, version 9 is a multi-agent world-model AI (not an LLM) that shares similarities with a Godel Machine, able to learn, evolve, self-organise, and reproduce. Constrained by published ontologies and laws of physics.

It is a big system and highly configurable. Implementing it for your entrprise will require a developer team. 

Currently, the community is beta-testing an open-source multi-workspace UI and generating 20,000 pages of developer documentation. Expect to be in production late 2026.

Finding out more.

    • https://www.blog.ajabbi.com/ (daily engineering blog)
    • Contact me for a chat, etc

The main 26 websites (developer, learn, handbook, design, pipiWiki, etc.) at ajabbi.com are currently hidden from search results and change frequently during testing.

Dive deeper with some reading

Any questions, feel free to ask

Carlos Gershenson: On the Limits of the Scientific Study of Complex Systems

Mike's Notes

A talk recorded on Vimeo by Carlos Gershenson.

Resources

References

  • Reference

Repository

  • Home > Ajabbi Research > Library > Complexity Digest
  • Home > Handbook > 

Last Updated

23/02/2026

Carlos Gershenson: On the Limits of the Scientific Study of Complex Systems

By: Carlos Gershenson
Vimeo: 28/01/2026

Carlos Gershenson (Systems Science and Industrial Engineering, Binghamton University).

Binghamton Centre of Complex Systems (CoCo) Seminar: January 28, 2026

“On the Limits of the Scientific Study of Complex Systems”

Vimeo 58:17

gRPC Clearly Explained

Mike's Notes

Good big picture explanation of gRPC. Pipi will have a dedicated gRPC Engine (grpc).

Resources

References

  • Reference

Repository

  • Home > Ajabbi Research > Library > Subscriptions > Level Up Coding System Design
  • Home > Handbook > 

Last Updated

22/02/2026

gRPC Clearly Explained

By: Nikki Siapno
Level Up Coding System Design: 18/01/2026

Founder LUC | Eng Manager | ex-Canva | 400k+ audience | Helping you become a great engineer and leader.

What gRPC actually is, and the real reasons teams adopt it beyond “it’s faster”.

gRPC: What It Is and Why Teams Use It

Most teams reach for REST by habit.

It works, it’s everywhere, and every tool knows how to talk HTTP+JSON.

But once you have dozens of microservices calling each other thousands of times per request, REST can quietly become the bottleneck.

That’s where gRPC shows up with a very different model: “call a function on another service as if it were local,” and make it fast.

What gRPC (and RPC) really is

Before gRPC, there’s RPC.

Remote Procedure Call (RPC) is a model where you call a function that runs on another machine, but it feels local to the caller.

You call a method, pass arguments, and get a result back. The network exists, but it’s intentionally hidden so developers can think in terms of functions instead of sockets and packets.

gRPC is a modern, open-source RPC framework released by Google in 2015. It takes the RPC idea and standardizes how services define methods, exchange data, and communicate efficiently over the network.

Two building blocks drive almost everything you feel in practice:

  • HTTP/2 → The transport protocol. It keeps a long-lived connection and supports multiplexed streams.
  • Protocol Buffers (Protobuf) → The data format. It’s a compact binary serialization with a schema (a defined shape).

Together, these turn “call a service” into a strongly typed operation instead of a loosely structured document exchange.

How gRPC actually works

gRPC starts with a service definition. You describe your API in a .proto file by defining services (methods) and messages (request and response shapes).

This file is the contract. Both client and server are built from it.

From that contract, gRPC generates code:

  • Client stubs → Methods you call like local functions.
  • Server interfaces → Methods you implement with your business logic.

When a client calls a gRPC method, the flow is straightforward:

  1. Serialize → The request is encoded into Protobuf’s binary format.
  2. Send → The message travels over an existing HTTP/2 stream.
  3. Dispatch → The server routes it to the correct method.
  4. Execute → Your code runs.
  5. Respond → The result is serialized and streamed back.

Because gRPC uses HTTP/2, many calls share one connection and run in parallel. A slow response doesn’t block faster ones behind it, which keeps tail latency under control.

Streaming uses the same mechanism:

  • Server-streaming → One request, many responses over time.
  • Client-streaming → Many requests, one final response.
  • Bidirectional streaming → Both sides send messages independently.

Backpressure is built in. If one side slows down, gRPC slows the stream instead of piling up memory or threads.

The core idea is simple: define a strict contract, generate code from it, and move typed messages efficiently over a shared connection.

Where gRPC pays off

gRPC is a strong fit when you control both ends and you care about efficiency.

  • High call volume microservices → Lower per-call overhead adds up fast at scale.
  • Latency-sensitive graphs → Multiplexing + smaller payloads reduces tail latency pressure.
  • Polyglot stacks → One .proto contract generates stubs across languages, reducing “JSON drift.”
  • Service mesh environments → gRPC routes cleanly through modern proxies and is common in mesh control-plane protocols.

Tradeoffs you feel immediately

gRPC’s downsides are predictable, and they usually show up early on.

  • Browser calls are not native → You often need gRPC-Web or a REST/JSON gateway for front-end use.
  • Debugging is less “curl-friendly” → Binary payloads require tooling like grpcurl or GUI clients with schema access.
  • Contracts tighten coupling → Clients must update generated code as schemas evolve, so versioning discipline matters.
  • Infra must support HTTP/2 well → Some proxies and firewalls need explicit support or configuration.

How to Decide: Is gRPC a Fit for Your System?

Use gRPC when most of the following are true:

  • You control both client and server → Internal microservices inside the same organization.
  • You’re performance-sensitive → Many small calls per request, or very high QPS between services.
  • You’re polyglot → Multiple languages across teams and services.
  • You need streaming → Real-time updates, telemetry, chat, or continuous feeds.
  • You want strict contracts → You care about compile-time guarantees and explicit schemas.

Stick to REST (or layer a REST gateway in front of gRPC) when:

  • You expose public APIs to unknown clients.
  • You want easy browser and curl-based experimentation.
  • Your main pain is clarity and discoverability, not raw latency or throughput.

In practice, most teams don’t choose one or the other; they split the responsibility.

  • gRPC inside your network → Service-to-service, behind an API gateway or service mesh.
  • REST/JSON at the edge → For browsers, partners, and mobile apps that prefer HTTP+JSON.

Recap

gRPC is not “REST but faster.”

It’s a different model: remote procedure calls over HTTP/2, using Protobuf contracts, with first-class support for streaming and strong typing.

That makes it excellent for internal microservices, high-performance backends, and real-time systems, as long as you’re willing to invest in schemas, tooling, and a slightly steeper learning curve.

If you’re hitting the limits of REST inside your system (too many chatty JSON calls, tricky real-time updates, or a messy polyglot codebase) gRPC is worth a serious look.