On a Sandy Beach: refactoring

Showing posts with label refactoring. Show all posts

Recent work on system messaging

Mike's Notes

The work this week.

I found a problem with the existing Pipi 6 era Messaging Engine (msg). It needed to be replaced to support the richer Pipi 9 internal environment of autonomous agents that can self-organise and move.

I spent days staring into space, and then the solution became very clear while daydreaming 😴 between morning coffee and ice cream summer afternoons 😎. Slow but getting there.

Resources

References

Enterprise Integration Patterns: Designing, Building, and Deploying Messaging Solutions (Addison-Wesley Signature Series), by Gregor Hohpe and Bobby Woolf.
Integration Patterns (Patterns & Practices). Microsoft.

Repository

Home > Ajabbi Research > Library > Authors > Gregor Hophe
Home > Handbook >

Last Updated

25/01/2026

Recent work on system messaging

By: Mike Peters

On a Sandy Beach: 22/01/2026

Mike is the inventor and architect of Pipi and the founder of Ajabbi.

Pipi 4 (2005-2008)

A simple internal messaging system was invented and built for Pipi 4 around 2005. It had a messaging store and handled interactions between different internal systems. It was not called a messaging system; it was an unnamed part of the Metadata Repository. It was built in CFML code and ran on MS SQL Server.

Pipi 6 (2017-2019)

After reading about Apache Kafka and RabbitMQ, I realised that Pipi 4 had contained a primitive internal messaging system, 5 years before Kafka. A new, quick-and-dirty broker-based messaging system, similar to RabbitMQ, was built as a CFML module to send messages across hundreds of modules when Pipi 4 was rebuilt from memory as Pipi 6. The work of Gregor Hohpe and Bobby Woolf was an inspiration. Martin Fowler was an enormous influence on everything.

Pipi 9 (2023-

It has taken all week to figure out a solution. I ended up splitting the role of the engine

The existing Pipi 6 messaging module is now being fully refactored into the Pipi 9 Messaging Engine (msg). Previously, it used a wrapper to make it appear to be an agent. Some minor changes are being made to the Namespace Engine (nsp).
The new Messaging Engine (msg) is used to create the local messaging systems via the Factory Engine (fac). The Factory Engine (fac) will place a local message store using a separate embedded database inside each agent to connect to its Messaging Endpoint. Each system (made up of many agents) has its own messaging store attached to the system router.

As agents dynamically self-assemble in response to events, the messaging system always works.

The new Engine closely follows the patterns in Enterprise Integration Patterns and even uses the same icons, aka "Gregorgrams".

Summary of changes underway

Internal system messaging between hundreds of autonomous agents.
Possible because of the Namespace Engine (nsp) built into Pipi 8.
External messaging between Pipi and applications, such as databases, cloud platforms and containers, will be handled using an open-source messaging system, such as Kafka or RabbitMQ.
The two messaging systems need to interconnect. I don't know how yet! 😇
Event messaging and Pub/Sub are the most common; others include CQRS, Dead Letter, and Point-to-Point.
Each agent has a state.
Essential for robust Pipi self-management.
The pipiWiki will automatically document the configuration of each engine, including messaging (Gregorgrams) diagrams. See link to the Primative Engine (prm) mockup above.
The messaging configuration can change dynamically.
The Messaging Engine will be accessible for configuration via the coming msg module in the Agent Workspace.

Updated Pipi 9 to 10 plan

Mike's Notes

Notes on a successful meeting on Tuesday morning with Luis and Cristobal from Ortus Solutions. This is the living plan.

Resources

Resource

References

Reference

Repository

Home > Ajabbi Research > Library >
Home > Handbook >

Last Updated

20/06/2025

Updated Pipi 9 to 10 plan

By: Mike Peters

On a Sandy Beach: 20/06/2025

Mike is the inventor and architect of Pipi and the founder of Ajabbi.

I had an excellent remote meeting on Tuesday at 7am with Luis Majano and Cristobal Escobar from Ortus Solutions. It was to discuss

The migration of Pipi to run on the BoxLang platform
Support plans
Sponsorship

Details

Pipi 9

Going into production now
Not designed to run on BoxLang, but 90% can be run in compatibility mode.
Written in CFML code
Expected to take 12 months
For enterprise critical infrastructure in any human language or writing system.
Ajabbi

runs on Pipi (dogfooding)
Has a first customer
Bootstrapping
Growing community
Hybrid closed-source/open-source
Not-for-profit foundation to be created

Pipi 10

Will be built by Pipi 9
Expected to take 12months +
Designed to run on Boxlang
Can generate the bx code
Written in CFML code
Use Boxlang to enable customers to use Python, Go, PHP, CFML, Java, and Ruby.
Ajabbi

Paid dedicated support by Ortus
Sponsorship of the Ortus open-source
Open-source Pipi community on GitHub

Migration

BoxLang is very new
Ortus has a crack team that has built BoxLang, Command Box, WireBox, etc over the decades.
Pipi 9 and Pipi 10 will run alongside each other till Pipi 10 can take over. Production will remain continuous.
Mike has spent 20,000 hours as the architect on Pipi since 1997 (Pipi 1). Necessary to plan migration to be efficient.
Mike is to do a crash course in Ortus "box" products.
Pipi 10 takes autonomous control of BoxLang via? (Several techniques are possible.)
Further meetings with Ortus

Real-world Engineering Challenges #8: Breaking up a Monolith

Mike's Notes

This detailed article by Gergely Orosz is partially copied from his newsletter, The Pragmatic Engineer.

Resources

References

Reference

Repository

Home > Ajabbi Research > Library > Subscriptions > The Pragmatic Engineer
Home > Handbook > DevOps

Last Updated

05/05/2025

Real-world Engineering Challenges #8: Breaking up a Monolith

By: Gergely Orosz

The Pragmatic Engineer: 08/03/2023

Writing The Pragmatic Engineer. Previously at Uber, Skype, Microsoft. Author of The Software Engineer's Guidebook.

‘Real world engineering challenges’ is a series in which I interpret interesting software engineering and engineering management case studies by tech companies. You might learn something new in these articles, as we dive into the concepts they contain.

In most real-world engineering challenges articles, we look at several interesting case studies from tech companies, but today’s issue is different. We’re covering a single case study, but in more detail than previously in this series.

Specifically, we’re diving into a massive migration project by Khan Academy, involving moving one million lines of Python code and splitting them across more than 40 services, mostly in Go, as part of a migration that took 3.5 years and involved around 100 software engineers.

Khan Academy has an interesting engineering blog where I came across several posts with details about how this migration progressed. I got interested in learning more about this journey and so reached out to engineering manager Brian Genisio and former principal software architect Kevin Dangoor. Both Brian and Kevin played key roles in this migration and they have generously shared their inside takes on how the migration played out.

In today’s issue, we reveal new details about this migration, covering:

Context and motivation. Why decide to migrate a very large monolith that is powering a site with heavy traffic? The trigger was the end of Python 2, but there were other reasons, too.
Kickoff and phase. Additionally, the concept of MVE (minimum viable experience.)
The migration phase. The migration approach to each of the fields and APIs.
Things that worked well. Incremental shipping, the side-by-side migration approach, and treating the project as fixed scope, fixed timeline.
Choices worth reconsidering. Was Go the best choice? It’s one of the questions the team still thinks about.
Learnings from working on a 3.5 year-long migration. The extra motivation which hard deadlines provide.

As with all case studies, I write about a given company because I find the topic interesting. I have no financial motivation for this: I don’t get paid, directly or indirectly. In the very rare case when there’s a potential conflict of interest – like being an investor – I always disclose my interest. There is no such conflict here, I strive to be independent in my viewpoints. For more details, see my ethics statement.

1. Context and motivation

Khan Academy is a US-based non-profit education provider, which teaches students about math, art, computing and many other topics for free, with videos and interactive learning guides. Courses are aimed at middle school to high school students, and also college students. The site is one of the most popular free education sites, with millions of users.

Khan Academy stands out from many other non-profits not just for its strong funding; for example, grants from the Bill and Melinda Gates Foundation and other philanthropic organizations, but also its strong engineering team. John Resig, creator of JQuery, is its chief software architect.

Back in 2017, an idea to come up with a better architecture started being raised more often within the engineering team. Back then, Khan Academy’s backend consisted of a Python monolith exposing REST endpoints. It was during that year that the team started experimenting with GraphQL for their API. By the end of 2017, the team realized that GraphQL had many benefits and decided to deprecate the REST interface and migrate existing endpoints to GraphQL.

It was also around then that the team started to feel the pain of maintaining a monolith with Python, whose performance was not great and resulted in more hardware resources being needed to run the monolith.

Python 2’s end-of-life announcement was the final spur to start the project. In 2019, Python Software Foundation announced Python 2’s end of life date as January 2020. It was this which kicked off the rewrite project because it meant several goals needed to be achieved:

Get off Python 2 as soon as possible. With Python 2 reaching end of life, delay was not an option.
Replace the REST API with GraphQL.
Break the Python monolith into services.

2. Kickoff and phasing

The rewrite project got underway in June 2019. The team started by choosing which language to move over to; Python 3? Kotlin? Go? They considered all options and settled on Go, largely for its first-class support in the Google App Engine, and the simplicity and consistency of its language and performance. As Kevin Dangoor summarized:

“Moving to Kotlin was an appealing alternative. While we were at it, we decided to dig deeper into other options. Looking at the languages with first-class support in the Google App Engine, another serious contender appeared: Go. Kotlin is a very expressive language with an impressive set of features. Go, on the other hand, offers simplicity and consistency. The Go team is focused on making a language which helps teams reliably ship software over the long term.

As individuals writing code, we can iterate faster due to Go’s lightning quick compile times. Also, members of our team have years of experience and muscle memory built around many different editors. Go is better supported than Kotlin by a broad range of editors.

Finally, we ran a bunch of tests around performance and found that Go and Kotlin (on the Java Virtual Machine) perform similarly, with Kotlin being perhaps a few percent ahead. Go, however, uses a lot less memory, which means that it can scale down to smaller instances.

We still like Python, but the dramatic performance difference which Go brings us is too big to ignore, and we think we’ll be able to better support a system running on Go over the years. Moving to Go will undeniably be more effort than moving to Python 3, but the performance win alone makes it worth it.”

The scope of the project was very clear from the start. Migrate 100% of Python 2 code to Go, have only GraphQL endpoints, and use GraphQL federation. The biggest task was agreeing on architectural strategies.

The team ended up converging on a federated GraphQL hub. The main difference between a REST-based “API gateway” is that with REST, a request is often directed to a single service. With GraphQL gateways, a query plan is generated that includes data from multiple backend services.

The proposed architecture looked like this, at a high-level:

A high-level overview of the new architecture at Khan Academy, based on federated GraphQL. The actual number of services is a lot higher at around 40, but the architecture is very similar. Diagram source: Khan Academy’s engineering blog.

How do you estimate how long it takes to migrate 1 million lines of code? I asked Brian, who said they used pretty straightforward heuristics and refined it later.

“As far as estimates go, we started by estimating the project using heuristics. We estimated the lines of code the average developer would likely port per day. We then used that with the number of lines of Python we had, in order to figure out how much work each service would take.

This was a good first pass, but when we got started working on APIs in the services, we had to break it down further. Some things were a lot more complex than that heuristic could manage. So the estimates changed when we dissected the work into smaller parts.”

The project was split into two parts:

Phase 1: Minimum Viable Experience. It’s common to call the first phase an MVP (Minimum Viable Product,) referring to a barebones but usable milestone. However, the team did not want to call this phase an MVP, because they already had a product. MVE answered the question of what the key features were, which if removed, would materially alter Khan Academy’s identity. The MVE scope was mostly defined by product managers looking at the experience from a product point of view.

These MVE features were things like content publishing, content delivery, progress tracking, user management, and more.

This phase alone took about 2 years to complete, completing in August 2021. At this point, about 95% of traffic was flowing through the new code and 32 services were built.

Phase 2: endgame. The second phase was everything else. Although it might have seemed like the project was largely complete, with 95% of traffic going through the new services; in fact there was much to do. On top of rewriting the remainder of the Python monolith still serving traffic, all internal tools needed to be rebuilt from Python to Go, which was a sizable undertaking in itself.

This second phase took another 1.5 years to complete. Part of why it took so long was because Phase 2 was less staffed with engineers than with Phase 1 had been. Also, in this phase, new features were built, as well as existing ones being migrated. In this phase, another 5 new services were created and a few developer-only services were added, bringing the services count to just over 40.

A big part of the endgame phase was adding non-MVE behaviors to existing services.

Here’s some statistics on how each phase went:

3. The migration phase

The team wanted to avoid a “big bang” migration in which everything is rewritten at once. Instead, they chose a “field by field” approach. As Brian shared:

“We knew a ‘big bang’ rewrite would be fraught with pain, so we chose the opposite: not at the service level, the feature level, or the model level, but at the field level. Literally, we'd move User.FirstName over to the Go-based User Service while User.LastName was still in the Python Monolith. We did this with GraphQL federation and some side-by-side testing magic which we added to our federation service.”

Before doing any migrating, the team needed to build basic infrastructure for the new service. This meant putting the GraphQL federation service in place.

The first migrated service was the simplest; a service hosting a single field. This service was one which answered the question of what was the oldest version of the mobile client that could be supported. When this first service migration was complete, the GraphQL federation service was built, and one Go service was built which served real – if very low – levels of traffic, while the majority of traffic flowed through the existing monolith.

Khan Academy’s system after migrating the first and smallest service

The migration strategy applied a similar side-by-side testing approach to all services.

Step 1: optional shadowing of traffic on the new Go service. The Python result is returned. The Go service only gets the request if developers set a parameter to also direct traffic to their new service. If no parameter was set, then no traffic was sent to the new service. Playing with this parameter allowed engineers to test the new service as they built it, without any risks to the production system.
Step 2: side-by-side testing of GraphQL services. Both the Python and Go services are called and any differences in responses logged, and the Python service’s result is returned.
Step 2.5: canary. A configurable percentage of traffic starts to return responses from the Go service, while still in a side-by-side testing phase. While side-by-side testing can be used for GraphQL queries (fetching data); for GraphQL mutations (modifying server-side data,) it is not a good option. It was GraphQL mutation use cases when a canary was used. Learn more about GraphQL queries vs mutations.
Step 3: migration. Only the Go service gets the traffic, but the Python service is still present, in case it needs to return as the primary.
Step 4: remove the Python code and service. Migration complete!

The automated testing approach was interesting. With any big rewrite and migration, having a robust automated tests suite can be an additional safety net. At the same time, porting over the unit or integration test suite can also be lots of work. So how did Khan Academy proceed? Brian explains:

“Our unit test coverage in Python varied: some areas were well-covered, while others were not. When it made sense, we ported over our Python tests to Go, and used them to build out the Go behavior. That worked well for simple, stateless functions. But in reality, most of the code was structured differently in Go, so the python tests didn't help too much.

We wrote a LOT of tests in Go, however. Our test coverage is much better today because of it. We also took a more “behavioral” approach to testing. We wanted the APIs to be ported identically. In doing so, we knew that for any input in the Python and Go code, we'd want the Go code to be exactly the same as the Python implementation. This is where "side-by-side" testing came in as part of the migration.”

The fact the migration took a long time was tough on the team. I’ve experienced first-hand how draining year-long migrations can be, so I asked Brian how he and the team handled their migration, in which just the first phase took more than two years. Brian was candid. He shared that it was rough:

“Most people who come to work at a non-profit like Khan Academy do so for the mission, not for the technology. People want to push forward our feature sets, expose more content to our learners/teachers. The technology tends to be second.

Engineers saw this migration as an existential imperative that they could contribute to, so this viewpoint helped. We also had some REALLY interesting technical problems to solve during the migration and these challenges resonated with many of them. However, each and every one of us were ready to finish the migration after it had been going on for 3.5 years.

Such a long migration was more challenging to stomach from the wider organization's perspective. There were big time frames during this project during which engineering did not build any new features because the migration took up all our time! On one end, this gave product managers and designers time to make plans for what features we’d build in a post-Goliath Khan Academy. Still, during the migration, both Product and Design faced a lot of pushback from engineering, and unfortunately we saw a lot of attrition in these functions, perhaps also as a result of less progress in building new things during this time.

By the end of the project, all of us: engineering, product, design and the rest of the business, just wanted the migration to be over so we could move into the post-Goliath world. Now finally we’ve arrived and the renewed sense of energy and excitement is very clear. We did it, and we’re ready to get back to building!”

4. Things that worked well

The engineering team warmed to Go during the project. Khan Academy started the rewrite with around 5-10 engineers. As the MVE phase progressed, more engineers joined in and at the end of the MVE phase, the whole engineering organization of about 100 people worked on it. For the endgame, the project started with around 50 engineers and gradually ramped down to 4 engineers in the final days of the project.

Before the rewrite, few engineers had used Go in production, so it was interesting to hear the team’s impressions of it. Kevin Dangoor collected learnings two years into the project, after half a million lines had been migrated. He shared these learnings in a blog post:

Engineers liked Go. Some liked the ease of reading and writing, others praised the documentation, while the tooling and compiler speed all scored points with the team. A software engineer who came from the .NET world initially found it strange that Go doesn’t do exception-like error handling. Later, this engineer said of the fact that Go’s errors are values: “Being able to call a function that doesn’t return an error and know for sure that it must succeed, is really nice.”
Performance was excellent. Compared to Python, Go’s runtime performance was much faster. Talking with the team, I’m told they compared the service hour cost of operating the same code on Python and Go. The difference was an order of magnitude in favor of Go, which was up to x10 cheaper for certain types of requests.
The lack of generics was the biggest complaint. Generics refers to the ability to write a function that could work with a variety of data types and have the compiler automatically create versions for more specific types at compile time. This is a language feature that’s widespread in many modern languages like TypeScript, Java, C#, Python and Swift, which all support generics at some level. After years of planning, Go added generics in March 2022, in Go version 1.18. Google App Engine only added support for this version of Go in December last year, so the team has not yet embraced this feature, but will do soon!

Keep shipping incrementally. Kevin thought that setting the rhythm of shipping to be incremental but continuous was the difference between success and failure. The team very quickly shipped their first migrated service – even though it was a tiny one! – and then kept up the pace of always shipping and always migrating as well.

Kevin wrote about how taking small steps was harder at first but became easier as the project progressed:

“At the beginning of the project, we stressed the importance of working on small slices as much as possible. In those days, this was tricky because the goal might be to switch over one GraphQL field, but that field might depend on a variety of other machinery running inside the monolith. As time progressed, more of the other required parts had been ported, making further porting for the same service smoother.”

One service to “own” a piece of data. Kevin shared an early decision they took which made the migration much clearer: only one service would “own” a piece of data.

The team put a firm rule in place that only one service could write a given piece of data. All other services had to call this “owner” service via the API in order to make changes to the data. Kevin suggested that without this rule, figuring out how and why data changes happened would have been extremely difficult.

The side-by-side migration approach worked extremely well. When I asked Brian for an approach he’d use again for similar migrations, he selected the side-by-side migration approach as one. This was because the project felt like rebuilding an aeroplane while in mid-air.

With the side-by-side approach in place, the team got to scrutinize the differences and problems that showed up across thousands of individual fields. They got a very real sense of progress by inspecting the level of traffic their services received. Day by day, this traffic grew from 0% up to 100%.

One benefit of a side-by-side migration is that you can track the percentage of traffic served from the new system. Above is a rough illustration of the migration’s progress, as pieced together from talking with Brian and Kevin.

Treating this migration as a fixed scope, fixed timeline project was the right choice, in hindsight. When talking about the project management approach, Brian revealed something that initially surprised me: they did not follow an “agile” approach on this project like as usual:

“Although I have been a proponent of agile development practices for most of my career, we mostly treated this project migration as a waterfall one. The only "agile" bits had to do with borderless engineering, and how we prioritized the work. “Borderless engineering” is what we call the practice of individuals floating from team to team for short periods of time – a few days to a few months – to help out on work when it’s the sensible thing to do.

The migration was a fixed-scope, fixed-timeline project. We had a massive burndown chart that always gave us a good understanding of how we were doing. When a team fell behind, we moved engineers around. In the end, we finished 4 days before our fixed deadline (January 31, 2023.)

Looking back, treating this type of work as a fixed scope, fixed timeline project was exactly what we needed.”

My initial surprise derived from my experience of treating most projects as “agile”: build something quickly, get feedback, then reassess your plan and build whatever makes the most sense. This agile approach works well when you’re innovating or discovering a problem space. However, with this migration the problem space was a given and the scope of the work was well understood. So, it’s no wonder that sticking to the original plan worked well.

5. Choices worth reconsidering

I asked both Brian and Kevin which choices they might reconsider with the benefit of hindsight.

Switching to a brand-new language for the rewrite. Brian ran me through his reflections:

“Was switching to Go worth it in the end? Go is demonstrably faster than Python, therefore it is cheaper to run, which affects our bottom line.

However, don’t forget that nobody on the team knew Go in-depth when we started this project. So we had approximately 100 developers who all needed to ramp up on the technology. And, sure enough, we made mistakes along the way and those mistakes slowed us down.

Here’s something I don’t yet know: how long it will take for us to “reclaim” the cost of the ramp-up to Go: the time the team spent mastering this technology versus cheaper cloud costs. In hindsight, I would likely do an analysis of this tradeoff. Looking back, there is a possibility that Python 3 would have been more prudent for getting the project done, faster.”

The team played loose with the “port things exactly as they are” approach when it came to internal tools. Brian says:

“We also had a general rule for this project: we move the behavior over exactly. If there's a bug in the python code, we port the bug (most times.)

However, for our internal tools, we were looser with this rule. Instead of porting these tools one-on-one from Python to Go, we wrote new tools. This meant we couldn't easily use the side-by-side system. Building these new tools likely slowed us down compared to porting them. I'd revisit that as well.”

I’m glad Brian shared these learnings, as neither are ones that we engineers talk about much, even though we should. The reality of moving to a new language is that there will be a lot of time wasted on learning the new technology and on mistakes.

However, looking at the other side of the coin; this is time invested in engineers learning a new and interesting technology, and it’s motivating to work at places which support such investment. Also, companies that budget for engineers to invest in learning and using modern tools tend to have an easier time recruiting curious devs.

The same goes for internal tools. Sure, the prudent approach would have been to port the tools one-on-one. It would have been faster because little thinking is required.

But then again, what is the point of porting internal tools if they won’t be improved upon? This is just speculation, but it’s probable the migration was “dry” enough for engineers to channel their creativity and desire to play around with Go into internal tools, like many engineers do.

Both the decision on whether to change languages, and whether to stick strictly to the scope for developer tooling, are common challenges on many projects. The project might have been completed faster by being stricter about technology choices and sticking to the original scope (even with internal tools.) However, such strictness could have harmed the company’s engineering culture.

Would a stricter approach have resulted in a less fun and interesting place to work, and perhaps higher engineering attrition? These is a question I can’t answer, but as an engineering leader it’s worth considering, alongside the budget and timeline of a project.

6. Learnings from working on a 3 year-long project

Kevin Dangoor was part of the leadership team for the vast majority of the rewrite. Responding to my question about his learnings from it, he shared a few things.

Defining the “minimum viable experience” paid off handsomely. Even though there was no “minimum viable product” to discuss, defining what a “minimum” port looked like helped the team to focus and prioritize the right things.

Doing “as direct a port as possible” meant estimates slipped very little. At the kickoff of the project in summer 2019, the team estimated the MVE phase would take about 2 years, based on the number of fields to port and lines of code to move. In August 2021, the team completed the MVE, almost exactly as per their original estimate two years previously.

If you build software, you’ll know it’s hard enough to estimate for even a month ahead, so accurately doing this on a timescale of years was an impressive feat. Kevin said he thinks the estimate was accurate because they tried to do as direct a port as possible for the new system, not expanding its scope.

A project that “only” goes from monolith to microservices would likely have less complexity. Kevin said he thinks their migration added complexity in several ways:

Changing languages from Python to Go. For instance, this meant Python libraries could not be used in Go, so the team needed to find the best replacements they could.
Moving to new versions of Google Cloud APIs. The Google Cloud APIs evolved shortly before the migration started, and so moving to the newer Google Cloud APIs were part of it.
Moving to services meant splitting control of the data. In a monolith, all data was in one place. With lots of services, workflows became more complex and data flows changed.

A sense of dedication to finish the work and feeling pride when it was done. Kevin has been in the tech industry for 20+ years, at places like Mozilla, Adobe, and currently GitHub. Nonetheless, when I asked him how he feels about this project, he shared:

“The migration was a huge undertaking. I’m proud of having had a part to play in it. It was unlike anything I worked in my career, so far. Of course, I worked with great engineers and seeing their dedication to this project was very important.”

Brian took over leading Part 2 of the project, the Endgame phase. I asked him about his learnings. Here’s what he shared.

Hard deadlines can be motivational. Brian was working on the migration project at first as a software engineer. For the final phase, he took on the role of managing and coordinating the whole project. He shared an interesting observation about hard deadlines:

“It’s interesting to see how a hard deadline can be both motivating, and help people to organize around it. Our directive was clear: we will not slip in completing the migration by our date. If we need extra people, we'll get extra people.

Coordinating all the teams to get to that deadline was complicated; if they all finished their work by the deadline, the org wouldn't make the deadline because so many teams would be blocked by other teams. Having a hard deadline forced us to align in creative ways to ensure a ‘critical path.’ To make sure we finished in time, we had mini-deadlines for the last 6 months of the project. If X didn't happen by Y date, the project was at risk.”

Just because you have services, you cannot ignore the broader ecosystem. Another learning comes from Brian, that just because you move to services, you aren’t free to ignore the “monolith” or how the services work holistically. If you take a few steps back, the collection of services also appears as a “monolith,” although a loosely coupled one.

However, these services use shared resources. For example, if one service makes heavy use of Redis, other services that use Redis will be affected unless multiple instances of Redis are created to mitigate this, which is what Khan Academy ended up doing. Caching of data between co-dependent services can cause thundering herds when caches expire. "Smarter" cache expiry is necessary to maintain a system that can truly scale.

Takeaways

Many thanks to Brian and Kevin for candidly sharing their experience of a long and challenging migration. They’ve each been generous in giving their observations and I recommend following them on LinkedIn. In a comment below, Brian mentioned how Khan Academy is hiring for those who are interested in joining the company after this migration.

I’ve read plenty of engineering blog posts on teams celebrating the hitting of a milestone in a migration, or announcing a migration is done, which often are triumphant in tone. However, having been close to a multi-year migration at Uber, my experience is these projects certainly don’t feel like triumphs while they’re taking place!

Long-running migrations often feel thankless, never-ending and frustrating. Both Brian and Kevin were good sports in sharing the positives and I appreciate Brian being candid about how rough the project felt at times; not just for the engineering team, but also for product and design.

Even though Khan Academy has what feels like a strong engineering culture, Brian mentioned that people at the company “want to push forward our feature sets, expose more content to our learners/teachers; the technology tends to be second.”

In any company where most engineers are builders, migrations will feel like a drag. However, the more a company builds and the bigger it grows, so the need arises to migrate over to new systems. Although counter-intuitive, I suggest that to be a great product engineer, it’s worth familiarizing yourself on how to do migrations, so you can do them more efficiently and reliably.

Additionally, there’s fewer better ways to learn this than by working on and helping out with challenging migrations.

I hope this deep dive has painted a realistic picture of what a complex, multi-year migration looks like, and offer approaches that are useful, should you embark on a similarly challenging project.

A Plea for more Mikado

Mike's Notes

Here is an article that I discovered in the latest Amazing CTO newsletter.

Resources

References

Reference

Repository

Home > Ajabbi Research > Library > Subscriptions > Amazing CTO
Home > Handbook >

Last Updated

10/03/2025

A Plea for more Mikado

By: Damien Mathieu

dmathieu.com: Monday, August 21, 2023

One of the books that impacted the most my career is probably The Mikado Method. I read it almost 10 years ago, and I don’t practice it explicitly. But I think of the method almost every day, and it has been impacting how I work ever since.

And yet, it has remained something quite obscure. Whenever folks suggest must-read computer science books, it’s never there. So let’s try to explain it a bit more, and how it can be used every day in the life of a programmer.

What is the Mikado Method?

If you ever worked on a large refactoring project, library switch or upgrade, you may have ended up working in a branch for weeks (or months). You need to regularly rebase against the main branch (or have everybody working on the same branch), and may end up spending more time fixing conflicts than actually working on the change.

But somehow you move forward, and one day you are ready to ship that huge change. The biggest bet still lays ahead of you though: will there be performance changes? Did we miss something? Were there unknown bugs? In my experience, every big bang change always ends up in at least one cycle of reverting and going back to the PR, and a non-trivial number of them were not shipped at all.

The Mikado Method is a framework to make that kind of refactoring manageable.

The idea is to split things into atomic changes. Each of these changes will be shipped right away, on its own.

Let’s say you’re working on a Ruby on Rails application which hasn’t been upgraded in several years. So you need to go from Rails 4 to Rails 7 (wow!).

Let’s do it with some mikado!

The first step will be to locally upgrade the rails dependency in your Gemfile to the final version you want to run on. On a paper, draw a rectangle (or a circle, anything) and write down a couple words about the task you’ve just down, such as upgrade rails in Gemfile.

Now, run your unit test suite. Obviously, there will be lots of failures.

Go through each failure, and for each of them write a new rectangle on the paper, with a (very) short description of what you would have to do to fix that issue. If the cause is unknown at that point, you can also write down the failure itself, to be investigated. Link each rectangle to the parent one, as can be seen in the example image below.

Then, revert your changes. Delete everything! And I really mean revert, not move to a new branch or squash. If you feel this change took you too long to just be deleted, it means it wasn’t atomic enough and you need to split it.

Now, pick one of the failures you wrote down, any of them and try to fix it in the current codebase, without the original upgrade. Doing so may require some refactoring or more changes. In that case, don’t do them. Write them on your paper, delete everything and start implementing them. Similarly, if after fixing the problem, there are still failures, write them down, link them to the issue you were just trying to fix and delete everything.

And iterate from there against every failure, refactoring or change you need. If you discover a new issue, write it down and delete everything.

At some point, you will get a fix which actually works and for which all your tests pass. Ship that change!

And move on to the next failure.

Over time, you will get more and more actual fixes, and less and less reverts. Until all there is left to do is to make the change where you actually change the content of your Gemfile to upgrade the dependency version.

At that point, your application supports both versions, making that change very small and trivial to ship. Do it of course!

Obviously, the Mikado Method cannot work if you don’t have a good and highly reliable automated test suite.

Wow dude, this is too much

It absolutely is. And I haven’t heard of anyone following this process to the letter.

But processes aren’t meant to be followed to the letter. They are meant to provide a frame. Once that process is fully understood, getting out of it can be beneficial, to adapt it to your own needs, while retaining the core ideas and goals of that process.

In the case of the Mikado method, I think the biggest takeaways are to ship atomic changes, and not be afraid to drop things if they derail.

Atomic Everything

There’s nothing worst (well …) than seeing a Pull Request describing something, but where other unrelated (yet relevant) changes crept in.

Whenever I am working on something, and I notice something else in the same bit of the codebase which should be changed or refactored, I take a note of it, and come back to it once my original change is ready for review. I see this as a lighter way of doing Mikado. And yet, everything in a PR is related to the same thing, making its review much easier.

One way to cheat about this would be to name the PR “do this and that”. Well, don’t! If your PR includes an and, there should be two of them (the same goes for issues).

The gist of it is: split everything you do into the smallest bit possible, and ship all those bits independently.

A failure of an example

Here is an example why thinking about everything atomically is safer. At $PREVIOUS_EMPLOYER, we wanted to migrate from Opentracing to OpenTelemetry.

Both libraries are quite similar, but we had some heavy internal things that couldn’t work exactly the same between both of them, so we wanted to ensure there were no performance regression with the change. Hence we decided to do a big bang PR to be able to run performance tests.

I worked for over a month just making the appropriate changes, the PR was huge, and then I worked for another month just on the benchmarks. Until we were ready to ship the change.

Due to errors unseen before and uncaught by unit tests, Wwe shipped and reverted 3 times before deciding to drop a quarter of work and restart from scratch with small PRs we could ship daily.

To be fair, this quarter wasn’t entirely lost, since it brought us benchmarks we wouldn’t have had this soon were it not for a big bang change. But the frustration was there anyway. And I am sure that if we had decided to keep on trying to ship that big bang PR, we would have ended up reverting more than 10 times.

Delete your WIP code

I am sometimes stuck into a fix that seems daunting. The more I fix things, the more there are to fix, and it seems like I’m never going to get over it. Well, this is exactly the kind of moment where deleting everything and starting from scratch again is highly beneficial.

Once again, I do mean delete. Not squash or branch off of. There is a real psychological value in deleting a change where you’re stuck to start fresh.

However, when you do that, you should start working on the new fix right away. Don’t wait for a couple days. You don’t have the code available, but your mind is still there. That’s what’s going to allow you to go back there much more quicker and better than you did the first time.

Said like this, it may seem the need to delete WIP code like this is pretty exceptional. I’ve personally grown to make it quite standard. Whenever I spend more than 15-20 minutes stuck on something, I’m usually going to delete it and start fresh.

This can only work because I am a bit extreme about making everything atomic. So I also very often have something that works and I can commit. When that happens, I only delete whatever’s not been committed yet. Not all the unpushed commits I made earlier. Every of those atomic commits must have a green local test run of course.

Conclusion

Mikado is much like the agile method. It’s something everybody should apply to some degree, but not follow to the letter. But working with it in mind provides a very good base to ship code (whether it be a small bugfix, or a very large refactoring) in a safe and reliable way.

It’s probably not something everybody should do as described in the book (though if you do try it for a large enough project, I’d be happy to hear about it). But I am convinced that having some experience of it will make anyone a better developer!

Growing the development forest - with Martin Fowler

Mike's Notes

This interview with Martin Fowler was in a recent Refactoring Newsletter.

Resources

References

Reference

Repository

Home > Ajabbi Research > Library > Subscriptions > Refactoring
Home > Handbook >

Last Updated

17/05/2025

Growing the development forest - with Martin Fowler

By: Luca Rossi

Refactoring: 24/01/2024

Martin is chief scientist at ThoughtWorks. He is one of the original signatories of the Agile Manifesto and author of several legendary books, among which there is Refactoring, which shares the name with this podcast and this newsletter.

With Martin, we talked about the impact of AI on software development, from the development process to how human learning and understanding changes up to the future of software engineering jobs.

Then we explored the technical debt metaphor, why it has been so successful, and Martin's own advice on dealing with it. And finally, we talked about the state of Agile, the resistance that still exists today towards many Agile practices and how to measure engineering effectiveness.

(03:29) Introduction

(05:20) Development cycle with AI

(08:36) Less control and reduced learning

(13:11) Splitting task between Human and AI

(14:48) The skills shift

(20:17) Betting on new technologies

(27:22) Martin's Refactoring and technical debt

(29:24) Accumulating "cruft"

(33:14) Dealing with "cruft"

(37:24) The financial value of refactoring

(42:04) Measuring performances

(46:19) Why the "forest" didn't spread

(56:11) Make the forest appealing

Show notes / useful links:

Cruft, Tech Debt, and High quality software is cheaper:
https://martinfowler.com/articles/is-quality-worth-cost.html
Measuring Developer Productivity with qualitative stuff too:
https://martinfowler.com/articles/measuring-developer-productivity-humans.html
Code review isn't just pre-commit:
https://martinfowler.com/bliki/RefinementCodeReview.html
Thoughtworks's Haiven tool:
https://github.com/tw-haiven/haiven
Building Boba AI:
https://martinfowler.com/articles/building-boba.html

Martin Fowler on Agile

Mike's Notes

Martin Fowler is a prolific writer, a colourful character and a superb conference speaker.

Agile Manifesto co-author
Well-thought-out comments to make about software development.

Resources

http://martinfowler.com/

References

Reference

Repository

Home > Ajabbi Research > Library > Authors > Martin Fowler
Home > Handbook >

Last Updated

11/05/2025

Article

By: Mike Peters

On a Sandy Beach: 03/01/2019

Mike is the inventor and architect of Pipi and the founder of Ajabbi.

"I am an author, speaker… essentially a loud-mouthed pundit on the topic of software development. I work for ThoughtWorks, a software delivery company, where I have the exceedingly inappropriate title of “Chief Scientist”. I’ve written half-a-dozen books on software development, including Refactoring and Patterns of Enterprise Application Architecture. I write regularly about software development on martinfowler.com.

My main interest is to understand how to design software systems, so as to maximize the productivity of development teams. In doing this I’ve looked to understand the patterns of good software design, and also the processes that support software design. I’ve become a big fan of agile approaches and the resulting focus on evolutionary software design. I don’t come up with original ideas, but do a pretty good job of recognizing and packaging the ideas of others, or as Brian Foote describes me “an intellectual jackal with good taste in carrion”. - martinfowler.com

Presentations

https://martinfowler.com/videos.html
YouTube Search

Books

Amazon author page

Pages

Recent work on system messaging

Mike's Notes

Resources

References

Repository

Last Updated

Recent work on system messaging

Pipi 4 (2005-2008)

Pipi 6 (2017-2019)

Pipi 9 (2023-

Summary of changes underway

Updated Pipi 9 to 10 plan

Mike's Notes

Resources

References

Repository

Last Updated

Updated Pipi 9 to 10 plan

Details

Real-world Engineering Challenges #8: Breaking up a Monolith

Mike's Notes

Resources

References

Repository

Last Updated

Real-world Engineering Challenges #8: Breaking up a Monolith

1. Context and motivation

2. Kickoff and phasing

4. Things that worked well

5. Choices worth reconsidering

6. Learnings from working on a 3 year-long project

Takeaways

Further reading

A Plea for more Mikado

Mike's Notes

Resources

References

Repository

Last Updated

A Plea for more Mikado

What is the Mikado Method?

Let’s do it with some mikado!

Wow dude, this is too much

Atomic Everything

A failure of an example

Delete your WIP code

Conclusion

Growing the development forest - with Martin Fowler

Mike's Notes

Resources

References

Repository

Last Updated

Growing the development forest - with Martin Fowler

Show notes / useful links:

Martin Fowler on Agile

Mike's Notes

Resources

References

Repository

Last Updated

Article

Presentations

Books

Website Contents