Developer access to Pipi, is coming

Mike's Notes

The data is clear on this one. Unfortunately, the current developer interest in NZ and Australia is 3% and 0%, respectively. I also can't find anyone in NZ who has the slightest technical understanding of what I'm doing. But there are plenty overseas, especially in MLOPs. We speak the same language, even if the architecture is radically different. Also, top-grade mathematicians seem to get it. Internally, Pipi 9 uses a lot of maths.

Resources

References

  • Reference

Repository

  • Home > Ajabbi Research > Library >
  • Home > Handbook > 

Last Updated

02/12/2025

Developer access to Pipi is coming

By: Mike Peters
On a Sandy Beach: 02/12/2025

Mike is the inventor and architect of Pipi and the founder of Ajabbi.

The problem

To launch Pipi 9 with limited resources in 2006, the effort needs to be highly focused on developers who build large enterprise systems and have experienced failure, cost overruns, and staggering complexity. This is for them.

The staggering global cost of annual IT failures on big projects is around $US 3 trillion.

  • 15% succeed
  • 25% makes no difference
  • 60 % fail

Web traffic stats

The steadily growing web traffic statistics of public interest in Pipi 9 are becoming very clear.

Since early 2019, the total traffic stats by country are;

  • Singapore 19%
  • United States 17%
  • Hong Kong 13%
  • Brazil 12%

Developer Accounts

The initial paid Developer Accounts will be restricted to those four countries. That also affects the language, currency, hours of support availability, etc. Later, as interest and resources grow, that list of countries can be expanded.

Personal Accounts

The free Personal Accounts will initially use an English interface and will not be restricted by country of residence. They will get community support.

Enterprise Accounts

The initial paid Enterprise Accounts will be supported by their associated Developer Accounts, who can charge them for that service. They will initially use an English interface and will not be restricted by country of residence. Developer Accounts will be able to translate UI and documentation into any language and writing system.

Pipi 9 is in hiding

This engineering blog and the many other Ajabbi documentation websites are deliberately hidden from search engines. People are visiting because they are curious, as I write notes to myself and build, learning as I go. It is not easy to find the technical documentation unless you are interested and very determined. That has helped me find some early, keen technical fans who provide testing and feedback. It has also protected me from being overwhelmed by enquiries.

Communication constraints

I am a very slow writer, using assistive technology, have hearing problems and prefer video chats with people who speak good, clear English. I don't speak any other language apart from tiny bits of Maori, French and Spanish.

SEO and GEO

The SEO/GEO settings will be fixed when

  • Pipi 9 matures and becomes ready for public use
  • Community support is in place
  • Bugs fixed
  • Enough self-help documentation to help people get started.

Developer Account waitlist

There will be a signup queue to control demand, so scaling is steady with a positive resource feedback loop to solve the chicken-and-egg problem. The small queue is growing now. I will pick the best candidates with the highest chance of success. They will gain a first-mover advantage in building large, custom enterprise systems faster and at a much lower cost. The first ones will get free unlimited support. 

Relying on word-of-mouth recommendations.

There will be no marketing or sales, just good, clear documentation, live demos, and bookable office hours (NZ daytime) for having a chat.

Inflexion point

In the future, as workspaces mature and Pipi 10 becomes even easier to work with, resource constraints will disappear, and an inflexion point will be reached. 

Workspaces for Research

Mike's Notes

This is where I will keep detailed working notes on creating Workspaces for Research. Eventually, these will become permanent, better-written documentation stored elsewhere. Hopefully, someone will come up with a better name than this working title.

This replaces coverage in Industry Workspace written on 13/10/2025.

Testing

The current online mockup is version 1 and will be updated frequently. If you are helping with testing, please remember to delete your browser cache so you see the daily changes. Eventually, a live demo version will be available for field trials.

Learning

Initially, Pipi 4 had a module called EcoTrack. It mirrored various existing paper tools for biodiversity sampling and measurement in NZ ecosystems. Basically, data storage of observations for any Ecological Restoration Site.

  • Trapping
  • Soil
  • Plant growth
  • Water quality
  • Microorganisms
  • Climate
  • Photo records
  • Etc

It was planned to join this data with ESRI mapping to help visualise it. ESRI provided a $NZ600,000 grant of their software for this project. This grant came after I gave a live demo at a NZ GIS User Conference, and the ESRI head of programming was in the audience. The software arrived on 2 pallets 2 weeks later.

Landcare Research, DOC and Eagle Technology were helping with this project.

Then there was a change in government, funding dried up, and the Christchurch earthquakes caused havoc. Pipi 4 died.

In 2016, when rebuilding Pipi from memory as Pipi 6, I was greatly influenced by David C. Hay's work on data models for laboratory tests. I then figured out how to add the Business Model testing experiments of Steve Blank and Alexander Osterwalder. Plus, I have a home lab for looking at bugs, running chemistry experiments, and making useful concoctions for art projects. Alex helped with how to provide literature references. Add in a catalogued reference library and seminars. So that's the origin story, starting at a basic level and slowly growing over time.

Why

Ajabbi Research will be the first user of this workspace to organise the research needed to support and improve Pipi for people to use. The workspace will also be used for testing the Researcher Account. Eventually, this workspace will be available to anyone.

Resources

References


References

  • Reference

Repository

  • Home > Ajabbi Research > Library >
  • Home > Handbook > 

Last Updated

28/11/2025

Workspaces for Research

By: Mike Peters
On a Sandy Beach: 28/11/2025

Mike is the inventor and architect of Pipi and the founder of Ajabbi.

Open-source

This open-source SaaS cloud system will be shared on GitHub and GitLab.

Dedication

This workspace is dedicated to the life and work of Jocelyn Bell Burnell, who at the age of 25 discovered Pulsars.

Bell Burnell in 2009

Source: https://en.wikipedia.org/wiki/Jocelyn_Bell_Burnell#/media/File:Launch_of_IYA_2009,_Paris_-_Grygar,_Bell_Burnell_cropped.jpg

"Dame Susan Jocelyn Bell Burnell (/bɜːrˈnɛl/; née Bell; born 15 July 1943) is a Northern Irish physicist who, while conducting research for her doctorate, discovered the first radio pulsars in 1967. This discovery later earned the Nobel Prize in Physics in 1974, but she was not among the awardees.

Bell Burnell was president of the Royal Astronomical Society from 2002 to 2004, president of the Institute of Physics from October 2008 until October 2010, and interim president of the Institute following the death of her successor, Marshall Stoneham, in early 2011. She was Chancellor of the University of Dundee from 2018 to 2023.

In 2018, she was awarded the Special Breakthrough Prize in Fundamental Physics. Following the announcement of the award, she decided to use the $3 million (£2.3 million) prize money to establish a fund to help female, minority and refugee students to become research physicists. The fund is administered by the Institute of Physics.

In 2021, Bell Burnell became the second female recipient (after Dorothy Hodgkin in 1976) of the Copley Medal. In 2025, Bell Burnell's image was included on an An Post stamp celebrating women in STEM." - Wikipedia

Change Log

Ver 1 includes research.

Existing products


Features

This is a basic comparison of features in research software.

[TABLE]

Data Model

words

Database Entities

  • Facility
  • Party
  • etc

Standards

The workspace needs to comply with all international standards.

  • (To come)

Workspace navigation menu

This default outline needs a lot of work. The outline can be easily customised by future users using drag-and-drop and tick boxes to turn features off and on.

  • Enterprise Account
    • Applications
      • Facility
        • Accounts
        • Maintenance
        • Supplies
      • Library
        • Borrower
        • Collection
        • Loan
      • Publish
        • Presentation
        • Website
          • Blog
          • Wiki
      • Research Program
        • Experiment
        • Theory
          • Customer (v2)
            • Bookmarks
              • (To come)
            • Support
              • Contact
              • Forum
              • Live Chat
              • Office Hours
              • Requests
              • Tickets
            • (To come)
              • Feature Vote
              • Feedback
              • Surveys
            • Learning
              • Explanation
              • How to Guide
              • Reference
              • Tutorial
            • Settings (v3)
              • Account
              • Billing
              • Deployments
                • Workspaces
                  • Modules
                  • Plugins
                  • Templates
                    • Institute
                    • Lab
                    • Student
                  • Users

          "You Don't Need Kafka, Just Use Postgres" Considered Harmful

          Mike's Notes

          Note

          Resources

          References

          • Reference

          Repository

          • Home > Ajabbi Research > Library >
          • Home > Handbook > 

          Last Updated

          01/12/2025

          "You Don't Need Kafka, Just Use Postgres" Considered Harmful

          By: Gunnar Morling
          Random Musings on All Things Software Engineering: 03/11/2025

          Gunnar Morling is an open-source software engineer in the Java and data streaming space. He currently works as a Technologist for Confluent. In his past role at Decodable he focused on developer outreach and helped them build their stream processing platform based on Apache Flink. Prior to that, he spent ten years at Red Hat, where he led the Debezium project, a platform for change data capture.

          Looking to make it to the front page of HackerNews? Then writing a post arguing that "Postgres is enough", or why "you don’t need Kafka at your scale" is a pretty failsafe way of achieving exactly that. No matter how often it has been discussed before, this topic is always doing well. And sure, what’s not to love about that? I mean, it has it all: Postgres, everybody’s most favorite RDBMS—​check! Keeping things lean and easy—​sure, count me in! A somewhat spicy take—​bring it on!

          The thing is, I feel all these articles kinda miss the point; Postgres and Kafka are tools designed for very different purposes, and naturally, which tool to use depends very much on the problem you actually want to solve. To me, the advice "You Don’t Need Kafka, Just Use Postgres" is doing more harm than good, leading to systems built in a less than ideal way, and I’d like to discuss why this is in more detail in this post. Before getting started though, let me get one thing out of the way really quick: this is not an anti-Postgres post. I enjoy working with Postgres as much as the next person (for those use cases it is meant for). I’ve used it in past jobs, and I’ve written about it on this blog before. No, this is a pro-"use the right tool for the job" post.

          So what’s the argument of the "You Don’t Need Kafka, Just Use Postgres" posts? Typically, they argue that Kafka is hard to run or expensive to run, or a combination thereof. When you don’t have "big data", this cost may not be justified. And if you already have Postgres as a database in your tech stack, why not keep using this, instead of adding yet another technology?

          Usually, these posts then go on to show how to use SELECT ... FOR UPDATE SKIP LOCKED for building a… job queue. Which is where things already start to make a bit less sense to me. The reason being that queuing just is not a typical use case for Kafka to begin with. It requires message-level consumer parallelism, as well as the ability to acknowledge individual messages, something Kafka historically has not supported. Now, the Kafka community actually is working towards queue support via KIP-932, but this is not quite ready for primetime yet (I took a look at that KIP earlier this year). Until then, the argument boils down to not use Kafka for something it has not been designed for in the first place. Hm, yeah, ok?

          That being said, building a robust queue on top of Postgres is actually harder than it may sound. Long-running transactions by queue consumers can cause MVCC bloat and WAL pile-up; Postgres' vacuum process not being able to keep up with the rate of changes can quickly become a problem for this use case. So if you want to go down that path, make sure to run representative performance tests, for a sustained period of time. You won’t find out about issues like this by running two minute tests.

          So let’s actually take a closer look at the "small scale" argument, as in "with such a low data volume, you just can use Postgres". But to use it for what exactly? What is the problem you are trying to solve? After all, Postgres and Kafka are tools designed for addressing specific use cases. One is a database, the other is an event streaming platform. Without knowing and talking about what one actually wants to achieve, the conversation boils down to "I like this tool better than that" and is pretty meaningless.

          Kafka enables a wide range of use cases such as microservices communication and data exchange, ingesting IoT sensor data, click streams, or metrics, log processing and aggregation, low-latency data pipelines between operational databases and data lakes/warehouses, and realtime stream processing, for instance for fraud detection and recommendation systems.

          So if you have one of those use cases, but at a small scale (low volume of data), could you then use Postgres instead of Kafka? And if so, does it make sense? To answer this, you need to consider the capabilities and features you get from Kafka which make it such a good fit for these applications. And while scalability indeed is one of Kafka’s core characteristics, it has many other traits which make it very attractive for event streaming applications:

          • Log semantics: At its core, Kafka is a persistent ordered event log. Records are not deleted after processing, instead they are subject to time-based retention policies or key-based compaction, or they could be retained indefinitely. Consumers can replay a topic from a given offset, or from the very beginning. If needed, consumers can work with exactly-once semantics. This goes way beyond simple queue semantics and replicating it on top of Postgres will be a substantial undertaking.
          • Fault tolerance and high availability (HA): Kafka workloads are scaled out in clusters running on multiple compute nodes. This is done for two reasons: increasing the throughput the system can handle (not relevant at small scale) and increasing reliability (very much relevant also at small scale). By replicating the data to multiple nodes, instance failures can be easily tolerated. Each node in the cluster can be a leader for a topic partition (i.e., receive writes), with another node taking over if the previous leader becomes unavailable.
          •         With Postgres in contrast, all writes go to a single node, while replicas only support read requests. A broker failover in Kafka will affect (in the form of increased latencies) only those partitions it is the leader for, whereas the failure of the Postgres primary node in a cluster is going to affect all writers. While Kafka broker failovers happen automatically, manual intervention is required in order to promote a Postgres replica to primary, or an external coordinator such as Patroni must be used. Alternatively, you might consider Postgres-compatible distributed databases such as CockroachDB, but then the conversation shifts quite a bit away from "Just use Postgres".
          • Consumer groups: One of the strengths of the Kafka protocol is its support for organizing consumers in groups. Multiple clients can distribute the load of reading the messages from a given topic, making sure that each message is processed by exactly one member of the group. Also when handling only a low volume of messages, this is very useful. For instance, consider a microservice which receives messages from another service. For the purposes of fault-tolerance, the service is scaled out to multiple instances. By configuring a Kafka consumer group for all the service instances, the incoming messages will be distributed amongst them.
          •         How would the same look when using Postgres? Considering the "small scale" scenario, you could decide that only one of the service instances should read all the messages. But which one do you select? What happens if that node fails? Some kind of leader election would be required. Ok, so let’s make each member of the application cluster consume from the topic then? For this you need to think about how to distribute the messages from the Postgres-based topic, how to handle client failures, etc. So your job now essentially is to re-implement Kafka’s consumer rebalance protocol. This is far from trivial and it certainly goes against the initial goal of keeping things simple.
          • Low latency: Let’s talk about latency, i.e. the time it takes from sending a message to a topic until it gets processed by a consumer. Having a low data volume doesn’t necessarily imply that you do not want low latency. Think about fraud detection, for example. Also when processing only a handful of transactions per second, you want to be able to spot fraudulent patterns very quickly and take action accordingly. Or a data pipeline from your operational data store to a search index. For a good user experience, search results should be based on the latest data as much as possible. With Kafka, latencies in the milli-second range can be achieved for use cases like this. Trying to do the same with Postgres would be really tough, if possible at all. You don’t want to hammer your database with queries from a herd of poll-based queue clients too often, while LISTEN/NOTIFY is known to suffer from heavy lock contention problems.
          • Connectors: One important aspect which is usually omitted from all the "Just use Postgres" posts is connectivity. When implementing data pipelines and ETL use cases, you need to get data out of your data source and put it into Kafka. From there, it needs to be propagated into all kinds of data sinks, with the same dataset oftentimes flowing into multiple sinks at once, such as a search index and a data lake. Via Kafka Connect, Kafka has a vast ecosystem of source and sink connectors, which can be combined, mix-and-match style. Taking data from MySQL into Iceberg? Easy. Going from Salesforce to Snowflake? Sure. There’s ready-made connectors for pretty much every data system under the sun.
          •         Now, what would this look like when using Postgres instead? There’s no connector ecosystem for Postgres like there is for Kafka. This makes sense, as Postgres never has been meant to be a data integration platform, but it means you’ll have to implement bespoke source and sink connectors for all the systems you want to integrate with.
          • Clients, schemas, developer experience: One last thing I want to address is the general programming model of a "Just use Postgres" event streaming solution. You might think of using SQL as the primary interface for producing and consuming messages. That sounds easy enough, but it’s also very low level. Building some sort of client will probably make sense. You may need consumer group support, as discussed above. You’ll need support for metrics and observability ("What’s my consumer lag?"). How do you actually go about converting your events into a persistent format? Some kind of serializer/deserializer infrastructure will be needed, and while at it, you probably should have support for schema management and evolution, too. What about DLQ support? With Kafka and its ecosystem, you get battle-proven clients and tooling, which will help you with all that, for all kinds of programming languages. You could rebuild all this, of course, but it would take a long time and essentially equate to recreating large parts of Kafka and its ecosystem.

          So where does all that leave us? Should you use Postgres as a job queue then? I mean, why not, if it fits the bill for you, go for it. Don’t build it yourself though, use an existing extension like pgmq. And make sure to understand the potential implications on MVCC bloat and vacuuming discussed above.

          Now, when it comes to using Postgres instead of Kafka as an event streaming platform, this proposition just doesn’t make an awful lot of sense to me, no matter what the volume of the data is going to be. There’s so much more to event streaming than what’s typically discussed in the "Just use Postgres" posts; while you might be able to punt some of the challenges for some time, you’ll eventually find yourself in the business of rebuilding your own version of Kafka, on top of Postgres. But what’s the point of recreating and maintaining the work already done by hundreds of contributors in the course of many years? What starts as an effort to "keep things simple" actually creates a substantial amount of unnecessary complexity. Solving this challenge might sound like a lot of fun purely from an engineering perspective, but for most organizations out there, it’s probably just not the right problem they should focus on.

          Another problem of the "small scale" argument is that what’s a low data volume today may be a much bigger volume next week. This is a trade-off, of course, but a common piece of advice is to build your systems for the current and the next order of magnitude of load: you should be able to sustain 10x of your current load and data volume as your business grows. This will be easily doable with Kafka which has been designed with scalability at its core, but it may be much harder for a queue implementation based on Postgres. It is single-writer as discussed above, so you’d have to look at scaling up, which becomes really expensive really quickly. So you might decide to migrate to Kafka eventually, which will be a substantial effort when thinking of migrating data, moving your applications from your home-grown clients to Kafka, etc.

          In the end, it all comes down to choosing the right tool for the job. Use Postgres if you want to manage and query a relational data set. Use Kafka if you need to implement realtime event streaming use cases. Which means, yes, oftentimes, it actually makes sense to work with both tools as part of your overall solution: Postgres for managing a service’s internal state, and Kafka for exchanging data and events with other services. Rather than trying to emulate one with the other, use each one for its specific strengths. How to keep both Postgres and Kafka in sync in this scenario? Change data capture, and in particular the outbox pattern can help there. So if there is a place for "Postgres over Kafka", it is actually here: for many cases it makes sense to write to Kafka not directly, but through your database, and then to emit events to Kafka via CDC, using tools such as Debezium. That way, both resources are (eventually) consistent, keeping things very simple from an application developer perspective.

          This approach also has the benefit of decoupling (and protecting) your operational datastore from the potential impact of downstream event consumers. You probably don’t want to be at the risk of increased tail latencies of your operational REST API because there’s a data lake ingest process, perhaps owned by another team, which happens to reread an entire topic from a table in your service’s database at the wrong time. Adhering to the idea of the synchrony budget, it makes sense to separate the systems for addressing these different concerns.

          What about the operational overhead then? While this definitely warrants consideration, I believe that oftentimes that concern is overblown. Running Kafka for small data sets really isn’t that hard. With the move from ZooKeeper to KRaft mode, running a single Kafka instance is trivial for scenarios not requiring fault tolerance. Managed services make running Kafka a very uneventful experience (pun intended) and should be the first choice, in particular when setting out with low scale use cases. Cost will be manageable kinda by definition by virtue of having a low volume of data. Plus, the time and effort for solving all the issues with a custom implementation discussed above should be part of the TCO consideration to be useful.

          So yes, if you want to make it to the front page of HackerNews, arguing that "Postgres is enough" may get you there; but if you actually want to solve your real-world problems in an effective and robust way, make sure to understand the sweet spots and limitations of your tools and use the right one for the job.

          Pipi Engines to build, deploy and manage in the cloud

          Mike's Notes

          Finally got onto this job. Should be fun. 😊

          Resources

          References

          • Reference

          Repository

          • Home > Ajabbi Research > Library >
          • Home > Handbook > 

          Last Updated

          2/12/2025

          Pipi Engines to build, deploy and manage in the cloud

          By: Mike Peters
          On a Sandy Beach: 30/11/2025

          Mike is the inventor and architect of Pipi and the founder of Ajabbi.

          The problem

          The open-source workspaces under development are designed to be shared on GitHub/GitLab and hosted in production on various Cloud Platforms. Eventually, private cloud and on-prem will be included, as long as Pipi 9 can get secure access. In that case, hybrid clouds should also be fine.

          Pipi 9 needs to be able to automatically build, deploy, and manage this process, either directly or via third-party tools such as GitHub Actions.

          Agent Engines

          An early Pipi 6 module from 2016 that catalogued cloud services was converted to a Pipi 7 microservice in 2018. Yesterday, this was imported into Pipi 9 and is being used to create these agents, which act as autonomous engines.

          • Platform Engine (plt) - this one is the commander on the battlefield. I will get this finished first.
          A dedicated agent engine has been created for each cloud platform. They have yet to be differentiated.
          • Apple Engine (ale)
          • AWS Engine (aws)
          • AZURE Engine (azu)
          • Digital Ocean Engine (dgo)
          • Google Cloud Engine (ggc)
          • IBM Engine (ibm)
          • Meta Engine (met)
          • Oracle Engine (ora)
          • (More will be added later; all are welcome)
          Then I remembered: Pipi 9 also has some self-deployment capacity, so generalising the existing capacity for building, deploying, and managing to share with the other engines makes sense.
          • Pipi Engine (pip) - for deploying to the closed data centre in a Boxlang/JRE host environment.

          How

          Most agents start like a Stem Cell. They are then modified to perform a specific job and can evolve over time. That's what I'm doing now.

          I started last night with the GCP console, looking at how to reverse-engineer the APIs and make a data model to drive the API calls. Looks straightforward. Gemini 3 chat is a big help and is saving a lot of time.

          But wait, there's more

          Each agent engine is complex and can incorporate other agents like LEGO bricks. Example: API Engine (api) and YAML Engine (yml). Just as in a living biological cell, everything is structured, in flux and self-regulating in response to its environment and internal processes. Other agent types, like primitives, are not complex.

          Free-tier experiments

          The Engines will play in the free tier of the different cloud providers. Probably try using GitHub Actions first, using the discovered sample code and go from there.

          Known available free tiers (more to come)

          • Alibaba Cloud
          • AWS
          • Azure
          • Cloudflare
          • Container Hosting Service
          • DigitalOcean
          • Google Cloud
          • Hetzner Cloud
          • IBM Cloud
          • Linode
          • Netlify
          • OpenShift
          • Oracle Cloud
          • OVHcloud
          • Render
          • Salesforce
          • Tencent Cloud
          • Vercel
          • Wasabi
          • Zeabur

          Cost $$$$$$$ 😎😎

          The free-tier usage limits need to be locked to prevent Pipi from burning through lots of cash.

          Cloud credits

          If Ajabbi can get some cloud credits, then playing with the more expensive stuff would be possible to make sure everything works for future customers. It would enable customers to choose their preferred cloud provider without barriers.

          No Series B

          Ajabbi is a bootstrap start-up for public good (with a future foundation) and will have no investors, so there will be no Series B. Unfortunately, these cloud providers are obsessed with giving more credits only to Series B start-ups. Go figure.

          Stock numbers

          Use more agent engines as the workload increases. So if, for example, IBM Engine (ibm) can handle 1,000 enterprise customers, and 10.000 enterprise customers want IBM cloud setups, then Platform Engine (plt) can get the Factory Engine (fac) to breed more IBM Engine (ibm) to nibble on the work. I won't know the actual stocking ratio until field testing under load. But whatever it is, it won't be a problem. And it may turn out that some very large customers need a dedicated agent engine or two each. All of this is possible.

          Developer Accounts

          The Workspaces for Developers, currently under development, will enable developers to help configure these agent engines and keep them up to date. This will also enable any platform to add itself by submitting a request for a dedicated agent engine and its devs, helping with the config and user documentation.

          Welcome to FreeTechBooks

          Mike's Notes

          A useful resource.

          Resources

          References

          • Reference

          Repository

          • Home > Ajabbi Research > Library >
          • Home > Handbook > 

          Last Updated

          30/11/2025

          Welcome to FreeTechBooks

          By: FreeTechBooks

          FreeTechBooks: 30/11/2025

          What’s Inside?

          This site lists free online computer science, engineering and programming books, textbooks and lecture notes, all of which are legally and freely available over the Internet.

          Throughout this site, other terms are used to refer to a book, such as ebook, text, document, monogram or notes.

          What’s the Catch?

          NONE. All the books listed in this site are freely available, as they are hosted on websites that belong to the authors or the publishers.

          In another word, we don’t host the books. We simply provide links to the books in PDF or HTML format available at the authors or the publishers websites.

          Please note that (a) we do not host pirated books and (b) we do not link to sites that host pirated books and (c) we do not even link to sites that link to sites that host pirated books.

          Each author and publisher has their own terms and conditions in the forms of free / open licenses, public domain or other specific ones.

          You are allowed to view, download and with a very few exceptions, print the books for your own private use at no charge. In fact, you are encouraged to tell others about the books.

          Feedback and Suggestions

          Any feedback and suggestions are most welcome. Please use the Contact Us form.

          Dead Links

          With so many links to external websites, dead links bound to happen from time to time. We regularly check for dead links and remove them but sometimes some dead links cracked trough. If you happen to see books with dead links please let us know via the Disqus comment.

          ECLASS Release 16.0

          Mike's Notes

          This is used by many manufacturers, including Rittal. This standard will be used by Pipi 9 for some of its internal models.

          "ECLASS (formerly styled as eCl@ss) is a data standard for the classification of products and services using standardized ISO-compliant properties. The ECLASS Standard enables the digital exchange of product master data across industries, countries, languages or organizations. Its use as a standardized basis for a product group structure or with product-describing properties of master data is particularly widespread in ERP systems.

          As an ISO-compliant and the world's only property-based classification standard, ECLASS also serves as a "language" for Industry 4.0 (IOTS)." 

          - Wikipedia 

          Resources

          References

          • ECLASS

          Repository

          • Home > Ajabbi Research > Library > Standards > ECLASS
          • Home > Handbook > 

          Last Updated

          29/11/2025

          ECLASS Release 16.0

          By: 
          ECLASS Newsletter: 28/11/2025

          License for ECLASS Release 16.0 in all available languages and all export formats. ECLASS 16.0 was released on Nov 27, 2025.

          Product description

          You purchase a license for ECLASS Release 16.0 in all available languages in the export formats BASIC (csv and XML) and ADVANCED (XML). Most language versions are fully translated, some are partially translated. In partly translated language versions, missing language content is filled in with the English original. You can receive language versions in 12 languages as default. All other language versions available from ECLASS can be ordered via our contact form.

          The ECLASS Release 16.0 is an enhancement and modification of the ECLASS Release 15.0.

          Technical innovations in ECLASS 16.0

          No changes in the data model or the XML scheme.

          New content

          ECLASS 16.0 comprises in comparison to Release 15.0

          • 995 new classes (CC, AS, BL, AC), thereof 137 new classification classes (CCs) 
          • 1.253 new properties 
          • 985 new values 
          • 126 new value lists 

          All available downloads at the ECLASS Shop generally contain a complete ECLASS version in csv format or XML format (for initial implementation). For information on the data structure, please refer to the enclosed README-file.

          The content has been extensively expanded, particularly in the following segments: 

          • Segment 23 "Machine element, fastener, fixing, mounting" 
            • New commodity classes in 23-30 "Linear motion technology, Rotary systems", e.g. 
              • 23-30-17-00 "Linear motion module, Linear motion axis"
              • 23-30-18-00 "Electromechanical cylinder"
              • 23-30-24-00 "Electromechanical rotary actuator"
            • And new commodity classes on third level, e.g. 
              • 23-01-04-00 "Hand wheel / Crank handle (machine handle)"
              • 23-01-06-00 "Quick release fastener"
              • 23-01-07-00 "Leveling feet, leveling mount"
          • Segment 27 "Electric engineering, automation, process control engineering" 
            • New classes and enhancement of these with properties in 27-38-03-00 "Gripper (electric)", e.g.
              • 27-38-03-01 "Parallel gripper (electric)"
              • 27-38-03-02 "Angular gripper (electric)"
              • 27-38-03-03 "Rotary gripper (electric)"
            • New classes due to the ETIM Harmonisation, e.g. 
              • 27-21-07-17 "Immersion thermostat"
              • 27-21-07-18 "Fancoil thermostat"
              • 27-21-07-19 "Zone controller"
          • Segment 36 "Machine, apparatus" 
            • 36-64-14-00 "Machine Vision System"
          • Segment 50 "Interior furnishing", e.g. 
            • 50-11-09-25 "Rolling pin"
          • Segment 51 "Fluid power" 
            • New main group 51-58 "Electronics and software (hydraulics)", with new classification classes, e.g. 
              • 51-58-01-01 "Valve amplifier with feedback (hydraulics)"
              • 51-58-01-02 "Valve amplifier without feedback (hydraulics)" 
          • Extension of the "Material Declaration" aspect, which is attached to all classes
            • Includes information on critical ingredients and has been added for all classification classes (except classes describing services)
            • Enhancement with legal or regulatory requirement to disclose the specific components, compounds, or chemical substances that make up a manufactured item—ensuring transparency for safety, environmental compliance, or consumer awareness (to be found in Block 0173-1#01-AKA421#001)
          • Around 23.300 definitions of classification classes were generated using ChatGPT. If a definition has been generated by AI, this is indicated in the attribute "source of definition" with "machine generated by GPT-4o mini".

          The product

          With Release 7.0 an improved structure according to the underlying ISO13584 data model and an ISO-standard XML format (OntoML) was introduced. 

          Starting with ECLASS Release 7.0, there are two different versions of ECLASS which contain the same classification classes but differ in the product description based on properties and values. Since Release 13.0, the ECLASS standard has included extensions for the Asset Administration Shell (AAS). Regarding the data model, in addition to BASIC and ADVANCED a new application class of the type "Asset" has been introduced.

          BASIC (in csv or XML format)

          The BASIC version contains only the content that could be represented in a csv format that was the exclusively used export format before 7.0. Therefore, it does not contain property block structures nor dynamic elements as in the ADVANCED version. BASIC does contain all classes of the ADVANCED, but the product description with the help of properties and values is structured a lot easier. BASIC is therefore a subset of ADVANCED and only includes properties that are flagged as "Basic relevant".

          ADVANCED (only in XML format)

          The ADVANCED version is the leading version in the database and built based on the data model ISO13584. It contains all structural elements of the ECLASS classification system including property blocks, dynamic elements such as reference properties, polymorphism, and cardinality blocks. The description of these extended structural components as well as additional information can be found under ADVANCED Version in our Technical Support.

          Note: Each classification class refers to an ADVANCED, a BASIC application class, and an ASSET AC, that contains the product description with the help of properties and values. The ADVANCED version contains the complete content of ECLASS and is therefore an extension of the BASIC version.

          In our Technical Support you find an overview (matrix of functionalities) which makes it possible for you to compare the two versions (BASIC and ADVANCED) and their possibilities. A provided reference table (XML-File: Mapping BASIC_ADVANCED_ADVANCED) enables users of the ADVANCED version to exchange technical data with users of the BASIC version. The necessary know-how must be delivered by the user of the ADVANCED version.

          ASSET (only in XML format)

          In Release 13.0, Asset Application Classes were created for all classes on the fourth level in the segments 17, 18, 19, 21, 23, 27, 28, 32, 33, 36, 49, 50 and 51. In addition, the so-called "submodel templates" for the AAS were created as aspects at the content level and were added to the Asset AC in these classes. In addition to the BASIC and ADVANCED representation, ECLASS e.V. publishes a derivation of the type "Asset". This contains all Asset Application Classes including the submodel template aspects.

          ECLASS 16.0 (Asset) is free of charge for all users; no order is required. You can activate the product yourself:  

          After entering the discount code "asset-158" under "My profile", ECLASS 16.0 (Asset) is immediately available to all registered ECLASS users free of charge in the personal download area.

          Translations

          For release 15.0, 14 additional languages have been added, meaning that ECLASS now offers a total of 31 languages in version 15.0. This means that ECLASS now has almost all official languages of the European Union in its portfolio and also covers many other international languages. The language content of the new language versions has been translated using the automatic translation tool DeepL. For the first time, complete translations of the content (‘preferred names’) can be provided in 29 of the 31 languages available in ECLASS.

          Content

          ECLASS 15.0 BASIC (csv)

          • classes file (complete)
          • properties file (contains properties marked as "Basic relevant")
          • keywords and synonyms file
          • values file (complete)
          • unit file (complete)
          • class-property relations file
          • value lists (restricted property-value relations file, only BOOLEAN properties)
          • proposal lists (suggested class-property-value relations file incl. constraints)
          • README file ECLASS standard

          ECLASS 16.0 BASIC (XML) or ADVANCED (XML)

          • Dictionaries: include 39 XML files, hence for each segment one XML file that contains the complete content of the relevant segment (classes, keywords, properties, synonyms, values, value lists, units and all relations) (complete) 
          • Templates: Since Release 10.0.1, the "Templates" folder already known from previous Releases does not contain any content. The reason for this is the future handling of templates to deliver and publish only "default" templates with substantial content. This may change for future Releases. Templates contain a Data Requirement Statement for the data exchange, in which it can be defined, e.g. sequences or optional and mandatory fields between the data transmitter and the receiver. You can find more information in our Technical Support under Templates. 
          • ECLASS units ("UnitsML" in XML format) 
          • Only ADVANCED: Additionally for the ADVANCED user the mapping file BASIC - ADVANCED is included 
          • README-file 

          In the following, you will find the structure of the XML files. The placeholder "xy" stands for the different language codes.

          Structure of ECLASS 16.0 BASIC (XML)


          Structure of ECLASS 16.0 ADVANCED (XML)

          Building a Resilient Data Platform with Write-Ahead Log at Netflix

          Mike's Notes

          Lots of useful ideas and examples here about using write-ahead logs when scaling big.

          Resources

          References

          • Reference

          Repository

          • Home > Ajabbi Research > Library >
          • Home > Handbook > 

          Last Updated

          29/11/2025

          Building a Resilient Data Platform with Write-Ahead Log at Netflix

          By: Prudhviraj Karumanchi, Samuel Fu, Sriram Rangarajan, Vidhya Arvind, Yun Wang, John Lu
          Netflix TechBlog: 26/09/2025

          Introduction

          Netflix operates at a massive scale, serving hundreds of millions of users with diverse content and features. Behind the scenes, ensuring data consistency, reliability, and efficient operations across various services presents a continuous challenge. At the heart of many critical functions lies the concept of a Write-Ahead Log (WAL) abstraction. At Netflix scale, every challenge gets amplified. Some of the key challenges we encountered include:

          • Accidental data loss and data corruption in databases
          • System entropy across different datastores (e.g., writing to Cassandra and Elasticsearch)
          • Handling updates to multiple partitions (e.g., building secondary indices on top of a NoSQL database)
          • Data replication (in-region and across regions)
          • Reliable retry mechanisms for real time data pipeline at scale
          • Bulk deletes to database causing OOM on the Key-Value nodes

          All the above challenges either resulted in production incidents or outages, consumed significant engineering resources, or led to bespoke solutions and technical debt. During one particular incident, a developer issued an ALTER TABLE command that led to data corruption. Fortunately, the data was fronted by a cache, so the ability to extend cache TTL quickly together with the app writing the mutations to Kafka allowed us to recover. Absent the resilience features on the application, there would have been permanent data loss. As the data platform team, we needed to provide resilience and guarantees to protect not just this application, but all the critical applications we have at Netflix.

          Regarding the retry mechanisms for real time data pipelines, Netflix operates at a massive scale where failures (network errors, downstream service outages, etc.) are inevitable. We needed a reliable and scalable way to retry failed messages, without sacrificing throughput.

          With these problems in mind, we decided to build a system that would solve all the aforementioned issues and continue to serve the future needs of Netflix in the online data platform space. Our Write-Ahead Log (WAL) is a distributed system that captures data changes, provides strong durability guarantees, and reliably delivers these changes to downstream consumers. This blog post dives into how Netflix is building a generic WAL solution to address common data challenges, enhance developer efficiency, and power high-leverage capabilities like secondary indices, enable cross-region replication for non-replicated storage engines, and support widely used patterns like delayed queues.

          API

          Our API is intentionally simple, exposing just the essential parameters. WAL has one main API endpoint, WriteToLog, abstracting away the internal implementation and ensuring that users can onboard easily.

          rpc WriteToLog (WriteToLogRequest) returns (WriteToLogResponse) {...}
          /**
            * WAL request message
            * namespace: Identifier for a particular WAL
            * lifecycle: How much delay to set and original write time 
            * payload: Payload of the message
            * target: Details of where to send the payload 
            */
          message WriteToLogRequest {
            string namespace = 1;
            Lifecycle lifecycle = 2;
            bytes payload = 3;
            Target target = 4;
          }
          /**
            * WAL response message
            * durable: Whether the request succeeded, failed, or unknown
            * message: Reason for failure
            */
          message WriteToLogResponse {
            Trilean durable = 1;
            string message = 2;
          }

          A namespace defines where and how data is stored, providing logical separation while abstracting the underlying storage systems. Each namespace can be configured to use different queues: Kafka, SQS, or combinations of multiple. Namespace also serves as a central configuration of settings, such as backoff multiplier or maximum number of retry attempts, and more. This flexibility allows our Data Platform to route different use cases to the most suitable storage system based on performance, durability, and consistency needs.

          WAL can assume different personas depending on the namespace configuration.

          Persona #1 (Delayed Queues)

          In the example configuration below, the Product Data Systems (PDS) namespace uses SQS as the underlying message queue, enabling delayed messages. PDS uses Kafka extensively, and failures (network errors, downstream service outages, etc.) are inevitable. We needed a reliable and scalable way to retry failed messages, without sacrificing throughput. That’s when PDS started leveraging WAL for delayed messages.

          "persistenceConfigurations": {
            "persistenceConfiguration": [
            {
              "physicalStorage": {
                "type": "SQS",
              },
              "config": {
                "wal-queue": [
                  "dgwwal-dq-pds"
                ],
                "wal-dlq-queue": [
                  "dgwwal-dlq-pds"
                ],
                "queue.poll-interval.secs": 10,
                "queue.max-messages-per-poll": 100
              }
            }
            ]
          }

          Persona #2 (Generic Cross-Region Replication)

          Below is the namespace configuration for cross-region replication of EVCache using WAL, which replicates messages from a source region to multiple destinations. It uses Kafka under the hood.

          "persistence_configurations": {
            "persistence_configuration": [
            {
              "physical_storage": {
                "type": "KAFKA"
              },
              "config": {
                "consumer_stack": "consumer",
                "context": "This is for cross region replication for evcache_foobar",
                "target": {
                  "euwest1": "dgwwal.foobar.cluster.eu-west-1.netflix.net",
                  "type": "evc-replication",
                  "useast1": "dgwwal.foobar.cluster.us-east-1.netflix.net",
                  "useast2": "dgwwal.foobar.cluster.us-east-2.netflix.net",
                  "uswest2": "dgwwal.foobar.cluster.us-west-2.netflix.net"
                },
                "wal-kafka-dlq-topics": [],
                "wal-kafka-topics": [
                  "evcache_foobar"
                ],
                "wal.kafka.bootstrap.servers.prefix": "kafka-foobar"
              }
            }
            ]
          }

          Persona #3 (Handling multi-partition mutations)

          Below is the namespace configuration for supporting mutateItems API in Key-Value, where multiple write requests can go to different partitions and have to be eventually consistent. A key detail in the below configuration is the presence of Kafka and durable_storage. These data stores are required to facilitate two phase commit semantics, which we will discuss in detail below.

          "persistence_configurations": {
            "persistence_configuration": [
            {
              "physical_storage": {
                "type": "KAFKA"
              },
              "config": {
                "consumer_stack": "consumer",
                "contacts": "unknown",
                "context": "WAL to support multi-id/namespace mutations for dgwkv.foobar",
                "durable_storage": {
                  "namespace": "foobar_wal_type",
                  "shard": "walfoobar",
                  "type": "kv"
                },
                "target": {},
                "wal-kafka-dlq-topics": [
                  "foobar_kv_multi_id-dlq"
                ],
                "wal-kafka-topics": [
                  "foobar_kv_multi_id"
                ],
                "wal.kafka.bootstrap.servers.prefix": "kaas_kafka-dgwwal_foobar7102"
              }
            }
            ]
          }

          An important note is that requests to WAL support at-least once semantics due to the underlying implementation.

          Under the Hood

          The core architecture consists of several key components working together.

          Message Producer and Message Consumer separation: The message producer receives incoming messages from client applications and adds them into the queue, while the message consumer processes messages from the queue and sends them to the targets. Because of this separation, other systems can bring their own pluggable producers or consumers, depending on their use cases. WAL’s control plane allows for a pluggable model, which, depending on the use-case, allows us to switch between different message queues.

          SQS and Kafka with a dead letter queue by default: Every WAL namespace has its own message queue and gets a dead letter queue (DLQ) by default, because there can be transient errors and hard errors. Application teams using Key-Value abstraction simply need to toggle a flag to enable WAL and get all this functionality without needing to understand the underlying complexity.

          • Kafka-backed namespaces: handle standard message processing
          • SQS-backed namespaces: support delayed queue semantics (we added custom logic to go beyond the standard defaults enforced in terms of delay, size limits, etc)
          • Complex multi-partition scenarios: use queues and durable storage
          • Target Flexibility: The messages added to WAL are pushed to the target datastores. Targets can be Cassandra databases, Memcached caches, Kafka queues, or upstream applications. Users can specify the target via namespace configuration and in the API itself.


          Architecture of WAL

          Deployment Model

          WAL is deployed using the Data Gateway infrastructure. This means that WAL deployments automatically come with mTLS, connection management, authentication, runtime and deployment configurations out of the box.

          Each data gateway abstraction (including WAL) is deployed as a shard. A shard is a physical concept describing a group of hardware instances. Each use case of WAL is usually deployed as a separate shard. For example, the Ads Events service will send requests to WAL shard A, while the Gaming Catalog service will send requests to WAL shard B, allowing for separation of concerns and avoiding noisy neighbour problems.

          Each shard of WAL can have multiple namespaces. A namespace is a logical concept describing a configuration. Each request to WAL has to specify its namespace so that WAL can apply the correct configuration to the request. Each namespace has its own configuration of queues to ensure isolation per use case. If the underlying queue of a WAL namespace becomes the bottleneck of throughput, the operators can choose to add more queues on the fly by modifying the namespace configurations. The concept of shards and namespaces is shared across all Data Gateway Abstractions, including Key-Value, Counter, Timeseries, etc. The namespace configurations are stored in a globally replicated Relational SQL database to ensure availability and consistency.


          Deployment model of WAL

          Based on certain CPU and network thresholds, the Producer group and the Consumer group of each shard will (separately) automatically scale up the number of instances to ensure the service has low latency, high throughput and high availability. WAL, along with other abstractions, also uses the Netflix adaptive load shedding libraries and Envoy to automatically shed requests beyond a certain limit. WAL can be deployed to multiple regions, so each region will deploy its own group of instances.

          Solving different flavors of problems with no change to the core architecture

          The WAL addresses multiple data reliability challenges with no changes to the core architecture:

          • Data Loss Prevention: In case of database downtime, WAL can continue to hold the incoming mutations. When the database becomes available again, replay mutations back to the database. The tradeoff is eventual consistency rather than immediate consistency, and no data loss.
          • Generic Data Replication: For systems like EVCache (using Memcached) and RocksDB that do not support replication by default, WAL provides systematic replication (both in-region and across-region). The target can be another application, another WAL, or another queue — it’s completely pluggable through configuration.
          • System Entropy and Multi-Partition Solutions: Whether dealing with writes across two databases (like Cassandra and Elasticsearch) or mutations across multiple partitions in one database, the solution is the same — write to WAL first, then let the WAL consumer handle the mutations. No more asynchronous repairs needed; WAL handles retries and backoff automatically.
          • Data Corruption Recovery: In case of DB corruptions, restore to the last known good backup, then replay mutations from WAL omitting the offending write/mutation.

          There are some major differences between using WAL and directly using Kafka/SQS. WAL is an abstraction on the underlying queues, so the underlying technology can be swapped out depending on use cases with no code changes. WAL emphasizes an easy yet effective API that saves users from complicated setups and configurations. We leverage the control plane to pivot technologies behind WAL when needed without app or client intervention.

          WAL usage at Netflix

          Delay Queue

          The most common use case for WAL is as a Delay Queue. If an application is interested in sending a request at a certain time in the future, it can offload its requests to WAL, which guarantees that their requests will land after the specified delay.

          Netflix’s Live Origin processes and delivers Netflix live stream video chunks, storing its video data in a Key-Value abstraction backed by Cassandra and EVCache. When Live Origin decides to delete certain video data after an event is completed, it issues delete requests to the Key-Value abstraction. However, the large amount of delete requests in a short burst interfere with the more important real-time read/write requests, causing performance issues in Cassandra and timeouts for the incoming live traffic. To get around this, Key-Value issues the delete requests to WAL first, with a random delay and jitter set for each delete request. WAL, after the delay, sends the delete requests back to Key-Value. Since the deletes are now a flatter curve of requests over time, Key-Value is then able to send the requests to the datastore with no issues.

          Requests being spread out over time through delayed requests

          Additionally, WAL is used by many services that utilize Kafka to stream events, including Ads, Gaming, Product Data Systems, etc. Whenever Kafka requests fail for any reason, the client apps will send WAL a request to retry the kafka request with a delay. This abstracts away the backoff and retry layer of Kafka for many teams, increasing developer efficiency.

          Backoff and delayed retries for clients producing to Kafka


          Backoff and delayed retries for clients consuming from Kafka

          Cross-Region Replication

          WAL is also used for global cross-region replication. The architecture of WAL is generic and allows any datastore/applications to onboard for cross-region replication. Currently, the largest use case is EVCache, and we are working to onboard other storage engines.

          EVCache is deployed by clusters of Memcached instances across multiple regions, where each cluster in each region shares the same data. Each region’s client apps will write, read, or delete data from the EVCache cluster of the same region. To ensure global consistency, the EVCache client of one region will replicate write and delete requests to all other regions. To implement this, the EVCache client that originated the request will send the request to a WAL corresponding to the EVCache cluster and region.

          Since the EVCache client acts as the message producer group in this case, WAL only needs to deploy the message consumer groups. From there, the multiple message consumers are set up to each target region. They will read from the Kafka topic, and send the replicated write or delete requests to a Writer group in their target region. The Writer group will then go ahead and replicate the request to the EVCache server in the same region.


          EVCache Global Cross-Region Replication Implemented through WAL

          The biggest benefits of this approach, compared to our legacy architecture, is being able to migrate from multi-tenant architecture to single tenant architecture for the most latency sensitive applications. For example, Live Origin will have its own dedicated Message Consumer and Writer groups, while a less latency sensitive service can be multi-tenant. This helps us reduce the blast radius of the issues and also prevents noisy neighbor issues.

          Multi-Table Mutations

          WAL is used by Key-Value service to build the MutateItems API. WAL enables the API’s multi-table and multi-id mutations by implementing 2-phase commit semantics under the hood. For this discussion, we can assume that Key-Value service is backed by Cassandra, and each of its namespaces represents a certain table in a Cassandra DB.

          When a Key-Value client issues a MutateItems request to Key-Value server, the request can contain multiple PutItems or DeleteItems requests. Each of those requests can go to different ids and namespaces, or Cassandra tables.

          message MutateItemsRequest {
           repeated MutationRequest mutations = 1;
           message MutationRequest {
            oneof mutation {
              PutItemsRequest put = 1;
              DeleteItemsRequest delete = 2;
            }
           }
          }

          The MutateItems request operates on an eventually consistent model. When the Key-Value server returns a success response, it guarantees that every operation within the MutateItemsRequest will eventually complete successfully. Individual put or delete operations may be partitioned into smaller chunks based on request size, meaning a single operation could spawn multiple chunk requests that must be processed in a specific sequence.

          Two approaches exist to ensure Key-Value client requests achieve success. The synchronous approach involves client-side retries until all mutations complete. However, this method introduces significant challenges; datastores might not natively support transactions and provide no guarantees about the entire request succeeding. Additionally, when more than one replica set is involved in a request, latency occurs in unexpected ways, and the entire request chain must be retried. Also, partial failures in synchronous processing can leave the database in an inconsistent state if some mutations succeed while others fail, requiring complex rollback mechanisms or leaving data integrity compromised. The asynchronous approach was ultimately adopted to address these performance and consistency concerns.

          Given Key-Value’s stateless architecture, the service cannot maintain the mutation success state or guarantee order internally. Instead, it leverages a Write-Ahead Log (WAL) to guarantee mutation completion. For each MutateItems request, Key-Value forwards individual put or delete operations to WAL as they arrive, with each operation tagged with a sequence number to preserve ordering. After transmitting all mutations, Key-Value sends a completion marker indicating the full request has been submitted.

          The WAL producer receives these messages and persists the content, state, and ordering information to a durable storage. The message producer then forwards only the completion marker to the message queue. The message consumer retrieves these markers from the queue and reconstructs the complete mutation set by reading the stored state and content data, ordering operations according to their designated sequence. Failed mutations trigger re-queuing of the completion marker for subsequent retry attempts.

          Architecture of Multi-Table Mutations through WAL


          Sequence diagram for Multi-Table Mutations through WAL

          Closing Thoughts

          Building Netflix’s generic Write-Ahead Log system has taught us several key lessons that guided our design decisions:

          • Pluggable Architecture is Core: The ability to support different targets, whether databases, caches, queues, or upstream applications, through configuration rather than code changes has been fundamental to WAL’s success across diverse use cases.
          • Leverage Existing Building Blocks: We had control plane infrastructure, Key-Value abstractions, and other components already in place. Building on top of these existing abstractions allowed us to focus on the unique challenges WAL needed to solve.
          • Separation of Concerns Enables Scale: By separating message processing from consumption and allowing independent scaling of each component, we can handle traffic surges and failures more gracefully.
          • Systems Fail — Consider Tradeoffs Carefully: WAL itself has failure modes, including traffic surges, slow consumers, and non-transient errors. We use abstractions and operational strategies like data partitioning and backpressure signals to handle these, but the tradeoffs must be understood.

          Future work

          • We are planning to add secondary indices in Key-Value service leveraging WAL.
          • WAL can also be used by a service to guarantee sending requests to multiple datastores. For example, a database and a backup, or a database and a queue at the same time etc.

          Acknowledgements

          Launching WAL was a collaborative effort involving multiple teams at Netflix, and we are grateful to everyone who contributed to making this idea a reality. We would like to thank the following teams for their roles in this launch.

          • Caching team — Additional thanks to Shih-Hao Yeh, Akashdeep Goel for contributing to cross region replication for KV, EVCache etc. and owning this service.
          • Product Data System team — Carlos Matias Herrero, Brandon Bremen for contributing to the delay queue design and being early adopters of WAL giving valuable feedback.
          • KeyValue and Composite abstractions team — Raj Ummadisetty for feedback on API design and mutateItems design discussions. Rajiv Shringi for feedback on API design.
          • Kafka and Real Time Data Infrastructure teams — Nick Mahilani for feedback and inputs on integrating the WAL client into Kafka client. Sundaram Ananthanarayan for design discussions around the possibility of leveraging Flink for some of the WAL use cases.
          • Joseph Lynch for providing strategic direction and organizational support for this project.