How AWS S3 serves 1 petabyte per second on top of slow HDDs

Mike's Notes

Fascinating.

Resources

References

  • Reference

Repository

  • Home > Ajabbi Research > Library > Subscriptions > Big Data Stream
  • Home > Handbook > 

Last Updated

01/10/2025

How AWS S3 serves 1 petabyte per second on top of slow HDDs

By: Stanislav Kozlovski
Big Data Stream: 24/09/2025

7yr experience with Kafka; committer; writes concisely about kafka and big data engineering.

Learn how Amazon built the backbone of the modern web that scales to 1 PB/s and 150M QPS on commodity hard drives

Everyone knows what AWS S3 is, but few comprehend the massive scale it operates at, nor what it took to get there.

In essence - it’s a scalable multi-tenant storage service with APIs to store and retrieve objects, offering extremely high availability [1] and durability [2] at a relatively low cost [3].

Scale

  • 400+ trillion [4] objects
  • 150 million requests a second (150,000,000/s)
  • > 1 PB/s of peak traffic
  • tens of millions of disks

Behind It All?

Hard drives.

How S3 achieves this scale is an engineering marvel. To understand and appreciate the system, we first must appreciate its core building block - the hard drive.

Hard Disk Drives (HDDs) are an old, somewhat out-of-favor technology largely superseded by SSDs. They are physically fragile, constrained for IOPS and high in latency.

But they nailed something flash still hasn’t: dirt cheap commodity economics:

how HDD prices have cratered in the last decades;

src: https://ourworldindata.org/grapher/historical-cost-of-computer-memory-and-storage?time=earliest..2023

Over their lifetime, HDDs have seen exponential improvement:

  • price: 6,000,000,000x cheaper per byte (inflation-adjusted)
  • capacity: increased 7,200,000x
  • size: decreased 5,000x
  • weight: decreased 1,235x

But one issue has consistently persisted - they’re constrained for IOPS. They have been stuck at 120 IOPS for the last 30 years.

Latency also hasn’t kept up in the same pace as the rest.

This means that per byte, HDDs are becoming slower.

Why are HDDs slow?

HDDs are slow because of physics.

They require real-world mechanical movement to read data. (unlike SSDs, which use electricity travelling at ~50% the speed of light). Here is a good visualization:

src: https://animagraffs.com/hard-disk-drive/ (altho it seems broken as of writing)

The platter spins around the spindle at about 7200 rounds per minute (RPM) [5].

The mechanical arm (actuator) with its read/write head physically moves across the platter and waits for it to rotate until it gets to the precise LBA address where the data resides.

Accessing data from the disk therefore involves two mechanical operations and one electrical.

That physical movements are:

  • seek - the act of the actuator moving left or right to the correct track on the platter
    • full-platter seek time: ~25ms [6]
    • half-platter seek time (avg): ~8-9ms [7]
  • rotation - waiting for the spindle to spin the disk until it matches the precise address on the platter’s track
    • full rotational latency: ~8.3ms [8]
    • half rotational latency (avg): ~4ms

And then the electrical one:

  • transfer rate - the act of the head shoving bits off the platter across the bus into memory (the drive’s internal cache)
    • reading 0.5MB: ~2.5ms on average [9]

Sequential I/O

Hard Drives are optimized for sequential access patterns.

Reading/writing bytes that are laid out consecutively on the disk is fast. The natural rotation of the platter cycles through the block of bytes and no excessive seeks need to be performed (the actuator stays still).

The easiest and most popular data structure with sequential access patterns is the Log. Popular distributed systems like Apache Kafka are built on top of it and through sequential access patterns squeeze out great performance off cheap hardware.

It is no surprise that S3’s storage backend - ShardStore - is based on a log-structured merge tree (LSM) itself.

In essence, writes for S3 is easy. Because they write sequentially to the disk, they take advantage of the HDD’s performance. (similar to Kafka, I bet they batch pending PUTs so as to squeeze out more sequential throughput on disk via appends to the log) [10]

Reads, however, are trickier. AWS can’t control what files the user requests - so they have to jump around the drive when serving them.

Random I/O

In the average case, a read on a random part of the drive would involve half of the full physical movement.

The average read latency is the sum of both average physical movements plus the transfer rate. Overall, you’re looking at ~16ms on average to read 0.5 MB of random I/O from a drive. That’s very slow.

Since a second has 1000 milliseconds, you’d only achieve ~32MB/s of random I/O from a single drive.

Because physical movements are a bottleneck - disks have been stuck at this same random I/O latency for the better part of 30 years.

They are simply not efficient under random access patterns. That’s when you’d opt for SSDs. But if you have to store massive amounts of data - SSDs become unaffordable. [11]

This becomes a pickle when you are S3 - a random access system [12] that also stores massive amounts of data.

Yet, S3 found a way to do it - it delivers tolerable latency [13] and outstanding [14] throughput while working around the physical limitations.

Need for Parallelism

S3 solves this problem through massive parallelism.

They spread the data out in many (many! [15]) hard drives so they can achieve massive read throughput by utilizing each drive in parallel.

  • Storing a 1 TB file in a single HDD means limits your reading rate by that single drive’s max throughput (~300 MB/s [16]).
  • Splitting that same 1 TB file across 20,000 different HDDs means you can read it in parallel at the sum of all HDDs’ throughput (TB/s).

They do this via Erasure Coding.

Erasure Coding

Redundancy schemes are common practice in storage systems.

They are most often associated with data durability - protecting against data loss when hardware fails.

S3 uses Erasure Coding (EC). It breaks data into K shards with M redundant “parity” shards. EC allows you to reconstruct the data from any K shards out of the total K+M shards.

The S3 team shares they use a 5-of-9 scheme. They shard each object into 9 pieces - 5 Regular Shards (K) and 4 Parity Shards (M)

This approach tolerates up to 4 losses. To access the object, they need 5/9 shards.


 This scheme helps S3 find a middle balance - it doesn’t take much extra disk capacity yet still provides flexible I/O.

  • EC makes them store 1.8x the original data.
    • A naive alternative like 3-way replication would result in 3x the data. That extra 1.2x starts to matter when we’re talking hundreds of exabytes.
  • EC gives them 5-of-9 possible read sources - an ample hedge against node bottlenecks
    • 3-way replication would only give them 3 sources, meaning its resistant to maximum 2 straggler nodes. If all 3 nodes are hot, performance would suffer. 5-of-9 EC is resistant to 4 straggler nodes (2x more).
  • EC’s 5-of-9 sources also offer much more burst write/read I/O due to parallelism and the actual sharding of the data [17]

An under-appreciated aspect of EC is precisely its ability to distribute load. Such schemes spread the hot spots of a system out and give it the flexibility to steer read traffic in a balanced way. And since shards are small, firing off hedge requests18 to dodge stragglers is far cheaper than with full replicas.

Parallelism in Action

S3 leverages parallelism in three main ways:

  1. From the user’s perspective - upload/download the file in chunks.
  2. From the client’s perspective - send requests to multiple different front-end servers.
  3. From the server’s perspective - store an object in multiple storage servers.

Any part of the end-to-end path can become a bottleneck, so it’s important to optimize everything.

1. Across Front-end Servers

Instead of requesting all the files through one connection to one S3 endpoint, users are encouraged to open as many connections as necessary. This happens behind the scenes in the library code through an internal HTTP connection pool.

This approach utilizes many different endpoints of the distributed system, ensuring no single point in the infrastructure becomes too hot (e.g. front-end proxies, caches, etc)

2. Across Hard Drives

Instead of storing the data in a single hard-drive, the system breaks it into shards via EC and spreads it out across multiple storage back ends.


copyright: AWS; from this re:Invent presentation.

3. Across PUT/GET Operations

Instead of sending one request through a single thread and HTTP connection, the client chunks it into 10 parts and uploads each in parallel. [19]

  • PUT requests support multipart upload, which AWS recommends in order to maximize throughput by leveraging multiple threads.
  • GET requests similarly support an HTTP header denoting you read only a particular range of the object (called byte-ranged GET). AWS again recommends this for achieving higher aggregate throughput instead of the single object read request.

Uploading 1 GB/s to a single server may be difficult, but uploading 100 chunks each at 10 MB/s chunks to 100 different servers is very practical.

This simple idea goes a long way.

Avoiding Hot Spots

S3 now finds itself with a difficult problem. They have tens of millions of drives, hundreds of millions of parallel requests per second and hundreds of millions of EC shards to persist per second.

How do they spread this load around effectively so as to avoid certain nodes/disks overheating?

As we said earlier - a single disk can do around ~32 MB/s of random IOs. It seems trivial to hit that bottleneck. Not to mention any additional system maintenance work like rebalancing data around for more efficient spreading would also take valuable IOs off the disk.

Forming hot spots in a distributed system is dangerous because it can easily trigger a domino-like spiral into system-wide degradation [20].

Needless to say, S3 is very careful in trying to spread data around. Their solution is again deceptively simple:

  1. randomize where you place data on ingestion
  2. continuously rebalance it
  3. scale & chill

Shuffle Sharding & Power of Two

Where you place data initially is key to performance. Moving it later is more expensive.

Unfortunately, at write time you have no good way of knowing whether the data you’re about to persist is going to be accessed frequently or not.

Knowing the perfect the least-loaded HDD to place new data in is also impossible at this scale. You can’t keep a synchronous globally-consistent view when you are serving hundreds of millions of requests per second across tens of millions of drives. This approach would also risk load correlation - placing similar workloads together and having them burst together at once.

A key realization is that picking at random works better in this scenario. 💡

It’s how AWS intentionally engineers decorrelation into their system:

  1. A given PUT picks a random set of drives
  2. The next PUT, even if it’s targetting the same key/bucket, picks a different set of near-random drives.

The way they do it is through the so-called Power of Two Random Choices:

Power of Two Random Choices: a well-studied phenomenon in load balancing that says choosing between the least-loaded of two completely random nodes yields much better results than choosing just one node at random.

Rebalancing

Another key realization is that newer data chunks are hotter than older ones

Fresh data is accessed more frequently. As it grows older, it gets accessed less.

All hard drives therefore eventually cool off in usage as they get filled with data and said data ages. The result is full storage capacity with ample I/O capacity.

AWS has to proactively rebalance the cold data out (so as to free up space) and rebalance cold data in (so as to make use of the free I/O).

Data rebalances are also needed when new racks of disks are added to S3. Each rack contains 20 PB of capacity21, and every disk in there is completely empty. The system needs to proactively spread the load around the new capacity.

Suffice to say - S3 constantly rebalances data around.

Chill@Scale

The last realization is perhaps the least intuitive: the larger the system becomes, the more predictable it is.

AWS experienced so-called workload decorrelation as S3 grew. That is the phenomenon of seeing a smoothening of load once it’s aggregated on a large enough scale. While their peak demand is growing in size, their peak-to-mean delta is collapsing.

This is because storage workloads are inherently very bursty - they demand a lot at once, and then may remain idle for a long time (months).

Because independent workloads do not burst together, the more workloads you cram together - the more those idle spots get filled up and the more predictable the system becomes in aggregate. 💡


copyright: AWS; from this re:Invent presentation.

Summary

AWS S3 is a massively multi-tenant storage service. It’s a gigantic distributed system consisting of many individually slow nodes that on aggregate allow you to access data faster than any single node can provide. S3 achieves this through:

  • massive parallelization across the end-to-end path (user, client, server)
  • neat load-balancing tricks like the power of two random
  • spreading out data via erasure coding
  • lowering tail latency via hedge requests
  • the economies of multi-tenancy at world scale

It started as a service optimized for backups, video and image storage for e-commerce websites - but eventually grew support being the main storage system used for analytics and machine learning on massive data lakes.

Nowadays, the growing trend is for entire data infrastructure projects to be based on top of S3. This gives them the benefits of stateless nodes (easy scaling, less management) while outsourcing difficult durability, replication and load-balancing problems to S3. And get this - it also reduces cloud costs. [22]

References

S3 has a lot of other goodies up its bag, including:

  • shuffle sharding at the DNS level
  • client library hedging requests by cancelling slow requests that pass the p95 threshold and sending new ones to a different host
  • software updates done erasure-coding-style, including rolling out their brand-new ShardStore storage system without any impact to their fleet
  • conway’s law and how it shapes S3’s architecture (consisting of 300+ microservices)
  • their durability culture, including continuous detection, durable chain of custody, a design process that includes durability threat modelling and formal verification

These are generally shared in their annual S3 Deep Dive at re:Invent:

  • 2022 (video)
  • 2023 (video)
  • 2024 (video)

Building and operating a pretty big storage system called S3 (article)

Thank you to the S3 team for sharing what they’ve built, and thank you for reading!

  1. S3 has never been down for more than 5 hours in its entire existence. And that incident was 8 years ago, in just one region (out of 38) in AWS. It was considered one of AWS' most impactful outages of all time.
  2. S3 markets itself as being designed for 11 nines of durability. Careful with the wording - they don’t legally promise 99.999999999% durability. In fact, Amazon does not legally provide any SLA for durability.
  3. By relative low cost, I mean relative to the other storage you can buy on AWS. S3 is still ~$21.5-$23.5 per TB of storage. In fact, S3 hasn’t lowered its prices in 8 years despire HDD prices falling 60% since. When I ran back-of-the-napkin maths for what it’d cost for me to build my own S3 bare metal, the cost came out to $0.875 per TB of storage (25x cheaper). Alternatively, hosting it on Hetzner would be around $5.73 per TB.
  4. 400,000,000,000,000
  5. The first 7200 rpm drive was the Seagate Barracuda released in 1992. Today, rpm has largely remained unchanged. Larger 15k-ish rpm drives exist, but aren’t super common.
  6. depends a lot on the drive, 20-30ms is the range for a 7.2k RPM drive
  7. notice this isn’t half the max seek time, it’s closer to 1/3rd. The actual number is around 0.32-0.35 (interesting paper on the matter)
  8. 7200 rotations per minute == 7200 rotations per 60000 milliseconds == 8.33ms per rotation
  9. HDDs on average have ~170-200 MB/s transfer rate; 200mb / 1000ms == 0.2 mb/ms; 0.5 mb == ~2.5ms; They’re simply not optimized for random access like this.
  10. Kafka loves batching entries. It batches on the client (by waiting), it batches in the protocol (by merging entries) and it batches on the server (by storing in page cache and utilizing the OS’ async flush). It’s such an obvious perf gain that S3 must do something similar in their back-end and storage system.
  11. although this is slowly but surely beginning to change for certain data thresholds. SSDs have massively deflated in price in just the last 15 years.
  12. In aggregate, S3 exhibits random access. You as a tenant can PUT/GET any blobs of any size. An average S3 disk would therefore consist of blobs from thousands of tenants. If they all attempt to access their data simultaneously, the drive simply cannot serve every request at once.
  13. Not a lot of benchmark actually exist here. Testing 0.5MB files, I got writes at ~140ms p99 and 26ms p50, reads at 86ms p99. Larger files allegedly get larger p99, and they vary throughout the week.
  14. At least one public number is Anthropic driving tens of terabytes per second. There are likely much larger single customer workloads out there - S3 does more than a petabyte a second!
  15. AWS shares that tens of thousands of their customers have their data spread over 1,000,000 disks. This is a great example of how multi-tenancy at scale can convert the financially impossible into the affordable. It would be prohibitively expensive for any single tenant to deploy a million HDDs themselves, but when shared behind a multi-tenant system - it becomes surprisingly cheap.
  16. e.g a modern cheap 20TB HDD maxes out at around 291 MB/s of data transfer: https://www.westerndigital.com/products/internal-drives/wd-gold-sata-hdd?sku=WD203KRYZ; note this is marketing numbers too
  17. Let me explain in detail. Assume you have a 100MB object and you want to write/read it in 1 second (for simplicity). 3x replication means you need 3 nodes that give you 100MB/s reads or writes. 5-of-9 EC means you need 9 nodes that give you 20MB/s reads or writes. (each shard is 20MB (100MB/5) because the data is split into 5 regular shards, the other 4 are parity “copies” each 20MB too)
  18. The concept of a hedge request was popularized by this Google paper “The Tail at Scale”. It essentially talks about how fanout requests (where a root request results in many sub-requests, e.g like S3’s GETs requesting multiple shards) can significantly reduce their tail latency by speculatively sending extra requests (i.e if you need 5 sub-requests to build an object - send 6). This extra request is sent only once one of the sub-requests surpasses the usual p95 latency. S3’s client libraries also utilize this concept.
  19. An interesting detail is that each part of the multi-part upload must be getting Erasure Coded 5-of-9 too. So a single object uploaded through multipart upload can consist of hundreds of shards.
  20. If too many requests hit the same disk at the same point in time, the disk starts to stall because its limited I/O is exhausted. This accumulates tail latency to requests that depend on the drive. This delay impacts other operations like writes. It also gets amplified up the stack in other components beyond the drive. If left unchecked, it can cause a cascade that significantly slows down the whole system.
  21. As someone with no data center experience, I find it super cool when Amazon shares pictures of what the physical disks look like. Here is an example of one such rack of disks. It consists of 1000 drives - 20TB each. It’s said this rack weighs more than a car, and Amazon had to reinforce the flooring in their data centers to support it.


  22. Apache Kafka (what I am most familiar with) has been seeing the so-called “Diskless” trend where the write path uses S3 instead of local disks. This trades off higher latency for lower costs (by 90% [!]). Similar projects exist - Turbopuffer (Vector DB built on S3), SlateDB (embedded LSM on S3), Nixiesearch (Lucene on S3). In general, every data infra project seems to be offloading as much as possible to object storage. (Clickhouse, OpenSearch, Elastic). Before Diskless, Kafka similarly used a two-tier approach where cold data was offloaded to S3 (for a 10x storage cost saving)

Agents in Production - MLOps x Prosus

Mike's Notes

I'm attending this to learn. Pipi 9 is composed of multiple agents, and many Pipis can form swarms that interact. Always room for improvement.

Maybe I should join the MLOps Community. Done.

Resources

  • Resource

References

  • Reference

Repository

  • Home > Ajabbi Research > Library >
  • Home > Handbook > 

Last Updated

14/11/2025

Agents in Production - MLOps x Prosus

By: MLOps Community
MLOps Community: 14/11/2025

Free Virtual Conference – November 18 2025

After a year of hype, we’re hearing from the people deploying agents at scale.

Join practitioners from Meta, OpenAI, and Google as they share how agentic AI is being deployed in real systems. What’s working, what’s breaking, and what’s next. We have over 25 speakers sharing their hard-earned lessons on deploying agents at scale.

Free Registration

Some talk highlights:

  • Aditya Gautam (Meta): Building multi-agent systems to detect, correct, and contain misinformation.
  • Teodora Musatoiu (OpenAI): Lessons from shipping enterprise-scale AI and the new patterns emerging from real deployments.
  • Jasleen Singh (Google): Bringing AI agents together through a common communication framework.
  • Plus, a panel featuring Arushi Jain (Microsoft), Swati Bhatia (Google), and Julia Rose (Inworld AI) discussing how to harden agents for e-commerce scale – from RL alignment to system reliability in production.

Expect technical depth, real implementation stories, and a format that keeps your attention between sessions.


The MLOps Community is where machine learning practitioners come together to define and implement MLOps. Our global community is the default hub for MLOps practitioners to meet other MLOps industry professionals, share their real-world experience and challenges, learn skills and best practices, and collaborate on projects and employment opportunities. We are the world's largest community dedicated to addressing the unique technical and operational challenges of production machine learning systems.

Workspace testing offline status

Mike's Notes

Oops

Resources

  • Resource

References

  • Reference

Repository

  • Home > Ajabbi Research > Library >
  • Home > Handbook > 

Last Updated

14/11/2025

Workspace testing offline status

By: Mike Peters
On a Sandy Beach: 14/11/2025

Mike is the inventor and architect of Pipi and the founder of Ajabbi.

Testing Status

If you are testing the workspaces, they are not working. The current online mockup is getting a breaking version change. The help menu is being replaced to provide better in-context help, support and self-learning. It should be back online in a few days. Will send out another message when good to go again.

What is "good taste" in software engineering?

Mike's Notes

Reminds me of code smell. A lot of code stinks.

Resources

References

  • Reference

Repository

  • Home > Ajabbi Research > Library > Author > Martin Fowler
  • Home > Handbook > 

Last Updated

14/11/2025

What is "good taste" in software engineering?

By: Sean Goedecke
sean goedecke: 28/09/2025

Hi! I'm Sean Goedecke, an Australian software engineer. I mostly write about AI and large-company dynamics.

Technical taste is different from technical skill. You can be technically strong but have bad taste, or technically weak with good taste. Like taste in general, technical taste sometimes runs ahead of your ability: just like you can tell good food from bad without being able to cook, you can know what kind of software you like before you’ve got the ability to build it. You can develop technical ability by study and repetition, but good taste is developed in a more mysterious way.

Here are some indicators of software taste:

  • What kind of code “looks good” to you? What kind of code “looks ugly”?
  • Which design decisions you feel really good about, and which ones are just fine?
  • Which software problems really bother you, to the point where you’re worrying about them outside of work? Which problems can you just brush off?
I think taste is the ability to adopt the set of engineering values that fit your current project.

Why taste is different from skill

Aren’t the indicators above just a part of skill? For instance, doesn’t code look good if it’s good code? I don’t think so.

Let’s take an example. Personally, I feel like code that uses map and filter looks nicer than using a for loop. It’s tempting to think that this is a case of me being straightforwardly correct about a point of engineering. For instance, map and filter typically involve pure functions, which are easier to reason about, and they avoid an entire class of off-by-one iterator bugs. It feels to me like this isn’t a matter of taste, but a case where I’m right and other engineers are wrong.

But of course it’s more complicated than that. Languages like Golang don’t contain map and filter at all, for principled reasons. Iterating with a for loop is easier to reason about from a performance perspective, and is more straightforward to extend to other iteration strategies (like taking two items at a time). I don’t care about these reasons as much as I care about the reasons in favour of map and filter - that’s why I don’t write a lot of for loops - but it would be far too arrogant for me to say that engineers who prefer for loops are simply less skilled. In many cases, they have technical capabilites that I don’t have. They just care about different things.

In other words, our disagreement comes down to a difference in values. I wrote about this point in I don’t know how to build software and you don’t either. Even if the big technical debates do have definite answers, no working software engineer is ever in a position to know what those answers are, because you can only fit so much experience into one career. We are all at least partly relying on our own personal experience: on our particular set of engineering values.

What engineering taste actually is

Almost every decision in software engineering is a tradeoff. You’re rarely picking between two options where one is strictly better. Instead, each option has its own benefits and downsides. Often you have to make hard tradeoffs between engineering values: past a certain point, you cannot easily increase performance without harming readability, for instance1.

Really understanding this point is (in my view) the biggest indicator of maturity in software engineering. Immature engineers are rigid about their decisions. They think it’s always better to do X or Y. Mature engineers are usually willing to consider both sides of a decision, because they know that both sides come with different benefits. The trick is not deciding if technology X is better than Y, but whether the benefits of X outweigh Y in this particular case.

In other words, immature engineers are too inflexible about their taste. They know what they like, but they mistake that liking for a principled engineering position. What defines a particular engineer’s taste?

In my view, your engineering taste is composed of the set of engineering values you find most important. For instance:

  • Resiliency. If an infrastructure component fails (a service dies, a network connection becomes unavailable), does the system remain functional? Can it recover without human intervention?
  • Speed. How fast is the software, compared to the theoretical limit? Is work being done in the hot path that isn’t strictly necessary?
  • Readability. Is the software easy to take in at a glance and to onboard new engineers to? Are functions relatively short and named well? Is the system well-documented?
  • Correctness. Is it possible to represent an invalid state in the system? How locked-down is the system with tests, types, and asserts? Do the tests use techniques like fuzzing? In the extreme case, has the program been proven correct by formal methods like Alloy?
  • Flexibility. Can the system be trivially extended? How easy is it to make a change? If I need to change something, how many different parts of the program do I need to touch in order to do so?
  • Portability. Is the system tied down to a particular operational environment (say, Microsoft Windows, or AWS)? If the system needs to be redeployed elsewhere, can that happen without a lot of engineering work?
  • Scalability. If traffic goes up 10x, will the system fall over? What about 100x? Does the system have to be over-provisioned or can it scale automatically? What bottlenecks will require engineering intervention?
  • Development speed. If I need to extend the system, how fast can it be done? Can most engineers work on it, or does it require a domain expert?

There are many other engineering values: elegance, modern-ness, use of open source, monetary cost of keeping the system running, and so on. All of these are important, but no engineer cares equally about all of these things. Your taste is determined by which of these values you rank highest. For instance, if you value speed and correctness more than development speed, you are likely to prefer Rust over Python. If you value scalability over portability, you are likely to argue for a heavy investment in your host’s (e.g. AWS) particular quirks and tooling. If you value resiliency over speed, you are likely to want to split your traffic between different regions. And so on2.

It’s possible to break these values down in a more fine-grained way. Two engineers who both deeply care about readability could disagree because one values short functions and the other values short call-stacks. Two engineers who both care about correctness could disagree because one values exhaustive test suites and the other values formal methods. But the principle is the same - there are lots of possible engineering values to care about, and because they are often in tension, each engineer is forced to take some more seriously than others.

How to identify bad taste

I’ve said that all of these values are important. Despite that, it’s possible to have bad taste. In the context of software engineering, bad taste means that your preferred values are not a good fit for the project you’re working on.

Most of us have worked with engineers like this. They come onto your project evangelizing about something - formal methods, rewriting in Golang, Ruby meta-programming, cross-region deployment, or whatever - because it’s worked well for them in the past. Whether it’s a good fit for your project or not, they’re going to argue for it, because it’s what they like. Before you know it, you’re making sure your internal metrics dashboard has five nines of reliability, at the cost of making it impossible for any junior engineer to understand.

In other words, most bad taste comes from inflexibility. I will always distrust engineers who justify decisions by saying “it’s best practice”. No engineering decision is “best practice” in all contexts! You have to make the right decision for the specific problem you’re facing.

One interesting consequence of this is that engineers with bad taste are like broken compasses. If you’re in the right spot, a broken compass will still point north. It’s only when you start moving around that the broken compass will steer you wrong. Likewise, many engineers with bad taste can be quite effective in the particular niche where their preferences line up with what the project needs. But when they’re moved between projects or jobs, or when the nature of the project changes, the wheels immediately come off. No job stays the same for long, particularly in these troubled post-2021 times.

How to identify good taste

Good taste is a lot more elusive than technical ability. That’s because, unlike technical ability, good taste is the ability to select the right set of engineering values for the particular technical problem you’re facing. It’s thus much harder to identify if someone has good taste: you can’t test it with toy problems, or by asking about technical facts. You need there to be a real problem, with all of its messy real-world context.

You can tell you have good taste if the projects you’re working on succeed. If you’re not meaningfully contributing to the design of a project (maybe you’re just doing ticket-work), you can tell you have good taste if the projects where you agree with the design decisions succeed, and the projects where you disagree are rocky. Importantly, you need a set of different kinds of projects. If it’s just the one project, or the same kind of project over again, you might just be a good fit for that. Even if you go through many different kinds of projects, that’s no guarantee that you have good taste in domains you’re less familiar with3.

How do you develop good taste? It’s hard to say, but I’d recommend working on a variety of things, paying close attention to which projects (or which parts of the project) are easy and which parts are hard. You should focus on flexibility: try not to acquire strong universal opinions about the right way to write software. What good taste I have I acquired pretty slowly. Still, I don’t see why you couldn’t acquire it fast. I’m sure there are prodigies with taste beyond their experience in programming, just as there are prodigies in other domains.

edit: this post got quite a few comments on Hacker News. There was some quite interesting discussion of how good taste applies to working with beginners, and how you have to mirror their taste a little in order to help them learn in a way they’ll understand. Some commenters disagreed that “taste” has any role in software engineering: they believe that all decisions have a single correct solution that engineers ought to arrive at analytically. I find this perspective pretty baffling. It seems obvious to me that there are many possible acceptable solutions for any given engineering problem, and that at some point the choice comes down to personal preference. Other commenters pointed out that I didn’t write about the customer or the business - fair enough, but I was more interested in writing about how even highly technical decisions are still influenced by taste.

  1. Of course this isn’t always true. There are win-win changes where you can improve several usually-opposing values at the same time. But mostly we’re not in that position.
  2. Like I said above, different projects will obviously demand a different set of values. But the engineers working on those projects will still have to draw the line somewhere, and they’ll rely on their own taste to do that.
  3. That said, I do think good taste is somewhat transferable. I don’t have much personal experience with this so I’m leaving it in a footnote, but if you’re flexible and attentive to the details in domain A, you’ll probably be flexible and attentive to the details in domain B.

Selling to the Enterprise

Mike's Notes

Me trying to understand their mindset.

Resources

References

  • Reference

Repository

  • Home > Ajabbi Research > Library >
  • Home > Handbook > 

Last Updated

13/11/2025

Selling to the Enterprise

By: 
Stay SaaSy: 26/08/2020

Writing about scaling enterprise SaaS product and engineering teams from $0 to IPO and beyond.

The enterprise SaaS industry is booming. Much has been written about how to scale B2C tech, as consumer technology companies such as Google, Facebook, or Netflix enjoy near-monopolies, employ hordes of people, and command appropriately dominant mindshare. A relative lack of material has been written about building enterprise software, particularly from a product & engineering point of view.

In this post series we’ll describe techniques on building an easily sellable SaaS product. To set the stage, we’ll define “enterprise buyers” as companies that:

  • Employ 500+ people
  • Are prepared to pay around $100,000 or (much) more for strategic products
  • Regularly make long-term, multi-year software buying commitments

Enterprise Sales is Like Marriage

SaaS covers a wide spectrum of businesses: from enterprise products that cost tens of millions of dollars over multiple years, to freemium products that start at $10/month. The strategies that help you sell one will not (typically) work for the other. The process of settling on the former product is like getting married: months to years building the relationship, multiple suitors, and meetings with the parents, all with the intention of making a decades long commitment. Buying a $150/year utility SaaS, by comparison, is like a Tinder date that ends with getting busy in the back of a Prius.

Once you see the similarities between enterprise buying and marriage, many dimensions of how to sell high-value SaaS become more clear:

  • Like a marriage, it’s important to have integrity. Don’t oversell – put your best foot forward (brush your teeth before the first date!) but have integrity in how you sell yourself (don’t lie about your job, your height, or where you went to school).
  • You need to offer something differentiated. Nobody wants to settle – you need to be the best catch in at least one, and preferably multiple important dimensions.
  • There are a lot of stakeholders. “Meeting the parents” is a key step in serious dating; meeting your buyer’s Procurement and Infosec teams can be a (disturbingly) similar experience.
  • You need to be the right partners for each other. Enterprise sales relies upon product/customer fit – you should solve a need for your buyer, and you should be able to actually deliver. A common example: over-selling to a Fortune 500 company as a 40-person company, when you don’t have the stability, maturity, or headcount to deliver the service that they expect. Maybe you all are just meeting at the wrong moment in your lives.

When I began my career I didn’t understand this dynamic at all. I imagined that enterprise sales was like trying to meet someone at a bar: look as hot as possible, go to a few steak dinners, dance poorly and hope for the best. Superficial stuff. In reality, building a product that can sell to big-company buyers requires scoring well across a broad constellation of factors: the equivalents of getting a good education, having a steady job, and demonstrating that you know how to clean your apartment. Enterprise buyers expect to be courted, and as a builder you have the tools to put your best foot forward.

The Anatomy of an Enterprise Sale

These posts on how to sell to the enterprise are rooted in the core drivers that impact huge software purchases. Understanding these drivers as a product builder can help you navigate the arcane “dating” process behind closing your first (or next) enterprise logo:

  • Spending millions of dollars on a software solution is necessarily a complex process due to the sheer amount of “stuff” going on at a large company.
  • Enterprise software deployments have a complex web of stakeholders. The buyer of enterprise software is often different from the user, as software is (often) bought by executives but used by their teams. Decisions on which vendors to use can take on a political dimension, as they can make or break careers.
  • Change is extremely expensive to enterprise buyers because they’re massive and built for momentum over agility. Like a battleship, they can carry a lot of people very far but are a pain to turn. Enterprise companies value stability, and they want to commit upfront to a long-term partner.

In this series, we’ll discuss strategies for building a product that will sell to the enterprise, and that will keep them happy for years. Following posts will break down strategies we’ve found helpful, and hopefully provide some ideas as you develop your own products.

Takeaways

  • Building a product that can sell to the enterprise is difficult but rewarding.
  • Enterprise sales is like dating with intent to marry, with many of the same dynamics.
  • Enterprise buyers’ decision-making is influenced by several important drivers: the complexity of enterprise businesses, complicated webs of stakeholders, and the need to plan on long time horizons.

Workspaces for Aviation

Mike's Notes

This is where I will keep detailed working notes on creating Workspaces for Aviation. Eventually, these will become permanent, better-written documentation stored elsewhere. Hopefully, someone will come up with a better name than this working title.

This replaces coverage in Industry Workspace written on 13/10/2025.

Testing

The current online mockup is version 3 and will be updated frequently. If you are helping with testing, please remember to delete your browser cache so you see the daily changes. Eventually, a live demo version will be available for field trials.

Learning

(To come)

Why

(To come)

Resources

References

  • Reference

Repository

  • Home > Ajabbi Research > Library >
  • Home > Handbook > 

Last Updated

14/11/2025

Workspaces for Aviation

By: Mike Peters
On a Sandy Beach: 12/11/2025

Mike is the inventor and architect of Pipi and the founder of Ajabbi.

Open-source

This open-source SaaS cloud system will be shared on GitHub and GitLab.

Dedication

This workspace is dedicated to the life and work of ..

Change Log

Ver 3 includes aircraft, airport, airspace, and flight.

Existing products

Features

This is a basic comparison of features found in aviation software.
[TABLE]

Data Model

words

Database Entities

  • Facility
  • Party
  • Party Relationship
  • Party Role
  • etc

Geodatabase

Airport Mapping Database (AMDB)

"An AMDB is a Geographic Information System (GIS) database of an airport describing:
  • the spatial layout of an airport;
  • the geometry of features (e.g. runways, taxiways, buildings) described as points, lines and polygons;
  • further information characterising the features and their functions which are stored as attributes (e.g. surface type, name/object identifier, runway slope).
  • AMDB are used in a wide variety of applications but mostly in on-board applications such as Electronic Flight Bags (EFBs). These applications are intended primarily to improve the user’s situational awareness and/or to supplement surface navigation, thereby increasing safety margins and operational efficiency.
Multiple user groups, such as pilots, controllers, aerodrome managers, aerodrome emergency/security personnel etc, can benefit from using AMDBs.

Further Reading
  • ICAO Annex 15 Chapter 11
  • ICAO Annex 14 Chapter 2.1
  • EUROCAE ED-99C (RTCA DO-272C) — User Requirements for Aerodrome Mapping Information
  • EUROCAE ED-119B (RTCA DO-291B) — Interchange Standards for Terrain, Obstacle, and Aerodrome Mapping Data
  • Eurocontrol AMDB website

Standards

The workspace needs to comply with all international standards.

  • (To come)

API

  • (To come)

Support

There will be extensive free documentation sets tailored for users, developers, and data scientists.

Ajabbi will provide free support to developers with a paid DevOps Account who are supporting end users of Workspaces for Aviation.

Workspace navigation menu

This default outline needs a lot of work. The outline can be easily customised by future users using drag-and-drop and tick boxes to turn features off and on.

  • Enterprise Account
    • Applications
      • Aviation (v.3)
        • Aircraft
          • Aircraft Records
            • Aircraft Registration
            • Certificate of Registration
            • Airframe
            • Engine Cycles
            • Engine Hours
          • Cerification
          • Fuel & Oil
          • Inventory
            • Orders
            • Parts Request
          • MRO (Maintenance, Repair, and Overhaul)
            • Fixed Wing
            • Helicopter
          • Personnel
            • Engineer
            • Pilot
          • Scheduling
          • Status & Readyness
          • Airport
            • Aerodrome
              • Aerodrome Plate
              • AIP
              • Aircraft Stand
              • Apron
              • Helipad
              • Runway
              • Taxiway
            • Air Traffic Control
              • Frequencies
                • Tower
                • Navigational
              • Slot
            • Contact
            • Emergency
            • Facility 
              • Flight information display system (FIDS)
              • Checkin Counter
                • Baggage
              • Gates
                • Boarding
              • Cafe
              • Parking
              • Shop
            • Security
            • Weather
              • ATIS
              • METAR
              • NOTAM
              • TAF
          • Airspace
            • Map
          • Flight
            • Cargo
            • Flight Number
            • Flight Plan
            • Passenger
      • Customer (v2)
        • Bookmarks
          • (To come)
        • Support
          • Contact
          • Forum
          • Live Chat
          • Office Hours
          • Requests
          • Tickets
        • (To come)
          • Feature Vote
          • Feedback
          • Surveys
        • Learning
          • Explanation
          • How to Guide
          • Reference
          • Tutorial
      • Settings (v3)
        • Account
        • Billing
        • Deployments
          • Workspaces
            • Modules
            • Plugins
            • Templates
              • (To come)
            • Users

    Inside the race to build agent-native databases

    Mike's Notes

    I was curious about this article by Ben Lorica because Pipi 9 uses database-driven agents. So what does Ben have to say?

    Resources

    References

    • Reference

    Repository

    • Home > Ajabbi Research > Library >
    • Home > Handbook > 

    Last Updated

    11/11/2025

    Inside the race to build agent-native databases

    By: Ben Lorica
    Gradient Flow: 29/10/2025

    Ben Lorica edits the Gradient Flow newsletter and hosts the Data Exchange podcast. He helps organize the AI Conference, the AI Agent Conference, the Applied AI Summit, while also serving as the Strategic Content Chair for AI at the Linux Foundation. You can follow him on Linkedin, X, Mastodon, Reddit, Bluesky, YouTube, or TikTok. This newsletter is produced by Gradient Flow.

    In a recent piece, I explored the growing mismatch between our existing data infrastructure and the demands of emerging AI agents. Since then, I have had the opportunity to speak with some founders and engineering leaders who are tackling this challenge directly. Their work confirms that the rise of agentic AI is not just an application-layer phenomenon; it is forcing a fundamental reconsideration of the database itself. This article examines four distinct initiatives that are reimagining what a database should be in an era where software, not just humans, will be its primary user.

    AgentDB: The Database as a Disposable File

    AgentDB reimagines the database by treating it not as persistent, heavy infrastructure but as a lightweight, disposable artifact, akin to a file. Its core premise is that creating a database should be as simple as generating a unique ID; doing so instantly provisions a new, isolated database. This serverless approach, which can utilize embedded engines like SQLite and DuckDB, is designed for the high-velocity, ephemeral needs of agentic workflows, where an agent might spin up a database for a single task and discard it upon completion.

    The initiative assumes that a significant portion of agentic tasks do not require the complexity of a traditional relational database. Its target use cases include developers building simple AI applications, agents needing a temporary “scratchpad” to process information, or even non-technical users who want to turn a data file, like a CSV of personal expenses, into an interactive chat application. Its primary limitation is that it is not designed for complex, high-throughput transactional systems with thousands of interconnected tables, such as an enterprise resource planning (ERP) system. AgentDB is currently live and accessible, with a focus on empowering developers to quickly integrate data persistence into their AI applications with minimal friction.

    Postgres for Agents: Evolving a Classic for AI

    Tiger Data’s “Postgres for Agents” takes an evolutionary, rather than revolutionary, approach. Instead of building a new database from scratch, it enhances PostgreSQL, the popular open-source database, with capabilities tailored for agents. The cornerstone of this initiative is a new storage layer that enables “zero-copy forking.” This allows a developer or an agent to create an instantaneous, isolated branch of a production database. This fork can be used as a safe sandbox to test schema changes, run experiments, or validate new code without impacting the live system.

    This approach is built on the assumption that the reliability, maturity, and rich ecosystem of Postgres are too valuable to discard. The target user is any developer building applications with AI, who can now instruct an AI coding assistant to safely test database migrations on a full-scale copy of production data. It also serves AI applications that require a robust and stateful backend. The platform is now available via Tiger Data’s cloud service, which includes a free tier. While the core forking technology is currently proprietary, the company is signaling a long-term commitment to the open Postgres ecosystem.

    Databricks Lakebase: Unifying Transactions and Analytics

    The Databricks Lakebase represents a broad architectural vision aimed at dissolving the long-standing wall between operational and analytical data systems. It proposes a new category of database — a “lakebase” — that embeds transactional capabilities directly within a data lakehouse architecture. Built on open standards like Postgres, it is designed to be serverless, separate storage from compute for elastic scaling, and support modern developer workflows like instantaneous branching.

    The core assumption of the Lakebase is that intelligent agents require seamless access to both real-time operational data and historical analytical insights to perform complex tasks. For example, an inventory management agent needs to check current stock levels (a transactional query) while also considering predictive demand models (an analytical query). The Lakebase is targeted at organizations, particularly those already invested in a lakehouse architecture, that want to build AI-native applications without the cost and complexity of maintaining separate databases and data pipelines. This is currently a strategic roadmap for Databricks, accelerated by its recent acquisition of companies like Mooncake Labs, and represents a long-term effort to create a single, unified platform for all data workloads.

    Bauplan Labs: A Safety-First Approach for Agents

    Bauplan Labs approaches the problem from the perspective of safety and reliability, motivated by the principle that modern data engineering requires the same rigor as software engineering. Their work focuses on creating a “programmable lakehouse,” an environment where every data operation is managed through code-based abstractions. This provides a secure and auditable foundation for AI agents to perform sensitive tasks. The central concept is a rigorously defined “Git-for-data” model, which allows agents to work on isolated branches of production data. Crucially, it introduces a “verify-then-merge” workflow. Before an agent’s changes are integrated, they must pass a series of automated correctness checks.

    This framework assumes that for agents to be trusted with mission-critical systems, their actions must be verifiable and their potential for error contained. The target use cases are high-stakes scenarios, such as an agent tasked with repairing a broken data pipeline or safely querying financial data through a controlled API, where a mistake could have significant consequences. Bauplan is building its platform on a formal blueprint for safe, agent-driven data systems, an approach already being validated by early customers. While the company offers open-source tooling on GitHub, its focus is on providing a commercial-grade framework for high-stakes, agent-driven applications that will influence the design of future platforms.

    The Broader Infrastructure Shift

    These four initiatives, from AgentDB’s file-like simplicity to the ambitious unification of the Databricks Lakebase, highlight a clear trend: databases are being reshaped to serve machines. Whether by evolving the trusted foundation of Postgres or by designing safety-first frameworks like Bauplan’s, the data community is moving toward systems that are more ephemeral, isolated, and context-aware. As outlined in my earlier thoughts, databases are becoming more than just repositories of information; they are the operational state stores and external memory that provide agents with the traceability, determinism, and auditable history needed to function reliably.

    Of course, the database is just one piece of the puzzle. As agents become more integrated into our workflows, other components of the technology stack also require reimagination. Search APIs, traditionally designed to return ten blue links for a human, must be adapted to deliver comprehensive, structured information for a machine. Development environments and IDEs are already evolving to become collaborative spaces for humans and AI coding assistants. The entire infrastructure, from headless browsers that allow agents to interact with the web to the observability tools that monitor their behavior, is being rebuilt for an agent-native world.

    Quick Takes