Pages

About

How function diversity scales, from cells to companies

Mike's Notes

Fascinating work. Something to test Pipi against using long-cycle simulations.

Resources

References

  • Scaling laws for function diversity and specialization across socioeconomic and biological complex systems. Authors: Vicky Chuqiao Yang, James Holehouse, Hyejin Youn, José Ignacio Arroyo, Sidney Redner, Geoffrey B. West, and Christopher P. Kempes. PNAS (February 12, 2025). DOI: 10.1073/pnas.2509729123

Repository

  • Home > Ajabbi Research > Library > Subscriptions > Parallax
  • Home > Handbook > 

Last Updated

04/05/2026

How function diversity scales, from cells to companies

By: Santa Fe Institute
Parallax: 18/02/2026

Mike is the inventor and architect of Pipi and the founder of Ajabbi.

A mystery novel, a history book, and a fantasy epic may have little in common in plot or style. But count the words inside them and a strange regularity appears: many new words show up early, then fewer and fewer as the author reuses what has already been introduced.

That pattern, known as Heaps’ law, turns out not to belong to books alone. A new study in PNAS finds that the same rule also describes the growth patterns in many complex systems, from living cells and corporations to universities and government agencies — and could even be used to predict how they will change in the future.

The study, led by scientists at the Santa Fe Institute and MIT, doesn’t just document this regularity; it introduces a mathematical model that quantifies how different systems diversify and specialize. It finds that, while systems vary in how much they invest in creating entirely new functions, once those functions exist, their subsequent growth follows a remarkably universal rich-get-richer process.

“What’s striking is that these systems weren’t designed to follow the same rules,” says SFI Program Postdoctoral Fellow James Holehouse, who co-led the study with Vicky Chuqiao Yang, a former SFI Omidyar Fellow now at MIT. “Yet when you look at how they grow, you see the same trade-off between adding something new and building on what already exists.”

“It is remarkable that cells, bureaucracies, and companies, despite obvious differences, all grow their function repertoire with a similar pattern.”

In the study, researchers focus on what they call “distinct functions” — the different kinds of work a system performs. In a cell, that might mean different proteins. In an organization, it could mean different kinds of jobs. As systems grow, they do add new kinds of work, but they do so more and more slowly over time.

Using their model, the team analyzed dozens of bacterial and microbial cells, more than a hundred U.S. federal agencies, thousands of companies and universities, and hundreds of metropolitan areas. Across most of these cases, the same pattern appeared: as systems got bigger, the pace at which they added new functions steadily slowed, growing sublinearly.

In practical terms, sublinear growth means that doubling the size of a system does not double the number of functions inside it. Instead, growth increasingly comes from expanding what already exists. A growing organization hires more people into established jobs before creating new titles. A cell produces more of the proteins it already uses instead of evolving entirely new ones.

“It is remarkable that cells, bureaucracies, and companies, despite obvious differences, all grow their function repertoire with a similar pattern,” says Yang, an assistant professor at MIT Sloan and the Institute for Data, Systems, and Society. “This suggests that the regularity discovered in Heaps’ law applies not only to what humans create, like books, but also to human organizations themselves.”

Cities, however, follow a different version of the same trend. They still add new kinds of jobs as they grow, but they do so much more slowly, following a logarithmic pattern rather than the power-law pattern seen in other systems. Even as populations soar, genuinely new job types become increasingly rare.

That difference reflects a deeper structural divide. Cells, firms, and agencies behave like organisms, with clear boundaries and unified goals. Cities, by contrast, resemble ecosystems shaped by the independent choices of individuals rather than centralized control.

Geoffrey West, a co-author and Santa Fe Institute Shannan Distinguished Professor, adds, “There are underlying regularities shaping how complexity builds, even in systems that look completely different on the surface.”

This material is based upon work supported by the U.S. National Science Foundation under Award No. 2526746

The Neural Harness: The new CPU

Mike's Notes

Some deep insights here from Will Schneck. Asking more questions than he answers. Especially deterministic vs probabilistic. Where does emergence emerge? 😎😎

Resources

References

  • Reference

Repository

  • Home > Ajabbi Research > Library > Subscriptions > The Focus AI
  • Home > Handbook > 

Last Updated

03/05/2026

The Neural Harness: The new CPU

By: Will Schenk
The Focus AI: 01/05/2026

I am a father, entrepreneur, technologist and aspiring woodsman.

My wife Ksenia and I live in the woods of Northwest Connecticut with our four boys and one baby girl. I have a lumber mill and all the kids love using the tractor.

I’m currently building The Focus AI, Umwelten, and Cornwall Market.

"Coding agents will build their own tools and their own agents. Agents will be used by non-engineers to manage other agents to manage parts of the org chart."


I'm on my second Claude Max plan. That's in addition to Cursor, Codex, Gemini, and a healthy Amp habit. Not to mention a Jetson AGX Thor I'm about to plug in at the office — more on that one later.

Overnight jobs parsing financial deal structures, ops stuff, research, monitoring logs, responding to events, all the little background things. The first plan tapped out, I added another, that one tapped out too, and now I'm provisioning a third the way you'd add a build runner. Mundane.

A new entry in an old list

Look at the paragraph I just wrote. Overnight jobs parsing financial deal structures, ops stuff, research, monitoring logs, responding to events. Half of those words are themselves names of native units of computing. Logs — log aggregators. Events — event streams. Research — search indexes. Ops — schedulers, orchestrators, deployment systems. Jobs — queues. The lede is already a list of older units I'm wiring into.

Computing has been accreting native units forever, and the way you build the next layer is by composing the units underneath it.

You combine adders and accumulators to make a CPU. You combine CPUs and memory and a bus to make a machine. You combine logic gates and clocks to make registers. You combine Boolean functions and a process model to make an operating system. You combine lexers and parsers and code generators to make a compiler. You combine source files and a compiler to make a program. You combine programs and a network stack to make a service. You combine services and a database to make an application. You combine applications and a queue to make a pipeline. You combine pipelines and a stream processor to make a real-time system. You combine streams and a log aggregator to make observability. You combine logs and a metric and an anomaly model to make a monitor. You combine all of it and a scheduler and you have a system that runs without you watching it.

Flat-color treemap of the computing stack: small blocks for adders, clocks, registers, cpu, memory, bus growing diagonally up and to the right through machine, os, compiler, program, service, application, pipeline, stream, observability, monitor — culminating in a large block for scheduler.

Each layer is just the layer below, composed. That's what a native unit is — the thing you stop writing yourself, the thing you wire to. You don't write a compiler. You don't write a Postgres. You don't write a Kafka or a Kubernetes or a Lucene or a git. You pick the unit, you combine it with other units, you build on top.

Now look at that list again. Everything on it is sitting on top of Boolean logic. Silicon, gates, arithmetic, state machines, if/then. Numbers, types, queries, schedules, indexes — all of it is deterministic logic resolving down to ones and zeros. You can climb that stack pretty high, but you don't get out of it.

19th-century geological cross-section: layered strata of fossilized circuit traces — gates, clocks, registers, CPU, OS, compiler, program, service, application — opening into a newly excavated neural floor below, soft coral and cream tones, neuron tendrils threading up into the rock. The new floor under the old stack.

Neural nets aren't more of that. They're a different kind of logic. Pattern, association, similarity, fuzzy matching, generation. The thing silicon-and-Boolean was bad at, that we kept failing to solve with cleverer rules, the neural net does natively. We added a new floor — GPUs, TPUs, the Cerebras inference fabric, the Jetson on my desk — and a new kind of computation running on it that doesn't reduce to if A and B then C.

By themselves these things predict tokens. They don't loop, they don't read files, they don't remember. To get computation out of one you wrap it. A loop, some tools, file access, a shell, a way to manage context. That wrapper is the harness. The harness is the unit that turns "predicts the next token" into "does the work" — and lets the new kind of logic compose with the old kind.

The neural harness is to neural nets what the compiler was to source code. New entry on the list, joining the family rather than replacing it. The work I'm running on these two-going-on-three Max plans is mostly the harness wiring into the older units — tailing logs, querying state, watching streams, kicking off jobs, hitting indexes. New unit, old units, composed.

That's why the second Max plan isn't weird. The bill scales with how much work you're doing in the new unit. I'm doing a lot of work in the new unit.

How it shows up in a day

It really has stopped being a tool I reach for; its just the tool.

When I'm coding, I'm in a harness. When I'm reading a PDF I needed to read anyway, the harness is the thing reading it. Operations folder — SOWs, invoices, content ideas, project status — that's a harness. Parsing 20 financial deal docs and writing me a summary while I sleep — harness. Family infographics, fasting tracker, oura ring trends — harness, harness, harness. Different work, same unit.

Small-multiples grid: the same harness icon repeated across sixteen everyday domains — code editor, PDF reading, invoice, SOW draft, content ideas, project status, financial deal, overnight job, log monitor, email triage, family infographic, fasting tracker, Oura trends, calendar, research note, ops dashboard. One unit, many domains.

Coding was just the first place this paid off, because the feedback loop is tightest. Compile or don't, test or don't, the world tells you you're wrong inside a second. So that's where the harness got tuned first. That's why the unit is called a "coding agent" right now. But "coding" is vestigial. The thing isn't a coding agent. It's a harness around a model, and what runs in it is whatever you have tools for.

Rick Blalock said it in AI Engineering Miami — coding agent as universal software primitive. A 60-year-old in Texas replaced a $10k/month HubSpot bill by pointing one of these at the problem for three months. A 24-year-old window cleaner in Florida runs marketing, sales, and estimating off the same primitive. Both of them bought MacMinis. Tim Cook didn't have that on his bingo card.

The model question is below the harness question

Here's something I noticed about my own behavior: I'm mainly on Claude. Have been for months. I dip in and out of GPT and Grok and Gemini, but just sort of end up back here. Not because I reasoned out a model strategy — because Claude Code defaults to it and now I'm on Opus all day every day. Amp has its opinion and I try to set Cursor to super max mode, but really the model picked itself by way of the harness picking it for me.

So the perennial "Opus vs GPT-5 vs Gemini 3" argument is pitched one floor below where the action is. It's not model-vs-model. It's harness-with-default-model vs other-harness-with-default-model. The harness drives the model choice, often without telling you.

And underneath that, there's a whole zoo. Frontier reasoning models. Cheap fast models. Code-specific fine-tunes. Local models that run on the GPU you already own. Cerebras-fast inference at 1,200 tokens/sec, a different regime entirely. And the inside-the-harness thing: Tejas Bhakta at Miami called it "everything is models" — a compaction model running every two seconds, a code-search model at 80k tokens/sec, a frontier model doing only the heavy reasoning, all stitched together. Software 3.5, he called it. The harness picks all of that for you, or doesn't, depending on which harness.

Da Vinci anatomical-plate: a single mechanical harness apparatus on top labeled HARNESSIS — UNITAS SUPERIOR with four tool-attachments (LEGERE, SCRIBERE, IMPERARE, ITERARE), and below it a labeled menagerie of seven model 'species' — Frontier, Velox, Codicis, Localis, Compactionis, Quaerens, Cerebras — drawn as small mechanical creatures on aged parchment.

Which means the harness is a model strategy. Picking a harness on purpose means picking which models do which jobs inside it.

So which harness?

A separate post coming soon — each one deserves its own treatment and the conversation moves week to week. The shape of it:

You can build your own in a weekend. About 50 lines gets you the loop. Highly recommend, even if you never use it. Claude Code is the one everyone uses, and — by Anthropic's own model on Anthropic's own benchmark — the worst Claude harness on offer. (Niels Rogge posted Terminal-Bench 2: same Opus 4.6, Claude Code last, ForgeCode and Capy at 70-75%. Twenty-five points of accuracy from picking a different harness.) Picode is Mario Zechner's minimal, self-modifying one — four tools, the agent writes its own extensions, hot-reloads in the session. The most fun one to play with right now. Amp is the one I'm most fascinated with — though to be clear, I'm editing this post in Cursor. The multimodel thing actually works now. In January I wrote that Amp "should be better, but, you know, isn't." Four months later: it is.

Tufte-style horizontal bar chart of Terminal-Bench 2 scores on the same Opus 4.6: ForgeCode 74%, Capy 71%, Picode 62%, Amp 55%, Claude Code 49% — outlined in vermilion. Annotation on the right: 25-point gap from picking a different harness.

The point of this post is the unit, not the catalog.

What I'm still circling

Da Vinci notebook spread with marginalia: a Jetson AGX Thor on a small workbench labeled MACHINA LOCALIS, a brass token-cost gauge labeled STIPENDIUM TOKEN — quanto?, a half-configured harness with question marks labeled HARNESSIS CONFIGURATA — UNITAS NAVIS?, and a small chart of a rising line labeled LINEA NOVA IN STATU FINANCIALI. Sepia ink on parchment, inkwell and quill in the corner.

What's the unit of shipping? Ben Davis's claim in Miami was that it's becoming a directory of skill files plus a coding-agent runtime. That feels right. But the runtime is also moving — picode's bet is that it should be malleable inside the session, so you can't pin it. Maybe the unit is even smaller. Maybe the unit is the harness, configured.

What about the Jetson on my desk. The other thing the bill is about to teach us is that some of this work shouldn't be paying a subscription at all. Local models on local hardware — gpt-oss, Qwen, MiniMax, whatever's frontier-enough for the job — running on the GPU you already own, or the Jetson, or the laptop. Cheap as electricity. No data leaving the building. The harness doesn't care which model it's calling. The bill cares a lot. I think a real chunk of what's running on the second Max plan ends up local by the end of the year.

When the bill becomes a real line item — and it will — what does that conversation sound like? "Cloud spend" took ten years to become its own column on the financial statement. "Token spend" might take less. We're paying for a unit of computation, not for software. Different shape entirely.

I'll get the third Max plan tomorrow. There's another job.

Into the Box 2026 - Keynote - Day 1

Mike's Notes

A video of the keynote speech at Into the Box 2026. Very impressive BoxLang progress, including these new features;

  • Rust VM
  • BoxLang Administrator
  • BoxLang Desktop
  • AI Administrator
  • etc
I will put up all the videos, day 1, day 2, etc.

Pipi 10 (2027-2028)

Pipi 10 will run on BoxLang, continue to use CFML, and also support through its multi-parser architectureBoxLang, Cobol, Groovy,  JavaPHP, PythonRuby, and Rust. More languages are coming.

Pipi Codebase

Over time, Pipi will self-migrate its existing CFML codebase to other languages based on performance, feature set, and other factors. All code languages are great; some are better for some jobs.

Resources

References

  • Reference

Repository

  • Home > Ajabbi Research > Library >
  • Home > Handbook > 

Last Updated

02/05/2026

Into the Box 2026 - Keynote - Day 1

By: Luis Majano
Ortus: 01/05/2026

Luis Majano, CEO of Ortus Solutions, has been programming since the age of nine. A Salvadoran innovator, Luis has created some of the most widely used tools in the ColdFusion ecosystem, including ColdBox, ContentBox, and WireBox. His expertise in scalable and efficient JVM solutions drives the BoxLang Runtimes, ensuring they meet the needs of modern development.

Luis is not just a coder; he’s a leader dedicated to fostering a strong community around BoxLang. His hands-on approach and commitment to open-source projects have made him a trusted voice in the industry.

Join Ortus Solutions live from Into the Box 2026 as we kick off Day 1 with a keynote focused on Modernisation in Motion. Discover the latest innovations across the Ortus ecosystem—including BoxLang, ColdBox, CommandBox, and TestBox—and see how developers are building faster, modernising legacy systems, and scaling with confidence.

This session dives into the future of development: cloud-native architectures, AI-driven workflows, and the tools you need to adapt and thrive in a rapidly evolving landscape.

Day 1

Day 2

Join Ortus Solutions for the Day 2 keynote as we continue our journey of Modernization in Motion. Building on the momentum from Day 1, this session dives deeper into the evolution of the Ortus ecosystem—featuring BoxLang, ColdBox, CommandBox, TestBox, and the innovations shaping the future of development.

Explore what’s next: from advanced tooling and performance breakthroughs to cloud-native strategies, AI-driven development, and real-world success stories from the community.

Day 2 is all about going further—turning ideas into execution and equipping you with the tools to build what’s next.

Anthropic Mythos -- We've Opened Pandora's Box

Mike's Notes

This is why Pipi Core is in its own data centre, physically isolated from the internet, to ensure 100% security and protect people's privacy.

I endorse Steve Blank's conclusion. The risks are enormous and growing.

Resources

References

  • Reference

Repository

  • Home > Ajabbi Research > Library > Subscriptions > Steve Blank
  • Home > Handbook > 

Last Updated

01/05/2026

Anthropic Mythos -- We've Opened Pandora's Box

By: Steve Blank
The Cipher Brief: 23/04/2026

Adjunct Professor at Stanford and Co-founder of the Gordian Knot Center for National Security Innovation

Steve Blank is an adjunct professor at Stanford and co-founder of the Gordian Knot Center for National Security Innovation. His book, The Four Steps to the Epiphany is credited with launching the Lean Startup movement. He created the curriculum for the National Science Foundation Innovation Corps. At Stanford, he co-created the Department of Defense Hacking for Defense and Department of State Hacking for Diplomacy curriculums. He is co-author of The Startup Owner's Manual.

EXPERT OPINION

For a decade the cybersecurity community was predicting a cyber apocalypse tied to a single event - the day a Cryptographically Relevant Quantum Computer could run Shor’s algorithm and break the public-key cryptography systems most of the internet runs on. We braced for a one-time shock we would absorb and adapt to. The National Institute for Standards and Technology (NIST) has already published standards for the first set of post-quantum cryptography codes.

It’s possible that the first cybersecurity apocalypse may have come early. Anthropic Mythos now tilts the odds in the cybersecurity arms race in favor of attackers - and the math of why it tilts, and how long it stays tilted, is different from anything our institutions were built to handle.

In 2013, Edward Snowden changed what people knew

In 2013, Edward Snowden changed what people understood about nation-state cyber capabilities. In the decade that followed disclosures and leaks of nation state cyber tools reduced uncertainty and accelerated the diffusion of cyber tradecraft.

The defensive playbook that followed - compartmentalization, need-to-know, leak-surface reduction, clearance reform, “worked” because the Snowden leaks and those that followed were one-time disclosures, absorbed over a decade, with the system returning to something like equilibrium.

We got good at responding to the shocks of disclosures. It became doctrine. It was the right doctrine for the wrong future.

Pandora's Box

In 2026, Anthropic Mythos (and similar AI systems) is changing what people can do. Mythos found Zero-day vulnerabilities and thousands of “bugs” that were not publicly known to exist (a must read article here.) Many of these were not just run-of-the-mill stack-smashing exploits but sophisticated attacks that required exploiting subtle race conditions, KASLR (Kernel Address Space Layout Randomization) bypasses, memory corruption vulnerabilities and logic flaws in cryptographic libraries in cryptography libraries, and bugs in TLS, AES-GCM, and SSH.

The reality is a number of these were not “bugs.” There were nation-state exploits built over decades.

What this means is that Anthropic Mythos, and the tools that will certainly follow, has exposed hacking tools previously only available to nation-states and transformed into tools that Script Kiddies will have within a few months (and certainly within a year.) No expertise will be required to apply that tradecraft, compressing both the learning curve and the execution barrier.

All Government’s Will Scramble

When Mythos-class systems are used to analyze the code in critical infrastructure and systems, the hidden sophisticated zero-day exploits that are already in use, (including ones nation-states have been sitting on for years) will be found and patched. That means intelligence agency sources of how to collect information will go dark as companies and governments patch these vulnerabilities.

Every serious intelligence service will scramble, likely with their own AI, to find new access before the visibility gap costs them something they cannot replace. A new generation of AI-driven exploits will rise to replace the ones that have been burned.This will build an arms race with a new generation of AI-driven cyber exploits looking to replace the ones that have been discovered. Whichever side sustains faster AI adoption - not just “procures” it, but ships it into operational systems, holds a widening advantage measured in powers of two every four months.

The binding constraint is not budget. Not authority. Not access to models. It is institutional capacity for change - the rate at which a defender organization can actually change what it deploys.

The Long Tail Will Not Be Patched

Anthropic has given companies early access to secure the world’s most critical software. That will help Fortune 100 companies. But the Fortune 100 is not just a small part of the software attack surface.

The attack surface includes the unpatched county water utility, the regional hospital, the third-tier defense supplier, the school district, the state Department of Motor Vehicles, the municipal 911 system, and the small-town electric co-op. Tens of thousands of systems running software nobody has time to patch, maintained by teams that have never heard of KASLR.

Every one of those systems is now exposed to nation-state-grade tradecraft, wielded by attackers with no expertise required. Mythos-class hardening at the top of the pyramid does not trickle down. The long tail will stay unpatched for years.

Attackers Advantage - For Now

Under continuous exponential growth of AI designed cyberattacks, a cyber defender using traditional tools can't just respond just once and stabilize their systems. They’ll need to keep investing at a rate that matches the offense's growth rate itself. A one-time defensive shock like compartmentalization might work against a sudden attack, but it will fail against sustained exponential pressure because there's no stable equilibrium to return to. The defender's investment rate has to track the offense's growth rate.

Ultimately and hopefully, the next generation of AI driven cyber-defense tools will create a new equilibrium.

What We Need to Do

Mythos and its follow-ons will change how we think about cyber-defense. We can’t just build a set of features to catch every exploit x or y. We need to build cyber systems that can maintain or exceed the capability rate of the attackers.

Here are the three tools governments and cyber defense companies need to build now:

  1. Measure the Gap Between Attackers and Defenders. We need to know the gap between what the attackers can do and what we can defend against. We need to develop instrumented red/blue exercises (a simulation of a cyberattack, where two teams – the red team and the blue team – are pitted against each other) to estimate the number of new vulnerabilities vs cyber defense mitigation. (This can be built in six months, with a small team.)
  2. Measure the Defender Response Time. For each corporate or government mission system, measure how long it takes to implement a change from identification to production deployment. Treat each organizational obstacle as equivalent to technical debt that needs to be remediated.
  3. Specify Speed, Not Features. Any new Cyber Defense tools and architecture - including the next-generation cloud-native systems sitting in review right now - should have explicit ‘rate’ requirements. Claims of “our product delivers X capability is now the wrong specification. “Closes detection gap at rate greater than or equal to the offense growth rate” is the right one.

Buckle up. It's going to be a wild ride - for companies, for defense and for government agencies.

Mythos is a sea change. It requires a different response than what the current cyber security ecosystem was built for, and one the current system is not built to produce. We are not behind yet. The gap between Mythos and what we can build to defend is small enough today that a serious response can still match it. A year from now, the same response will be eight times too slow. Two years, sixty-four.

By the way, the only thing left in Pandora’s Box was hope.

Vibing, Harness and OODA loop

Mike's Notes

Wise words from Oskar Dudycz. Subscribe to Architecture Weekly, it's awesome. 😎

Resources

References

  • Reference

Repository

  • Home > Ajabbi Research > Library > Subscriptions > Architecture Weekly
  • Home > Handbook > 

Last Updated

30/04/2026

Vibing, Harness and OODA loop

By: Oskar Dudycz
Architecture Weekly: 27/04/2026

Software Architect & Consultant / Helping fellow humans build Event-Driven systems. /Blogger at https://event-driven.io/ . OSS contributor https://github.com/oskardudycz /Proud owner of Amiga 500.

On why Vibing and Harness are not new and why feedback loops are important.

Hey, have a look at what I made during the weekend. I had some time, grabbed a beer, turned on the computer and tried to code this feature. If I could do so much during the weekend, how much could you and your team do with it in 2 weeks?

It’s almost a 1:1 quote of what I heard from the startup founder I worked with over 10 years ago. I’m sure that you’ve heard similar phrases from people you worked with. We all know the annoying type of person who doesn’t code anymore but thinks, “I still got it!”. Then they threw a piece of stuff at you to “just fine-tune it a bit and do final touches”. Then they’re the first ones to ask “Why so long?“.

Nowadays, the Internet is full of such people. They shout about what they did with Claude or how much progress LLM tools have made. Some even predict the end of coding. I already wrote that this is wrong perspective. I won’t repeat that, but I want to say that…

Vibing isn’t new and isn’t always an issue.

I’m saying that LLM tools are an appraisal for ignorance. The more ignorant we are of the topic we’re working with, the better we see the outcomes. And that, by itself, is not always bad, as there’s power in ignorance if we focus on getting it done with the simplest tools we have.

Still, this can be terrible if we fall in love too much with what we’ve vibed.

To understand why that “weekend beer” energy is both a superpower and a liability, we need to look at the OODA Loop.


Disclaimer, it’s not a competition for Ralph Wiggum Loop. It’s much older and generic.

Military strategist John Boyd developed the OODA loop (Observe, Orient, Decide, Act) for fighter pilots. In a dogfight, the pilot who cycles through these four stages the fastest and most accurately survives.

In software, the “dogfight” is the gap between your intent and the production-ready feature.

OODA loop is built from four steps:

  1. Observe - This is the intake of raw, unfiltered information. In our world, this means looking at the state of the system.
  2. Orient - This is the most critical and difficult stage. It’s where you filter your observations through your experience, culture, and technical knowledge.
  3. Decide - Based on your orientation, you formulate a hypothesis.
  4. Act - You execute.

Getting back to my favourite founder and LLM-based tools.

The reason founder could build a PoC in a weekend while the team needed more than two weeks is that he bypassed the Observe and Orient phases. He went straight from a vague idea to Act.

If we skip or brush past the observation step, it feels like lightning speed. If the fancy UI grid is there and it does something we wanted, we move on. We’ve outsourced Orientation to our own ego. It’s too easy to assume that because we wrote it, it works.

Observation is the intake of raw data. In a professional environment, our eyes aren’t enough. We need a Harness. If we don’t have automations, tests, integration tests, and pristine traces, we aren’t observing the system; we’re just looking at it. If the inputs are messy, our observation is clouded.

But real engineering, the kind that takes those “two weeks”, is about closing the loop properly. That’s also where we need different perspectives and knowledge sharing.

Orientation is where you process those observations. This is the part where LLMs make us feel smarter than we are. If we don’t understand how a database handles concurrent connections, our “orientation” of a generated script will be shallow. We’ll see code that “looks” right, decide it’s fine, and act by deploying it.

The “I still got it” crowd loves the Decide and Act phases because that’s where the visible progress happens. LLM tools have made these phases nearly instantaneous. We can decide to build a feature and have the code for it in ten seconds.

The problem is that the faster we Act, the faster we need to Observe. If our “Act” phase takes seconds but our “Observe” phase requires a manual weekend of clicking around and drinking beer, our OODA loop is broken. We’re just generating a pile of stuff that we haven’t actually verified.

That’s why the team usually needs more than an imaginary “two weeks”. They are not “fine-tuning” the single-brilliant-dude masterpiece. They are building the infrastructure required to make the OODA loop sustainable.

And to make that possible, they need to run the full loop: Observe, Orient, Decide, Act. And do it multiple times. That takes time, but it’s required to assess the direction, automate what needs to be automated, and ensure they can iterate further and run this loop sustainably. That’s critical for delivering the outcome at the expected pace.

Of course, there’s a danger here, overfocusing on the Orient and Decide can lead to overengineering, building stuff we don’t need. That’s where ignorance can be blissful, especially when we connect it with humility. Being humble about what we don’t know and trying things the easiest way, then learning and making enhancements. Still, humility fails under deadline pressure. The harness doesn’t.

Let me give you…

The example

I’m adding proper Observability and Open Telemetry to Emmett right now. I spent some time working on it and instrumented the first component: Command Handling.

Of course, I had tests to prove it works, but I don’t trust them enough, and I wanted to try it on a real sample, since you never know until you run it. Even the best test suite won’t tell you all.

So I decided to plug it into the sample. See if it works, how ergonomic the API is and how it fits conventions in this area.

To do it, I decided to use Grafana stack and set it up with Docker Compose. So, stable, boring stack. Not going to lie, I vibed the config. Not that there are no docs, but I intentionally wanted to see the typical config people use.

If someone says LLM-based tools are great at proof of concepts, they don’t run the stuff they vibed. If I made the observation based on the initial config, then an oriented decision would be that it won’t work. Of course, then I did the typical back-and-forth, with the LLM tool doing some Linux command Voodoo to make it work. Once. Then, if you try to repeat it, you won’t know how to do it without doing Voodoo again.

Again, that’s not much different from the other stuff we do. I’m sure that you had multiple cases, when someone didn’t use Continuous Deployment tools, but clicked through Azure, AWS, GCP portal, deployed the stack, and then there was no trace on how to set it up again (e.g. to have a different environment for testing or demos for customers).

So, we need a harness, we need a leash to keep our process on track.

How to do the harness? My advice is to start simple. We may ask LLMs to give us shell scripts, and we may ask them to run them multiple times. We also need experience and knowledge of what we want to achieve and the tools we use. It’s fine not to remember all the YAML config to set up the Grafana stack, but it’s not fine not to understand why you even use it, how it relates, and how to set it up.

Still, our first loop can close on the first working solution, even a manually vibed one. But that’s not even a PoC. We need to automate them.

I asked LLM to take notes on what issues it had, and it solved them. Then, based on that, I asked to research how to code it in TypeScript. And to use tools I know, used in past, validating if there are no new more modern ones. For instance, I was a big fan of Gulp.js and Bullseye in the past, but they’re mostly dead. I wanted to have something in the same spirit, using native, maintained tooling.

I ended up with the following tools:

  • execa for running shell scripts,
  • native fetch for calling http endpoints,
  • native Node.js test tools for checking if the stack works as expected.

Then I asked it to create the script to automate the shell Voodoo they did to make Grafana stack and Docker Compose work.

Essentially, it should:

  1. Run Docker Compose script starting up services (Grafana, Prometheus, Loki, Tempo, PostgreSQL, etc.).
  2. Wait for them to check when they’re ready (it usually takes some time).
  3. Start the application and make a request.
  4. Check if the predefined dashboard with Emmett metrics appears, and shows expected traces and metrics.

Initial diagnostic tools looked like that

async function fetchWithDiag(label: string, url: string, init?: RequestInit) {
  const res = await fetch(url, init);
  if (!res.ok) {
    const body = await res.text().catch(() => '(could not read body)');
    console.error(`\n  ✗ ${label} → HTTP ${res.status}\n  body: ${body}\n`);
  }
  return res;
}
async function diagnoseCollector() {
  const text = await fetch(URLS.otelCollectorMetrics)
    .then((r) => r.text())
    .catch(() => 'unreachable');
  const emmett = text
    .split('\n')
    .filter((l) => l.startsWith('emmett_') && !l.startsWith('#'))
    .slice(0, 5);
  console.log(
    emmett.length
      ? `\n  collector /metrics (emmett lines):\n  ${emmett.join('\n  ')}`
      : '\n  collector /metrics: no emmett_* lines found',
  );
}
async function diagnosePrometheus() {
  const json = await fetch(
    `${URLS.prometheus}/api/v1/label/__name__/values`,
  )
    .then((r) => r.json() as Promise<{ data: string[] }>)
    .catch(() => ({ data: [] as string[] }));
  const emmett = json.data.filter((n) => n.startsWith('emmett_'));
  console.log(
    emmett.length
      ? `\n  Prometheus emmett_* metrics: ${emmett.join(', ')}`
      : '\n  Prometheus: no emmett_* metrics found yet',
  );
}
async function diagnoseLoki() {
  const labels = await fetch(`${URLS.loki}/loki/api/v1/labels`)
    .then((r) => r.json() as Promise<{ data?: string[] }>)
    .catch(() => ({ data: [] as string[] }));
  console.log(`\n  Loki labels: ${(labels.data ?? []).join(', ') || '(none)'}`);
}
async function diagnoseDockerLogs(service: string, lines = 10) {
  const { stdout } = await execa('docker', [
    ...COMPOSE,
    'logs',
    '--tail',
    String(lines),
    service,
  ]).catch(() => ({ stdout: '(could not get logs)' }));
  console.log(`\n  docker logs ${service} (last ${lines}):\n  ${stdout.split('\n').join('\n  ')}`);
}

Are they pretty? No. Can they be improved? Yes. Do they have to be improved at this specific moment? No.

The setup uses test infrastructure

const CLEANUP = process.env['CLEANUP'] === '1' || process.env['CLEANUP'] === 'true';
const CLEANUP_AFTER = process.env['CLEANUP_AFTER'] === '1' || process.env['CLEANUP_AFTER'] === 'true';
const NO_START = process.env['NO_START'] === '1' || process.env['NO_START'] === 'true';
// ─── configuration ───────────────────────────────────────────────────────────
const COMPOSE = ['compose', '-f', 'docker-compose.yml', '--profile', 'observability'];
const URLS = {
  app: 'http://localhost:3000',
  prometheus: 'http://localhost:9090',
  tempo: 'http://localhost:3200',
  loki: 'http://localhost:3100',
  grafana: 'http://localhost:3001',
  otelCollectorMetrics: 'http://localhost:8889/metrics',
};
// Fresh client per run — avoids stale cart state from previous runs.
const SERVICE_NAME = 'expressjs-with-postgresql';
const CLIENT_ID = randomUUID();
const CART_ENDPOINT = `${URLS.app}/clients/${CLIENT_ID}/shopping-carts/current/product-items`;
const CONFIRM_ENDPOINT = `${URLS.app}/clients/${CLIENT_ID}/shopping-carts/current/confirm`;
// Matches the .http file — unitPrice is resolved server-side.
const ADD_PRODUCT_BODY = JSON.stringify({ productId: randomUUID(), quantity: 10 });

before(async () => {
  console.log(`\n▶ client ID for this run: ${CLIENT_ID}\n`);
  if (NO_START) {
    console.log('▶ --no-start: skipping docker compose and app startup');
    return;
  }
  if (CLEANUP) {
    console.log('▶ --cleanup: killing port 3000 and tearing down stack (down -v)…');
    await execa('bash', ['-c', 'fuser -k 3000/tcp 2>/dev/null || true']).catch(() => {});
    await new Promise((r) => setTimeout(r, 500));
    await execa('docker', [...COMPOSE, 'down', '-v', '--remove-orphans'], {
      stdio: 'inherit',
    });
  }
  const stackReady = await fetch(`${URLS.prometheus}/-/ready`)
    .then((r) => r.ok)
    .catch(() => false);
  if (stackReady) {
    console.log('▶ observability stack already up — skipping docker compose up');
  } else {
    console.log('▶ starting observability stack…');
    await execa('docker', [...COMPOSE, 'up', '-d'], { stdio: 'inherit' });
  }
  console.log('▶ waiting for backends…');
  await waitFor(() => checkUrl('Prometheus', `${URLS.prometheus}/-/ready`), {
    timeout: 90_000, label: 'Prometheus',
  });
  await waitFor(() => checkUrl('Grafana', `${URLS.grafana}/api/health`), {
    timeout: 90_000, label: 'Grafana',
  });
  await waitFor(() => checkUrl('Tempo', `${URLS.tempo}/ready`), {
    timeout: 90_000, label: 'Tempo',
  });
  await waitFor(() => checkUrl('Loki', `${URLS.loki}/ready`), {
    timeout: 90_000, label: 'Loki',
  });
  // /health returns { status: 'ok', service: 'expressjs-with-postgresql' } —
  // checking service name lets us distinguish our app from other processes on :3000.
  const checkOurApp = () =>
    checkUrl('app /health', `${URLS.app}/health`, async (res) => {
      const json = (await res.json().catch(() => ({}))) as { service?: string };
      if (json.service !== SERVICE_NAME) {
        console.log(
          `    app /health: service="${json.service ?? '(missing)'}", expected="${SERVICE_NAME}"`,
        );
        return false;
      }
      return true;
    });
  const appIsOurs = stackReady && (await checkOurApp());
  if (appIsOurs) {
    console.log('▶ app already running and healthy — skipping npm start');
  } else {
    const portTaken = await fetch(URLS.app).then(() => true).catch(() => false);
    if (portTaken) {
      // Port is occupied but not by our app — stale process or unrelated service.
      console.error(
        '\n  ✗ Port 3000 is occupied by a process that is not this app.\n' +
          '  It may be a stale version of this app (connected to a wiped database)\n' +
          '  or a completely different service.\n' +
          '  Fix: run  npm run verify:observability:cleanup  to kill it and restart,\n' +
          '  or manually free port 3000.\n',
      );
      process.exit(1);
    }
    console.log('▶ starting app…');
    app = execa('npm', ['start'], { stdio: 'inherit' });
    await waitFor(checkOurApp, { timeout: 60_000, label: 'app /health' });
  }
  console.log('▶ setup complete\n');
});

As you see, nothing fancy, the cleanup is even simpler

after(async () => {
  if (app) {
    console.log('\n▶ stopping app…');
    app.kill('SIGTERM');
    await app.catch(() => {});
    console.log('▶ app stopped');
  }
  if (CLEANUP_AFTER) {
    console.log('▶ tearing down stack (down -v)…');
    await execa('docker', [...COMPOSE, 'down', '-v', '--remove-orphans'], {
      stdio: 'inherit',
    });
    console.log('▶ stack torn down');
  } else {
    console.log('▶ stack is still running');
    console.log('▶ to clean up: npm run verify:observability:cleanup');
  }
});

Having that we can run tests:

test('successful command returns x-trace-id header', async () => {
  const res = await fetchWithDiag('POST add product', CART_ENDPOINT, {
    method: 'POST',
    headers: { 'Content-Type': 'application/json' },
    body: ADD_PRODUCT_BODY,
  });
  assert.equal(res.status, 204, `Expected 204 — body logged above`);
  const header = res.headers.get('x-trace-id');
  if (!header) {
    console.error(
      '  ✗ x-trace-id missing — verify the wrapper app in src/index.ts ' +
        'adds it via @opentelemetry/api before mounting the emmett app',
    );
  }
  assert.ok(header, 'x-trace-id header missing');
  assert.match(header, /^[0-9a-f]{32}$/, `"${header}" is not a 32-hex trace ID`);
  traceId = header;
  console.log(`  trace ID: ${traceId}`);
});
test('OTel collector exposes Emmett metrics on port 8889', async () => {
  // Send a few more requests so metrics are definitely recorded.
  for (let i = 0; i < 5; i++) {
    await fetch(CART_ENDPOINT, {
      method: 'POST',
      headers: { 'Content-Type': 'application/json' },
      body: ADD_PRODUCT_BODY,
    });
  }
  try {
    await waitFor(
      async () => {
        let text: string;
        try {
          const res = await fetch(URLS.otelCollectorMetrics);
          text = await res.text();
        } catch {
          console.log('    collector :8889: connection refused');
          return false;
        }
        const emmettLines = text.split('\n').filter((l) => l.startsWith('emmett_') && !l.startsWith('#'));
        if (emmettLines.length === 0) {
          const allFamilies = [...new Set(text.split('\n').filter((l) => !l.startsWith('#') && l).map((l) => l.split('{')[0]))].slice(0, 5);
          console.log(`    collector :8889: no emmett_* metrics yet. Present: ${allFamilies.join(', ') || '(none)'}`);
          return false;
        }
        return true;
      },
      { timeout: 90_000, interval: 5_000, label: 'emmett metrics on collector :8889' },
    );
  } catch (err) {
    await diagnoseCollector();
    await diagnoseDockerLogs('otel-collector');
    throw err;
  }
});

I put it into a single file that can be run as a regular Node.js script.

It already showed me (and Claude) that what they initially did wasn’t working if you try to run it multiple times. It also showed that doing a full cleanup and rebuild, and making it reproducible, needs more work.

Is it done? Not yet; it takes too much time and resources to run it continuously throughout the pipeline. The code is a bit messy, so it needs to be organised. It’s segmented into blocks, includes basic automation and tests, and has already gone through some failures to get it done.

Could I do it better? Sure, and I will improve it, but that’s not the point. I wanted to show you my findings during weekend vibing (without beer tho), the real, not polished iteration, before I run the next one.

The main idea behind OODA loops is not to be perfect, but to iterate quickly, gather feedback as soon as possible, learn from it, develop another theory, and verify it through action.

It’s not about vibing, but it’s also not about analysis paralysis.

I hope you’re now better equipped to think about when vibing — with beer or without, with LLMs or without — actually helps, and when it doesn’t.

Vibe coding is just high-frequency steering. It only works if you have a Harness: a mechanical way to observe and orient, so you don’t steer the whole project into a wall.

Act takes seconds now. Observe takes as long as it always did. Without a harness, you’re not going faster; you’re just making more stuff you haven’t checked.

Harness is not magic, a new discipline, or the next buzzword; I hope I showed you that a bit in this article on what it may look like.

So iterate fast, but wisely remembering to do the full loop. It’s great that LLMs can help us make Acting faster, but we should not skip other steps. We should aim for a fast feedback loop to iterate in the right direction and achieve continuous improvement, to deliver proper value.

Just like Vibing isn’t new, we shouldn’t abandon old engineering practices. We should also not replace collaboration with solitary self-high fives.

Check also:

  • Emmett Pull Request with mentioned changes
  • Interactive Rubber Ducking with GenAI
  • The End of Coding? Wrong Question
  • A few tricks on how to set up related Docker images with docker-compose
  • Docker Compose Profiles, one the most useful and underrated features

Cheers!

Oskar