Building High-Performance Teams in 2025: Beyond the Location Debate

Mike's Notes

A thoughtful article about in-office versus remote from Leah Brown, Managing Editor at IT Revolution. Pipi 9 has been built specifically for high-performance DevOps teams.

Resources

Building High-Performance Teams in 2025: Beyond the Location Debate

By: Leah Brown
IT Revolution: January 6 2025

The debate over in-office versus remote work misses a fundamental truth: high-performing teams succeed based on how they’re organized, not where they sit. Through extensive research across industries, Gene Kim and Dr. Steven J. Spear found that three key mechanisms consistently enable team excellence: slowing down to speed up, breaking down complexity, and amplifying problems early.

As they explain in their award-winning book Wiring the Winning Organization, the leaders of the highest-performing teams will use these three simple mechanisms to “wire their organization for success instead of mediocrity.” 

1. Slow Down to Speed Up

High-performing teams in 2025 must prioritize solving problems in controlled environments before they appear in production, what Kim and Spear term “slowification.” High-performing teams should look to:

  • Move complex problem-solving offline instead of firefighting during execution.
  • Create dedicated spaces for experimentation and learning.
  • Build standard approaches based on validated solutions.
  • Test new processes in low-stakes environments.

Toyota exemplifies this approach using careful preparation and practice to achieve industry-leading performance. Known as the Toyota Production System, this method of slowing down to solve problems has long been proven to help the highest-performing teams succeed. And it will continue to be a differentiator for high-performing teams in 2025 and beyond.

2. Break Down Complexity 

High-performing organizations like Amazon have transformed their performance by making complex work manageable through what Kim and Spear term “simplification.”

Simplification is the process of making complex work more manageable by:

  • Creating small, self-contained teams that own complete workflows.
  • Defining clear handoffs between specialized functions.
  • Implementing changes incrementally rather than all at once.
  • Designing linear processes with obvious next steps.

Amazon has used these principles to evolve from making only twenty software deployments per year to over 136,000 daily deployments. They achieved this by breaking down monolithic systems into smaller, independent services with clear interfaces.

3. Amplify Problems Early

Drawing from their research of high-performing organizations in manufacturing, healthcare, and technology, Kim and Spear found that great organizations create mechanisms to detect and respond to small issues before they become major disruptions. This “amplification,” as they call it, requires teams to maintain reserve capacity to swarm problems when they occur and share solutions across teams to prevent recurrence down the road.

In other words, high-performing teams:

  • Make problems visible immediately when they occur.
  • Create rapid feedback loops between dependent teams.
  • Maintain reserve capacity to swarm and contain issues.
  • Share solutions across teams to prevent recurrence.

Leading the High-Performing Team

To create and lead your high-performing teams, Kim and Spear recommend starting with what they call a “model line”—a small segment where new approaches can be tested. Their research shows three phases of implementing a model line in any organization:

  • Start Small: Choose one critical workflow, form an initial cross-functional team, and implement basic performance metrics.
  • Expand Thoughtfully: Add supporting capabilities, establish clear team interactions, and build knowledge-sharing mechanisms.
  • Optimize Continuously: Refine team boundaries and interfaces while maintaining focus on outcomes.

The organizations that thrive in 2025 and beyond will be those that create what Kim and Spear call effective “social circuitry”—the processes and norms that enable great collaboration. When teams have well-defined boundaries, clear visibility into work, and mechanisms to coordinate when needed, location becomes irrelevant.

The future belongs to organizations that focus on creating the right conditions for teams to excel, whether in a physical, remote, or hybrid environment. By implementing the three key mechanisms of great social circuitry, leaders can build high-performing teams that consistently deliver exceptional results, regardless of where they sit. 

The evidence presented in Wiring the Winning Organization makes this clear: excellence comes from organizational design, not office design.

Curiosity, Open Source, and Timing: The Formula Behind DeepSeek’s Phenomenal Success

Mike's Notes

This week's Turing Post had an excellent summary of DeepSeek and some valuable links.

The original post is on Turing Post, and a longer version is on HuggingFace. The missing links on this page can be found in the original post.

Turing Post is worth subscribing to.

LM Studio is free for personal use and can run DeepSeek and other LLM's. It can run on Mac, Windows Linux. Windows requires 16GB Ram.

Resources

Curiosity, Open Source, and Timing: The Formula Behind DeepSeek’s Phenomenal Success

By: Ksenia Se
Turing Post: #85 January 27, 2025

How an open-source mindset, relentless curiosity, and strategic calculation are rewriting the rules in AI and challenging Western companies, plus an excellent reading list and curated research collection

When we first covered DeepSeek models in August 2024 (we are opening that article for everyone, do read it), it didn’t gain much traction. That surprised me! Back then, DeepSeek was already one of the most exciting examples of curiosity-driven research in AI, committed to open-sourcing its discoveries. They also employed an intriguing approach: unlike many others racing to beat benchmarks, DeepSeek pivoted to addressing specific challenges, fostering innovation that extended beyond conventional metrics. Even then, they demonstrated significant cost reductions.

“What’s behind DeepSeek-Coder-V2 that makes it so special it outperforms GPT-4 Turbo, Claude-3 Opus, Gemini 1.5 Pro, Llama 3-70B, and Codestral in coding and math?

DeepSeek-Coder-V2, costing 20–50x less than other models, represents a major upgrade over the original DeepSeek-Coder. It features more extensive training data, larger and more efficient models, improved context handling, and advanced techniques like Fill-In-The-Middle and Reinforcement Learning.” (Inside DeepSeek Models)

Although DeepSeek was making waves in the research community, it remained largely unnoticed by the broader public. But then they released R1-Zero and R1.

With that release they crushed industry benchmarks and disrupted the market by training their models at a fraction of the typical cost. But do you know what else they did? Not only did they prove that reinforcement learning (RL) is all you need in reasoning (R1 stands as solid proof of how well RL works), but they also embraced a trial-and-error approach – fundamental to RL – for their own business strategies. Previously overlooked, they calculated this release of R1 meticulously. Did you catch the timing? It was a strategic earthquake that shook the market and left everyone reeling:

  1. As ChinaTalk noticed: “R1's release during President Trump’s inauguration last week was clearly intended to rattle public confidence in the United States’ AI leadership at a pivotal moment in US policy, mirroring Huawei's product launch during former Secretary Raimondo's China visit. After all, the benchmark results of an R1 preview had already been public since November.”
  2. The release happened just one week before the Chinese Lunar New Year (this year on January 29), which typically lasts 15 days. However, the week leading up to the holiday is often quiet, giving them a perfect window to outshine other Chinese companies and maximize their PR impact.

So, while the DeepSeek family of models serves as a case study in the power of open-source development paired with relentless curiosity (from an interview with Liang Wenfeng, DeepSeek’s CEO: “Many might think there's an undisclosed business logic behind this, but in reality, it's primarily driven by curiosity.”), it’s also an example of cold-blooded calculation and triumph of reinforcement learning applied to both models and humans :). DeepSeek has shown a deep understanding of how to play Western games and excel at them. Of course, today’s market downturn, though concerning to many, will likely recover soon. However, if DeepSeek can achieve such outstanding results, Western companies need to reassess their strategies quickly and clarify their actual competitive moats.

Worries about NVIDIA

Of course, we’ll still need a lot of compute – everyone is hungry for it. That’s a quote from Liang Wenfeng, DeepSeek’s CEO: “For researchers, the thirst for computational power is insatiable. After conducting small-scale experiments, there's always a desire to conduct larger ones. Since then, we've consciously deployed as much computational power as possible.”

So, let’s not count NVIDIA out. What we can count on is Jensen Huang’s knack for staying ahead to find the way to stay relevant (NVIDIA wasn’t started as an AI company, if you remember). But what the rise of innovators like DeepSeek could push NVIDIA to is to double down on openness. Beyond the technical benefits, an aggressive push toward open-sourcing could serve as a powerful PR boost, reinforcing Nvidia’s centrality in the ever-expanding AI ecosystem.

As I was writing these words about NVIDIA, they sent a statement regarding DeepSeek: “DeepSeek is an excellent AI advancement and a perfect example of Test Time Scaling. DeepSeek’s work illustrates how new models can be created using that technique, leveraging widely-available models and compute that is fully export control compliant. Inference requires significant numbers of NVIDIA GPUs and high-performance networking. We now have three scaling laws: pre-training and post-training, which continue, and new test-time scaling.”

So – to wrap up – the main takeaway from DeepSeek breakthrough is that:

  • open-source and decentralize
  • stay curiosity-driven
  • apply reinforcement learning to everything

For DeepSeek, this is just the beginning. As curiosity continues to drive its efforts, it has proven that breakthroughs come not from hoarding innovation but from sharing it. As we move forward, it’s these principles that will shape the future of AI.

We are reading (it’s all about 🐳)

Here is a collection of superb articles covering everything you need to know about DeepSeek:

Curated Collections

7 Open-source Methods to Improve Video Generation and Understanding

Weekly recommendation from AI practitioner👍🏼

  • To run DeepSeek models offline using LM Studio:
  • Install LM Studio: Download the appropriate version for your operating system from the LM Studio website. Follow the installation instructions provided.
  • Download the DeepSeek Model: Open LM Studio and navigate to the "Discover" tab. Search for "DeepSeek" and select your desired model. Click "Download" to save the model locally.
  • Run the Model Offline: Once downloaded, go to the "Local Models" section. Select the DeepSeek model and click "Load." You can interact with the model directly within LM Studio without an internet connection.

News from The Usual Suspects ©

  • Data Center News
    $500B Stargate AI Venture by OpenAI, Oracle, and SoftBank
    With plans to build massive data centers and energy facilities in Texas, Stargate aims to bolster U.S. AI dominance. Partners like NVIDIA and Microsoft bring muscle to this high-stakes competition with China. Trump supports it, Musk trashes.

Meta's Manhattan-Sized AI Leap

  • Mark Zuckerberg’s AI ambitions come on a smaller scale (haha) – $65 billion for a data center so vast it could envelop Manhattan. With 1.3 million GPUs powering this, Meta aims to revolutionize its ecosystem and rival America’s AI heavyweights. The era of AI megaprojects is here.
  • Mistral’s IPO Plans: Vive la Résistance French AI startup Mistral isn’t selling out. With €1 billion raised, CEO Arthur Mensch eyes an IPO while doubling down on open-source LLMs. Positioned as a European powerhouse, Mistral’s independence signals Europe’s readiness to play hardball in the global AI race.
  • SmolVLM: Hugging Face Goes Tiny Hugging Face introduces SmolVLM, two of the smallest foundation models yet. This open-source release proves size doesn’t matter when efficiency leads the charge, setting new standards for compact AI development.
  • OpenAI's Agent Takes the Wheel CUA (Computer-Using Agent) redefines multitasking with Operator, seamlessly interacting with GUIs like a digital power user. From downloading PDFs to complex web tasks, it’s the closest we’ve come to a universal assistant .CUA is now in Operator's research preview for Pro users. Blog. System Card.
  • Google DeepMind A Year in Gemini’s Orbit They just published an overview of 2024. From Gemini 2.0's breakthroughs in multimodal AI to Willow chip’s quantum strides, innovation soared. Med-Gemini aced medical exams, AlphaFold 3 advanced molecular science, and ALOHA redefined robotics. With disaster readiness, educational tools, and responsible AI initiatives, DeepMind balanced cutting-edge tech with global impact. A Nobel-worthy streak indeed. Cost-Cutting AI with "Light Chips" Demis Hassabis unveils Google's next move – custom "light chips" designed to slash AI model costs while boosting efficiency. These chips power Gemini 2.0 Flash, with multimodal AI, 1M-token memory, and a "world model" vision for AGI. DeepMind’s edge? Owning every layer of the AI stack, from chips to algorithms.

Top models to pay attention to

  • DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning Enhance reasoning in LLMs with multi-stage reinforcement learning, outperforming competitors in benchmarks like AIME 2024 and MATH-500.
  • Kimi K1.5: Scaling Reinforcement Learning with LLMs Scale reasoning capabilities with efficient reinforcement learning methods, optimizing token usage for both long- and short-chain-of-thought tasks.
  • VideoLLaMA 3: Frontier Multimodal Foundation Models for Image and Video Understanding Advance image and video understanding with multimodal integration, achieving top results in temporal reasoning and long-video tasks.
  • Qwen2.5-1M Series Support 1M-token contexts with open-source models, leveraging sparse attention and lightning-fast inference frameworks for long-context tasks.

The freshest research papers, categorized for your convenience

There were quite a few TOP research papers this week, we will mark them with 🌟 in each section.

Specialized Architectures and Techniques

  • 🌟 Demons in the Detail: Introduces load-balancing loss for training Mixture-of-Experts models.
  • 🌟 Autonomy-of-Experts Models: Proposes expert self-selection to improve Mixture-of-Experts efficiency and scalability.
  • O1-Pruner: Length-Harmonizing Fine-Tuning: Reduces inference overhead in reasoning models through reinforcement learning-based pruning. Language Model Reasoning and Decision-Making
  • 🌟 Evolving Deeper LLM Thinking: Explores genetic search methods to enhance natural language inference for planning tasks, achieving superior accuracy.
  • 🌟 Agent-R: Training Language Model Agents to Reflect via Iterative Self-Training: Develops a framework for LLMs to self-correct using Monte Carlo Tree Search and iterative refinement.
  • 🌟 Reasoning Language Models: A Blueprint: Proposes a modular framework integrating reasoning methods to democratize reasoning capabilities.
  • Step-KTO: Optimizing Mathematical Reasoning through Stepwise Binary Feedback: Enhances mathematical reasoning with stepwise binary feedback for more accurate LLM outputs.
  • Test-Time Preference Optimization: Introduces a framework for aligning LLM outputs to human preferences during inference without retraining.

Multi-Agent Systems and Coordination

  • SRMT: Shared Memory for Multi-Agent Lifelong Pathfinding: Demonstrates shared memory use for enhanced coordination in multi-agent systems.
  • Mobile-Agent-E: Self-Evolving Mobile Assistant for Complex Tasks: Develops a hierarchical agent framework for mobile assistants with self-evolution capabilities.

**Generative and Retrieval-Augmented Models

  • Chain-of-Retrieval Augmented Generation: Presents a stepwise query and reasoning framework for retrieval-augmented generation.
  • Can We Generate Images with CoT?: Integrates Chain-of-Thought reasoning for compositional and iterative image generation.

Multi-Modal and GUI Systems

  • UI-TARS: Pioneering Automated GUI Interaction: Advances vision-based agents for human-like GUI task performance.
  • InternLM-XComposer2.5-Reward: Improves multi-modal reward modeling for text, image, and video alignment.

Robustness, Adaptability, and Uncertainty

  • Trading Inference-Time Compute for Adversarial Robustness: Examines inference-time compute scaling to improve robustness against adversarial attacks.
  • Evolution and the Knightian Blindspot of Machine Learning: Advocates integrating evolutionary principles into machine learning for resilience to uncertainty.

Planning and Execution in AI

  • LLMs Can Plan Only If We Tell Them: Proposes structured state tracking to enhance planning capabilities in LLMs.
  • Debate Helps Weak-to-Strong Generalization: Leverages debate methods to improve model generalization and alignment.

Social and Cognitive Insights

  • Multiple Predictions of Others’ Actions in the Human Brain: Examines neural mechanisms for predicting social behaviors under ambiguity.

AI Infrastructure and Hardware

  • Good Things Come in Small Packages: Advocates Lite-GPUs for scalable and cost-effective AI infrastructure.

Nine ways to shoot yourself in the foot with PostgreSQL

Mike's Notes

This handy post by Phil Booth on his website has some excellent warnings about misusing PostgreSQL.

There are a lot more in-article links in the original post.

Phil has a great website full of interesting articles.

Resources

Nine ways to shoot yourself in the foot with PostgreSQL

By: Phil Booth
PhilBooth.com: April 23 2023

Previously for Extreme Learning, I discussed all the ways I've broken production using healthchecks. In this post I'll do the same for PostgreSQL.

The common thread linking most of these gotchas is scalability. They're things that won't affect you while your database is small. But if one day you want your database not to be small, it pays to think about them in advance. Otherwise they'll came back and bite you later, potentially when it's least convenient. Plus in many cases it's less work to do the right thing from the start, than it is to change a working system to do the right thing later on.

1. Keep the default value for work_mem

The biggest mistake I made the first time I deployed Postgres in prod was not updating the default value for work_mem. This setting governs how much memory is available to each query operation before it must start writing data to temporary files on disk, and can have a huge impact on performance.

It's an easy trap to fall into if you're not aware of it, because all your queries in local development will typically run perfectly. And probably in production too at first, you'll have no issues. But as your application grows, the volume of data and complexity of your queries both increase. It's only then that you'll start to encounter problems, a textbook "but it worked on my machine" scenario.

When work_mem becomes over-utilised, you'll see latency spikes as data is paged in and out, causing hash table and sorting operations to run much slower. The performance degradation is extreme and, depending on the composition of your application infrastructure, can even turn into full-blown outages.

A good value depends on multiple factors: the size of your Postgres instance, the frequency and complexity of your queries, the number of concurrent connections. So it's really something you should always keep an eye on.

Running your logs through pgbadger is one way to look for warning signs. Another way is to use an automated 3rd-party system that alerts you before it becomes an issue, such as pganalyze (disclosure: I have no affiliation to pganalyze, but am a very happy customer).

At this point, you might be asking if there's a magic formula to help you pick the correct value for work_mem. It's not my invention but this one was handed down to me by the greybeards:

work_mem = ($YOUR_INSTANCE_MEMORY * 0.8 - shared_buffers) / $YOUR_ACTIVE_CONNECTION_COUNT

EDIT: Thanks to afhammad for pointing out you can also override work_mem on a per-transaction basis using SET LOCAL work_mem`.

2. Push all your application logic into Postgres functions and procedures

Postgres has some nice abstractions for procedural code and it can be tempting to push lots or even all of your application logic down into the db layer. After all, doing that eliminates latency between your code and the data, which should mean lower latency for your users, right? Well, nope.

Functions and procedures in Postgres are not zero-cost abstractions, they're deducted from your performance budget. When you spend memory and CPU to manage a call stack, less of it is available to actually run queries. In severe cases that can manifest in some surprising ways, like unexplained latency spikes and replication lag.

Simple functions are okay, especially if you can mark them IMMUTABLE or STABLE. But any time you're assembling data structures in memory or you have nested functions or recursion, you should think carefully about whether that logic can be moved back to your application layer. There's no TCO in Postgres!

And of course, it's far easier to scale application nodes than it is to scale your database. You probably want to postpone thinking about database scaling for as long as possible, which means being conservative about resource usage.

3. Use lots of triggers

Triggers are another feature that can be misused.

Firstly, they're less efficient than some of the alternatives. Requirements that can be implemented using generated columns or materialized views should use those abstractions, as they're better optimised by Postgres internally.

Secondly, there's a hidden gotcha lurking in how triggers tend to encourage event-oriented thinking. As you know, it's good practice in SQL to batch related INSERT or UPDATE queries together, so that you lock a table once and write all the data in one shot. You probably do this in your application code automatically, without even needing to think about it. But triggers can be a blindspot.

The temptation is to view each trigger function as a discrete, composable unit. As programmers we value separation of concerns and there's an attractive elegance to the idea of independent updates cascading through your model. If you feel yourself pulled in that direction, remember to view the graph in its entirety and look for parts that can be optimised by batching queries together.

A useful discipline here is to restrict yourself to a single BEFORE trigger and a single AFTER trigger on each table. Give your trigger functions generic names like before_foo and after_foo, then keep all the logic inline inside one function. Use TG_OP to distinguish the trigger operation. If the function gets long, break it up with some comments but don't be tempted to refactor to smaller functions. This way it's easier to ensure writes are implemented efficiently, plus it also limits the overhead of managing an extended call stack in Postgres.

4. Use NOTIFY heavily

Using NOTIFY, you can extend the reach of triggers into your application layer. That's handy if you don't have the time or the inclination to manage a dedicated message queue, but once again it's not a cost-free abstraction.

If you're generating lots of events, the resources spent on notifying listeners will not be available elsewhere. This problem can be exacerbated if your listeners need to read further data to handle event payloads. Then you're paying for every NOTIFY event plus every consequential read in the handler logic. Just as with triggers, this can be a blindspot that hides opportunities to batch those reads together and reduce load on your database.

Instead of NOTIFY, consider writing events to a FIFO table and then consume them in batches at regular cadence. The right cadence depends on your application, maybe it's a few seconds or perhaps you can get away with a few minutes. Either way it will reduce the load, leaving more CPU and memory available for other things.

A possible schema for your event queue table might look like this:

CREATE TABLE event_queue (
  id uuid PRIMARY KEY DEFAULT gen_random_uuid(),
  user_id uuid NOT NULL REFERENCES users(id) ON DELETE CASCADE,
  type text NOT NULL,
  data jsonb NOT NULL,
  created_at timestamptz NOT NULL DEFAULT now(),
  occurred_at timestamptz NOT NULL,
  acquired_at timestamptz,
  failed_at timestamptz
);

With that in place you could acquire events from the queue like so:

UPDATE event_queue
SET acquired_at = now()
WHERE id IN (
  SELECT id
  FROM event_queue
  WHERE acquired_at IS NULL
  ORDER BY occurred_at
  FOR UPDATE SKIP LOCKED
  LIMIT 1000 -- Set this limit according to your usage
)
RETURNING *;

Setting acquired_at on read and using FOR UPDATE SKIP LOCKED guarantees each event is handled only once. After they've been handled, you can then delete the acquired events in batches too (there are better options than Postgres for permanently storing historical event data of unbounded size).

EDIT: Thanks to labatteg, notfancy, reese_john, xnickb and Mavvie for pointing out the missing FOR UPDATE SKIP LOCKED in this section.

5. Don't use EXPLAIN ANALYZE on real data

EXPLAIN is a core tool in every backend engineer's kit. I'm sure you diligently check your query plans for the dreaded Seq Scan already. But Postgres can return more accurate plan data if you use EXPLAIN ANALYZE, because that actually executes the query. Of course, you don't want to do that in production. So to use EXPLAIN ANALYZE well, there are a few steps you should take first.

Any query plan is only as good as the data you run it against. There's no point running EXPLAIN against a local database that has a few rows in each table. Maybe you're fortunate enough to have a comprehensive seed script that populates your local instance with realistic data, but even then there's a better option.

It's really helpful to set up a dedicated sandbox instance alongside your production infrastructure, regularly restored with a recent backup from prod, specifically for the purpose of running EXPLAIN ANALYZE on any new queries that are in development. Make the sandbox instance smaller than your production one, so it's more constrained than prod. Now EXPLAIN ANALYZE can give you confidence about how your queries are expected to perform after they've been deployed. If they look good on the sandbox, there should be no surprises waiting for you when they reach production.

6. Prefer CTEs over subqueries

If you're regularly using EXPLAIN this one probably won't catch you out, but it's caught me out before so I want to mention it explicitly.

Many engineers are bottom-up thinkers and CTEs (i.e. WITH queries) are a natural way to express bottom-up thinking. But they may not be the most performant way.

Instead I've found that subqueries will often execute much faster. Of course it depends entirely on the specific query, so I make no sweeping generalisations other than to suggest you should EXPLAIN both approaches for your own complex queries.

There's a discussion of the underlying reasons in the "CTE Materialization" section of the docs, which describes the performance tradeoffs more definitively. It's a good summary, so I won't waste your time trying to paraphrase it here. Go and read that if you want to know more.

EDIT: Thanks to Randommaggy and Ecksters for pointing out the subquery suggestion in this section is outdated. Since version 12, Postgres has been much better at optimising CTEs and will often just replace the CTE with a subquery anyway. I've left the section in place as the broader point about comparing approaches with EXPLAIN still stands and the "CTE Materialization" docs remain a worthwhile read. But bear in mind the comment thread linked above!

7. Use recursive CTEs for time-critical queries

If your data model is a graph, your first instinct will naturally be to traverse it recursively. Postgres provides recursive CTEs for this and they work nicely, even allowing you to handle self-referential/infinitely-recursive loops gracefully. But as elegant as they are, they're not fast. And as your graph grows, performance will decline.

A useful trick here is to think about how your application traffic stacks up in terms of reads versus writes. It's common for there to be many more reads than writes and in that case, you should consider denormalising your graph to a materialized view or table that's better optimised for reading. If you can store each queryable subgraph on its own row, including all the pertinent columns needed by your queries, reading becomes a simple (and fast) SELECT. The cost of that is write performance of course, but it's often worth it for the payoff.

8. Don't add indexes to your foreign keys

Postgres doesn't automatically create indexes for foreign keys. This may come as a surprise if you're more familiar with MySQL, so pay attention to the implications as it can hurt you in a few ways.

The most obvious fallout from it is the performance of joins that use a foreign key. But those are easily spotted using EXPLAIN, so are unlikely to catch you out.

Less obvious perhaps is the performance of ON DELETE and ON UPDATE behaviours. If your schema relies on cascading deletes, you might find some big performance gains by adding indexes on foreign keys.

9. Compare indexed columns with IS NOT DISTINCT FROM

When you use regular comparison operators with NULL, the result is also NULL instead of the boolean value you might expect. One way round this is to replace <> with IS DISTINCT FROM and replace = with IS NOT DISTINCT FROM. These operators treat NULL as a regular value and will always return booleans.

However, whereas = will typically cause the query planner to use an index if one is available, IS NOT DISTINCT FROM bypasses the index and will likely do a Seq Scan instead. This can be confusing the first time you notice it in the output from EXPLAIN.

If that happens and you want to force a query to use the index instead, you can make the null check explicit and then use = for the not-null case.

In other words, if you have a query that looks like this:

SELECT * FROM foo
WHERE bar IS NOT DISTINCT FROM baz;

You can do this instead:

SELECT * FROM foo
WHERE (bar IS NULL AND baz IS NULL)
OR bar = baz;

Design Token-Based UI Architecture

Mike's Notes

Pipi 9 has an existing design system engine, one of its many parts. This engine describes the CSS files but does not yet automate or generate code.

Design Tokens have significantly matured, and the draft standard has recently improved.

The article below is copied from Martin Fowler's website. It describes how ThoughtWorks uses Design Tokens for code generation, which I will use as a starting point. It should work well with the existing design system engine.

Resources

Design Token-Based UI Architecture

By: Andreas Kutschmann
MartinFowler.com: December 12 2024

Design tokens are design decisions as data and serve as a single source of truth for design and engineering. Utilizing deployment pipelines, they enable automated code generation across platforms, allowing for faster updates and improved consistency in design. Organizing tokens in layers—progressing from available options to tokens that capture how they are applied—ensures scalability and a better developer experience. Keeping option tokens (e.g. color palettes) private reduces file size and supports non-breaking changes. These benefits make design tokens particularly well-suited for organizations with large-scale projects, multi-platform environments or frequent design changes.

Contents

  • Role of design tokens
    • What are design tokens?
  • Establishing a single source of truth
  • Automated design token distribution
    • Fully automated pipeline
    • Pipeline including manual approval
  • Organizing tokens in layers
    • Option tokens: defining what design options are provided
    • Decision tokens: defining how styles are applied
    • Component tokens: defining where styles are applied
    • How many layers shall I use?
  • Token scope
    • File-based scope
    • A more flexible approach
  • Should I use design tokens?
    • When to use design tokens
    • When design tokens might not be necessary

Design tokens, or “tokens” are fundamental design decisions represented as data. They are the foundational building blocks of design systems.

Since the release of the second editor’s draft of the design token specification in 2022 and the call for tool makers to start implementing and providing feedback, the landscape of design token tools has evolved rapidly. Tools like code generators, documentation systems, and UI design software are now better equipped to support design tokens, underscoring their growing importance in modern UI architecture.

In this article, I'll explain what design tokens are, when they are useful and how to apply them effectively. We'll focus on key architectural decisions that are often difficult to change later, including:

  1. How to organize design tokens in layers to balance scalability, maintainability and developer experience.
  2. Whether all tokens should be made available to product teams or just a subset.
  3. How to automate the distribution process of tokens across teams.

Role of design tokens

Around 2017, I was involved in a large project that used the Micro Frontend Architecture to scale development teams. In this setup, different teams were responsible for different parts of the user interface, which could be even on the same page. Each team could deploy its micro-frontend independently.

There were various cases where components would be displayed on top of each other (such as dialogs or toasts appearing on top of content areas), which were not part of the same micro frontend. Teams used the CSS property z-index to control the stacking order, often relying on magic numbers—arbitrary values that weren’t documented or standardized. This approach did not scale as the project grew. It led to issues that took effort to fix, as cross-team collaboration was needed.

The issue was eventually addressed with design tokens and I think makes a good example to introduce the concept. The respective token file might have looked similar to this:

{
  "z-index": {
    "$type": "number",
    "default": {
      "$value": 1
    },
    "sticky": {
      "$value": 100
    },
    "navigation": {
      "$value": 200
    },
    "spinner": {
      "$value": 300
    },
    "toast": {
      "$value": 400
    },
    "modal": {
      "$value": 500
    }
  }
}

The design tokens above represent the set of z-index values that can be used in the application and the name gives developers a good idea of where to use them. A token file like this can be integrated into the designers’ workflow and also be used to generate code, in a format that each team requires. For example, in this case, the token file might have been used to generate CSS or SCSS variables:

css

:root {
    --z-index-default: 1;
    --z-index-sticky: 100;
    --z-index-navigation: 200;
    --z-index-spinner: 300;
    --z-index-toast: 400;
    --z-index-modal: 500;
  }

scss

$z-index-default: 1;
  $z-index-sticky: 100;
  $z-index-navigation: 200;
  $z-index-spinner: 300;
  $z-index-toast: 400;
  $z-index-modal: 500;

What are design tokens?

Salesforce originally introduced design tokens to streamline design updates to multiple platforms.

The Design Tokens Community Group describes design tokens as “a methodology for expressing design decisions in a platform-agnostic way so that they can be shared across different disciplines, tools, and technologies

Let’s break this down:

Cross-Disciplinary Collaboration: Design tokens act as a common language that aligns designers, developers, product managers, and other disciplines. By offering a single source of truth for design decisions, they ensure that everyone involved in the product life cycle is on the same page, leading to more efficient workflows.

Tool integration: Design tokens can be integrated into various design and development tools, including UI design software, token editors, translation tools (code generators), and documentation systems. This enables design updates to be quickly reflected in the code base and are synchronized across teams.

Technology adaptability: Design tokens can be translated into different technologies like CSS, SASS, and JavaScript for the web, and even used on native platforms like Android and iOS. This flexibility enables design consistency across a variety of platforms and devices.

Establishing a single source of truth

A key benefit of design tokens is their ability to serve as a single source of truth for both design and engineering teams. This ensures that multiple products or services maintain visual and functional consistency.

A translation tool takes one or more design token files as input and generates platform-specific code as output. Some translation tools can also produce documentation for the design tokens in the form of HTML. At the time of writing, popular translation tools include Style Dictionary, Theo, Diez or Specify App.


Figure 1: Translation tool

Automated design token distribution

In this section, we’ll explore how to automate the distribution of design tokens to product teams.

Let’s assume our goal is to provide teams with updated, tech-specific design tokens immediately after a designer makes a change. To achieve this, we can automate the translation and distribution process using a deployment pipeline for design tokens. Besides platform-specific code artifacts (like CSS for the web, XML for Android etc.), the pipeline might also deploy the documentation for the design tokens.

One crucial requirement is keeping design tokens under version control. Thankfully, plugins for popular design tools like Figma already integrate with Git providers like GitHub. It's considered best practice to use the Git repository as the single source of truth for design tokens—not the design tool itself. However, this requires the plugin to support syncing both ways between the repository and the design tool, which not all plugins do. As of now, Tokens Studio is a plugin that offers this bidirectional syncing. For detailed guidance on integrating Tokens Studio with different Git providers, please refer to their documentation. The tool enables you to configure a target branch and supports a trunk-based as well as a pull-request-based workflow.

Once the tokens are under version control, we can set up a deployment pipeline to build and deploy the artifacts needed by the product teams, which include platform-specific source code and documentation. The source code is typically packaged as a library and distributed via an artifact registry. This approach gives product teams control over the upgrade cycle. They can adopt updated styles by simply updating their dependencies. These updates may also be applied indirectly through updates of component libraries that use the token-based styles.


Figure 2: Automated design token distribution

This overall setup has allowed teams at Thoughtworks to roll out smaller design changes across multiple front-ends and teams in a single day.

Fully automated pipeline

The most straightforward way to design the pipeline would be a fully automated trunk-based workflow. In this setup, all changes pushed to the main branch will be immediately deployed as long as they pass the automated quality gates.

Such a pipeline might consist of the following jobs:

Check: Validate the design token files using a design token validator or a JSON validator.

  • Build: Use a translation tool like Style Dictionary to convert design token files into platform-specific formats. This job might also build the docs using the translation tool or by integrating a dedicated documentation tool.
  • Test: This job is highly dependent on the testing strategy. Although some tests can be done using the design token file directly (like checking the color contrast), a common approach is to test the generated code using a documentation tool such as Storybook. Storybook has excellent test support for visual regression tests, accessibility tests, interaction tests, and other test types.
  • Publish: Publish updated tokens to a package manager (for example, npm). The release process and versioning can be fully automated with a package publishing tool that is based on Conventional Commits like semantic-release. semantic-release also allows the deployment of packages to multiple platforms. The publish job might also deploy documentation for the design tokens.
  • Notify: Inform teams of the new token version via email or chat, so that they can update their dependencies.

Figure 3: Fully automated deployment pipeline

Pipeline including manual approval

Sometimes fully automated quality gates are not sufficient. If a manual review is required before publishing, a common approach is to deploy an updated version of the documentation with the latest design token to a preview environment (a temporary environment).

If a tool like Storybook is used, this preview might contain not only the design tokens but also show them integrated with the components used in the application.

An approval process can be implemented via a pull-request workflow. Or, it can be a manual approval / deployment step in the pipeline.


Figure 4: Deployment pipeline with manual approval

Organizing tokens in layers

As discussed earlier, design tokens represent design decisions as data. However, not all decisions operate at the same level of detail. Instead, ideally, general design decisions guide more specific ones. Organizing tokens (or design decisions) into layers allows designers to make decisions at the right level of abstraction, supporting consistency and scalability.

For instance, making individual color choices for every new component isn’t practical. Instead, it’s more efficient to define a foundational color palette and then decide how and where those colors are applied. This approach reduces the number of decisions while maintaining a consistent look and feel.

There are three key types of design decisions for which design tokens are used. They build on top of one another:

  • What design options are available to use?
  • How are those styles applied across the user interface?
  • Where exactly are those styles applied (in which components)?

There are various names for these three types of tokens (as usual, naming is the hard part). In this article, we’ll use the terms proposed by Samantha Gordashko: option tokens, decision tokens and component tokens.

Let’s use our color example to illustrate how design tokens can answer the three questions above.

Option tokens: defining what design options are provided

Option tokens (also called primitive tokens, base tokens, core tokens, foundation tokens or reference tokens) define what styles can be used in the application. They define things like color palettes, spacing/sizing scales or font families. Not all of them are necessarily used in the application, but they present reasonable options.

Using our example, let’s assume we have a color palette with 9 shades for each color, ranging from very light to highly saturated. Below, we define the blue tones and grey tones as option-tokens:

{
  "color": {
    "$type": "color",
    "options": {
      "blue-100": {"$value": "#e0f2ff"},
      "blue-200": {"$value": "#cae8ff"},
      "blue-300": {"$value": "#b5deff"},
      "blue-400": {"$value": "#96cefd"},
      "blue-500": {"$value": "#78bbfa"},
      "blue-600": {"$value": "#59a7f6"},
      "blue-700": {"$value": "#3892f3"},
      "blue-800": {"$value": "#147af3"},
      "blue-900": {"$value": "#0265dc"},
      "grey-100": {"$value": "#f8f8f8"},
      "grey-200": {"$value": "#e6e6e6"},
      "grey-300": {"$value": "#d5d5d5"},
      "grey-400": {"$value": "#b1b1b1"},
      "grey-500": {"$value": "#909090"},
      "grey-600": {"$value": "#6d6d6d"},
      "grey-700": {"$value": "#464646"},
      "grey-800": {"$value": "#222222"},
      "grey-900": {"$value": "#000000"},
      "white": {"$value": "#ffffff"}
    }
  }
}

Although it’s highly useful to have reasonable options, option tokens fall short of being sufficient for guiding developers on how and where to apply them.

Decision tokens: defining how styles are applied

Decision tokens (also called semantic tokens or system tokens) specify how those style options should be applied contextually across the UI.

In the context of our color example, they might include decisions like the following:

  • grey-100 is used as a surface color.
  • grey-200 is used for the background of disabled elements.
  • grey-400 is used for the text of disabled elements.
  • grey-900 is used as a default color for text.
  • blue-900 is used as an accent color.
  • white is used for text on accent color backgrounds.

The corresponding decision token file would look like this:

{
  "color": {
    "$type": "color",
    "decisions": {
      "surface": {
        "$value": "{color.options.grey-100}",
        "description": "Used as a surface color."
      },
      "background-disabled": {
        "$value": "{color.options.grey-200}",
        "description":"Used for the background of disabled elements."
      },
      "text-disabled": {
        "$value": "{color.options.grey-400}",
        "description": "Used for the text of disabled elements."
      },
      "text": {
        "$value": "{color.options.grey-900}",
        "description": "Used as default text color."
      },
      "accent": {
        "$value": "{color.options.blue-900}",
        "description": "Used as an accent color."
      },
      "text-on-accent": {
        "$value": "{color.options.white}",
        "description": "Used for text on accent color backgrounds."
      }
    }
  }
}

As a developer, I would mostly be interested in the decisions, not the options. For example, color tokens typically contain a long list of options (a color palette), while very few of those options are actually used in the application. The tokens that are actually relevant when deciding which styles to apply, would be usually the decision tokens.

Decision tokens use references to the option tokens. I think of organizing tokens this way as a layered architecture. In other articles, I have often seen the term tier being used, but I think layer is the better word, as there is no physical separation implied. The diagram below visualizes the two layers we talked about so far:


Figure 5: 2-layer pattern

Component tokens: defining where styles are applied

Component tokens (or component-specific tokens) map the decision tokens to specific parts of the UI. They show where styles are applied.

The term component in the context of design tokens does not always map to the technical term component. For example, a button might be implemented as a UI component in some applications, while other applications just use the button HTML element instead. Component tokens could be used in both cases.

Component tokens can be organised in a group referencing multiple decision tokens. In our example, this references might include text- and background-colors for different variants of the button (primary, secondary) as well as disabled buttons. They might also include references to tokens of other types (spacing/sizing, borders etc.) which I'll omit in the following example:

{
  "button": {
    "primary": {
      "background": {
        "$value": "{color.decisions.accent}"
      },
      "text": {
        "$value": "{color.decisions.text-on-accent}"
      }
    },
    "secondary": {
      "background": {
        "$value": "{color.decisions.surface}"
      },
      "text": {
        "$value": "{color.decisions.text}"
      }
    },
    "background-disabled": {
      "$value": "{color.decisions.background-disabled}"
    },
    "text-disabled": {
      "$value": "{color.decisions.text-disabled}"
    }
  }
}

To some degree, component tokens are simply the result of applying decisions to specific components. However, as this example shows, this process isn’t always straightforward—especially for developers without design experience. While decision tokens can offer a general sense of which styles to use in a given context, component tokens provide additional clarity.


Figure 6: 3-layer pattern

Note: there may be “snowflake” situations where layers are skipped. For example, it might not be possible to define a general decision for every single component token, or those decisions might not have been made yet (for example at the beginning of a project).

How many layers shall I use?

Two or three layers are quite common amongst the bigger design systems.

However, even a single layer of design tokens already greatly limits the day-to-day decisions that need to be made. For example, just deciding what units to use for spacing and sizing became a somewhat nontrivial task with up to 43 units for length implemented in some browsers (if I counted correctly).

A three-layer architecture should offer the best developer experience. However, it also increases maintenance effort and token count, as new tokens are introduced with each new component. This can result in a larger code base and heavier package size.

Starting with two layers (option and decision tokens) can be a good idea for projects where the major design decisions are already in place and/or relatively stable. A third layer can still be added if there is a clear need.

An additional component layer makes it easier for designers to change decisions later or let them evolve over time. This flexibility could be a driving force for a three-layer architecture. In some cases, it might even make sense to start with component tokens and to add the other layers later on.

Ultimately, the number of layers depends on your project's needs and how much flexibility and scalability are required.

Token scope

I already mentioned that while option tokens are very helpful to designers, they might not be relevant for application developers using the platform-specific code artifacts. Application developers will typically be more interested in the decision/component tokens.

Although token scope is not yet included in the design token spec, some design systems already separate tokens into private (also called internal) and public (also called global) tokens. For example, the Salesforce Lightning Design System introduced a flag for each token. There are various reasons why this can be a good idea:

  • it guides developers on which tokens to use
  • fewer options provide a better developer experience
  • it reduces the file size as not all tokens need to be included
  • private/internal tokens can be changed or removed without breaking changes

A downside of making option tokens private is that developers would rely on designers to always make those styles available as decision or component tokens. This could become an issue in case of limited availability of the designers or if not all decisions are available, for example at the start of a project.

Unfortunately, there is no standardized solution yet for implementing scope for design tokens. So the approach depends on the tool-chain of the project and will most likely need some custom code.

File-based scope

Using Style Dictionary, we can use a filter to expose only public tokens. The most straightforward approach would be to filter on the file ending. If we use different file endings for component, decision and option tokens, we can use a filter on the file path, for example, to make the option tokens layer private.

Style Dictionary config

const styleDictionary = new StyleDictionary({
    "source": ["color.options.json", "color.decisions.json"],
    "platforms": {
      "css": {
        "transformGroup": "css",
        "files": [
          {
            "destination": "variables.css",
            "filter": token => !token.filePath.endsWith('options.json'),
            "format": "css/variables"
          }
        ]
      }
    }
  });

The resulting CSS variables would contain only these decision tokens, and not the option tokens.

Generated CSS variables

:root {
    --color-decisions-surface: #f8f8f8;
    --color-decisions-background-disabled: #e6e6e6;
    --color-decisions-text-disabled: #b1b1b1;
    --color-decisions-text: #000000;
    --color-decisions-accent: #0265dc;
    --color-decisions-text-on-accent: #ffffff;
  }

A more flexible approach

If more flexibility is needed, it might be preferable to add a scope flag to each token and to filter based on this flag:

Style Dictionary config

 const styleDictionary = new StyleDictionary({
    "source": ["color.options.json", "color.decisions.json"],
    "platforms": {
      "css": {
        "transformGroup": "css",
        "files": [
          {
            "destination": "variables.css",
            "filter": {
              "public": true
            },
            "format": "css/variables"
          }
        ]
      }
    }
  });

If we then add the flag to the decision tokens, the resulting CSS would be the same as above:

Tokens with scope flag

 {
    "color": {
      "$type": "color",
      "decisions": {
        "surface": {
          "$value": "{color.options.grey-100}",
          "description": "Used as a surface color.",
          "public": true
        },
        "background-disabled": {
          "$value": "{color.options.grey-200}",
          "description":"Used for the background of disabled elements.",
          "public": true
        },
        "text-disabled": {
          "$value": "{color.options.grey-400}",
          "description": "Used for the text of disabled elements.",
          "public": true
        },
        "text": {
          "$value": "{color.options.grey-900}",
          "description": "Used as default text color.",
          "public": true
        },
        "accent": {
          "$value": "{color.options.blue-900}",
          "description": "Used as an accent color.",
          "public": true
        },
        "text-on-accent": {
          "$value": "{color.options.white}",
          "description": "Used for text on accent color backgrounds.",
          "public": true
        }
      }
    }
  }

Generated CSS variables

:root {
    --color-decisions-surface: #f8f8f8;
    --color-decisions-background-disabled: #e6e6e6;
    --color-decisions-text-disabled: #b1b1b1;
    --color-decisions-text: #000000;
    --color-decisions-accent: #0265dc;
    --color-decisions-text-on-accent: #ffffff;
  }

Such flags can now also be set through the Figma UI (if using Figma variables as a source of truth for design tokens). It is available as hiddenFromPublishing flag via the Plugins API.

Should I use design tokens?

Design tokens offer significant benefits for modern UI architecture, but they may not be the right fit for every project.

Benefits include:

  • Improved lead time for design changes
  • Consistent design language and UI architecture across platforms and technologies
  • Design tokens being relatively lightweight from an implementation point of view

Drawbacks include:

  • Initial effort for automation
  • Designers might have to (to some degree) interact with Git
  • Standardization is still in progress

Consider the following when deciding whether to adopt design tokens:

When to use design tokens

  1. Multi-Platform or Multi-Application Environments: When working across multiple platforms (web, iOS, Android…) or maintaining several applications or frontends, design tokens ensure a consistent design language across all of them.
  2. Frequent Design Changes: For environments with regular design updates, design tokens provide a structured way to manage and propagate changes efficiently.
  3. Large Teams: For teams with many designers and developers, design tokens facilitate collaboration.
  4. Automated Workflows: If you’re familiar with CI/CD pipelines, the effort to add a design token pipeline is relatively low. There are also commercial offerings.

When design tokens might not be necessary

  1. Small projects: For smaller projects with limited scope and minimal design complexity, the overhead of managing design tokens might not be worth the effort.
  2. No issue with design changes: If the speed of design changes, consistency and collaboration between design and engineering are not an issue, then you might also not need design tokens.

Acknowledgments

Thanks to Berni Ruoff—I don't think I would have written this article without all the great discussions we had about design systems and design tokens (and for giving feedback on the first draft). Thanks to Shawn Lukas, Jeen Suratriyanont, Mansab Uppal and of course Martin for all the feedback on the subsequent drafts.

Growing the development forest - with Martin Fowler

Mike's Notes

This interview with Martin Fowler was in a recent Refactoring Newsletter.

Resources

Growing the development forest - with Martin Fowler

By: Luca Rossi
24/01/2024

Martin is chief scientist at ThoughtWorks. He is one of the original signatories of the Agile Manifesto and author of several legendary books, among which there is Refactoring, which shares the name with this podcast and this newsletter. 

With Martin, we talked about the impact of AI on software development, from the development process to how human learning and understanding changes up to the future of software engineering jobs.

Then we explored the technical debt metaphor, why it has been so successful, and Martin's own advice on dealing with it. And finally, we talked about the state of Agile, the resistance that still exists today towards many Agile practices and how to measure engineering effectiveness.

(03:29) Introduction
(05:20) Development cycle with AI
(08:36) Less control and reduced learning
(13:11) Splitting task between Human and AI
(14:48) The skills shift
(20:17) Betting on new technologies
(27:22) Martin's Refactoring and technical debt
(29:24) Accumulating "cruft"
(33:14) Dealing with "cruft"
(37:24) The financial value of refactoring
(42:04) Measuring performances
(46:19) Why the "forest" didn't spread
(56:11) Make the forest appealing

Show notes / useful links:

Feature Flags Transform Your Product Development Workflow

Mike's Notes

Ben Nadel wrote a great book on Feature Flags. He has now made the online version free to read.

Ben is also very generous in sharing his CFML code, clearly explaining how it works, and answering questions. I learn a lot from Ben.

There is also a playground demo to play with feature flags.

He also has a lot of helpful YouTube videos in which he explains a lot of the code.

Resources

Feature Flags Playground Demo


The four kinds of optimisation

Mike's Notes

This is an excerpt from an article referenced in a recent issue of Quastor.

It's a thoughtful article about ways to improve software performance.

Laurence Tratt is a programmer and the Shopify / Royal Academy of Engineering Research Chair in Language Engineering in the Department of Informatics at King’s College London, where he leads the Software Development Team.

Resources

Four Kinds of Optimisation

By: Laurence Tratt

Blog Post: 14/11/2023

Premature optimisation might be the root of all evil, but overdue optimisation is the root of all frustration. No matter how fast hardware becomes, we find it easy to write programs which run too slow. Often this is not immediately apparent. Users can go for years without considering a program’s performance to be an issue before it suddenly becomes so — often in the space of a single working day.

I have devoted more of my life to optimisation than I care to think about, and that experience has led me to make two observations:

  • Human optimism leads us to believe that we can easily know where a program spends most of its time.
  • Human optimism leads us to believe that we can easily know how to make the slow parts of a program run faster.

You will not be surprised to learn that I think both forms of optimism misplaced. Partly this is because, as hardware and software have become more sophisticated, it has become harder to understand their effects on performance. But, perhaps more fundamentally, we tend to overestimate how much we know about the software we’re working on. We overemphasise the parts of the system we’ve personally worked on, particularly those we’ve most recently worked on. We downplay other parts of the system, including the impact of dependencies (e.g. libraries).

The solution to the first of these observations is fairly widely known — one should rigorously profile a program before assuming one knows where it is spending the majority of its time. I deliberately say “rigorously profile” because people often confuse “I have profiled a program once” with “I have built up a good model of a program’s performance in a variety of situations”. Sometimes, a quick profiling job is adequate, but it can also mislead. Often it is necessary to profile a program with different inputs, sometimes on different machines or network configurations, and to use a variety of sampling and non-sampling approaches [1].

However, the multiple solutions, and their inevitable trade-offs, to the second observation are, I believe, underappreciated. I tend to think that there are four main solutions:

  • Use a better algorithm.
  • Use a better data-structure.
  • Use a lower-level system.
  • Accept a less precise solution.

In the rest of this post I’m going to go through each of these and give some suggestions for the trade-offs involved.

Use a better algorithm

Let’s imagine – and I’ve genuinely seen this happen! – that after careful profiling of a Python program, I find that I’m spending most of my time in a function which looks like this:

def f1(l):
  while True:
    c = False
    for i in range(0, len(l) - 1):
      if l[i+1] < l[i]:
        t = l[i]
        l[i] = l[i+1]
        l[i+1] = t
        c = True
    if not c: return l

It’s a bubble sort! At this point, many people will start guffawing, because it’s an obviously slow way of sorting elements. However, bubble sort has an often-forgotten advantage over many “better” algorithms: it runs in constant memory [2]. I could gamble that my program doesn’t need to use constant memory, but if I’m unsure, I can use an alternative algorithm which preserves this property. Let’s try a selection sort:

def f2(l):
  for i in range(0, len(l) - 1):
    m = i
    for j in range(i + 1, len(l)):
      if l[j] < l[m]: m = j
    if m != i:
      t = l[i]
      l[i] = l[m]
      l[m] = t
  return l

If I use this quick testing code:

import random, time
l = [random.random() for _ in range(1000)]
before = time.time()
l1 = f1(l[:])
print(time.time() - before)
before = time.time()
l2 = f2(l[:])
print(time.time() - before)

and run it on CPython 3.11 on a Linux server I consistently get timings along the lines of:

0.0643463134765625
0.020025014877319336

In other words, selection sort is about three times faster than bubble sort in my test.

You don’t need me to tell you that selection sort isn’t the fastest possible sorting algorithm, but “fastest” is a more slippery concept than it first appears. For example, the selection sort algorithm above is faster than the bubble sort for random data, but the bubble sort is much faster for sorted data [3]. The relationship between inputs and algorithmic performance can be subtle. Famously, if you choose an unfortunate “pivot” when implementing quicksort, you’ll find that it is very non-quick (e.g. you can make it as slow on already-sorted data as the selection sort above).

We can generalise from this that “use a better algorithm” requires understanding the wider context of your system and the nature of the algorithm you’re thinking of using. For example, I’ve often seen people conflate an algorithm’s best-case, average-case, and worst-case performance — but the differences between those three pieces of information can be vital when I’m optimising a program. Sometimes I might know something about my program (e.g. the nature of its inputs) that makes me confident that the worst case can’t happen, or I don’t consider the worst case to be a problem (e.g. its a batch job and no-one will notice occasional latency). But, generally, I care more about the worst case than the best case, and I select algorithms accordingly.

It’s also not uncommon that algorithms that have good theoretical performance have poor real-world performance (big O notation can hide many sins). If in doubt, I try gradually more test data until I feel I have truly understood the practical consequences of different choices.

It’s also easy to overlook complexity. Fundamentally, faster algorithms are faster because they observe that some steps in a calculation can be side-stepped. I can still remember the first time I read the description for timsort: the beauty of its algorithmic observations has stayed with me ever since. But verifying those observations is harder than we imagine — even timsort, created by one of the greatest programmers I have ever come across, had a subtle bug in it [4].

When us mortals implement faster algorithms, they are often slightly incorrect, particularly when newly implemented, either producing wrong results or not having the expected performance characteristics [5]. For example, parallelising an algorithm can often lead to huge speedups, particularly as CPUs gain more cores, but how many of us understand the C11 memory model well enough to feel confident of the consequences or parallelisation?

The combination of (in)correctness and the difficulty in understanding the context in which an algorithm is fast means that I frequently encourage people to start with a simple algorithm and only move to something “faster” if they really find they need to. Picking (and, if necessary, implementing) the right algorithm for the task at hand is a surprisingly difficult skill!

Use a better data-structure

Let’s imagine that I profile another program and find that I spend most of my time in the following function:

def f3(l, e):
  for x in l:
    if x == e: return True
  return False

It’s an existence check function! Optimising these can be quite interesting, because my choices will depend on how the lists passed to this function are used. I could change the list to a binary tree, for example. But if I can tell, as is not uncommon, that we repeatedly check for the existence of elements in a list that is never mutated after initial creation, I might be able to get away with a very simple data-structure: a sorted list. That might sound odd, because “sorted list” doesn’t sound like much of a data-structure, but that then allows me to do a binary search. For anything but the smallest lists [6], binary search is much quicker than the linear search above.

Just as with “use a better algorithm”, “use a better data-structure” requires careful thought and measurement [7]. In general, while I often find it necessary to implement my own “better algorithms”, I rarely find it necessary to implement my own “better data-structures”. Partly this is laziness on my part, but it’s mostly because data-structures are more easily packaged in a library than better algorithms [8].

There is an important tactical variant on “better data-structures” that is perhaps best thought of as “put your structs/classes on a diet”. If a program is allocating vast numbers of a given struct/class, the size of that struct/class in bytes can become a significant cost in its own right. When I was working on error recovery in grmtools, I found that simply reducing the most commonly allocated struct by 8 bytes in size improved total program performance by 5% — a trick that, from memory, I repeated twice!

There are many similar tactics to this, for example reducing “pointer chasing” (typically by folding multiple structs/classes into one), encouraging memory locality and so on. However, while it’s easy to measure the size of a struct/class and how often it’s allocated, it’s difficult to measure the indirect impact of things like memory locality — I have heard such factors blamed for poor performance much more often than I have seen such factors proven as responsible for poor performance. In general, I only look to such factors when I’m getting desperate.

Use a lower-level system

A time-honoured tradition is to rewrite parts of a program in a lower-level programming language. Let’s rewrite our Python bubble sort into Rust:

use std::cmp::PartialOrd;
fn f1(l: &mut Vec) {
  loop {
    let mut c = false;
    for i in 0..l.len() - 1 {
      if l[i + 1] < l[i] {
        let t = l[i];
        l[i] = l[i + 1];
        l[i + 1] = t;
        c = true;
      }
    }
    if !c {
      return;
    }
  }
}

I mildly adopted my Python program from earlier to save out 1000 random floating point numbers, and added this testing code in Rust:

use {env::args, fs::read_to_string, time::Instant};
fn main() {
  let mut l = read_to_string(args().nth(1).unwrap())
    .unwrap()
    .lines()
    .map(|x| x.parse::().unwrap())
    .collect::>();
  let before = Instant::now();
  f1(&mut l);
  println!("{}", (Instant::now() - before).as_secs_f64());
}

}

My Rust bubble sort runs in 0.001s, about 60x faster than the Python version. This looks like a great success for “rewrite in a lower-level programming language” — but you may have noticed that I titled this section “Use a lower-level system”.

Instead of spending 15 minutes writing the Rust code, it would have been smarter of me to recognise that my Python bubble sort is likely to emphasise CPython’s (the most common implementation of Python) weaknesses. In particular, CPython will represent what I conceptually thought of as a list of floating point numbers as an array of pointers to individually heap-allocated Python objects. That representation has the virtue of generality but not efficiency.

Although it’s often forgotten, CPython isn’t the only implementation of Python. Amongst the alternatives is PyPy, which just so happens to represent lists of floats as efficiently as Rust. Simply typing pypy instead of python speeds my bubble sort up by 4x! There are few changes I can make that give me such a big performance improvement for such little effort. That’s not to say that PyPy runs my program as fast as Rust (PyPy is still about 15x slower) but it may well be fast enough, which is what really matters.

I have seen multiple organisations make the mistake of trying to solve performance problems by rewriting their software in lower-level programming languages, when they would have got sufficient benefit from working out how to run their existing software a little faster. There are often multiple things one can do here, from using different language implementations, to checking that you’ve got compiler optimisations turned on [9], to using faster libraries or databases, and so on. Sometimes rewriting in a lower-level programming language really is the right thing to do, but it is rarely a quick job, and it inevitably introduces a period of instability while bugs are shaken out of the new version.

Accept a less precise solution

A common problem we face is that we have n elements of something and we want to understand the best subset or ordering of those for our situation. Let’s imagine that I’ve implemented a compiler and 30 separate optimisation passes. I know that some optimisation passes are more effective if they run after other optimisation passes, but I don’t know what the most effective ordering of all the passes is.

I could write a program to enumerate all the permutations of those 30 passes, run them against a benchmark suite I possess, and then select the fastest permutation. But if my benchmark suite takes 1 second to run then it will take roughly 282 years to evaluate all the possibilities — which is rather longer than the current age of the universe. Clearly I can’t wait that long for an answer: I can only run a subset of all the permutations. In situations such as this, I have to accept that I’ll never be able to know for sure what the best possible answer is: but, that said, I can at least make sure I end up with a better answer than not trying anything at all.

There are various ways of tackling this but most boil down to local search. In essence, we define a metric (in our running example, how fast our benchmark suite runs) that allows us to compare two solutions (in our case, faster is better) and discard the worst. We then need a way of generating a neighbour solution to the one we already have, at which point we recalculate the metric and discard the worse of the old and new solution. After either a fixed time-limit, or if we can’t find solutions which improve our metric, we return the best solution we’ve found. The effectiveness of this simple technique (the core algorithm is a few lines of code) tends to stun newcomers, since the obvious problem of local optima seems like it should undermine the whole idea.

As typically implemented, local search as I’ve outlined it above produces correct but possibly non-optimal solutions. Sometimes, however, we’re prepared to accept an answer which is less precise in the sense that it is possibly “incorrect”. By this I don’t mean that the program is buggy, but that the program may deliberately produce outputs that do not fully match what we would consider the “full and proper” answer.

Exactly what constitutes “correct” varies from one situation to another. For example, fast inverse square root approximates multiplicative inverse: for situations such as games, its fast nearly-correct answer is a better trade-off than a slow definitely-correct answer. A Bloom filter can give false positives: accepting that possibility allows it to be exceptionally frugal with memory. JPEG image compression deliberately throws away some of an image’s fine details in order to make the image more compressible. Unlike other image compression approaches I cannot recover the original imagine perfectly from a JPEG, but by foregoing a little bit of image quality, I end up with much smaller files to transmit.

I think that, in general, most programmers struggle to accept that correctness can sometimes be traded-off — personally, it offends a deep internal conviction of mine that programs should be correct. Probably because of that, I think the technique is used less often than it should be.

Recently, though, we’ve become much more willing to accept incorrect answers thanks to the explosion of ML (Machine Learning). Whereas local search requires us to explicitly state how to create new solutions, ML is trained on previous data, and then generates new solutions from that data. This can be a very powerful technique, but ML’s inevitable “hallucinations” are really just a form of incorrectness.

We can thus see that there are two different ways of accepting imprecise solutions: possibly non-optimal; and possibly incorrect. I’ve come to realise that many people think they’re the same thing, but possible incorrectness more often causes problems. I might be happy trading off a bit of image-quality for better compression, but if an ML system rewrites my code and leaves off a “not” I’m unhappy. My rule of thumb is that unless you are convinced you can tolerate incorrectness, you’re best off assuming that you can’t.

Summary

I’ve listed the four optimisation approaches above in the frequency with which I’ve seen them used (from most to least used).

It will probably not surprise you that my least favourite approach is “rewrite in a lower-level programming language”, in the sense that it tends to offer the poorest ratio of improvement/cost. That doesn’t mean that it’s always the wrong approach, but we tend to reach for it before we’ve adequately considered cheaper alternatives. In contrast, I think that until recently we have too rarely reached for “accept a less precise solution”, though the ML explosion has rapidly changed that.

Personally, when I’m trying to optimise a program I tend to reach for the simplest tricks first. One thing that I’ve found surprises people is how often my first attempt at optimisation will be to hunt for places to use hashmaps — only rarely do I go hunting for exotic data-structures to use. I less often turn to clever algorithms. Of those clever algorithms I tend to implement myself, I suspect that binary search is the one I use the most often, and I probably do so at most once or twice a year — each time I implement it, I have to look up the correct way to do so [10]!

Ultimately, having written this post, I’ve come to realise that there are three lessons that cut across all of the approaches.

First, when correctness can be sacrificed for performance, it’s a powerful technique — but we often sacrifice correctness for performance unintentionally. When we need to optimise a program, it’s best to use the least complex optimisation that will give us the performance we want, because that’s likely to introduce the fewest bugs.

Second, human time matters. Because us programmers enjoy complexity so much, it’s tempting for us to reach for the complex optimisations too soon. Even if they succeed in improving performance – which they often don’t! – they tend to consume much more time than is necessary for the performance improvement we needed.

Third, I think that breadth of optimisation knowledge is more important than depth of optimisation knowledge. Within each of the approaches I’ve listed in this post I have a couple of tricks that I regularly deploy. That has helped give me a reasonable intuition about what the most appropriate overall approach to my current performance woes might be, even if I don’t know the specifics.

Acknowledgements: Thanks to Carl Friedrich Bolz-Tereick, and Jake Hughes for comments.

Update (2023-11-14): My original phrasing of a Bloom filter could be read in a way that seemed to be a contradiction. I’ve tweaked the phrasing to avoid this.