21 Frontend System Design Concepts for Software Engineers

Mike's Notes

An excellent list of frontend system options.

Pipi generates a static frontend as a thin wrapper.

Resources

References

  • Reference

Repository

  • Home > Ajabbi Research > Library > Subscriptions > System Design
  • Home > Handbook > 

Last Updated

28/12/2025

21 Frontend System Design Concepts for Software Engineers

By: Neo Kim & Shefali Jangid
System Design: 11/11/2025

Neo: I Teach You System Design

Shefali: I write about CSS, JavaScript, web dev resources, and solo projects.

If you’re coming from the backend, you probably think the frontend is just “HTML, CSS, maybe some JavaScript.” But honestly? Modern frontend engineering has grown into something much closer to backend system design.

Just like your APIs need to be fast, scalable, and reliable, frontend apps also have to handle millions of users, load content quickly, and stay observable and secure.

This newsletter is a quick introduction to frontend system design.

We’ll take concepts you already know from the backend, like caching, deployment pipelines, observability, and security, and see how they apply in the browser.

By the end, you’ll see that the frontend isn’t just about buttons and forms. It’s about building systems that run right in the user’s browser.

Onward.

I want to introduce Shefali Jangid as a guest author.

She’s a web developer, technical writer, and content creator with a love for frontend architecture and building things that scale.

Check out her work and socials:

  • Shefali.dev
  • GitHub
  • Twitter

You’ll often find her writing about web development, sharing UI tips, and building tools that make developers’ lives easier.

Rendering & Delivery Models

One of the first things to understand is how webpages reach your users.

The way you build and load them affects how fast, reliable, and smooth your site feels. You can pre-build pages, render them on the server, build them in the browser, or mix these approaches.

Building web pages works much like a server handles API responses. The trade-offs change depending on when and where the HTML gets generated.

Let’s start with pre-built pages and move to fully dynamic ones. We’ll see how each affects speed, scalability, and content freshness.

1 Static Site Generation (SSG)

Before SSG, websites worked in two fundamental ways. The server either built the page for every request, or the browser built it on the client side. That means:

  • Every request needed work to generate the page.
  • Pages could get slow if many people visit at once.
  • Caching was tricky, so scaling was hard.

SSG solves this by pre-building the HTML when you deploy your site. The system can fetch data during the build process, even for pages with dynamic content, which means all content is baked into static HTML files before any user visits them.

During the build process, the framework executes data-fetching code, queries your database, and generates complete HTML files for each route. The framework then uploads them to your CDN or hosting provider.

When users request a page, they receive a fully formed HTML document immediately, without waiting for server-side processing or client-side data fetching.

This makes SSG super fast for users because there’s no rendering delay. The trade-off is that if your content changes, you’ll need to rebuild and redeploy to update the static files, which is why SSG works best for content that doesn’t change frequently.

It’s like preparing API responses in advance; the hard work is done before anyone asks.

Why it matters:

  • Pages load super fast.
  • Easy to handle millions of users.
  • SEO is better because pages get fully rendered from the start.

Use case:

Documentation sites, marketing landing pages, or personal blogs where content updates happen through deployments, not user actions.

2 Incremental Static Regeneration (ISR)

Static Site Generation (SSG) is fast, but what if your content changes frequently? Rebuilding the whole site every time would be a pain.

That’s where Incremental Static Regeneration (ISR) comes in.

Pages are still pre-built, but they can update automatically without a full redeploy.

You just set a revalidation time; after that period, the next visitor triggers a background rebuild of that specific page on the server, not a full deployment. The old version loads instantly, so users don’t wait. After regeneration, the new version replaces the cached one. This occurs per page, not site-wide, allowing you to set different revalidation intervals for individual pages.

It’s like cached API responses with an expiry timer; users might glimpse an older version until it’s refreshed, but the update happens quietly behind the scenes.

Why it matters:

  • Just as fast as SSG, but the content stays fresh.
  • Perfect for dynamic websites like blogs or e-commerce sites.
  • Works with CDNs, so updates happen without downtime.

Use case:

E-commerce product pages where most content (e.g., descriptions or images) is static, but some parts, like prices or stock info, update occasionally.

ISR keeps the page fresh without full redeploys, while real-time data, such as live prices, can come from APIs.

3 Server-Side Rendering (SSR)

Server-Side Rendering (SSR) works the other way around: the server builds the page for each request. It fetches the data, generates the HTML, and sends it to the user.

Unlike Static Site Generation (SSG), which is great for mostly static pages, SSR is useful when content needs to stay fresh or personalised. For example, dashboards, user profiles, or live feeds. Because the system generates pages in real time, they always display the latest data instead of relying on pre-built files.

Think of it like a regular API endpoint; everything gets computed on demand.

Why it matters:

  • Keeps content fresh and easy to personalise.
  • Perfect for pages that need real-time data or personalised content.
Note: Under heavy traffic, SSR can slow down because it builds each page on demand, but caching can help balance load and speed.

Use case:

Social media feeds, admin dashboards, or user-specific pages where content varies by session.

4 Client-Side Rendering (CSR)

CSR means the browser does most of the work instead of the server. The server sends only a basic HTML page and some JavaScript. The browser then loads the data and builds the page on the fly.

This approach is useful when you need rich interactivity, real-time updates, or pages that change often based on user actions - things that static or server-rendered pages can’t handle easily.

Think of it like sending raw JSON and letting the client put it together.

Note: The first page load can be slower because the browser needs to download and run JavaScript before showing the content. Since pages get built in the browser, search engines might not see them immediately, so you might need extra setup like pre-rendering or server-side rendering for better SEO.

Why it matters:

  • Reduces pressure on the server.
  • Makes the app more interactive and responsive.
  • Works best for apps people use for a long time, like dashboards or editors.

Use case:

Complex apps like Figma, Notion, or Google Docs, where the app is highly interactive and users stay on the page for extended sessions.

5 Hybrid Rendering

Sometimes, one approach just isn’t enough.

Different parts of your app might have different needs. For example, some pages stay mostly the same, while others need fresh or personalised data. That’s where hybrid rendering comes in.

It mixes different strategies:

  • Server-side rendering (SSR) for pages that need live or personalised content,
  • Static site generation (SSG) for pages that rarely change,
  • And client-side rendering (CSR) for sections with lots of interactivity.

Think of it like combining pre-computed API responses with on-demand endpoints - all in the same system.

Why it matters:

  • You get the best of everything: speed, fresh content, and interactivity.
  • Allows you to choose the right approach for each page or component.
  • Reduces overloading the server while keeping content dynamic where needed.

Use case:

Large-scale apps like e-commerce platforms often combine different rendering strategies:

  • The homepage and category pages use static generation for speed.
  • Product pages use incremental static regeneration to keep content fresh.
  • User account pages use server-side rendering for personalised data.
  • The shopping cart uses client-side rendering for real-time updates without page reloads.

6 Content Delivery Networks (CDNs) & Edge Delivery

No matter which rendering method you choose, serving content efficiently is super important. CDNs keep copies of your static files on servers worldwide. This lets users download them from a nearby location instead of your main server.

This is especially useful for global audiences. For example, when someone in India visits a site hosted in the US, the CDN delivers the content from a local server, making it load much faster.

Edge rendering takes this idea a step further. Instead of just serving static files, it can actually run code or build pages at the edge, closer to the user, which reduces latency even more.

Think of it like having caches and compute nodes near your users, so requests go to a nearby server instead of your main database.

Why it matters:

  • Faster load times everywhere.
  • Easy to scale to millions of users.
  • Works perfectly with SSG, ISR, SSR, or hybrid setups.

Use case:

Any globally distributed application. Media sites like The New York Times use CDNs to serve articles instantly worldwide.

Performance & Optimisation

Now that you understand how your pages get rendered, the next obvious question is, “How quickly do they actually load?”

Even the most beautiful app can be frustrating if it takes too long to open or lags while being used. In frontend system design, speed really matters.

Let’s dive in!

7 Web Performance Metrics

To really understand your app’s speed, there are a few key metrics you should watch closely:

  • TTFB (Time to First Byte): The time it takes for your browser to get the first piece of data back from the server or CDN after making a request.
  • FCP (First Contentful Paint): The moment when something first appears on the screen, like text, an image, or a button, so the user knows the page is loading.
  • LCP (Largest Contentful Paint): The time it takes for the main part of the page, like a large image or headline, to fully appear on the screen.
  • CLS (Cumulative Layout Shift): It measures how much the page layout jumps around while loading, like when text or buttons suddenly shift because images or ads are still loading.

These are basically the frontend versions of response time, throughput, and latency in backend systems. It’s important to keep a close eye on them; users can notice even minor delays of a few hundred milliseconds.

Why it matters:

  • You can spot slow pages before users even notice.
  • Improves engagement and reduces bounce rates.
  • Helps guide your optimisations for a smoother experience.

Use case:

E-commerce sites must optimise for LCP (product images) and CLS (avoid layout shifts during checkout). News sites focus on FCP to show headlines quickly.

8 Lazy Loading

Of course, fast pages aren’t just about metrics; they’re also about smart resource management.

Not everything on a page needs to load immediately. Lazy loading means loading heavy assets, like images, videos, or big components, only when they’re actually needed.

This works by using techniques like the Intersection Observer API or conditional imports, which tell the browser to fetch those resources only when they come into view or are triggered by user interaction.

It’s like fetching extra data from an API only when the user asks for it.

Why it matters:

  • Cuts down the initial load time.
  • Makes the pages feel faster and smoother.
  • Saves bandwidth for users who don’t need everything immediately.

Use case:

Image-heavy sites like Pinterest or Instagram use lazy loading extensively; images below the fold don’t load until you scroll.

9 Service Workers & Caching

Once you’ve optimised loading, you can make your app faster and more reliable using service workers and caching.

Service workers are background scripts that run in a separate thread from your main web page. They can intercept network requests and cache important files or data, helping your app load faster and even work offline.

Think of them as a smart middle layer between the browser and the network; if something is already cached, it’s served instantly instead of being fetched again.

Why it matters:

  • Speeds up repeat visits.
  • Reduces the load on servers.
  • Keeps apps usable even with poor or no internet connection.

Use case:

Progressive Web Apps like Twitter Lite or Starbucks PWA, which cache core UI and recent content, so users can browse even on unstable mobile networks.

Data & State Management

Once your UI loads quickly, the next step is to think about the data behind it.

In real apps, this data (also called state) can come from different places:

  • Some live inside a single component (a reusable piece of the UI, like a button),
  • Some are shared across the app,
  • And others come from APIs.

How you manage this state can make or break your app’s speed, reliability, and scalability.

10 State Management (Local, Global, Server Cache)

  • Local state: data that lives inside a single component, used for things like toggles, forms, or small interactions. It’s simple to manage and doesn’t add much complexity.
  • Global state: data that’s shared across multiple components or pages, like user info or theme settings. Tools like Redux, Zustand, or React Context help manage it.
  • Server cache: stores frequently used API data on the client so the app doesn’t have to fetch it again and again, making it faster and reducing server load.

Think of it like database caching: by deciding where data should live, you can make your app more responsive, reliable, and easier to scale.

Why it matters:

  • Keeps your app responsive.
  • Reduces unnecessary API calls.
  • Makes scaling smoother as your app grows.

Use case:

Local state for a modal’s open/closed status. Global state for theme preference (dark mode) that affects every component. Server-side cache for user profile data displayed by multiple components.

11 API Caching with Expiration

Caching doesn’t stop at the component level. You can store API responses in memory, IndexedDB (a browser database for larger data), or localStorage (for smaller key-value data), and set expiration rules to make sure data stays fresh.

It’s like having a Redis cache server, but right in the browser instead of on your server.

Why it matters:

  • Keeps data up-to-date for users.
  • Reduces repeated server requests.
  • Makes your app feel faster.

Use case:

A news app might cache articles for a few minutes so users can read offline, while comments refresh more often to stay up to date. Similarly, a SaaS dashboard could cache chart data while the user is on the page, then refresh it when they come back later.

12 GraphQL vs REST (Reducing Over/Under-Fetching)

How you fetch data also affects performance.

  • REST: Can sometimes send too much data or not enough, making your app fetch extra information or require additional requests.
  • GraphQL: A query language for APIs that lets the client ask for exactly the data it needs, avoiding extra or missing information. This avoids over-fetching or under-fetching data and helps reduce unnecessary requests.

It’s like how you optimise database queries on the backend to make them faster and use less bandwidth, but this happens on the frontend.

GraphQL sits between the client and the server as one endpoint. The client asks for exactly the data it needs, and the server’s GraphQL layer collects that data from databases or other APIs, then sends back a clean, organised response.

This way, you make one flexible request instead of several REST calls, making it faster and more data-efficient.

Why it matters:

  • Saves bandwidth, especially on mobile networks.
  • Reduces unnecessary requests.
  • Simplifies client-side data handling.

Use case:

GraphQL works best for complex apps that need data from many places at once, like GitHub. One GraphQL query can get a pull request, comments, and author info in a single request instead of several REST calls. While REST is simpler and great for apps with stable data, like blogs or public APIs that rely on caching.

13 Pagination Strategies (Cursor vs Offset)

Loading large lists or tables all at once can be heavy. Pagination helps break the data into manageable chunks.

  • Offset pagination: Uses page numbers or record counts (like ?page=2 or ?offset=20) to fetch data. It’s simple and works well for lists that don’t change often. But the list order shifts if new items are added or old ones are removed. This can make the same offset return different items, leading to duplicates or missing entries.
  • Cursor pagination: Uses a pointer to mark where the last item ended, so the next request starts right after it. It’s more reliable for live or frequently updated data (social feeds or chat messages) because it keeps track of the exact position in the dataset. That means even if new items are added or removed while you’re scrolling, you won’t see duplicates or miss entries.

Why it matters:

  • Handles large datasets efficiently.
  • Prevents slowdowns and performance bottlenecks.
  • Keeps dynamic lists reliable and consistent.

Use case:

  • Offset pagination: best for data tables with stable data and clear page numbers, such as admin panels or product catalogs.
  • Cursor pagination: ideal for infinite scroll feeds like social media timelines, notification lists, or any real-time list where items are frequently added or removed.

14 Real-Time Data & Networking (WebSockets, SSE, Polling)

Finally, some apps need live updates, like chat apps, dashboards, or notifications. How you handle real-time data matters.

  • WebSockets: Let the client and server send messages to each other in real time, both ways, without constantly asking for updates.
  • Server-Sent Events (SSE): The server can push updates to the client in real time, but communication only goes one way, from server to client.
  • Polling: The client regularly asks the server for updates. It’s simple to set up, but it can put more load on the server.

It’s like building event-driven systems on the backend, but here it happens in the browser.

Why it matters:

  • Supports live dashboards, chat, and notifications.
  • Improves interactivity and user engagement.
  • Allows you to choose the right strategy for your app’s needs.

Use case:

  • WebSockets: chat apps (Slack), multiplayer games, collaborative editing (Google Docs).
  • SSE: live notifications, stock tickers, server logs streaming to a dashboard.
  • Polling: simple use cases like checking for new emails or status updates.

Architecture & Scalability

As your app grows, managing complexity becomes just as important as writing features. Frontend architecture isn’t just about code; it’s about building systems that are maintainable, scalable, and predictable.

15 Micro Frontends

When multiple teams work on the same app, things can get messy fast.

Micro frontends let each team build and deploy their part of the app separately. For example, one team handles the dashboard while another builds the settings page. Technically, the app is divided into smaller frontend projects that are combined at runtime to work as one seamless app.

A module federation feature (for example, in tools like Webpack) lets these separate projects share code (like components or utilities) directly in the browser, without rebuilding or duplicating code across projects.

Why it matters:

  • Teams can develop features faster and in parallel.
  • Reduces duplicated code across bundles.
  • Supports independent deployment cycles, so updates don’t block each other.

Use case:

Large enterprises with multiple teams working on different product areas. For example, big companies like Zalando, IKEA, DAZN, and Spotify use micro-frontends so each team can build and release their part of the app on their own.

16 Component-Based Architecture & Design Systems

Components are the building blocks of your app. A design system ensures these components stay consistent and reusable across teams and projects.

It’s like having reusable backend modules or libraries, but for your UI.

Why it matters:

  • Makes the UI predictable and easier to maintain.
  • Encourages code reuse across pages and projects.
  • Helps teams scale efficiently without creating chaos.

Use case:

  • Used by companies with many products or teams to keep design consistent, like Shopify’s Polaris or IBM’s Carbon, which are open-source design systems containing ready-to-use UI components, styles, and guidelines.
  • Even small startups benefit: a shared set of 10–20 components (like buttons and modals) helps teams build faster and keep the UI consistent.

17 Build & Deployment Pipelines (CI/CD for Frontend)

Frontend apps also benefit from CI/CD (Continuous Integration and Continuous Deployment) pipelines, just like backend services. These pipelines automatically handle steps like building the app, running tests, and deploying updates.

In simple terms, every time you push code, CI/CD tools check that nothing breaks and then safely release the latest version, making deployments faster, more reliable, and less manual.

Why it matters:

  • Minimises human errors during deployment.
  • Enables fast, reliable releases.
  • Makes scaling and frequent updates much smoother.

Use case:

Works for any app with regular updates, from small teams auto-deploying to Vercel to big companies like Netflix releasing thousands of times a day. It keeps updates fast, safe, and reliable.

User Experience & Reliability

Your users don’t care about your architectures or caching strategies; they just want the app to be fast, reliable, and easy to use.

18 Accessibility (a11y) & Mobile-First Design

Accessibility and mobile-first design aren’t just design principles; they’re system-level considerations. Accessibility ensures your app’s UI and code structure work for everyone, including people using assistive technologies.

Mobile-first design forces you to build efficient layouts, load lighter assets, and prioritize key features, all of which influence performance, scalability, and overall frontend architecture.

Why it matters:

  • Reaches more users.
  • Makes your app easier and more pleasant to use.
  • Ensures a consistent experience across devices.

Use case:

Government sites (accessibility is legally required in many countries), e-commerce, and content platforms. Mobile-first is essential for apps in developing markets where mobile is the main or only device.

19 Progressive Web Apps (PWAs) & Offline-First

Progressive Web Apps (PWAs) are web apps that behave like native apps. They can work offline, send notifications, and even be installed on a device.

They use a few key technologies:

  • Service workers run in the background to cache important files like HTML, CSS, and API responses.
  • A web app manifest defines how the app looks and behaves when installed.
  • And HTTPS keeps everything secure.

Together, these make the app fast, reliable, and installable.

Why it matters:

  • Users can access your app anywhere.
  • Reduces load on servers.
  • Improves reliability and user trust.

Use case:

Apps where offline access is valuable: Twitter Lite, Starbucks PWA, field service apps, and news apps.

20 Security Basics (XSS, CSRF, CSP, Authentication)

Speed means nothing without security. Frontend isn’t just about the UI; it’s also the first line of defence for your app.

  • XSS (Cross-Site Scripting): Stop attackers from injecting malicious scripts into your app.
  • CSRF (Cross-Site Request Forgery): Protect your forms and actions that change data from being triggered by attackers without the user’s consent.
  • CSP (Content Security Policy): A rule set that helps prevent malicious scripts from running in your app.
  • Authentication: Make sure user tokens and sessions are stored and handled securely in the browser.

Why it matters:

  • Protects your users and their data.
  • Prevents common attacks before they reach the backend.
  • Builds trust and helps with compliance.

Use case:

Any app handling sensitive data. Financial apps need strict CSP and token handling. Social platforms must prevent XSS to avoid account takeovers. E-commerce sites need CSRF protection on checkout to prevent unauthorised purchases.

21 Observability & Error Monitoring (Client-Side)

Even if everything works well, things can still break in production. That’s why observability is important.

Frontend errors are just like 500 errors in your backend; they happen. Monitoring tools like Sentry or LogRocket help you track:

  • JS exceptions: errors that happen in your JavaScript code while the app is running.
  • Performance bottlenecks: parts of your app that slow it down or make it lag.
  • User interactions leading to errors: actions by users that trigger bugs or crashes in your app.

These tools add a small script to your app. When something breaks, it collects information like the error message, what the user was doing, and browser details. Then it sends that data to the tool’s server, where you can see and fix the issue from your dashboard.

Why it matters:

  • Detects and resolves issues faster.
  • Keeps your app stable and performant.
  • Improves the overall user experience and trust.

Use case:

Used in production apps with real users. SaaS teams track errors right after deployment, e-commerce sites watch checkout issues, and session replay tools help support teams see what confused users without extra bug reports.

Conclusion

Frontend system design is basically backend system design, just happening in the user’s browser.

Every choice you make, like rendering method, caching strategy, state management, architecture, and security, affects speed, scalability, and reliability.

So next time you’re building a frontend, ask yourself:

  • Where should computation happen? On the server, in the client’s browser, or at the edge?
  • When does the data need to be up-to-date? Prebuilt, cached, or real-time?
  • How can we keep the app fast and reliable? Lazy loading, smart caching, or micro frontends?
  • How do we scale this? Can the architecture handle 10x traffic? 100x?
  • How do we maintain this? Will new developers understand the architecture? Can teams work independently?

Think of your frontend as a distributed system. Treat it that way, and your users will get an app that’s fast, smooth, and seamless, exactly what they expect.

👋 I’d like to thank Shefali for writing this newsletter!

Plus, don’t forget to check out her work and socials:

  • Shefali.dev
  • GitHub
  • Twitter

You’ll often find her writing about web development, sharing UI tips, and building tools that make developers’ lives easier.

Ten Laws That Govern Enterprise Architecture

Mike's Notes

Still relevant today. Does anyone know if Roger Sessions is still around? I would love to meet him,

Resources

References

  • Reference

Repository

  • Home > Ajabbi Research > Library > Authors > Roger Sessions
  • Home > Handbook > 

Last Updated

27/12/2025

Ten Laws That Govern Enterprise Architecture

By: Roger Sessions
LinkedIn: 24/06/2016

Lead Architect of the IT Simplification Initiative (ITSI), leveraging mathematics to build simpler IT..

Every engineering discipline is governed by specific mathematical laws. Bridge designers must understand the laws of Tension and Compression. Any bridge designed in violation of these laws will collapse. Rocket ship designers must understand Newton’s Laws of Motion. Any rocket ship designed in violation of Newton’s laws will destruct. Aqueduct designers must understand the Laws of Hydraulics. Any aqueduct designed in violation of these laws will block.

How many people would knowingly drive on bridges that violate the laws of Tension and Compression? How many would ride a rocket ship that ignores Newton’s Laws of Motion? How many would hook up their toilet to aqueducts designed by people who poo-poo the laws of Hydraulics?

Like these engineering disciplines, enterprise architecture is governed by specific mathematical laws. However in marked contrast to these other disciplines, few enterprise architects understand the laws that govern their field. Even fewer executives demand that the high cost designs they are funding take into account the most fundamental laws that will determine their success.

This article is a Call to Action. It is a call to enterprise architects to start designing systems that conform to those laws rather than flaunting them. It is a call to executives to start demanding that all designs be subject to a proof of conformity to these laws. To design a large IT system that violates the Laws of Complexity is every bit as negligent as designing a bridge that violates the Laws of Tension and Compression.

As a starting point for this critical discussion, here are what I believe are the ten most important Laws of Complexity:

  1. Complexity = Functional Complexity + Dependency Complexity
  2. Viability = c /Complexity
  3. Value = Useful Functionality / Complexity
  4. Complexity increases exponentially
  5. Capacity to Manage Complexity increases linearly
  6. When partitioning independent elements, partition complexity is driven by subset size.
  7. When partitioning dependent elements, partition complexity is driven by element assignment.
  8. |Non-Optimal Partitions (NOPs)| >> |Optimal Partitions (OPs)|
  9. Complexity (NOPs) >> Complexity (OPs)
  10. OPs can only be found with directed methodologies.

I will write about these laws in more detail in upcoming articles. Stay tuned!

Why Software Fails

Mike's Notes

One of the classics. Still true today. Very wise words from Robert N. Charette.

Resources

References

  • Reference

Repository

  • Home > Ajabbi Research > Library > Authors > Roger Sessions
  • Home > Ajabbi Research > Library > Authors > Robert N. Charette
  • Home > Handbook > 

Last Updated

26/12/2025

Why Software Fails

By: Robert N. Charette
IEEE Spectrum: 01/09/2005

Robert N. Charette is a Contributing Editor to IEEE Spectrum and an acknowledged international authority on information technology and systems risk management. A self-described “risk ecologist,” he is interested in the intersections of business, political, technological, and societal risks. Charette is an award-winning author of multiple books and numerous articles on the subjects of risk management, project and program management, innovation, and entrepreneurship. A Life Senior Member of the IEEE, Charette was a recipient of the IEEE Computer Society’s Golden Core Award in 2008.

We waste billions of dollars each year on entirely preventable mistakes

Have you heard the one about the disappearing warehouse? One day, it vanished—not from physical view, but from the watchful eyes of a well-known retailer’s automated distribution system. A software glitch had somehow erased the warehouse’s existence, so that goods destined for the warehouse were rerouted elsewhere, while goods at the warehouse languished. Because the company was in financial trouble and had been shuttering other warehouses to save money, the employees at the “missing” warehouse kept quiet. For three years, nothing arrived or left. Employees were still getting their paychecks, however, because a different computer system handled the payroll. When the software glitch finally came to light, the merchandise in the warehouse was sold off, and upper management told employees to say nothing about the episode.

Person pushes shopping cart inside Sainsbury's, with orange welcome sign and stacked carts nearby.

MARKET CRASH: After its new automated supply-chain management system failed last October, leaving merchandise stuck in company warehouses, British food retailer Sainsbury’s had to hire 3000 additional clerks to stock its shelves.GRAHAM BARCLAY/BLOOMBERG NEWS/LANDOV

This story has been floating around the information technology industry for 20-some years. It’s probably apocryphal, but for those of us in the business, it’s entirely plausible. Why? Because episodes like this happen all the time. Last October, for instance, the giant British food retailer J Sainsbury PLC had to write off its US $526 million investment in an automated supply-chain management system. It seems that merchandise was stuck in the company’s depots and warehouses and was not getting through to many of its stores. Sainsbury was forced to hire about 3000 additional clerks to stock its shelves manually [see photo above, "Market Crash"].

Software Hall of Shame

List of software failures from 1992-2005 with costs and business consequences.

Sources: Business Week, CEO Magazine, Computerworld, InfoWeek, Fortune, The New York Times, Time, and The Wall Street Journal * Converted to U.S. dollars using current exchange rates as of press time. † Converted to U.S. dollars using exchange rates for the year cited, according to the International Trade Administration, U.S. Department of Commerce. ** Converted to U.S. dollars using exchange rates for the year cited, according to the Statistical Abstract of the United States, 1996 .

This is only one of the latest in a long, dismal history of IT projects gone awry [see table above, "Software Hall of Shame" for other notable fiascoes]. Most IT experts agree that such failures occur far more often than they should. What’s more, the failures are universally unprejudiced: they happen in every country; to large companies and small; in commercial, nonprofit, and governmental organizations; and without regard to status or reputation. The business and societal costs of these failures—in terms of wasted taxpayer and shareholder dollars as well as investments that can’t be made—are now well into the billions of dollars a year.

The problem only gets worse as IT grows ubiquitous. This year, organizations and governments will spend an estimated $1 trillion on IT hardware, software, and services worldwide. Of the IT projects that are initiated, from 5 to 15 percent will be abandoned before or shortly after delivery as hopelessly inadequate. Many others will arrive late and over budget or require massive reworking. Few IT projects, in other words, truly succeed.

The biggest tragedy is that software failure is for the most part predictable and avoidable. Unfortunately, most organizations don’t see preventing failure as an urgent matter, even though that view risks harming the organization and maybe even destroying it. Understanding why this attitude persists is not just an academic exercise; it has tremendous implications for business and society.

SOFTWARE IS EVERYWHERE. It’s what lets us get cash from an ATM, make a phone call, and drive our cars. A typical cellphone now contains 2 million lines of software code; by 2010 it will likely have 10 times as many. General Motors Corp. estimates that by then its cars will each have 100 million lines of code.

The average company spends about 4 to 5 percent of revenue on information technology, with those that are highly IT dependent—such as financial and telecommunications companies—spending more than 10 percent on it. In other words, IT is now one of the largest corporate expenses outside employee costs. Much of that money goes into hardware and software upgrades, software license fees, and so forth, but a big chunk is for new software projects meant to create a better future for the organization and its customers.

Governments, too, are big consumers of software. In 2003, the United Kingdom had more than 100 major government IT projects under way that totaled $20.3 billion. In 2004, the U.S. government cataloged 1200 civilian IT projects costing more than $60 billion, plus another $16 billion for military software.

Any one of these projects can cost over $1 billion. To take two current examples, the computer modernization effort at the U.S. Department of Veterans Affairs is projected to run $3.5 billion, while automating the health records of the UK’s National Health Service is likely to cost more than $14.3 billion for development and another $50.8 billion for deployment.

Such megasoftware projects, once rare, are now much more common, as smaller IT operations are joined into “systems of systems.” Air traffic control is a prime example, because it relies on connections among dozens of networks that provide communications, weather, navigation, and other data. But the trick of integration has stymied many an IT developer, to the point where academic researchers increasingly believe that computer science itself may need to be rethought in light of these massively complex systems.

When a project fails , it jeopardizes an organization’s prospects. If the failure is large enough, it can steal the company’s entire future. In one stellar meltdown, a poorly implemented resource planning system led FoxMeyer Drug Co., a $5 billion wholesale drug distribution company in Carrollton, Texas, to plummet into bankruptcy in 1996.

IT failure in government can imperil national security, as the FBI’s Virtual Case File debacle has shown. The $170 million VCF system, a searchable database intended to allow agents to “connect the dots” and follow up on disparate pieces of intelligence, instead ended five months ago without any system’s being deployed [see “Who Killed the Virtual Case File?" in this issue].

IT failures can also stunt economic growth and quality of life. Back in 1981, the U.S. Federal Aviation Administration began looking into upgrading its antiquated air-traffic-control system, but the effort to build a replacement soon became riddled with problems [see photo, "Air Jam," at top of this article]. By 1994, when the agency finally gave up on the project, the predicted cost had tripled, more than $2.6 billion had been spent, and the expected delivery date had slipped by several years. Every airplane passenger who is delayed because of gridlocked skyways still feels this cancellation; the cumulative economic impact of all those delays on just the U.S. airlines (never mind the passengers) approaches $50 billion.

Worldwide, it’s hard to say how many software projects fail or how much money is wasted as a result. If you define failure as the total abandonment of a project before or shortly after it is delivered, and if you accept a conservative failure rate of 5 percent, then billions of dollars are wasted each year on bad software.

For example, in 2004, the U.S. government spent $60 billion on software (not counting the embedded software in weapons systems); a 5 percent failure rate means $3 billion was probably wasted. However, after several decades as an IT consultant, I am convinced that the failure rate is 15 to 20 percent for projects that have budgets of $10 million or more. Looking at the total investment in new software projects—both government and corporate—over the last five years, I estimate that project failures have likely cost the U.S. economy at least $25 billion and maybe as much as $75 billion.

Of course, that $75 billion doesn’t reflect projects that exceed their budgets—which most projects do. Nor does it reflect projects delivered late—which the majority are. It also fails to account for the opportunity costs of having to start over once a project is abandoned or the costs of bug-ridden systems that have to be repeatedly reworked.

Then, too, there’s the cost of litigation from irate customers suing suppliers for poorly implemented systems. When you add up all these extra costs, the yearly tab for failed and troubled software conservatively runs somewhere from $60 billion to $70 billion in the United States alone. For that money, you could launch the space shuttle 100 times, build and deploy the entire 24-satellite Global Positioning System, and develop the Boeing 777 from scratch—and still have a few billion left over.

Why do projects fail so often?

Among the most common factors:

  • Unrealistic or unarticulated project goals
  • Inaccurate estimates of needed resources
  • Badly defined system requirements
  • Poor reporting of the project’s status
  • Unmanaged risks
  • Poor communication among customers, developers, and users
  • Use of immature technology
  • Inability to handle the project’s complexity
  • Sloppy development practices
  • Poor project management
  • Stakeholder politics
  • Commercial pressures

Of course, IT projects rarely fail for just one or two reasons. The FBI’s VCF project suffered from many of the problems listed above. Most failures, in fact, can be traced to a combination of technical, project management, and business decisions. Each dimension interacts with the others in complicated ways that exacerbate project risks and problems and increase the likelihood of failure.

Consider a simple software chore: a purchasing system that automates the ordering, billing, and shipping of parts, so that a salesperson can input a customer’s order, have it automatically checked against pricing and contract requirements, and arrange to have the parts and invoice sent to the customer from the warehouse.

The requirements for the system specify four basic steps. First, there’s the sales process, which creates a bill of sale. That bill is then sent through a legal process, which reviews the contractual terms and conditions of the potential sale and approves them. Third in line is the provision process, which sends out the parts contracted for, followed by the finance process, which sends out an invoice.

Let’s say that as the first process, for sales, is being written, the programmers treat every order as if it were placed in the company’s main location, even though the company has branches in several states and countries. That mistake, in turn, affects how tax is calculated, what kind of contract is issued, and so on.

The sooner the omission is detected and corrected, the better. It’s kind of like knitting a sweater. If you spot a missed stitch right after you make it, you can simply unravel a bit of yarn and move on. But if you don’t catch the mistake until the end, you may need to unravel the whole sweater just to redo that one stitch.

If the software coders don’t catch their omission until final system testing—or worse, until after the system has been rolled out—the costs incurred to correct the error will likely be many times greater than if they’d caught the mistake while they were still working on the initial sales process.

And unlike a missed stitch in a sweater, this problem is much harder to pinpoint; the programmers will see only that errors are appearing, and these might have several causes. Even after the original error is corrected, they’ll need to change other calculations and documentation and then retest every step.

In fact, studies have shown that software specialists spend about 40 to 50 percent of their time on avoidable rework rather than on what they call value-added work, which is basically work that’s done right the first time. Once a piece of software makes it into the field, the cost of fixing an error can be 100 times as high as it would have been during the development stage.

If errors abound, then rework can start to swamp a project, like a dinghy in a storm. What’s worse, attempts to fix an error often introduce new ones. It’s like you’re bailing out that dinghy, but you’re also creating leaks. If too many errors are produced, the cost and time needed to complete the system become so great that going on doesn’t make sense.

In the simplest terms, an IT project usually fails when the rework exceeds the value-added work that’s been budgeted for. This is what happened to Sydney Water Corp., the largest water provider in Australia, when it attempted to introduce an automated customer information and billing system in 2002 [see box, "Case Study #2"]. According to an investigation by the Australian Auditor General, among the factors that doomed the project were inadequate planning and specifications, which in turn led to numerous change requests and significant added costs and delays. Sydney Water aborted the project midway, after spending AU $61 million (US $33.2 million).

All of which leads us to the obvious question: why do so many errors occur?

Software project failures have a lot in common with airplane crashes. Just as pilots never intend to crash, software developers don’t aim to fail. When a commercial plane crashes, investigators look at many factors, such as the weather, maintenance records, the pilot’s disposition and training, and cultural factors within the airline. Similarly, we need to look at the business environment, technical management, project management, and organizational culture to get to the roots of software failures.

Chief among the business factors are competition and the need to cut costs. Increasingly, senior managers expect IT departments to do more with less and do it faster than before; they view software projects not as investments but as pure costs that must be controlled.

Political exigencies can also wreak havoc on an IT project’s schedule, cost, and quality. When Denver International Airport attempted to roll out its automated baggage-handling system, state and local political leaders held the project to one unrealistic schedule after another. The failure to deliver the system on time delayed the 1995 opening of the airport (then the largest in the United States), which compounded the financial impact manyfold.

Even after the system was completed, it never worked reliably: it chewed up baggage, and the carts used to shuttle luggage around frequently derailed. Eventually, United Airlines, the airport’s main tenant, sued the system contractor, and the episode became a testament to the dangers of political expediency.

A lack of upper-management support can also damn an IT undertaking. This runs the gamut from failing to allocate enough money and manpower to not clearly establishing the IT project’s relationship to the organization’s business. In 2000, retailer Kmart Corp., in Troy, Mich., launched a $1.4 billion IT modernization effort aimed at linking its sales, marketing, supply, and logistics systems, to better compete with rival Wal-Mart Corp., in Bentonville, Ark. Wal-Mart proved too formidable, though, and 18 months later, cash-strapped Kmart cut back on modernization, writing off the $130 million it had already invested in IT. Four months later, it declared bankruptcy; the company continues to struggle today.

Frequently, IT project managers eager to get funded resort to a form of liar’s poker, overpromising what their project will do, how much it will cost, and when it will be completed. Many, if not most, software projects start off with budgets that are too small. When that happens, the developers have to make up for the shortfall somehow, typically by trying to increase productivity, reducing the scope of the effort, or taking risky shortcuts in the review and testing phases. These all increase the likelihood of error and, ultimately, failure.

A state-of-the-art travel reservation system spearheaded by a consortium of Budget Rent-A-Car, Hilton Hotels, Marriott, and AMR, the parent of American Airlines, is a case in point. In 1992, three and a half years and $165 million into the project, the group abandoned it, citing two main reasons: an overly optimistic development schedule and an underestimation of the technical difficulties involved. This was the same group that had earlier built the hugely successful Sabre reservation system, proving that past performance is no guarantee of future results.

After crash investigators consider the weather as a factor in a plane crash, they look at the airplane itself. Was there something in the plane’s design that caused the crash? Was it carrying too much weight?

In IT project failures, similar questions invariably come up regarding the project’s technical components: the hardware and software used to develop the system and the development practices themselves. Organizations are often seduced by the siren song of the technological imperative—the uncontrollable urge to use the latest technology in hopes of gaining a competitive edge. With technology changing fast and promising fantastic new capabilities, it is easy to succumb. But using immature or untested technology is a sure route to failure.

In 1997, after spending $40 million, the state of Washington shut down an IT project that would have processed driver’s licenses and vehicle registrations. Motor vehicle officials admitted that they got caught up in chasing technology instead of concentrating on implementing a system that met their requirements. The IT debacle that brought down FoxMeyer Drug a year earlier also stemmed from adopting a state-of-the-art resource-planning system and then pushing it beyond what it could feasibly do.

A project’s sheer size is a fountainhead of failure. Studies indicate that large-scale projects fail three to five times more often than small ones. The larger the project, the more complexity there is in both its static elements (the discrete pieces of software, hardware, and so on) and its dynamic elements (the couplings and interactions among hardware, software, and users; connections to other systems; and so on). Greater complexity increases the possibility of errors, because no one really understands all the interacting parts of the whole or has the ability to test them.

Sobering but true: it’s impossible to thoroughly test an IT system of any real size. Roger S. Pressman pointed out in his book Software Engineering, one of the classic texts in the field, that “exhaustive testing presents certain logistical problems....Even a small 100-line program with some nested paths and a single loop executing less than twenty times may require 10 to the power of 14 possible paths to be executed.” To test all of those 100 trillion paths, he noted, assuming each could be evaluated in a millisecond, would take 3170 years.

All IT systems are intrinsically fragile. In a large brick building, you’d have to remove hundreds of strategically placed bricks to make a wall collapse. But in a 100 000-line software program, it takes only one or two bad lines to produce major problems. In 1991, a portion of ATandamp;T’s telephone network went out, leaving 12 million subscribers without service, all because of a single mistyped character in one line of code.

Sloppy development practices are a rich source of failure, and they can cause errors at any stage of an IT project. To help organizations assess their software-development practices, the U.S. Software Engineering Institute, in Pittsburgh, created the Capability Maturity Model, or CMM. It rates a company’s practices against five levels of increasing maturity. Level 1 means the organization is using ad hoc and possibly chaotic development practices. Level 3 means the company has characterized its practices and now understands them. Level 5 means the organization quantitatively understands the variations in the processes and practices it applies.

As of January, nearly 2000 government and commercial organizations had voluntarily reported CMM levels. Over half acknowledged being at either level 1 or 2, 30 percent were at level 3, and only 17 percent had reached level 4 or 5. The percentages are even more dismal when you realize that this is a self-selected group; obviously, companies with the worst IT practices won’t subject themselves to a CMM evaluation. (The CMM is being superseded by the CMM-Integration, which aims for a broader assessment of an organization’s ability to create software-intensive systems.)

Immature IT practices doomed the U.S. Internal Revenue Service’s $4 billion modernization effort in 1997, and they have continued to plague the IRS’s current $8 billion modernization. It may just be intrinsically impossible to translate the tax code into software code—tax law is complex and based on often-vague legislation, and it changes all the time. From an IT developer’s standpoint, it’s a requirements nightmare. But the IRS hasn’t been helped by open hostility between in-house and outside programmers, a laughable underestimation of the work involved, and many other bad practices.

THE PILOT’S ACTIONS JUST BEFORE a plane crashes are always of great interest to investigators. That’s because the pilot is the ultimate decision-maker, responsible for the safe operation of the craft. Similarly, project managers play a crucial role in software projects and can be a major source of errors that lead to failure.

Back in 1986, the London Stock Exchange decided to automate its system for settling stock transactions. Seven years later, after spending $600 million, it scrapped the Taurus system’s development, not only because the design was excessively complex and cumbersome but also because the management of the project was, to use the word of one of its own senior managers, “delusional.” As investigations revealed, no one seemed to want to know the true status of the project, even as more and more problems appeared, deadlines were missed, and costs soared [see box, "Case Study #3"].

The most important function of the IT project manager is to allocate resources to various activities. Beyond that, the project manager is responsible for project planning and estimation, control, organization, contract management, quality management, risk management, communications, and human resource management.

Bad decisions by project managers are probably the single greatest cause of software failures today. Poor technical management, by contrast, can lead to technical errors, but those can generally be isolated and fixed. However, a bad project management decision—such as hiring too few programmers or picking the wrong type of contract—can wreak havoc. For example, the developers of the doomed travel reservation system claim that they were hobbled in part by the use of a fixed-price contract. Such a contract assumes that the work will be routine; the reservation system turned out to be anything but.

Project management decisions are often tricky precisely because they involve tradeoffs based on fuzzy or incomplete knowledge. Estimating how much an IT project will cost and how long it will take is as much art as science. The larger or more novel the project, the less accurate the estimates. It’s a running joke in the industry that IT project estimates are at best within 25 percent of their true value 75 percent of the time.

There are other ways that poor project management can hasten a software project’s demise. A study by the Project Management Institute, in Newton Square, Pa., showed that risk management is the least practiced of all project management disciplines across all industry sectors, and nowhere is it more infrequently applied than in the IT industry. Without effective risk management, software developers have little insight into what may go wrong, why it may go wrong, and what can be done to eliminate or mitigate the risks. Nor is there a way to determine what risks are acceptable, in turn making project decisions regarding tradeoffs almost impossible.

Poor project management takes many other forms, including bad communication, which creates an inhospitable atmosphere that increases turnover; not investing in staff training; and not reviewing the project’s progress at regular intervals. Any of these can help derail a software project.

The last area that investigators look into after a plane crash is the organizational environment. Does the airline have a strong safety culture, or does it emphasize meeting the flight schedule above all? In IT projects, an organization that values openness, honesty, communication, and collaboration is more apt to find and resolve mistakes early enough that rework doesn’t become overwhelming.

If there’s a theme that runs through the tortured history of bad software, it’s a failure to confront reality. On numerous occasions, the U.S. Department of Justice’s inspector general, an outside panel of experts, and others told the head of the FBI that the VCF system was impossible as defined, and yet the project continued anyway. The same attitudes existed among those responsible for the travel reservation system, the London Stock Exchange’s Taurus system, and the FAA’s air-traffic-control project—all indicative of organizational cultures driven by fear and arrogance.

A recent report by the National Audit Office in the UK found numerous cases of government IT projects’ being recommended not to go forward yet continuing anyway. The UK even has a government department charged with preventing IT failures, but as the report noted, more than half of the agencies the department oversees routinely ignore its advice. I call this type of behavior irrational project escalation—the inability to stop a project even after it’s obvious that the likelihood of success is rapidly approaching zero. Sadly, such behavior is in no way unique.

In the final analysis , big software failures tend to resemble the worst conceivable airplane crash, where the pilot was inexperienced but exceedingly rash, flew into an ice storm in an untested aircraft, and worked for an airline that gave lip service to safety while cutting back on training and maintenance. If you read the investigator’s report afterward, you’d be shaking your head and asking, “Wasn’t such a crash inevitable?”

So, too, the reasons that software projects fail are well known and have been amply documented in countless articles, reports, and books [see sidebar, To Probe Further]. And yet, failures, near-failures, and plain old bad software continue to plague us, while practices known to avert mistakes are shunned. It would appear that getting quality software on time and within budget is not an urgent priority at most organizations.

It didn’t seem to be at Oxford Health Plans Inc., in Trumbull, Conn., in 1997. The company’s automated billing system was vital to its bottom line, and yet senior managers there were more interested in expanding Oxford’s business than in ensuring that its billing system could meet its current needs [see box, "Case Study #1"]. Even as problems arose, such as invoices’ being sent out months late, managers paid little attention. When the billing system effectively collapsed, the company lost tens of millions of dollars, and its stock dropped from $68 to $26 per share in one day, wiping out $3.4 billion in corporate value. Shareholders brought lawsuits, and several government agencies investigated the company, which was eventually fined $3 million for regulatory violations.

Even organizations that get burned by bad software experiences seem unable or unwilling to learn from their mistakes. In a 2000 report, the U.S. Defense Science Board, an advisory body to the Department of Defense, noted that various studies commissioned by the DOD had made 134 recommendations for improving its software development, but only 21 of those recommendations had been acted on. The other 113 were still valid, the board noted, but were being ignored, even as the DOD complained about the poor state of defense software development!

Some organizations do care about software quality, as the experience of the software development firm Praxis High Integrity Systems, in Bath, England, proves. Praxis demands that its customers be committed to the project, not only financially, but as active participants in the IT system’s creation. The company also spends a tremendous amount of time understanding and defining the customer’s requirements, and it challenges customers to explain what they want and why. Before a single line of code is written, both the customer and Praxis agree on what is desired, what is feasible, and what risks are involved, given the available resources.

After that, Praxis applies a rigorous development approach that limits the number of errors. One of the great advantages of this model is that it filters out the many would-be clients unwilling to accept the responsibility of articulating their IT requirements and spending the time and money to implement them properly. [See “The Exterminators," in this issue.]

Some level of software failure will always be with us. Indeed, we need true failures—as opposed to avoidable blunders—to keep making technical and economic progress. But too many of the failures that occur today are avoidable. And as our society comes to rely on IT systems that are ever larger, more integrated, and more expensive, the cost of failure may become disastrously high.

Even now, it’s possible to take bets on where the next great software debacle will occur. One of my leading candidates is the IT systems that will result from the U.S. government’s American Health Information Community, a public-private collaboration that seeks to define data standards for electronic medical records. The idea is that once standards are defined, IT systems will be built to let medical professionals across the country enter patient records digitally, giving doctors, hospitals, insurers, and other health-care specialists instant access to a patient’s complete medical history. Health-care experts believe such a system of systems will improve patient care, cut costs by an estimated $78 billion per year, and reduce medical errors, saving tens of thousands of lives.

But this approach is a mere pipe dream if software practices and failure rates remain as they are today. Even by the most optimistic estimates, to create an electronic medical record system will require 10 years of effort, $320 billion in development costs, and $20 billion per year in operating expenses—assuming that there are no failures, overruns, schedule slips, security issues, or shoddy software. This is hardly a realistic scenario, especially because most IT experts consider the medical community to be the least computer-savvy of all professional enterprises.

Patients and taxpayers will ultimately pay the price for the development, or the failure, of boondoggles like this. Given today’s IT practices, failure is a distinct possibility, and it would be a loss of unprecedented magnitude. But then, countries throughout the world are contemplating or already at work on many initiatives of similar size and impact—in aviation, national security, and the military, among other arenas.

Like electricity, water, transportation, and other critical parts of our infrastructure, IT is fast becoming intrinsic to our daily existence. In a few decades, a large-scale IT failure will become more than just an expensive inconvenience: it will put our way of life at risk. In the absence of the kind of industrywide changes that will mitigate software failures, how much of our future are we willing to gamble on these enormously costly and complex systems.

We already know how to do software well. It may finally be time to act on what we know.