Pages

About

A Critique of Iceberg REST Catalog: A Classic Case of Why Semantic Spec Fails

Mike's Notes

I like the specification and hope to enable its use via a future Pipi plugin. Here is an article by Ananth Packkildurai about some flaws that need fixing. His weekly newsletter is worth subscribing to.

Resources

References

  • Iceberg REST Catalog Specification.
  • Designing Data-Intensive Applications, by Martin Kleppmann.

Repository

  • Home > Ajabbi Research > Library > Subscriptions > Data Engineering Weekly
  • Home > Handbook > 

Last Updated

19/01/2026

A Critique of Iceberg REST Catalog: A Classic Case of Why Semantic Spec Fails

By: Ananth Packkildurai
Data Engineering Weekly: 9/01/2026

Ananth Packkildurai is a data engineering leader, writer, and author of Data Engineering Weekly, sharing insights on modern data platforms, large-scale pipelines, and AI-driven architectures..

How a Semantically Correct API Becomes Operationally Unreliable at Scale.

“Latency is not just a performance characteristic; it is a fundamental part of correctness.” — Designing Data-Intensive Applications

In Designing Data-Intensive Applications, Martin Kleppmann makes a subtle but critical point: the CAP theorem omits latency, yet in real systems, latency often determines whether a system is usable at all. A system that is correct but slow is, in practice, incorrect.

This observation is directly applicable to the Apache Iceberg REST Catalog specification. While the specification achieves semantic clarity, it fails to define the operational realities that enable distributed systems to remain predictable at scale. The result is a standard that is formally correct, yet operationally fragile.

Semantic Interoperability Without Predictability

Over the past two years, the Iceberg REST Catalog specification has emerged as the de facto standard for metadata access in the Iceberg ecosystem. We have seen the outburst of the catalog war around the REST spec. It promises a universal interface that allows engines such as Trino, Spark, Flink, and StarRocks to interact with Iceberg tables via a common REST abstraction, independent of the underlying catalog implementation.

At the semantic level, this promise largely holds. The specification rigorously defines metadata structures: tables, schemas, snapshots, and namespace operations. A LoadTable or CreateNamespace request looks identical across implementations. This semantic interoperability has been critical to Iceberg’s rapid ecosystem adoption.

However, semantic interoperability alone is insufficient. The specification defines what metadata operations mean, but it avoids specifying how they must behave in real-world conditions, such as concurrency, latency sensitivity, and cross-catalog synchronization.

This gap—between semantic interoperability and operational interoperability—is where systems begin to fail in production.

The Core Problem: No Operational SLA, No Predictability

The Iceberg REST Catalog specification is intentionally silent on performance guarantees. There are no latency expectations, no throughput baselines, and no service-level objectives. While this flexibility lowers the barrier to implementation, it creates an ecosystem where:

  • Two catalogs can both be “compliant” yet differ by orders of magnitude in response time.
  • Clients cannot reason about metadata latency during query planning.
  • Synchronization behavior across catalogs becomes unpredictable.

In distributed data systems, predictability matters more than raw performance. Without a strict operational SLA—or at least defined behavioral constraints—clients are forced into defensive, retry-heavy designs that amplify load and increase tail latency.

The “List Tables” Problem: Cross-Catalog Sync Failure

The ListTables endpoint (GET /v1/namespaces/{namespace}/tables) is semantically straightforward. It allows clients to enumerate tables within a namespace and supports pagination through pageSize and pageToken.

The primary issue is not pagination itself. The real failure emerges when the same Iceberg tables are registered in multiple catalogs, a pattern that is increasingly common in hybrid and multi-platform deployments.

A Realistic Scenario

  • An Iceberg table is registered in Catalog A and Catalog B
  • Both catalogs point to the same underlying metadata and object storage.
  • One catalog is used by ingestion and streaming workloads.
  • Analytics engines or BI tools use the other.

The Sync Pathology

When a client connects to Catalog B and issues a metadata discovery operation—such as listing tables or syncing namespace state—the catalog must:

  1. Enumerate all tables
  2. Resolve metadata pointers
  3. Validate access permissions
  4. Reconcile the state with the underlying storage.

Because the REST specification defines no operational expectations:

  • There is no SLA for how long this sync should take
  • There is no distinction between a “lightweight” listing and a fully validated listing.
  • There is no mechanism to express intent (e.g., names only, no ACL validation)

As table counts grow into the tens of thousands, synchronization latency grows non-linearly. In practice, sync operations can take minutes—or fail—causing engines to stall, time out, or repeatedly retry.

The result is not merely slow metadata access. It is system-wide unpredictability. Query engines cannot determine whether a delay is transient, systemic, or catastrophic.

Latency Is Treated as an Implementation Detail—But It Is a Contract

The REST Catalog specification implicitly treats latency as an implementation concern. From a standards perspective, this is understandable. But in data-intensive systems, latency is part of the correctness contract.

The specification does not define:

  • Upper bounds on metadata retrieval latency
  • Maximum metadata payload sizes
  • Limits on metadata fan-out operations
  • The number of round-trip required to plan a query

As a result, a compliant catalog may require megabytes of JSON metadata and dozens of HTTP calls just to validate a single query plan. Engines appear slow and unstable, even though the root cause lies in an underspecified protocol.

This is precisely the class of problem Kleppmann warns about: correctness without latency guarantees is operationally meaningless.

Commit Semantics Under Contention: Undefined and Unfair

Iceberg relies on optimistic concurrency control. When multiple writers attempt to commit simultaneously, conflicts are expected and resolved through retries.

The REST specification defines the 409 Conflict response, but stops there. It does not define:

  • Backoff expectations
  • Retry fairness
  • Starvation prevention

In a multi-engine environment, this creates asymmetric outcomes. A high-frequency streaming writer with aggressive retries can permanently starve batch compaction jobs that follow conservative retry policies. Over time, table health degrades due to file explosion and unbounded metadata growth.

Once again, the issue is not semantic correctness. It is the absence of operational guarantees.

Caching Without a Freshness Model

While HTTP caching is permitted, it is not part of the correctness model. Support for conditional requests, ETags, or freshness validation is optional.

This forces clients into a pessimistic stance: always re-fetch, always revalidate, always assume staleness. The REST protocol degenerates into a chatty, high-latency control plane that negates its own architectural benefits.

Without a standardized freshness contract, caching becomes a gamble rather than a reliability tool.

Behavioral Conformance Is Missing

The Iceberg ecosystem has strong conformance testing for table formats. It lacks an equivalent for catalog behavior.

Today, “REST Catalog compliant” means:

  • The endpoints exist
  • The JSON schema is correct.
  • The happy path works.

It does not mean:

  • Predictable latency under load
  • Stable pagination during concurrent updates
  • Graceful overload signaling
  • Bounded retry amplification

Without behavioral conformance tests, compliance guarantees syntax, not operability.

Underspecification Is Still a Design Decision

The absence of operational constraints is not accidental. It reflects a deliberate choice to prioritize adoption and flexibility.

However, in distributed systems, underspecification pushes complexity downstream. It burdens clients, operators, and platform teams with the need to implement compensating logic. As Iceberg becomes core infrastructure rather than experimental tooling, this trade-off increasingly limits its reliability.

Semantic agreement without behavioral agreement leads to fragile systems.

Toward Operational Interoperability

Operational interoperability does not require rigid SLAs or centralized control. It requires acknowledging that latency, retries, and fairness are part of the interface.

Concrete improvements could include:

  • Defined operational profiles with minimum latency and concurrency expectations
  • Lightweight metadata views to avoid synchronization amplification
  • Standardized retry and backoff semantics for conflict scenarios
  • Explicit freshness and caching contracts

Semantic interoperability enabled Iceberg’s success. Operational interoperability will determine whether it remains dependable at scale.

Until then, the Iceberg REST Catalog remains a textbook example of why semantic specifications alone are not enough.

No comments:

Post a Comment