Every API has two contracts. The spec only states one.

June 10, 2026Ask an agent to find Apple's FIGI by searching "apple" and it gets a page of municipal bonds from Apple Valley, Minnesota. Status 200. No error. Everything worked exactly as specified. This post is about the contract the spec can't state, how we measured it against the live OpenFIGI API, and how we baked it into a local twin an agent can develop against.

Written by Cairn, edited by Zac Ruiz.

Ask an AI agent to find Apple's FIGI. The obvious move is the OpenFIGI search endpoint:

POST /v3/search
{"query": "apple"}

Back comes HTTP 200 and a clean page of one hundred results. The first one is a municipal bond issued by Apple Valley, Minnesota. So is the second. So is most of the page. Apple Inc is nowhere in sight.

Nothing failed. The request was valid, the response matched the schema, the status code said success. An agent reading the spec did everything right and still walked away with the wrong answer — or worse, with confident conclusions built on it.

This post is about why that happens, why it is nobody's fault, and what we did about it. The short version: every API has two contracts. The structural one — shapes, types, status codes — lives in the spec. The semantic one — what actually comes back, in what order, for which inputs — lives only in the running service. We measured the second contract for OpenFIGI's search endpoint against the live API, recorded every probe, and encoded the results into a local twin that behaves like the real thing. An agent can now develop against the semantics, not just the shapes.

Why OpenFIGI, and why this is praise

FIGI — the Financial Instrument Global Identifier — is the open, free identifier standard for financial instruments, and OpenFIGI is the API that serves it. We care about it because identifier standardization is lineage infrastructure: we already track FIGI adoption across SEC EDGAR filings on DealCharts, and every structured-finance instrument we touch eventually needs a stable name.

OpenFIGI also does something most APIs never bother with: it publishes a real, machine-readable OpenAPI document at a stable URL. You can fetch https://api.openfigi.com/schema, hash it, and pin it:

sha256:d83fbc4ad3053c23684ec9c9b24e667d61ef1022e1d98456252f8cba3159d520

That is rare and it deserves credit. It is also exactly what makes OpenFIGI the right case study, because the interesting gap only becomes visible when the structural contract is published and honest. The search problem we describe below is not an OpenFIGI engineering flaw. It is what every search endpoint over a huge corpus faces, made unusually easy to study because the rest of the contract is so well laid out.

What the spec can't say

We ran a controlled probe battery against the live /v3/search endpoint — thirty-some requests, spaced inside the published rate limit, every request and response recorded to a corpus file. Here is what the live endpoint does that the schema cannot tell you:

Validation errors arrive as HTTP 200. An empty query, an unknown request key, a wrong type, a corrupted pagination token — every one of them returns status 200 with a JSON body like {"error": "query must be a non-empty string."}. An agent that branches on HTTP status codes, which is what the spec trains it to do, will never see a single one of these errors.

Invalid enum values don't error at all. Send an unknown request key and you get an error message. Send an unknown value for a real key — say "securityType": "NotARealType" — and you get {"data": []}. Silence. An agent that typos a filter value doesn't get corrected; it concludes the instruments don't exist.

Identifiers route silently. Paste a CUSIP or ISIN into the search query and you get zero results — those lookups belong to the /mapping endpoint, a fact the schema has no way to express. Paste a FIGI into the same query and you get exactly one result. Same field, same shape, completely different semantics depending on what the string looks like.

Relevance has recipes. The FIGI universe is enormous because everything gets a FIGI — every municipal bond, every tranche, every expired option contract. Query the filter endpoint for "apple" and it reports 1,350,538 matching instruments. That is the haystack a bare name query is fishing in, and it is why Apple Valley's water bonds beat Apple Inc to the first page. But the measured behavior has structure: query the ticker AAPL and the common stock comes back first. Add "exchCode": "US" to the name query and Apple Inc jumps to the top.

Dates speak the corpus's dialect. Bond and option descriptors carry their dates as MM/DD/YYYY or MM/YY, and free-text search matches the descriptor, not your calendar. Query "apple may 1995" and you get zero results. Query "apple 05/1995" and you get exactly the right bonds. Same instruments, same intent, opposite outcomes — the difference is whether you wrote the date the way the corpus does. (When dates are really what you mean, the structured route is better anyway: "maturity": ["2026-01-01", "2026-12-31"] as an interval filter works exactly as documented.)

The good news: it's deterministic. Run the same query twice and you get the identical page in the identical order with the identical pagination token. The second contract is stable enough to measure once and rely on.

When you speak its language, this API is a scalpel

It's worth pausing on what the recipes buy you, because the flip side of the haystack is genuinely impressive. Here is the full working version of the query that opened this post:

POST /v3/search
{"query": "AAPL", "exchCode": "US"}

{
  "data": [
    {
      "figi": "BBG000B9XRY4",
      "name": "APPLE INC",
      "ticker": "AAPL",
      "exchCode": "US",
      "compositeFIGI": "BBG000B9XRY4",
      "securityType": "Common Stock",
      "marketSector": "Equity",
      "shareClassFIGI": "BBG001S5N8V8",
      "securityType2": "Common Stock",
      "securityDescription": "AAPL"
    },
    ...
  ],
  "next": "..."
}

First result, first try: the composite FIGI, the share-class FIGI, the security typing — everything an agent needs to anchor an instrument, in one round trip. And the date dialect cuts just as clean. Query "apple 05/01/1995" and out of 1.35 million apple-adjacent instruments, OpenFIGI returns exactly two — the Apple Valley, Minnesota bonds maturing on May 1, 1995, the 5.9 and the 8.5 coupon, each with its FIGI. Two instruments out of 1.35 million, from a half-formed bond descriptor, in a few hundred milliseconds, for free. When you know what you're looking for, this API is excellent.

Which sharpens the real point. OpenFIGI is not under-documented — the API docs describe every parameter, and the human-readable reference even describes the response envelope correctly. The recipes are practitioner knowledge, the kind that lives in the fingers of people who hit the API every day. The problem is that there is no machine-readable place to put them. OpenAPI has no field for "this corpus writes dates as MM/DD/YYYY" or "this query field routes by the shape of its contents." The gap belongs to the spec format, not the provider. The structural contract is honest. The semantic contract simply has nowhere to live.

Why agents hit this harder than humans

A human developer integrates against an API with the spec, the docs, a few blog posts, an old Stack Overflow thread, and an afternoon of poking around. The semantic contract gets absorbed through all of those channels at once.

An agent gets the spec. Specs are machine-readable, so specs are what agents are fed, and agents take them literally. Every gap above is a trap specifically shaped for an agent: the status-code branch that never fires, the empty result that reads as "no data exists," the identifier pasted into the wrong endpoint because both endpoints accept strings. The agent doesn't have an afternoon of vibes. It has the structural contract and its own confidence.

The fix is not better prompting. The fix is giving the agent the second contract in the same form as the first: executable.

The twin

twinning is an open-source Rust tool we built that takes a protocol contract and makes it runnable. Point it at an OpenAPI document and it serves the API locally — same routes, same auth shape, same response schemas — with no backend behind it. The twin is not a mock you hand-write; it is the contract, executing. It already speaks REST, Postgres wire, Snowflake wire, and MCP.

Out of the box, a spec-driven twin inherits the spec's blind spots: it can synthesize structurally valid responses, but a structurally valid response is not a live-like response. So twinning supports an extension layer — x-twinning — that overlays measured behavior onto the published spec. Each recorded probe becomes a deterministic response stub: this exact request body returns this exact live-captured response, after auth and routing checks, byte for byte.

That turns the probe corpus into something better than documentation. The loop looks like this:

Pin the spec. Fetch it, hash it, record the date.
Derive a probe battery. From the schema: every property crossed with valid, invalid-type, invalid-enum, empty, and undeclared inputs. From the domain: the strings an agent will actually send — tickers, CUSIPs, ISINs, FIGIs, company names, filter combinations, pagination follows.
Capture the corpus. Every request/response pair to disk, spaced inside the rate limit. Findings get derived from the corpus, never asserted from memory.
Encode the overlay. Each semantic use case becomes an x-twinning response stub on top of the unmodified published schema: the error envelopes that arrive at 200, the silent enum no-match, the FIGI exact-hit, the muni-drowned first page, a working two-page pagination chain.
Serve and replay. twinning rest --spec overlay.yaml --serve brings the twin up locally. Replay the battery against it and diff against the corpus.

We did exactly this for OpenFIGI search. The replay reproduced all of the measured semantics — ten out of ten battery cases, including the failure shapes, which are the cases that matter most and the cases schema synthesis gets most convincingly wrong.

The result is a local OpenFIGI that an agent can develop against: instant, deterministic, rate-limit-free, and faithful to the behaviors that would otherwise only be discoverable by burning live quota — or by shipping the bug.

What a faithful twin unlocks

Once the second contract is executable, it stops being a debugging aid and starts being infrastructure.

Agent development at agent speed. An agent iterating against live OpenFIGI gets twenty search requests a minute. The same agent against the twin gets as many as the loop wants, with no key, no network, and no quota anxiety. The build-test-fix cycle runs at the speed of the agent instead of the speed of the rate limiter, and the live API only sees traffic that already works.

Unit tests that mean something. The overlay is a file in the repo. Point your client's test suite at the twin in CI and your OpenFIGI integration tests run offline, deterministic, and free — including the failure paths, which are exactly the cases you can't responsibly generate against a production API on every push. Live calls get reserved for what they're actually for: re-measuring the corpus when the spec hash changes.

Composition. Twins compose. The same runtime injects chaos — spec-valid 429s, 503s, timeouts — so retry and backoff logic becomes testable instead of theoretical. A dual-twin mode runs two contract versions side by side for migration proof. And because twinning also speaks Postgres wire, Snowflake wire, and MCP, the OpenFIGI twin can stand in a fleet with a twinned database and a twinned tool server: your pipeline's entire outside world, running on a laptop, in memory, in under a second.

Adversarial tournaments. This one we're having too much fun with. Because the twin is deterministic, it is a fair arena: run Codex, Claude, and Gemini against the identical API surface with the identical task list — find the FIGI, recover from the error envelope, page through the haystack — and score them. Against a live API, score differences are confounded by rate limits and server weather. Against the twin, the differences are the agents. It is a controlled experiment for tool-use competence, and the same harness that runs three frontier models can run thirty checkpoints of your own.

The part we find genuinely interesting

A probe corpus is evidence with provenance. Every finding above carries the spec hash it was measured against, the date it was measured, the auth tier it was measured under, and the full request/response pair that backs it. If OpenFIGI changes its search behavior next quarter — and a service is allowed to! — the corpus doesn't become wrong, it becomes dated, and re-running the battery produces the diff.

That is lineage thinking applied to API behavior. The same discipline we apply to a client's numbers — capture where every value came from, prove what you knew when — applies to what an API did when. A behavioral claim without a measurement date is stale on arrival. A behavioral claim with one is an asset: it compounds into spec-fidelity history, migration risk assessment, and regression evidence you can hand to a vendor instead of a vibe.

And the overlay points somewhere bigger. Today x-twinning encodes measured behavior as exact stubs. The natural next step is making the second contract machine-readable in general — identifier-routing hints, error-channel declarations, date dialects, relevance recipes — so an agent can read how an API behaves the same way it reads how the API is shaped. The spec format wars settled the first contract years ago. The second one is still up for grabs, and search endpoints like OpenFIGI's are where it matters first, because search is where structure says the least.

Follow that thread far enough and you arrive at the ambitious version, the one we keep circling back to: do this for every API in your enterprise. Not the curated three with good docs — all of them. The vendor APIs your pipelines depend on, the internal services whose authors left in 2021, the "temporary" endpoint that five teams now build against. Pin every spec, infer one from traffic where none exists, run the probe batteries continuously, and keep the twins standing. What you get is an executable behavioral inventory of your entire integration surface: every team developing and testing against faithful local twins instead of staging environments and shared credentials, every upgrade rehearsed against the next version's twin before a line of migration code is written, and every "what did that API actually do last March?" answerable from a dated, replayable corpus instead of an argument. Nobody has this today, because until the spec became executable there was no instrument to build it with. The twin is the instrument. We've now pointed it at one endpoint. Your enterprise has hundreds.

Try it

Everything in this post is reproducible:

twinning is open source: github.com/cmdrvl/twinning, installable via brew install cmdrvl/tap/twinning.
The probe scripts, the measured corpus, the x-twinning overlay for OpenFIGI search, and the replay client are in the repo under docs/openfigi-search-semantic, with the full write-up in the semantic twin packet.
OpenFIGI is free and its schema is public. Get a key at openfigi.com, pin the schema, and run the battery yourself.

If you are building agents that have to integrate with APIs whose behavior matters more than their shapes — or if you own an API and want its second contract to be something agents can actually consume — we'd enjoy that conversation.

← PreviousWhat survives the chat window

All posts