What we found building an OTel sink for LLM telemetry

What we found building an OTel sink for LLM telemetry

We built an OpenTelemetry sink for Scorable. The pitch is simple: point your existing OTel exporter at us, and we get the LLM data we need to run evaluations on top. No custom SDK to install. Standard wire format, one line of config, done.

That is the goal, anyway. Reality is a bit messier.

A spec! How convenient.

LLM telemetry has all the usual operational dimensions: latency, cost, model, token counts. Plus the LLM-specific ones: prompts, responses, turn contents, tool calls. Run-of-the-mill SaaS observability covers the first half. The second half is the new bit, and fortunately there is a spec for it: the OpenTelemetry GenAI Semantic Conventions.

That is what we based our work on. With a spec in hand we can make assumptions about which attributes carry which meaning, highlight the right ones in the sink's UI, and treat anything else as bonus.

We dogfooded against Pydantic AI first, since it is what we ship internally for our own Evaluator Factory. Pydantic AI follows the spec well. Key attributes flow in with the names the spec uses. The few extras it emits are clearly Pydantic AI specific, so easy to bucket. Great.

Then we pointed the sink at other producers, and the picture changed.

Producers do not follow the spec

Claude Code, via the Anthropic Agent SDK, does not follow it. Claude Code has its own attribute hierarchy. The Agent SDK has zero

attributes in first-party code.

LangChain does not follow it either. Its first-party attribute family is

(
,
,
), emitted when talking to LangSmith. Outside that integration you either install third-party instrumentation libraries or hand-roll OTel spans yourself.

Part of why emissions look like this is that the OTel GenAI Semantic Conventions is not the only spec in town. There is a competing one from Arize called OpenInference, and it actually predates the OTel version in production use by about three months (Phoenix adopted OpenInference on 2024-01-09; the first

attributes merged upstream on 2024-04-16). Arize ships a long list of instrumentation libraries for it, and OpenInference is also what Agno, BeeAI, and SmolAgents emit. So when frameworks "have their own thing," sometimes it is actually OpenInference under the hood.

The

attributes that do show up in production traces today mostly come from third-party instrumentation libraries (OpenLLMetry, OTel's experimental GenAI instrumentations) monkey-patching frameworks from the outside. The frameworks themselves are mostly not in on it.

Surely the consumers all agreed?

We figured the picture would look better on the consumer side. The observability tools have to deal with the chaos anyway, so surely they have all aligned on the spec internally.

Spoiler: they have not. We took a static-analysis tour through MLflow, Phoenix, Langfuse, and Opik on their current default branches.

Tool

Ingest gen_ai.*

Emit gen_ai.*

Internal taxonomy
MLflowTranslates most attrs

Opt-in only (MLFLOW_ENABLE_OTEL_GENAI_SEMCONV=False by default)

mlflow.* namespace

PhoenixStores verbatim, ignores in UI / token aggregation / costNo

OpenInference (llm.*, openinference.*)

LangfuseTranslates all 10 attrs plus ~10 vendor dialectsNo

langfuse.observation.* over a ClickHouse schema

OpikTranslates all 10 via a 16-rule mapping tableNo

Opik native (model / provider / usage)

Phoenix deserves a closer look. Their MIGRATION.md says it twice, across v3 and v5: "Phoenix now exclusively uses OpenInference for instrumentation." The frontend folder is literally named

. They picked a side and committed.

A flavorful detail in Langfuse: the OTLP ingest layer has explicit branches for more than ten producer conventions. Vercel AI SDK, Genkit, TraceLoop, Pydantic AI, Logfire, LiveKit, MLflow, SmolAgents, OpenInference, and

itself, plus partial branches for Pipecat, LlamaIndex, and Google ADK. All in one ~3000-line ingestion file. The OTel spec is one dialect on a long list.

Newer spec namespaces fare worse.

exists in the registry, but zero of the four tools touch it. Each ships its own scoring system instead. (Including Scorable, of course. Yay, dogfooding!)

Why is it like this?

Four plausible reasons.

The schemas predate the spec. Every tool had a working LLM data model before April 2024. For them the spec arrives as a migration ask, not a green-field choice. Phoenix's whole product is built on OpenInference. Langfuse has its Generation/Observation/Score model in ClickHouse. MLflow has its own trace schema. Opik has its Java DTOs. Internal schemas are foundational: you do not rewrite them on a maybe.

OpenInference is genuinely better suited for span-first backends. It puts message content directly on the span, which is what tools like Tempo or Jaeger can actually render. The OTel GenAI spec pushes content off into opt-in events that span-first UIs cannot trivially join back. The OTel spans page itself concedes the problem: "Structured attributes work best on events/logs rather than spans currently."

The spec is still marked Development. Nothing is stable. In the last twelve months alone,

got renamed to
,
to
, and
was removed entirely (content moved to events, then the event shape itself got refactored). Hard to bake a moving target into your internal schema.

The spec has gaps for things production actually cares about. Cost is the obvious one. There is no canonical attribute for token cost. So three of the four tools just invented their own (

in Langfuse,
in MLflow, custom routes in Opik). Same story for evaluation, as mentioned above.

Where we landed

It is not like we can just drop the custom attributes when they show up. We have to keep them and map to the spec where it matches, and to our own domain model where it does not. Sure, we have our own domain model too, but we would rather chase spec churn than ship a fourth taxonomy nobody asked for.

So our sink follows the spec where it stabilizes, falls back gracefully when producers emit something else (OpenInference, LangSmith-shaped LangChain data, Anthropic Agent SDK events, etc.), and keeps the custom attributes around as metadata. We accept that we live in a translation layer for now. Everyone else does.