Updated: 2026-03-19 By: Ari Heljakka
Short answer
An LLMOps workflow is not a single tool; it is a composition of category slots, each of which has multiple credible open-source options. The slots are tracing and observability, evaluation libraries, prompt and experiment management, model serving and inference, orchestration and tool use, and vector stores and retrieval. The right question is not "which tool" but "which categories does my workflow require, and how do they hand off to one another." Open-source coverage of every slot is now mature enough that a team can build a defensible pipeline without committing to any single proprietary platform; the trade-offs sit in operational ownership rather than in capability.
Key facts
- Definition: An LLMOps workflow is the end-to-end engineering practice that takes an LLM application from prototype to operated production: instrumentation, evaluation, prompt iteration, deployment, monitoring.
- When to use: Any team operating an LLM application past prototype, where consistency, reproducibility, and cost visibility matter.
- Limitations: Composition is the real work; integrating four open-source tools is rarely as smooth as one platform; operational ownership costs do not disappear when license costs do.
- Example: A team builds an LLMOps stack from open-source components in six weeks: OpenTelemetry for traces, an open-source eval library for offline scoring, an inference server for hosted models, an orchestration library for the agent loop, and a vector store for retrieval.
Key takeaways
- LLMOps is a composition of categories, not a single product. Reason about the slots before reasoning about the names.
- Open-source coverage of each slot is mature. The constraint is composition, not capability.
- Tracing is the foundational slot. Every downstream slot consumes traces, directly or indirectly.
- Evaluation libraries and prompt management belong together. A prompt change without an evaluation gate is a vibe-driven deployment.
- Operational ownership costs (upgrades, integration, on-call) do not go away when you replace a SaaS with an OSS stack; they redistribute.
Definition
LLMOps is the operational discipline of building, deploying, evaluating, and monitoring LLM applications, including agents. It is to LLM systems what MLOps is to classical ML pipelines and what DevOps is to general software systems: a set of practices and a set of tools that together let a team ship and operate the system continuously.
An LLMOps workflow decomposes into a small number of category slots. The slots are mostly orthogonal; a tool that excels in one slot is usually neutral in the others. The composition (which slot hands off to which, on what schema) is the real architectural decision.
When this matters
An open-source LLMOps stack matters most when:
- The team has the operational appetite to run its own infrastructure.
- Vendor lock-in is a strategic concern (regulated industry, sovereignty requirement, multi-cloud).
- The team wants to extend or modify tooling beyond what a SaaS exposes.
- Cost projections at scale make the SaaS line item unattractive.
- The team's data cannot leave its own perimeter for compliance reasons.
Teams without operational appetite or with small workloads often find that a hosted stack pays for itself. The comparison only holds if the operational ownership cost of the open-source stack is counted alongside the licence cost it replaces.
How it works
Each section below describes one slot in the LLMOps workflow: what it does, what the property to aim for is, and the open-source landscape that fills it.
Slot 1, tracing and observability
The foundational slot. Tracing captures the structured execution of an LLM application as a tree of spans: model calls, tool calls, retrievals, prompt versions, latencies. Without trace fidelity, every downstream slot guesses at structure.
The property to aim for: trace schema with first-class span types for the operations that matter (LLM calls, tool calls, retrievals), span attributes that capture prompt and model versions, and a propagating trace identifier so the same trace can be inspected in development and in production.
The OpenTelemetry GenAI semantic conventions are the de facto schema. General-purpose open-source tracing backends consume them; LLM-specific tracing libraries layer LLM-aware views on top.
Slot 2, evaluation libraries
Evaluation libraries provide the primitives for scoring outputs: rubric-based judges, deterministic checks, regression suites against a versioned dataset. The library is not the evaluator; it is the framework in which the evaluator is defined and run.
The property to aim for: evaluators expressible as code or as a versioned prompt, with deterministic graders, model-based judges, and human-in-the-loop hooks all available; dataset abstractions that treat the dataset as a versioned artifact; scoring output that normalizes to a 0 to 1 range so dimensions can be composed.
Mature open-source eval libraries cover the field, with sub-specializations for RAG dimensions, prompt-level testing, and safety evaluations. The composition with the tracing slot is what makes evaluations operational rather than ad hoc.
Slot 3, prompt and experiment management
Prompts and model configurations are versioned artifacts. The slot manages versions, runs experiments against datasets, records per-version per-dimension scores, and supports gradual rollouts.
The property to aim for: every prompt change has a recorded version with the dimensions it was scored on, the agreement metrics for the judges, and a clear audit trail from prompt-v3 to prompt-v4. Experiments are reproducible because the prompt, dataset, judges, and model pin are all referenceable.
Open-source options range from general-purpose MLOps platforms with prompt and experiment tracking to lighter-weight libraries that focus on prompt versioning alone. The right choice depends on how much classical MLOps practice already lives in the team.
Slot 4, model serving and inference
For teams running open-weights models, the slot handles model loading, batching, KV-cache management, and exposing an OpenAI-compatible HTTP interface so the rest of the stack does not care whether the model is hosted by a vendor or by the team.
The property to aim for: throughput and latency predictable enough to plan against, OpenAI-compatible API surface for portability, and instrumentation that flows into the same tracing slot the rest of the stack uses.
Mature open-source inference servers cover most workloads, with options optimized for GPU throughput and others optimized for CPU and edge deployments. The team's choice depends on the model family, the hardware budget, and the latency requirements.
Slot 5, orchestration and agent frameworks
The orchestration slot composes calls: chains, retrieval-augmented generation flows, multi-step agents with tools, multi-agent topologies. It is the slot most prone to over-investment; the framework is often less critical than the prompts and the evaluators around it.
The property to aim for: explicit control flow with structured tool calls, observable steps that emit trace spans into the tracing slot, and the ability to swap the framework without rewriting the application logic.
Open-source frameworks cover everything from heavy abstractions to bare metal, with minimalist alternatives that come down to writing the loop directly. Treat the orchestrator as fungible; the durable assets are the prompts, the evaluators, and the dataset.
Slot 6, vector stores and retrieval
The slot handles embedding storage, similarity search, and the retrieval index that feeds RAG pipelines. The retrieval quality determines a large share of end-to-end correctness, so this slot deserves its own evaluation discipline.
The property to aim for: an index that scales with the corpus, retrieval that returns calibrated relevance scores, and an evaluation hook so retrieval quality (recall, precision, MRR) can be scored independently from generation quality.
Open-source vector stores cover the field, with dedicated databases for large corpora, Postgres extensions for teams already on Postgres, and lighter-weight stores for smaller workloads. The slot decision is operational (where does it run, how is it backed up, how is it scaled) more than functional.
How the slots compose
The slots compose through three contracts:
- The trace schema (slot 1) is consumed by slot 3 for experiment tracking and by slot 2 for production scoring.
- The dataset abstraction (slot 2) is consumed by slot 3 for experiment runs and by CI/CD for gating.
- The OpenAI-compatible API (slot 4) is consumed by slot 5 for model calls, so the orchestrator does not care where the model runs.
A workflow that gets the three contracts right can swap any individual tool without rewriting the application. A workflow that does not is locked into the tools regardless of the license.
Example
A team stands up an LLMOps stack in six weeks. The composition:
- Slot 1, tracing. OpenTelemetry instrumentation in the application; spans exported to Tempo for storage and to an LLM-aware viewer for inspection.
- Slot 2, evaluation. An open-source eval library wraps the dataset (200 sessions, versioned in git) and the judges (one deterministic, three LLM-as-judge with calibrated agreement). Per-dimension scores normalize to 0 to 1; composition rules live in the eval config.
- Slot 3, prompt management. Prompts live in git alongside the application. A lightweight experiment runner records per-version scores in a structured log; promotion to production requires a passing gate run.
- Slot 4, serving. An open-source inference server hosts the open-weights model behind an OpenAI-compatible API. The same API is used by the application and by the eval suite.
- Slot 5, orchestration. A minimal orchestration library composes the agent loop; tool calls emit structured spans into slot 1.
- Slot 6, retrieval. A vector store holds the document embeddings; the retrieval pipeline emits its own spans and is scored independently for recall and precision.
The result is a stack the team owns end to end. The operational cost is real: upgrades, version drift, on-call coverage. The benefit is a measurable, portable workflow that does not depend on any single vendor.
Limitations
Caveats worth flagging up front:
- Composition is the work. Picking the tool in each slot is the easy part. Making the slots interoperate (trace schema, dataset abstraction, API contract) is where the time goes.
- Operational ownership costs are real. Open-source license cost is zero; operational cost is not. Upgrades, security patches, on-call coverage, and integration drift all consume engineering time.
- OSS quality varies. Some slots have multiple mature options; some are still consolidating. Evaluate the project's release cadence, issue responsiveness, and dependency tree as part of the choice.
- Hybrid stacks are common. Many teams run open-source tracing and evaluation while keeping a SaaS for model serving, or vice versa. The slot decomposition makes hybrid practical.
- Lock-in moves, it does not disappear. Replacing vendor lock-in with framework lock-in is a real risk. Treat the orchestrator as fungible; the durable assets are the prompts, evaluators, and dataset.
Evidence and sources
- GenAI semantic conventions, https://opentelemetry.io/docs/specs/semconv/gen-ai/, for the trace schema that the observability slot consumes.
- Open-source inference server documentation, https://docs.vllm.ai/, for an inference server with OpenAI-compatible API.
- Test and evaluate documentation, https://docs.anthropic.com/en/docs/test-and-evaluate/, for the versioned-rubric and judge-calibration discipline that the evaluation slot implements.
FAQ
Should I pick one orchestration framework and commit to it? No. The orchestrator is the most fungible slot in the stack. The durable assets (prompts, evaluators, dataset, trace schema) are framework-independent. Pick what fits the team's idioms; expect to swap it as the field consolidates.
How do I evaluate an open-source project's fitness? Release cadence, maintainer responsiveness on issues, dependency tree (transitive risks), the project's own test coverage, and whether the project ships its own evaluation suite. The last is the strongest signal.
Is OpenTelemetry mature enough for LLM tracing? The GenAI semantic conventions are stable enough to commit to. LLM-aware viewers on top of OTEL are mature; LLM-aware backends for storage are converging. Building against OTEL today is a defensible choice.
Where does prompt management belong: in git or in a UI? In git for code-resident workflows; in a UI when product or domain experts edit prompts. The two are compatible: a UI that emits prompt versions back into git gives you both. The pivotal property is that every prompt has a version and a score history.
How do I budget for the operational cost? A rule of thumb: budget 10 to 20 percent of an engineer's time for ongoing maintenance of the stack. Treat upgrades as scheduled work, not as fire drills.