How do you transfer an LLM across domains?

How do you transfer an LLM across domains?

Updated: 2026-01-29 By: Ari Heljakka

Short answer

Cross-domain model transfer is the practice of moving a pretrained model into a new domain (legal, medical, support, code, finance) without destroying its general capabilities. The core risks are catastrophic forgetting, semantic gap in embeddings, and distribution mismatch between training and target data. The reliable solutions are parameter-efficient fine-tuning (LoRA, adapters), retrieval-augmented generation, pseudo-labeling with confidence thresholds, and hybrid strategies. None of these matter without a measurement loop: a versioned evaluation suite that scores the target domain and the source domain on every change, with calibrated judges, so regressions surface before they reach production.

Key facts

  • Definition: A set of techniques for adapting a general-purpose model to a specific domain while preserving as much of its original capability as the task allows.
  • When to use: Any deployment where the base model's training distribution differs meaningfully from the target task (specialized vocabularies, regulated content, customer-specific data, multilingual surfaces).
  • Limitations: Adaptation is bounded by the quality of target-domain data and the evaluation discipline that detects regressions. No technique substitutes for a calibrated test set.
  • Example: A team adapts a general model to medical claims review using LoRA on 4,000 curated examples, runs a retrieval layer over its policy corpus, and gates each weight update against a versioned faithfulness and refusal-correctness scorecard.

Key takeaways

  • Cross-domain transfer is not one technique; it is a portfolio (LoRA, adapters, RAG, pseudo-labeling, hybrid).
  • Catastrophic forgetting is the dominant failure mode; measure source-task performance after every adaptation step.
  • Parameter-efficient methods (LoRA, QLoRA, adapters) deliver most of the gains at a small fraction of the cost.
  • Retrieval is the right answer when the domain changes faster than weights can be retrained.
  • The evaluation suite is the contract; it must outlive any specific model or prompt version and gate every deployment.

Definition

Cross-domain model transfer is the structured adaptation of a pretrained model to a target domain whose distribution, vocabulary, or task structure differs from the training corpus. The unit of transfer is not a model; it is a tuple of artifacts: the base model, the adaptation method, the training data, the evaluation suite, and the rubric that defines success.

Three risks recur across every implementation.

  • Catastrophic forgetting. Adapting weights toward a new domain overwrites parameters that governed prior capabilities. Source-task performance can drop sharply after a few hundred fine-tuning steps.
  • Semantic gap in embeddings. Domain-specific terminology fragments during tokenization, producing weak internal representations until the model sees enough in-domain text to align them.
  • Distribution mismatch. Gradient signals from a narrow target domain pull representations away from the broader manifold the model was trained on, degrading performance on adjacent tasks.

Every technique below trades off cost, coverage, and risk against these three failure modes.

When this matters

Transfer becomes a deliberate engineering concern when at least one of the following holds.

  • Vocabulary divergence. Medical, legal, regulatory, scientific, or internal-product vocabularies are underrepresented in general pretraining corpora.
  • Style or policy adherence. The target domain demands a specific tone, citation format, refusal pattern, or compliance check that the base model does not honor out of the box.
  • Latency or cost ceilings. A smaller adapted model can replace a larger general model at a fraction of the inference cost, provided it scores at parity on the target evaluation suite.
  • Data sensitivity. Retrieval against private corpora is cheaper, more controllable, and more auditable than retraining on the same data.
  • Frequent domain updates. When the knowledge base changes daily or weekly, model weights cannot keep up; retrieval is the right adaptation mechanism, not retraining.

How it works

A defensible transfer pipeline has five stages, each with its own measurement gate.

Stage 1: Define the evaluation contract before changing weights

Before any adaptation, write down the target-domain evaluation suite. It has three layers.

  • Deterministic checks. Schema, length, banned phrases, citation format. Run on 100 percent of evaluation traffic, near-zero cost.
  • Calibrated LLM judges. Faithfulness, refusal correctness, tone, domain accuracy. Each judge is versioned and calibrated against a human-labeled ground-truth set; agreement is itself a metric.
  • Source-task regression suite. A held-out probe of the model's prior capabilities. Every adaptation step must pass this suite, not just the target-domain suite.

Without all three layers, every adaptation is a guess.

Stage 2: Choose the adaptation strategy by cost and risk

The strategies fall into a small portfolio.

  • LoRA and QLoRA. Low-rank matrices added to attention and feed-forward layers; train fewer than one percent of parameters. Cheap, fast, easy to roll back, low forgetting risk.
  • Adapters. Small bottleneck modules inserted between transformer layers. Slightly larger than LoRA, slightly better on multi-task transfer, swappable at inference time.
  • Full fine-tuning. Updates every parameter. Highest cost, highest forgetting risk, occasionally necessary for deep behavioral shifts.
  • Retrieval-augmented generation. No weight changes; the model reads target-domain documents at inference. Ideal when the knowledge base changes faster than weights.
  • Pseudo-labeling and self-training. Use the model's own high-confidence outputs as additional training data. Effective when labeled data is scarce; dangerous without strict confidence thresholds (typically above 0.95) to prevent error amplification.

The right choice depends on three variables: how much labeled data exists, how often the domain changes, and how strict the source-task regression budget is.

Stage 3: Train against the evaluation contract

Adaptation runs in short cycles. Each cycle produces a candidate tuple (model version, adapter version, training data version) and scores it against the full evaluation suite. The cycle gate is explicit:

  • Target-domain dimensions move forward by at least a calibrated threshold.
  • Source-task regression suite does not drop below a configured floor.
  • Calibration agreement on every judge is above the operational threshold (Matthews 0.6 or rank correlation 0.7 are common minimums).

Candidates that fail the gate are not deployed. The cycle records the failure cause; the next iteration adjusts.

Stage 4: Layer retrieval where weights cannot keep up

For knowledge-heavy domains (policy, product catalog, current events, internal documentation), retrieval is the right layer for whatever changes daily. Adapt weights for behavioral concerns (tone, refusal patterns, format), retrieve for factual concerns (which policy applies, what the current pricing is).

The retrieval layer has its own evaluation dimensions: retrieval precision and recall on a calibrated query set, citation faithfulness, hallucination rate when the retriever returns nothing relevant. These dimensions are independent of the adaptation pipeline and must be scored separately.

Stage 5: Monitor drift on every dimension after deployment

Adaptation does not end at deployment. Three drift signals deserve continuous tracking.

  • Score drift. A target-domain dimension regresses week-over-week without any model change. Investigate distribution shift in user inputs.
  • Calibration drift. Judge agreement against humans decays. Trigger immediate recalibration; do not trust dimension scores until agreement is back above threshold.
  • Source-task drift. The model's general capabilities slowly erode under continued fine-tuning. The source-task regression suite catches this only if it runs on every adaptation cycle.

The same evaluation contract defined in stage one drives all three monitors. The contract outlives every specific model version.

Example

A team adapts a general 7B base model to a medical claims review task.

Evaluation contract. Four dimensions: claim classification accuracy, citation faithfulness to policy text, refusal correctness on out-of-scope claims, tone (professional, non-diagnostic). Two deterministic checks: output schema and required disclaimer presence. Three LLM judges, each calibrated against an 80-example human-labeled set; current agreement is Matthews 0.71, 0.69, and 0.74. A 400-example source-task regression probe drawn from general benchmarks.

Adaptation choice. LoRA on attention and feed-forward layers, rank 16. Training data: 4,200 anonymized claim reviews with policy citations, labeled by domain experts.

Cycle one. Target-domain accuracy moves from 0.62 to 0.81; faithfulness moves from 0.58 to 0.79; refusal correctness moves from 0.71 to 0.76. Source-task regression drops by 0.04 (within the 0.05 floor). Calibration agreement holds. Cycle passes the gate; candidate is promoted to staging.

Retrieval layer. A policy retrieval module returns the top three policy clauses for every claim. Retrieval precision against a calibrated query set: 0.84. Citation faithfulness improves another 0.06 once retrieval is on; tone is unaffected.

Drift event. Three weeks after deployment, faithfulness drops from 0.85 to 0.78 on a specific customer slice. Root cause: the customer added a new claim type the retriever does not yet cover. Fix: expand the policy index and the calibrated query set; faithfulness returns to 0.84 within a day.

Every change is recorded as a tuple (base model, LoRA version, retrieval index version, judge version, dataset version). Every deployment is reproducible. Every regression is investigated against the same evaluation contract.

Limitations

  • Catastrophic forgetting is not eliminated, only bounded. Source-task regression budgets are guardrails, not guarantees. Some adaptation always costs general capability.
  • Pseudo-labeling can amplify errors. Without strict confidence thresholds and a clean human-labeled validation set, the model trains on its own mistakes.
  • Retrieval depends on indexing discipline. A stale or incomplete index produces confident wrong answers; retrieval quality must be measured continuously.
  • Calibration is a recurring cost. Every judge model update, every rubric change, and every meaningful distribution shift triggers recalibration work.
  • Cost-cutting can hide regressions. Sampled evaluation runs save money but lose statistical power on rare failure modes. Bias sampling toward anomalies, not uniform random.

Evidence and sources

FAQ

How much labeled target-domain data do I need? It depends on the task. Classification often works with 100 to 500 examples; structured extraction needs 200 to 1,000; behavioral or stylistic adaptation usually needs 1,000 to 5,000. The right number is whatever your evaluation suite says is enough; the suite, not a heuristic, is the answer.

LoRA or full fine-tuning? Start with LoRA. Move to full fine-tuning only when LoRA plateaus below the target evaluation threshold and you have a budget for the additional forgetting risk.

Where does retrieval beat fine-tuning? When the underlying knowledge changes faster than you can retrain, or when the data is too sensitive or too dynamic to bake into weights. Retrieval also makes auditability easier; each output cites the source it used.

How do I detect catastrophic forgetting? Run a source-task regression suite (a held-out probe of general capability) on every adaptation cycle. If it drops below your configured floor, the adaptation has overcorrected.

Can I combine techniques? Yes, and most production systems do. LoRA for behavioral adaptation plus retrieval for factual grounding is the most common combination. The evaluation suite treats them as independent dimensions and scores each separately.

Related reading