Execution-Grounded Fine-Tuning

The standard recipe for "AI that writes software" is one large frontier model emitting free-form code. Masar takes the opposite stance: decompose the judgment, specialize the pieces, and ground the training signal in real execution. This page explains the general ideas.

Judgment-level decomposition

In our pipeline the hard structural decisions — state machines, circuits, event wiring, schema topology — are already settled by the typed behavior library and the Orb compiler. What is left for a model is closed-vocabulary selection: which behavior fits this intent, which of its typed knobs the user named, and what value each named knob should take.

So instead of delegating authoring to one black box, we factor it into a typed contract and place a small, specialized component behind each decision:

Behavior selection — no training. An embedding router matches the intent against each behavior's description and synonyms by cosine similarity. Frozen encoder, zero learned parameters, no drift.
Presence — a tiny classifier. Which knobs did the user actually name? A small multi-label probe over the same embeddings.
Value picking — the fine-tuned model. For each named knob, choose a value from its declared, typed vocabulary.

Because each step only ever chooses among declared, verified options, a small model is sufficient. The model never invents structure; it selects within a contract the compiler will check.

One small base, one multi-task adapter

The value picker (and the tool-calling "subagent" that wraps repair turns) is a Qwen2.5-1.5B base with a single LoRA adapter spanning the related tasks, served on a multi-LoRA-capable runtime (vLLM, PagedAttention — Kwon et al., 2023, arXiv:2309.06180) and scaled to zero when idle.

We keep one adapter rather than many, following the multi-task LoRA literature — Align-Don't-Divide (Liu et al., 2025, arXiv:2508.05078), LoRI (Zhang et al., 2025, arXiv:2504.07448), MeTA-LoRA (Cheng et al., 2025, arXiv:2510.11598) — and split to a second adapter only if measured negative transfer appears, never preemptively. The base technique is LoRA (Hu et al., 2021, arXiv:2106.09685); training is 4-bit QLoRA on a single mid-range GPU.

The training signal comes from the compiler, not from labels

The defining choice is how the training data is made and filtered. Every example is generated and checked by the deterministic compiler and behavior factory; we keep only programs that dispatch and validate green. The supervision signal is a real execution outcome, not a human label or a learned reward model.

The loop:

Sample a (behavior, knob-value) tuple from the catalog's typed schema — every knob is typed, every default is hand-authored.
Synthesize natural-language intent for it by paraphrasing the behavior's description, synonyms, and per-knob descriptions.
Verify by dispatching through the factory and running orbital validate (and the runtime verifier). Drop anything that fails.
Assemble the surviving (intent, behavior, values, verified program) tuples into the training corpus.

This has three properties a label-based or reward-model approach lacks: correctness by construction (the compiler is the oracle), full coverage (the catalog can be enumerated, where hand-written specs cannot), and cold start for new behaviors (a newly added atom is picked up on the next training run, no new annotation required).

Honest status

Picker — trained, deployed, validated (narrow). The value picker is trained and served, and clears its bucket-tolerant gate on service-pattern behaviors. It is not yet general: it covers domain knobs for a handful of service patterns, while real app builds also turn presentation/config knobs. Extending coverage to app behaviors is the open work.
Subagent — trained, currently dormant. The tool-calling subagent trains cleanly (low eval loss, high token accuracy on held-out trajectories from the same synthesized distribution). That proves it reproduces the call-and-repair format; it does not prove end-to-end build success — that needs the live battery eval. In practice the coordinator already supplies parameters through a deterministic fast path, so the model's inference seam is rarely invoked today.
Why a small local model is the goal. Correctness is decided by the compiler and the dual verifiers, so the model can be a self-hosted 1.5B adapter instead of a cloud frontier model — cheaper, private, and runnable offline.

The honest summary: the method (decompose → specialize → ground in execution) is sound and partly in production; making the specialized picker general across all behaviors is the next milestone.

Judgment-level decomposition​

One small base, one multi-task adapter​

The training signal comes from the compiler, not from labels​

Honest status​

Next steps​

Judgment-level decomposition

One small base, one multi-task adapter

The training signal comes from the compiler, not from labels

Honest status

Next steps