Structured outputs 2

🧠 What “structured output” means¶

Large Language Models (LLMs) like GPT, LLaMA, or Mistral normally generate free-form text — like how a human would write a paragraph. But sometimes, we want the model to output well-structured data, such as:

{
  "name": "Alice",
  "age": 25,
  "likes": ["hiking", "cats"]
}

That’s called structured output — the output must follow a specific format, schema, or grammar. This is useful because structured outputs are easy for programs to parse and use.

⚙️ Two ways to get structured output¶

There are two general approaches:

1. Prompt-only approach (soft enforcement)¶

You ask nicely in the prompt:

“Please output a JSON object with keys name, age, and likes.”

The model tries to follow the format, but it can still make mistakes (e.g., forget a comma, add extra text). This works for simple use cases but isn’t reliable when the format must be 100% valid.

2. Constrained decoding (hard enforcement)¶

Instead of just asking, the server or inference engine actually restricts which tokens (words/pieces of words) the model is allowed to output.

It uses your schema or grammar to mask invalid tokens in real time, ensuring every token generated keeps the output valid.

So:

The model predicts probabilities for the next token.
The inference engine zeroes out tokens that would break the structure.
The model picks the next valid token.
Repeat until done.

This guarantees the final output must follow the defined structure.

🧩 Who actually enforces the structure?¶

The model itself (like LLaMA or Mistral) just predicts the next token based on context. It doesn’t know about JSON, regex, or your schema.
The server/inference engine (like vLLM or OpenAI’s API) is the one that enforces the structure during token generation.

So — structured output enforcement is handled by the server, not by the model’s neural weights.

🚀 Does every model support structured output?¶

No — but here’s the nuance:

Any model can be prompted to produce structured text (soft approach).
Only some serving engines (like OpenAI’s API or vLLM) can enforce structure strictly (hard approach).

To check if structured output is truly supported:

See if the API or serving engine offers parameters like response_format, response_schema, guided_json, etc.
If it does, that means the serving layer supports constrained decoding.

⚡ What about vLLM?¶

vLLM is an open-source inference engine — it runs LLMs efficiently, similar to how a web server runs code.

✅ vLLM supports structured output¶

vLLM added a feature called Structured Outputs with several backends:

guided_json → Enforces JSON structure
guided_regex → Enforces regex pattern
guided_choice → Limits outputs to a fixed list (like “yes” or “no”)
guided_grammar → Enforces custom grammars

Example (conceptually):

response = llm.generate(
    prompt="Extract the user's details",
    guided_json={"name": "string", "age": "integer"}
)

Internally, vLLM tracks which tokens are valid at each step and blocks any that break the schema.

🧩 Does vLLM check if the model supports it?¶

Not really — vLLM doesn’t ask the model if it supports structured output. Instead, it wraps the model’s generation with its own structured-output logic. As long as the model works with vLLM’s token interface, vLLM can enforce the schema externally.

🧭 In short¶

Concept	Who handles it?	Example
Free text generation	The model	“The weather is sunny.”
Prompted structure	Model (tries to follow format)	“Please output JSON.”
Enforced structure	Server / engine (like vLLM)	Guarantees valid JSON

💡 Summary¶

LLMs predict text; they don’t natively enforce formats.
Structured output means restricting output to a schema, grammar, or pattern.
Enforcement happens at the server/inference layer (not inside the model).
vLLM supports structured output via guided decoding, by dynamically masking invalid tokens.
vLLM doesn’t depend on the model’s awareness — it enforces the structure externally.