Inside XGrammar Por dentro de XGrammar

How constrained decoding actually works Cómo funciona realmente la decodificación restringida

90-minute lecture · Advanced UG / Early Grad CS Clase de 90 minutos · CS avanzado / posgrado inicial

Salvador Hinojosa · Tec de Monterrey

HookAnzuelo One bad token = broken pipeline Un mal token rompe todo el pipeline

Your LLM was asked to return JSON. It returned: Le pediste JSON a tu LLM. Devolvió esto:


{
  "user": "Ada",
  "score": 99,
  "tags": ["math", "logic"]
  "active": true,
}

Three ways out: Tres salidas:

Retry until valid — unbounded latency. Reintentar hasta que sea válido — latencia no acotada.
Fine-tune for format — works most of the time. Hacer fine-tuning para el formato — funciona casi siempre.
Constrained decoding Decodificación restringida — make the bad token unsamplable. — hacer que el token malo no sea muestreable.

XGrammar picks option 3 and pays near zero extra latency for it. XGrammar elige la opción 3 y paga casi cero latencia extra por ella.

Learning objectives Objetivos de aprendizaje

By the end of this class you will be able to: Al terminar esta clase podrás:

DefineDefinir constrained decoding and why it beats post-hoc validation. la decodificación restringida y por qué es mejor que validar después.
TraceTrazar how a CFG becomes a byte-level pushdown automaton. cómo una GLC se convierte en un autómata de pila a nivel de bytes.
DistinguishDistinguir context-independent from context-dependent tokens. tokens contexto-independientes de los contexto-dependientes.
DescribeDescribir the per-step runtime algorithm that produces a vocabulary-sized bitmask in microseconds. el algoritmo por paso que genera una bitmask del tamaño del vocabulario en microsegundos.
ExplainExplicar why XGrammar overlaps CPU mask generation with GPU inference. por qué XGrammar solapa la generación de máscara en CPU con la inferencia en GPU.

Roadmap Mapa de ruta

BlockBloque	TopicTema	TimeTiempo
A	Hook & objectivesAnzuelo y objetivos	10 min
B	Constrained decoding 101Decodificación restringida 101	15 min
C	Why naïve doesn't scalePor qué lo ingenuo no escala	10 min
D	CFG → byte-level PDAGLC → APD a nivel de bytes	20 min
E	Adaptive token mask cacheCaché adaptativa de máscaras	20 min
F	Persistent stack + CPU/GPU overlapPila persistente + solapamiento CPU/GPU	10 min
G	Activity + exit ticketActividad + ticket de salida	5 min

Block BBloque B

Constrained decoding 101 Decodificación restringida 101

The fundamentals Los fundamentos

The decode loop, with a mask El ciclo de decodificación, con máscara


for _ in range(max_new_tokens):
    logits = model(input_ids)                   # GPU forward → (seq, |V|)
    logits[-1] += mask                          # mask only the next-token row: {0, -inf}^|V|
    probs    = softmax(logits[-1])
    next_tok = sample(probs)                    # top-k / nucleus / argmax
    input_ids.append(next_tok)

Adding the mask before softmax kills illegal tokens (-inf → 0 probability) and preserves the relative ranking of legal ones. The model's preferences are honored within the legal set. Sumar la máscara antes del softmax mata los tokens ilegales (-inf → 0 de probabilidad) y preserva el ranking relativo de los legales. Las preferencias del modelo se respetan dentro del conjunto legal.

The mask is a bit per token La máscara es un bit por token

For Llama-3 with |V|=128k: Para Llama-3 con |V|=128k:

Shape:Forma: (batch, ⌈|V|/32⌉)
Dtype:Tipo: int32 bitset bitset
Size:Tamaño: ~4 KB per request per step ~4 KB por solicitud por paso
Lives on CPU, gets shipped to GPU before the apply step. Vive en CPU, se envía a GPU antes del paso de aplicación.

Toy 32-token vocab, one decoding step: Vocabulario de juguete de 32 tokens, un paso:

allowedpermitido blockedbloqueado

Why we can't just "run the parser per token" Por qué no podemos solo "correr el parser por token"

Per decoding step, naïvely: Por paso de decodificación, de forma ingenua:

|V| = 128,000 token candidates |V| = 128,000 candidatos de token
Clone the parser, feed token bytes, see if it advances Clonar el parser, alimentarle los bytes del token, ver si avanza
If parser advance ≈ 1 µs/token → 128 ms / step on the mask alone Si avanzar el parser ≈ 1 µs/token → 128 ms / paso solo para la máscara
GPU forward (70B model) ≈ 25 ms/step Forward en GPU (modelo de 70B) ≈ 25 ms/paso

The mask is 5× slower than the model. Unacceptable. La máscara es 5× más lenta que el modelo. Inaceptable.

Saving graces: Lo que nos salva: the mask depends only on past tokens — so does the forward. They can run in parallel if we make the mask cheap enough. la máscara solo depende de los tokens previos — el forward también. Pueden correr en paralelo si abaratamos la máscara lo suficiente.

Check #1Pregunta #1 Quick check Pregunta rápida

Q:P: Why do we add the mask before softmax instead of after? ¿Por qué sumamos la máscara antes del softmax y no después?

A. Because softmax is faster on masked inputs. A. Porque el softmax es más rápido sobre entradas enmascaradas.
B. Because masking before softmax preserves the relative probabilities of legal tokens. B. Porque enmascarar antes del softmax preserva las probabilidades relativas de los tokens legales.
C. Because the GPU only supports masked softmax. C. Porque la GPU solo soporta softmax enmascarado.
D. Because -inf + softmax produces NaN we want to detect. D. Porque -inf + softmax produce NaN que queremos detectar.

Answer: B.Respuesta: B. Masking with -inf sends those tokens to probability 0 after softmax while the rest keep their relative ranking — the model's preferences are honored, only narrowed to the legal set. Enmascarar con -inf manda esos tokens a probabilidad 0 tras el softmax, mientras los demás conservan su ranking relativo — las preferencias del modelo se respetan, solo se acotan al conjunto legal.

Block DBloque D

CFG → byte-level PDA GLC → APD a nivel de bytes

From context-free grammar to pushdown automaton De la gramática libre de contexto al autómata de pila

Why a pushdown automaton? ¿Por qué un autómata de pila?

FSA / AF

Recognizes regular languages Reconoce lenguajes regulares
CannotNo puede handle balanced delimiters: { { } } manejar delimitadores balanceados: { { } }
JSON, arithmetic, code → not regular (need a CFG) JSON, aritmética, código → no regulares (necesitan una GLC)

PDA / APD

FSA + a stack AF + una pila
Recognizes context-free languages Reconoce lenguajes libres de contexto
Stack push/pop = enter/leave a nested rule Push/pop = entrar/salir de una regla anidada

XGrammar's twist: El giro de XGrammar: the PDA edges are labeled by bytes, not characters. This handles BPE tokens that contain partial UTF-8 codepoints. las aristas del APD se etiquetan con bytes, no con caracteres. Esto maneja tokens BPE que contienen codepoints UTF-8 parciales.

Construction from EBNF Construcción desde EBNF


expr ::= "(" expr ")"
       | "a"

One FSA per rule. Edges are either: Un AF por regla. Las aristas son:

Character edgesAristas de carácter — consume bytes — consumen bytes
Rule referencesReferencias a regla — push return state, jump to start of referenced rule — apilan el estado de retorno y saltan al inicio de la regla referenciada

Stack trace on input `(a)` Traza de pila sobre la entrada `(a)`

One level of recursion. Each fragment is one byte (or one auto-pop after accept). Un nivel de recursión. Cada fragmento es un byte (o un auto-pop tras aceptación).

Key insight to verbalize: consuming "a" actually does three things in one byte — push inner expr via the rule-ref edge, take the "a" branch, and auto-pop because we're at accept. The animation separates the auto-pop into fragment 3 so students see it. After step 3, ask: what if the input had been "]"? Answer: no transition, parser rejects. Motivates the mask. Idea clave para decir en voz alta: consumir "a" hace tres cosas en un byte — apila el expr interno vía la referencia a regla, toma la rama "a", y auto-desapila porque está en aceptación. La animación separa el auto-pop en el fragmento 3 para que los alumnos lo vean. Tras el paso 3 pregunta: ¿qué pasaría si la entrada hubiera sido "]"? Respuesta: sin transición, el parser rechaza. Motiva la máscara.

Parallel stacks Pilas paralelas

Real grammars are ambiguous — multiple productions match the same prefix. Las gramáticas reales son ambiguas — varias producciones casan con el mismo prefijo.
The matcher tracks a set of stacks, one per live parse. El matcher mantiene un conjunto de pilas, una por cada parseo vivo.
A token is accepted if any stack accepts it. Un token se acepta si cualquiera de las pilas lo acepta.
Masks across stacks are merged by union. Las máscaras se combinan por unión entre pilas.

This is why we need rollback to be cheap: we test 100k+ tokens per step against multiple stacks. Block F shows the trick. Por eso necesitamos rollback barato: probamos 100k+ tokens por paso contra varias pilas. El bloque F muestra el truco.

Check #2Pregunta #2 Quick check Pregunta rápida

Q:P: Why does XGrammar label PDA edges with bytes rather than characters? ¿Por qué XGrammar etiqueta las aristas del APD con bytes y no con caracteres?

A. Bytes use less memory than characters. A. Los bytes usan menos memoria que los caracteres.
B. BPE tokens can contain partial UTF-8 codepoints, so byte-level edges let one token cleanly walk the FSA byte-by-byte. B. Los tokens BPE pueden contener codepoints UTF-8 parciales, así que las aristas a nivel de bytes dejan que un token recorra el AF byte por byte limpiamente.
C. C++ doesn't have a Unicode type. C. C++ no tiene un tipo Unicode.
D. The paper says CFGs only operate on bytes. D. El paper dice que las GLC solo operan sobre bytes.

Answer: B.Respuesta: B. A single BPE token may carry the first half of a multibyte codepoint. Byte-level edges sidestep that — the matcher walks bytes inside the token. Un solo token BPE puede llevar la primera mitad de un codepoint multibyte. Las aristas a nivel de bytes esquivan eso — el matcher recorre bytes dentro del token.

Block EBloque E

The adaptive token mask cache La caché adaptativa de máscaras

The heart of XGrammar El corazón de XGrammar

Key observation Observación clave

For most tokens, validity depends only on the stack-top node, not the rest of the stack. Para la mayoría de los tokens, la validez depende solo del nodo cima de la pila, no del resto.

Token stays in current rule, or pushes a child rule → top-only check works. El token sigue en la regla actual o apila una regla hija → basta mirar la cima.
Token finishes the current rule and pops → need the full stack. El token termina la regla actual y desapila → necesita toda la pila.

Context-independent Contexto-independientes
Decided by stack-top node alone.
→ precompute at compile time. Decididos solo con el nodo cima.
→ precalcular en tiempo de compilación.

Context-dependent Contexto-dependientes
Need the entire stack.
→ check at runtime only. Necesitan toda la pila.
→ revisar solo en tiempo de ejecución.

How rare is "context-dependent"? ¿Qué tan raros son los "contexto-dependientes"?

~1%

of vocabulary is context-dependent at a typical PDA node del vocabulario es contexto-dependiente en un nodo típico del APD

1,134 / 128k

tokens, measured on Llama-3.1 with JSON tokens, medido en Llama-3.1 con JSON

After context expansion: Tras la expansión de contexto: drops to ~120 tokens per step. The runtime only has to actually run the PDA for roughly a hundred tokens out of 128 thousand. baja a ~120 tokens por paso. El runtime solo necesita ejecutar realmente el APD sobre unos cien tokens de 128 mil.

Compile time vs. runtime Tiempo de compilación vs. tiempo de ejecución

At compile time (once) En compilación (una vez)

Build PDA from CFG Construir el APD desde la GLC
For every PDA node n, partition V into
accepted_CI(n) ∪ rejected_CI(n) ∪ CD(n) Para cada nodo n, partir V en
aceptados_CI(n) ∪ rechazados_CI(n) ∪ CD(n)
Store in the adaptive token mask cache Guardar en la caché adaptativa

At runtime (every step) En ejecución (cada paso)

Look up cache for current node(s) Consultar la caché para el/los nodo(s) actuales
Actually run the PDA only on the ~120 CD tokens Ejecutar realmente el APD solo sobre los ~120 tokens CD
Union the masks across parallel stacks Unir las máscaras de las pilas paralelas


# Pseudo-code: one runtime step
def fill_next_token_bitmask(stacks):
    mask = empty_bitmask(|V|)
    for top in {s.top for s in stacks}:
        ci_accept, ci_reject, cd_set = cache[top]
        mask |= ci_accept
        for tok in cd_set:
            if pda_advance_ok(stacks, tok):
                mask.set(tok)
    return mask

"Adaptive" storage — three options Almacenamiento "adaptativo" — tres opciones

CaseCaso	StoreAlmacenar	WhyPor qué
Mostly acceptedMayoría aceptados	`rejected_CI ∪ CD`	Negative set is tiny El conjunto negativo es diminuto
Mostly rejectedMayoría rechazados	`accepted_CI ∪ CD`	Positive set is tiny El conjunto positivo es diminuto
BalancedEquilibrado	`\|V\|`-bit bitset	Sparse sets would be bigger than a bitmap Los conjuntos dispersos serían más grandes que un bitmap

~160 MB

naïve per-node bitmasks bitmasks ingenuas por nodo

0.46 MB

with adaptive storage con almacenamiento adaptativo

Llama-3.1 + JSON. ~350× smaller. Source: XGrammar paper §3. Llama-3.1 + JSON. ~350× más pequeño. Fuente: paper de XGrammar §3.

Worked example — classify the tokens Ejemplo guiado — clasifica los tokens


list  ::= "[" elems "]"
elems ::= digit ("," digit)*
digit ::= "0" | "1" | "2"

You are at the PDA node right after consuming [. Classify: Estás en el nodo del APD justo después de consumir [. Clasifica:

Tokens (left→right): Tokens (izq→der): [ ] 0 1 2 , [0 0] 00 X

Accepted CI:CI aceptados: 0 1 2 — advance within digit — avanzan dentro de digit
Rejected CI:CI rechazados: [ ] , [0 00 X — no valid transition from this node — sin transición válida desde este nodo
Context-dependent:Contexto-dependiente: 0] — second byte triggers a pop into the parent list rule — el segundo byte dispara un pop hacia la regla padre list

Check #3Pregunta #3 Quick check Pregunta rápida

Q:P: Why isn't the cache simply a |V|-bit mask per PDA node, always? ¿Por qué la caché no es siempre, simplemente, una máscara de |V| bits por nodo del APD?

A. It would be too slow to look up. A. Sería demasiado lento consultarla.
B. It would use too much memory — most nodes have very skewed accept/reject ratios, so storing the smaller subset is cheaper. B. Usaría demasiada memoria — la mayoría de los nodos tienen ratios aceptar/rechazar muy sesgados, así que guardar el subconjunto más pequeño sale más barato.
C. Bitmasks can't represent "context-dependent". C. Las bitmasks no pueden representar "contexto-dependiente".
D. Lookup needs a hash for sparse cases. D. La consulta necesita un hash para casos dispersos.

Answer: B.Respuesta: B. 160 MB → 0.46 MB compression comes from storing only the smaller of {accepted, rejected} per node, plus the tiny CD set. The bitmap is used only when they're roughly balanced. La compresión de 160 MB → 0.46 MB viene de guardar solo el menor de {aceptados, rechazados} por nodo, más el diminuto conjunto CD. El bitmap solo se usa cuando están equilibrados.

Block FBloque F

Persistent stack + CPU/GPU overlap Pila persistente + solapamiento CPU/GPU

How the remaining cost disappears Cómo desaparece el costo restante

Persistent execution stack Pila de ejecución persistente

We test ~120 CD tokens per step. Naïvely copying the stack each time → quadratic. Probamos ~120 tokens CD por paso. Copiar la pila cada vez de forma ingenua → cuadrático.

TrickTruco (Driscoll, Sarnak, Sleator, Tarjan, 1989): store all stacks as a tree. A "stack" is just a pointer to a tree node; the path to the root reconstructs the stack. (Driscoll, Sarnak, Sleator, Tarjan, 1989): guardar todas las pilas como un árbol. Una "pila" es solo un puntero a un nodo del árbol; el camino a la raíz reconstruye la pila.

ForkingBifurcar = one pointer write = una escritura de puntero O(1)

RollbackRevertir = swap the pointer back = intercambiar el puntero de vuelta O(1)

Plus:Además: sort candidate tokens lexicographically and roll back only to the longest common prefix between successive tests. Drops per-step byte work to ~30%. ordena los tokens candidatos lexicográficamente y revierte solo hasta el prefijo común más largo entre pruebas sucesivas. Reduce el trabajo de bytes por paso al ~30%.

The CPU / GPU swimlane Los carriles CPU / GPU

Mask generation needs only past tokens. Transformer forward needs only past tokens. They can run in parallel and only synchronize at the sampler. Generar la máscara solo necesita los tokens previos. El forward del transformer también. Pueden correr en paralelo y solo se sincronizan en el sampler.

GPU

forward(t)

sample(t)

forward(t+1)

sample(t+1)

forward(t+2)

CPU

mask(t)

accept(t)

mask(t+1)

accept(t+1)

mask(t+2)

t → wall-clock time tiempo real

Mask generation is well under the per-step GPU budget → disappears from the critical path. End-to-end overhead measured at ~0% with vLLM / MLC-LLM. Generar la máscara está muy por debajo del presupuesto GPU por paso → desaparece del camino crítico. Sobrecarga end-to-end medida en ~0% con vLLM / MLC-LLM.

Impact in practice Impacto en la práctica

100×

lower per-token mask latency vs. prior libraries menor latencia de máscara por token vs. librerías previas

~0%

end-to-end overhead with fast inference engines sobrecarga end-to-end con motores de inferencia rápidos

Default backend in vLLM, SGLang, TensorRT-LLM, and MLC-LLM. Backend por defecto en vLLM, SGLang, TensorRT-LLM y MLC-LLM.

Block GBloque G

Wrap-up & exit ticket Cierre y ticket de salida

Five takeaways Cinco ideas para llevar

Constrained decoding = mask the logits before softmax. 100% structural correctness, no model surgery. Decodificación restringida = enmascarar los logits antes del softmax. 100% de corrección estructural, sin tocar el modelo.
A byte-level PDA handles CFGs and BPE tokens cleanly. Un APD a nivel de bytes maneja GLCs y tokens BPE de forma limpia.
~99% of tokens at any PDA node are context-independent and can be cached. El ~99% de los tokens en cualquier nodo del APD son contexto-independientes y pueden cachearse.
Adaptive storage (smaller of accepted/rejected) shrinks the cache by ~350×. El almacenamiento adaptativo (el menor entre aceptados/rechazados) reduce la caché ~350×.
Persistent stacks + CPU/GPU overlap hide the remaining cost behind the forward pass. Las pilas persistentes + el solapamiento CPU/GPU esconden el costo restante tras el forward.

Exit ticket Ticket de salida

Define context-independent vs context-dependent tokens in one sentence each. Define en una oración cada uno: tokens contexto-independientes vs contexto-dependientes.
In the example grammar, after consuming [0, is , accepted, rejected, or context-dependent? Justify. En la gramática del ejemplo, tras consumir [0, ¿el token , es aceptado, rechazado o contexto-dependiente? Justifica.
In two sentences, why doesn't CPU mask generation add latency to the decode loop in practice? En dos oraciones, ¿por qué la generación de máscara en CPU no añade latencia al ciclo de decodificación en la práctica?

Hand in on an index card on your way out. Entrega en una tarjeta al salir.

References Referencias

Dong et al. (2024). XGrammar: Flexible and Efficient Structured Generation Engine for Large Language Models. arXiv:2411.15100
MLC blog (Nov 2024). Achieving Efficient, Flexible, and Portable Structured Generation with XGrammar. blog.mlc.ai
XGrammar docs — Workflow: Docs de XGrammar — Workflow: xgrammar.mlc.ai/docs
GitHub: mlc-ai/xgrammar
Driscoll, Sarnak, Sleator, Tarjan (1989). Making Data Structures Persistent.

Companion lesson plan: Plan de clase complementario: xgrammar_lesson_plan_90min.md

Block HBloque H

Hands-on lab: programmatic use & integrations Laboratorio: uso programático e integraciones

30 minutes · open the notebook now 30 minutos · abran el notebook ahora

Meet the tools you'll see Conoce las herramientas que verás

A 30-second intro to each. All four use XGrammar under the hood. Una intro de 30 segundos a cada una. Las cuatro usan XGrammar por debajo.

Library HF Transformers

The de-facto Python library for transformer models — load any HuggingFace model in one line. A LogitsProcessor is a callback that runs inside model.generate() and can edit the logits before sampling. XGrammar ships one (xgr.contrib.hf.LogitsProcessor) that applies the bitmask. Easiest way to try constrained decoding. La librería estándar de Python para modelos transformer — carga cualquier modelo de HuggingFace en una línea. Un LogitsProcessor es un callback que corre dentro de model.generate() y puede editar los logits antes del muestreo. XGrammar trae uno (xgr.contrib.hf.LogitsProcessor) que aplica la bitmask. La vía más fácil para probar decodificación restringida.

Server vLLM

High-throughput LLM serving library from UC Berkeley. Famous for PagedAttention (efficient KV-cache memory). Most popular open-source inference server in production. Exposes an OpenAI-compatible HTTP API. XGrammar is its default structured-output backend. Librería de serving de LLMs de alto throughput de UC Berkeley. Famosa por PagedAttention (memoria de KV-cache eficiente). El servidor de inferencia open-source más popular en producción. Expone una API HTTP compatible con OpenAI. XGrammar es su backend por defecto para salidas estructuradas.

Framework SGLang

LLM serving runtime plus a frontend DSL for orchestrating multi-step prompts. Built by the LMSYS team (the same group behind Chatbot Arena). Optimized for structured tasks, agents, and tool-calling. XGrammar is the default grammar_backend. Runtime de serving de LLMs más un DSL frontend para orquestar prompts multi-paso. Hecho por el equipo de LMSYS (los mismos detrás de Chatbot Arena). Optimizado para tareas estructuradas, agentes y tool-calling. XGrammar es el grammar_backend por defecto.

Compiler MLC-LLM

"Machine Learning Compilation" for LLMs, from CMU and collaborators — same authors as XGrammar. Compiles models to Apple Silicon (Metal), WebGPU (in-browser via WebLLM), Android, iOS. The only path here for offline-first or mobile deployment. "Machine Learning Compilation" para LLMs, de CMU y colaboradores — mismos autores que XGrammar. Compila modelos a Apple Silicon (Metal), WebGPU (en el navegador vía WebLLM), Android, iOS. La única vía aquí para despliegue offline-first o móvil.

💡 Tip: any dotted-underlined word like softmax opens a definition. 💡 Tip: cualquier palabra subrayada como softmax abre una definición.

The public API in one slide La API pública en una diapositiva

Four classes for compile time, one for runtime — exactly mirroring Block E. Cuatro clases para compilación, una para ejecución — espejo exacto del bloque E.


Grammar              # WHAT you want    (JSON, JSON schema, regex, EBNF)
TokenizerInfo        # WHO you serve    (vocab + chat template)
GrammarCompiler      # the COMPILE step (builds PDA + adaptive cache)
CompiledGrammar      # the RESULT       (pickleable, reusable)
GrammarMatcher       # the RUNTIME      (accept_token, fill_next_token_bitmask)

First four = once per grammar. Only GrammarMatcher runs per decoding step. Las primeras cuatro = una sola vez por gramática. Solo GrammarMatcher corre por paso de decodificación.

The easy path: HuggingFace Transformers El camino fácil: HuggingFace Transformers


pip install "xgrammar>=0.1.30" transformers torch pydantic


import xgrammar as xgr
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, AutoConfig

model_name = "meta-llama/Llama-3.2-1B-Instruct"
model = AutoModelForCausalLM.from_pretrained(model_name, device_map="auto")
tokenizer = AutoTokenizer.from_pretrained(model_name)
config = AutoConfig.from_pretrained(model_name)

tokenizer_info = xgr.TokenizerInfo.from_huggingface(tokenizer, vocab_size=config.vocab_size)
compiler = xgr.GrammarCompiler(tokenizer_info)
compiled = compiler.compile_builtin_json_grammar()  # or .compile_json_schema(...)

logits_processor = xgr.contrib.hf.LogitsProcessor(compiled)
out = model.generate(**inputs, max_new_tokens=256, logits_processor=[logits_processor])
print(tokenizer.decode(out[0], skip_special_tokens=True))

The mask + apply happens invisibly inside LogitsProcessor. Compile once, reuse forever. El cálculo y la aplicación de la máscara ocurren invisiblemente dentro de LogitsProcessor. Compila una vez, reutiliza siempre.

⚠️ Colab gotchasTrampas en Colab:

meta-llama/Llama-3.2-1B-Instruct is a gated repo. Either log in with huggingface_hub.login(token) first, or swap in an open model like gpt2-medium. meta-llama/Llama-3.2-1B-Instruct es un repo gated. Loguéate con huggingface_hub.login(token) o cambia a un modelo abierto como gpt2-medium.
Base models (like gpt2-medium) have no chat template. Replace tokenizer.apply_chat_template(...) with a manual prompt: "User: … \nAssistant:". Los modelos base (como gpt2-medium) no tienen chat template. Reemplaza tokenizer.apply_chat_template(...) con un prompt manual: "User: … \nAssistant:".
xgrammar doesn't expose __version__. Use pip show xgrammar from a shell cell if you need the version. xgrammar no expone __version__. Usa pip show xgrammar desde una celda de shell si necesitas la versión.

The transparent path: the manual matcher loop El camino transparente: el loop manual del matcher

Same algorithm as Block E slide 19 — now in real Python. El mismo algoritmo de la diapositiva 19 del bloque E — ahora en Python real.


matcher = xgr.GrammarMatcher(compiled)
token_bitmask = xgr.allocate_token_bitmask(1, tokenizer_info.vocab_size)

while not matcher.is_terminated():
    logits = model(torch.tensor([input_ids], device=model.device)).logits
    matcher.fill_next_token_bitmask(token_bitmask)            # cache lookup + CD check
    xgr.apply_token_bitmask_inplace(logits[0, -1, :], token_bitmask.to(logits.device))
    next_id = torch.argmax(torch.softmax(logits[0, -1, :], dim=-1)).item()
    matcher.accept_token(next_id)                              # advances persistent stack
    input_ids.append(next_id)

Three lines of real code map to three theoretical concepts from the first 90 minutes. Tres líneas de código real mapean a tres conceptos teóricos de los primeros 90 minutos.

Production serving: vLLM Serving en producción: vLLM

OfflineSin servidor


from vllm import LLM, SamplingParams
from vllm.sampling_params import GuidedDecodingParams

llm = LLM(model="meta-llama/Llama-3.2-1B-Instruct")

guided = GuidedDecodingParams(
    json=schema, backend="xgrammar"
)
sp = SamplingParams(
    temperature=0.7, max_tokens=128,
    guided_decoding=guided,
)
out = llm.generate(["Describe Ada as JSON."], sp)
print(out[0].outputs[0].text)

OpenAI-compatible server Servidor compatible con OpenAI


vllm serve meta-llama/Llama-3.2-1B-Instruct \
  --guided-decoding-backend xgrammar


from openai import OpenAI
client = OpenAI(base_url="http://localhost:8000/v1",
                api_key="-")

resp = client.chat.completions.create(
    model="meta-llama/Llama-3.2-1B-Instruct",
    messages=[{"role": "user",
               "content": "Ada as JSON."}],
    extra_body={
        "guided_json": schema,
        "guided_decoding_backend": "xgrammar",
    },
)

Keys: guided_json, guided_regex, guided_choice, guided_grammar (full EBNF — even SQL). Claves: guided_json, guided_regex, guided_choice, guided_grammar (EBNF completo — hasta SQL).

SGLang


import sglang as sgl, json
from pydantic import BaseModel

class Capital(BaseModel):
    name: str; population: int

llm = sgl.Engine(model_path="meta-llama/Meta-Llama-3.1-8B-Instruct",
                 grammar_backend="xgrammar")  # default anyway

sampling_params = {
    "temperature": 0.1, "top_p": 0.95,
    "json_schema": json.dumps(Capital.model_json_schema()),
    # or: "regex": "(France|England)"
    # or: "ebnf": "root ::= 'yes' | 'no'"
}
outputs = llm.generate(["Capital of France, JSON please."], sampling_params)
print(outputs[0]["text"])

SGLang exposes regex, ebnf, and structural_tag as first-class sampling params. Best fit for agentic tool-calling patterns. SGLang expone regex, ebnf y structural_tag como parámetros de muestreo de primera clase. Ideal para patrones agénticos con tool-calling.

MLC-LLM — Apple Silicon & browser MLC-LLM — Apple Silicon y navegador


from mlc_llm import MLCEngine

engine = MLCEngine("HF://mlc-ai/Llama-3.2-1B-Instruct-q4f16_1-MLC")

resp = engine.chat.completions.create(
    messages=[{"role": "user", "content": "Describe Ada as JSON."}],
    response_format={"type": "json_object", "schema": schema},
)
print(resp.choices[0].message.content)
engine.terminate()

Same authors as XGrammar — tightest integration. Mismos autores que XGrammar — la integración más estrecha.
Only engine in this list that runs in-browser via WebLLM. El único motor de esta lista que corre en el navegador vía WebLLM.
Pick this for offline-first or mobile apps. Elígelo para apps offline-first o móviles.

Which engine when? ¿Qué motor cuándo?

If you want…Si quieres…	Reach forUsa
Minimum-friction educational integrationIntegración educativa con mínima fricción	HF Transformers + `xgr.contrib.hf.LogitsProcessor`
Production serving with the OpenAI APIServing en producción con la API de OpenAI	vLLM
Offline batch + structural tags / agentsBatch offline + structural tags / agentes	SGLang
Apple Silicon / browser / mobileApple Silicon / navegador / móvil	MLC-LLM
To understand the internalsEntender las tripas	manual `GrammarMatcher` looploop manual de `GrammarMatcher`

All four use XGrammar under the hood. Only the surface differs — the algorithm you spent 90 minutes learning is the same. Los cuatro usan XGrammar por debajo. Solo cambia la superficie — el algoritmo que aprendieron en 90 minutos es el mismo.

Q & A Preguntas

Thanks! ¡Gracias!

Inside XGrammar Por dentro de XGrammar

How constrained decoding actually works Cómo funciona realmente la decodificación restringida

HookAnzuelo One bad token = broken pipeline Un mal token rompe todo el pipeline

Learning objectives Objetivos de aprendizaje

Roadmap Mapa de ruta

Constrained decoding 101 Decodificación restringida 101

The decode loop, with a mask El ciclo de decodificación, con máscara

The mask is a bit per token La máscara es un bit por token

Why we can't just "run the parser per token" Por qué no podemos solo "correr el parser por token"

Check #1Pregunta #1 Quick check Pregunta rápida

CFG → byte-level PDA GLC → APD a nivel de bytes

Why a pushdown automaton? ¿Por qué un autómata de pila?

Construction from EBNF Construcción desde EBNF

Stack trace on input (a) Traza de pila sobre la entrada (a)

Parallel stacks Pilas paralelas

Check #2Pregunta #2 Quick check Pregunta rápida

The adaptive token mask cache La caché adaptativa de máscaras

Key observation Observación clave

How rare is "context-dependent"? ¿Qué tan raros son los "contexto-dependientes"?

Compile time vs. runtime Tiempo de compilación vs. tiempo de ejecución

At compile time (once) En compilación (una vez)

At runtime (every step) En ejecución (cada paso)

"Adaptive" storage — three options Almacenamiento "adaptativo" — tres opciones

Worked example — classify the tokens Ejemplo guiado — clasifica los tokens

Check #3Pregunta #3 Quick check Pregunta rápida

Persistent stack + CPU/GPU overlap Pila persistente + solapamiento CPU/GPU

Persistent execution stack Pila de ejecución persistente

The CPU / GPU swimlane Los carriles CPU / GPU

Impact in practice Impacto en la práctica

Wrap-up & exit ticket Cierre y ticket de salida

Five takeaways Cinco ideas para llevar

Exit ticket Ticket de salida

References Referencias

Hands-on lab: programmatic use & integrations Laboratorio: uso programático e integraciones

Meet the tools you'll see Conoce las herramientas que verás

The public API in one slide La API pública en una diapositiva

The easy path: HuggingFace Transformers El camino fácil: HuggingFace Transformers

The transparent path: the manual matcher loop El camino transparente: el loop manual del matcher

Production serving: vLLM Serving en producción: vLLM

OfflineSin servidor

OpenAI-compatible server Servidor compatible con OpenAI

SGLang

MLC-LLM — Apple Silicon & browser MLC-LLM — Apple Silicon y navegador

Which engine when? ¿Qué motor cuándo?

Q & A Preguntas

Stack trace on input `(a)` Traza de pila sobre la entrada `(a)`