Reflection loops are one of the most powerful patterns in LLM agent architecture. The idea: don't trust the first answer. Have the agent evaluate its own output and iterate if quality is insufficient.
LangGraph makes this pattern explicit and debuggable through its graph-based state model. Here's how to implement it correctly.
Summary
A reflection loop in LangGraph consists of three components: a generation node that produces initial output, a reflection node that evaluates that output using a structured rubric, and a conditional edge that routes back to generation (with feedback) or forward to finalization. The key is structured reflection output — not free-text critique — so the feedback can reliably drive the conditional routing.
The Core Pattern
Generate
│
▼
Reflect ──→ [quality < threshold] ──→ Generate (with feedback)
│
│ [quality >= threshold]
▼
Finalize
State Design
State is the contract between nodes. Design it explicitly:
from pydantic import BaseModel
from typing import Optional
class ReflectionState(BaseModel):
# The task
task: str
# Generation
draft: Optional[str] = None
# Reflection output
reflection_score: float = 0.0
reflection_issues: list[str] = []
reflection_suggestions: list[str] = []
# Loop control
iteration: int = 0
max_iterations: int = 3
# Final output
final_output: Optional[str] = None
Generation Node
The generation node must incorporate reflection feedback from previous iterations:
async def generate_node(state: ReflectionState) -> ReflectionState:
"""
Generate or regenerate output, incorporating reflection feedback
from previous iterations if available.
"""
# Build messages: include feedback from previous reflection if any
messages = [{"role": "system", "content": GENERATION_SYSTEM_PROMPT}]
user_content = f"Task: {state.task}"
if state.draft and state.reflection_issues:
# Include the previous draft and specific feedback
user_content += f"""
Previous attempt:
{state.draft}
Issues identified:
{chr(10).join(f"- {issue}" for issue in state.reflection_issues)}
Suggestions:
{chr(10).join(f"- {s}" for s in state.reflection_suggestions)}
Please revise to address these issues.
"""
messages.append({"role": "user", "content": user_content})
response = await openai_client.chat.completions.create(
model="gpt-4o",
messages=messages,
)
return state.model_copy(update={
"draft": response.choices[0].message.content,
"iteration": state.iteration + 1,
})
Reflection Node
The reflection node is where most implementations go wrong. Free-text reflection fails because you can't reliably parse it for routing decisions. Use structured output:
from pydantic import BaseModel, Field
class ReflectionResult(BaseModel):
"""Structured output for reflection evaluation."""
score: float = Field(ge=0.0, le=1.0, description="Overall quality 0-1")
is_complete: bool = Field(description="Does it fully address the task?")
is_accurate: bool = Field(description="Are all claims verifiable and accurate?")
issues: list[str] = Field(description="Specific problems found")
suggestions: list[str] = Field(description="Specific improvements needed")
async def reflect_node(state: ReflectionState) -> ReflectionState:
"""
Evaluate the current draft against quality criteria.
Returns structured feedback for routing and next-iteration improvement.
"""
response = await openai_client.beta.chat.completions.parse(
model="gpt-4o",
messages=[
{
"role": "system",
"content": """You are a quality evaluator. Assess the draft against the task.
Be specific about issues — vague feedback is useless for improvement."""
},
{
"role": "user",
"content": f"""
Task: {state.task}
Draft to evaluate:
{state.draft}
Evaluate on: completeness, accuracy, specificity, and clarity.
"""
}
],
response_format=ReflectionResult,
)
result = response.choices[0].message.parsed
return state.model_copy(update={
"reflection_score": result.score,
"reflection_issues": result.issues,
"reflection_suggestions": result.suggestions,
})
Conditional Edge
The routing function determines whether to loop or continue:
def should_continue(state: ReflectionState) -> str:
"""
Route based on reflection score and iteration count.
Never loop infinitely — respect max_iterations.
"""
if state.iteration >= state.max_iterations:
# Force completion even if quality is low
# Log this for monitoring
if state.reflection_score < 0.7:
logger.warning(
f"Max iterations reached with low quality score: "
f"{state.reflection_score:.2f}"
)
return "finalize"
if state.reflection_score >= 0.75:
return "finalize"
return "generate" # Loop back with feedback
Graph Assembly
from langgraph.graph import StateGraph, END
def build_reflection_graph():
graph = StateGraph(ReflectionState)
graph.add_node("generate", generate_node)
graph.add_node("reflect", reflect_node)
graph.add_node("finalize", finalize_node)
graph.set_entry_point("generate")
graph.add_edge("generate", "reflect")
# Conditional routing from reflect
graph.add_conditional_edges(
"reflect",
should_continue,
{
"generate": "generate", # Loop
"finalize": "finalize", # Continue
}
)
graph.add_edge("finalize", END)
return graph.compile()
Anti-Patterns to Avoid
1. Free-text reflection output
# Wrong: can't parse this reliably
reflection_text = "The response is pretty good but could be more specific
about the technical details and maybe add more examples."
# Right: structured output with numeric score and lists
result = ReflectionResult(score=0.62, issues=["lacks technical specificity"], ...)
2. Unbounded loops
# Wrong: no iteration cap
def should_continue(state):
if state.reflection_score < 0.8:
return "generate" # Can run forever
return "finalize"
# Right: explicit max_iterations with logging
def should_continue(state):
if state.iteration >= state.max_iterations:
return "finalize" # Always escape
if state.reflection_score >= 0.75:
return "finalize"
return "generate"
3. Reflection without feedback injection
# Wrong: reflect, then regenerate without telling the model what was wrong
# The model has no idea what to fix
# Right: pass reflection.issues and reflection.suggestions into
# the next generation prompt explicitly
4. Same prompt for first and subsequent iterations
The generation node should detect whether it's the first iteration or a revision:
is_revision = state.iteration > 0 and state.reflection_issues
if is_revision:
# Include explicit "what to fix" context
else:
# Clean initial generation
Monitoring Reflection Quality
Instrument your reflection loops in production:
@dataclass
class ReflectionMetrics:
task_id: str
total_iterations: int
final_score: float
iteration_scores: list[float]
reached_max_iterations: bool
# After graph execution:
metrics = ReflectionMetrics(
task_id=task_id,
total_iterations=final_state.iteration,
final_score=final_state.reflection_score,
iteration_scores=collected_scores,
reached_max_iterations=final_state.iteration >= final_state.max_iterations,
)
await metrics_store.record(metrics)
Track: what percentage of tasks hit max_iterations without reaching quality threshold. If >10%, either lower your threshold or improve your generation prompt.
Key Takeaways
- Structured reflection output (Pydantic + OpenAI Structured Outputs) is required — free-text critique can't drive reliable routing
- Max iteration cap is not optional — agents can loop forever without it
- Pass specific issues and suggestions from reflection into the next generation prompt — without this, the loop doesn't converge
- Monitor max_iterations hits in production — they indicate your quality threshold or generation quality needs tuning
- LangGraph's conditional edges make reflection loops explicit and debuggable; prefer it over implicit looping in agent frameworks