Designing Reflection Loops In LangGraph

Reflection loops are one of the most powerful patterns in LLM agent architecture. The idea: don't trust the first answer. Have the agent evaluate its own output and iterate if quality is insufficient.

LangGraph makes this pattern explicit and debuggable through its graph-based state model. Here's how to implement it correctly.

Summary

A reflection loop in LangGraph consists of three components: a generation node that produces initial output, a reflection node that evaluates that output using a structured rubric, and a conditional edge that routes back to generation (with feedback) or forward to finalization. The key is structured reflection output — not free-text critique — so the feedback can reliably drive the conditional routing.

The Core Pattern

Generate
    │
    ▼
Reflect ──→ [quality < threshold] ──→ Generate (with feedback)
    │
    │ [quality >= threshold]
    ▼
Finalize

State Design

State is the contract between nodes. Design it explicitly:

from pydantic import BaseModel
from typing import Optional

class ReflectionState(BaseModel):
    # The task
    task: str
    
    # Generation
    draft: Optional[str] = None
    
    # Reflection output
    reflection_score: float = 0.0
    reflection_issues: list[str] = []
    reflection_suggestions: list[str] = []
    
    # Loop control
    iteration: int = 0
    max_iterations: int = 3
    
    # Final output
    final_output: Optional[str] = None

Generation Node

The generation node must incorporate reflection feedback from previous iterations:

async def generate_node(state: ReflectionState) -> ReflectionState:
    """
    Generate or regenerate output, incorporating reflection feedback
    from previous iterations if available.
    """
    # Build messages: include feedback from previous reflection if any
    messages = [{"role": "system", "content": GENERATION_SYSTEM_PROMPT}]
    
    user_content = f"Task: {state.task}"
    
    if state.draft and state.reflection_issues:
        # Include the previous draft and specific feedback
        user_content += f"""

Previous attempt:
{state.draft}

Issues identified:
{chr(10).join(f"- {issue}" for issue in state.reflection_issues)}

Suggestions:
{chr(10).join(f"- {s}" for s in state.reflection_suggestions)}

Please revise to address these issues.
"""
    
    messages.append({"role": "user", "content": user_content})
    
    response = await openai_client.chat.completions.create(
        model="gpt-4o",
        messages=messages,
    )
    
    return state.model_copy(update={
        "draft": response.choices[0].message.content,
        "iteration": state.iteration + 1,
    })

Reflection Node

The reflection node is where most implementations go wrong. Free-text reflection fails because you can't reliably parse it for routing decisions. Use structured output:

from pydantic import BaseModel, Field

class ReflectionResult(BaseModel):
    """Structured output for reflection evaluation."""
    score: float = Field(ge=0.0, le=1.0, description="Overall quality 0-1")
    is_complete: bool = Field(description="Does it fully address the task?")
    is_accurate: bool = Field(description="Are all claims verifiable and accurate?")
    issues: list[str] = Field(description="Specific problems found")
    suggestions: list[str] = Field(description="Specific improvements needed")

async def reflect_node(state: ReflectionState) -> ReflectionState:
    """
    Evaluate the current draft against quality criteria.
    Returns structured feedback for routing and next-iteration improvement.
    """
    response = await openai_client.beta.chat.completions.parse(
        model="gpt-4o",
        messages=[
            {
                "role": "system",
                "content": """You are a quality evaluator. Assess the draft against the task.
                Be specific about issues — vague feedback is useless for improvement."""
            },
            {
                "role": "user",
                "content": f"""
Task: {state.task}

Draft to evaluate:
{state.draft}

Evaluate on: completeness, accuracy, specificity, and clarity.
"""
            }
        ],
        response_format=ReflectionResult,
    )
    
    result = response.choices[0].message.parsed
    
    return state.model_copy(update={
        "reflection_score": result.score,
        "reflection_issues": result.issues,
        "reflection_suggestions": result.suggestions,
    })

Conditional Edge

The routing function determines whether to loop or continue:

def should_continue(state: ReflectionState) -> str:
    """
    Route based on reflection score and iteration count.
    Never loop infinitely — respect max_iterations.
    """
    if state.iteration >= state.max_iterations:
        # Force completion even if quality is low
        # Log this for monitoring
        if state.reflection_score < 0.7:
            logger.warning(
                f"Max iterations reached with low quality score: "
                f"{state.reflection_score:.2f}"
            )
        return "finalize"
    
    if state.reflection_score >= 0.75:
        return "finalize"
    
    return "generate"  # Loop back with feedback

Graph Assembly

from langgraph.graph import StateGraph, END

def build_reflection_graph():
    graph = StateGraph(ReflectionState)
    
    graph.add_node("generate", generate_node)
    graph.add_node("reflect", reflect_node)
    graph.add_node("finalize", finalize_node)
    
    graph.set_entry_point("generate")
    graph.add_edge("generate", "reflect")
    
    # Conditional routing from reflect
    graph.add_conditional_edges(
        "reflect",
        should_continue,
        {
            "generate": "generate",  # Loop
            "finalize": "finalize",  # Continue
        }
    )
    
    graph.add_edge("finalize", END)
    
    return graph.compile()

Anti-Patterns to Avoid

1. Free-text reflection output

# Wrong: can't parse this reliably
reflection_text = "The response is pretty good but could be more specific 
                   about the technical details and maybe add more examples."

# Right: structured output with numeric score and lists
result = ReflectionResult(score=0.62, issues=["lacks technical specificity"], ...)

2. Unbounded loops

# Wrong: no iteration cap
def should_continue(state):
    if state.reflection_score < 0.8:
        return "generate"  # Can run forever
    return "finalize"

# Right: explicit max_iterations with logging
def should_continue(state):
    if state.iteration >= state.max_iterations:
        return "finalize"  # Always escape
    if state.reflection_score >= 0.75:
        return "finalize"
    return "generate"

3. Reflection without feedback injection

# Wrong: reflect, then regenerate without telling the model what was wrong
# The model has no idea what to fix

# Right: pass reflection.issues and reflection.suggestions into
# the next generation prompt explicitly

4. Same prompt for first and subsequent iterations

The generation node should detect whether it's the first iteration or a revision:

is_revision = state.iteration > 0 and state.reflection_issues
if is_revision:
    # Include explicit "what to fix" context
else:
    # Clean initial generation

Monitoring Reflection Quality

Instrument your reflection loops in production:

@dataclass
class ReflectionMetrics:
    task_id: str
    total_iterations: int
    final_score: float
    iteration_scores: list[float]
    reached_max_iterations: bool

# After graph execution:
metrics = ReflectionMetrics(
    task_id=task_id,
    total_iterations=final_state.iteration,
    final_score=final_state.reflection_score,
    iteration_scores=collected_scores,
    reached_max_iterations=final_state.iteration >= final_state.max_iterations,
)
await metrics_store.record(metrics)

Track: what percentage of tasks hit max_iterations without reaching quality threshold. If >10%, either lower your threshold or improve your generation prompt.

Key Takeaways

Structured reflection output (Pydantic + OpenAI Structured Outputs) is required — free-text critique can't drive reliable routing
Max iteration cap is not optional — agents can loop forever without it
Pass specific issues and suggestions from reflection into the next generation prompt — without this, the loop doesn't converge
Monitor max_iterations hits in production — they indicate your quality threshold or generation quality needs tuning
LangGraph's conditional edges make reflection loops explicit and debuggable; prefer it over implicit looping in agent frameworks