Overview

An e-commerce marketplace needed its search to understand what users mean, not just what they type. A user searching "something ergonomic for long work sessions" should surface ergonomic chairs and accessories — not products containing the word "ergonomic."

This system replaces pure keyword matching with a two-stage retrieve-and-rerank pipeline: fast semantic retrieval via Google Embedding API, followed by selective Gemini LLM reranking for ambiguous queries. The backend is Django end-to-end, connected to a Next.js frontend at MrSolvo.com.

Problem

The existing search was BM25 keyword matching on product titles and descriptions. Three failure modes made it unreliable:

Vocabulary mismatch — users describe products in natural language; catalog uses vendor terminology
Zero results on near-misses — a slightly imprecise query returns nothing instead of the closest match
No intent understanding — contextual queries like "gift for a 5-year-old who loves dinosaurs" had no path to success

Search abandonment was measurable and hurting conversion.

Constraints

Latency budget: 220ms total end-to-end including network
Cost: LLM reranking can't fire on every query — only when keyword confidence is low
Infrastructure: Existing Django + PostgreSQL + OpenSearch setup; no new managed services
Scale: Must handle spikes during sale events without degradation

Architecture

User Query (natural language)
    │
    ▼
Django Search View
    │  - Normalize and classify query intent
    │  - Check Redis cache (hash-keyed per query)
    │
    ▼
Dual Retrieval
    ├── OpenSearch BM25 (keyword, fast baseline)
    └── OpenSearch k-NN (vector similarity via Google Embedding API)
    │
    ▼
Reciprocal Rank Fusion
    │  - Merge and deduplicate top-20 candidates
    │
    ▼
Selective Reranking (Gemini 1.5 Flash)
    │  - Only triggered when BM25 confidence < threshold
    │  - Scores candidates against original query intent
    │  - Returns ranked top 10
    │
    ▼
Final Results → Next.js Frontend

Technology Decisions

Decision	Choice	Why
Embedding model	Google Embedding API	Already in the GCP ecosystem; good multilingual support; cost-effective at scale
Vector store	OpenSearch k-NN	Already in stack — avoids new infrastructure, native BM25 hybrid out of the box
Reranker	Gemini 1.5 Flash	Fast, cheap per-call, good instruction-following for relevance scoring
Rerank strategy	Selective (threshold-gated)	LLM only fires on ambiguous queries — saves ~60% of rerank cost
Cache	Redis	Query hash → result cache; meaningful hit rate on similar phrasings
Backend	Django	Existing stack; easy integration with PostgreSQL catalog and auth

Implementation

Embedding Pipeline

Product embeddings are generated offline during catalog indexing — not at query time. This offloads the expensive work to batch jobs:

def embed_product(product: Product) -> list[float]:
    """
    Generate embedding for a product using concatenated
    title + category + trimmed description.
    Stored in OpenSearch k-NN index during catalog sync.
    """
    text = f"{product.title}. {product.category}. {product.description[:200]}"
    response = google_embedding_client.embed_content(
        model="models/text-embedding-004",
        content=text,
        task_type="RETRIEVAL_DOCUMENT",
    )
    return response["embedding"]

Query Embedding at Search Time

def embed_query(query: str) -> list[float]:
    response = google_embedding_client.embed_content(
        model="models/text-embedding-004",
        content=query,
        task_type="RETRIEVAL_QUERY",
    )
    return response["embedding"]

Using RETRIEVAL_QUERY vs RETRIEVAL_DOCUMENT task types matters — the Google Embedding API optimizes differently for asymmetric search pairs.

Selective Reranking

The core cost-saving insight: not every query needs LLM reranking. High-confidence keyword matches (exact brand, product code, specific model) are served directly. Only low-confidence or ambiguous semantic queries trigger Gemini:

def search(query: str) -> list[Product]:
    # Check cache first
    cache_key = hashlib.md5(query.lower().encode()).hexdigest()
    cached = redis_client.get(cache_key)
    if cached:
        return json.loads(cached)

    # Stage 1: Dual retrieval
    bm25_results = opensearch.keyword_search(query, size=20)
    vector_results = opensearch.vector_search(embed_query(query), size=20)

    # Stage 2: Merge
    candidates = reciprocal_rank_fusion(bm25_results, vector_results)

    # Stage 3: Selective reranking
    if needs_reranking(query, candidates):
        results = gemini_rerank(query, candidates[:20])
    else:
        results = candidates[:10]

    redis_client.setex(cache_key, 300, json.dumps(results))
    return results

def needs_reranking(query: str, candidates: list) -> bool:
    """Skip LLM reranking when top BM25 result is high-confidence."""
    top_score = candidates[0].bm25_score if candidates else 0
    return top_score < CONFIDENCE_THRESHOLD

Gemini Reranking

The reranking prompt is structured for consistent, parseable JSON output:

RERANK_PROMPT = """
You are evaluating product search relevance.

User query: {query}

Rate each product's relevance from 0-10 and return JSON:
{{"rankings": [{{"product_id": "...", "score": 8, "reason": "..."}}]}}

Products:
{products}
"""

What Didn't Work

All-or-nothing reranking: The first version sent every query to Gemini. Latency spiked and cost was 4× budget. The selective threshold was the fix.

Nightly full re-embedding: Re-embedding the entire catalog every night caused a 2-hour degraded window. Switched to incremental embedding — only products with title/description changes get re-embedded.

No query normalization: Typos, trailing spaces, and casing differences were all cache misses. Added pre-processing normalization to improve cache hit rate.

Results

Query latency held well under the 220ms budget even with selective LLM reranking
Search abandonment dropped measurably after deployment
Zero-result rate dropped significantly for long-tail and natural language queries
~60% of queries skip LLM reranking entirely — served directly from BM25 + vector merge

Key Takeaways

Using Google Embedding API's task_type distinction (RETRIEVAL_QUERY vs RETRIEVAL_DOCUMENT) improves asymmetric search quality
Selective reranking — not always-on LLM — is what makes the economics work in production
Offline catalog embedding + Redis caching keeps query-time latency predictable
Django's ORM + OpenSearch integration kept the stack coherent without adding new services