Overview
An e-commerce marketplace needed its search to understand what users mean, not just what they type. A user searching "something ergonomic for long work sessions" should surface ergonomic chairs and accessories — not products containing the word "ergonomic."
This system replaces pure keyword matching with a two-stage retrieve-and-rerank pipeline: fast semantic retrieval via Google Embedding API, followed by selective Gemini LLM reranking for ambiguous queries. The backend is Django end-to-end, connected to a Next.js frontend at MrSolvo.com.
Problem
The existing search was BM25 keyword matching on product titles and descriptions. Three failure modes made it unreliable:
- Vocabulary mismatch — users describe products in natural language; catalog uses vendor terminology
- Zero results on near-misses — a slightly imprecise query returns nothing instead of the closest match
- No intent understanding — contextual queries like "gift for a 5-year-old who loves dinosaurs" had no path to success
Search abandonment was measurable and hurting conversion.
Constraints
- Latency budget: 220ms total end-to-end including network
- Cost: LLM reranking can't fire on every query — only when keyword confidence is low
- Infrastructure: Existing Django + PostgreSQL + OpenSearch setup; no new managed services
- Scale: Must handle spikes during sale events without degradation
Architecture
User Query (natural language)
│
▼
Django Search View
│ - Normalize and classify query intent
│ - Check Redis cache (hash-keyed per query)
│
▼
Dual Retrieval
├── OpenSearch BM25 (keyword, fast baseline)
└── OpenSearch k-NN (vector similarity via Google Embedding API)
│
▼
Reciprocal Rank Fusion
│ - Merge and deduplicate top-20 candidates
│
▼
Selective Reranking (Gemini 1.5 Flash)
│ - Only triggered when BM25 confidence < threshold
│ - Scores candidates against original query intent
│ - Returns ranked top 10
│
▼
Final Results → Next.js Frontend
Technology Decisions
| Decision | Choice | Why |
|---|---|---|
| Embedding model | Google Embedding API | Already in the GCP ecosystem; good multilingual support; cost-effective at scale |
| Vector store | OpenSearch k-NN | Already in stack — avoids new infrastructure, native BM25 hybrid out of the box |
| Reranker | Gemini 1.5 Flash | Fast, cheap per-call, good instruction-following for relevance scoring |
| Rerank strategy | Selective (threshold-gated) | LLM only fires on ambiguous queries — saves ~60% of rerank cost |
| Cache | Redis | Query hash → result cache; meaningful hit rate on similar phrasings |
| Backend | Django | Existing stack; easy integration with PostgreSQL catalog and auth |
Implementation
Embedding Pipeline
Product embeddings are generated offline during catalog indexing — not at query time. This offloads the expensive work to batch jobs:
def embed_product(product: Product) -> list[float]:
"""
Generate embedding for a product using concatenated
title + category + trimmed description.
Stored in OpenSearch k-NN index during catalog sync.
"""
text = f"{product.title}. {product.category}. {product.description[:200]}"
response = google_embedding_client.embed_content(
model="models/text-embedding-004",
content=text,
task_type="RETRIEVAL_DOCUMENT",
)
return response["embedding"]
Query Embedding at Search Time
def embed_query(query: str) -> list[float]:
response = google_embedding_client.embed_content(
model="models/text-embedding-004",
content=query,
task_type="RETRIEVAL_QUERY",
)
return response["embedding"]
Using RETRIEVAL_QUERY vs RETRIEVAL_DOCUMENT task types matters — the Google Embedding API optimizes differently for asymmetric search pairs.
Selective Reranking
The core cost-saving insight: not every query needs LLM reranking. High-confidence keyword matches (exact brand, product code, specific model) are served directly. Only low-confidence or ambiguous semantic queries trigger Gemini:
def search(query: str) -> list[Product]:
# Check cache first
cache_key = hashlib.md5(query.lower().encode()).hexdigest()
cached = redis_client.get(cache_key)
if cached:
return json.loads(cached)
# Stage 1: Dual retrieval
bm25_results = opensearch.keyword_search(query, size=20)
vector_results = opensearch.vector_search(embed_query(query), size=20)
# Stage 2: Merge
candidates = reciprocal_rank_fusion(bm25_results, vector_results)
# Stage 3: Selective reranking
if needs_reranking(query, candidates):
results = gemini_rerank(query, candidates[:20])
else:
results = candidates[:10]
redis_client.setex(cache_key, 300, json.dumps(results))
return results
def needs_reranking(query: str, candidates: list) -> bool:
"""Skip LLM reranking when top BM25 result is high-confidence."""
top_score = candidates[0].bm25_score if candidates else 0
return top_score < CONFIDENCE_THRESHOLD
Gemini Reranking
The reranking prompt is structured for consistent, parseable JSON output:
RERANK_PROMPT = """
You are evaluating product search relevance.
User query: {query}
Rate each product's relevance from 0-10 and return JSON:
{{"rankings": [{{"product_id": "...", "score": 8, "reason": "..."}}]}}
Products:
{products}
"""
What Didn't Work
All-or-nothing reranking: The first version sent every query to Gemini. Latency spiked and cost was 4× budget. The selective threshold was the fix.
Nightly full re-embedding: Re-embedding the entire catalog every night caused a 2-hour degraded window. Switched to incremental embedding — only products with title/description changes get re-embedded.
No query normalization: Typos, trailing spaces, and casing differences were all cache misses. Added pre-processing normalization to improve cache hit rate.
Results
- Query latency held well under the 220ms budget even with selective LLM reranking
- Search abandonment dropped measurably after deployment
- Zero-result rate dropped significantly for long-tail and natural language queries
- ~60% of queries skip LLM reranking entirely — served directly from BM25 + vector merge
Key Takeaways
- Using Google Embedding API's task_type distinction (RETRIEVAL_QUERY vs RETRIEVAL_DOCUMENT) improves asymmetric search quality
- Selective reranking — not always-on LLM — is what makes the economics work in production
- Offline catalog embedding + Redis caching keeps query-time latency predictable
- Django's ORM + OpenSearch integration kept the stack coherent without adding new services