Blog / Prompt engineering / Hybrid Retrieval: Why the Stack Won

Hybrid Retrieval: Why the Stack Won

Learn how BM25, vectors, RRF, and rerankers work together in production RAG. See patterns, tradeoffs, and real examples inside. Try free.

Ilia Ilinskii
Rephrase · June 5, 2026

Prompt engineering9 min read

On this page

Why did BM25 + vectors become the default?How does BM25 cover what vectors miss?Why do vectors still matter?Why is RRF the glue everybody uses?What does the reranker actually fix?What does the production pipeline look like?What does a practical before/after look like?What's the real production lesson?References

You can build a decent demo with vectors alone. Production is where that idea usually dies. Once real users start asking about clause numbers, model IDs, dates, and half-remembered phrases, sparse and dense retrieval stop being alternatives and start being teammates.

Key Takeaways

BM25 still matters because exact tokens are often the thing users care about most.
Vector retrieval fills in the semantic gaps that keyword search misses.
RRF is popular because it merges rankings safely without score calibration.
Rerankers are the quality gate that turns "pretty good recall" into usable precision.
In production, the winning pattern is usually hybrid retrieval first, reranking second.

Why did BM25 + vectors become the default?

Hybrid retrieval became the default because each retriever fails in a different way. BM25 is strong at exact matching, especially for names, IDs, error codes, and technical terms, while dense vectors handle paraphrases and conceptual similarity. Research and production systems increasingly show that combining both consistently beats either one alone [1][2].

The interesting part is not that hybrid works. It's that the failure modes are so complementary that ignoring one feels reckless once you've shipped a real system. A user doesn't care that your embedding model is elegant if it misses "SKU-7829."

How does BM25 cover what vectors miss?

BM25 covers the literal stuff. If the user asks for "clause 4.2.1," "bge-reranker-v2-m3," or a product code, BM25 usually nails it where dense retrieval can wobble. That's because BM25 rewards rare terms and exact overlaps, which are often the strongest signals in enterprise search [1].

That's why teams keep BM25 in the stack even after they adopt embeddings. Dense retrieval is great at "similar meaning," but exact terms are a different game. When the question is anchored to a token, keyword search is still the most reliable first pass.

Why do vectors still matter?

Vectors matter because users rarely phrase things the same way your documents do. They ask for "refund policy," but the doc says "return eligibility." They ask for "how to reduce token costs," but the doc says "prompt compression and cache reuse." Dense retrieval catches those semantic matches in a way BM25 cannot [2].

This is also why pure vector search feels magical in demos and frustrating in production. It's good at meaning, but meaning isn't enough when the user's question contains a specific anchor. The hybrid pattern solves that by letting vectors recover intent while BM25 keeps the system grounded.

Why is RRF the glue everybody uses?

RRF became the glue because it's simple and robust. Instead of trying to compare incompatible raw scores from sparse and dense retrievers, it merges their ranked lists using rank positions. That avoids score-calibration headaches and tends to reward documents that both systems agree on [1][2].

Here's the practical advantage: you do not have to make BM25 scores and embedding scores speak the same numerical language. RRF sidesteps the problem entirely. That's one reason it shows up so often in production architectures. It's boring in the best possible way.

What does the reranker actually fix?

The reranker fixes precision. First-stage retrieval is about recall: get the right neighborhood. The reranker is the final editor: sort the neighborhood and move the best evidence to the top. Cross-encoder rerankers are expensive, but they see the query and document together, which lets them catch subtle relevance signals that bi-encoders miss [2].

That's the production logic in one sentence: retrieve broadly, then rerank narrowly. If you try to use the reranker as your first-stage retriever, latency gets ugly fast. If you skip reranking, your top results are often "close enough" in a way that still breaks answer quality.

What does the production pipeline look like?

In most systems I'd call mature, the flow is basically: query rewrite, BM25 search, vector search, RRF merge, rerank, answer. Some stacks add routing or caching before retrieval, but the retrieval spine stays the same. Recent enterprise RAG work explicitly describes this layered approach as a response to precision, hallucination, and latency problems [1][3].

The nice thing is that this pipeline scales with complexity. Simple queries can skip heavy steps. Hard queries can use the full stack. That's why I think hybrid retrieval is less a single technique and more a production posture: use every signal you trust, but only as much as the query deserves.

What does a practical before/after look like?

Here's the pattern I keep seeing.

Approach	Strength	Weakness	Best use
BM25 only	Exact terms, IDs, names	Misses paraphrases	Docs, APIs, legal text
Vector only	Semantic similarity	Misses anchors and rare tokens	Concept search, FAQ
BM25 + Vector + RRF	Balanced candidate set	Still needs reranking	General production RAG
Hybrid + Reranker	Best precision at top	More latency and cost	Customer support, enterprise search

A simple query like "what's our refund policy?" might work fine with vector search alone. But "what does section 4.2.1 say about chargebacks?" usually needs BM25 to catch the anchor, vectors to widen recall, RRF to merge the lists, and a reranker to clean up the top of the stack.

What's the real production lesson?

The real lesson is that retrieval is now a layered engineering problem, not a single model choice. The systems that win are the ones that combine lexical precision, semantic recall, rank fusion, and final reranking without pretending one component can do everything [1][2][3].

That's also why prompt quality still matters. A sloppy user query can poison retrieval before the model ever answers. I've seen teams recover a surprising amount of quality by rewriting queries before search, which is exactly the kind of step tools like Rephrase can automate in two seconds. If you want more practical breakdowns like this, the Rephrase blog has a growing set of prompt engineering posts.

The bottom line: BM25 did not get replaced. Vectors did not win outright. RRF did not become famous because it was fancy. The production default emerged because each layer fixes the one below it.

References

Documentation & Research

Higress-RAG: A Holistic Optimization Framework for Enterprise Retrieval-Augmented Generation via Dual Hybrid Retrieval, Adaptive Routing, and CRAG - arXiv cs.CL (link)
Cognis: Context-Aware Memory for Conversational AI Agents - arXiv cs.CL (link)
Reciprocal Rank Fusion outperforms Condorcet and individual rank learning methods - SIGIR 2009 (link)

Community Examples 4. Your RAG system isn't failing because of the LLM. It's failing because of how you split your documents. - r/PromptEngineering (link)

Frequently asked

What is hybrid retrieval in RAG?

Hybrid retrieval combines sparse keyword search like BM25 with dense vector search. BM25 catches exact terms, while vectors catch semantic matches, so the two cover each other's blind spots.

What does a reranker do after retrieval?

A reranker re-scores the top candidates with a cross-encoder or similar model. It is slower than first-stage retrieval, but it usually improves precision at the top of the list.

How do I improve my production RAG search?

Start with chunking, then add hybrid retrieval, then fuse with RRF, and finally rerank the top hits. Systems like [Rephrase](https://rephrase-it.com) can help you turn rough prompts into clearer retrieval queries.