Learn how BM25, vectors, RRF, and rerankers work together in production RAG. See patterns, tradeoffs, and real examples inside. Try free.
You can build a decent demo with vectors alone. Production is where that idea usually dies. Once real users start asking about clause numbers, model IDs, dates, and half-remembered phrases, sparse and dense retrieval stop being alternatives and start being teammates.
Key Takeaways
Hybrid retrieval became the default because each retriever fails in a different way. BM25 is strong at exact matching, especially for names, IDs, error codes, and technical terms, while dense vectors handle paraphrases and conceptual similarity. Research and production systems increasingly show that combining both consistently beats either one alone [1][2].
The interesting part is not that hybrid works. It's that the failure modes are so complementary that ignoring one feels reckless once you've shipped a real system. A user doesn't care that your embedding model is elegant if it misses "SKU-7829."
BM25 covers the literal stuff. If the user asks for "clause 4.2.1," "bge-reranker-v2-m3," or a product code, BM25 usually nails it where dense retrieval can wobble. That's because BM25 rewards rare terms and exact overlaps, which are often the strongest signals in enterprise search [1].
That's why teams keep BM25 in the stack even after they adopt embeddings. Dense retrieval is great at "similar meaning," but exact terms are a different game. When the question is anchored to a token, keyword search is still the most reliable first pass.
Vectors matter because users rarely phrase things the same way your documents do. They ask for "refund policy," but the doc says "return eligibility." They ask for "how to reduce token costs," but the doc says "prompt compression and cache reuse." Dense retrieval catches those semantic matches in a way BM25 cannot [2].
This is also why pure vector search feels magical in demos and frustrating in production. It's good at meaning, but meaning isn't enough when the user's question contains a specific anchor. The hybrid pattern solves that by letting vectors recover intent while BM25 keeps the system grounded.
RRF became the glue because it's simple and robust. Instead of trying to compare incompatible raw scores from sparse and dense retrievers, it merges their ranked lists using rank positions. That avoids score-calibration headaches and tends to reward documents that both systems agree on [1][2].
Here's the practical advantage: you do not have to make BM25 scores and embedding scores speak the same numerical language. RRF sidesteps the problem entirely. That's one reason it shows up so often in production architectures. It's boring in the best possible way.
The reranker fixes precision. First-stage retrieval is about recall: get the right neighborhood. The reranker is the final editor: sort the neighborhood and move the best evidence to the top. Cross-encoder rerankers are expensive, but they see the query and document together, which lets them catch subtle relevance signals that bi-encoders miss [2].
That's the production logic in one sentence: retrieve broadly, then rerank narrowly. If you try to use the reranker as your first-stage retriever, latency gets ugly fast. If you skip reranking, your top results are often "close enough" in a way that still breaks answer quality.
In most systems I'd call mature, the flow is basically: query rewrite, BM25 search, vector search, RRF merge, rerank, answer. Some stacks add routing or caching before retrieval, but the retrieval spine stays the same. Recent enterprise RAG work explicitly describes this layered approach as a response to precision, hallucination, and latency problems [1][3].
The nice thing is that this pipeline scales with complexity. Simple queries can skip heavy steps. Hard queries can use the full stack. That's why I think hybrid retrieval is less a single technique and more a production posture: use every signal you trust, but only as much as the query deserves.
Here's the pattern I keep seeing.
| Approach | Strength | Weakness | Best use |
|---|---|---|---|
| BM25 only | Exact terms, IDs, names | Misses paraphrases | Docs, APIs, legal text |
| Vector only | Semantic similarity | Misses anchors and rare tokens | Concept search, FAQ |
| BM25 + Vector + RRF | Balanced candidate set | Still needs reranking | General production RAG |
| Hybrid + Reranker | Best precision at top | More latency and cost | Customer support, enterprise search |
A simple query like "what's our refund policy?" might work fine with vector search alone. But "what does section 4.2.1 say about chargebacks?" usually needs BM25 to catch the anchor, vectors to widen recall, RRF to merge the lists, and a reranker to clean up the top of the stack.
The real lesson is that retrieval is now a layered engineering problem, not a single model choice. The systems that win are the ones that combine lexical precision, semantic recall, rank fusion, and final reranking without pretending one component can do everything [1][2][3].
That's also why prompt quality still matters. A sloppy user query can poison retrieval before the model ever answers. I've seen teams recover a surprising amount of quality by rewriting queries before search, which is exactly the kind of step tools like Rephrase can automate in two seconds. If you want more practical breakdowns like this, the Rephrase blog has a growing set of prompt engineering posts.
The bottom line: BM25 did not get replaced. Vectors did not win outright. RRF did not become famous because it was fancy. The production default emerged because each layer fixes the one below it.
Documentation & Research
Community Examples 4. Your RAG system isn't failing because of the LLM. It's failing because of how you split your documents. - r/PromptEngineering (link)
Hybrid retrieval combines sparse keyword search like BM25 with dense vector search. BM25 catches exact terms, while vectors catch semantic matches, so the two cover each other's blind spots.
A reranker re-scores the top candidates with a cross-encoder or similar model. It is slower than first-stage retrieval, but it usually improves precision at the top of the list.
Start with chunking, then add hybrid retrieval, then fuse with RRF, and finally rerank the top hits. Systems like [Rephrase](https://rephrase-it.com) can help you turn rough prompts into clearer retrieval queries.