Vector Orchestrated Tool Retrieval for Scalable Multi-Agent Systems
A large multi-agent system can know about thousands of MCP tools, but a model should not have to read every schema before deciding what to do. VOTR turns that problem into retrieval: find the right tools first, then hand only a compact candidate set to the agent.
A 309-server MCP deployment with 2,806 tools would require ~262,487 tokens to inject all schemas into a model's context — before any reasoning begins. Full injection degrades reasoning quality even when the window is large enough. VOTR inserts a retrieval stage before model invocation.
server_intent — domain/capability hint for server-level scoringtool_intent — specific operation for tool-level scoringsession_id — enables multi-turn memory (TTL 24h, lazy GC)skip_session_filter — bypass session tool suppressionVOTR is a three-tier service: Agent/Client → Router Core (FastAPI) → Registry + Execution Proxy. Click each pipeline step to see what happens. Embedding the query (OpenAI text-embedding-3-large, 3072-dim) dominates latency at ~200–280ms; all local computation runs in <20ms.
server_intent + tool_intent + optional session_idserver_intent navigates the domain of servers (e.g. "GitHub repository operations"). tool_intent describes the specific action (e.g. "list open pull requests"). This decomposition is intentional: VOTR applies hierarchical scoring; server-level first, then tool-level within candidate servers.[server: Name] tool_fn(param: type, opt?: type) → Short description. Retains the three pieces the model needs: server name (routing), function signature with types (arg construction), description (intent verification). tiktoken cl100k_base measured. At k≈2.9, ~76 schema tokens + ~154 routing metadata = ~230 tokens total per route call.| Aspect | MCP-Zero | VOTR |
|---|---|---|
| Storage | 326MB JSON | meta.json + .npy shards |
| Rebuild | Full manual | Hot registration |
| Embedding storage | Inline JSON | Binary arrays |
| Memory | Entire file in RAM | Memory-mapped |
| Top-k | Fixed | Adaptive (1/3/5) |
| Live MCP | None | stdio + SSE |
Dense embedding, BM25, and SPLADE-lite run in parallel and capture different signals. The key insight: BM25-only achieves 99.0% top-1 on single-tool queries (vs 93.8% dense-only) because MCP tool names are strongly lexical, but BM25 drops to 80.0% on paraphrastic multi-hop chains where dense semantic matching is essential.
| Stage | What it does | Key settings from paper/code | Why it matters |
|---|---|---|---|
| Hierarchical dense | Ranks servers first using embeddings from server summary + description, then ranks tools inside the selected servers. | Embedding model: text-embedding-3-large (3072 dims); hierarchical server-to-tool scoring. | Semantic recall under paraphrases; avoids flat brute-force over entire catalog. |
| BM25 sparse | Builds one lexical “document” per tool by concatenating server text + tool text. | BM25 over concatenated fields: server name/summary/description + tool name/description. | Strong exact-name/id precision (tool/server keywords). |
| SPLADE-lite | TF-IDF style sparse expansion with unigram + bigram features (SPLADE-lite approximation). | SPLADE-lite fusion weight w_splade = 0.35 (dense = 1.0, BM25 = 1.0). | Recovers compound/rare term matches without the full SPLADE model cost. |
| Weighted RRF fusion | Merges dense/BM25/SPLADE ranked lists with reciprocal-rank voting. | RRF(j) = sum over lists [ w_l / (60 + rank_l(j)) ]. | Robust to score-scale differences; rewards tools that rank well across multiple lenses. |
| Field-aware rerank | Adds structured overlap bonus on top candidates. | Sørensen-Dice overlap across structured fields (server name/summary, tool name/description, parameters). Applied near the head of the fused list (head = 24) with a small score window (0.003). | Breaks near ties RRF alone cannot resolve. |
| Intent disambiguation | Adjusts ranking for singular destructive vs plural/batch intent. | Bulk penalty = 0.82 for singular destructive; bulk boost = 1.08 for plural/batch. | Prevents wrong single-vs-bulk tool selection. |
| Confidence-gated handoff | Selects candidate count by uncertainty tier. | Adaptive handoff size k in {1, 3, 5}, chosen from a non-conformity score (conformal calibration) or gap thresholds. | Keeps prompt small when clear, widens safely when ambiguous. |
| Retrieval guardrails | Applies abstention/null-route and overlap-aware expansion. | If query/tool support is too low, null-route returns an empty candidate list (abstention). If overlap is detected, it expands an overlap cluster and may downgrade confidence; if overlap handling is triggered too aggressively, it can widen prompts toward similar/colliding tools and hurt precision. | Avoids high-confidence wrong tools when support is weak or capabilities overlap. |
After three ranked lists are produced, RRF merges them without requiring score normalisation. Each tool's fused score is the sum of weighted reciprocal ranks across all lists. k=60 prevents position-1 domination. A tool appearing at rank #1 across all three lists wins decisively; one appearing at rank #4 in just one list barely contributes.
VOTR combines the three ranked lists (Dense, BM25, SPLADE-lite) with weighted Reciprocal Rank Fusion (RRF) (paper Eq. 5 / Algorithm 1). For each tool j, the fused score is the sum of per-list contributions: w · 1 / (k + rank), where k = 60 smooths rank differences. If a tool is absent from a list, its rank is treated as infinity, so that list contributes 0.
With k = 60, ranks close to the top (roughly 1–5) contribute almost the same amount. That means RRF rewards agreement across retrievers: a tool that appears near the top in multiple lists can beat a tool that only gets one perfect #1. In this demo the list weights match the paper: dense=1.0, BM25=1.0, SPLADE-lite=0.35.
The SPLADE-lite weight is set lower because SPLADE-lite is a lightweight approximation of full SPLADE (TF-IDF with unigram + bigram features and sublinear term-frequency scaling), so it’s used as an extra lexical cue, especially for compound/phrase-like matches—rather than letting it override the stronger dense and BM25 evidence.
After RRF fusion, the top-H=24 candidates within a score window δ=0.003 of the leader receive a structured field bonus. The reranker decomposes the query into server, action, and constraint fields, then computes token-level Sørensen-Dice overlap against each tool's metadata fields.
VOTR uses a conformal prediction-inspired non-conformity score to adaptively choose how many tools to return. Thresholds τ₁ and τ₃ are calibrated on held-out data. On the large suite, 49% of queries get k=1 (98.4% accuracy), 6.6% get k=3 (100%), and 44.4% get k=5 (93.7% top-1, 97.7% handoff@k).
Lower nc → higher confidence. Gap of 0.001 → f=0.0 (confident). Gap of 0.0001 → f=1.0 (uncertain). The ratio term g provides secondary signal when absolute gap is small but relative separation is informative.
| Tier | Count | % | Top-1 | Handoff@k | Avg k |
|---|---|---|---|---|---|
| High | 245 | 49% | 98.4% | 98.4% | 1.0 |
| Medium | 33 | 6.6% | 100% | 100% | 3.0 |
| Low | 222 | 44.4% | 93.7% | 97.7% | 5.0 |
| Overall | 500 | 100% | 96.4% | 98.2% | 2.906 |
| Tier | Misses | Steps | Error rate | % of misses |
|---|---|---|---|---|
| High | 4 | 245 | 1.6% | 22% |
| Medium | 0 | 33 | 0.0% | 0% |
| Low | 14 | 222 | 6.3% | 78% |
| Total | 18 | 500 | 3.6% | 100% |
VOTR replaces MCP-Zero's full JSON-Schema blocks with a compact one-liner. The format retains exactly what the model needs: server name (routing), function signature with types (arg construction), short description (intent verification). Measured with tiktoken cl100k_base on the 2,806-tool index.
| Format | Tokens/tool | At k=3 | vs full catalog |
|---|---|---|---|
| MCP-Zero paper (their index) | ~143 | ~429 | baseline |
| MCP-Zero-style (our 2,806 tools) | 93.5 | 281 | 262,487 total |
| VOTR compressed (same index) | 26.2 | 79 | 72% reduction |
| VOTR full route avg (k≈2.9) | — | 230.3 | 99.91% reduction |
New MCP servers can be added without restarting the service or rebuilding embeddings. The registration is atomic: the new ToolIndex and HybridRetriever are fully constructed before any live state is replaced — under CPython's GIL, the reference swap is atomic with respect to concurrent request threads. In end-to-end agent runs, this also means VOTR-Orchestrator (the execution layer) can keep running while it calls the router for each server_intent/tool_intent hop—newly discovered servers become available to tool routing immediately.
server_intent/tool_intent hops./route) to retrieve candidate tools.1. Validate uniqueness of s.name in I
2. e_desc ← E(s.description) [OpenAI API]
3. e_sum ← E(s.summary) [OpenAI API]
4. for ti in {t₁,...,tₘ}:
e_i ← E(ti.name ∥ ti.description)
5. Append e_desc, e_sum to server arrays
6. Append {e₁,...,eₘ} to tool array
7. Update server/tool index mapping
8. Rebuild BM25 + SPLADE-lite indices
9. Persist updated .npy shards to disk
10. Atomically swap engine.index ref ← GIL
Evaluated on a 309-server / 2,806-tool corpus across single-tool, multi-hop, and multi-tool task types at small (n=100), medium (n=250), and large (n=500) routing-step scales. Multi-hop and multi-tool suites use 5 steps/case for compound chain metrics.
Compound Success@k requires EVERY hop in the chain to succeed. At k=5, VOTR achieves 100% on all chain lengths except pure adversarial (0%).
| Dataset | Hops | Hop@1 | Chain@k |
|---|---|---|---|
| 10-hop | 10 | 90.0% | 100% |
| 20-hop | 20 | 95.0% | 100% |
| 25-hop strict | 25 | 92.0% | 100% |
| 50-hop unique servers | 50 | 100% | 100% |
| 50-hop realistic hard | 50 | 100% | 100% |
| 50-hop adversarial valid | 50 | 58.0% | 0% |
| 50-hop adversarial pure | 50 | 0% | 0% |
LiveMCPBench reveals VOTR's sensitivity to input format. Under the native dual-field protocol, performance is strong. Forcing raw step-queries degrades significantly.
| Protocol | n | R@1 | nDCG@5 |
|---|---|---|---|
| Native dual-field (policy payloads) | 268 | 87.7% | 90.9% |
| Paper-faithful tool-to-agent | 82 | 70.7% | 79.9% |
| Full95 reconstructed stepwise | 268 | 52.4% | 59.9% |
| Router format exact subset | 82 | 46.3% | 62.2% |
| Tool-to-Agent Retrieval [7] | — | 61.0% | — |
VOTR is not optimised for a single metric. BM25-only wins on single-tool lexical queries (99.0%) but collapses on multi-hop chains (80.0%). The full-stack configuration is a multi-objective operating point balancing strict Top-1, chain success, abstention safety, and token budget.
| Profile | Top-1 | Top-3 | Top-5 | Handoff@k | Notes |
|---|