VOTR

Vector Orchestrated Tool Retrieval for Scalable Multi-Agent Systems

A large multi-agent system can know about thousands of MCP tools, but a model should not have to read every schema before deciding what to do. VOTR turns that problem into retrieval: find the right tools first, then hand only a compact candidate set to the agent.

309
MCP Servers indexed
2,806
Tool schemas
96.4%
Top-1 accuracy (large)
99.91%
Token reduction
01

The Catalog Problem

A 309-server MCP deployment with 2,806 tools would require ~262,487 tokens to inject all schemas into a model's context — before any reasoning begins. Full injection degrades reasoning quality even when the window is large enough. VOTR inserts a retrieval stage before model invocation.

CONTEXT PRESSURE SIMULATOR2,806 tools
262,487
full-inject tokens
230
VOTR avg tokens
99.91%
reduction
VOTR compressed route (26.2 tokens/tool, avg k≈2.9)
Schema injection comparison: Normal full-catalog injection adds about 262,487 schema tokens on the 2,806-tool index, while VOTR injects only the routed set (about 230 tokens total per route on average).
Why “230”?: VOTR’s compact format averages ~26.2 tokens per returned tool line and returns ~k≈2.9 tools per route, so schema-only tokens are ~26.2×2.9≈76, plus formatting/routing metadata brings the total to ~230 tokens per route call.
REQUEST ANATOMY — POST /route

        
server_intent — domain/capability hint for server-level scoring
tool_intent — specific operation for tool-level scoring
session_id — enables multi-turn memory (TTL 24h, lazy GC)
skip_session_filter — bypass session tool suppression
02

System Architecture & Pipeline

VOTR is a three-tier service: Agent/Client → Router Core (FastAPI) → Registry + Execution Proxy. Click each pipeline step to see what happens. Embedding the query (OpenAI text-embedding-3-large, 3072-dim) dominates latency at ~200–280ms; all local computation runs in <20ms.

ROUTING PIPELINE — click steps
① Intents In
server_intent + tool_intent + optional session_id
Two natural-language fields split the routing signal. server_intent navigates the domain of servers (e.g. "GitHub repository operations"). tool_intent describes the specific action (e.g. "list open pull requests"). This decomposition is intentional: VOTR applies hierarchical scoring; server-level first, then tool-level within candidate servers.
② Embedding (OpenAI)
text-embedding-3-large → q_s ∈ ℝ³⁰⁷², q_t ∈ ℝ³⁰⁷² (~200–280ms)
Both intent strings are embedded separately. q_s is used for server-level cosine scoring: σᵢ = max(cos(q_s, e_desc_i), cos(q_s, e_sum_i)). q_t is used for tool-level scoring within candidate servers. Embeddings are L2-normalised and stored as .npy shards (not a monolithic JSON like MCP-Zero). The round-trip dominates total latency.
③ Hybrid Retrieval (parallel)
Dense (hierarchical) + BM25 + SPLADE-lite run in parallel
Dense: Server scoring selects top-N=8 servers; tool scoring within those gives L_dense. Hierarchical score τⱼ = σ_π(j) · tⱼ · max(σ_π(j), tⱼ) — penalises tools on wrong servers.
BM25: Each tool = doc concatenating server name + summary + description + tool name + description. Exact lexical matches (e.g. "search_repositories") rank with high precision. BM25-only achieves 99.0% top-1 — higher than dense-only (93.8%) — but degrades to 80.0% on paraphrastic multi-hop queries.
SPLADE-lite: TF-IDF with bigram features + sublinear TF scaling. Captures compound terms ("send message", "list branches") without a full SPLADE model. Weight w=0.35 vs 1.0 for dense and BM25.
④ Weighted RRF Fusion
RRF(j) = Σ wₗ/(k+rₗ(j)), k=60, weights [1.0, 1.0, 0.35]
The three ranked lists vote into a single merged ranking. k=60 (standard RRF smoothing constant) prevents top-1 over-dominance. Tools absent from a list get rₗ(j)=∞ (contribute 0). Weights: dense=1.0, BM25=1.0, SPLADE-lite=0.35. RRF is robust to score-scale mismatches between sparse and dense retrievers — no normalisation needed.
⑤ Field-Aware Rerank
Sørensen-Dice overlap on top-H=24 candidates within δ=0.003 window
Computes token-level Sørensen-Dice coefficient across 5 structured fields: tool name (w=0.35), server name (w=0.25), tool desc (w=0.20), params (w=0.12), server summary (w=0.08). Explicit server name match bonus: +0.22. Final score: τ̂ⱼ = RRF(j) + 0.0015 · Bⱼ. Resolves near-tie ambiguities that global retrieval signals cannot distinguish (e.g. github.list_pull_requests vs gitlab.list_merge_requests).
⑥ Intent Disambiguation
Singular/destructive → bulk penalty ×0.82 · Plural/batch → bulk boost ×1.08
Detects two patterns: (1) Singular destructive intent (delete/remove/clear without plural qualifiers) → penalises bulk-operation tools by 0.82. (2) Plural/batch intent (all/batch/every/multiple) → boosts bulk-capable tools by 1.08. Tiebreaker only — small adjustments to avoid wrong bulk-vs-single matching.
⑦ Confidence & Handoff-k
Non-conformity score → tier {high/med/low} → k ∈ {1, 3, 5}
nc(q) = -log₁₀(gap) + 0.5·(s₂/s₁ - 0.975) - 0.3·[explicit server match]. Three tiers: high (k=1, nc≤τ₁), medium (k=3), low (k=5, nc>τ₃). Thresholds calibrated on held-out data for target coverage. Overlap detector flags ambiguous cross-server capability collisions. Abstention guard returns empty list if query support < θ_null = 0.213.
⑧ Compressed Schemas Out
26.2 tokens/tool avg vs 93.5 for MCP-Zero JSON blocks
Format: [server: Name] tool_fn(param: type, opt?: type) → Short description. Retains the three pieces the model needs: server name (routing), function signature with types (arg construction), description (intent verification). tiktoken cl100k_base measured. At k≈2.9, ~76 schema tokens + ~154 routing metadata = ~230 tokens total per route call.
LATENCY BREAKDOWN
OpenAI embed API~200–280ms
Dense retrieval (matvec)<5ms
BM25 + SPLADE-lite<8ms
RRF + rerank + confidence<7ms
P50: 312ms (large) · P95: 393ms
MCP-ZERO vs VOTR
AspectMCP-ZeroVOTR
Storage326MB JSONmeta.json + .npy shards
RebuildFull manualHot registration
Embedding storageInline JSONBinary arrays
MemoryEntire file in RAMMemory-mapped
Top-kFixedAdaptive (1/3/5)
Live MCPNonestdio + SSE
03

Hybrid Retrieval: Three Lenses

Dense embedding, BM25, and SPLADE-lite run in parallel and capture different signals. The key insight: BM25-only achieves 99.0% top-1 on single-tool queries (vs 93.8% dense-only) because MCP tool names are strongly lexical, but BM25 drops to 80.0% on paraphrastic multi-hop chains where dense semantic matching is essential.

RETRIEVAL PIPELINE SUMMARY
StageWhat it doesKey settings from paper/codeWhy it matters
Hierarchical dense Ranks servers first using embeddings from server summary + description, then ranks tools inside the selected servers. Embedding model: text-embedding-3-large (3072 dims); hierarchical server-to-tool scoring. Semantic recall under paraphrases; avoids flat brute-force over entire catalog.
BM25 sparse Builds one lexical “document” per tool by concatenating server text + tool text. BM25 over concatenated fields: server name/summary/description + tool name/description. Strong exact-name/id precision (tool/server keywords).
SPLADE-lite TF-IDF style sparse expansion with unigram + bigram features (SPLADE-lite approximation). SPLADE-lite fusion weight w_splade = 0.35 (dense = 1.0, BM25 = 1.0). Recovers compound/rare term matches without the full SPLADE model cost.
Weighted RRF fusion Merges dense/BM25/SPLADE ranked lists with reciprocal-rank voting. RRF(j) = sum over lists [ w_l / (60 + rank_l(j)) ]. Robust to score-scale differences; rewards tools that rank well across multiple lenses.
Field-aware rerank Adds structured overlap bonus on top candidates. Sørensen-Dice overlap across structured fields (server name/summary, tool name/description, parameters). Applied near the head of the fused list (head = 24) with a small score window (0.003). Breaks near ties RRF alone cannot resolve.
Intent disambiguation Adjusts ranking for singular destructive vs plural/batch intent. Bulk penalty = 0.82 for singular destructive; bulk boost = 1.08 for plural/batch. Prevents wrong single-vs-bulk tool selection.
Confidence-gated handoff Selects candidate count by uncertainty tier. Adaptive handoff size k in {1, 3, 5}, chosen from a non-conformity score (conformal calibration) or gap thresholds. Keeps prompt small when clear, widens safely when ambiguous.
Retrieval guardrails Applies abstention/null-route and overlap-aware expansion. If query/tool support is too low, null-route returns an empty candidate list (abstention). If overlap is detected, it expands an overlap cluster and may downgrade confidence; if overlap handling is triggered too aggressively, it can widen prompts toward similar/colliding tools and hurt precision. Avoids high-confidence wrong tools when support is weak or capabilities overlap.
LIVE RETRIEVAL COMPARISON — try different queries
04

Weighted Reciprocal Rank Fusion

After three ranked lists are produced, RRF merges them without requiring score normalisation. Each tool's fused score is the sum of weighted reciprocal ranks across all lists. k=60 prevents position-1 domination. A tool appearing at rank #1 across all three lists wins decisively; one appearing at rank #4 in just one list barely contributes.

RRF CALCULATOR
RRF(j) = Σ wₗ / (60 + rₗ(j)) — absent tools get rₗ(j) = ∞ (0 contribution)
DENSE (w=1.0)
BM25 (w=1.0)
SPLADE-lite (w=0.35)
FUSED OUTPUT → sorted by RRF score
RRF intuition (why agreement matters)

VOTR combines the three ranked lists (Dense, BM25, SPLADE-lite) with weighted Reciprocal Rank Fusion (RRF) (paper Eq. 5 / Algorithm 1). For each tool j, the fused score is the sum of per-list contributions: w · 1 / (k + rank), where k = 60 smooths rank differences. If a tool is absent from a list, its rank is treated as infinity, so that list contributes 0.

With k = 60, ranks close to the top (roughly 1–5) contribute almost the same amount. That means RRF rewards agreement across retrievers: a tool that appears near the top in multiple lists can beat a tool that only gets one perfect #1. In this demo the list weights match the paper: dense=1.0, BM25=1.0, SPLADE-lite=0.35.

The SPLADE-lite weight is set lower because SPLADE-lite is a lightweight approximation of full SPLADE (TF-IDF with unigram + bigram features and sublinear term-frequency scaling), so it’s used as an extra lexical cue, especially for compound/phrase-like matches—rather than letting it override the stronger dense and BM25 evidence.

WEIGHT SENSITIVITY
0 (ignored)0.351.0 (equal)
05

Field-Aware Reranking

After RRF fusion, the top-H=24 candidates within a score window δ=0.003 of the leader receive a structured field bonus. The reranker decomposes the query into server, action, and constraint fields, then computes token-level Sørensen-Dice overlap against each tool's metadata fields.

FIELD WEIGHTS (Sørensen-Dice)
tool_name
0.35
server_name
0.25
tool_desc
0.20
params
0.12
server_summary
0.08
explicit server
+0.22
B_j = Σ_f w_f · Dice(Q_f, T_jf) + 0.22·[server_match]
τ̂_j = RRF(j) + 0.0015 · B_j
Dice(A,B) = 2|A∩B| / (|A|+|B|) on camelCase-split, underscore-normalised tokens. λ=0.0015 keeps field bonus proportional to RRF score range.
RERANK SIMULATOR
06

Confidence-Gated Handoff Policy

VOTR uses a conformal prediction-inspired non-conformity score to adaptively choose how many tools to return. Thresholds τ₁ and τ₃ are calibrated on held-out data. On the large suite, 49% of queries get k=1 (98.4% accuracy), 6.6% get k=3 (100%), and 44.4% get k=5 (93.7% top-1, 97.7% handoff@k).

NON-CONFORMITY SCORE CALCULATOR
nc(q) = -log₁₀(gap) ← gap signal f(δ)
      + 0.5 · (s₂/s₁ - 0.975) ← ratio signal g(s₁,s₂)
      - 0.3 · [explicit server match] ← structural h(ŝ)

Lower nc → higher confidence. Gap of 0.001 → f=0.0 (confident). Gap of 0.0001 → f=1.0 (uncertain). The ratio term g provides secondary signal when absolute gap is small but relative separation is informative.

0.0023
0.942
s₁ (top-1)
0.0241
s₂ (top-2)
0.0218
gap term
2.64
ratio term
-0.02
server term
0.00
NON-CONFORMITY SCORE
2.62
LOW CONFIDENCE
Handoff k = 5 tools
CALIBRATION RESULTS (large suite, n=500)
TierCount%Top-1Handoff@kAvg k
High 24549% 98.4%98.4% 1.0
Medium 336.6% 100%100% 3.0
Low 22244.4% 93.7%97.7% 5.0
Overall 500100% 96.4%98.2% 2.906
78% of all misses occur in the low-confidence tier (14/18). The policy correctly identifies hard queries — Handoff@k recovers 97.7% vs 93.7% strict Top-1 in this tier.
MISS ACCOUNTING BY TIER
TierMissesStepsError rate% of misses
High42451.6%22%
Medium0330.0%0%
Low142226.3%78%
Total185003.6%100%
4 of 18 misses are alias/equivalence issues (recovered by equivalence-aware labels → 97.2% Top-1, 91% compound). Not genuine retrieval failures.
07

Compressed Schema Injection

VOTR replaces MCP-Zero's full JSON-Schema blocks with a compact one-liner. The format retains exactly what the model needs: server name (routing), function signature with types (arg construction), short description (intent verification). Measured with tiktoken cl100k_base on the 2,806-tool index.

FORMAT COMPARISON — interactive
MCP-Zero JSON block~93.5 tokens

          
VOTR compressed line~26.2 tokens

          
FormatTokens/toolAt k=3vs full catalog
MCP-Zero paper (their index)~143~429baseline
MCP-Zero-style (our 2,806 tools)93.5281262,487 total
VOTR compressed (same index)26.27972% reduction
VOTR full route avg (k≈2.9)230.399.91% reduction
HANDOFF PAYLOAD — k sliderk = 3
k=1 (high conf)k=3 (medium)k=5 (low conf)
08

Dynamic Registry & Hot Registration

New MCP servers can be added without restarting the service or rebuilding embeddings. The registration is atomic: the new ToolIndex and HybridRetriever are fully constructed before any live state is replaced — under CPython's GIL, the reference swap is atomic with respect to concurrent request threads. In end-to-end agent runs, this also means VOTR-Orchestrator (the execution layer) can keep running while it calls the router for each server_intent/tool_intent hop—newly discovered servers become available to tool routing immediately.

REGISTRATION SIMULATOR
Simulates the hot registration flow for a new server. Watch the index state update atomically.
VOTR Orchestrator ⏳ idle
1. Decomposes your request into server_intent/tool_intent hops.
2. Calls the VOTR router (/route) to retrieve candidate tools.
3. Wraps routed MCP tools and runs a multi-step tool-calling loop (with session memory).
Hot registration impact: while the router rebuilds indices, the orchestrator can keep executing; once the router swaps the new ToolIndex/HybridRetriever, the next hop can immediately use tools from the newly registered server.
REGISTRATION ALGORITHM
1. Validate uniqueness of s.name in I
2. e_desc ← E(s.description)       [OpenAI API]
3. e_sum  ← E(s.summary)           [OpenAI API]
4. for ti in {t₁,...,tₘ}:
   e_i ← E(ti.name ∥ ti.description)
5. Append e_desc, e_sum to server arrays
6. Append {e₁,...,eₘ} to tool array
7. Update server/tool index mapping
8. Rebuild BM25 + SPLADE-lite indices
9. Persist updated .npy shards to disk
10. Atomically swap engine.index ref  ← GIL
TRANSPORT DISCOVERY
POST /register/discover
Spawns stdio subprocess → MCP initialize handshake → tools/list → normalize → register. Timeout: 2–120s (default 20s).
POST /register/discover/sse
Same JSON-RPC over HTTP/SSE endpoint. Suitable for remote/containerised servers.
POST /session/clear
Purges session tool history. Session memory uses TTL=24h with lazy GC on writes.
Zero-downtime verified: Control Hit@1 = 1.000 before and after registration in all experiments, including Bloomberg cold-start case.
09

Evaluation Results

Evaluated on a 309-server / 2,806-tool corpus across single-tool, multi-hop, and multi-tool task types at small (n=100), medium (n=250), and large (n=500) routing-step scales. Multi-hop and multi-tool suites use 5 steps/case for compound chain metrics.

FULL-STACK FUNCTIONAL CORRECTNESS
MULTI-HOP CHAIN SCALING

Compound Success@k requires EVERY hop in the chain to succeed. At k=5, VOTR achieves 100% on all chain lengths except pure adversarial (0%).

DatasetHopsHop@1Chain@k
10-hop1090.0%100%
20-hop2095.0%100%
25-hop strict2592.0%100%
50-hop unique servers50100%100%
50-hop realistic hard50100%100%
50-hop adversarial valid5058.0%0%
50-hop adversarial pure500%0%
Key limitation: Coordinated adversarial server names + paraphrased intents cause complete chain failure. No adversarial training layer yet.
LIVEMCPBENCH Benchmark

LiveMCPBench reveals VOTR's sensitivity to input format. Under the native dual-field protocol, performance is strong. Forcing raw step-queries degrades significantly.

ProtocolnR@1nDCG@5
Native dual-field (policy payloads)26887.7%90.9%
Paper-faithful tool-to-agent8270.7%79.9%
Full95 reconstructed stepwise26852.4%59.9%
Router format exact subset8246.3%62.2%
Tool-to-Agent Retrieval [7]61.0%
Key: The gap is driven by VOTR+orchestrator interface sensitivity, not a retriever-only ceiling. VOTR’s router emits a structured “policy payload” (dual fields like server_intent + tool_intent), along with a confidence-gated handoff candidate set (adaptive k∈{1,3,5}) and guardrails (overlap-aware expansion + abstention/null-route). The orchestrator then uses this contract to format the agent step consistently (including compressed schema injection), so the downstream model calls the right tool interface. When you force raw step-queries or truncate the router payload, the orchestrator can’t reliably reconstruct that intent/prompt contract, so Recall@1 drops; with the native payload contract, Recall@1 rises from 52% to 88%.
10

Ablation Study

VOTR is not optimised for a single metric. BM25-only wins on single-tool lexical queries (99.0%) but collapses on multi-hop chains (80.0%). The full-stack configuration is a multi-objective operating point balancing strict Top-1, chain success, abstention safety, and token budget.

ABLATION COMPARISON — interactive
ProfileTop-1Top-3Top-5Handoff@kNotes