VOTR

Vector Orchestrated Tool Retrieval for Scalable Multi-Agent Systems

A large multi-agent system can know about thousands of MCP tools, but a model should not have to read every schema before deciding what to do. VOTR turns that problem into retrieval: find the right tools first, then hand only a compact candidate set to the agent.

Paper VOTR Dataset VOTR Orchestrator

309

MCP Servers indexed

2,806

Tool schemas

96.4%

Top-1 accuracy (large)

99.91%

Token reduction

The Catalog Problem

A 309-server MCP deployment with 2,806 tools would require ~262,487 tokens to inject all schemas into a model's context — before any reasoning begins. Full injection degrades reasoning quality even when the window is large enough. VOTR inserts a retrieval stage before model invocation.

CONTEXT PRESSURE SIMULATOR2,806 tools

CATALOG SIZE — drag to simulate growth

262,487

full-inject tokens

230

VOTR avg tokens

99.91%

reduction

VOTR compressed route (26.2 tokens/tool, avg k≈2.9)

Schema injection comparison: Normal full-catalog injection adds about 262,487 schema tokens on the 2,806-tool index, while VOTR injects only the routed set (about 230 tokens total per route on average).

Why “230”?: VOTR’s compact format averages ~26.2 tokens per returned tool line and returns ~k≈2.9 tools per route, so schema-only tokens are ~26.2×2.9≈76, plus formatting/routing metadata brings the total to ~230 tokens per route call.

REQUEST ANATOMY — POST /route

server_intent — domain/capability hint for server-level scoring

tool_intent — specific operation for tool-level scoring

session_id — enables multi-turn memory (TTL 24h, lazy GC)

skip_session_filter — bypass session tool suppression

System Architecture & Pipeline

VOTR is a three-tier service: Agent/Client → Router Core (FastAPI) → Registry + Execution Proxy. Click each pipeline step to see what happens. Embedding the query (OpenAI text-embedding-3-large, 3072-dim) dominates latency at ~200–280ms; all local computation runs in <20ms.

ROUTING PIPELINE — click steps

① Intents In

server_intent + tool_intent + optional session_id

Two natural-language fields split the routing signal. server_intent navigates the domain of servers (e.g. "GitHub repository operations"). tool_intent describes the specific action (e.g. "list open pull requests"). This decomposition is intentional: VOTR applies hierarchical scoring; server-level first, then tool-level within candidate servers.

② Embedding (OpenAI)

text-embedding-3-large → q_s ∈ ℝ³⁰⁷², q_t ∈ ℝ³⁰⁷² (~200–280ms)

Both intent strings are embedded separately. q_s is used for server-level cosine scoring: σᵢ = max(cos(q_s, e_desc_i), cos(q_s, e_sum_i)). q_t is used for tool-level scoring within candidate servers. Embeddings are L2-normalised and stored as .npy shards (not a monolithic JSON like MCP-Zero). The round-trip dominates total latency.

③ Hybrid Retrieval (parallel)

Dense (hierarchical) + BM25 + SPLADE-lite run in parallel

Dense: Server scoring selects top-N=8 servers; tool scoring within those gives L_dense. Hierarchical score τⱼ = σ_π(j) · tⱼ · max(σ_π(j), tⱼ) — penalises tools on wrong servers.
BM25: Each tool = doc concatenating server name + summary + description + tool name + description. Exact lexical matches (e.g. "search_repositories") rank with high precision. BM25-only achieves 99.0% top-1 — higher than dense-only (93.8%) — but degrades to 80.0% on paraphrastic multi-hop queries.
SPLADE-lite: TF-IDF with bigram features + sublinear TF scaling. Captures compound terms ("send message", "list branches") without a full SPLADE model. Weight w=0.35 vs 1.0 for dense and BM25.

④ Weighted RRF Fusion

RRF(j) = Σ wₗ/(k+rₗ(j)), k=60, weights [1.0, 1.0, 0.35]

The three ranked lists vote into a single merged ranking. k=60 (standard RRF smoothing constant) prevents top-1 over-dominance. Tools absent from a list get rₗ(j)=∞ (contribute 0). Weights: dense=1.0, BM25=1.0, SPLADE-lite=0.35. RRF is robust to score-scale mismatches between sparse and dense retrievers — no normalisation needed.

⑤ Field-Aware Rerank

Sørensen-Dice overlap on top-H=24 candidates within δ=0.003 window

Computes token-level Sørensen-Dice coefficient across 5 structured fields: tool name (w=0.35), server name (w=0.25), tool desc (w=0.20), params (w=0.12), server summary (w=0.08). Explicit server name match bonus: +0.22. Final score: τ̂ⱼ = RRF(j) + 0.0015 · Bⱼ. Resolves near-tie ambiguities that global retrieval signals cannot distinguish (e.g. github.list_pull_requests vs gitlab.list_merge_requests).

⑥ Intent Disambiguation

Singular/destructive → bulk penalty ×0.82 · Plural/batch → bulk boost ×1.08

Detects two patterns: (1) Singular destructive intent (delete/remove/clear without plural qualifiers) → penalises bulk-operation tools by 0.82. (2) Plural/batch intent (all/batch/every/multiple) → boosts bulk-capable tools by 1.08. Tiebreaker only — small adjustments to avoid wrong bulk-vs-single matching.

⑦ Confidence & Handoff-k

Non-conformity score → tier {high/med/low} → k ∈ {1, 3, 5}

nc(q) = -log₁₀(gap) + 0.5·(s₂/s₁ - 0.975) - 0.3·[explicit server match]. Three tiers: high (k=1, nc≤τ₁), medium (k=3), low (k=5, nc>τ₃). Thresholds calibrated on held-out data for target coverage. Overlap detector flags ambiguous cross-server capability collisions. Abstention guard returns empty list if query support < θ_null = 0.213.

⑧ Compressed Schemas Out

26.2 tokens/tool avg vs 93.5 for MCP-Zero JSON blocks

Format: [server: Name] tool_fn(param: type, opt?: type) → Short description. Retains the three pieces the model needs: server name (routing), function signature with types (arg construction), description (intent verification). tiktoken cl100k_base measured. At k≈2.9, ~76 schema tokens + ~154 routing metadata = ~230 tokens total per route call.

LATENCY BREAKDOWN

OpenAI embed API~200–280ms

Dense retrieval (matvec)<5ms

BM25 + SPLADE-lite<8ms

RRF + rerank + confidence<7ms

P50: 312ms (large) · P95: 393ms

MCP-ZERO vs VOTR

Aspect	MCP-Zero	VOTR
Storage	326MB JSON	meta.json + .npy shards
Rebuild	Full manual	Hot registration
Embedding storage	Inline JSON	Binary arrays
Memory	Entire file in RAM	Memory-mapped
Top-k	Fixed	Adaptive (1/3/5)
Live MCP	None	stdio + SSE

Hybrid Retrieval: Three Lenses

Dense embedding, BM25, and SPLADE-lite run in parallel and capture different signals. The key insight: BM25-only achieves 99.0% top-1 on single-tool queries (vs 93.8% dense-only) because MCP tool names are strongly lexical, but BM25 drops to 80.0% on paraphrastic multi-hop chains where dense semantic matching is essential.

RETRIEVAL PIPELINE SUMMARY

Stage	What it does	Key settings from paper/code	Why it matters
Hierarchical dense	Ranks servers first using embeddings from server summary + description, then ranks tools inside the selected servers.	Embedding model: text-embedding-3-large (3072 dims); hierarchical server-to-tool scoring.	Semantic recall under paraphrases; avoids flat brute-force over entire catalog.
BM25 sparse	Builds one lexical “document” per tool by concatenating server text + tool text.	BM25 over concatenated fields: server name/summary/description + tool name/description.	Strong exact-name/id precision (tool/server keywords).
SPLADE-lite	TF-IDF style sparse expansion with unigram + bigram features (SPLADE-lite approximation).	SPLADE-lite fusion weight w_splade = 0.35 (dense = 1.0, BM25 = 1.0).	Recovers compound/rare term matches without the full SPLADE model cost.
Weighted RRF fusion	Merges dense/BM25/SPLADE ranked lists with reciprocal-rank voting.	RRF(j) = sum over lists [ w_l / (60 + rank_l(j)) ].	Robust to score-scale differences; rewards tools that rank well across multiple lenses.
Field-aware rerank	Adds structured overlap bonus on top candidates.	Sørensen-Dice overlap across structured fields (server name/summary, tool name/description, parameters). Applied near the head of the fused list (head = 24) with a small score window (0.003).	Breaks near ties RRF alone cannot resolve.
Intent disambiguation	Adjusts ranking for singular destructive vs plural/batch intent.	Bulk penalty = 0.82 for singular destructive; bulk boost = 1.08 for plural/batch.	Prevents wrong single-vs-bulk tool selection.
Confidence-gated handoff	Selects candidate count by uncertainty tier.	Adaptive handoff size k in {1, 3, 5}, chosen from a non-conformity score (conformal calibration) or gap thresholds.	Keeps prompt small when clear, widens safely when ambiguous.
Retrieval guardrails	Applies abstention/null-route and overlap-aware expansion.	If query/tool support is too low, null-route returns an empty candidate list (abstention). If overlap is detected, it expands an overlap cluster and may downgrade confidence; if overlap handling is triggered too aggressively, it can widen prompts toward similar/colliding tools and hurt precision.	Avoids high-confidence wrong tools when support is weak or capabilities overlap.

LIVE RETRIEVAL COMPARISON — try different queries

Weighted Reciprocal Rank Fusion

After three ranked lists are produced, RRF merges them without requiring score normalisation. Each tool's fused score is the sum of weighted reciprocal ranks across all lists. k=60 prevents position-1 domination. A tool appearing at rank #1 across all three lists wins decisively; one appearing at rank #4 in just one list barely contributes.

RRF CALCULATOR

RRF(j) = Σ wₗ / (60 + rₗ(j)) — absent tools get rₗ(j) = ∞ (0 contribution)

DENSE (w=1.0)

BM25 (w=1.0)

SPLADE-lite (w=0.35)

FUSED OUTPUT → sorted by RRF score

RRF intuition (why agreement matters)

VOTR combines the three ranked lists (Dense, BM25, SPLADE-lite) with weighted Reciprocal Rank Fusion (RRF) (paper Eq. 5 / Algorithm 1). For each tool j, the fused score is the sum of per-list contributions: w · 1 / (k + rank), where k = 60 smooths rank differences. If a tool is absent from a list, its rank is treated as infinity, so that list contributes 0.

With k = 60, ranks close to the top (roughly 1–5) contribute almost the same amount. That means RRF rewards agreement across retrievers: a tool that appears near the top in multiple lists can beat a tool that only gets one perfect #1. In this demo the list weights match the paper: dense=1.0, BM25=1.0, SPLADE-lite=0.35.

The SPLADE-lite weight is set lower because SPLADE-lite is a lightweight approximation of full SPLADE (TF-IDF with unigram + bigram features and sublinear term-frequency scaling), so it’s used as an extra lexical cue, especially for compound/phrase-like matches—rather than letting it override the stronger dense and BM25 evidence.

WEIGHT SENSITIVITY

SPLADE-lite weight (currently 0.35)

0 (ignored)0.351.0 (equal)

Field-Aware Reranking

After RRF fusion, the top-H=24 candidates within a score window δ=0.003 of the leader receive a structured field bonus. The reranker decomposes the query into server, action, and constraint fields, then computes token-level Sørensen-Dice overlap against each tool's metadata fields.

FIELD WEIGHTS (Sørensen-Dice)

tool_name

0.35

server_name

0.25

tool_desc

0.20

params

0.12

server_summary

0.08

explicit server

+0.22

B_j = Σ_f w_f · Dice(Q_f, T_jf) + 0.22·[server_match]
τ̂_j = RRF(j) + 0.0015 · B_j

Dice(A,B) = 2|A∩B| / (|A|+|B|) on camelCase-split, underscore-normalised tokens. λ=0.0015 keeps field bonus proportional to RRF score range.

RERANK SIMULATOR

Confidence-Gated Handoff Policy

VOTR uses a conformal prediction-inspired non-conformity score to adaptively choose how many tools to return. Thresholds τ₁ and τ₃ are calibrated on held-out data. On the large suite, 49% of queries get k=1 (98.4% accuracy), 6.6% get k=3 (100%), and 44.4% get k=5 (93.7% top-1, 97.7% handoff@k).

NON-CONFORMITY SCORE CALCULATOR

nc(q) = -log₁₀(gap) ← gap signal f(δ)
+ 0.5 · (s₂/s₁ - 0.975) ← ratio signal g(s₁,s₂)
- 0.3 · [explicit server match] ← structural h(ŝ)

Lower nc → higher confidence. Gap of 0.001 → f=0.0 (confident). Gap of 0.0001 → f=1.0 (uncertain). The ratio term g provides secondary signal when absolute gap is small but relative separation is informative.

SCORE GAP (s₁ - s₂) 0.0023

SCORE RATIO (s₂/s₁) 0.942

EXPLICIT SERVER MATCH?

s₁ (top-1)

0.0241

s₂ (top-2)

0.0218

gap term

2.64

ratio term

-0.02

server term

0.00

NON-CONFORMITY SCORE

2.62

LOW CONFIDENCE

Handoff k = 5 tools

CALIBRATION RESULTS (large suite, n=500)

Tier	Count	%	Top-1	Handoff@k	Avg k
High	245	49%	98.4%	98.4%	1.0
Medium	33	6.6%	100%	100%	3.0
Low	222	44.4%	93.7%	97.7%	5.0
Overall	500	100%	96.4%	98.2%	2.906

78% of all misses occur in the low-confidence tier (14/18). The policy correctly identifies hard queries — Handoff@k recovers 97.7% vs 93.7% strict Top-1 in this tier.

MISS ACCOUNTING BY TIER

Tier	Misses	Steps	Error rate	% of misses
High	4	245	1.6%	22%
Medium	0	33	0.0%	0%
Low	14	222	6.3%	78%
Total	18	500	3.6%	100%

4 of 18 misses are alias/equivalence issues (recovered by equivalence-aware labels → 97.2% Top-1, 91% compound). Not genuine retrieval failures.

Compressed Schema Injection

VOTR replaces MCP-Zero's full JSON-Schema blocks with a compact one-liner. The format retains exactly what the model needs: server name (routing), function signature with types (arg construction), short description (intent verification). Measured with tiktoken cl100k_base on the 2,806-tool index.

FORMAT COMPARISON — interactive

MCP-Zero JSON block~93.5 tokens

VOTR compressed line~26.2 tokens

Format	Tokens/tool	At k=3	vs full catalog
MCP-Zero paper (their index)	~143	~429	baseline
MCP-Zero-style (our 2,806 tools)	93.5	281	262,487 total
VOTR compressed (same index)	26.2	79	72% reduction
VOTR full route avg (k≈2.9)	—	230.3	99.91% reduction

HANDOFF PAYLOAD — k sliderk = 3

RETURNED CANDIDATES

k=1 (high conf)k=3 (medium)k=5 (low conf)

Dynamic Registry & Hot Registration

New MCP servers can be added without restarting the service or rebuilding embeddings. The registration is atomic: the new ToolIndex and HybridRetriever are fully constructed before any live state is replaced — under CPython's GIL, the reference swap is atomic with respect to concurrent request threads. In end-to-end agent runs, this also means VOTR-Orchestrator (the execution layer) can keep running while it calls the router for each server_intent/tool_intent hop—newly discovered servers become available to tool routing immediately.

REGISTRATION SIMULATOR

Simulates the hot registration flow for a new server. Watch the index state update atomically.

VOTR Orchestrator ⏳ idle

1. Decomposes your request into server_intent/tool_intent hops.

2. Calls the VOTR router (/route) to retrieve candidate tools.

3. Wraps routed MCP tools and runs a multi-step tool-calling loop (with session memory).

Hot registration impact: while the router rebuilds indices, the orchestrator can keep executing; once the router swaps the new ToolIndex/HybridRetriever, the next hop can immediately use tools from the newly registered server.

REGISTRATION ALGORITHM

1. Validate uniqueness of s.name in I
2. e_desc ← E(s.description)       [OpenAI API]
3. e_sum  ← E(s.summary)           [OpenAI API]
4. for ti in {t₁,...,tₘ}:
   e_i ← E(ti.name ∥ ti.description)
5. Append e_desc, e_sum to server arrays
6. Append {e₁,...,eₘ} to tool array
7. Update server/tool index mapping
8. Rebuild BM25 + SPLADE-lite indices
9. Persist updated .npy shards to disk
10. Atomically swap engine.index ref  ← GIL

TRANSPORT DISCOVERY

POST /register/discover

Spawns stdio subprocess → MCP initialize handshake → tools/list → normalize → register. Timeout: 2–120s (default 20s).

POST /register/discover/sse

Same JSON-RPC over HTTP/SSE endpoint. Suitable for remote/containerised servers.

POST /session/clear

Purges session tool history. Session memory uses TTL=24h with lazy GC on writes.

Zero-downtime verified: Control Hit@1 = 1.000 before and after registration in all experiments, including Bloomberg cold-start case.

Evaluation Results

Evaluated on a 309-server / 2,806-tool corpus across single-tool, multi-hop, and multi-tool task types at small (n=100), medium (n=250), and large (n=500) routing-step scales. Multi-hop and multi-tool suites use 5 steps/case for compound chain metrics.

FULL-STACK FUNCTIONAL CORRECTNESS

MULTI-HOP CHAIN SCALING

Compound Success@k requires EVERY hop in the chain to succeed. At k=5, VOTR achieves 100% on all chain lengths except pure adversarial (0%).

Dataset	Hops	Hop@1	Chain@k
10-hop	10	90.0%	100%
20-hop	20	95.0%	100%
25-hop strict	25	92.0%	100%
50-hop unique servers	50	100%	100%
50-hop realistic hard	50	100%	100%
50-hop adversarial valid	50	58.0%	0%
50-hop adversarial pure	50	0%	0%

Key limitation: Coordinated adversarial server names + paraphrased intents cause complete chain failure. No adversarial training layer yet.

LIVEMCPBENCH Benchmark

LiveMCPBench reveals VOTR's sensitivity to input format. Under the native dual-field protocol, performance is strong. Forcing raw step-queries degrades significantly.

Protocol	n	R@1	nDCG@5
Native dual-field (policy payloads)	268	87.7%	90.9%
Paper-faithful tool-to-agent	82	70.7%	79.9%
Full95 reconstructed stepwise	268	52.4%	59.9%
Router format exact subset	82	46.3%	62.2%
Tool-to-Agent Retrieval [7]	—	61.0%	—

Key: The gap is driven by VOTR+orchestrator interface sensitivity, not a retriever-only ceiling. VOTR’s router emits a structured “policy payload” (dual fields like server_intent + tool_intent), along with a confidence-gated handoff candidate set (adaptive k∈{1,3,5}) and guardrails (overlap-aware expansion + abstention/null-route). The orchestrator then uses this contract to format the agent step consistently (including compressed schema injection), so the downstream model calls the right tool interface. When you force raw step-queries or truncate the router payload, the orchestrator can’t reliably reconstruct that intent/prompt contract, so Recall@1 drops; with the native payload contract, Recall@1 rises from 52% to 88%.

Ablation Study

VOTR is not optimised for a single metric. BM25-only wins on single-tool lexical queries (99.0%) but collapses on multi-hop chains (80.0%). The full-stack configuration is a multi-objective operating point balancing strict Top-1, chain success, abstention safety, and token budget.

ABLATION COMPARISON — interactive

Profile	Top-1	Top-3	Top-5	Handoff@k	Notes