Evals for recruiting agents: what we measure after ship.

The week an agent ships is not the week it starts earning. The real work is what happens in weeks 2 through 12 — and if you're not measuring it, the agent will drift, the client will stop using it, and you'll be reading a renewal cancellation email wondering what happened.

We've deployed enough recruiting agents now to have a clear view of which metrics predict renewal and which are theater. The shortlist below is what we instrument on day one of every deploy.

Response rate is a trap

Everyone defaults to response rate because it's easy to count. But a response rate of 40% on a poorly-segmented list is worse than 12% on a tight one. The agent that gets higher replies by being pushier burns your brand. We stopped looking at this number in isolation in month two.

If a metric can be juiced by making the agent more aggressive, it's not a renewal metric. It's a vanity metric.

The six we actually track

Metric	What	Why it matters
Ownership Rate	% messages the recruiter would've sent themselves, unedited	The only metric that predicts renewal. Measured monthly via blind review.
Escalation Precision	of threads flagged for human, % the human agrees with	Agent calibrated to your sense of "hot".
Kill-Switch Rate	% of threads closed by agent without human intervention	Agent knows when to stop. Absence of this = brand damage.
Cost / Useful Outcome	$ (API + seat + time) per placement-eligible candidate surfaced	Proves retainer ROI to your client's CFO.
Drift Score	diff between week-1 output distribution and week-N	LLM providers ship silent updates. This catches them.
Latency p95	time from trigger to message out	Voice/chat agents die on this. Boring but non-negotiable.

The one metric that predicts renewal

Ownership rate. Every month we pull a random sample of 50 messages the agent sent and ask the recruiter in charge: would you have sent this, unedited, to this contact? The answer is a binary yes or no. We compute the percentage.

Fig 01 · Ownership rate vs. client renewal (n=14 deploys, 12 months)

Every agent above 70% ownership renewed. Every one below 55% did not.

Renewed · above 70%9

Renewed · 55–70%3

Churned · 55–70%2

Churned · below 55%0

The chart is the whole argument. If a recruiter wouldn't send the message themselves, they're going to edit it every time — which means the agent is saving nobody's time. They'll cancel the retainer and be right to.

How we instrument it

Every outbound message gets logged with the full prompt, the model, and a hash of the contact's CRM state. Every week we surface 20 random messages to the recruiter through a tiny Slackbot: /review-agent — they click ship or edit. Those clicks become the dataset.

You want to see the code? It's 90 lines, and we'll write it up properly next month. Short version: Vercel AI SDK for logging, Slack bolt for the review UI, Upstash for the queue.

Drift is the quiet killer

In July 2025 we watched three agents quietly get worse over two weeks. No code changes. The provider had rolled out an update to the underlying model that subtly shifted tone — the responses got longer, more formal, less like our recruiters. We only caught it because drift score alerted.

Since then, we pin model versions for every shipped agent and run a weekly diff. The upgrade happens deliberately, not by accident. This is one of the things a retainer pays for — someone whose job is to watch for this so you don't wake up to angry candidates.

What we're still figuring out

The hardest metric we haven't cracked: long-tail brand damage. A candidate who never replies but quietly thinks less of your agency is invisible in every dashboard. Our only proxy right now is sentiment analysis on replies — which is a bad proxy because the unhappy ones don't reply. Open question. If you've solved this, I want to hear from you.

Meantime, if you're running agents without ownership-rate reviews, start there. It's the cheapest eval you can do and the one most correlated with your ability to keep shipping.