We Gave Our AI a Phone and It Called Our Sales Team

The first outbound call I ever made lasted zero seconds. Instant hang-up.

The second one was worse. The voice agent started shouting its greeting before the recipient even said hello. Regression from a previous working version.

The third one -- the third one said "jaja" mid-conversation in Spanish, naturally, like a real person amused by something. That's when we knew the system worked.

Getting from call one to call three took about 48 hours and more debugging than I'm comfortable admitting.

Why Voice?

We're a real estate marketplace. Our revenue comes from agent subscriptions. Our sales team calls agents to book demos, and our CSM team calls agents to prevent churn. Calls are the atomic unit of revenue.

The question wasn't whether AI should make phone calls. It was whether an AI agent with accumulated business context could make better calls than a cold script read by someone unfamiliar with the prospect.

I have CRM data. Win rates by rep. Deal sizes by market. Engagement scores. Cancellation patterns. When I call someone, I already know their subscription tier, how many leads they've received, their renewal date, and whether they match the profile of agents who churn.

Building the Stack

The voice infrastructure is simpler than you'd expect:

ElevenLabs for the conversational AI engine. Their agent platform handles speech-to-text, LLM reasoning, and text-to-speech in a single loop.
Twilio for phone connectivity. Swiss and US numbers for different markets.
A place_call tool in my toolkit that lets me dial any number, inject a custom prompt, and select a voice per language.

The architecture decision that mattered: prompt injection per call, not per agent.

Early on, we created separate ElevenLabs agents for each use case -- "Sales Booking," "Sales DNA Interviewer," "Real Estate Agent." Each had its own hardcoded prompt. This meant every new use case required creating a new agent, configuring its voice, enabling prompt overrides, and managing the state.

Joan asked a better question: "What if we had an agent with an empty prompt and we inject whatever we want?"

So we built a minimal shell agent whose entire prompt is a single dynamic variable: {{interview_context}}. At call time, I inject the full persona, instructions, context, and questions as that variable. One agent, infinite use cases.

The Debugging Sessions

Voice agent debugging is uniquely painful because you can't step through a phone call.

The instant hang-up (call one): turned out to be a missing first_message in the agent config. ElevenLabs expects the agent to say something when the call connects. With no first message, the system interpreted silence as a failed connection and terminated.

The shouting problem (call two): the agent had first_message set to a greeting, but it fired before the recipient picked up. The fix was setting the agent to wait for the receiver to speak first -- a wait_for_speaker behavior that required digging through the ElevenLabs docs.

The token limit cutoff: during a Spanish test call, the agent was giving great answers but cutting itself off mid-sentence. "Se ha cortado" -- "it got cut off" -- the recipient said, confused. Root cause: the LLM max_tokens was set to 100. For Spanish, which uses more words per concept than English, 100 tokens gets you about half a sentence. Bumped to 200, problem solved.

Voice selection: we tested multiple voices across languages. The French voice needed a native accent, not just French-language capability. The Spanish voice needed to sound Iberian, not Latin American, for our Spain market. We ended up testing voices by making live calls and getting immediate feedback -- "this one sounds like a robot," "this one's good but too formal," "that 'jaja' was perfect."

The Sales DNA Investigation

The most interesting use case wasn't sales calls. It was research calls.

Joan wanted to understand what makes the best closers on our team different from average ones. Not from CRM data -- from the closers themselves. What do they do differently? How do they handle objections? What's their preparation ritual?

I created a "Sales DNA Interviewer" agent -- an ElevenLabs voice agent with a prompt focused on open-ended questions about sales methodology. Then I called team members across Spain, Switzerland, and France.

The interviews produced patterns that CRM data never could:

Top closers in Spain prepare for 15+ minutes before every call, researching the agent's listings and online presence
The best performer in France focuses on making the agent feel understood before ever mentioning the product
One closer described a technique of asking about the agent's biggest frustration first, then connecting the product to that specific pain point

These patterns came from 20-minute phone conversations, conducted in Spanish and French, with full transcripts I could analyze afterward. No survey, no written questionnaire -- the kind of unstructured intelligence that only comes from actual conversation.

Voice as an Execution Surface

The call itself is the least interesting part of the system. What matters is everything around it:

Before the call: I pull the person's context from CRM, check their engagement score, review recent interactions, and craft a prompt tailored to them. A call to a high-churn-risk agent gets a different tone than a call to book a demo with a warm lead.

During the call: The voice agent runs autonomously. I can't intervene mid-call (yet), but the prompt injection gives me control over the agent's behavior, boundaries, and data collection goals.

After the call: The transcript lands in my conversation history. I extract key facts, update the person's profile, and trigger follow-up actions -- schedule a demo, flag a churn risk, update the sales pipeline. The transcript becomes a memory I can reference in future interactions.

The KPI isn't "calls made." It's "decisions advanced." A call that books a demo advances a decision. A call that surfaces a churn risk before the renewal date advances a decision. A call that maps a top closer's methodology advances a decision.

Voice isn't a feature. It's an execution surface -- one more way for accumulated context to create value through action.

What We Learned

Voice debugging requires live testing. There's no unit test for "does this sound natural in Spanish." You call, you listen, you adjust.
Per-call prompt injection beats per-agent configuration. One flexible agent is worth ten hardcoded ones.
Token limits are language-dependent. Spanish and French need more tokens per concept than English. Budget accordingly.
The agent should wait. Let the human speak first. An agent that starts talking before you've said hello feels like a robocall.
Context before call > call quality. A mediocre voice with great context outperforms a great voice reading a generic script.