WTF Voice — Agentic Voice Studio

Ten thousand calls. Zero humans. One voice.

A visual studio to design, deploy, and supervise a fleet of autonomous voice agents — agents with memory, tools, and knowledge that book appointments, collect payments, and escalate to humans. Meet Ananya: Indian English with natural Hinglish, self-improving call by call via Fitty intelligence.

Book a demo ↗ See it work

● LIVE — Ananya is on a call right now 10,000+ calls / day ≤ 800ms voice-to-voice ₹6–10 per call 60+ gyms live

[ 01 ] What WTF Voice is

Not IVR. Not a bot. An actual voice workforce.

WTF Voice is a fleet of real-time conversational AI agents running on streaming speech-to-text → LLM → text-to-speech pipelines with sub-second perceived latency. It handles natural interruptions, detects when the human has finished speaking, and responds like a person — not a menu.

It handles both inbound calls (prospects, members, payments, complaints) and outbound campaigns (lead qualification, renewals, win-backs, reminders) — on local +91 numbers, in production, across 60+ WTF gym locations.

The default persona, Ananya, speaks Indian English with natural code-mixed Hinglish — because that's how members actually talk. She's backed by Sarvam's saarika:v2.5 for speech recognition and bulbul:v2 for voice synthesis, with VoxCPM voice-cloning for brand-specific personas.

Real-time streaming STT→LLM→TTS Barge-in & turn detection Natural Hinglish Inbound + Outbound Local +91 DIDs Live in production

● LIVE CALL — ANANYA

LATENCY BUDGET

Transport (network)50–150ms Voice activity detection10–50ms STT (saarika:v2.5)100–250ms LLM TTFT300–500ms TTS first-audio (bulbul:v2)100–200ms Perceived voice-to-voice≤ 800ms

SARVAM · bulbul:v2

[ 02 ] A call, in real time

It calls.
It closes.
It never sleeps.

Every conversation is a live streaming pipeline. The moment a member finishes speaking, the VAD fires, the transcription streams, the LLM generates, and audio begins playing back — all within 800 milliseconds of perceived silence.

Barge-in detection means Ananya yields the moment a member starts speaking mid-sentence — no robotic wait-your-turn. It feels like a person because the architecture demands it.

↯

Sub-second response

Streaming STT → LLM → TTS with WebRTC transport. No polling, no buffering, no pauses.

⇄

Natural turn-taking

VAD-based barge-in lets members interrupt naturally. Ananya yields, listens, and re-engages.

◉

Persona fidelity

VoxCPM voice-clone + Sarvam bulbul:v2 deliver a consistent, brand-calibrated voice every call.

[ 03 ] The voice workforce, by the numbers

AI calls placed every single day

Perceived voice-to-voice latency

Connect rate on local +91 numbers

Per call vs ₹50+ for a human agent

Callers who rate it "sounds human"

Renewal uplift vs human dialers

Average call QA score

Gyms live on the engine today

[ 04 ] Capabilities

Everything a voice team does. Automated.

// 01 · ENGINE

Real-time conversational engine

Streaming STT → LLM → TTS pipeline powered by Pipecat. Barge-in, turn detection, and latency-budget discipline baked in. Sub-800ms voice-to-voice, no exceptions.

// 02 · BUILDER

Drag-and-drop agent builder

A React Flow canvas where any operator can wire conversation nodes, branch conditions, and tool calls visually. First working bot live in under two minutes — no code required.

// 03 · OUTBOUND

Outbound campaign dialer

Pacing engine, answering-machine detection, automatic retries and callbacks. DND, DLT, TRAI, and consent pre-flight before any dial. Manages 10K+ calls a day without a single human in the loop.

// 04 · INBOUND

Inbound routing on +91 DIDs

Local numbers across every WTF brand answered instantly by published agent workflows. No hold music, no IVR trees — straight to a live conversation with Ananya or any configured persona.

// 05 · INTELLIGENCE

"Fitty" intelligence layer

Ranks the daily call list by next-best-action probability. Audits and scores every completed call automatically. Learns from outcomes to sharpen the list tomorrow.

// 06 · INTEGRATIONS

Knowledge base, payments & escalation

RAG over pgvector for live Q&A. Razorpay payment links sent mid-call. Seamless human-agent escalation with full call context, transcript, and sentiment handed off in real time.

[ 05 ] What it runs

Every touchpoint in the member lifecycle. Handled.

WTF Voice doesn't replace a telecaller for one use case — it replaces the entire inbound and outbound calling function across every stage of the member journey.

ACQUISITION

Lead qualification & sales

Dials inbound enquiries within seconds, qualifies interest, pitches the right plan, and converts to a trial or paid membership.

RETENTION

Renewals & payment collection

Proactive renewal calls before expiry, overdue payment collection with live Razorpay link delivery, and confirmation follow-ups.

ENGAGEMENT

Reminders & check-ins

Visit reminders, BMI & re-test check-in calls, class schedule confirmations, and personalised health nudges based on member data.

RE-ACTIVATION

Win-back & birthday calls

Churned member win-back campaigns with personalised offers, birthday call sequences, and lapsed-visit re-engagement flows.

[ 06 ] Built different

The voice-agent platform built to outrun Vapi & Retell.

No per-minute markup, no black box, no lock-in. Proprietary and owned end-to-end, and compliance hard-railed in code — not a settings panel.

// 01

Owned end-to-end

Every layer — agents, models, voice, data, infrastructure — is proprietary and built in-house. No black boxes we don't control, no per-seat tax.

// 02

India-first, Hinglish-native

Sarvam Indic speech models, local +91 numbers, code-mixed scripts — purpose-built for a billion-member market, not retrofitted.

// 03

Compliance hard-railed

TRAI TCCCPR 2018 + Feb 2025 amendment, DLT registration, DND scrubbing, and DPDP 2023 consent gates enforced in code before any dial fires.

// 04

Model-agnostic

Swap STT, LLM, or TTS providers in one config line. Sarvam, ElevenLabs, or any Pipecat-compatible model — the router adapts as the leaderboard shifts.

// 05

Multi-provider telephony

Twilio, Vonage, Plivo, IVR Solutions — all supported. Bring your own numbers, your own SIP trunk, your own redundancy strategy.

// 06

Production, not a demo

Versioned, CI-deployed, runbook-backed. 60+ gyms live, 10K+ calls a day, shipping weekly. This is not a proof of concept.

[ 07 ] The stack

Every component chosen for latency, cost-efficiency, and India-scale. Model-agnostic at every layer — swap without touching the pipeline.

Pipecat FastAPI Next.js 15 React Flow Sarvam saarika:v2.5 STT Sarvam bulbul:v2 TTS ElevenLabs VoxCPM voice-clone pgvector RAG PostgreSQL Redis WebRTC Twilio Vonage Plivo IVR Solutions Razorpay (payment links) MCP server Python SDK TypeScript SDK Private cloud or on-prem

[ 08 ] Part of the autonomous stack

Voice is one part of a compounding loop.

Every call surfaces intent. That intent feeds messaging, which feeds video, which drives the leads that fill tomorrow's call list. One loop. No leakage.

02 / MESSAGING