Self.

From 10 test calls to 52,000 in production

How we deployed AI voice support for Self Financial across ~75,000 monthly calls, 24 hours a day, in under six months. Now processing ~2,400 calls daily with write actions that close accounts and replace cards.

Production calls
~52k
22 days of 24/7 (4/21--5/12)
Accuracy
~99%+
Voice pipeline + unsupported facts
Performance
~95%
From 50% in Dec 2025
Live traffic
100%
24/7 since Apr 20

1 About Self Financial
Self helps millions of Americans build credit and savings through products designed for people underserved by traditional banking. Headquartered in Austin, TX.
Credit Builder Account (CBA)
A savings-backed installment loan that reports to all three credit bureaus. Customers make fixed monthly payments; at maturity, they receive their savings minus fees. Self's flagship product.
Self Visa Credit Card (SCC)
A secured credit card funded by the customer's CBA savings. Designed as a next step after building initial credit history through the CBA.
Unsecured Credit Card (UCC)
An unsecured card for customers who have graduated from the secured card. Represents the final step in Self's credit-building ladder.
Monthly call volume
~75k
~2,500/day weekdays, ~1,400 weekends
Ticketing system
Salesforce
Via SAIS integration layer
Voice platform
Twilio
IVR + Flex queue

"We want to completely replace the existing support program and build AI that can actually resolve issues, not just route them."

Elizabeth O'Connor, SVP Operations, Self Financial

2 What we built
Not a chatbot layer on top of an IVR. A fully integrated system that reads accounts, takes actions, and handles regulated financial conversations end-to-end.
Live workflows
~35
Every major support topic
API integrations
~18
Direct SAIS + Salesforce read/write
Knowledge articles
~250
Curated + daily-scraped FAQs
Account lookup
Balance, status, payment history, payout info across CBA and SCC products via SAIS APIs
Write actions
CBA account closure, card replacement, Salesforce case creation -- all executed via API
Knowledge base
~40 curated articles plus ~210 daily-scraped FAQs for comprehensive coverage

3 The quality curve
23 test batches plus 21 days of production data, December 2025 through May 2026. Performance nearly doubled; accuracy converged to near-perfect.
50%
Dec 23 performance
94%
Apr 21 performance

Performance pass rate -- Dec 2025 to May 2026

Agent repetition, dead air, workflow execution, coverage. Test batches through 4/21, then production samples through 5/11.

Accuracy pass rate -- Dec 2025 to May 2026

Voice pipeline + unsupported facts. Zero tolerance checks.

4 Scaling with confidence
We started with 10 calls. Each batch grew as confidence increased. By April, we were testing thousands of calls per batch while maintaining quality.

Test volume over time

Calls reviewed per batch -- reflecting growing confidence in the system

Response time pass rate

Agent response time check -- after manual review overrides for false positives

5 The launch
From first test call to 100% of traffic in under five months.
Sep 2025
Project kickoff
Initial scoping, SAIS API integration design, workflow architecture
Dec 2025
First test calls
10-call batch, 50% performance pass rate. Baseline established.
Jan 2026
API integrations complete
CBA Payouts, CBA Payments, SCC Payouts live. KB scraper running daily with ~210 FAQs.
Mar 2026
Quality crosses 95%
Performance pass rate hits 97% on 3/27 batch. Scale testing begins at 200+ calls.
Apr 9, 2026
1,000-call validation
Largest structured test batch. 94.9% performance, 99.9% accuracy. Green light for full traffic.
Apr 13, 2026
100% of voice traffic
All voice calls routed through Lorikeet. Business hours initially.
Apr 20, 2026
24/7 operation
Extended to round-the-clock coverage. Kill switch in place for emergency failover.
Apr 21, 2026
3,558-call batch
Full 24-hour production test. 94% performance, 99.9% accuracy at true production scale.
Now
Phase 2: Pushing deflection
52k+ production calls processed. Driving AI resolution toward 40--60% target. Expanding write actions, reducing premature escalations.

6 What got us here
Six operational practices that made the difference between a pilot and a production deployment.
Step 1
Define what "good" means upfront
We split quality into two distinct tracks: accuracy (unsupported facts, voice pipeline errors) with zero tolerance, and performance (response time, agent repetition, workflow execution) with separate thresholds. This gave us a shared language with Self for what to fix vs what to watch.
Step 2
Accountability gates before every scale-up
EPD and FD both had to sign off before the next live batch. No unilateral "ship it". Each gate reviewed the previous batch results, open issues, and risk. This forced honest conversations about readiness instead of optimistic timelines.
Step 3
Map the full issue cycle, not just the filing
Every issue followed a transparency chain: discovery, root cause, fix, then validation that the fix actually worked. Filing a ticket wasn't credit for fixing the problem. This closed the loop between QA findings and production improvements.
Step 4
Output transforms (the breakthrough)
Adding data field definitions and clear instructions for the LLM on how to interpret API responses was the single biggest quality lever. Raw API data is ambiguous. Telling the model what each field means and how to present it to customers eliminated an entire class of hallucination.
Step 5
Lorikeet-owned QA, subscriber audits a subset
We own the QA process and review every batch. Self independently audits a subset. This flips the typical vendor model -- the subscriber validates rather than drives QA. When Self's audit found 3 issues we missed in 113 tickets, it built confidence rather than eroded it.
Step 6
Canonical simulation set for regression testing
We built a simulation suite covering ~80% of Self's business logic. Every workflow change runs against it before going live. This catches regressions before they reach production and gives confidence to ship changes fast without breaking what already works.

7 Quality infrastructure
Financial services means zero tolerance for hallucinated data. We built monitoring at every layer.
Daily pulse monitoring
Automated 4-day rolling metrics posted to Slack with statistical significance testing
Automated feedback triage
Bad-rated tickets auto-diagnosed, classified, and routed to Linear with deduplication
Self QA audit alignment
Self independently audited 113 tickets. Found 3 accuracy issues and 3 minor performance issues not in our QA.
Kill switch
Instant escalation number for emergency failover to Twilio Flex human agent queue