At 9:47 AM Eastern on Black Friday 2025, our network dashboard hit a number we’d never seen in production: 1,000,000 concurrent calls. By 11:23 AM, it was 1.5 million. At 2:14 PM, we peaked at 2,143,891 simultaneous voice connections across 15 data centers.
No calls dropped. No quality degradation. Call quality held at 4.4 MOS throughout. Our on-call engineers watched the dashboards and — honestly — just watched. The auto-scaling handled everything.
This is the story of what happened, what almost went wrong, and what we learned.
The Weeks Before
Black Friday doesn’t surprise anyone. We started preparing in October.
Capacity planning: We analyzed our previous peak (1.2M concurrent on Cyber Monday 2024) and projected 1.8M for Black Friday 2025 based on customer growth curves and historical seasonality. We provisioned for 2.5M — about 40% headroom above projection.
Pre-scaling: Starting Wednesday night, we pre-warmed additional compute instances across all 15 data centers. Kubernetes pods that normally scale dynamically were pre-deployed to avoid cold-start latency under load.
Runbook updates: We updated our incident response runbooks with Black Friday-specific playbooks. What to do if one DC drops. What to do if SIP proxy latency exceeds 50ms. What to do if auto-scaling falls behind. Every scenario had a documented response.
War room: Our infrastructure team set up a shared monitoring room starting 6 AM Eastern. Twelve engineers across three time zones. Grafana dashboards on every screen. Slack channel open. PagerDuty armed.
We were ready for war. We got a parade instead.
Timeline: Black Friday 2025
6:00 AM ET — Baseline Normal traffic. 340,000 concurrent calls. All systems green. Engineers caffeinated.
7:30 AM — First ramp East Coast retail hours begin. Concurrent calls climb to 520,000. Auto-scaling triggers for the first time — 12 additional SIP proxy pods deploy across US-East-1 and US-East-2. Response time: 47 seconds from trigger to active.
9:47 AM — One million We cross 1,000,000 concurrent. This is a milestone we’d talked about internally but never hit. Someone posts a screenshot in the Slack channel. Brief celebration, then back to watching.
Auto-scaling is running continuously now. New pods deploying every 3-4 minutes across 8 data centers. Total compute at 2.1x baseline.
10:30 AM — The only scare US-East-1 SIP proxy latency spikes to 45ms (our alarm threshold is 50ms). Root cause: a burst of new call setups — not sustained load, but a sudden 30-second wave where 40,000 calls initiated simultaneously. The SIP proxies queue briefly. Latency returns to normal within 90 seconds as new instances absorb the load.
This was the moment that made us nervous. Not because 45ms is bad — it’s still within SLA — but because the spike was sharper than our models predicted. We made a note to investigate.
11:23 AM — 1.5 million Total compute at 2.7x baseline. Call quality steady at 4.4 MOS. Jitter averaging 14ms across all data centers. Packet loss: 0.002%. These are better than our normal daily metrics because we’d pre-scaled aggressively.
2:14 PM — Peak: 2,143,891 The highest concurrent call count in VestaCall’s history. Retail e-commerce support lines, insurance claim centers, and outbound sales teams all hitting peak simultaneously.
The dashboard looked like a mountain range. But every metric was green. Auto-scaling had expanded us to 3.2x baseline capacity. We had headroom to spare.
5:00 PM — Descent Traffic begins dropping as East Coast business hours end. By 8 PM, we’re back to 800,000 concurrent. Auto-scaling starts reclaiming resources.
11:59 PM — All clear Black Friday officially over. Peak survived. Zero dropped calls due to infrastructure. Zero customer-impacting outages. Measured uptime: 100% across all 15 data centers.
What Almost Broke
Nothing broke. But some things bent.
SIP Proxy Burst Handling
That 10:30 AM latency spike revealed a gap in our architecture. Our SIP proxies handle call setup (the initial handshake when a call starts). They’re designed for steady-state throughput — X new calls per second sustained. What they’re less optimized for is burst patterns — 40,000 new calls in 3 seconds followed by normal traffic.
The proxies queued briefly. No calls failed, but setup time increased from 200ms to 450ms for about 90 seconds. Imperceptible to callers, but visible in our metrics.
What we fixed afterward: We implemented burst-aware auto-scaling that monitors the rate of change in call setup volume, not just absolute volume. If new call attempts spike faster than 10,000/second, additional SIP proxy capacity deploys proactively. Shipped in January 2026.
Database Write Contention
Our call detail record (CDR) database handles metadata writes for every call — start time, duration, participants, quality metrics. At 2.1M concurrent, the write volume exceeded what our primary database cluster could handle without queuing.
We’d anticipated this and had a write buffer in place — CDRs queue in memory and flush in batches every 5 seconds. The buffer worked perfectly. But it grew larger than projected — 340MB at peak versus our modeled 200MB. Not dangerous, but closer to the buffer limit than we’d like.
What we fixed afterward: Increased buffer capacity to 1GB and added a secondary write path that spills to a message queue (Kafka) if the buffer exceeds 500MB. Even if the database falls behind, no CDR data is lost.
Monitoring Dashboard Lag
Ironic: our monitoring infrastructure struggled more than our voice infrastructure. Grafana dashboards lagged by 15-20 seconds during peak because the metrics ingestion pipeline (Prometheus → Thanos → Grafana) couldn’t keep up with the telemetry volume from 3.2x baseline pods.
Engineers were making decisions on 20-second-old data. For most situations, that’s fine. For an active incident, it’s uncomfortable.
What we fixed afterward: Deployed a dedicated metrics pipeline for critical real-time signals (concurrent calls, latency, error rate) that bypasses the main Thanos aggregation layer. Response time: sub-2-second for critical metrics.
The Numbers
| Metric | Black Friday 2025 | Normal Day |
|---|---|---|
| Peak concurrent calls | 2,143,891 | ~600,000 |
| Call quality (MOS) | 4.4 | 4.4 |
| Avg latency | 18ms (22ms peak at US-East-1) | 18ms |
| Jitter | 14ms avg | 16ms avg |
| Packet loss | 0.002% | 0.003% |
| Compute scaling | 3.2x baseline | 1.0x |
| Auto-scale events | 847 | ~50 |
| Dropped calls (infra) | 0 | 0 |
| Uptime | 100% | 99.9993% trailing |
| Engineers on-call | 12 | 3 |
The quality metrics being slightly better than normal isn’t random — it’s because pre-scaling meant every data center had excess capacity. Fewer calls per server = less CPU contention = marginally better audio processing.
What We Learned
1. Pre-scaling beats reactive scaling
Auto-scaling works. But there’s always a lag between demand increase and new capacity being available. For predictable events like Black Friday, pre-deploying capacity eliminates that lag entirely. We’ll pre-scale for every major retail event going forward.
2. Burst patterns need different handling than sustained load
A steady climb to 2M calls is easy. A sudden spike of 40K calls in 3 seconds is hard — even if total load is lower. Our AI-powered IVR and routing systems handle sustained load gracefully, but burst patterns require dedicated handling at the SIP proxy layer.
3. Monitor your monitoring
If your observability stack can’t keep up with your production stack, you’re blind during the moments that matter most. We now treat monitoring infrastructure as a Tier 1 service with its own scaling strategy.
4. Boring infrastructure is good infrastructure
The most satisfying outcome of Black Friday 2025: nothing dramatic happened. Engineers sat in a room for 18 hours watching green dashboards. That’s the goal. Drama means something went wrong. The best infrastructure stories are the ones where the exciting thing is that nothing exciting happened.
5. Owning the network matters
We own our voice network — 15 data centers, our own SIP proxies, our own media servers. We’re not reselling carrier infrastructure. On Black Friday, that meant we could monitor end-to-end, scale any component independently, and troubleshoot without calling a third party. Providers that resell infrastructure can’t do this. When their upstream carrier has issues, they wait.
Why This Matters for Your Business
You probably don’t have 2.1 million concurrent calls. But the same infrastructure that handles that peak handles your 50-person contact center during Monday morning rush hour.
The question isn’t “can my VoIP provider handle Black Friday?” It’s “can my VoIP provider handle MY peak?” — the day your marketing campaign goes viral, the day a product recall hits, the day weather cancels flights and every customer calls at once.
VestaCall customers report an average 47% cost reduction when switching from legacy systems — based on 2,000+ migrations. But cost savings mean nothing if the system can’t handle the moment that matters most.
We handled 2.1 million concurrent calls with zero dropped calls and zero quality degradation. Your 200-person team isn’t going to stress us.
If you’re evaluating VoIP providers, ask them about their peak concurrent call capacity. Ask if they own their network or resell. Ask for measured uptime data, not just SLA promises. The answers tell you everything about whether you can trust them on your worst day.
For the full VoIP security picture and how it intersects with reliability, check that guide. And our call center KPI guide covers the metrics that matter during high-volume periods.
Start a free trial and test it yourself. We won’t ask you to generate 2 million calls — but we’ll handle whatever you throw at us.
What’s the highest call volume your current system has handled — and did it hold up?