Engineering

We Handled 2.1 Million Concurrent Calls on Black Friday — Here's What We Learned

By Sarah Chen April 10, 2026

At 9:47 AM Eastern on Black Friday 2025, our network dashboard hit a number we’d never seen in production: 1,000,000 concurrent calls. By 11:23 AM, it was 1.5 million. At 2:14 PM, we peaked at 2,143,891 simultaneous voice connections across 15 data centers.

No calls dropped. No quality degradation. Call quality held at 4.4 MOS throughout. Our on-call engineers watched the dashboards and — honestly — just watched. The auto-scaling handled everything.

This is the story of what happened, what almost went wrong, and what we learned.

The Weeks Before

Black Friday doesn’t surprise anyone. We started preparing in October.

Capacity planning: We analyzed our previous peak (1.2M concurrent on Cyber Monday 2024) and projected 1.8M for Black Friday 2025 based on customer growth curves and historical seasonality. We provisioned for 2.5M — about 40% headroom above projection.

Pre-scaling: Starting Wednesday night, we pre-warmed additional compute instances across all 15 data centers. Kubernetes pods that normally scale dynamically were pre-deployed to avoid cold-start latency under load.

Runbook updates: We updated our incident response runbooks with Black Friday-specific playbooks. What to do if one DC drops. What to do if SIP proxy latency exceeds 50ms. What to do if auto-scaling falls behind. Every scenario had a documented response.

War room: Our infrastructure team set up a shared monitoring room starting 6 AM Eastern. Twelve engineers across three time zones. Grafana dashboards on every screen. Slack channel open. PagerDuty armed.

We were ready for war. We got a parade instead.

Timeline: Black Friday 2025

6:00 AM ET — Baseline Normal traffic. 340,000 concurrent calls. All systems green. Engineers caffeinated.

7:30 AM — First ramp East Coast retail hours begin. Concurrent calls climb to 520,000. Auto-scaling triggers for the first time — 12 additional SIP proxy pods deploy across US-East-1 and US-East-2. Response time: 47 seconds from trigger to active.

9:47 AM — One million We cross 1,000,000 concurrent. This is a milestone we’d talked about internally but never hit. Someone posts a screenshot in the Slack channel. Brief celebration, then back to watching.

Auto-scaling is running continuously now. New pods deploying every 3-4 minutes across 8 data centers. Total compute at 2.1x baseline.

10:30 AM — The only scare US-East-1 SIP proxy latency spikes to 45ms (our alarm threshold is 50ms). Root cause: a burst of new call setups — not sustained load, but a sudden 30-second wave where 40,000 calls initiated simultaneously. The SIP proxies queue briefly. Latency returns to normal within 90 seconds as new instances absorb the load.

This was the moment that made us nervous. Not because 45ms is bad — it’s still within SLA — but because the spike was sharper than our models predicted. We made a note to investigate.

11:23 AM — 1.5 million Total compute at 2.7x baseline. Call quality steady at 4.4 MOS. Jitter averaging 14ms across all data centers. Packet loss: 0.002%. These are better than our normal daily metrics because we’d pre-scaled aggressively.

2:14 PM — Peak: 2,143,891 The highest concurrent call count in VestaCall’s history. Retail e-commerce support lines, insurance claim centers, and outbound sales teams all hitting peak simultaneously.

The dashboard looked like a mountain range. But every metric was green. Auto-scaling had expanded us to 3.2x baseline capacity. We had headroom to spare.

5:00 PM — Descent Traffic begins dropping as East Coast business hours end. By 8 PM, we’re back to 800,000 concurrent. Auto-scaling starts reclaiming resources.

11:59 PM — All clear Black Friday officially over. Peak survived. Zero dropped calls due to infrastructure. Zero customer-impacting outages. Measured uptime: 100% across all 15 data centers.

What Almost Broke

Nothing broke. But some things bent.

SIP Proxy Burst Handling

That 10:30 AM latency spike revealed a gap in our architecture. Our SIP proxies handle call setup (the initial handshake when a call starts). They’re designed for steady-state throughput — X new calls per second sustained. What they’re less optimized for is burst patterns — 40,000 new calls in 3 seconds followed by normal traffic.

The proxies queued briefly. No calls failed, but setup time increased from 200ms to 450ms for about 90 seconds. Imperceptible to callers, but visible in our metrics.

What we fixed afterward: We implemented burst-aware auto-scaling that monitors the rate of change in call setup volume, not just absolute volume. If new call attempts spike faster than 10,000/second, additional SIP proxy capacity deploys proactively. Shipped in January 2026.

Database Write Contention

Our call detail record (CDR) database handles metadata writes for every call — start time, duration, participants, quality metrics. At 2.1M concurrent, the write volume exceeded what our primary database cluster could handle without queuing.

We’d anticipated this and had a write buffer in place — CDRs queue in memory and flush in batches every 5 seconds. The buffer worked perfectly. But it grew larger than projected — 340MB at peak versus our modeled 200MB. Not dangerous, but closer to the buffer limit than we’d like.

What we fixed afterward: Increased buffer capacity to 1GB and added a secondary write path that spills to a message queue (Kafka) if the buffer exceeds 500MB. Even if the database falls behind, no CDR data is lost.

Monitoring Dashboard Lag

Ironic: our monitoring infrastructure struggled more than our voice infrastructure. Grafana dashboards lagged by 15-20 seconds during peak because the metrics ingestion pipeline (Prometheus → Thanos → Grafana) couldn’t keep up with the telemetry volume from 3.2x baseline pods.

Engineers were making decisions on 20-second-old data. For most situations, that’s fine. For an active incident, it’s uncomfortable.

What we fixed afterward: Deployed a dedicated metrics pipeline for critical real-time signals (concurrent calls, latency, error rate) that bypasses the main Thanos aggregation layer. Response time: sub-2-second for critical metrics.

The Numbers

MetricBlack Friday 2025Normal Day
Peak concurrent calls2,143,891~600,000
Call quality (MOS)4.44.4
Avg latency18ms (22ms peak at US-East-1)18ms
Jitter14ms avg16ms avg
Packet loss0.002%0.003%
Compute scaling3.2x baseline1.0x
Auto-scale events847~50
Dropped calls (infra)00
Uptime100%99.9993% trailing
Engineers on-call123

The quality metrics being slightly better than normal isn’t random — it’s because pre-scaling meant every data center had excess capacity. Fewer calls per server = less CPU contention = marginally better audio processing.

What We Learned

1. Pre-scaling beats reactive scaling

Auto-scaling works. But there’s always a lag between demand increase and new capacity being available. For predictable events like Black Friday, pre-deploying capacity eliminates that lag entirely. We’ll pre-scale for every major retail event going forward.

2. Burst patterns need different handling than sustained load

A steady climb to 2M calls is easy. A sudden spike of 40K calls in 3 seconds is hard — even if total load is lower. Our AI-powered IVR and routing systems handle sustained load gracefully, but burst patterns require dedicated handling at the SIP proxy layer.

3. Monitor your monitoring

If your observability stack can’t keep up with your production stack, you’re blind during the moments that matter most. We now treat monitoring infrastructure as a Tier 1 service with its own scaling strategy.

4. Boring infrastructure is good infrastructure

The most satisfying outcome of Black Friday 2025: nothing dramatic happened. Engineers sat in a room for 18 hours watching green dashboards. That’s the goal. Drama means something went wrong. The best infrastructure stories are the ones where the exciting thing is that nothing exciting happened.

5. Owning the network matters

We own our voice network — 15 data centers, our own SIP proxies, our own media servers. We’re not reselling carrier infrastructure. On Black Friday, that meant we could monitor end-to-end, scale any component independently, and troubleshoot without calling a third party. Providers that resell infrastructure can’t do this. When their upstream carrier has issues, they wait.

Why This Matters for Your Business

You probably don’t have 2.1 million concurrent calls. But the same infrastructure that handles that peak handles your 50-person contact center during Monday morning rush hour.

The question isn’t “can my VoIP provider handle Black Friday?” It’s “can my VoIP provider handle MY peak?” — the day your marketing campaign goes viral, the day a product recall hits, the day weather cancels flights and every customer calls at once.

VestaCall customers report an average 47% cost reduction when switching from legacy systems — based on 2,000+ migrations. But cost savings mean nothing if the system can’t handle the moment that matters most.

We handled 2.1 million concurrent calls with zero dropped calls and zero quality degradation. Your 200-person team isn’t going to stress us.

If you’re evaluating VoIP providers, ask them about their peak concurrent call capacity. Ask if they own their network or resell. Ask for measured uptime data, not just SLA promises. The answers tell you everything about whether you can trust them on your worst day.

For the full VoIP security picture and how it intersects with reliability, check that guide. And our call center KPI guide covers the metrics that matter during high-volume periods.

Start a free trial and test it yourself. We won’t ask you to generate 2 million calls — but we’ll handle whatever you throw at us.

What’s the highest call volume your current system has handled — and did it hold up?

Sarah Chen
Sarah Chen

Head of Product, VestaCall

FAQ

Frequently Asked Questions

Our tested peak is 2.1 million concurrent calls, achieved on Black Friday 2025. The platform auto-scales beyond this — 2.1M wasn't a hard ceiling, it was the highest demand we've seen. Our architecture distributes calls across 15 global data centers with automatic load balancing and failover. The practical limit for any single customer account is effectively unlimited — we've never had a customer hit a concurrency cap.

No. Measured uptime for the Black Friday 2025 period (November 27-30) was 100% across all 15 data centers. Call quality remained at 4.4 MOS throughout the peak. We did experience elevated latency (22ms vs our normal 18ms) on one US East data center for about 40 minutes, but this was within SLA parameters and imperceptible to callers. No calls were dropped due to infrastructure limitations.

VestaCall operates 15 geographically distributed data centers across North America, Europe, and Asia-Pacific. We own our voice network — we're not reselling carrier infrastructure. This gives us direct control over call routing, quality, and failover. The platform uses containerized microservices with Kubernetes orchestration, auto-scaling based on real-time demand metrics, and redundant SIP proxies at every data center.

Auto-scaling. Our platform monitors concurrent call count, CPU utilization, and SIP proxy queue depth in real-time. When any metric exceeds 60% of current capacity, new compute instances spin up automatically within 90 seconds. On Black Friday, we scaled from our baseline capacity to 3.2x baseline over 4 hours — entirely automatically. No human intervention required. The scaling is also bidirectional — capacity shrinks when demand drops, so we're not paying for idle infrastructure.

If auto-scaling can't keep up with an unprecedented surge, our load balancers implement graceful degradation. New calls get a brief queuing delay (2-5 seconds of ring time) rather than a failure. Calls in progress are never dropped — they're anchored to their original server. In 12 months of operation, we've never had a spike that outran our auto-scaling. But the fallback exists because infrastructure engineering is about preparing for scenarios you haven't seen yet.

Stop Losing Revenue to Missed Calls & Poor CX

Get started with a free setup, number porting, and a 14-day no-credit-card free trial.

No credit card required. Full access. Start in 5 minutes.