Name: LLM API Latency Benchmark - May 2026
Creator: llmping

Current latency table

Provider	Model	Region	P50	P95	P99	TTFT	Tokens/sec	Collected
OpenAI	gpt-4o	US East	342ms	891ms	1430ms	410ms	72	May 12, 2026, 01:55 PM UTC
Anthropic	claude-3-5-sonnet	US East	416ms	1048ms	1640ms	492ms	63	May 12, 2026, 01:56 PM UTC
Google	gemini-1.5-pro	US East	458ms	1165ms	1880ms	535ms	68	May 12, 2026, 01:58 PM UTC
Groq	llama-3.3-70b	US East	302ms	770ms	1220ms	360ms	186	May 12, 2026, 01:53 PM UTC

Best provider for US East by use case

Use case	Best provider/model	Reason
Real-time chat	Groq llama-3.3-70b	Lowest P50 and highest output speed in this snapshot.
General product assistant	OpenAI gpt-4o	Balanced TTFT, tail latency, and broad model capability.
Long reasoning response	Anthropic Claude 3.5 Sonnet	Slightly slower first token, but stable long-form throughput.

How US East developers can reduce latency

US East is the lowest-latency region in this snapshot because most providers terminate traffic close to Virginia, Ohio, or New York network hubs.

Host server-side inference callers in us-east when most users are in North America and the provider has a US endpoint.
Measure TTFT separately from total completion time because streaming chat feels fast only when the first token arrives quickly.
Keep retry budgets small for chat. A retry that starts after P95 often feels worse than a graceful fallback model.

Compare this page with the global leaderboard and the benchmark method in How to Measure LLM Latency Correctly.

LLM API Latency from US East