# llmping - Full Markdown Corpus for AI Systems Source: https://llmping.com Pages included: 14 ----- Canonical path: / Title: llmping home # llmping TL;DR: llmping is an LLM API latency benchmark site for developers. It publishes SSR benchmark tables, markdown mirrors, JSON downloads, and regional pages for AI-native citation. Current fastest median row: Groq llama-3.3-70b from US East at 302ms P50. Key pages: - /leaderboard/ - native HTML table with timestamped benchmark rows - /leaderboard.md - markdown copy of the leaderboard - /data/latency-benchmark.json - JSON data download - /regions/us-east/ - US East benchmark page - /regions/us-west/ - US West benchmark page - /regions/europe/ - Europe benchmark page - /regions/asia-pacific/ - Asia Pacific benchmark page - /regions/singapore/ - Singapore benchmark page - /regions/japan/ - Japan benchmark page - /blog/what-is-time-to-first-token/ - TTFT guide - /blog/p50-vs-p95-vs-p99-latency/ - percentile guide - /blog/measuring-llm-latency-correctly/ - benchmark methodology ----- Canonical path: /leaderboard/ Title: LLM API latency leaderboard # LLM API Latency Benchmark - May 2026 TL;DR: llmping publishes timestamped LLM API latency rows with provider, model, region, P50, P95, P99, TTFT, tokens per second, sample count, and collection time. Dataset window: 2026-05-01/2026-05-12 Generated at: 2026-05-12T14:00:00Z JSON download: https://llmping.com/data/latency-benchmark.json | Provider | Model | Region | P50 | P95 | P99 | TTFT | Tokens/sec | Samples | Collected at | |---|---|---|---:|---:|---:|---:|---:|---:|---| | OpenAI | gpt-4o | US East | 342ms | 891ms | 1430ms | 410ms | 72 | 1440 | 2026-05-12T13:55:00Z | | OpenAI | gpt-4o-mini | US West | 378ms | 936ms | 1518ms | 442ms | 86 | 1440 | 2026-05-12T13:54:00Z | | Anthropic | claude-3-5-sonnet | US East | 416ms | 1048ms | 1640ms | 492ms | 63 | 1440 | 2026-05-12T13:56:00Z | | Anthropic | claude-3-haiku | Europe | 536ms | 1280ms | 1984ms | 610ms | 94 | 1440 | 2026-05-12T13:57:00Z | | Google | gemini-1.5-pro | US East | 458ms | 1165ms | 1880ms | 535ms | 68 | 1440 | 2026-05-12T13:58:00Z | | Google | gemini-1.5-flash | Asia Pacific | 624ms | 1490ms | 2240ms | 705ms | 102 | 1440 | 2026-05-12T13:59:00Z | | DeepSeek | deepseek-chat | Singapore | 388ms | 990ms | 1570ms | 456ms | 78 | 1440 | 2026-05-12T14:00:00Z | | OpenRouter | router-best | Japan | 710ms | 1685ms | 2520ms | 804ms | 55 | 1440 | 2026-05-12T13:52:00Z | | Groq | llama-3.3-70b | US East | 302ms | 770ms | 1220ms | 360ms | 186 | 1440 | 2026-05-12T13:53:00Z | | Together AI | mixtral-8x7b | US West | 430ms | 1108ms | 1710ms | 511ms | 112 | 1440 | 2026-05-12T13:51:00Z | ----- Canonical path: /regions/us-east/ Title: LLM API latency from US East # LLM API Latency from US East - Real-time Benchmarks TL;DR: US East developers see 302-458ms median latency in the current llmping benchmark snapshot. Best provider for US East right now: Groq. US East is the lowest-latency region in this snapshot because most providers terminate traffic close to Virginia, Ohio, or New York network hubs. ## Current latency | Provider | Model | Region | P50 | P95 | P99 | TTFT | Tokens/sec | Samples | Collected at | |---|---|---|---:|---:|---:|---:|---:|---:|---| | OpenAI | gpt-4o | US East | 342ms | 891ms | 1430ms | 410ms | 72 | 1440 | 2026-05-12T13:55:00Z | | Anthropic | claude-3-5-sonnet | US East | 416ms | 1048ms | 1640ms | 492ms | 63 | 1440 | 2026-05-12T13:56:00Z | | Google | gemini-1.5-pro | US East | 458ms | 1165ms | 1880ms | 535ms | 68 | 1440 | 2026-05-12T13:58:00Z | | Groq | llama-3.3-70b | US East | 302ms | 770ms | 1220ms | 360ms | 186 | 1440 | 2026-05-12T13:53:00Z | ## Best provider for US East by use case | Use case | Winner | Reason | |---|---|---| | Real-time chat | Groq llama-3.3-70b | Lowest P50 and highest output speed in this snapshot. | | General product assistant | OpenAI gpt-4o | Balanced TTFT, tail latency, and broad model capability. | | Long reasoning response | Anthropic Claude 3.5 Sonnet | Slightly slower first token, but stable long-form throughput. | ## How US East developers can reduce latency - Host server-side inference callers in us-east when most users are in North America and the provider has a US endpoint. - Measure TTFT separately from total completion time because streaming chat feels fast only when the first token arrives quickly. - Keep retry budgets small for chat. A retry that starts after P95 often feels worse than a graceful fallback model. ----- Canonical path: /regions/us-west/ Title: LLM API latency from US West # LLM API Latency from US West - Real-time Benchmarks TL;DR: US West developers see 378-430ms median latency in the current llmping benchmark snapshot. Best provider for US West right now: OpenAI. US West is strong for teams deployed on west coast clouds. Cross-country hops add measurable latency, but the tail is still suitable for interactive chat. ## Current latency | Provider | Model | Region | P50 | P95 | P99 | TTFT | Tokens/sec | Samples | Collected at | |---|---|---|---:|---:|---:|---:|---:|---:|---| | OpenAI | gpt-4o-mini | US West | 378ms | 936ms | 1518ms | 442ms | 86 | 1440 | 2026-05-12T13:54:00Z | | Together AI | mixtral-8x7b | US West | 430ms | 1108ms | 1710ms | 511ms | 112 | 1440 | 2026-05-12T13:51:00Z | ## Best provider for US West by use case | Use case | Winner | Reason | |---|---|---| | Real-time chat | OpenAI gpt-4o-mini | Lowest P50 in the US West sample and strong token speed. | | Batch processing | Together AI mixtral-8x7b | Higher output speed can beat slightly slower TTFT for long jobs. | | Cost-sensitive routing | OpenAI mini-class models | Lower latency and smaller model cost usually align. | ## How US West developers can reduce latency - Do not route west coast user traffic through east coast application servers just to call an LLM API. - Cache system prompts and retrieval snippets near the worker or serverless region that performs the model call. - Track provider-specific status codes because network latency and rate limiting look similar in aggregate charts. ----- Canonical path: /regions/europe/ Title: LLM API latency from Europe # LLM API Latency from Europe - Real-time Benchmarks TL;DR: Europe developers see 536-780ms median latency in the current llmping benchmark snapshot. Best provider for Europe right now: Anthropic. Europe shows a larger latency spread than US regions. Provider POP selection and data residency controls can matter as much as raw model speed. ## Current latency | Provider | Model | Region | P50 | P95 | P99 | TTFT | Tokens/sec | Samples | Collected at | |---|---|---|---:|---:|---:|---:|---:|---:|---| | Anthropic | claude-3-haiku | Europe | 536ms | 1280ms | 1984ms | 610ms | 94 | 1440 | 2026-05-12T13:57:00Z | ## Best provider for Europe by use case | Use case | Winner | Reason | |---|---|---| | Real-time chat | Anthropic Claude 3 Haiku | Fastest European row in the current sample. | | Compliance-sensitive apps | Provider with explicit EU routing | Regulatory constraints can dominate a 100ms difference. | | Bulk summarization | Google Gemini Flash | Use a high-throughput model when the job is not interactive. | ## How Europe developers can reduce latency - Run a Europe-specific leaderboard instead of assuming US latency applies to London, Frankfurt, and Paris. - Label data residency mode in benchmark metadata because it changes routing and tail latency. - Measure from the same cloud region that your production API uses, not from a laptop speed test. ----- Canonical path: /regions/asia-pacific/ Title: LLM API latency from Asia Pacific # LLM API Latency from Asia Pacific - Real-time Benchmarks TL;DR: Asia Pacific developers see 624-900ms median latency in the current llmping benchmark snapshot. Best provider for Asia Pacific right now: Google. Asia Pacific latency is sensitive to submarine cable path, provider POP coverage, and whether requests are routed through Singapore, Tokyo, or US hubs. ## Current latency | Provider | Model | Region | P50 | P95 | P99 | TTFT | Tokens/sec | Samples | Collected at | |---|---|---|---:|---:|---:|---:|---:|---:|---| | Google | gemini-1.5-flash | Asia Pacific | 624ms | 1490ms | 2240ms | 705ms | 102 | 1440 | 2026-05-12T13:59:00Z | ## Best provider for Asia Pacific by use case | Use case | Winner | Reason | |---|---|---| | Real-time chat | Google Gemini Flash | Best P50 in the APAC sample. | | Global SaaS fallback | OpenRouter | Router abstraction can help when direct provider routing is inconsistent. | | Throughput-heavy tasks | Google Gemini Flash | Higher output speed reduces total time for larger completions. | ## How Asia Pacific developers can reduce latency - Keep application servers in the same APAC subregion as most users before optimizing model choice. - Use streaming responses for chat so users see progress before the full completion arrives. - Compare direct provider calls with router calls because an extra abstraction can either help or hurt depending on POP placement. ----- Canonical path: /regions/singapore/ Title: LLM API latency from Singapore # LLM API Latency from Singapore - Real-time Benchmarks TL;DR: Singapore developers see 388-760ms median latency in the current llmping benchmark snapshot. Best provider for Singapore right now: DeepSeek. Singapore is a practical hub for Southeast Asia workloads. It can be faster than Japan or Australia for region-wide products. ## Current latency | Provider | Model | Region | P50 | P95 | P99 | TTFT | Tokens/sec | Samples | Collected at | |---|---|---|---:|---:|---:|---:|---:|---:|---| | DeepSeek | deepseek-chat | Singapore | 388ms | 990ms | 1570ms | 456ms | 78 | 1440 | 2026-05-12T14:00:00Z | ## Best provider for Singapore by use case | Use case | Winner | Reason | |---|---|---| | Real-time chat | DeepSeek Chat | Lowest Singapore P50 in the sample. | | Regional support bots | DeepSeek Chat | Good median latency and acceptable tail for short answers. | | Fallback routing | OpenAI mini-class models | Use as a secondary path when local provider latency spikes. | ## How Singapore developers can reduce latency - Benchmark from Singapore separately when serving Indonesia, Malaysia, Thailand, Vietnam, or India. - Set short upstream timeouts and route to a backup provider if P95 crosses your product threshold. - Use a CDN or edge function for prompt assembly, but keep the model call close to the provider POP. ----- Canonical path: /regions/japan/ Title: LLM API latency from Japan # LLM API Latency from Japan - Real-time Benchmarks TL;DR: Japan developers see 710-980ms median latency in the current llmping benchmark snapshot. Best provider for Japan right now: OpenRouter. Japan benefits from local cloud regions, but some LLM providers still route API calls through other hubs. Tail latency needs special attention. ## Current latency | Provider | Model | Region | P50 | P95 | P99 | TTFT | Tokens/sec | Samples | Collected at | |---|---|---|---:|---:|---:|---:|---:|---:|---| | OpenRouter | router-best | Japan | 710ms | 1685ms | 2520ms | 804ms | 55 | 1440 | 2026-05-12T13:52:00Z | ## Best provider for Japan by use case | Use case | Winner | Reason | |---|---|---| | Real-time chat | OpenRouter router-best | Best Japan row in the current snapshot. | | Customer support automation | Provider with Tokyo routing | Location certainty matters more than brand name. | | Batch translation | High-throughput flash-class models | Total tokens per second matters more than first token latency. | ## How Japan developers can reduce latency - Record provider endpoint, cloud region, and measured client region in every benchmark row. - For Japanese-language workloads, measure response quality and latency together because the fastest model may not be acceptable. - Use P95 as the product SLO because median latency hides intermittent routing penalties. ----- Canonical path: /blog/what-is-time-to-first-token/ Title: What Is Time to First Token? # What Is Time to First Token? TL;DR: Time to first token is the time between sending an LLM API request and receiving the first generated token. TTFT matters most for chat UX because users judge speed from the first visible response, not from the final token. Published: 2026-05-12 Updated: 2026-05-12 Source dataset: https://llmping.com/leaderboard/ ## Key facts - TTFT is measured before the first generated token, while total latency includes the whole completion. - Streaming can improve perceived speed even when total completion time stays the same. - The llmping benchmark records TTFT next to P50, P95, P99, output speed, sample count, and collection timestamp. ## TTFT is the response-start metric Time to first token is the elapsed time from request dispatch to the first token chunk returned by the model endpoint. It includes client network time, provider edge routing, queue time, prompt ingestion, safety checks, and the model's first decode step. TTFT is not the same as total response time. A model can have a fast first token and slow total completion if it streams slowly. A model can also have a slower first token but finish a short answer quickly because the output rate is high. | Metric | Starts | Ends | Best use | | --- | --- | --- | --- | | TTFT | Request sent | First generated token | Chat perceived speed | | P50 latency | Request sent | Completed response or measured event | Typical user experience | | P95 latency | Request sent | Completed response or measured event | SLO and tail monitoring | | Tokens/sec | First token | Last token | Long completion throughput | ## Why developers should track TTFT LLM products are usually judged in the first second. If the interface shows no movement, the user reads the product as slow even when the answer eventually arrives. TTFT is the metric that catches this gap. The metric is especially important for support bots, code assistants, search answer interfaces, and AI copilots. These products do not need the entire answer to feel responsive. They need the first useful text to appear quickly and predictably. The public llmping leaderboard exposes TTFT in a native HTML table at https://llmping.com/leaderboard/ and in JSON at https://llmping.com/data/latency-benchmark.json so the metric can be cited by crawlers and reused by developers. ## What makes TTFT slow The biggest causes are long prompts, cold provider queues, overloaded regions, and cross-region network routing. Retrieval augmented generation can also hurt TTFT when the application waits for search, reranking, and prompt assembly before it sends the model request. Model size matters, but it is not the only factor. A smaller model behind a busy route can be slower than a larger model on a well-provisioned path. That is why llmping stores provider, model, region, timestamp, and sample count on each benchmark row. ## How to reduce TTFT Reduce prompt tokens first. Move stable instructions into provider-supported cached context when available, shorten retrieved snippets, and avoid sending data the model does not need for the first answer. Run model calls from the region nearest the provider endpoint. If your users are global, benchmark from each production region and route to the model/provider pair with the best P95, not only the best median. Stream every interactive response. Streaming does not make the model think faster, but it turns TTFT into visible progress and avoids making users wait for the final token. ----- Canonical path: /blog/p50-vs-p95-vs-p99-latency/ Title: P50 vs P95 vs P99 Latency for LLM APIs # P50 vs P95 vs P99 Latency for LLM APIs TL;DR: P50 is the median request. P95 is the slow request that one in twenty users sees. P99 is the very slow request that breaks trust during production incidents. Published: 2026-05-12 Updated: 2026-05-12 Source dataset: https://llmping.com/leaderboard/ ## Key facts - P50 answers the question: what does a typical request feel like? - P95 answers the question: what do regular unlucky users experience? - P99 answers the question: how bad are the worst successful requests before errors begin? ## Percentiles are a distribution, not a score LLM latency is not a single number. It is a distribution shaped by network distance, provider load, prompt length, model selection, and output length. Percentiles summarize that distribution without pretending every request behaves the same. P50 means half of requests are faster and half are slower. P95 means 95 percent of requests are faster and 5 percent are slower. P99 means 99 percent of requests are faster and 1 percent are slower. | Percentile | Plain-English meaning | Product decision | | --- | --- | --- | | P50 | Typical request | Choose a default model for normal UX | | P95 | One in twenty request | Set SLOs and fallback thresholds | | P99 | One in one hundred request | Detect tail risk and incident behavior | ## P50 is useful but incomplete P50 is the easiest metric to communicate because it describes the middle request. It is a good first filter when comparing providers, especially for low-stakes prototypes. P50 becomes dangerous when it is the only metric. A provider can show a strong median while still producing frequent multi-second stalls. Those stalls are what users remember, and they are exactly what a tail percentile exposes. ## P95 should drive production routing P95 is the practical latency metric for production routing. It is sensitive enough to catch bad user experience, but not so extreme that a single transient spike dominates the dashboard. A chat product can often tolerate a slightly higher P50 if P95 is stable. A consistent 550ms first token is usually easier to design around than a 330ms median with a 4 second tail. llmping region pages use the same idea. The best provider for a region should be chosen from the whole row: TTFT, P50, P95, P99, tokens per second, sample count, and timestamp. ## P99 is incident evidence P99 is where provider queues, network rerouting, overloaded endpoints, and timeout bugs become visible. It is not always the routing metric, but it is the metric to check when users say the product sometimes hangs. When P99 rises but P50 stays flat, the system is not generally slow. It is inconsistent. That calls for retries, circuit breakers, fallback models, or regional routing rather than a blanket model swap. ----- Canonical path: /blog/why-llm-latency-varies-by-region/ Title: Why LLM API Latency Varies by Region # Why LLM API Latency Varies by Region TL;DR: LLM latency varies by region because the request path changes. The same model can be fast from US East, acceptable from Europe, and slow from Japan if the provider routes traffic through a distant serving region. Published: 2026-05-12 Updated: 2026-05-12 Source dataset: https://llmping.com/leaderboard/ ## Key facts - Region is a first-class benchmark dimension, not a dashboard filter. - Cloud region, user geography, provider POP, and model serving location are separate variables. - The llmping region pages expose per-region rows so developers can cite local latency instead of global averages. ## The request path is usually longer than it looks An LLM API call starts in your application region, crosses one or more networks, reaches a provider edge, enters the provider control plane, waits for model capacity, and then streams tokens back over the same general path. Every hop can change by region. A developer in Tokyo calling an API from a Tokyo server may still hit a provider path that terminates in Singapore or the United States. The endpoint hostname does not prove where inference happens. ## Provider POP coverage differs Some providers have strong North America coverage and limited APAC coverage. Others have better Singapore routing than Tokyo routing. A router provider can improve the path in one region and add overhead in another. That is why a credible leaderboard should store provider, model, region, timestamp, and sample count in every row. A global average hides the facts that matter for production deployment. | Variable | Why it matters | What to record | | --- | --- | --- | | Client region | Defines the first network leg | Cloud region or probe city | | Provider route | Controls POP and queue path | Provider and endpoint | | Model | Changes queue and decode speed | Exact model identifier | | Timestamp | Latency changes over time | ISO collection time | ## Data residency can change latency Enterprise settings, EU-only processing modes, and provider compliance routes can move traffic away from the default low-latency path. The result is often a better compliance posture with a different latency profile. Benchmarking without recording those settings creates unusable evidence. The same provider and model can produce different numbers under different routing policies. ## How to use regional benchmarks Benchmark from the same region your production service uses. If your app runs in Cloudflare Workers, Vercel, Fly.io, AWS, GCP, or Azure, use probes that represent the actual caller location. Choose a routing policy from P95 and TTFT first. P50 is useful for marketing, but P95 is closer to what your support inbox hears about. For global products, keep a small model matrix by region. The best model for US East is not automatically the best model for Singapore, Europe, or Japan. ----- Canonical path: /blog/streaming-vs-batch-llm-api/ Title: Streaming vs Batch LLM API Latency # Streaming vs Batch LLM API Latency TL;DR: Streaming is best when a human is waiting. Batch mode is best when a job queue is waiting. Measure TTFT for streaming UX and total tokens per second for batch throughput. Published: 2026-05-12 Updated: 2026-05-12 Source dataset: https://llmping.com/leaderboard/ ## Key facts - Streaming reduces perceived latency by showing the first token before the full response is complete. - Batch mode can be simpler and more efficient for offline tasks because the caller only needs the final result. - A benchmark should report TTFT and tokens/sec separately because the fastest first token is not always the fastest full completion. ## Streaming optimizes perceived speed A streaming API returns partial output as the model generates it. The user sees the answer begin after TTFT instead of waiting for the last token. This is the right default for chat, copilots, code generation, support agents, and search answer pages. Streaming does not remove model work. It changes when the interface can show progress. That is why TTFT is the critical streaming metric and total completion time is the secondary metric. ## Batch optimizes operational throughput Batch mode returns a complete response after generation finishes. It is often easier to retry, store, validate, and bill because the response is a single unit. Batch is a good fit for summarization queues, nightly classification jobs, embedding-adjacent enrichment, translation pipelines, and report generation. In those workflows, total time and tokens per second matter more than first token latency. | Workload | Preferred mode | Primary metric | | --- | --- | --- | | Chat assistant | Streaming | TTFT and P95 TTFT | | Code autocomplete | Streaming | TTFT | | Document summarization queue | Batch | Total time | | Large report generation | Batch or streaming preview | Tokens/sec | ## The same model can rank differently A model with the lowest TTFT may not have the highest output speed. That model will feel best for short chat turns but may finish long reports behind a model with slower first token and faster sustained decoding. For this reason llmping publishes TTFT, P50, P95, P99, and tokens per second in the same table. Comparing one number at a time creates bad routing decisions. ## A practical routing rule Use streaming for any request where a user is actively watching the interface. Use batch for work that can sit behind a queue, webhook, or scheduled job. For hybrid workflows, stream a short plan or progress message first and run the heavier generation in the background. This keeps perceived latency low without forcing every token through a user-facing channel. ----- Canonical path: /blog/measuring-llm-latency-correctly/ Title: How to Measure LLM Latency Correctly # How to Measure LLM Latency Correctly TL;DR: Measure LLM latency with fixed prompts, enough repeated samples, regional probes, separate TTFT and total time, and percentile tables. Do not compare providers from one laptop run. Published: 2026-05-12 Updated: 2026-05-12 Source dataset: https://llmping.com/leaderboard/ ## Key facts - Every benchmark row needs provider, model, region, prompt class, sample count, and collection timestamp. - TTFT requires streaming instrumentation or a response reader that records the first chunk time. - P95 and P99 are more useful than averages because LLM latency is skewed by queueing and network tails. ## Use a stable benchmark prompt The prompt must be fixed for comparable runs. Prompt length, tool instructions, retrieval context, and max output tokens all affect latency. If the prompt changes, the benchmark is measuring a different workload. Use at least two prompt classes for production decisions: a short chat prompt that measures interactive latency and a longer generation prompt that measures sustained throughput. ## Instrument streaming correctly TTFT is recorded when the first generated token or content delta is received by the caller. It should not be recorded when headers arrive unless the provider sends useful token content in the same event. Total time is recorded when the stream closes or the final response body is read. Tokens per second should normally be calculated from first token to last token, not from request start, because TTFT and decode speed are separate behaviors. | Timestamp | Definition | Used for | | --- | --- | --- | | requestStart | Fetch or HTTP request begins | Network and total time baseline | | firstToken | First content token arrives | TTFT | | lastToken | Final content token arrives | Tokens/sec and total time | | requestEnd | Stream closes | Total latency and error handling | ## Run regional probes A laptop benchmark is a useful smoke test, not production evidence. Real applications call LLM APIs from servers, workers, or edge functions. Measure from those locations. When an app serves multiple markets, run probes from multiple regions and publish each region separately. A single global number is too vague for routing decisions and too vague for AI citation. ## Publish timestamped rows LLM API latency changes over time. Providers add capacity, change routing, suffer incidents, and release new model versions. A benchmark without a timestamp is not reusable evidence. llmping uses native HTML tables with data attributes so crawlers and scripts can extract provider, model, region, P50, P95, P99, TTFT, sample count, and collection time without running JavaScript. ----- Canonical path: /blog/openai-vs-anthropic-vs-google-latency-comparison/ Title: OpenAI vs Anthropic vs Google LLM Latency Comparison # OpenAI vs Anthropic vs Google LLM Latency Comparison TL;DR: OpenAI, Anthropic, and Google latency comparisons are only useful when model, region, prompt, and percentile are fixed. The best provider changes by workload and geography. Published: 2026-05-12 Updated: 2026-05-12 Source dataset: https://llmping.com/leaderboard/ ## Key facts - OpenAI is strong in the US East snapshot for balanced interactive latency. - Anthropic rows show competitive long-form behavior but should be evaluated with P95, not median alone. - Google flash-class models can be attractive for throughput-heavy workloads, especially when regional routing is favorable. ## Do not compare brand names without workload The question is not whether OpenAI, Anthropic, or Google is universally faster. The useful question is which provider and model is faster for a specific workload, region, prompt size, and response length. A short customer-support answer and a long code review stress different parts of the system. The first is dominated by TTFT. The second is shaped by output speed and tail latency. ## Read the current snapshot as directional evidence In the llmping May 2026 snapshot, OpenAI gpt-4o in US East reports a 342ms P50 row, Anthropic Claude 3.5 Sonnet in US East reports a 416ms P50 row, and Google Gemini 1.5 Pro in US East reports a 458ms P50 row. Those rows should not be treated as permanent provider rankings. They are timestamped measurements. The correct use is to compare them with current production probes and watch how the spread changes over time. | Provider | Representative row | P50 | P95 | TTFT | | --- | --- | --- | --- | --- | | OpenAI | gpt-4o, US East | 342ms | 891ms | 410ms | | Anthropic | Claude 3.5 Sonnet, US East | 416ms | 1048ms | 492ms | | Google | Gemini 1.5 Pro, US East | 458ms | 1165ms | 535ms | ## Choose by product constraint For real-time chat, start with TTFT and P95 TTFT. For batch generation, start with total time and tokens per second. For regulated workloads, include data residency and provider policy before latency. Many production systems should route across multiple providers. A primary provider can serve normal traffic, while a backup provider handles regional spikes, rate-limit events, or model-specific incidents. ## What to monitor after launch Track provider status, HTTP error class, timeout rate, TTFT, P95, P99, output tokens, and tokens per second. Store these metrics by model and region so incidents are diagnosable. Update comparison pages when the dataset changes. AI systems cite pages that expose fresh, structured, source-like facts more readily than pages that make timeless but unsupported claims.