# How to Measure LLM Latency Correctly

TL;DR: Measure LLM latency with fixed prompts, enough repeated samples, regional probes, separate TTFT and total time, and percentile tables. Do not compare providers from one laptop run.

Published: 2026-05-12
Updated: 2026-05-12
Source dataset: https://llmping.com/leaderboard/

## Key facts

- Every benchmark row needs provider, model, region, prompt class, sample count, and collection timestamp.
- TTFT requires streaming instrumentation or a response reader that records the first chunk time.
- P95 and P99 are more useful than averages because LLM latency is skewed by queueing and network tails.

## Use a stable benchmark prompt

The prompt must be fixed for comparable runs. Prompt length, tool instructions, retrieval context, and max output tokens all affect latency. If the prompt changes, the benchmark is measuring a different workload.

Use at least two prompt classes for production decisions: a short chat prompt that measures interactive latency and a longer generation prompt that measures sustained throughput.

## Instrument streaming correctly

TTFT is recorded when the first generated token or content delta is received by the caller. It should not be recorded when headers arrive unless the provider sends useful token content in the same event.

Total time is recorded when the stream closes or the final response body is read. Tokens per second should normally be calculated from first token to last token, not from request start, because TTFT and decode speed are separate behaviors.

| Timestamp | Definition | Used for |
| --- | --- | --- |
| requestStart | Fetch or HTTP request begins | Network and total time baseline |
| firstToken | First content token arrives | TTFT |
| lastToken | Final content token arrives | Tokens/sec and total time |
| requestEnd | Stream closes | Total latency and error handling |

## Run regional probes

A laptop benchmark is a useful smoke test, not production evidence. Real applications call LLM APIs from servers, workers, or edge functions. Measure from those locations.

When an app serves multiple markets, run probes from multiple regions and publish each region separately. A single global number is too vague for routing decisions and too vague for AI citation.

## Publish timestamped rows

LLM API latency changes over time. Providers add capacity, change routing, suffer incidents, and release new model versions. A benchmark without a timestamp is not reusable evidence.

llmping uses native HTML tables with data attributes so crawlers and scripts can extract provider, model, region, P50, P95, P99, TTFT, sample count, and collection time without running JavaScript.
