Browser test runtime is one of those metrics that looks simple until you try to compare it across CI providers. A pipeline that finishes in 6 minutes on one platform and 8 minutes on another does not automatically mean one provider is slower. The difference may come from runner provisioning, image pull time, browser startup, cache warmth, network locality, parallel slot allocation, or a single flaky test that retried on one run and not the other.

If you want a useful browser test runtime benchmark, you need a measurement model that separates the parts of the run. Otherwise you end up comparing cold starts to steady-state execution, or setup overhead to actual browser automation speed. That kind of comparison is noisy, easy to misread, and hard to defend in front of engineering or finance.

This article lays out a lab-style framework for measuring browser test runtime across CI providers, with enough detail to help QA leads, SDETs, engineering managers, and DevOps teams produce numbers they can trust. The goal is not to crown a universal winner. The goal is to understand where time goes, what can be optimized, and how to compare providers fairly.

What you are actually measuring

Before collecting any data, define the metric. “Test runtime” can mean several different things:

Runner startup time, the time from job scheduling to the environment being ready for your test command.
Browser launch time, the time required for Chrome, Firefox, or WebKit to become usable in the test environment.
Test execution time, the time from test framework start to completion of the suite itself, excluding infrastructure setup.
End-to-end pipeline time, the full CI job duration including checkout, install, build, cache restore, test, upload, and teardown.

A CI provider can look slow because it provisions runners slowly, even if the browser tests themselves are fast. If you do not split the stages, your benchmark will mostly measure platform overhead.

A good benchmark usually reports at least three numbers:

Provision-to-ready time, from job start to test command start.
Ready-to-first-test time, which captures browser launch and framework boot.
Test command duration, which is the closest proxy for actual browser automation speed.

If your team also cares about developer feedback time, include the full pipeline number too. Just do not confuse it with the speed of the browser automation layer.

Why CI provider performance is hard to compare

The phrase CI provider performance sounds like a single attribute, but it is really a stack of independent behaviors. Two providers can be identical in one layer and very different in another.

1. Provisioning is often the biggest variable

Some providers keep warm capacity around and hand you a runner quickly. Others need longer to provision a container or VM. The difference can dwarf the time spent actually running tests, especially for short suites.

2. Shared infrastructure introduces variance

Even if average times are close, one provider may have more jitter. For benchmark work, variability matters as much as mean runtime. A team that values predictable feedback may prefer a slightly slower provider with tighter spread over a faster but erratic one.

3. Caches are not comparable unless you make them comparable

Dependency caches, browser caches, build caches, and Docker layer caches all influence runtime. If one provider restores caches on every run and another starts from scratch, the benchmark has already become an architecture comparison, not a CI comparison.

4. Browser behavior changes under different resource limits

Browser automation speed is affected by CPU share, memory pressure, disk I/O, and whether the browser is headless or headed. A browser suite that is CPU-bound on one runner may become network-bound on another if the environment changes the timing of downloads, auth flows, or fixture setup.

5. Test data and parallelization can distort the result

If one provider gives you 2 parallel slots and another gives you 8, then the runtime comparison may really be a concurrency comparison. That can still be useful, but it needs to be labeled clearly.

Define a benchmark model before you run anything

The easiest way to get misleading results is to start with a real production workflow and call it a benchmark. Benchmarks need constraints.

A practical model for browser CI benchmarks has four isolated segments:

Segment A, runner provisioning

Measure from job queued to the moment the test container or VM is available.

Segment B, environment preparation

Measure install, cache restore, build, test fixture seeding, and any setup scripts before the browser starts.

Segment C, browser and framework boot

Measure the time from starting the test command to the first browser page ready event or equivalent marker.

Segment D, suite execution

Measure the actual test body, including waits, assertions, navigation, DOM interaction, and test-level retries.

If possible, record each segment separately, then also capture a total. That gives you both the component diagnosis and the executive summary.

Choose a representative suite, not a toy demo

Benchmarking with a trivial login test can hide the real cost profile of your project. Benchmarking with your full production suite can make the experiment too noisy to interpret. The middle path is to use one or more representative slices:

A smoke slice, 5 to 15 tests that cover common flows.
A workflow slice, a realistic user journey or two that include navigation, form entry, and assertion timing.
A load slice, a moderately sized subset that exercises fixtures, parallelism, and browser churn.

Try to include test types that reflect your actual mix:

DOM-heavy checks
API-assisted setup
Visual assertions if they are part of your workflow
Authentication flows
Cross-browser variants, if your CI matrix includes them

Do not benchmark just the easiest tests, unless your real goal is to understand best-case runtime. If your production suite spends 80 percent of its time in setup and 20 percent in assertions, the benchmark should reflect that balance.

Stabilize the test environment

A benchmark is only useful if the environment is controlled enough to repeat. That does not mean you must eliminate every source of variation, but you should standardize the obvious ones.

Pin the software stack

Use the same versions across providers whenever possible:

Browser version
Test framework version
Node.js, Python, or Java runtime
OS image or base container
Browser driver version, if applicable

Keep the test code identical

Use the same branch or commit, the same environment variables, and the same fixture data. If your tests depend on seed data, seed it in the same way on every run.

Normalize parallelism

If you compare serial runs on one provider and parallel runs on another, the numbers are not comparable. Pick a consistent parallelism level, or report runtime per shard and total wall-clock time separately.

Avoid accidental cache advantage

There are three common cache states:

Cold, nothing is cached.
Warm within the same provider, caches persist between runs.
Warm from local re-use, which is often not available in real CI.

For fairness, run both cold and warm scenarios, and label them clearly. Cold tells you the first-run penalty. Warm tells you steady-state performance.

If your developers regularly re-run the same pipeline on a branch with warm caches, steady-state numbers matter. If your main pain is every PR’s first run, cold-start cost matters more.

Instrument the benchmark with explicit timestamps

You cannot infer reliable sub-stage runtime from a single “job duration” value. Add timestamps at the boundaries you care about.

Here is a simple Playwright-style example that records the start of the test command and the first page load marker:

import { test, expect } from '@playwright/test';

const started = Date.now();

test('login flow', async ({ page }) => {
  console.log(JSON.stringify({ phase: 'test_start', ms: Date.now() - started }));

await page.goto(‘https://example.com/login’); console.log(JSON.stringify({ phase: ‘page_loaded’, ms: Date.now() - started }));

await expect(page.getByRole(‘heading’, { name: ‘Sign in’ })).toBeVisible(); });

This is not a full measurement system, but it shows the principle. Emit structured timestamps at key transitions so you can separate browser boot from suite work.

For Selenium, you can do the same thing with Python and simple logging:

import time

start = time.time() print({“phase”: “test_start”, “ms”: 0})

navigate, wait, assert

print({“phase”: “after_login_page”, “ms”: round((time.time() - start) * 1000)})

The important thing is not the language. It is the discipline of marking the boundaries.

Collect the right CI metadata

A runtime number without context is difficult to trust. Alongside the test times, capture metadata about the run:

CI provider name
Runner type or machine class
Region or availability zone, if available
CPU and memory limits
Container image digest or VM image version
Browser version and framework version
Commit SHA and branch name
Cache hit or miss status
Number of parallel shards
Retry count

This metadata lets you answer useful questions later, such as whether a slowdown came from a specific runner class or only from a certain region.

If your provider exposes job logs or an API, make a habit of exporting them into the benchmark dataset. Raw logs often contain the only reliable evidence of whether the job was delayed before the test even started.

Separate cold starts from real slowdowns

This is the core of the benchmark design.

A cold start is usually infrastructure-related. Examples include:

Runner provisioning delay
Docker image pull time
Dependency install from scratch
Browser download or first-time cache fill
DNS or network setup lag

A real slowdown is usually execution-related. Examples include:

Slower navigation because of remote test data or backend latency
Longer waits due to rendering or selector instability
Increased assertion time from DOM churn
More retries from flaky timing assumptions
Resource contention inside the runner

The tricky part is that these can mask each other. A cold start may make a suite look slow even if the suite body is unchanged. A real slowdown may be hidden by a warm cache, making the environment look healthier than it is.

The cleanest approach is to run two benchmark modes:

Cold mode, no reused workspace, no prebuilt artifacts, and cache cleared or explicitly unavailable.
Warm mode, with cache policy held constant and a defined number of repeat runs.

Then compare both sets. If the provider differs mainly in cold starts, that tells you something actionable about developer experience and branch feedback. If it differs in warm execution, that points more directly to browser automation speed or environment performance.

Use repeated runs and summarize variance

A single run tells you almost nothing. CI noise is real.

For each provider and each scenario, run enough repetitions to see spread, not just average. You do not need an academic sample size, but you do need more than one data point. In practice, 10 to 20 runs per condition is often enough to spot obvious differences in variance and medians, especially if the suite is short.

When summarizing, prefer:

Median runtime
95th percentile runtime
Min and max, with caution
Standard deviation or interquartile range

Avoid relying only on mean runtime. One unusually slow provisioning event can distort the average badly on short jobs.

A simple way to structure the summary table is:

Provider	Mode	Median total	Median cold start	Median test execution	p95 total
Provider A	cold	…	…	…	…
Provider A	warm	…	…	…	…
Provider B	cold	…	…	…	…
Provider B	warm	…	…	…	…

You do not need perfect statistical formalism to make this useful. You do need consistency.

Build a benchmark harness that does not change the test itself

One common mistake is to add too much instrumentation inside the test logic, then accidentally slow down one provider more than another. Keep the benchmark harness lightweight.

Useful harness duties include:

Printing timestamps at known boundaries
Capturing environment metadata
Preserving raw logs and screenshots on failure
Exposing cache hit or miss state
Tagging each run with a unique ID

Try to avoid heavy in-test logging every few milliseconds. That can change the runtime profile and make the benchmark more about logging overhead than actual browser automation speed.

Here is a small GitHub Actions example that shows how to make the benchmark repeatable across runs:

name: browser-runtime-benchmark
on:
  workflow_dispatch:

jobs: run: runs-on: ubuntu-latest strategy: matrix: repeat: [1, 2, 3, 4, 5] steps: - uses: actions/checkout@v4 - uses: actions/setup-node@v4 with: node-version: 20 - run: npm ci - run: npm test – –grep “smoke”

This is intentionally simple. In a real benchmark, you would also export logs and runtime markers, but the idea is the same, one controlled execution per repeat.

Treat browser choice as a separate variable

Chrome, Firefox, and WebKit do not behave identically in CI. If you benchmark all three at once without labeling the browser, your results will be difficult to interpret.

Browser differences can affect:

Launch time
Memory consumption
Page render timing
Selector stability
Video or screenshot overhead

If your organization only cares about Chrome in production, benchmark Chrome first. If cross-browser testing is a requirement, split the matrix and report browser-specific numbers. Do not collapse them into a single average unless you also report the distribution by browser.

Include failure paths, but do not mix them into the main runtime metric

Failed tests often take longer because they produce traces, screenshots, and retries. That is useful operationally, but it should not pollute your main runtime benchmark unless you are explicitly measuring “time to diagnose failure.”

A good practice is to run two related measurements:

Happy-path runtime, where tests are expected to pass.
Failure overhead, where one controlled failure path is exercised to measure trace collection and reporting costs.

This is especially important if your CI provider or automation tool uploads large artifacts. Artifact handling can dominate time on failing jobs.

When the benchmark reveals something actionable

A well-built benchmark should help you decide among specific options, such as:

Whether to move from container runners to dedicated VMs
Whether to warm dependencies differently
Whether to reduce setup in each shard
Whether browser downloads should be baked into the image
Whether your suite should be split into smaller groups
Whether a provider’s variance is acceptable for your team’s release cadence

The benchmark is also useful for identifying which layer is the real bottleneck. For example, if cold start dominates, optimizing selectors will not move the total much. If test execution dominates, then browser interaction and fixture design deserve attention.

A practical decision framework

When comparing CI providers, ask these questions in order:

Is the provider’s cold-start overhead acceptable for our branch feedback expectations?
Is the warm test execution time competitive for our suite size and browser mix?
Is variance low enough that developers can trust the feedback?
Are caches, parallel slots, and runner classes comparable to our production usage?
Does artifact upload or reporting slow the pipeline more than the browser itself?

If the answer to number 1 is no, you may need a provider with better provisioning or a different job strategy. If number 2 is no, the environment may be underpowered. If number 3 is no, the benchmark should be repeated before making a platform decision.

Common mistakes that make browser runtime benchmarks useless

Here are the most common ways these benchmarks go wrong:

Comparing one cold run to one warm run
Changing test code between providers
Leaving cache policy undocumented
Mixing browser matrix results together
Using full pipeline time as if it were test runtime
Ignoring retries and flaky test behavior
Running too few repetitions
Comparing a container runner with a prewarmed VM without labeling the difference
Letting artifact upload time obscure execution time

If any of these are happening, your benchmark may still be informative, but it should not be used as a final decision input.

Where Endtest can fit into the comparison

If you are evaluating browser automation setups alongside traditional code-first frameworks, it can be useful to include one low-code or agentic AI reference point as a baseline for reporting and execution workflow. For example, Endtest offers agentic AI and low-code/no-code test creation, along with CI/CD integrations, parallel testing, and reporting features that can be benchmarked for execution workflow alongside other browser automation options.

For a fair comparison, do not benchmark Endtest as if it were a source-code library. Instead, measure the same things you would measure for any browser automation setup, queue time, startup time, execution duration, artifact handling, and result visibility. The question is not whether one tool “feels faster,” but where the time goes and how quickly teams can get a trustworthy result.

A benchmark report template you can reuse

When you publish your internal benchmark, use a structure like this:

Scope, what suites, browsers, and providers were tested
Environment, runner specs, image versions, caches, and parallelism
Method, what timestamps were recorded and how repeats were run
Results, median, p95, and variance for cold and warm modes
Interpretation, what appears to be infrastructure cost versus test cost
Limitations, what was not controlled or measured
Recommendation, which provider or setup fits which use case

That structure keeps the report useful long after the first decision is made.

Final takeaway

A browser test runtime benchmark is only trustworthy when it separates cold starts from real slowdowns. Once you split runner provisioning, browser boot, and actual suite execution, CI provider performance becomes much easier to understand. You can then compare providers on the dimensions that matter, latency, consistency, caching behavior, and the real cost of browser automation speed in your workflow.

If your team is planning a provider switch, or just trying to reduce PR feedback time, start with a small benchmark matrix, instrument the boundaries, and run enough repeats to see variance. The result will be less dramatic than a one-off timing screenshot, but far more useful.

For teams also evaluating execution and reporting workflows in a broader benchmark set, it can be worth comparing a few platforms side by side, including agentic AI and low-code options like Endtest, especially when you care about how quickly teams can create, run, and review browser tests.