How to Benchmark Browser Startup Overhead in CI Before You Blame Your Test Suite

When a browser-based test suite feels slow, the instinct is often to blame specs, selectors, or flaky waits. Sometimes that is correct. Very often, though, the real cost sits before the first assertion ever runs, in container boot, dependency installation, test runner startup latency, browser launch time in CI, authentication setup, and the first page navigation.

If you do not measure those phases separately, you can spend hours optimizing the wrong layer. A browser startup overhead benchmark gives you a way to see where the time actually goes, and to decide whether the fix belongs in infrastructure, test design, parallelization, or the browser automation framework itself.

This article lays out a practical benchmarking plan for teams running Playwright, Selenium, Cypress, or similar tools in CI. The goal is not to produce a single magic number, because startup overhead is not one thing. The goal is to build a repeatable experiment that splits the pipeline into measurable stages.

If your suite is slow only in CI, start by measuring startup phases separately before rewriting specs. Many “test suite” problems turn out to be environment problems.

What browser startup overhead actually includes

In a local dev loop, a browser test may feel simple: start the runner, open a browser, visit a page, assert something. In CI, that same sequence often includes multiple hidden steps:

Container or VM boot.
Dependency restore, package install, and cache hydration.
Test runner process startup.
Browser binary launch.
Auth bootstrap, fixture loading, or session setup.
First navigation to the app or staging environment.
Any browser context creation, tracing, video, or coverage instrumentation.

A browser startup overhead benchmark should measure these layers separately whenever possible. If you only time the whole job, you will know the suite is slow, but not which knob to turn.

For background definitions, the general ideas of software testing, test automation, and continuous integration are useful, but this topic is more specific. The question is not whether automated tests are valuable, it is whether the startup path is dominated by browser launch time, environment boot, or application initialization.

The benchmark question you should answer first

Before collecting numbers, define the question precisely. A good benchmark question sounds like this:

How much of our CI job time is spent before the first test assertion?
How much slower is browser launch in CI compared with local or containerized runs?
How much startup cost comes from auth setup versus first-page navigation?
Which parts vary the most between runs, and which are stable enough to optimize?

A weak question sounds like this:

Is Playwright faster than Cypress?
Are our tests slow because the browser is slow?
Should we rewrite the suite?

Those questions mix tool choice, suite design, environment setup, and application behavior. A usable browser startup overhead benchmark isolates one variable at a time.

Split the run into measurable phases

The simplest useful model is four phases:

1. Container or machine startup

This covers the time from job scheduling to the shell being ready. In hosted CI, this includes runner allocation and image start. In self-hosted environments, it may include VM wake-up or autoscaling.

2. Test runner startup latency

This is the time from invoking the test command to the runner being ready to execute the first test. It can include Node.js process startup, framework initialization, loading config, transpilation, and browser driver wiring.

3. Browser launch time in CI

This is the time from the runner asking for a browser to the browser being usable. It includes downloading or locating the binary, sandbox setup, GPU or headless configuration, profile creation, and any remote control handshake.

This is the first real interaction with your app. It may include logging in, setting cookies, waiting for client-side hydration, and navigating to a route that your tests depend on.

The first page load is often the most misleading number in the suite. It is not just browser startup, it is browser startup plus app startup plus network cost.

If you can time each phase independently, you can stop attributing all startup time to the test framework.

Build a benchmark harness, not just a test

A benchmark should be runnable in CI and locally, but it should also be intentionally small. Do not measure the full production suite first. Create a dedicated harness that can run repeatedly with minimal test logic.

A practical harness should support three modes:

cold environment, with caches cleared or disabled,
warm environment, with normal CI caching,
isolated phase timing, with timestamps around each step.

For example, with Playwright you can instrument startup timestamps directly in the test process.

import { test } from '@playwright/test';

const t0 = Date.now();

test.beforeAll(async ({ browser }) => { console.log(runner_ready_ms=${Date.now() - t0}); const page = await browser.newPage(); await page.goto(‘https://example.com’); console.log(first_navigation_ms=${Date.now() - t0}); await page.close(); });

test('benchmark placeholder', async () => {
  console.log(`test_body_ready_ms=${Date.now() - t0}`);
});

This is not a full benchmark by itself, but it gives you structured timestamps. In CI, structured logs are easier to parse than ad hoc console output.

For Selenium, a similar idea works if you log timestamps around driver creation and navigation.

import time
from selenium import webdriver
from selenium.webdriver.chrome.options import Options

start = time.time() opts = Options() opts.add_argument(‘–headless=new’)

driver = webdriver.Chrome(options=opts) print(f’driver_ready_ms={(time.time() - start) * 1000:.0f}’) driver.get(‘https://example.com’) print(f’first_navigation_ms={(time.time() - start) * 1000:.0f}’) driver.quit()

The harness should be small enough that you can reason about every millisecond it reports.

Decide what to hold constant

A browser startup overhead benchmark is only useful if the inputs stay controlled. Pick stable values for:

CI image or container base,
browser version,
test runner version,
network target,
auth mechanism,
CPU and memory limits,
concurrency level,
tracing, video, and coverage settings.

If you are comparing launch times across tools, do not silently change the runtime environment too. For example, one tool might be running in a Docker image with the browser preinstalled, while another downloads the browser on each run. That is not a tool comparison, it is a packaging comparison.

A few common confounders are worth calling out:

Browser cache state, downloaded binaries can dominate the first run.
Package manager cache state, Node, Python, or Java dependency restore can obscure runner startup.
Headed vs headless mode, some CI images behave differently depending on display requirements.
Sandbox and security flags, browser launch time can change if the container is privileged or unprivileged.
Remote grid latency, Selenium grids and cloud browsers add network hops that local runners do not.

Measure cold start and warm start separately

Teams often optimize for the wrong scenario because they only inspect the first run after a cache clear or only inspect warm runs after the runner has been reused.

A useful browser startup overhead benchmark records both:

Cold start, first run on a fresh runner or after cache purge.
Warm start, repeated run with caches and browser binaries already present.

Cold start tells you what new jobs pay. Warm start tells you what most pull request runs pay if the same runner image or cache is reused.

If cold start is slow but warm start is acceptable, the fix may be prebuilding images, improving cache keys, or moving browser binaries into the runner image. If warm start is still slow, the issue may be the browser itself, the test runner, or app bootstrap.

Time the phases with explicit markers

Do not rely only on the total job duration. Add explicit markers around the major events. A simple set of markers might look like this:

job_start
deps_ready
runner_ready
browser_ready
auth_ready
first_navigation_complete
first_assertion_complete

You can emit those as JSON lines from your test process.

typescript

const mark = (name: string, t0: number) =>
  console.log(JSON.stringify({ name, ms: Date.now() - t0 }));

Then in CI, export the logs and aggregate them later. The value is not just the measurement, it is the phase boundary. Once the team agrees on boundaries, you can discuss specific bottlenecks without talking past each other.

A simple comparison matrix for tool and environment experiments

When you benchmark browser startup overhead, compare tool behavior only after fixing the environment. A small matrix is usually enough:

Dimension	Example choices	Why it matters
Runner	Playwright, Selenium, Cypress	Different startup models and process trees
Browser mode	Headless, headed	Affects launch behavior and resource usage
Browser source	Preinstalled image, downloaded at runtime	Can dominate cold start
Auth method	UI login, token injection, cookie restore	Adds very different startup cost
Network target	Local mock, staging, real backend	First navigation variance
Isolation	Fresh container, reused runner	Changes cache behavior
Instrumentation	None, tracing, video, coverage	Extra startup and I/O overhead

This matrix helps you determine whether a slowdown is caused by the browser launch itself or by everything around it.

Use CI job telemetry alongside test logs

The benchmark should not live only inside the test code. CI platform telemetry can reveal hidden startup cost outside the browser process, such as queue time, image pull time, or machine provisioning delay.

Track at least these numbers:

queue time until job starts,
time to first shell command,
time to dependency restore completion,
time to browser process start,
time to first URL loaded,
time to first test assertion.

If your CI system exposes step timing, combine it with test logs. The difference between job_start and runner_ready is often the overhead you can reduce with caching or image changes. The difference between browser_ready and first_navigation_complete may point to auth or application bottlenecks.

Know when auth setup is the real problem

Authentication is one of the most common sources of startup inflation. A suite may launch the browser quickly, then spend most of its time logging in through the UI. That is not browser launch cost, but it is still startup overhead from the perspective of your test.

Compare these three patterns:

UI login on every test, highest realism, highest startup cost.
Session reuse via cookies or storage state, lower cost, lower setup noise.
Token injection or API-authenticated state, lowest cost, best for many non-login flows.

The right choice depends on what you are validating. If a specific test exists to verify the login flow, you should measure it separately. If the login is just a prerequisite for unrelated tests, benchmark with a reusable authenticated state so the login path does not mask browser startup cost.

Many teams stop timing at browser.newPage() or webdriver.Chrome(). That misses the next major cost, which is usually navigation.

The first navigation may include:

DNS resolution,
TLS handshake,
API requests,
JavaScript bundle download,
hydration or client-side rendering,
redirects,
feature flag evaluation,
cookies and consent checks.

If your benchmark only measures browser launch time in CI, you may miss the fact that the browser is fast but the app entry point is expensive. In that case, the right fix might be a lightweight test landing page, a fixture endpoint, or a pre-authenticated route for setup.

Avoid benchmark traps that create fake conclusions

A browser startup overhead benchmark is easy to distort. Watch out for these traps.

Measuring only once

Single runs are useful for smoke testing, not for conclusions. Startup timing varies because of container scheduling, warm caches, network jitter, and browser process behavior. Run enough iterations to see the spread.

Comparing different amounts of work

If one tool downloads browsers, creates users, and enables tracing, while another does none of those things, the benchmark is comparing setup policies, not core speed.

Ignoring browser and runner versions

A version bump can change startup behavior. Record the exact versions in the benchmark output.

Mixing test suite overhead with application startup

If your test logs in, seeds data, opens a dashboard, and waits for realtime updates, you are no longer measuring browser startup alone. That may still be useful, but label it honestly.

Running on one CI provider only

Different runners have different CPU scheduling, filesystem speed, and image cache behavior. A benchmark on one provider is still valid, but it only describes that provider.

Benchmark labels matter. If a number includes auth and app boot, call it that. Otherwise people will optimize the wrong layer.

A minimal CI job that captures startup stages

Here is a compact GitHub Actions example that runs a browser benchmark and preserves logs for later analysis.

name: browser-startup-benchmark
on:
  workflow_dispatch:
  pull_request:

jobs: benchmark: runs-on: ubuntu-latest steps: - uses: actions/checkout@v4 - uses: actions/setup-node@v4 with: node-version: 20 - run: npm ci - run: npx playwright install –with-deps - run: npm run benchmark:startup | tee startup.log - uses: actions/upload-artifact@v4 with: name: startup-log path: startup.log

This job does not solve the measurement problem by itself, but it creates a repeatable artifact. From there, you can parse the timestamps into a spreadsheet, a dashboard, or a lightweight script.

Interpreting results without overfitting

Once you have data, resist the urge to treat every difference as a problem. Look for patterns.

If cold start is slow, warm start is acceptable

Focus on image prebuilds, browser binary caching, and dependency caching. Pre-baking browsers into the runner image often helps more than micro-optimizing test code.

If both cold and warm starts are slow

Look at the runner process, browser launch flags, and auth path. Also check whether tracing, screenshots, or video are enabled by default in every run.

The bottleneck is probably app bootstrap, environment dependency, or network. Measure the target route directly, and consider whether tests can start from a more stable fixture or pre-authenticated state.

If times vary a lot between runs

The issue may be shared runner noise, resource contention, or flaky backend dependencies. Median time is useful, but variation matters just as much. A benchmark that swings wildly is hard to optimize safely.

Practical optimization levers, in the order to check them

When the benchmark shows overhead, tackle the biggest controllable contributor first.

Preinstall browsers in the CI image if binary download dominates.
Cache dependencies correctly if package install dominates.
Reuse authenticated state if login dominates.
Trim instrumentation if tracing or video adds significant startup cost.
Simplify first navigation if app boot dominates.
Split setup from spec execution if a heavy before-all hook delays everything.
Parallelize carefully if single-worker startup is fine but total throughput is not.

The order matters because infrastructure fixes usually have broader impact than spec refactors.

What a good benchmark report should contain

A useful report is not just a chart. It should answer the questions the team actually needs.

Include:

tool and version,
browser and version,
CI provider and runner type,
container image or VM spec,
cold versus warm run definition,
number of iterations,
median and spread,
phase timing table,
notes on auth and instrumentation.

A good report lets another engineer reproduce the conditions without guessing.

A decision framework for teams

If you are trying to decide whether to invest in startup optimization, use this simple rule set:

If startup is a small fraction of total job time, do not spend weeks chasing it.
If startup dominates PR feedback time, benchmark the phases before refactoring tests.
If the same slowdown appears across multiple suites, suspect infrastructure first.
If the slowdown appears only in one suite, suspect auth, navigation, or suite setup.
If the slowdown appears only on one CI provider, compare runner behavior before touching test code.

This is where a browser startup overhead benchmark pays off. It gives the team evidence for where the time is spent, and it prevents speculative tuning.

A repeatable benchmarking checklist

Use this checklist when you set up the experiment:

Closing thought

A browser test suite can be slow for many reasons, but startup overhead is the easiest place to waste optimization effort because it hides behind the first few seconds of a run. Once you measure container boot, test runner startup latency, browser launch time in CI, auth setup, and first-page navigation separately, the conversation becomes concrete.

You are no longer asking, “Why are the tests slow?” You are asking, “Which phase is slow, under which conditions, and what can we actually change?” That is a much better place to be.