A theme toggle looks trivial until it is not. A single button can drive a surprising amount of state, CSS branching, browser storage, server defaults, hydration behavior, accessibility output, and cross-session persistence. That makes theme switching a useful stress test for a browser automation suite, especially if your goal is not just to verify that dark mode exists, but to measure how stable your tests are when the application remembers user preferences in local state, cookies, or profile storage.

This article lays out a practical benchmark plan for a browser test benchmark for theme switching. The focus is not on ranking tools by raw speed, but on measuring repeatability, state isolation, locator resilience, and failure reproducibility across realistic UI conditions. If you work in QA, SDET, frontend engineering, or DevOps, this is the kind of benchmark that can expose whether your automation stack is robust enough for modern applications that personalize the UI.

Why theme switching is a good benchmark topic

Theme and color-mode features are often implemented with a mix of mechanisms:

  • a local UI state variable,
  • localStorage or sessionStorage,
  • cookies for server-rendered defaults,
  • OS preference detection through prefers-color-scheme,
  • a persisted user profile on the backend,
  • hydration logic in a single-page app.

That mix creates failure modes that are easy to miss in ordinary smoke tests. A test might pass once, then fail on the second run because a previous execution left a cookie behind. Another might pass in a clean browser context, then fail in a persistent profile because the application restores a theme before the test has finished loading. This is exactly the kind of problem a benchmark should surface.

A good benchmark does not only ask, “Did the test pass?” It asks, “Did it pass for the right reason, under controlled state, and can we reproduce the failure when it does not?”

The objective is to measure two things at the same time:

  1. Functional correctness, does the UI switch modes and remember the choice?
  2. Automation robustness, does the test keep working when the app stores or replays state across sessions?

What to measure in the benchmark

A useful benchmark for UI state should include metrics that reflect both application behavior and test behavior. For theme switching, the most valuable metrics are usually these:

1. State transition success rate

This is the percentage of runs in which the theme toggle produces the expected UI mode change.

Example questions:

  • Does clicking the theme control apply dark mode immediately?
  • Does the DOM or computed style change as expected?
  • Does the app expose a persistent indicator, such as data-theme="dark"?

2. Persistence success rate

After switching the theme, does a fresh session keep the preference?

Test this across storage types:

  • new tab in the same browser context,
  • new browser context with the same profile,
  • full browser restart,
  • fresh profile, which should reset to default.

3. Failure reproducibility rate

If a run fails, can you reproduce the same failure on retry with the same initial state?

This metric matters because flaky state bugs often disappear when rerun with a clean browser, hiding the real issue. Record whether the failure reproduces with:

  • same profile, same cookies, same storage,
  • same browser, different profile,
  • different browser engine.

4. Locator stability under theme variants

Theme changes often alter class names, visibility, icon sets, or text contrast. Measure whether your locators still work after the UI switches modes.

5. Assertion quality

A brittle test may pass by checking only a button click result. A stronger benchmark verifies the actual rendered state, such as:

  • CSS custom properties,
  • body class or data-theme,
  • computed color values,
  • accessibility tree labels,
  • screenshot diffs, if you use visual assertions.

Define the benchmark scope carefully

Theme switching can be benchmarked at several layers. If you do not define the scope, you will end up comparing tools on different tasks.

App types to include

Choose one or more app patterns:

  • a static documentation site with a theme toggle,
  • a React/Vue/Angular app using client-side persistence,
  • a server-rendered app with cookie-backed preference,
  • a hybrid app that hydrates from server defaults and then updates client-side state.

These variants matter because they introduce different timing and storage behaviors. A benchmark that only tests a static theme toggle will miss hydration and persistence issues.

Storage mechanisms to exercise

A realistic benchmark should test the same UI feature across multiple storage models:

  • localStorage, common for client-side preference storage,
  • cookies, common when the server should render the correct theme immediately,
  • profile storage, when browser context persists state across sessions,
  • URL parameters, if your app allows theme overrides for sharing or debugging.

State lifecycle to test

At minimum, benchmark these scenarios:

  1. default load with no prior preference,
  2. switch to dark mode and verify immediate effect,
  3. reload and verify persistence,
  4. close and reopen to verify session survival or reset behavior,
  5. clear storage and verify the default returns,
  6. start in a dark OS preference and verify initialization logic.

A benchmark matrix that keeps the results meaningful

A benchmark is only useful if you can compare runs under controlled permutations. For theme switching, the most practical matrix includes browser engine, storage type, and session model.

Dimension Example values Why it matters
Browser engine Chromium, Firefox, WebKit Different storage, rendering, and timing behavior
Session model fresh context, persistent profile Reveals leakage across runs
Storage mechanism localStorage, cookie, server profile Shows whether persistence is tied to the right layer
Initial preference light, dark, OS-preferred Catches bootstrapping issues
Assertion type DOM, computed style, screenshot Measures robustness of validation

You do not need every combination in every build. Start with a manageable subset, then expand. The benchmark should fit into CI without being so large that teams stop running it.

Decide what a “pass” means

For a theme benchmark, a pass should be stricter than “the toggle was clicked.” A credible pass definition usually includes three parts:

  • the action was performed successfully,
  • the visual or semantic state changed as expected,
  • the state persisted or reset according to the scenario.

For example, if a user clicks “Dark mode” and the app writes theme=dark into localStorage, then a pass might require:

  • the toggle is activated,
  • the <html> element has data-theme="dark",
  • the next reload restores dark mode,
  • a fresh profile returns to light mode.

If the app also supports server-side rendering, then a stronger pass should ensure the first paint is correct before hydration completes. This helps detect flashes of the wrong theme.

Suggested benchmark scenarios

Below is a practical set of scenarios that expose common failures without overengineering the benchmark.

Scenario 1: Fresh session, default theme

Start with a clean browser context and no storage.

Verify:

  • default theme is applied,
  • no stale cookie or localStorage item exists,
  • theme toggle is visible and usable.

This scenario establishes a clean baseline.

Scenario 2: Switch theme once, verify live update

Click the theme toggle and confirm the UI changes immediately.

Check:

  • the DOM marker changes,
  • text and icons remain accessible,
  • contrast does not break layout.

A lot of bugs appear here because components use different theme-dependent styles, especially third-party widgets.

Scenario 3: Reload and verify persistence

After switching to dark mode, reload the page.

Check whether the preference survives according to the design:

  • localStorage persists through reload,
  • cookies persist based on expiry,
  • sessionStorage may survive only within the same tab/session.

The point is not to force one storage model, but to verify that the chosen model works consistently.

Scenario 4: New browser context with same profile

If your test framework supports persistent profiles, reopen the app using the same user data directory.

This scenario catches state that should survive restarts. It is especially useful for desktop-like browser tests and long-lived profiles.

Scenario 5: New profile, verify reset behavior

Start with a brand-new browser profile.

You should see the default theme again unless your app uses an OS-level preference or server-side user profile.

If the theme is still dark, some state is leaking outside the intended storage mechanism.

Scenario 6: OS preference override

Set the browser or OS emulation to prefer dark mode, then verify initial render.

For apps that follow prefers-color-scheme, this can reveal timing bugs, because the app may briefly render light mode before applying the correct preference.

Implementation details that make the benchmark trustworthy

Use deterministic selectors

Theme benchmark tests should not depend on pixel-perfect matching of arbitrary elements. Prefer selectors that are built into the app for testability:

  • data-testid="theme-toggle",
  • data-theme="dark",
  • aria-label="Switch to dark mode".

Avoid brittle selectors like long CSS chains or icon-specific classes. If the theme changes SVG icons or wrapper classes, the test should not break for unrelated reasons.

Assert state, not just appearance

Visual checks are useful, but they are not enough by themselves. A dark background might render correctly while the underlying theme state remains unset, which means the next page transition could revert unexpectedly.

A better pattern is to check both:

  • a semantic marker, such as data-theme, and
  • a computed style or screenshot-based confirmation.

Capture storage before and after

When a test fails, you want to know whether the application wrote the correct preference.

Record:

  • localStorage keys relevant to theme,
  • cookies related to preference,
  • sessionStorage entries, if used,
  • current URL and query parameters.

This is the difference between a flaky test report and a useful debugging artifact.

Example benchmark test with Playwright

This is a compact example of a theme persistence check. It is not the full benchmark harness, but it shows the kind of structure that keeps the test reproducible.

import { test, expect } from '@playwright/test';
test('theme persists after reload', async ({ page }) => {
  await page.goto('https://example-app.local');

await page.getByTestId(‘theme-toggle’).click(); await expect(page.locator(‘html’)).toHaveAttribute(‘data-theme’, ‘dark’);

await page.reload(); await expect(page.locator(‘html’)).toHaveAttribute(‘data-theme’, ‘dark’); });

To make this a benchmark rather than a one-off test, wrap it in a runner that records environment details, storage state, and retry behavior.

Example of a more robust setup in Selenium Python

Some teams still prefer Selenium for broader grid support or legacy coverage. In that case, the benchmark should still focus on state validation and controlled profile setup.

from selenium import webdriver
from selenium.webdriver.common.by import By

options = webdriver.ChromeOptions() options.add_argument(‘–user-data-dir=/tmp/theme-bench-profile’)

driver = webdriver.Chrome(options=options) driver.get(‘https://example-app.local’)

driver.find_element(By.CSS_SELECTOR, ‘[data-testid=”theme-toggle”]’).click() assert driver.find_element(By.TAG_NAME, ‘html’).get_attribute(‘data-theme’) == ‘dark’

driver.refresh() assert driver.find_element(By.TAG_NAME, ‘html’).get_attribute(‘data-theme’) == ‘dark’ driver.quit()

The key benchmarking point is not the framework, it is whether your setup isolates runs and preserves only the state you intended to preserve.

Handle flakes caused by hydration and first paint

Theme bugs often appear during hydration, especially in apps that render server-side and then reconcile client state.

Common symptoms include:

  • the page flashes light mode before dark mode loads,
  • the toggle state is correct but the document root class is late,
  • tests fail because the assertion runs before hydration settles,
  • screenshots differ depending on CPU speed.

To reduce these failures, define a stable readiness signal. For example, the app can set window.__themeReady = true after it has applied persisted preference. Your benchmark can wait for that signal before asserting.

typescript

await page.waitForFunction(() => window.__themeReady === true);
await expect(page.locator('html')).toHaveAttribute('data-theme', 'dark');

This is a better benchmark than relying on arbitrary sleeps, because sleeps hide instability instead of measuring it.

Include negative cases on purpose

A benchmark is stronger when it includes deliberate failure cases. You want to know whether the test suite can distinguish between a correct and incorrect state.

Good negative cases include:

  • clearing localStorage before reload, expecting the default theme,
  • blocking cookie persistence, expecting a reset,
  • forcing a stale cookie and checking whether server preference wins,
  • corrupting the stored value, expecting fallback handling.

These scenarios help you measure whether the application fails gracefully and whether your tests can explain the failure instead of just reporting a mismatch.

Failure reproducibility needs a standard artifact set

If your benchmark finds a bug, you should be able to reproduce it later with the same initial conditions. That means collecting enough context on every run.

A good artifact set includes:

  • browser name and version,
  • engine and platform,
  • profile type, fresh or persistent,
  • initial storage snapshot,
  • test timestamps,
  • screenshot or video, if available,
  • console logs,
  • network traces when relevant.

When you are comparing tools or approaches, reproducibility often matters more than raw execution speed. A slightly slower tool that makes failures easy to replay is usually more valuable than a fast one that obscures state.

How to score the benchmark

For a practical browser automation benchmark, I recommend a small scorecard instead of a single composite score. Separate the concerns so teams can see tradeoffs clearly.

Suggested scoring dimensions

  • Persistence correctness, did the theme survive the intended boundary?
  • Isolation quality, did fresh sessions start cleanly?
  • Locator resilience, did the test survive UI differences caused by mode changes?
  • Reproducibility, could the failure be replayed with the same artifacts?
  • Debuggability, did the run produce enough context to explain the result?

A tool that wins on speed but loses on isolation may be a poor fit for this benchmark class. Likewise, a framework that works well only when state is manually reset is not a strong answer for teams that need repeatable CI coverage.

Integrate the benchmark into CI without making it noisy

Theme benchmarks can become flaky if you run too many variants too often. A practical CI strategy is to split the workload:

  • per-commit smoke path, one fresh-session scenario, one persistence scenario,
  • nightly path, full matrix across browsers and profiles,
  • diagnostic path, rerun failures with preserved artifacts.

If you use continuous integration, keep the fast path small and deterministic, and push the broader browser matrix into scheduled runs. For background on the practice, see continuous integration.

A simple GitHub Actions job might look like this:

name: ui-state-benchmark
on: [push, workflow_dispatch]

jobs: run: runs-on: ubuntu-latest steps: - uses: actions/checkout@v4 - uses: actions/setup-node@v4 with: node-version: 20 - run: npm ci - run: npx playwright test tests/theme-benchmark.spec.ts

This kind of pipeline is especially useful when you want the benchmark to function as a regression guard, not just a one-time audit.

Common mistakes to avoid

Testing only one browser profile

If every run starts with a brand-new profile, you will miss persistence bugs. If every run reuses the same profile, you will miss leakage and cleanup bugs. You need both.

Asserting color values without checking state

A style may look correct while the underlying preference is missing. When the app navigates or hydrates, the state can reset.

Forgetting accessibility implications

Dark mode is not only visual. Icon contrast, aria labels, and focus rings can change with theme. A benchmark that ignores these aspects can miss the most user-visible regressions.

Relying on arbitrary sleeps

This is one of the fastest ways to create a flaky benchmark. Wait for application readiness, a specific DOM marker, or a network condition instead.

Treating persistence as purely client-side

Many apps store theme on the server for logged-in users. If your benchmark only checks localStorage, you may misclassify a correct server-backed implementation as broken, or the reverse.

When to extend the benchmark beyond themes

Theme switching is a good entry point, but the same benchmark structure can be reused for other persisted UI state:

  • sidebar collapsed or expanded,
  • table density settings,
  • language selection,
  • last viewed tab,
  • dismissed banners,
  • filter chips or sort order.

These are all forms of UI memory. Once you have a benchmark harness that handles storage, reloads, and profile boundaries correctly, you can reuse it across a wider class of stateful UI features.

Final checklist for a usable benchmark

Before you call the benchmark complete, make sure it answers these questions:

  • Can it start from a truly clean state?
  • Can it verify theme changes after the click?
  • Can it verify persistence after reload and restart?
  • Can it distinguish local storage from cookies and profile storage?
  • Can it reproduce failures with the same initial artifacts?
  • Can it run in CI without excessive noise?
  • Can it explain failures with enough context to debug them?

If the answer is yes, you do not just have a test. You have a meaningful benchmark for persisted UI state.

Closing thought

Theme switching is deceptively small, which is why it makes such a strong benchmark target. It exercises state storage, browser context boundaries, hydration timing, and UI validation in one compact workflow. A solid benchmark in this area gives teams a repeatable way to compare browser automation approaches and to detect when a test suite is too brittle to trust.

For more background on the discipline behind these checks, it can help to revisit software testing and test automation as engineering practices rather than just tooling choices. The best browser automation benchmark is the one that tells you, with evidence, whether your UI state handling is stable enough to ship.