June 1, 2026
How to Reduce Flaky Tests in GitHub Actions Without Hiding Real Failures
A practical guide to reducing flaky tests in GitHub Actions with retries, isolation, logging, and CI failure triage without masking real product defects.
Flaky tests are one of the easiest ways to waste engineering time in CI. A test passes locally, fails once in GitHub Actions, passes on retry, and then everyone loses confidence in the pipeline. The danger is not just noise, it is the slow erosion of trust. Once a team starts assuming failures are probably random, real regressions can linger longer than they should.
The goal is not to eliminate every intermittent failure at any cost. The goal is to reduce flaky tests in GitHub Actions while keeping true defects visible, actionable, and hard to ignore. That means treating flakiness as a system problem, not just a test problem. It often involves the test itself, the app under test, the CI environment, network boundaries, timing, data isolation, and how failures are reported.
This guide is a practical workflow for DevOps engineers, SDETs, QA leads, and release engineers who need to make GitHub Actions more reliable without hiding product failures behind aggressive retries.
What flaky tests actually are
A flaky test is a test that sometimes passes and sometimes fails without a meaningful change in the product behavior. That sounds simple, but in CI there are several distinct sources of intermittency:
- Timing issues, such as UI elements appearing later than the test expects
- Shared state, such as database records or files left by previous jobs
- Infrastructure noise, such as CPU starvation or transient network errors
- Test order dependence, where one test changes state another test assumes
- Environment drift, such as different browser versions, locale, timezone, or node dependencies
- App instability, where the product itself is producing nondeterministic behavior
Not every intermittent failure is a flaky test. Some are signals that your system is underprovisioned, your setup is non-deterministic, or your application has a real race condition.
That distinction matters because the response should differ. A genuine product race condition should fail loudly. A bad wait strategy should be fixed in the test. A noisy CI host should be stabilized or isolated. A retry policy can help in some cases, but retries alone are not a strategy.
Start with failure classification, not retries
If you want to reduce flaky tests in GitHub Actions, the first step is to classify failures into buckets. That gives you a policy for what to retry, what to quarantine, and what to escalate.
A useful working taxonomy is:
1. Product defect
The test failed because the application behaved incorrectly. Examples:
- Validation error not shown when required
- API returned malformed data
- UI state did not match the expected business rule
Do not retry these away. They should fail immediately and be visible in release gating.
2. Test defect
The test is poorly written or too tightly coupled to timing, DOM structure, or shared state. Examples:
- Hard-coded sleep instead of waiting for a specific condition
- Unstable selector tied to CSS layout classes
- Cleanup omitted after test data creation
These should be fixed in the test suite, not masked.
3. Environment or infrastructure noise
The test depends on CI conditions outside the app code. Examples:
- Browser startup failed because the runner is overloaded
- A temporary DNS error occurred
- A service container was not ready yet
Some of these can be retried, but only if you can distinguish them from true application failures.
4. Data or dependency instability
External services, test data, or fixtures are inconsistent. Examples:
- Shared test account state changes unexpectedly
- Third-party API rate limiting causes occasional failure
- Seed data differs between jobs
The fix is usually isolation, better mocks, contract tests, or test environment control.
5. Observability gap
The test may be failing for a real reason, but the failure does not contain enough context to know why. In practice, this is where many teams get stuck.
If you cannot see the app logs, request payload, browser console errors, network failures, screenshots, or timing data, you are guessing. Good observability is what makes CI failure triage possible.
Build a retry policy that is narrow, explicit, and informative
Retries can be useful, but they must be targeted. A blanket retry of every job hides defects and increases pipeline time. A better approach is to retry only the classes of failure that are plausibly transient.
Use retries at the right layer
There are three common levels:
- Step-level retries, such as rerunning a flaky browser step
- Job-level retries, such as rerunning the full test job
- Workflow-level retries, such as rerunning the entire GitHub Actions workflow
Step-level retries are usually preferable for isolated transient operations, like a single network call or browser navigation. Job-level retries are reasonable when the environment can be dirty after a failure. Workflow-level retries should be rare because they are expensive and can blur root cause analysis.
Keep retries limited
A practical policy is one or two retries for explicitly transient classes, then fail. The point is to reduce false negatives, not to turn CI into a lottery with more tickets.
A good retry policy should answer:
- Which failures are eligible?
- How many attempts are allowed?
- Is there a backoff between attempts?
- What evidence do we collect from each attempt?
- When do we stop retrying and fail the build?
If your policy does not answer these questions, it is probably hiding signal.
Example: GitHub Actions job retry for transient conditions
GitHub Actions does not provide a built-in universal retry for all jobs, but you can implement a controlled retry pattern in a step or use a dedicated action with care. A simple shell-based retry is often enough for a flaky external dependency check or a narrowly scoped setup step.
name: test
on: [push, pull_request]
jobs: unit: runs-on: ubuntu-latest steps: - uses: actions/checkout@v4 - uses: actions/setup-node@v4 with: node-version: 20 - run: npm ci - name: Run tests with limited retry run: | for attempt in 1 2; do npm test && exit 0 echo “Attempt $attempt failed” sleep 5 done exit 1
This is not suitable for everything. If npm test includes real product assertions, you may end up masking a regression. Use retries only when the command boundaries are clear and the failure mode is known.
Make tests deterministic before you tune CI
Many flaky tests are not CI problems at all. They are nondeterministic tests running in a deterministic environment. Before you optimize GitHub Actions, eliminate obvious sources of variability.
Avoid hard sleeps
A sleep 5 may make a test pass more often, but it does not prove readiness. Replace sleep-based waits with explicit conditions. In browser automation, wait for the element, network call, or state transition you actually need.
typescript
await page.goto('https://app.example.com/dashboard');
await page.getByRole('button', { name: 'Create report' }).waitFor({ state: 'visible' });
await page.getByRole('button', { name: 'Create report' }).click();
Use stable selectors
Selectors based on layout classes or generated DOM fragments tend to be brittle. Prefer accessibility roles, stable data attributes, or semantic IDs. This reduces false failures when UI structure changes without changing behavior.
typescript
await page.locator('[data-testid="save-settings"]').click();
Isolate test data
Each test should own the data it creates. If multiple CI jobs share the same user, same account, or same database records, they will eventually step on one another.
Good isolation patterns include:
- Unique namespaces per job or branch
- Ephemeral databases per workflow
- Per-test seed data
- Cleanup hooks that run even on failure
- Read-only fixtures for shared reference data
If you use parallel runs in GitHub Actions, data isolation becomes even more important because the issue count grows with concurrency.
Make time and randomness controllable
Tests that depend on Date.now(), random IDs, or time zones can fail differently across runners and locales. Freeze time, seed random generators, and standardize locale settings where practical.
typescript process.env.TZ = ‘UTC’;
When the app itself depends on time, use explicit test fixtures or a time abstraction so the suite does not depend on the wall clock.
Improve test observability so triage is fast
Reducing flaky tests is easier when every failure leaves enough evidence behind. The best CI systems do not just say “failed,” they explain what happened.
Capture the right artifacts
For browser and integration tests, collect:
- Screenshots on failure
- Video for selected workflows
- Browser console logs
- Network logs or HAR files
- Application logs from the test window
- Server logs from the test environment
- The exact command, commit SHA, and runner metadata
For API and backend tests, capture:
- Request and response payloads, with secrets redacted
- Correlation IDs
- Upstream dependency errors
- Container logs
- DB migration state
Example: upload artifacts in GitHub Actions
- name: Run Playwright tests
run: npm run test:e2e
- name: Upload test artifacts if: failure() uses: actions/upload-artifact@v4 with: name: playwright-artifacts path: | playwright-report/ test-results/
Artifact retention is not the whole story. The artifacts also need to be useful, which means consistent naming, stable paths, and enough context to compare attempts.
Record failure metadata
At minimum, include:
- Branch and commit SHA
- Workflow name and job name
- Runner OS and image version
- Browser or runtime version
- Attempt number if retried
- Test name and file path
- Timestamp and duration
This metadata turns a one-off failure into a pattern. That pattern is often what exposes whether a problem is environment-specific, branch-specific, or test-specific.
If you cannot answer whether a failure happened on the same test, same runner, same browser, and same commit, you do not have enough data to triage it well.
Separate infrastructure noise from product defects
One of the hardest parts of CI failure triage is deciding whether a failure is transient infrastructure noise or a real defect. The difference determines whether you rerun, investigate, or block release.
Create explicit failure signatures
Some errors are strong signals of infrastructure instability, such as:
- Connection reset by peer
- DNS resolution errors
- Browser process failed to start
- Service container unhealthy during startup
- Timeout while waiting for a dependent service to become ready
Others are more likely product failures:
- Expected text not found after the UI rendered
- API returned a business rule violation
- Form submission succeeded with invalid data
A classifier can be as simple as matching known error signatures in logs. The key is to keep the list narrow and auditable. If the list grows too broad, it becomes another hiding place for bugs.
Treat readiness as part of the test contract
In GitHub Actions, a lot of apparent flakiness is actually startup ordering. The test begins before a service is ready. If your workflow spins up Postgres, Redis, local APIs, or browser services, verify readiness before the test starts.
- name: Wait for API
run: |
for i in 1 2 3 4 5; do
curl -fsS http://localhost:3000/health && exit 0
sleep 3
done
exit 1
This is better than hoping the app comes up fast enough. For more complex systems, use a real health endpoint that verifies the dependencies the test actually needs.
Keep environment setup consistent
GitHub-hosted runners are convenient, but they are not identical to local laptops. Small differences in Node version, browser version, memory pressure, or file system behavior can expose bad assumptions.
Use pinned tool versions where possible, not floating defaults. For example, set Node, browser channels, and package lockfiles explicitly. The GitHub Actions documentation is a good reference for available runner patterns and workflow syntax, especially when you need to understand job dependencies and environment controls, see the official docs at GitHub Actions.
Use parallelism carefully
Parallelism can make CI faster, but it can also make flaky tests worse if your suite was already dependent on shared state.
Good candidates for parallel execution
- Pure unit tests with no shared mutable state
- API tests using isolated test data
- Browser tests with ephemeral environments per worker
Bad candidates for parallel execution without additional isolation
- Tests sharing a single account
- Tests manipulating the same records
- Tests that assume execution order
- Tests that depend on fixed ports or shared local resources
If you shard tests across jobs in GitHub Actions, make the shard assignment deterministic. Then when a failure happens, you can reproduce the same shard locally or in a debug job.
A simple pattern is to use test framework sharding or file-based partitioning. In Playwright, for example, you can split tests across workers while keeping each worker isolated.
// playwright.config.ts
import { defineConfig } from '@playwright/test';
export default defineConfig({ workers: 4, use: { trace: ‘on-first-retry’ } });
Trace collection on first retry is particularly useful because it gives you extra evidence only when needed, without turning every run into a storage problem.
Add quarantine, but make it temporary and visible
Quarantining flaky tests is sometimes necessary, especially when a known issue blocks delivery but the team needs the pipeline to stay useful. The danger is that quarantine becomes a permanent hiding place.
A good quarantine policy should require:
- An owner
- A reason code
- An expiration date or review date
- A link to the tracking issue
- A separate report that shows quarantined tests clearly
Do not silently exclude tests from CI. If a flaky test is not part of the release gate, make that decision visible in the workflow and in the test dashboard.
Practical quarantine patterns
- Mark the test as allowed to fail in a non-blocking job
- Run quarantined tests in a separate workflow that still reports status
- Send flaky tests to a dedicated nightly suite while the fix is in progress
The important part is that quarantine is not amnesty. The suite should still remind you that quality debt exists.
Design a GitHub Actions workflow for triage, not just pass/fail
A pipeline that only says pass or fail is not enough when you are trying to reduce flaky tests in GitHub Actions. You need a workflow that helps you decide what happened.
A useful workflow shape
- Run fast checks first, such as linting and unit tests
- Run integration or browser tests in isolated jobs
- Capture artifacts on failure
- Retry only known transient steps, not the whole suite
- Separate non-blocking flaky-test tracking from release gates
- Publish a summary with failure metadata
Example workflow structure
name: ci
on: [push, pull_request]
jobs: unit: runs-on: ubuntu-latest steps: - uses: actions/checkout@v4 - uses: actions/setup-node@v4 with: node-version: 20 - run: npm ci - run: npm test
e2e: runs-on: ubuntu-latest steps: - uses: actions/checkout@v4 - uses: actions/setup-node@v4 with: node-version: 20 - run: npm ci - run: npm run start:test & - name: Wait for app run: npx wait-on http://localhost:3000 - name: Run E2E run: npm run test:e2e - name: Upload artifacts on failure if: failure() uses: actions/upload-artifact@v4 with: name: e2e-failure-artifacts path: | test-results/ playwright-report/
This structure does not solve flakiness by itself, but it creates a cleaner failure boundary. You can then observe whether failures cluster around startup, the browser layer, or the app logic.
Add summaries that humans can scan
GitHub Actions job summaries are useful for high-signal failure reports. Include links to artifacts, retry count, and a short error classification. The goal is to make the first review of a failure faster than opening the raw logs and guessing.
Triage process, a simple decision tree
When a test fails in GitHub Actions, use a consistent triage flow.
Step 1, did the test fail deterministically on rerun?
If yes, treat it as a real failure until proven otherwise.
Step 2, does the error match a known transient signature?
If yes, inspect artifacts and retry policy. If the same transient signature appears repeatedly, there may be an infrastructure issue worth fixing.
Step 3, is the failure tied to one environment, browser, or runner image?
If yes, compare environment drift. Something may have changed outside the app.
Step 4, does the test rely on shared data, timing, or ordering?
If yes, improve isolation.
Step 5, is the app itself inconsistent under the same inputs?
If yes, it is likely a product bug or a race condition in the product code.
A small triage checklist like this prevents every failure from becoming a debate.
Common anti-patterns that increase flakiness
Some patterns show up again and again in unreliable CI suites.
Overusing global setup
If global setup creates shared state for all tests, failures can cascade. Prefer per-suite or per-test setup when the cost is acceptable.
Retrying assertions without fixing readiness
If a test waits for an unstable selector and retries the assertion three times, it still depends on luck. Fix the wait condition.
Ignoring logs until a failure becomes widespread
Logging and artifacts are far cheaper to add before the fire than during it.
Allowing test code to mutate production-like shared resources
Shared buckets, shared queues, and shared accounts almost always create hidden coupling.
Mixing smoke, integration, and end-to-end tests without boundaries
When every failure looks the same, triage becomes slow. Keep test layers distinct and label them clearly in CI.
A pragmatic checklist for a more reliable pipeline
If you want a short implementation plan, start here:
- Pin runtime and browser versions in GitHub Actions
- Replace sleeps with explicit waits
- Use stable selectors and deterministic test data
- Isolate databases, files, accounts, and queues per job or test
- Capture artifacts on failure, including screenshots, logs, and traces
- Retry only known transient steps, not the entire release gate
- Publish clear failure metadata, including commit, runner, and attempt number
- Keep quarantined tests visible and time-bound
- Separate infrastructure failures from product failures with explicit rules
- Review flake trends regularly, not just individual failures
If you do only one thing, improve observability first. Better logs, traces, and artifacts make every other improvement easier to validate.
A final rule of thumb
The best way to reduce flaky tests in GitHub Actions is not to suppress failures. It is to make failures more truthful.
A reliable pipeline should do three things well:
- Fail when the product is broken
- Resist unnecessary noise from the environment
- Provide enough evidence to explain the difference
That balance takes some effort, but it pays off quickly. Once your team trusts CI again, merges are faster, triage is calmer, and real regressions are much easier to spot.