The 4 stages of flakiness (part 1/3): denial, anger, depression and acceptance

Mickael Meausoone
HMH Engineering
Published in
6 min readMay 26, 2022

--

Photo by Alexander Schimmeck on Unsplash

With a mono repository getting bigger as fast as we can code, one day something happened: the unit and integration tests that we thought we could rely on, became… Flaky. Also known as one of the scary stories the developers tell themselves after midnight!

Whatwazat? So in case you are not familiar with the term, we call flaky a test that would pass, locally and in our Jenkins CI, but once in a while would fail for unknown reasons. Well, not completely unknown, but while some bugs are reproducible, hence easily fixable, flaky means we can’t reliably reproduce the problem, thus, can’t really be sure we fixed it, until enough time passed without any new occurrences.

Warning: this article is mostly an introduction and there’s nearly no code here (gasp). But check out next parts of that series for your daily dose:

  1. The 4 stages of flakiness (you are here)
  2. Log failed tests with a Jest custom reporter
  3. Retrying failed tests in Jenkins

1. Denial

Nope, not possible. I mean, we are talking about unit tests right? In a controlled environment with mocked services responses, mocked data,
simulated browser, virtual machine, what could go wrong?

Well, a lot. Usually you don’t get so much flakiness in simple and straightforward tests: checking additions, helpers. Vanilla code.
But the deal changes when we start adding React components, service calls, user clicks, waiting for stuff to happen with state update and useEffect, etc.
Now we have seriously complex interactions, and this includes our dependencies: JSDOM, Jest, React Testing Library, Mock Service Worker.

Not only can our code have faults, our tests could be imperfect, but the tools we are using are also open to mistakes and bugs! And indeed, they get fixes and updates. And don’t get me started on Jenkins running our code in a Docker image on another platform creating even more failure potential (slowness for starters).

2. Anger

Ok, so now we have failed tests. But when I run it locally it passes, ok?!
It’s not my fault, something is wrong with the way we run it in the CI!
When more tests started occasionally failing, we turned against each other (I’m being purposely dramatic, we are friendly people here in HMH). Who wrote that test? Is there something wrong here?

When switching to Jest and Testing Library there were definitely some mistakes and to be honest, we are still learning.
Additionally, with an increasing number of tests, performance issues that are unclear start to weigh on test stability and will result in failing tests.

So what can be done?

2.1. Limit CI Jest command line to X workers (+ some tweaking)
The Jest settings and limitations of the container running the code could be possible causes of flakiness. In doubt, check the CI abilities!
In our case we chose to limit the number of workers, used for running multiple tests in parallel:
yarn jest --maxWorkers=1

2.2. Reduce coverage duplication
Do we need to test this particular use case both for parent and children components? Can we do this preparation only once? Maybe by using “pure” (disable the auto cleanup in Jest after each test block) to avoid repetition?

2.3. Split the bigger tests into multiple files
Because smaller files helped both developers and CI, this was a good way to reduce problems. And if you are testing more than one use-case in an integration test, e.g. one for each user role, maybe make a file for each role.

2.4. Determine the good practices, document them, enforce them
This was a necessary step: we wrote a testing doc, listed our findings on potential memory leak sources, flakiness causes, all that with fix suggestions. When possible, new ESLint rules helped limit mistakes.

2.5. Keep dependencies up to date
Sometimes it was a non-negligible effort, but definitely worthwhile: you should be mindful of library updates. Maybe don’t jump to the latest version (always wait for .1), but definitely avoid too much gap. This is also a good idea to avoid updating only parts of a whole: some updates trigger a need for fix elsewhere (Jest 27, we’re looking at you!).

2.6. Skip overly flaky tests
Not a real solution, but it can help when desperate!

2.7. Update tests to fix common mistakes
Examples of those are:

  1. Wait for asynchronous results in tests: maybe you’ve seen an “not wrapped in act” error in the console, one of the visible symptoms. Locally and when running a single test it’s harder to tell, but sometimes there’s still code running when a test is over. It can even make Jest crash (CI console with no createEvent on undefined that terminated Jest with a silent fail).
  2. Don’t forget to await your promises (or appropriate usage of done()). This one should be straightforward, this is the same as 1, but for selectors and test code.
  3. Make sure render is done in Jest it() or test() blocks:
    rendering in beforeAll or beforeEach is a bad practice. Those functions should be used for test preparation and render should be part of the test. Mostly because rendering a component can trigger asynchronous code and the result should be awaited. And tested.
  4. Keep test blocks small: waitFor or Promise based selectors like findBy have a timeout, but it blocks also have one. So the more is happening in a block, the more likely you are to reach that limit.
  5. Be mindful of selectors
    Using ByRole is a good practice but can be very slow: we measured it at 600ms in some tests and this is not a maximum. More than half a second on a strong machine and locally! We recommend to couple it with within and byTestId to enjoy the benefit of speed AND relevant selectors. This contributes to point 4 as well.

Those are all definitely good things, but like a tire with too much pressure, fix one hole and another appear: this was not enough and never would be. We kept adding stuff, you see?

3. Depression

The consequences of flakiness are pretty bad and underestimated. Taken independently, this looks annoying at most. But as a whole this accumulates and starts to matter.

3.1. The snowball effect
Imagine having a PR (Pull Request, code you want to merge in the mother-of-all monorepo), and the only thing between you and finishing that part of your work is a passing CI, represented by a green “Merge” button.

Then the CI tests fail and coverage isn’t right. You can’t merge (we have a system that requires passing tests, solid coverage and a lot of good things, all that are seriously impacted by flakiness).
Other developers are also impacted, adding to the pressure, and we all have deadlines, so when one merge is delayed an afternoon, it’s really 3, 4 PRs or more waiting behind, pushing everything back. And this accumulates.

3.2. No way out
The only option is rerunning the full CI (which could sometimes mean an additional 1–2 hours), hoping nobody merges in the meantime, resetting your CI (despite our system having a queue and notifications in our discussion channels, it happens, add 1–2 hours again).

But like playing roulette and hoping to win, this exposes you to exactly the same results: failures, maybe different, but still, no merge yet. Finishing is now completely uncertain. You lose confidence in the system.
* Sad developer noises *

4. Acceptance

So all those solutions we implemented helped, and are still really necessary, but at some point you have to accept the truth: flakiness isn’t going away, we will have to deal with it. Differently.

I read a 2016 article mentioning that even at Google they have flaky tests. At that time they estimated that more than 1 in 7 tests can be flaky and that’s a LOT of potentially failing tests. Are those guys noobs? Nope.
That was when I realised that we had to find another way. And it was there in front of us from the start!

5. A solution

When rerunning the tests, they are passing, right? Ok, then let’s do just that!

We are using Jenkins to run CI tests and though we do use Jest circus to rerun single test block in a file, this doesn’t work well with our integration tests which require preparations and several steps. Also this just (unfortunately) doesn’t solve everything.

After working on a Jenkins pipeline creation for something completely unrelated, a way to retry automatically tests that failed started emerging:
- a Jest custom reporter can collect a list of files that failed
- a shell script can use that list to rerun them, not only saving time (instead of rerunning everything), but removing the false positive from the final report.

Watch out for Part 2 and the detailed article on the custom Jest reporter!

Thanks for reading!

Mickael MeausooneStaff Software Engineer, passionate about JavaScript and web tech in general, curious and likes solving interesting problems. Will code for chocolate.

--

--

Senior Software Engineer, JavaScript and web tech in general, curious and likes playing with code. Will code for chocolate.