Beyond coverage

It looks like there are a lot of opin­ions or as­sump­tions about unit tests and code cov­er­age, most of them con­fus­ing of bi­ased in sev­er­al ways. For ex­am­ple, I’ve heard or read things like “this is fine, it has X% cov­er­age”, or “check­ing for cov­er­age on pull re­quests does­n’t help”, or “drop­ping the cov­er­age lev­el is not an is­sue”, and many more of the like.

This ar­ti­cle aims to shed some light on the is­sue of unit test­ing with and code cov­er­age. Hope­ful­ly by the end of it, we’ll get an idea of which of the pre­vi­ous state­ments are right and which are wrong (spoil­er alert: they’re all wrong, but for dif­fer­ent rea­son­s).

It’s not a metric

I pret­ty much agree with Mar­tin Fowler here 1, about the fact that test cov­er­age is not a met­ric that in­di­cates how good we are do­ing in terms of test­ing, but just a ref­er­ence in or­der to iden­ti­fy parts of the code that need to be test­ed.

This leads me to the idea that cov­er­age is not a goal, or and end­ing point, but just the start­ing point.

What I mean by this is that imag­ine that you find in the cov­er­age re­port, that some lines are not be­ing ex­er­cised, for ex­am­ple there are two func­tions that lack test­ing. Good head start, now we can write tests for them. Now you go on and write a test for the first one. Af­ter that, you run the tests again, and the cov­er­age in­creased: the first func­tion is now cov­ered, so those lines no longer ap­pear as miss­ing in the re­port. Now you might be think­ing on mov­ing on to writ­ing tests for the sec­ond func­tion. That’s the trap. Test­ing should not stop there. What about the rest of the tests sce­nar­ios for that first func­tion? What about test­ing with more in­put, dif­fer­ent com­bi­na­tions of pa­ram­e­ter­s, side ef­fect­s, and more? That’s why it’s not a goal: we should­n’t stop test­ing once we’ve sat­is­fied the miss­ing lines on the re­port.

Another reason why reaching 100% coverage it’s not a valid goal, is because sometimes is unachievable. There are parts of the code that respond to defensive programming 5, and have some statements like assert(0) for unreachable conditions, which logically, if the code works correctly, will never run (actually, the fact that it doesn’t fail by reaching those lines when tests are run, works as a way of making the code to “self-testable”, so to speak).

Necessity and sufficiency

And even if we reach that un­re­al­is­tic goal: what does it mean to have 100% code cov­er­age on unit test­s? Can we re­ly on that to say that the code is ful­ly test­ed, and there are no bugs? Ab­so­lute­ly not.

Here I am not even talking about path coverage (sometimes referred to as multiple condition coverage). To clarify, it’s known that covering all branches does not mean the program will run just fine. Even with functional (manual or automated) testing. Suppose that we’re completely sure all paths are covered, and the testing team has checked everything, therefore we’re sure all the logic is sound. It still doesn’t mean the program is correct. There are runtime considerations to be taken into account: what if there is a race condition? (something hard to reproduce). What if the server is under heavy load? (with a high load average), what if malloc() at some point returns NULL? What if the disk is full? Or if there is latency? What about security? You get the point, the list of possible failure scenarios, is infinite.

Putting those con­sid­er­a­tions aside, the crux ques­tion is: can unit tests prove the log­ic (a­gain, not be­hav­iour in run­time, just the log­ic), to be cor­rec­t? No, be­cause, even with all state­ments anal­ysed, there is still the pos­si­bil­i­ty that things are left out.

To put it in an­oth­er way: if the cov­er­age is low­er than to­tal, then as­sume there are things that will go south (the fa­mous “code with­out tests is bro­ken by de­sign”). If ev­ery­thing is cov­ered, then for in­ter­pret­ed lan­guages (like Python), it means some­thing like “it com­piles“. It’s syn­tac­ti­cal­ly cor­rec­t, which does­n’t mean it’s se­man­ti­cal­ly cor­rec­t. For com­piled lan­guages, here I see lit­tle gain, ex­cept for the mere fact that checks at a very ba­sic lev­el that the code will run.

A high coverage is not enough

There is an­oth­er in­ter­est­ing idea about cov­er­age, nice­ly il­lus­trat­ed in the pa­per “how to mis­use tests cov­er­age” 2, which is that code cov­er­age can on­ly tell about the code that is there. There­fore, it can’t tell any­thing about po­ten­tial bugs that due to miss­ing code. It can’t de­tect faults of omis­sion.

On the oth­er hand, if in­stead of be­ing just guid­ed by the test cov­er­age, we ac­tu­al­ly think about the test sce­nar­ios that are rel­e­vant for a unit of code, we’ll start think­ing on new pos­si­bil­i­ties, in­put­s, and com­bi­na­tions that will log­i­cal­ly lead to these faults be­ing dis­cov­ered, and as a re­sult of that, the cor­rec­tive code will be in­clud­ed. This is the key point: not just to set­tle for a high cov­er­age, but for hav­ing a bat­tery of mean­ing­ful tests that cov­er rel­e­vant sce­nar­ios, in­stead of lines of code.

Tip

Cov­er sce­nar­ios, not lines of code.

The truth is that software is complex. Really complex. There are a lot of things that can go wrong. Therefore, tests are a fundamental tool to at least ensure a degree of quality. It is logical to think that for each line of code there should be many more of tests. This applies for all projects, in all programming languages. Now, if for each function we should have at least many more of them just testing it, you’ll quickly get the picture that the relation between productive code and testing code should be in the ratio 1:N. Now, having 100% coverage (to say the best), can only mean an 1:1 ratio. It could be the case of a single test, covering the function, but not will sufficient cases.

Relation between tests and main code

Let’s take a look at SQLite, which is a project that seems to have a reasonable level of testing 3. According to the document that explains it’s testing strategy, we can see that it has many more lines of tests code than main code in the library.

To quote the document itself: the library contains roughly 122 KLOC 4, whereas the tests are about 91,596.1 KLOC (~90M LOC). The ratio is an impressive 745x.

In my opinion, this relation does not only apply to C projects, it’s something general to all programming languages. It’s just the reality of software. This is what it takes to build reliable software.

Now, with this idea in mind, know­ing that we must have many more lines of test­ing code than pro­duc­tive code, be­cause each pos­si­ble func­tion can have mul­ti­ple out­comes, and has to be ex­er­cised un­der mul­ti­ple sce­nar­ios (val­i­da­tion of in­put, com­bi­na­tion of its in­ter­nal con­di­tion­s, and more), it be­comes clear, that cov­er­age does not mean that the code is thought­ful­ly test­ed at al­l. It then be­comes ev­i­dent that cov­er­age is not the end, but the be­gin­ning of test­ing: once we’ve iden­ti­fied the lines that need to be checked, the tests won’t stop once they’ve been cov­ered, they should stop once all pos­si­ble sce­nar­ios have been prop­er­ly ver­i­fied. It al­so be­comes ev­i­dent that is ex­pect­ed to have many more times test­ing lines than main ones.

Tip

Don’t re­ly on cov­er­age. Re­ly on though­ful test­ing.

Slides

This idea was pre­sent­ed in a light­ning talk at Eu­roPy­thon 2017, on Mon­day 10 of Ju­ly. Here are the slides.

1

http­s://­mart­in­fowler.­com/b­lik­i/Test­Cov­er­age.html

2

How to mis­use test cov­er­age” - Bri­an Mar­ick http://www.ex­am­pler.­com/test­ing-­com/writ­ings/­cov­er­age.pdf This is an ex­cel­lent pa­per, that dis­cuss­es some im­por­tant points about test cov­er­age.

3

http­s://sqlite.org/test­ing.html

4

1 KLOC means 1000 lines of code

5

http­s://en.wikipedi­a.org/wik­i/De­fen­sive_pro­gram­ming