Beyond coverage

It looks like there are a lot of opin­ions or as­sump­tions about unit tests and ­code cov­er­age, most of them con­fus­ing of bi­ased in sev­er­al ways. For ex­am­ple, I’ve heard or read things like “this is fine, it has X% cov­er­age”, or “check­ing ­for cov­er­age on pull re­quests does­n’t help”, or “drop­ping the cov­er­age lev­el is not an is­sue”, and many more of the like.

This ar­ti­cle aims to shed some light on the is­sue of unit test­ing with and code ­cov­er­age. Hope­ful­ly by the end of it, we’ll get an idea of which of the pre­vi­ous state­ments are right and which are wrong (spoil­er alert: they’re al­l wrong, but for dif­fer­ent rea­son­s).

It’s not a metric

I pret­ty much agree with Mar­tin Fowler here [1], about the fact that test ­cov­er­age is not a met­ric that in­di­cates how good we are do­ing in terms of test­ing, but just a ref­er­ence in or­der to iden­ti­fy parts of the code that need ­to be test­ed.

This leads me to the idea that cov­er­age is not a goal, or and end­ing point, but just the start­ing point.

What I mean by this is that imag­ine that you find in the cov­er­age re­port, that ­some lines are not be­ing ex­er­cised, for ex­am­ple there are two func­tions that lack test­ing. Good head start, now we can write tests for them. Now you go on and write a test for the first one. Af­ter that, you run the tests again, and the cov­er­age in­creased: the first func­tion is now cov­ered, so those lines no ­longer ap­pear as miss­ing in the re­port. Now you might be think­ing on mov­ing on ­to writ­ing tests for the sec­ond func­tion. That’s the trap. Test­ing should not stop there. What about the rest of the tests sce­nar­ios for that first func­tion? What about test­ing with more in­put, dif­fer­ent com­bi­na­tions of pa­ram­e­ter­s, side ­ef­fect­s, and more? That’s why it’s not a goal: we should­n’t stop test­ing on­ce we’ve sat­is­fied the miss­ing lines on the re­port.

Another reason why reaching 100% coverage it’s not a valid goal, is because sometimes is unachievable. There are parts of the code that respond to defensive programming [5], and have some statements like assert(0) for unreachable conditions, which logically, if the code works correctly, will never run (actually, the fact that it doesn’t fail by reaching those lines when tests are run, works as a way of making the code to “self-testable”, so to speak).

Necessity and sufficiency

And even if we reach that un­re­al­is­tic goal: what does it mean to have 100% code ­cov­er­age on unit test­s? Can we re­ly on that to say that the code is ful­ly test­ed, and there are no bugs? Ab­so­lute­ly not.

Here I am not even talking about path coverage (sometimes referred to as multiple condition coverage). To clarify, it’s known that covering all branches does not mean the program will run just fine. Even with functional (manual or automated) testing. Suppose that we’re completely sure all paths are covered, and the testing team has checked everything, therefore we’re sure all the logic is sound. It still doesn’t mean the program is correct. There are runtime considerations to be taken into account: what if there is a race condition? (something hard to reproduce). What if the server is under heavy load? (with a high load average), what if malloc() at some point returns NULL? What if the disk is full? Or if there is latency? What about security? You get the point, the list of possible failure scenarios, is infinite.

Putting those con­sid­er­a­tions aside, the crux ques­tion is: can unit tests prove the log­ic (a­gain, not be­hav­iour in run­time, just the log­ic), to be cor­rec­t? No, be­cause, even with all state­ments anal­ysed, there is still the pos­si­bil­i­ty that things are left out.

To put it in an­oth­er way: if the cov­er­age is low­er than to­tal, then as­sume there are things that will go south (the fa­mous “code with­out tests is bro­ken ­by de­sign”). If ev­ery­thing is cov­ered, then for in­ter­pret­ed lan­guages (like Python), it means some­thing like “it com­piles“. It’s syn­tac­ti­cal­ly cor­rec­t, which does­n’t mean it’s se­man­ti­cal­ly cor­rec­t. For com­piled lan­guages, here I see lit­tle gain, ex­cept for the mere fact that checks at a very ba­sic lev­el that the code will run.

A high coverage is not enough

There is an­oth­er in­ter­est­ing idea about cov­er­age, nice­ly il­lus­trat­ed in the ­pa­per “how to mis­use tests cov­er­age” [2], which is that code cov­er­age can on­ly tell about the code that is there. There­fore, it can’t tell any­thing about po­ten­tial bugs that due to miss­ing code. It can’t de­tect faults of omis­sion.

On the oth­er hand, if in­stead of be­ing just guid­ed by the test cov­er­age, we ac­tu­al­ly think about the test sce­nar­ios that are rel­e­vant for a unit of code, we’ll start think­ing on new pos­si­bil­i­ties, in­put­s, and com­bi­na­tions that will ­log­i­cal­ly lead to these faults be­ing dis­cov­ered, and as a re­sult of that, the ­cor­rec­tive code will be in­clud­ed. This is the key point: not just to set­tle for a high cov­er­age, but for hav­ing a bat­tery of mean­ing­ful tests that cov­er rel­e­vant sce­nar­ios, in­stead of lines of code.

Tip

Cov­er sce­nar­ios, not lines of code.

The truth is that soft­ware is com­plex. Re­al­ly com­plex. There are a lot of things that can go wrong. There­fore, tests are a fun­da­men­tal tool to at least­ en­sure a de­gree of qual­i­ty. It is log­i­cal to think that for each line of code there should be many more of test­s. This ap­plies for all pro­ject­s, in al­l pro­gram­ming lan­guages. Now, if for each func­tion we should have at least many ­more of them just test­ing it, you’ll quick­ly get the pic­ture that the re­la­tion ­be­tween pro­duc­tive code and test­ing code should be in the ra­tio 1:N. Now, hav­ing 100% cov­er­age (to say the best), can on­ly mean an 1:1 ra­tio. It ­could be the case of a sin­gle test, cov­er­ing the func­tion, but not will ­suf­fi­cient cas­es.

Relation between tests and main code

Let’s take a look at SQLite, which is a project that seems to have a rea­son­able lev­el of test­ing [3]. Ac­cord­ing to the doc­u­ment that ex­plains it’s test­ing strat­e­gy, we can see that it has many more lines of tests code than ­main code in the li­brary.

To quote the doc­u­ment it­self: the li­brary con­tains rough­ly 122 KLOC [4], where­as the tests are about 91,596.1 KLOC (~90M LOC). The ra­tio is an im­pres­sive 745x.

In my opin­ion, this re­la­tion does not on­ly ap­ply to C pro­ject­s, it’s ­some­thing gen­er­al to all pro­gram­ming lan­guages. It’s just the re­al­i­ty of ­soft­ware. This is what it takes to build re­li­able soft­ware.

Now, with this idea in mind, know­ing that we must have many more lines of test­ing code than pro­duc­tive code, be­cause each pos­si­ble func­tion can have ­mul­ti­ple out­comes, and has to be ex­er­cised un­der mul­ti­ple sce­nar­ios (val­i­da­tion of in­put, com­bi­na­tion of its in­ter­nal con­di­tion­s, and more), it be­comes clear, that cov­er­age does not mean that the code is thought­ful­ly test­ed at al­l. It then be­comes ev­i­dent that cov­er­age is not the end, but the be­gin­ning of test­ing: once we’ve iden­ti­fied the lines that need to be checked, the test­s ­won’t stop once they’ve been cov­ered, they should stop once all pos­si­ble s­ce­nar­ios have been prop­er­ly ver­i­fied. It al­so be­comes ev­i­dent that is ex­pect­ed ­to have many more times test­ing lines than main ones.

Tip

Don’t re­ly on cov­er­age. Re­ly on though­ful test­ing.

Slides

This idea was pre­sent­ed in a light­ning talk at Eu­roPy­thon 2017, on Mon­day 10 of Ju­ly. Here are the s­lides.

[1] https://martinfowler.com/bliki/TestCoverage.html
[2] “How to misuse test coverage” - Brian Marick http://www.exampler.com/testing-com/writings/coverage.pdf This is an excellent paper, that discusses some important points about test coverage.
[3] https://sqlite.org/testing.html
[4] 1 KLOC means 1000 lines of code
[5] https://en.wikipedia.org/wiki/Defensive_programming