Beyond coverage

It looks like the­re are a lot of opi­nions or as­sump­tions about unit tes­ts an­d ­co­de co­ve­ra­ge, most of them con­fu­sing of bia­sed in se­ve­ral wa­ys. For exam­ple, I’­ve heard or read things like “this is fi­ne, it has X% co­ve­ra­ge”, or “che­ckin­g ­for co­ve­ra­ge on pu­ll re­ques­ts does­n’t hel­p”, or “dro­pping the co­ve­ra­ge le­vel is ­not an is­sue”, and many mo­re of the like.

This arti­cle ai­ms to shed so­me li­ght on the is­sue of unit tes­ting wi­th and co­de ­co­ve­ra­ge. Ho­pe­fu­lly by the end of it, we’­ll get an idea of whi­ch of the ­pre­vious sta­te­men­ts are ri­ght and whi­ch are wrong (s­poi­ler aler­t: the­y’­re all w­ron­g, but for di­ffe­rent rea­son­s).

It’s not a metric

I pre­tty mu­ch agree wi­th Mar­tin Fo­w­ler he­re [1], about the fact that tes­t ­co­ve­ra­ge is not a me­tric that in­di­ca­tes how good we are doing in ter­ms of ­tes­tin­g, but just a re­fe­ren­ce in or­der to iden­ti­fy par­ts of the co­de that nee­d ­to be tes­te­d.

This lea­ds me to the idea that co­ve­ra­ge is not a goa­l, or and en­ding poin­t, bu­t just the star­ting poin­t.

What I mean by this is that ima­gi­ne that you find in the co­ve­ra­ge re­por­t, tha­t ­so­me li­nes are not being exer­ci­s­e­d, for exam­ple the­re are two func­tions tha­t ­la­ck tes­tin­g. Good head star­t, now we can wri­te tes­ts for the­m. Now you go on and wri­te a test for the first one. After tha­t, you run the tes­ts agai­n, an­d ­the co­ve­ra­ge in­crea­se­d: the first func­tion is now co­ve­re­d, so tho­se li­nes no­ ­lon­ger appear as mis­sing in the re­por­t. Now you mi­ght be thi­nking on mo­ving on ­to wri­ting tes­ts for the se­cond func­tio­n. Tha­t’s the tra­p. Tes­ting should no­t s­top the­re. What about the rest of the tes­ts sce­na­rios for that first func­tio­n? What about tes­ting wi­th mo­re in­pu­t, di­ffe­rent com­bi­na­tions of pa­ra­me­ter­s, si­de e­ffec­ts, and mo­re? Tha­t’s why it’s not a goa­l: we should­n’t stop tes­ting on­ce we’­ve sa­tis­fied the mis­sing li­nes on the re­por­t.

Another reason why reaching 100% coverage it’s not a valid goal, is because sometimes is unachievable. There are parts of the code that respond to defensive programming [5], and have some statements like assert(0) for unreachable conditions, which logically, if the code works correctly, will never run (actually, the fact that it doesn’t fail by reaching those lines when tests are run, works as a way of making the code to “self-testable”, so to speak).

Necessity and sufficiency

And even if we rea­ch that un­rea­lis­tic goa­l: what does it mean to ha­ve 100% co­de ­co­ve­ra­ge on unit tes­ts? Can we re­ly on that to say that the co­de is fu­ll­y ­tes­te­d, and the­re are no bugs? Ab­so­lu­te­ly no­t.

Here I am not even talking about path coverage (sometimes referred to as multiple condition coverage). To clarify, it’s known that covering all branches does not mean the program will run just fine. Even with functional (manual or automated) testing. Suppose that we’re completely sure all paths are covered, and the testing team has checked everything, therefore we’re sure all the logic is sound. It still doesn’t mean the program is correct. There are runtime considerations to be taken into account: what if there is a race condition? (something hard to reproduce). What if the server is under heavy load? (with a high load average), what if malloc() at some point returns NULL? What if the disk is full? Or if there is latency? What about security? You get the point, the list of possible failure scenarios, is infinite.

Pu­tting tho­se con­si­de­ra­tions asi­de, the crux ques­tion is: can unit tes­ts pro­ve ­the lo­gic (a­gai­n, not be­ha­viour in runti­me, just the lo­gi­c), to be co­rrec­t? No­, ­be­cau­se, even wi­th all sta­te­men­ts ana­l­ys­e­d, the­re is sti­ll the po­s­si­bi­li­ty tha­t ­things are le­ft ou­t.

To put it in ano­ther wa­y: if the co­ve­ra­ge is lo­wer than to­ta­l, then as­su­me ­the­re are things that wi­ll go sou­th (the fa­mous “co­de wi­thout tes­ts is bro­ken by de­sig­n”). If eve­r­y­thing is co­ve­re­d, then for in­ter­pre­ted lan­gua­ges (like ­P­y­tho­n), it means so­me­thing like “it com­pi­les“. It’s syn­tac­ti­ca­lly co­rrec­t, whi­ch does­n’t mean it’s se­man­ti­ca­lly co­rrec­t. For com­pi­led lan­gua­ges, he­re I ­see li­ttle gai­n, ex­cept for the me­re fact that che­cks at a ve­ry ba­sic le­ve­l ­that the co­de wi­ll run.

A high coverage is not enough

The­re is ano­ther in­te­res­ting idea about co­ve­ra­ge, ni­ce­ly illus­tra­ted in the ­pa­per “how to mi­su­se tes­ts co­ve­ra­ge” [2], whi­ch is that co­de co­ve­ra­ge can on­ly te­ll about the co­de that is the­re. The­re­fo­re, it can’t te­ll an­y­thin­g a­bout po­ten­tial bugs that due to mis­sing co­de. It can’t de­tect faul­ts of o­mis­sion.

On the other han­d, if ins­tead of being just gui­ded by the test co­ve­ra­ge, we ac­tua­lly thi­nk about the test sce­na­rios that are re­le­vant for a unit of co­de, we’­ll start thi­nking on new po­s­si­bi­li­tie­s, in­pu­ts, and com­bi­na­tions that wi­ll ­lo­gi­ca­lly lead to the­se faul­ts being dis­co­ve­re­d, and as a re­sult of tha­t, the ­co­rrec­ti­ve co­de wi­ll be in­clu­de­d. This is the key poin­t: not just to se­ttle fo­r a hi­gh co­ve­ra­ge, but for ha­ving a ba­tte­ry of mea­nin­gful tes­ts that co­ve­r ­re­le­vant sce­na­rio­s, ins­tead of li­nes of co­de.

Tip

Co­ver sce­na­rio­s, not li­nes of co­de.

The tru­th is that so­ftwa­re is com­plex. Rea­lly com­plex. The­re are a lot of ­things that can go wron­g. The­re­fo­re, tes­ts are a fun­da­men­tal tool to at leas­t en­su­re a de­gree of qua­li­ty. It is lo­gi­cal to thi­nk that for ea­ch li­ne of co­de ­the­re should be many mo­re of tes­ts. This applies for all pro­jec­ts, in all ­pro­gra­m­ming lan­gua­ges. No­w, if for ea­ch func­tion we should ha­ve at least man­y ­mo­re of them just tes­ting it, you’­ll qui­ck­ly get the pic­tu­re that the re­la­tio­n ­be­tween pro­duc­ti­ve co­de and tes­ting co­de should be in the ra­tio 1:N. No­w, ha­ving 100% co­ve­ra­ge (to say the bes­t), can on­ly mean an 1:1 ra­tio. It ­could be the ca­se of a sin­gle tes­t, co­ve­ring the func­tio­n, but not wi­ll ­su­ffi­cient ca­ses.

Relation between tests and main code

Le­t’s take a look at SQ­Li­te, whi­ch is a pro­ject that see­ms to ha­ve a ­rea­so­na­ble le­vel of tes­ting [3]. Ac­cor­ding to the do­cu­ment that ex­plains it’s ­tes­ting stra­te­g­y, we can see that it has many mo­re li­nes of tes­ts co­de than ­main co­de in the li­bra­r­y.

To quo­te the do­cu­ment itsel­f: the li­bra­ry con­tains rou­gh­ly 122 KLOC [4], whe­reas the tes­ts are about 91,596.1 KLOC (~90M LOC). The ra­tio is an im­pres­si­ve 745­x.

In my opi­nio­n, this re­la­tion does not on­ly apply to C pro­jec­ts, it’s ­so­me­thing ge­ne­ral to all pro­gra­m­ming lan­gua­ges. It’s just the rea­li­ty of ­so­ftwa­re. This is what it takes to build re­lia­ble so­ftwa­re.

No­w, wi­th this idea in min­d, kno­wing that we must ha­ve many mo­re li­nes of ­tes­ting co­de than pro­duc­ti­ve co­de, be­cau­se ea­ch po­s­si­ble func­tion can ha­ve ­mul­ti­ple ou­tco­me­s, and has to be exer­ci­s­ed un­der mul­ti­ple sce­na­rios (va­li­da­tio­n of in­pu­t, com­bi­na­tion of its in­ter­nal con­di­tion­s, and mo­re), it be­co­mes clea­r, ­that co­ve­ra­ge does not mean that the co­de is thou­gh­tfu­lly tes­ted at all. It ­then be­co­mes evi­dent that co­ve­ra­ge is not the en­d, but the be­gin­ning of ­tes­tin­g: on­ce we’­ve iden­ti­fied the li­nes that need to be che­cke­d, the tes­ts wo­n’t stop on­ce the­y’­ve been co­ve­re­d, they should stop on­ce all po­s­si­ble s­ce­na­rios ha­ve been pro­per­ly ve­ri­fie­d. It al­so be­co­mes evi­dent that is ex­pec­te­d ­to ha­ve many mo­re ti­mes tes­ting li­nes than main ones.

Tip

Do­n’t re­ly on co­ve­ra­ge. Re­ly on thou­gh­ful tes­tin­g.

Slides

This idea was pre­sen­ted in a li­gh­tning ta­lk at Eu­ro­P­y­thon 2017, on Mon­day 10 of Ju­l­y. He­re are the s­li­des.

[1] https://martinfowler.com/bliki/TestCoverage.html
[2] “How to misuse test coverage” - Brian Marick http://www.exampler.com/testing-com/writings/coverage.pdf This is an excellent paper, that discusses some important points about test coverage.
[3] https://sqlite.org/testing.html
[4] 1 KLOC means 1000 lines of code
[5] https://en.wikipedia.org/wiki/Defensive_programming