Unit testing – you’re measuring it wrong

I’ve been having this problem ever since I’ve started teaching (and preaching) about SCRUM, clean code, unit testing and TDD and any other software development practice.
When implementing a change - how do you measure its success?
For that matter – how can you measure a developer productivity?
Faccia is watching by Our Hero, on Flickr
Creative Commons Creative Commons Attribution-Noncommercial-Share Alike 2.0 Generic License   by Our Hero

A few years ago I worked with a “brilliant” manager who tried to measure developer’s productivity by counting the number of bugs a developer fixed on a given day. Being my argumentative old self I’ve explained that no two bugs are created equal and besides we should aspire not to write these bugs in the first place.
My comment was not well received – I was branded a subversive element – he felt that I wanted to prevent him from assessing developer’s work (which I didn’t).
Over the years several attempts has been made to calculate developer’s productivity – with similar result such as counting lines of code:
“Measuring programming progress by lines of code is like measuring aircraft building progress by weight.”
- Bill Gates
And it only get’s trickier when faced with the need to measure unit testing efforts.

How to measure unit testing effectiveness

The problem is that it’s impossible to measure both how much time was saved and how many bugs were prevented. The only viable way is to have two developers write the same feature and compare the results – and I’m yet to find a company willing to throw money down the drain.
A few years ago I found a research (which I use all the time) that did something similar: two teams working at the same same projects, one using TDD and one not were timed coding time and bug density was checked afterwards (guess who had less bugs). Unfortunately I cannot reproduce that experiment on every single company I work with in order to show that unit testing/TDD improved the way they are doing things.
There are two “matrices” that seem to interest managers when unit testing are concerned:
  1. Test count (i.e. how many tests we have)
  2. Test coverage (i.e. what percentage of the code was tested)
Both are inherently flawed.
Test count does not tell me anything about the project’s state and it’s impossible to provide a number of tests per feature developed - how many tests are “enough”? 10, 100, 1000 or maybe a million?
As for code coverage it’s a tool, a very effective tool that helps me daily to find code I need to test and help me understand how healthy my unit testing suite is – to put it in the words of one of a former boss of mine:
What do you mean 20% code coverage? we had 75% three months ago…
I know that I need an excellent reason to have less then 50% code coverage for a component, but how much coverage is enough coverage? 80%, 95% or 100%?
And just like any other tool code coverage can be gamed and easily abused to show good results for very bad code and tests.
Crossroads: Success or Failure by StockMonkeys.com, on Flickr
Creative Commons Creative Commons Attribution 2.0 Generic License   by StockMonkeys.com

Both code coverage and test count are good at showing quantity not quality and can be used to measure progress (to some extent- we had 100 tests a week ago and 50% coverage and now we have 1000 tests and 85% coverage), and both should be handled with care.

Why do you want to know?

There’s always a reason behind the need to measure or predict success. Usually when asked about measuring unit tests it’s due to a need to estimate future effort (which we still don’t know how to do). Someone up the food chain needs to fill reports, Gantt-charts  and approve expenses and needs to know what would be changed in the next X days/weeks/months. Perhaps the guy who wanted to start using unit tests needs to explain it to his superior or investors.
The problem is that any metric given would be inaccurate – I can show you two different projects with roughly the same amount of unit tests and 90% code coverage -one would be a success and one an utter and complete failure.
From time to time I’m forced to provide these guesstimates – since people need to be able to know where they heading and how much work is planned. When doing this I do my best to explain that these are educated guesses and the real benefits are more then these numbers but hard to predict and quantify – but no one ever listens…
It all comes back to the fact that I cannot show time saved nor bugs which were not created due to the change – the two factors that I find most interesting.

Success is measurable – in retrospective

I promise you one thing – after a short while the team would feel the change – for better or worse (usually for the better). Adding features would be a breeze, less bugs introduced with each new feature, shipping a working product would be easier and faster and developers tends to be more effective in their work (and usually happier)
Success by aloshbennett, on Flickr
Creative Commons Creative Commons Attribution 2.0 Generic License   by aloshbennett

I usually use nDepend to show the current code complexity and how various coding violations disappeared after a few tests were written an refactoring were made.
Looking back after a few weeks I can show real tangible improvement that I couldn’t possible predict with an accurate number.
And so I keep collecting these numbers with hope that someday when asked to predict or set measureable targets for developments I’ll have  a good answer…

Labels: , ,