Downstream Bugs as a Quality Measure

I did something dumb yesterday. I got into an argument with someone on twitter. I usually try to stay away from that, it never ends well. The other person usually walks away thinking I’m dumb, and I leave not thinking much better of them. This time, the argument started with the claim that software development teams doing estimation with story points and task hours have 250% better quality.

Martin is referencing the information presented in this white paper —  ImpactofAgileQuantified2015 — where quality is defined in terms of the number of bugs found downstream. The Whitepaper was behind an email wall, and I balked all over twitter about the claim without having read first. Skin in the game is important. I registered, downloaded my copy, and will talk about their working definition of quality with some guidance from material I have read on measurement, as well as the work of Dr. Kaner.

Reliability and Validity

Meaningful measures have what we call reliability and validity. A measure is reliable to the extent that it can be performed many times by different people, and each person will get the same result. There are three types of validity, I am concerned with construct validity right now. Construct validity is the extent to which a measure (example: 250 downstream bugs reported over a period of 1 week) correlates to a theory or idea like quality.

Any kind of bug count as a measure of quality, including down stream bugs found per increment of time, has problems of both reliability and validity.

Reliability Example

Lets say that I release a software product to the market, customers are using it for a period of two weeks, and over two weeks 45 bugs are logged into a tracking system by the people using the product. Members of the development staff review those bugs over a few days and categorize 6 of them as feature requests, 3 of them as unable to reproduce, and 4 as not a bug. The remaining 32 issues are categorized by priority (how important it is to fix them), and severity (how bad the failure is).

The varying reports in this scenario point to unreliability. The customer(s) feel that they experiences 45 different problems while using the product. The development group, after reviewing those issues, feel that only 32 of those issues are bugs. Which party is right? There are also a few hidden questions to consider here:

  1. Did the customer experience bugs and not report them?
  2. Did every bug reported get documented and tracked? Maybe some got lost in email threads.
  3. If a different customer found more bugs over the same period of time, does that mean product quality is worse?

Validity Example:

Assume we have a product that is being used by 10 people over a period of two weeks. During that time, those 10 people report 20 bugs. In another experiment, that same product is used by 1000 people for two weeks and that group of people reports 200 bugs.If the first group that used the product was happy and wanted to continue to be paying customers despite the fact that group 2 found 200 bugs, does the product have good quality or bad quality?

Quality and Bugs Are Social

This points to the idea that quality is a social construct, it is a judgement people make based on their personal value system. A product can be valuable for one person, and terrible for another at the same time. The concept of a bug is also social and there is no standard definition for what it means. If you have ever taken part in a bug triage meeting where reported issues are getting routinely re-categorized as features, or working as designed, then you have experienced this first hand. Bugs are also not equal things. A server crash is not the same at a form failing to submit because of a special character, and those are not the same as a performance problem. Counting bugs and making a conversion from number to quality usually pretends that bugs are consistent things.If they were, we wouldn’t need descriptors like severity or priority to help the business make decisions about what to fix, and what to leave be.

Numbers are intoxicating. I can look at them and they seem to tell a clear, simple story. There isn’t one story there though, there are several. And, when I look at data, I am creating a story around what I see. That story may or may not represent a reality for the people using the product. I think it was Edwin Boring who said measurements are used to construct a reality for people that were not there to observe it. Whose reality are you constructing?

Inquiry or Control

I get that managers, directors, and C* level people have a need to understand how a project is going. They can not be there to understand the reality of building software, because they are busy tending to other aspects of the business. Sometimes measurement is how they get that feeling. I think using measurement as a place to begin asking questions is healthy. If a team I am working on puts a new product version into production for two weeks and gets 5 bug reports, and then two weeks later releases another new version and gets 100 bug reports the right thing to do there is ask “What’s going on here”? The change is probably a signal that something is happen with the development team, or something is happening with the customer, or maybe both. But, if we don’t start with a question, we’ll never know.

So, back to the moral of the story, it is not possible to measure a 250% improvement in quality based on a reduction in down stream bugs. It probably isn’t possible to measure a 250% quality improvement period. If there is some way, I suspect it would be a very complicated (and nuanced) anthropology experiment that would be too expensive for any company to bother performing.

Here are some of the books I have read (incompletely) that shaped the way I think about measurement.

measurement books

6 thoughts on “Downstream Bugs as a Quality Measure

  1. Justin,
    Well done mate. This post should enlighten some of those who were throwing somewhat unintelligent arguments on Twitter y’day. You have explained reliability, validity and reification problem very well.
    I did not see Kirk & Miller in the list of books?

    Cheers
    Rajesh

  2. I agree in general and generally think more KPIs should be sued as indicative measures rather than targets, But I intensely dislike the pervasive idea throughout testing that it is impossible to measure quality and it is impossible to show some sort of quantitive improvement. Very often you won’t get the chance to deliver a qualitative story to a CXO – they want to see KPIs and graphs going in positive directions – regardless of whether they make any sense or not. That might be the entire 5 seconds of attention you get at that level. So it’s incumbent on the test community to come up with a way of doing that approximates truth.

    It’s worth bearing in mind that a lot of what is reported represents a model rather than the actual truth. GDP measures do not necessarily accurately assess the entire value represented in a country – reporting can be flawed, things like unpaid housework aren’t counted, the black market remains black. There are multiple inflation measures, none of which are perfect and can be endless argued over. But they have utility because they are the best approximations we have, and they are applied relatively consistently. The soft sciences use a number of statistics in general use a number of statistics to try and convert qualitative behaviour to quantitive figures. You don’t get off doing it because it is hard.

    By using downstream bugs what you are really doing is constructing a model. You are saying that is a significant correlation between the numbers of downstream bugs and the quality of the product. That is a flawed measure as you point out. But it is almost certainly better than no model at all – if you have a high number of downstream bugs then it is probable more of your customers are going to have a worse experience. You also have a measure that can be communicated upwards in a suitable form to important stakeholder as opposed to nothing or worse, whining.

    The answer I think is to build a better model. You could assign weights to defects based on some criteria that matters to you from the range of options you’ve presented. You could use completely different measures, or mix and match. The key is that you should be able to show a 250% improvement in quality for your context at a particular point in time. There might be an overarching framework, but no universal measures that work for all companies at all times.

    The question then becomes not “Is this a bad measure?” of quality but “Does all or any of this apply to my product, company and team?”.

    • Ah yes, I have read that one a couple times. I left it out of the stack because there is no title on the spine.

  3. Pingback: Testing Bits – 3/20/16 – 3/26/16 | Testing Curator Blog

Leave a Reply

Your email address will not be published. Required fields are marked *