Downstream Bugs as a Quality Measure

I did something dumb yesterday. I got into an argument with someone on twitter. I usually try to stay away from that, it never ends well. The other person usually walks away thinking I’m dumb, and I leave not thinking much better of them. This time, the argument started with the claim that software development teams doing estimation with story points and task hours have 250% better quality.

Martin is referencing the information presented in this white paper —  ImpactofAgileQuantified2015 — where quality is defined in terms of the number of bugs found downstream. The Whitepaper was behind an email wall, and I balked all over twitter about the claim without having read first. Skin in the game is important. I registered, downloaded my copy, and will talk about their working definition of quality with some guidance from material I have read on measurement, as well as the work of Dr. Kaner.

Reliability and Validity

Meaningful measures have what we call reliability and validity. A measure is reliable to the extent that it can be performed many times by different people, and each person will get the same result. There are three types of validity, I am concerned with construct validity right now. Construct validity is the extent to which a measure (example: 250 downstream bugs reported over a period of 1 week) correlates to a theory or idea like quality.

Any kind of bug count as a measure of quality, including down stream bugs found per increment of time, has problems of both reliability and validity.

Reliability Example

Lets say that I release a software product to the market, customers are using it for a period of two weeks, and over two weeks 45 bugs are logged into a tracking system by the people using the product. Members of the development staff review those bugs over a few days and categorize 6 of them as feature requests, 3 of them as unable to reproduce, and 4 as not a bug. The remaining 32 issues are categorized by priority (how important it is to fix them), and severity (how bad the failure is).

The varying reports in this scenario point to unreliability. The customer(s) feel that they experiences 45 different problems while using the product. The development group, after reviewing those issues, feel that only 32 of those issues are bugs. Which party is right? There are also a few hidden questions to consider here:

  1. Did the customer experience bugs and not report them?
  2. Did every bug reported get documented and tracked? Maybe some got lost in email threads.
  3. If a different customer found more bugs over the same period of time, does that mean product quality is worse?

Validity Example:

Assume we have a product that is being used by 10 people over a period of two weeks. During that time, those 10 people report 20 bugs. In another experiment, that same product is used by 1000 people for two weeks and that group of people reports 200 bugs.If the first group that used the product was happy and wanted to continue to be paying customers despite the fact that group 2 found 200 bugs, does the product have good quality or bad quality?

Quality and Bugs Are Social

This points to the idea that quality is a social construct, it is a judgement people make based on their personal value system. A product can be valuable for one person, and terrible for another at the same time. The concept of a bug is also social and there is no standard definition for what it means. If you have ever taken part in a bug triage meeting where reported issues are getting routinely re-categorized as features, or working as designed, then you have experienced this first hand. Bugs are also not equal things. A server crash is not the same at a form failing to submit because of a special character, and those are not the same as a performance problem. Counting bugs and making a conversion from number to quality usually pretends that bugs are consistent things.If they were, we wouldn’t need descriptors like severity or priority to help the business make decisions about what to fix, and what to leave be.

Numbers are intoxicating. I can look at them and they seem to tell a clear, simple story. There isn’t one story there though, there are several. And, when I look at data, I am creating a story around what I see. That story may or may not represent a reality for the people using the product. I think it was Edwin Boring who said measurements are used to construct a reality for people that were not there to observe it. Whose reality are you constructing?

Inquiry or Control

I get that managers, directors, and C* level people have a need to understand how a project is going. They can not be there to understand the reality of building software, because they are busy tending to other aspects of the business. Sometimes measurement is how they get that feeling. I think using measurement as a place to begin asking questions is healthy. If a team I am working on puts a new product version into production for two weeks and gets 5 bug reports, and then two weeks later releases another new version and gets 100 bug reports the right thing to do there is ask “What’s going on here”? The change is probably a signal that something is happen with the development team, or something is happening with the customer, or maybe both. But, if we don’t start with a question, we’ll never know.

So, back to the moral of the story, it is not possible to measure a 250% improvement in quality based on a reduction in down stream bugs. It probably isn’t possible to measure a 250% quality improvement period. If there is some way, I suspect it would be a very complicated (and nuanced) anthropology experiment that would be too expensive for any company to bother performing.

Here are some of the books I have read (incompletely) that shaped the way I think about measurement.

measurement books

I’m speaking at CAST2014

There it is, I’ll be speaking at CAST2014 in NYC this year. I’ve mentioned on twitter that I was accepted to speak but haven’t actually written about it yet.

2014_CAST_Banner

Most simplistic measures for software productivity and quality fail, for reasons you don’t need a conference talk to explain. The problem is how to do better than that – how to “plus one” software measurement, or, at least, to choose measures and frame them in a way that will do more good than harm. Studying a little social science, specifically how social scientists do qualitative research, and measurement problems can help. Justin will talk about the development of qualitative research as a field of study, common problems with measurement in the software world, and some ideas from Lean. You will take back some tools to help you tell a more meaningful story to your business.

I’m in the session group before the last keynote on the last day. This feels a little ominous. If you include the tutorials, and TestRetreat before that, CAST is an intense 5 day marathon of deep discussions on testing and for me a bit of introspection. I will have to be sure to reserve some energy for the talk, especially since this is my first at a real conference, especially since I will be live cast over youtube. I have talks for the local testers group, and facilitated events and whatnot, but for me this is the big-time. CAST is the place.

The theme of this years conference is the art and science of software testing. My talk is themed around measurement. Mainly how it has been traditionally used in our craft, some of how it is used in the social sciences, and a bit on how we can make measurement a useful thing for software delivery and delivering value to the folks that pay for it. Measurement is a difficult problem, but I feel like talking about problems without offering alternate ideas to explore or solid solutions is not all that helpful. I’m hoping we can leave the room with some ideas on making the testers life a little bit better.

Hope to see you there!