Monday, February 26, 2007

Mark Twain Was Right

Everyone knows that old wag by Mark Twain that goes There are three kinds of lies: lies, damned lies and statistics. Truer words were never spoken. It's not so much that people who use statistics are trying to lie. Instead, in many cases statistics are used inappropriately and lend too much gravitas and certainty to an unknown situation. Unlike scientists, business people rarely make inquires as to research methods and models. If there are numbers in front of us, it's much easier to believe whatever point someone is trying to make. We tend not to question the statistics in the same way we would a financial information.

So why the rant? The trigger was a rather good article I read in InfoStor. The article was examining the shift from older storage protocols such as parallel SCSI and ATA to more modern ones such as SAS or SATA. I absolutely agree with the premise that we are shifting away from parallel protocols - a vestige of the ancient world in computing terms - to serial ones. I also agree that it is happening rapidly. So far, I'm completely on-board.

What caused me to get seasick was not the article per se but the graphics that accompanied it. They had two sets of pie charts, showing changes in the market share of the different storage protocols from 2006 through 2009. What was jarring was how different they were.

Now, I can understand perfectly how the 2008 and 2009 numbers could be very different. Different forecast models will yield different results. It's why it is important to understand the model used in quantitative research before using the information to make a decision. It is a prediction that looks more precise because it uses numbers instead of words. If I said that by 2009 SAS and SATA would comprise the majority of the storage protocol market, it would be easy to discount it as my opinion, no matter how informed that opinion was. If I told you that “Research shows that SAS will encompass 45% of the market and the next largest share would be SATA at 30%” it would sound much more believable. They are both predictions but the latter sounds more credible because I used some numbers. The assumptions underlying those numbers are rarely questioned. If you don't understand the model then you have no way of knowing if the numbers are just a guess or how good a guess they are.

What caught my eye was the difference in the 2006 numbers. For example, the parallel SCSI share percentages differed by roughly 6 percentage points (43% vs. 49%). This is a substantial gap, especially when it's not a forecast anymore. Now, I know how this works. How you count impacts the numbers. If you tally up the number of chips sold versus the number of ports sold, you will get different amounts. The former actually includes some future “ports” or other uses. One may overcount and the other undercount. It's also the case that these numbers are still part forecast. The last quarter for many companies has yet to be reported. Again, this is why understanding how these numbers are arrived at is tremendously important for decision makers.

In case someone may be thinking that this is a case of professional envy, put that out of your mind. I purposefully don't do quantitative research for this very reason. People should accept my opinions for what they are. Informed and insightful analysis but not certainty. Too many people accept relatively simple statistics as a sure thing, as truth. They're not and should be examined with a critical eye. Besides, the reported statistics are unusually simple. We tend to think that's enough. Basing any conclusion on a frequency distribution would be laughable in modern social science research.

I have nothing against statistics or the analysts that use them. In most cases, I have great respect for their insights. That doesn't mean that their methods should not be scrutinized by a consumer. As these charts show, not everything is what it seems and nothing should be taken on face value.

