Tom Petrocelli's take on technology. Tom is the author of the book "Data Protection and Information Lifecycle Management" and a natural technology curmudgeon. This blog represents only my own views and not those of my employer, Enterprise Strategy Group. Frankly, mine are more amusing.

Monday, February 26, 2007

Mark Twain Was Right

Everyone knows that old wag by Mark Twain that goes There are three kinds of lies: lies, damned lies and statistics. Truer words were never spoken. It's not so much that people who use statistics are trying to lie. Instead, in many cases statistics are used inappropriately and lend too much gravitas and certainty to an unknown situation. Unlike scientists, business people rarely make inquires as to research methods and models. If there are numbers in front of us, it's much easier to believe whatever point someone is trying to make. We tend not to question the statistics in the same way we would a financial information.

So why the rant? The trigger was a rather good article I read in InfoStor. The article was examining the shift from older storage protocols such as parallel SCSI and ATA to more modern ones such as SAS or SATA. I absolutely agree with the premise that we are shifting away from parallel protocols - a vestige of the ancient world in computing terms - to serial ones. I also agree that it is happening rapidly. So far, I'm completely on-board.


What caused me to get seasick was not the article per se but the graphics that accompanied it. They had two sets of pie charts, showing changes in the market share of the different storage protocols from 2006 through 2009. What was jarring was how different they were.

Now, I can understand perfectly how the 2008 and 2009 numbers could be very different. Different forecast models will yield different results. It's why it is important to understand the model used in quantitative research before using the information to make a decision. It is a prediction that looks more precise because it uses numbers instead of words. If I said that by 2009 SAS and SATA would comprise the majority of the storage protocol market, it would be easy to discount it as my opinion, no matter how informed that opinion was. If I told you that “Research shows that SAS will encompass 45% of the market and the next largest share would be SATA at 30%” it would sound much more believable. They are both predictions but the latter sounds more credible because I used some numbers. The assumptions underlying those numbers are rarely questioned. If you don't understand the model then you have no way of knowing if the numbers are just a guess or how good a guess they are.

What caught my eye was the difference in the 2006 numbers. For example, the parallel SCSI share percentages differed by roughly 6 percentage points (43% vs. 49%). This is a substantial gap, especially when it's not a forecast anymore. Now, I know how this works. How you count impacts the numbers. If you tally up the number of chips sold versus the number of ports sold, you will get different amounts. The former actually includes some future “ports” or other uses. One may overcount and the other undercount. It's also the case that these numbers are still part forecast. The last quarter for many companies has yet to be reported. Again, this is why understanding how these numbers are arrived at is tremendously important for decision makers.

In case someone may be thinking that this is a case of professional envy, put that out of your mind. I purposefully don't do quantitative research for this very reason. People should accept my opinions for what they are. Informed and insightful analysis but not certainty. Too many people accept relatively simple statistics as a sure thing, as truth. They're not and should be examined with a critical eye. Besides, the reported statistics are unusually simple. We tend to think that's enough. Basing any conclusion on a frequency distribution would be laughable in modern social science research.

I have nothing against statistics or the analysts that use them. In most cases, I have great respect for their insights. That doesn't mean that their methods should not be scrutinized by a consumer. As these charts show, not everything is what it seems and nothing should be taken on face value.

Friday, February 23, 2007

Back To The Future with Google Apps

I am generally a proponent of Software as Service (SaS). To me, the advantages are clear. With central control you eliminate the costs associated with updating and distributing client applications. It's also much easier to share information that resides in a central repository. Even client-server applications, where data is stored and managed in a central database, have the problem of client distribution. Strip away the GUI and make it a web page and you will achieve cost savings. SaS makes the most sense for enterprise applications such as MRP, CRM, and other three letter acronym systems. It also works well for social networking. What I don't get is SaS for office productivity applications, such as word processors and spreadsheets. I'm not even convinced of web-based e-mail except as an adjunct to a client-server environment or as a low-end, free, consumer product.

Google has been rolling out free versions of its word processor and spreadsheet products for months. They hope to entice corporations, big and small, to use this service instead of buying standalone productivity applications such as Microsoft Office. Google, like Yahoo and Microsoft, has been selling premium on-line e-mail packages for quite some time and think office applications dovetail nicely into this business. The e-mail services have been popular with individuals and small businesses because of their low or no cost. The fact that they lack the features of Thunderbird or Outlook matters very little to the technophobic low-usage public that doesn't want to install and, more importantly, configure an e-mail application. However, that won't be a problem for business of more than a few people.

What Google seems to be recreating is the ancient IBM PROFS suite. PROFS was developed by IBM in the 1970s and deployed on mainframes throughout the US Government. It had an integrated e-mail and calendaring system and was often used for word processing as well. PROFS got killed by client-server e-mail systems in the 1980s, though it lingered on under the name OfficeVision for quite some time until finally replaced by Lotus Notes and Domino.

PROFS became irrelevant for the same reasons that Google Docs and Spreadsheets are a tough sell for me. They can't create the kind of user experience you need in a word processor or produce the range of features desired in a spreadsheet. Even with all the new techniques for enhancing user experience, these applications are slow, quirky, and lacking in features when compared to established office suites. Even more importantly, you need to be tethered to a high speed network connection. That eliminates the ability to work on an airplane, secluded beach, or anywhere else that you can't get a broadband connection. A secure and reliable broadband connection at that. Add to that all the normal problems of network applications such as network congestion and overloaded servers – again the problems of mainframe applications – and you have to wonder why we seem to be going backwards in time.

There are also some special problems associated with SaS for office applications. For starters, you have to feel comfortable having your intellectual property and trade secrets housed off-site by different company. That pretty much killed the Storage Service Provider Market five years back. It scared the pants off corporate security analysts to have someone else control many classes of corporate data. Will spreadsheets be okay somehow? What about privacy? Will there be problems if I write a performance review or termination letter in Google Docs?

A key argument in the sales pitch for office SaS is that the cost of applications such as Microsoft Office are high. True enough. The latest version of Office is very expensive not only to buy but to deploy. You have to believe that the benefits of the revamped Office interface will pay off enough to overcome the costs of retraining and supporting confused workers.

There are alternatives that don't include a feature poor service. You can avoid the high purchase costs of upgrades or even new deployments by using open source applications such as Thunderbird for e-mail and the OpenOffice.org productivity suite. There are a number of lower cost commercial suites as well such as the storied WordPerfect. All of these are perfectly good for the modern office. Besides, no one says you have to update you word processor just because Microsoft has a new version. With on-line applications, you may not get the choice.

Ultimately, SaS makes sense when there is a natural advantage to being connected. In that case, you are going to be networked anyway. Intranet portals, order entry, customer management, workflow management, and group calendaring makes sense as a service. To be useful, they have to have access to a central data repository so you might as well make the whole system centralized. Productivity applications are designed around individual work and need to be available when there is no network.

Google thinks they can take us back to the days of PROFS. Nice idea but I doubt it will work. Centralized mainframe office applications suffered a quick death for a reason. Those reasons haven't changed. While they may get a bunch of consumers to use these applications and sell ads around it (nothing wrong with that) they won't get a sufficient number of corporate clients to make it viable as a Word or Excel killer. If I were Microsoft, I would be more worried about the Open Source community than Google's on-line apps.

Wednesday, February 21, 2007

Managing Metadata

Everywhere I turn, I hear more about metadata. It seems that everyone is jumping on the metadata bandwagon. For those of you who have never heard the term, it is data about data. More precisely, metadata describes data, providing the context that allows us to make it useful.

Bringing Structure to Chaos

Organizing data is something we do in our heads but that computers are pretty poor at. It is natural for a human being to develop schema and categories for the endless streams of data that invade our consciousness every moment we that are awake. We can file things away for later use, delete it as unimportant, and connect it with other data to form relationships. It is an innate human capability.

Not so with computers. They are literal in nature and driven by the commands that we humans give them. No matter how smart we think computers are, compared to us organics, they are as dumb as rocks.

Metadata is an attempt to give computers a brain boost. By describing data, we are able to automate the categorization and presentation of data in order to make it more meaningful. In other words, we can build schema out of unstructured data. Databases do this by imposing a rigid structure on the data. This works fine for data that is naturally organized into neat little arrangements. For sloppy situations, say 90% of the data in our lives, databases are not so useful.

Metadata Is All Around Us

We are already swimming in metadata. All those music files clogging up our hard drives have important metadata associated with them. That's why your iPod can display the name, artist and other important information when you play a song and iTunes can build playlists automatically. Your digital camera places metadata into all of those pictures of your kids. Because of metadata, you can attach titles and other information to them and have them be available to all kinds of software. Internet services use metadata extensively to provide those cool tag clouds, relevant search responses, and social networking links.

Businesses have a keen need for metadata. With so many word processor documents, presentations, graphics, and spreadsheets strewn about corporate servers, there needs to be a good way to organize and manage them. Information Lifecycle Management assumes the ability to generate and use metadata. Advanced backup and recovery also uses metadata. Companies are trying to make sense out of the vast stores of unstructured data in their clutches. Whether it's to help find, manage, or protect data, organizations are increasingly turning to metadata approaches to do so.

Dragged Down By The Boat

So, we turn to metadata to keep us from drowning in data. Unfortunately, we are starting to find ourselves drowning in metadata too. A lot of metadata is unmanaged. Managing metadata sounds a lot like watching the watchers. If we don't start to do a better job of managing metadata, we are going to find out an ugly truth about it – it can quickly become meaningless. Just check out the tag cloud on an on-line service such as Technorati or Flicr. They are so huge that it's practically useless. I'm a big fan of tag clouds when they are done right. The ability to associate well thought out words and phrases to a piece of data makes it much easier to find what you want and attach meaning to whatever the data represents.

The important phrase here is “well thought out”. A lot of metadata is impulsive. Like a three year old with a tendency to say whatever silly thought comes into their brains, a lot of tags are meaningless and transient. Whereas the purpose of metadata is to impart some extended meaning to the data, a lot of metadata does the opposite. It creates a confused jumble of words that shine no light on the meaning of the data.

The solution is to start to manage the metadata. That means (and I know this is heresy that I speak) rules. Rules about what words can be used in what circumstances. Rules about the number of tags associated with any piece of data. Rules about the rules basically. It makes my stomach hurt but it is necessary.

I don't expect this discipline from Internet services. It runs counter to the “one happy and equal family” attitude that draws people to these services. For companies this is necessary as they implement metadata driven solutions. Unfortunately, it means guidelines (tagged as “information”, “guidelines”, “metadata”, or something) and some person with a horrible, bureaucratic personality to enforce them. Think of it as a necessary evil, like lawmen in the old West.

For the most part, companies will probably start to manage metadata when it is already too late, when they are already drowning in the stuff. There is an opportunity to still avoid the bad scenario though. Metadata-based management of unstructured data is still pretty new. Set up the rules and guidelines now. Enforce the rules and review the tags and how they are used regularly. Eventually, there will be metadata analysis software to assist you but in the meantime, put the framework in place. The opportunity is there to do it right from the beginning and avoid making a big mistake. Use metadata to create value from your data rather than confuse things further.