Tom Petrocelli's take on technology. Tom is the author of the book "Data Protection and Information Lifecycle Management" and a natural technology curmudgeon. This blog represents only my own views and not those of my employer, Enterprise Strategy Group. Frankly, mine are more amusing.

Monday, January 11, 2010

What's On My Mind

A little career downtime can sometimes be a good thing. Whether your sabbatical is planned or, as in my case unexpected, it represents a rare opportunity to delve into new areas of interest. Academics and clergy do this regularly as a way of expanding their skills, working on projects that they can never get to, or simply as a way of recharging their psychological batteries. This is not an extended vacation or time to simply relax. Career downtime has to be used to expand your horizons.

Consistent with my beliefs about downtime, I've been using my current “sabbatical” to embark on areas of discovery that I previously hadn't time for. As my Twitter followers know, some of my time was used to explore the biotech industry. On the more geeky side I've been going back to my software roots to look at things that have fascinated me for a long time. Namely, how to manage unusual or difficult data stores in different ways.

More precisely I've been looking into:

  • Managing large unstructured data sets, a constant problem in certain industries;

  • Applications driven by relationships between entities more than their structure and;

  • How to make smaller applications by embedding data management into them.

This journey of discovery has led me to a number of software technologies that I find very interesting. Let me share with you what's on my mind these days.

Managing Metadata

One of the major bugaboos of the last decade or so has been dealing with the explosion of unstructured data. The simple solution is to wrap metadata around the real data. This descriptive information adds the machine manipulable context that unstructured information lacks. However, managing the metadata itself has become a big problem. Individual metadata is generally small and many data management tools, such as relational databases, are overkill. Implementing a full blown SQL database to manage metadata is like hunting deer with a tank. Expensive and more than you need to get the job done.

XML has helped but managing XML text files is difficult when there is a lot of them. You might need to open and close lots of small files constantly, straining a lot of file systems. The other option is to process one giant XML file which can be processor intensive and slow. Worse yet, these are decisions you have to live with and are hard to change once an application is underway. XML is not ideal for dealing with relationships between entities either.

Social Media Does Change Everything

Okay, I don't believe social media changes everything. What it does do is address the fact that humans are social creatures. We view the world as a series of relationships. Relationships between ourselves and the world around us, between each other, and between everything that makes up the world. The natural schema for a human is relationship based.

Computers don't always reflect that. They tend to be concerned with structures more than the quality of a relationship. This is one of the problems with SQL. It is damn hard to code reciprocal relationships, the strength of a relationship, and the ways entities may be interact. A good DBA can tell you how to do it but it gets complicated quickly especially when modeling real human type relationships.

Small Applications That Can Grow

Enterprise applications have gotten huge. Worse yet, they require boat loads of infrastructure. This is why enterprise applications developers always talk in stacks. LAMP stack, WAMP stack, .Net stack. Developers can declare that their apps aren't really as big as they are because they assume that a stack is in place.

There are a lot of negatives to relying on these stacks. For one, you are at the mercy of whoever is designing the pieces of the stack. Applications also use different versions of the programs that make up these stacks, leading to compatibility problems. Not to mention finding an application that you love on a stack you don't support.

The biggest problem with stack-based enterprise applications is that they are not compact and simple. They don't port to small systems and devices easily. Try implementing a single user version of most enterprise apps. Who is going to install and maintain an Apache web server for one or two people? Cloud businesses like this since they provide an alternative but not all applications lend themselves to the cloud.

What Am I Looking At?

In thinking about these issue, I have come across technologies that address some or all of these problems. I especially like embedded data management tools. I especially like Derby and SQLite, Lucene, and Neo. Derby and SQLite are open source or public domain RDBMS' that allow developers to embed a SQL database in an application. Derby has a server version as well, allowing applications to be small and compact or to scale up to large enterprise size. Derby is from the Apache Foundation and Java-based. This allows it to nicely integrate with Java applications and object mapping frameworks like Hibernate. SQLite is C++ based making it excellent for embedded applications and is extensively used by the Mozilla Foundation. Being a Java geek, I'm planing on spending more time with Derby.

Another Apache project, Lucene, embeds a search engine in an application. With Lucene, a developer is able to manage large amounts of unstructured text using methods familiar to everyone. Lucene also works well with other types of data management tools to add search functionality to all kinds kinds of data.

One the other technologies that Lucene works well with is Neo. Neo is a graphing or network database (there is debate as to what the difference is between the two). Graphing databases view data a bit differently than an RDBM. Data is stored as key-value pairs called properties in an interconnected network of nodes. Finding data is through it's relationships with other nodes. With Neo, information is stored and retrieved in a way that humans organize information, by it's relationship to other data. This fits in well when modeling people or other entities that rely on interactions with others. Some examples are biological ontologies, proteins, and documents. At the moment I'm experimenting with Neo and Document and Content Management.

While there are a lot of things that stink about career downtime, if used effectively it can be a transformative experience. Discovery of this type almost always leads to something good. If nothing else, it helps us to grow as professionals and people. It's also better than sitting around watching television.

No comments: