My alternative title for this post was “Money for Nothing,” which is along the same lines. I have been engaged in discussions regarding Big Data, which has become a bit of a buzz phrase of late in both business and government. Under the current drive to maximize the value of existing data, every data source, stream, lake, and repository (and the list goes on) has been subsumed by this concept. So, at the risk of being a killjoy, let me point out that not all large collections of data is “Big Data.” Furthermore, once a category of data gets tagged as Big Data, the further one seems to depart from the world of reality in determining how to approach and use the data. So for of you who find yourself in this situation, let’s take a collective deep breath and engage our critical thinking skills.
So what exactly is Big Data? Quite simply, as noted by this article in Forbes by Gil Press, term is a relative one, but generally means from a McKinsey study, “datasets whose size is beyond the ability of typical database software tools to capture, store, manage, and analyze.” This subjective definition is a purposeful one, since Moore’s Law tends to change what is viewed as simply digital data as opposed to big data. I would add some characteristics to assist in defining the term based on present challenges. Big data at first approach tends to be unstructured, variable in format, and does not adhere to a schema. Thus, not only is size a criteria for the definition, but also the chaotic nature of the data that makes it hard to approach. For once we find a standard means of normalizing, rationalizing, or converting digital data, it no longer is beyond the ability of standard database tools to effectively use it. Furthermore, the very process of taming it thereby renders it non-big data, or perhaps, if a exceedingly large dataset, perhaps “small big data.”
Thus, having defined our terms and the attributes of the challenge we are engaging, we now can eliminate many of the suppositions that are floating around in organizations. For example, there is a meme that I have come upon that asserts that disparate application file data can simply be broken down into its elements and placed into database tables for easy access by analytical solutions to derive useful metrics. This is true in some ways but both wrong and dangerous in its apparent simplicity. For there are many steps missing in this process.
Let’s take, for example, the least complex example in the use of structured data submitted as proprietary files. On its surface this is an easy challenge to solve. Once someone begins breaking the data into its constituent parts, however, greater complexity is found, since the indexing inherent to data interrelationships and structures are necessary for its effective use. Furthermore, there will be corruption and non-standard use of user-defined and custom fields, especially in data that has not undergone domain scrutiny. The originating third-party software is pre-wired to be able to extract this data properly. Absent having to use and learn multiple proprietary applications with their concomitant idiosyncrasies, issues of sustainability, and overhead, such a multivariate approach defeats the goal of establishing a data repository in the first place by keeping the data in silos, preventing integration. The indexing across, say, financial systems or planning systems are different. So how do we solve this issue?
In approaching big data, or small big data, or datasets from disparate sources, the core concept in realizing return on investment and finding new insights, is known as Knowledge Discovery in Databases or KDD. This was all the rage about 20 years ago, but its tenets are solid and proven and have evolved with advances in technology. Back then, the means of extracting KDD from existing databases was the use of data mining.
The necessary first step in the data mining approach is pre-processing of data. That is, once you get the data into tables it is all flat. Every piece of data is the same–it is all noise. We must add significance and structure to that data. Keep in mind that we live in this universe, so there is a cost to every effort known as entropy. Computing is as close as you’ll get to defeating entropy, but only because it has shifted the burden somewhere else. For large datasets it is pushed to pre-processing, either manual or automated. In the brute force world of data mining, we hire data scientists to pre-process the data, find commonalities, and index it. So let’s review this “automated” process. We take a lot of data and then add a labor-intensive manual effort to it in order to derive KDD. Hmmm.. There may be ROI there, or there may not be.
But twenty years is a long time and we do have alternatives, especially in using Fourth Generation software that is focused on data usage without the limitations of hard-coded “tools.” These alternatives apply when using data on existing databases, even disparate databases, or file data structured under a schema with well-defined data exchange instructions that allow for a consistent manner of posting that data to database tables. The approach in this case is to use APIs. The API, like OLE DB or the older ODBC, can be used to read and leverage the relative indexing of the data. It will still require some code to point it in the right place and “tell” the solution how to use and structure the data, and its interrelationship to everything else. But at least we have a means for reducing the cost associated with pre-processing. Note that we are, in effect, still pre-processing data. We just let the CPU do the grunt work for us, oftentimes very quickly, while giving us control over the decision of relative significance.
So now let’s take the meme that I described above and add greater complexity to it. You have all kinds of data coming into the stream in all kinds of formats including specialized XML, open, black-boxed data, and closed proprietary files. This data is non-structured. It is then processed and “dumped” into a non-relational database such as NoSQL. How do we approach this data? The answer has been to return to a hybrid of pre-processing, data mining, and the use of APIs. But note that there is no silver bullet here. These efforts are long-term and extremely labor intensive at this point. There is no magic. I have heard time and again from decision makers the question: “why can’t we just dump the data into a database to solve all our problems?” No, you can’t, unless you’re ready for a significant programmatic investment in data scientists, database engineers, and other IT personnel. At the end, what they deploy, when it gets deployed, may very well be obsolete and have wasted a good deal of money.
So, once again, what are the proper alternatives? In my experience we need to get back to first principles. Each business and industry has commonalities that transcend proprietary software limitations by virtue of the professions and disciplines that comprise them. Thus, it is domain expertise to the specific business that drives the solution. For example, in program and project management (you knew I was going to come back there) a schedule is a schedule, EVM is EVM, financial management is financial management.
Software manufacturers will, apart from issues regarding relative ease of use, scalability, flexibility, and functionality, attempt to defend their space by establishing proprietary lexicons and data structures. Not being open, while not serving the needs of customers, helps incumbents avoid disruption from new entries. But there often comes a time when it is apparent that these proprietary definitions are only euphemisms for a well-understood concept in a discipline or profession. Cat = Feline. Dog = Canine.
For a cohesive and well-defined industry the solution is to make all data within particular domains open. This is accomplished through the acceptance and establishment of a standard schema. For less cohesive industries, but where the data or incumbents through the use of common principles have essentially created a de facto schema, APIs are the way to extract this data for use in analytics. This approach has been applied on a broader basis for the incorporation of machine data and signatures in social networks. For closed or black-boxed data, the business or industry will need to execute gap analysis in order to decide if database access to such legacy data is truly essential to its business, or given specification for a more open standard from “time-now” will eventually work out suboptimization in data.
Most important of all and in the end, our results must provide metrics and visualizations that can be understood, are valid, important, material, and be right.