Thursday, August 21, 2014

Information Architecture:  Big Data "Data Lake" and Data Warehouse - a case for melding the two with Enterprise Information Integration and Semantic Technologies

Data Lake
"In broad terms, data lakes are marketed as enterprise wide data management platforms for analyzing disparate sources of data in its native format," said Nick Heudecker, research director at Gartner recently in an article published on July 28, 2014 (http://www.gartner.com/newsroom/id/2809117).

The concept of data lake, as the article goes on to describe, is to store varied and disparate data (both structured and unstructured) all in one central location.  Nick describes the Big Data in a great metaphor in this article.  So what does this “data lake” do?  The so called “solution” of putting all the disparate data in its native structure and constructs in to a big reservoir or repository allows research and analysis of the data in the data lake by sophisticated professionals with the right tools who know how to manipulate and analyze data. 
So it’s smooth swimming for the data scientists if I can extend the metaphor a bit.  To the rest of us, however, it’s a big ball of muddy swamp that is difficult to wade through.



Lake Superior

So what lacks in the big data lake?
While it is true that data lakes help with the need for increased speed with which data is available in a centralized location, the reality is that they just provide more real time volume of data.  The quality of information that can be gleaned from them can elude us if the insights are hard to find.

Consistency of data coming in to the data lake is one reason why there is questionable quality.  Even the data for a common domain coming from disparate sources can have disparate data structures and architecture  and the underlying model has a tendency to be in a constant flux for a fast moving system.  The reality in large and small enterprises is that the results of how good the insights are will vary quite often.

Data based on such loosely governed models from data sources that use different master data in their native formats will regularly require understanding of the source, the master data of the source and the corresponding data model not only for any given time but for a period of time.  Only then and with the right amount of 'tuning' of the data will the good results ensue.

Another issue is the ‘big ball of mud’ that gets accumulated in the data lake.  Brian Foote and Joseph Yoder way back in 1997 came up with the term “Big Ball of Mud” systems.  They define the term thus: “A Big Ball of Mud is a haphazardly structured, sprawling, sloppy, duct-tape-and-balling-wire, spaghetti-code jungle”.  Nothing else but a “big ball of mud” describes big data lake more aptly for most of us even though some Big Data practitioners may disagree. For the rest of us, depending on the level of disparity and versatility, the data retrieved from such a data lake is more noise and less information.

Business Data Warehouse
A Data Warehouse is at the other end of the spectrum from Data Lakes when it comes to organining structured data. One is very informal supporting structured and unstructured data, the other mostly formal with well-defined dimensions and marts for structured data. Data Warehouses have been around for several decades now ever since there was recognition in the enterprise that individual data silos were not serving the business well and there was a need to centralize.

The premise of a Data Warehouse is that data mining techniques can be used on current and historical data to discover previously unknown relationships in the data as well as provide centralized reporting and analytics.  Business Data Warehouse (BDW) term, first coined by IBM, originally defined such a usage of the data from this central system as 'business analysis'.  Later on, the term morphed to 'Business Intelligence' or  'Analytics'.  The premise of Data Warehouse is: what good is it to design product features that markets did not care about and what good is a marketing campaign if it did not promote the most profitable products? Business Intelligence capabilities were quickly built in BDWs to provide a better understanding of products, pricing, profitability, operations research, quality, capacity and customer satisfaction as well as for initiatives in logistics and manufacturing.   Data Warehouses certainly solve quite a few essential problems of quality and consistency, help in various functions such as resource planning and analytics by providing the single view of the "truth" - truth as envisioned by the Data Warehouse designers at a certain point in time and validated and accepted by the consumers by the results it produces.  Data Warehouses over the course of their over two-decade long history have been able to deliver on the promise of mining and centralization of disparate data.

Any practitioner, however, who has engaged in building them will tell you that they are costly to stand up and costlier to maintain.  How difficult can it be to come up with building a BDW for two data sources based on an understanding and alignment of two data models, two sets of master data, two data governance models and two underlying design and functional constructs together, one may ask?  Creating a single version of truth from these two data sources, albeit not very difficult, is certainly not trivial.   Now let’s say you add a third data source in the mix - you are now doubling the complexity of the solution as compared to the first two sources.  Add one more and the complexity is quadrupled compared to the first two data sources.  Any additional source adds to this exponential complexity puzzle. The complexity is also an ongoing issue with Data Warehouses – every time there are updates to a data source’s model, its functional usage, its master data or its governance, the analysis of the alignment of that data source with the rest of the data sources in the BDW needs to be revisited.

Additionally as one builds the BDW one must have the data undergo a certain amount of standardization and change from the original source.  The data can forever change because of the metamorphosis that is needed to align with the BDW requirements leading to the occasional irretrievable changes to the source. This change is well understood and manageable initially for fewer sources and for small periods of time but in the long run with large disparate sources this understanding can be lost.  Additional insights when needed in the long run from this permanently changed data can prove elusive.  To top it all, the data has to be highly structured.

Melding the two spectrums with Enterprise Information Integration and Semantic Technology
Big Data is now being discussed in most of the enterprises although survey results still suggest that the production level implementation is less than 10%.  The insights that can come out of Big Data that help the business in various departments are critical to the success of enterprises and are the reason Big Data is implemented.

Watch out for an explosion in the software space in this arena that support Big Data constructs along side traditional Enterprise Information Integration processes.  The level of dynamic integration that will be offered to tune traditional structured data in real-time and process Big Data to extract business insights will be the 'call to duty' of the data insight tools of the near future.

Semantic technologies are providing the extra lift in the data architecture now. Internet standards to describe the data and constructs consistently are providing the necessary common-speak for the data and are being promoted by W3C - the World Wide Web Consortium - the web standards body.  W3C is working with different industries from Health Care to eGovernment, and Energy — to improve collaboration and innovation adoption through Semantic Web technology. Linked data architecture is being built to support the consistent consumption and understanding of the collective data in the various industries.  This common semantic construct can help build and understand competitive insights.

Technologies that can determine smart matches to business problems and reveal insights from the 'big' business data with juxtaposition to the industry insights available from Semantic web are on the verge of flooding the data software landscape and the tools in their arsenal are multi-pronged - melding BDW with Big Data with EII with Semantic web.  Already, ETL tool manufacturers have embedded the technologies to support Big Data/Big Insights using HDFS and MapReduce.  Now these new technologies are laying their sights on real-time Enterprise Information Integration using uniform information representation using Semantic Web.  


The right strategy of extracting the information out of the Big Data will produce the big insights and will require some of these new technologies.  These insights are the foundation blocks of the values that will be extracted by enterprises with a vision to the future with their data initiatives.  Savvy companies will have the right mix to support marketing and business with these new technologies and will need to be successful at extracting values out of their “data lakes” lest those lakes turn into "data swamps".