Information Architecture: Big Data
"Data Lake" and Data Warehouse - a case for melding the two with
Enterprise Information Integration and Semantic Technologies
Data Lake
"In broad terms, data lakes are marketed
as enterprise wide data management platforms for analyzing disparate sources of
data in its native format," said Nick Heudecker, research director at
Gartner recently in an article published on July 28, 2014 (http://www.gartner.com/newsroom/id/2809117).
The concept of data lake, as the article goes on to describe, is
to store varied and disparate data (both structured and unstructured) all in
one central location. Nick describes the Big Data in a great metaphor in
this article. So what does this “data lake” do? The so called
“solution” of putting all the disparate data in its native structure and
constructs in to a big reservoir or repository allows research and analysis of
the data in the data lake by sophisticated professionals with the right tools
who know how to manipulate and analyze data.
So it’s smooth swimming for the data scientists if I can extend
the metaphor a bit. To the rest of us, however, it’s a big ball of muddy swamp that is difficult to wade through.
Lake Superior
So what lacks in the big data lake?
While it is true that data lakes help with the need for increased
speed with which data is available in a centralized location, the reality is
that they just provide more real time volume of data. The quality of
information that can be gleaned from them can elude us if the insights are hard
to find.
Consistency of data coming in to the data lake is one reason why
there is questionable quality. Even the data for a common domain coming
from disparate sources can have disparate data structures and architecture
and the underlying model has a tendency to be in a constant flux for a
fast moving system. The reality in large and small enterprises is that
the results of how good the insights are will vary quite often.
Data based on such loosely governed models from data sources that
use different master data in their native formats will regularly require
understanding of the source, the master data of the source and the
corresponding data model not only for any given time but for a period of time.
Only then and with the right amount of 'tuning' of the data will the good
results ensue.
Another issue is the ‘big ball of mud’ that gets accumulated in
the data lake. Brian Foote and Joseph Yoder way back in 1997 came up with
the term “Big Ball of Mud” systems. They define the term thus: “A Big
Ball of Mud is a haphazardly structured, sprawling, sloppy,
duct-tape-and-balling-wire, spaghetti-code jungle”. Nothing else but a
“big ball of mud” describes big data lake more aptly for most of us even though
some Big Data practitioners may disagree. For the rest of us, depending on the
level of disparity and versatility, the data retrieved from such a data lake is
more noise and less information.
Business Data Warehouse
A Data Warehouse is at the other end of the spectrum from Data
Lakes when it comes to organining structured data. One is very informal
supporting structured and unstructured data, the other mostly formal with
well-defined dimensions and marts for structured data. Data Warehouses have
been around for several decades now ever since there was recognition in the
enterprise that individual data silos were not serving the business well and
there was a need to centralize.
The premise of a Data Warehouse is that data mining techniques can
be used on current and historical data to discover previously unknown
relationships in the data as well as provide centralized reporting and
analytics. Business Data Warehouse (BDW) term, first coined by IBM,
originally defined such a usage of the data from this central system as
'business analysis'. Later on, the term morphed to 'Business
Intelligence' or 'Analytics'. The premise of Data Warehouse is:
what good is it to design product features that markets did not care about and
what good is a marketing campaign if it did not promote the most profitable
products? Business Intelligence capabilities were quickly built in BDWs to
provide a better understanding of products, pricing, profitability, operations
research, quality, capacity and customer satisfaction as well as for initiatives
in logistics and manufacturing. Data Warehouses certainly solve quite a
few essential problems of quality and consistency, help in various functions
such as resource planning and analytics by providing the single view of the
"truth" - truth as envisioned by the Data Warehouse designers at a
certain point in time and validated and accepted by the consumers by the
results it produces. Data Warehouses over the course of their over
two-decade long history have been able to deliver on the promise of mining and
centralization of disparate data.
Any practitioner, however, who has engaged in building them will
tell you that they are costly to stand up and costlier to maintain. How
difficult can it be to come up with building a BDW for two data sources based
on an understanding and alignment of two data models, two sets of master data,
two data governance models and two underlying design and functional constructs
together, one may ask? Creating a single version of truth from these two
data sources, albeit not very difficult, is certainly not trivial. Now
let’s say you add a third data source in the mix - you are now doubling the
complexity of the solution as compared to the first two sources. Add one
more and the complexity is quadrupled compared to the first two data
sources. Any additional source adds to this exponential complexity
puzzle. The complexity is also an ongoing issue with Data Warehouses – every
time there are updates to a data source’s model, its functional usage, its
master data or its governance, the analysis of the alignment of that data
source with the rest of the data sources in the BDW needs to be revisited.
Additionally as one builds the BDW one must have the data undergo
a certain amount of standardization and change from the original source.
The data can forever change because of the metamorphosis that is needed
to align with the BDW requirements leading to the occasional irretrievable
changes to the source. This change is well understood and manageable initially
for fewer sources and for small periods of time but in the long run with large
disparate sources this understanding can be lost. Additional insights
when needed in the long run from this permanently changed data can prove
elusive. To top it all, the data has to be highly structured.
Melding the two spectrums with Enterprise Information Integration
and Semantic Technology
Big Data is now being discussed in most of the enterprises
although survey results still suggest that the production level implementation
is less than 10%. The insights that can come out of Big Data that help
the business in various departments are critical to the success of enterprises
and are the reason Big Data is implemented.
Watch out for an explosion in the software space in this arena
that support Big Data constructs along side traditional Enterprise Information
Integration processes. The level of dynamic integration that will be
offered to tune traditional structured data in real-time and process Big Data
to extract business insights will be the 'call to duty' of the data insight
tools of the near future.
Semantic technologies are providing the extra lift in the data
architecture now. Internet standards to describe the data and constructs
consistently are providing the necessary common-speak for the data and are
being promoted by W3C - the World Wide Web Consortium - the web standards body.
W3C
is working with different industries from Health Care to eGovernment, and
Energy — to improve collaboration and innovation adoption through Semantic Web
technology. Linked data architecture is being built to support the consistent
consumption and understanding of the collective data in the various industries.
This common semantic construct can help build and understand competitive
insights.
Technologies that can determine smart matches to business problems
and reveal insights from the 'big' business data with juxtaposition to the
industry insights available from Semantic web are on the verge of flooding the data
software landscape and the tools in their arsenal are multi-pronged - melding
BDW with Big Data with EII with Semantic web. Already, ETL tool
manufacturers have embedded the technologies to support Big Data/Big Insights
using HDFS and MapReduce. Now these new technologies are laying their
sights on real-time Enterprise Information Integration using uniform
information representation using Semantic Web.
The right strategy of extracting the information out of the Big Data will produce the big insights and
will require some of these new technologies. These insights are the
foundation blocks of the values that will be extracted by enterprises with a
vision to the future with their data initiatives. Savvy companies will
have the right mix to support marketing and business with these new
technologies and will need to be successful at extracting values out of their “data
lakes” lest those lakes turn into "data swamps".