Expert Advice about Managing Blockchain in a Big Data Environment

Is there a future that incorporates blockchain with data lakes?

Senthil RajamanickamJanuary 3, 2017

Attractiveness of Data Lakes

Enterprise IT leaders are increasingly beginning to see the value of incorporating data lakes into their wider big data management strategy. Leaving a legacy IT environment is not the legacy they want to leave their organization. But what’s the alternative? The potential long-term costs of open source projects are not the ideal data management strategy. That’s where the data lakes strategy comes into play. A data lake can be used to end data silos within your organization, centralize data and gain better access to all disparate data sources within your business. It can be used in a multitude of different ways, as well as capture a 360-degree view of your customers, analyze social media data, etc. It can also capture volumes of data and insights from IoT devices and sensors. For all of these reasons, and those below, the solution becomes attractive due to its:

  1. Low cost and extremely scalable storage solution
  2. Ability to support multiple programming languages, applications, frameworks, etc. Be completely data agnostic while providing uncompromised access to all structured or unstructured data.
  3. Centralized system that eliminates the need to move data, preventing data silos and encouraging large data archives
  4. Ability to store raw data, thereby providing more insight with undiluted data for exploration and analysis

Challenges of Data Lakes

But while a data lake can be viewed as an extremely valuable asset, many enterprise implementations succumb to many challenges. We’ve identified a few below, as well as ways to overcome them:

  1. Metadata Management – Data becomes more valuable to an organization if it’s tagged and catalogued. An organized library management system that is well indexed can ease navigation and help users quickly find information they want within the data lake. While metadata is certainly key, so is meta-metric information that provides detailed metrics including number of  data sets, data owners, data origination, data date origination, etc. This aids users in determining what data to use from the large data catalog associated with the data lake.
  2. Data Governance – If the data within the data lake is not governed, you’ll soon end up in  data limbo land or risk creating a data swamp.  This creates all kinds of issues including data quality, metadata management, information security, etc., and can eventually lead to the failure of the data lake. Monitoring the data lake and keeping metrics is critical to help spot irregularities and incrementally improve processes.
  3. Data Preparation – As organizations provide more democratized access to data lakes and self-service becomes more common, finding ways to address data quality and preparation becomes even more critical.
  4. Data Security – Moving all data to a single location, like a data lake, means security must be top notch because even one breach can disrupt your business. An integrated security plan needs to be in place.

Blockchain Meets Data Lake

While challenges arise with data lakes, it’s clear that the benefits outweigh the challenges. But what if we could take the idea of a data lake one step further – a step that might help data lakes overcome their largest challenge?

Let’s introduce blockchain. Blockchain is a database that holds ever-growing records, known as blocks, and keeps them secure from tampering and revision by holding the information in multiple storage devices that are not all attached to a single processor. Because the data is stored in what’s called a “distributed database,” where the loosely connected system shares no physical components, there’s no one single area for a hacker to attack. Further, because each block is timestamped and linked to a previous block, it serves as a public ledger for transactions.

There are quite a few opportunities to integrate data lakes with blockchain. The greatest benefit of such integration is data provenance for the information entering the data lake to prevent the data lake from turning into a data swamp. What is needed is to simply identify data provenance for the data entering the lake to make it more genuine, trustworthy and useable. Although let’s not get ahead of ourselves here. This blog is not about integrating blockchain with data lakes. While that would be great, lets table the how to do it to a later discussion. However; it’s important to point out the benefits of making it happen.

  1. One of the well proven use cases for blockchain technology is digital content identification and ownership. Block chain provides a great opportunity for organizations to digitally sign a document, while leveraging a distributed ledger system as 3rd party validation to successfully validate and legitimize the digital document/content. If we could legitimize every document that went into a data lake, the opportunities to use the data later would be much greater. To do so, blockchain technology has to evolve into a corporate level solution. Creating and managing an enterprise level blockchain today is evolving through open source tools like Openchain, which provides distributed ledger technology to organizations wishing to issue and manage digital blocks associated with their digital documents, but they are yet to gain mainstream adoption.
  2.  Data lake adoption currently sits at the peak of Gartner’s hype cycle. As it matures, it will need to demand that data provenance be addressed by integrating corporate blockchain technology. Users could then check for data provenance and authenticity from a data lake using a validation request from participating nodes in the corporate blockchain environment. This process would allow blockchain blocks to vouch for data authenticity and legitimacy – helping users establish data trust.
  3. Data lakes are able to quickly scan tons of documents in raw format supporting structured or unstructured data sets, while systematically storing them with easy NO SQL access. But this system lacks a metadata environment and  doesn’t come with the ability to catalog data arrays or provide useful metrics around the data sets. However; doing so would help ensure better data usage for end users, as well as ensure the legitimacy of the data, preventing dark data buildup.

Capturing and cataloging data metrics, and supporting modern data lakes with metadata, are two key capabilities supported by various vendor groups. To connect these two capabilities, a trusted common platform must provide a transparent and connected environment with both tool spectrums. Blockchain’s ability to support data provenance can also act as a broker between metadata tools and evolving metric data tools by providing a referenceable common ledger with a unique ID.

  1. It’s no secret that data lakes struggle supporting complex layers of security protocol that are essential in financial services and healthcare. Marrying the two technologies would enable selective access  to digital assets stored in the data lake. Access would be granted based on predefined access management and controlled white and black lists to appease security teams. Blockchain would act as a security guard to manage data access, only granting access to matching security controls.  Distributed systems that validate the security will be responsible for generating the encryption codes that can be used to un-encrypt the data asset by users if the asset is white listed within the blockchain, thereby ensuring a highly trusted, secure data management environment.

Blockchain and Data Lake Roadblocks

Today, blockchain Proof of Concepts (PoC) are common in a variety of different industries, but it has yet to be used in a corporate environment. If you consider corporate-issued laptops as the nodes of a blockchain, information can be passed and accessed using more secure authentication methods like blockchain ID, private keys and public keys.

A recent article in Harvard Business Review identified that the annual cost of bad data is $3.1 trillion USD. If we could prevent bad data from entering a corporate environment, we could significantly lower that number. Consider every piece of data as it enters a corporate environment being placed into a data lake and then using blockchain to aid in identifying its provenance. It would be stored in the data lake with an associated blockchain ID. Then all corporate users with laptops, or nodes, on the system would be able to access that data with a public key sent from the data originator. This way, if you’re using this particular data set over a prolonged period of time, it can be confirmed as clean, accurate data.

How Data Controls Can Help with Data Quality

But data provenance doesn’t mean you automatically have clean data. It just means you know where it originated. Automated data controls can help monitor, validate and capture metrics around the data. Extending blockchain capabilities through open chain can complement data controls by digitally tagging each data asset before it moves into the data lake. Validating and authenticating digital assets based on blockchain ID is done via distributed ledger systems by leveraging enterprise application environments through simple application extensions.

Data metrics captured by automated controls can be stored in a meta-metrics data repository along with the metadata of the digital asset, and the blockchain ID that is associated with the data asset.

Today, data lakes are being used as a glorified staging environment. Marrying data lakes and blockchain is difficult, but it’s certainly not something that can be ignored. To help improve data quality and data provenance associated with data lakes, this integration is not something that can be ignored.