Enterprise IT leaders are increasingly beginning to see the value of incorporating data lakes into their wider big data management strategy. Leaving a legacy IT environment is not the legacy they want to leave their organization. But what’s the alternative? The potential long-term costs of open source projects are not the ideal data management strategy. That’s where the data lakes strategy comes into play. A data lake can be used to end data silos within your organization, centralize data and gain better access to all disparate data sources within your business. It can be used in a multitude of different ways, as well as capture a 360-degree view of your customers, analyze social media data, etc. It can also capture volumes of data and insights from IoT devices and sensors. For all of these reasons, and those below, the solution becomes attractive due to its:
But while a data lake can be viewed as an extremely valuable asset, many enterprise implementations succumb to many challenges. We’ve identified a few below, as well as ways to overcome them:
While challenges arise with data lakes, it’s clear that the benefits outweigh the challenges. But what if we could take the idea of a data lake one step further – a step that might help data lakes overcome their largest challenge?
Let’s introduce blockchain. Blockchain is a database that holds ever-growing records, known as blocks, and keeps them secure from tampering and revision by holding the information in multiple storage devices that are not all attached to a single processor. Because the data is stored in what’s called a “distributed database,” where the loosely connected system shares no physical components, there’s no one single area for a hacker to attack. Further, because each block is timestamped and linked to a previous block, it serves as a public ledger for transactions.
There are quite a few opportunities to integrate data lakes with blockchain. The greatest benefit of such integration is data provenance for the information entering the data lake to prevent the data lake from turning into a data swamp. What is needed is to simply identify data provenance for the data entering the lake to make it more genuine, trustworthy and useable. Although let’s not get ahead of ourselves here. This blog is not about integrating blockchain with data lakes. While that would be great, lets table the how to do it to a later discussion. However; it’s important to point out the benefits of making it happen.
Capturing and cataloging data metrics, and supporting modern data lakes with metadata, are two key capabilities supported by various vendor groups. To connect these two capabilities, a trusted common platform must provide a transparent and connected environment with both tool spectrums. Blockchain’s ability to support data provenance can also act as a broker between metadata tools and evolving metric data tools by providing a referenceable common ledger with a unique ID.
Today, blockchain Proof of Concepts (PoC) are common in a variety of different industries, but it has yet to be used in a corporate environment. If you consider corporate-issued laptops as the nodes of a blockchain, information can be passed and accessed using more secure authentication methods like blockchain ID, private keys and public keys.
A recent article in Harvard Business Review identified that the annual cost of bad data is $3.1 trillion USD. If we could prevent bad data from entering a corporate environment, we could significantly lower that number. Consider every piece of data as it enters a corporate environment being placed into a data lake and then using blockchain to aid in identifying its provenance. It would be stored in the data lake with an associated blockchain ID. Then all corporate users with laptops, or nodes, on the system would be able to access that data with a public key sent from the data originator. This way, if you’re using this particular data set over a prolonged period of time, it can be confirmed as clean, accurate data.
But data provenance doesn’t mean you automatically have clean data. It just means you know where it originated. Automated data controls can help monitor, validate and capture metrics around the data. Extending blockchain capabilities through open chain can complement data controls by digitally tagging each data asset before it moves into the data lake. Validating and authenticating digital assets based on blockchain ID is done via distributed ledger systems by leveraging enterprise application environments through simple application extensions.
Data metrics captured by automated controls can be stored in a meta-metrics data repository along with the metadata of the digital asset, and the blockchain ID that is associated with the data asset.
Today, data lakes are being used as a glorified staging environment. Marrying data lakes and blockchain is difficult, but it’s certainly not something that can be ignored. To help improve data quality and data provenance associated with data lakes, this integration is not something that can be ignored.