Data Mutations in Your Data Lake? Explore These Data Quality Strategies

Don’t Compromise Analytic Results – Apply Big Data Quality First

Emily WashingtonJune 13, 2017

Download eBook

Data Cleanliness Dilemma

If the DNA of your data is mutated, then it’s dirty data, and one can predict that any analysis of your big data is going to yield inaccurate results.

When you’re working with big data that hasn’t been validated for data quality prior to applying analytics, it’s a safe bet to conclude that no one is going to vouch with high confidence that the analytical outcomes have a high degree of accuracy.

When it comes to using big data, we marvel at the possibilities of making ground breaking predictions and game changing insights, which will transform the fortunes of the organization and leapfrog the competition. But none of that is going to come to fruition if we lack confidence that the data is reliable.

Costs of Dark Data

Improving insights comes from combining different data sets to make connections previously impossible to see. While logical, what happens in practice is much of the big data remains underutilized due to insufficient quality. This occurs when tools aren’t readily available to quickly perform data profiling to rate its reliability and data quality for big data tools at one’s finger tips to expedite cleaning up the data. What we need is less dark data and higher confidence/trust in the big data we are using. What we need is data quality for big data prior to applying analytics to make high impact predictions with conviction.

The Old World: Data Warehouses

Not too long ago, we saw a run on the traditional data warehouse. It was the perfect solution to store all of that data. But with the production of more and more data, bottlenecks were exposed and now many question if data warehouses are obsolete (though that’s a whole different blog). While data warehouses can perform predictions based on historical data, they fail to harness the power of predictive analytics which can be unleashed by leveraging big data.

To combat these issues many organizations have jumped to data lakes. Data lakes can store endless amounts of data, from multiple systems, in any format or structure, all in one place.

For these reasons, and others, many organizations have been moving their data from various warehouses into data lakes. The many benefits include the ability to derive value from unlimited types of data, store all types of structured and unstructured data, unlimited ways to query data and its flexibility.

However, while the storage and access to data is improved, running analytics to gain insight into what the data means continues to be a major challenge.

Data Lake vs Data Swamp

Many organizations across industries have invested millions of dollars into data lakes. They save money because their data is no longer in silos and data is preserved in its native form, but their data lake has turned into a data dumping ground, or what some fondly call a data swamp. Organizations have dumped their data from various sources into the data lake, hoping to run analytics on it down the road. While IT applauds the move because they no longer have to spend time understanding how information is used, getting value out of the data is often a problem, especially without determining data quality.

To run a successful analytics program in a data lake, organizations need to ensure their data is of the highest quality. Unlike food or medicine, there is no “born on” date and no expiration date to determine its value. It is important to define where the data came from, how the data will be used and how the data will be consumed.

Enterprises can do this by sending in a team of IT professionals to manually reconcile the data in a very specific way. But reconciling hundreds of sources within a data lake and correcting the errors is tedious and can take a tremendous amount of time.

In addition, data lakes make certain assumptions about the users of information. It assumes that users recognize or understand the contextual bias of how data is captured, that they know how to merge and reconcile different data sources, and that they understand the incomplete nature of structured and unstructured data sets.

With no restrictions on the cleanliness of the data, errors can still slip through, making the data unreliable and not trustworthy, eventually hurting business intelligence and an organization’s reputation.

Getting data quality right may take a significant effort, but it does not have to be a manual process. It can be operationalized, saving an organization immense time and money.

Operationalizing Data Quality 

To ensure data quality within your data lake you need a self-service, big data analytics platform designed to handle not one, but rather multiple steps from data acquisition and preparation, to data analysis and operationalization. The platform should enable users to source data from multiple data platforms and applications, including vendor products, external databases, and data lakes.

The platform should sit on top of your data lake and monitor data quality. It should enable users to create automated notifications, manage exception workflows, and develop automated data processing pipelines to integrate the results of that analysis back into operational applications and business processes.

In addition, the platform should enable users to apply statistical and process controls, as well as machine-learning algorithms for segmentation, classification, recommendation, regression and forecasting. Users should be able to create reports and dashboards to visualize the results and collaborate with other users.

Always remember the old saying goes, “Garbage In, Garbage Out.”

To learn more about data quality for data lakes, download this eBook.

Get Insights

For a deeper dive into this topic, visit our resource center. Here you will find a broad selection of content that represents the compiled wisdom, experience, and advice of our seasoned data experts and thought leaders.

Download eBook