From the second data enters your enterprise and begins to move, it is vulnerable. Data in motion flows through many systems before it may be analyzed to derive valuable information for better business decisions. Data in motion is at the most vulnerable stage—not only because of the nature of the information itself, but because of its continual fluctuation and the uncertainty of how to properly monitor the data while in transition. This lack of awareness results in company processes developed around data at rest, with ad-hoc and fragmented solutions to monitor data in motion. One of the principal challenges relating to the utilization of big data in the enterprise today is the inherent complexities it presents for data quality. Even organizations with the most rigorous data quality mechanisms in place—which many organizations don’t necessarily have—can become easily overwhelmed by the speed, variation, and enormity of big data.
Data quality is a necessity, and can be particularly challenging to achieve in the world of big data and big data environments. Failure to ensure data quality can render all data, whether big or otherwise, virtually useless because of inaccuracies and the fundamental unreliability of the insights they are bound to yield. In this regard, data quality is a vital prerequisite for any sort of analytical insight or application functionality, a significant component of data preparation, and the foundation of trustworthy data and results.
Organizations that have a mature process in place for data quality agree that it’s far less expensive to fix an issue early in the process, before it can cascade into other systems. It’s often difficult and costly—in time, money and resources—to track down the root cause of an error after the fact. And when data quality impacts compliance or customer experience, it often turns into a high-visibility management issue.
Organizations receive, process, produce, store and send an amazing array of information to support and manage their operations, satisfy regulators and make important decisions. They use sophisticated information systems and state-of-the-art information technologies. However, their information environments are especially susceptible to the risk of information errors. The following are five steps to help you master big data quality.
Discover: Critical information flows need to be identified and discovered in order to develop metric baselines. All data provisioning systems including external source systems, along with their data lineage, need to be identified and documented. In this phase, source and target system owners should jointly establish data quality criteria and data quality measurement metrics for the key data elements. Data profiling is used to set a baseline for data quality metrics. It is important to remember that this is an ongoing process. As new systems are added or processes change the discovery phase continues.
Define: You must assess data quality risk. This is accomplished by thoroughly defining data quality issues, pain points, and risks. Some of these might be relevant only to a specific process or organization, whereas others might be tied to industry regulations. Once the risks are evaluated and prioritized, organizations must determine an appropriate response based on a cost benefit analysis.
Design: Appropriate information analysis and exception management processes should be designed to address the risks identified in the “define” phase. The analysis (data quality rules) should be independent of the process they are analyzing. When dealing with large amounts of data, this is critical. In order to analyze 100% of the data instead of sample sets, you will need a solution design that will run native in Hadoop.
Deploy: Identify and categorize the highest priority risks, and the appropriate controls or actions to be deployed based on criticality. Data quality governance deployment not only includes technology, but the people and processes that can effectively execute the solution. Appropriate workflow should be put in place to take action based on results.
Monitor: Once appropriate controls are in place, you should monitor the data quality indicators established in the discovery phase. Automated, continuous monitoring solutions provide the most cost-effective approach for data quality oversight and produce the best results for operational communication.
Now you will need a solution that can implement and automate these five steps.
To ensure data quality, you will need an all-inclusive big data quality platform. The platform should continuously monitor your data to ensure the bad data is immediately flagged and stopped before it impacts business operations. The platform should conduct high volume data quality checks such as data profiling, consistency, conformity, completeness, timeliness, reconciliations, visual data prep, and machine learning to foster end-user trust by verifying the quality of your big data. This trust is conveyed through proper governance. It is important to remember data quality is an ongoing process. In order to achieve your goals, the 5 steps outlined here will be repeated over and over as information is always evolving.
To learn more about mastering data quality in the age of big data, download this eBook:
For a deeper dive into this topic, visit our resource center. Here you will find a broad selection of content that represents the compiled wisdom, experience, and advice of our seasoned data experts and thought leaders.