Chris Reed | August 31, 2017

Check out these big data quality components to set your big data environment straight

Few technological trends have garnered the acclaim that big data has. Yet it’s no surprise because big data can help you turn data into a competitive advantage when used the right way. One of the most common ways to achieve success is by using data analytics to acquire in-depth information about customers.

Data analytics can be used to learn a lot about your customers, like who they are, what they like, where they live, etc. to deliver a personable and personalized customer experience. One of the easiest examples to consider is the ability to send marketing materials specifically targeted to an individual based on their specific tastes. This data-driven approach boosts customer loyalty and helps improve activity and visibility across the channel.  Yet delivering big data success isn’t quite this easy. The bedrock of data-driven insights is good quality data. With data entering an organization so fast, many organizations are loading their data into their big data environment with no information or validation about the quality of that data. Unfortunately, if you can’t trust that data or validate its quality, what good are the insights?   Without quality data in your big data environment, the competitive advantage can be completely lost.

In order to deliver a personalized data strategy that you can trust to draw reliable insights, data quality in your big data environment is critical. Below are five components you should look for in a big data quality solution.

File Monitoring 

File monitoring within a big data quality solution ensures you receive the data you’re supposed to receive at the right time. This is going to affect how complete and current your reporting or analysis is. A proper file monitoring rule consists of validating a landing zone with all the files and the times they were received versus a list of expected files and the time they should be received. At the end, you should have a dashboard list of all the files, what time they were expected, what time they actually arrived and whether it was late or on time. This will allow users to have confidence that the data is current.

Data Quality

Typically,  a big data environment is divided into at least two zones; the ingestion zone or landing zone where the files come into the big data environment, and a consumer zone where the data is staged and prepped for analytics and reporting.

At the consumer zone or preparation stage, the platform should check and profile the quality of the data sets.  This means completeness, type conformance, value conformance and consistency of the data as described below:

  • Completeness – Check for null values or empty fields. Verify that there isn’t anything critical missing within the data set.
  • Type conformance – Ensure the data follows a specific pattern and that it is the right type for that specific field.
  • Value conformance – Validate that the data is within a specific range. For example, when you have a premium amount, it should be between $300 and $1,200. If it’s not within that range, it needs to be flagged.
  • Consistency – Verify that the data makes sense in the relationship between the fields. 

Data Accuracy 

Once you have measured data quality, you will need data accuracy.  As data moves between the ingestion zone and the consumer zone, it may be enriched for analytics.  During this movement of data, data may get dropped or changed.  It is important to validate the accuracy of the data within a big data quality solution. In order to validate the accuracy of the data, you will need to reconcile the data between the two zones to ensure all the records made it from the ingestion zone to the consumer zone.

Reconciling Data 

After you have good quality data, the platform should allow you to reconcile the data by taking it from your big data environment and correlating it to originating data sources outside of the big data environment. An example would be reconciling a company’s financial reports with the data they have in their big data environment.  Correlating critical financial details back to what exists in their general ledger, ensures the data has remained accurate and consistent from origination to its final destination in your big data quality solution.

Data Governance 

Data governance involves the documentation of any rules that have been put in place around the data. A good data governance solution includes who is responsible for each rule, what procedures take place if the rule passes or fails, and what is impacted by a particular rule so if there’s a change you know what parts of business will be affected. Data governance also involves tracking the lineage of your data including where the data originated, what controls it passed through, and how it was applied. This ensures that your data went through the proper steps with proper controls to verify data accuracies. If the data is not accurate it should get pushed to a workflow to manage any exceptions.

With a proper big data quality solution that does everything listed above, the end result is accuracy and confidence within your big data environment and a successful analytics program.

Download the Data Sheet

Subscribe to our Blog!