Kafka Data Storage and the Impact on Data Quality

Jeffery BrownFebruary 19, 2020

Download the eBook

Data management technologies are evolving faster than ever and organizations are choosing to replace some of their old architectures with the latest and greatest. Just a few years ago, on-premises Hadoop data lakes were the preferred data storage solution for the vast majority of businesses, utilizing technologies like Hortonworks, Cloudera and MapR. Today, while organizations still utilize on-premises Hadoop data lakes to store massive amounts of data, many companies have migrated to cloud-hosted data systems because they’re less expensive, more efficient and can scale elastically.

The latest “hot” data management technology gaining traction is event-driven architecture, a software streaming design architecture that models data as a series of distinct changes to information. Such streaming data software quickly communicates events such as updates to customer information, ecommerce transactions and completion of online forms. As a result, organizations can communicate data changes too, immediately, in real-time.

Apache Kafka has emerged as the preferred distributed streaming platform, delivering high-throughput, low latency real-time streaming, redundancy and scalability. In addition, the platform has flexible data retention capabilities, which is a major game-changer for many companies.

Data Integrity When Leveraging Kafka’s Shorter Data Storage Capabilities

Kafka has the potential to act as a storage layer, housing event-driven data for shorter customizable periods of time, typically ranging from hours to weeks. Since Kafka’s fault tolerant features can safely house data, businesses are using the platform to store smaller amounts of data changes from source systems (producers).

By storing data directly in the Kafka platform, consumers of the data can have longer retained data at the ready to be loaded, especially if new applications need to obtain a larger historical data dump. Because the platform doesn’t contain massive volumes of historical information, topics typically only provide a shorter timeframe of retained data.  However, because it acts as the highway between systems—connecting two points with data transfers—Kafka users expect not only real-time data, but also data with the highest level of data integrity.

Establishing High Quality Data for Kafka

As the mantra goes, “Garbage in, garbage out.” As more and more companies invest considerable time, money and resources in streaming data initiatives, data integrity has to be at the forefront of their data strategy. After all, generating and transmitting high volumes of data between systems will help the organization excel, but if you’re working with bad data then the output will have bad results.

According to Gartner, companies lose up to $15 million every year due to poor data management. Low integrity data results in distrust among data users and higher consumption of resources, preventing users from leveraging data, wherever it is being stored, to gain critical business insights. Bad data can quickly pollute data streaming processes and turn into a business liability. A company’s focus should be about ensuring data across the entire architecture landscape, from source systems, through the Kafka platform, and down into the target systems.

To protect data quality, companies need data integrity methods and technologies that can provide quality validation as quickly as their data streams across the platform. With the right solution, organizations can enact in-line data integrity checks to ensure data completeness, conformity, accuracy and consistency. These checks happen almost immediately and if any data fails a check, it is re-routed for investigation and remediation before it moves to consumer systems. This can be critical to downstream systems, especially those systems being used to report financials, customer facing or even used for regulatory purposes.

Fast-growing event-driven architectures and streaming data platforms like Kafka are gaining popularity, as well as accelerating risk. In addition, other modern technologies like IoT and new digital transformation mechanisms continue to change the way businesses collect, store, share, manage and utilize data. Regardless of technological advancements, building a foundation for business initiatives starts with data quality.

Are you looking to solve data quality challenges for streaming data? Check out the e-book above or below.

Get Insights

For a deeper dive into this topic, visit our Kafka resource center. Here you will find a broad selection of content that represents the compiled wisdom, experience, and advice of our seasoned data experts and thought leaders.

Download the eBook