Defending Data Quality in a Real-Time Data World

How to Ensure Data Reliability in Kafka

Jeffery BrownSeptember 25, 2019

Download eBook

With a constant stream of data being generated from a diverse set of sources such as social networks, e-commerce, transactions, IoT devices, and web applications, being able to react quickly is critical. The traditional batch processing approach, where large files of data are sent on a scheduled basis from system to system, simply can’t meet the demands of a highly reactive, changing data landscape. And in a world of 24-hour news cycles and an on-demand global economy, data consumers and customers increasingly expect real-time data feeds, access and analysis. Enter event-driven architecture.

Just a decade ago, data was still largely a concern of IT, cloud computing was in its infancy and the term “data lake” had yet to be coined. But with the rise of big data from anomaly to reality, organizations began looking for new ways to store massive amounts of data and run software and applications. That quickly blossomed into looking for new ways to share, synchronize and update that data in real-time between systems.

Organizations increasingly recognize the importance of digital transformation to extract more value from their strategic data assets. But digital transformation isn’t just about updating legacy systems, adopting a data strategy and following the latest DataOps principles. It’s also about adjusting to the pace of today’s demand for speed to insights, and giving data consumers real-time access to data.

Event-Driven Architecture and the Rise of Kafka

Event-driven architecture is a software design architecture that models data as a stream of discrete changes to data state. Simply put, it allows companies to communicate changes to data immediately and as they occur (data in motion). This represents a fundamental shift in how organizations send data between systems, and provides an ideal model for real-time API updates. Messaging system software quickly communicates events such as completion of online payments, inventory updates or changes to contact information from point A to point B. Among available messaging options, the open source, distributed streaming platform Apache Kafka has quickly emerged as the preferred messaging choice, delivering high-throughput, low latency real-time streaming, flexible data retention, redundancy and scalability.

According to Neha Narkhede, co-founder of Confluent and one of Kafka’a developers, “the reason about 60% of Fortune 100 companies are using [Kafka] is because they’re moving from a world where data was primarily at rest, processed in batches, to a world where software and data are a significant part of their business, and that means treating data in motion.” Kafka is a tool that quickly sends messages containing new or updated data from the systems or applications that created it (called producers), to the systems or applications who ingest the information, or subscribed to, a specific topic. These consumers receive all new data related to a topic they’ve subscribed to. Kafka keeps all of these messages for a set period of time, and makes multiple copies of every message to ensure the reliability of the data stored. This retention and redundancy prevent data loss failures and help provide a fault tolerant solution.

Ensuring Streaming Data Quality

 As more and more companies explore the possibilities of Kafka, streaming data, or an event-driven architecture, they want to ensure that data integrity is maintained within high-throughput platforms before they invest significant time, money and resources in these initiatives. Without data quality, data quickly turns from asset to liability. The good news is, vendors are coming to market to help companies safeguard data quality and provide validation at a speed and scale to match that of their data in motion.

Organizations can set in-line data checks to ensure data conformity and completeness, verify counts and amounts, and identify patterns and threshold violations. These checks can happen in near-real time, and data that “fails” can be routed for investigation, resolution and tracking before it moves to consumer systems.

Other data quality strategies for Kafka include conducting checks on data in batch, where messages are ingested and batch processed for validation or reconciliation. This approach can also provide reconciliations on data within the Kafka “reservoir” of retained messages, which provides insight into aggregated transactions or balances from system to system.

Event-driven architecture popularity will continue to grow, and high velocity digital transformation will continue to alter the ways that organizations collect, store, share, manage and leverage data. But no matter what advancements may occur, data quality will always remain a fundamental component of any successful data strategy. To learn more about how Infogix can help you with your data quality challenges, download our eBook below.

Get Insights

For a deeper dive into this topic, visit our resource center. Here you will find a broad selection of content that represents the compiled wisdom, experience, and advice of our seasoned data experts and thought leaders.

Download eBook