In my last blog, we talked about data lakes, its many benefits, and why organizations should consider building one. We explored the value of data lakes, including the diversity of usage and its intrinsic ability to house various types of data. Clearly there’s no question – data lakes help overcome scalability and data duplication issues, resulting in increased information use and sharing, and reduced costs through server and license reduction.
However, while there are many advantages to data lakes, it also has its challenges.
Data Quality: Collecting data is an important task for all businesses and because a data lake typically accepts any data, without oversight or governance, companies are taking a risk if they have errors in their data and/or missing descriptive metadata. It can negatively impact revenue, and can cause missed opportunities and insights, leading to bad decision making. Unreliable data also impacts audit trails and logging, costing the company more to improve its data integrity. Finally, if there is no way to identify who stored the data or the lineage of findings, then those tasked with correlating the data find themselves working in disconnected data pools or the very data silos the organization is trying hard to avoid.
Data at Risk: Data lake architecture is a ‘store everything’ approach, but securing the data inside is still a capability that is in its infancy. Organizations need to consider how to secure the data stored on platforms like Hadoop, as well as identify proper access control protocols. Without establishing oversight of the content, the organization is unknowingly increasing their privacy and regulatory risk exposure.
Performance Management: For some, the performance and ease of use of a data lake is difficult to achieve. Many find it complex to have a separate system for extracting raw data only to put the data back into a relational system, like a data warehouse, for analysis. It’s important to remember, data lakes are simply a stored repository and aren’t a replacement for existing analytical platforms or infrastructure. Instead, they complement existing efforts to store new and different types of data.
Optimization: Data lakes are housed using highly complex software that’s difficult to optimize for performance, a problem that can be compounded when installing software on unfamiliar cloud infrastructure or in a data center. No one wants highly skilled, expensive data scientists and analysts spending time troubleshooting software when they could be focused on solving the business’ analytical challenges. To extract additional value out of the data lake, many companies supplement their in-house staff with Data Science as a Service to accelerate parts of an existing analytical projects or as a way to add resource capacity to take on additional projects.
Data Governance: The importance of data governance cannot be overstated. Data Governance is the process of putting standards, processes and controls around enterprise data to ensure availability, usability, integrity, and security of the data employed in an enterprise. Existing governance policies should reinforce the framework for business drivers to mitigate doubts and second-guessing, and extend current governance policies to focus on data ingestion and evaluating internal and external sources, third-party information and specialized user data sets. And because access controls, mobility, and security are of the utmost concern, these technology requirements should be defined upfront.
Data Lineage: The design of the data lake can be hindering acceptance of data governance to monitor and validate the data lineage. To ensure data scientists extract and make the most of the data stored in the data lake, organizations must develop views around usage patterns. It’s key for new data users to understand who has used these data sets before them, for what purpose they were used, provide a recommendation on the quality and relevance of the data, etc. Users can also create a reliable user feedback and rating mechanism by stating if they recommend using the data sets for further analysis.
Legacy Data: Many legacy systems contain software patches, Band-Aid solutions, and aging design. As a result, the raw data may provide limited value outside its legacy context. The data lake performs optimally when supplied with unadulterated data from source systems, and rich metadata built on top.
With the Internet of Things (IoT), mobile and wearables, building data lakes for big data is a trend that will increase substantially in the coming months and will have a large impact on any organization due to its many advantages. There is great value to be had from a data lake that was well-planned and created with some of these challenges in mind. Creating a strategic plan and addressing these challenges and gaps as a top priority will help you draw value from your data lake.
To learn more about maintaining data integrity in data lakes, check out this data sheet.
For a deeper dive into this topic, visit our resource center, where you will find a broad selection of content that represents the compiled wisdom, experience, and advice of our seasoned data experts and thought leaders.