Evaluating Data Lakes and Data Warehouses as Machine Learning Data Repositories

Data is the lifeblood of modern enterprises, whether you’re a retailer, financial service company, or digital advertiser. Across industries, organizations are recognizing the importance of their data for business analytics, machine learning, and AI.

Smart businesses are investing in new ways to extract value from their data: to better understand customer needs and behaviors, tailor new products and services, and make strategic decisions that will deliver competitive advantages in the years to come.

For decades, enterprise data warehouses have been used for all types of business analytics, with a robust ecosystem around SQL and relational databases. Now, a challenger has emerged.

Data lakes were created to store big data for training AI models and predictive analytics. This post covers the pros and cons of each repository: how they are used and, ultimately, which delivers the best outcomes for ML projects.

Key to this puzzle is processing data for AI and ML workflows. AI projects require massive amounts of data for training models and running predictive analytics. Technical teams must evaluate how to capture, process, and store data so that it is scalable, affordable, and readily available.

Banner ad for free e-book entitled Accelerating Apache Spark 3: Leveraging NVIDIA GPUs to Power the Next Generation of Analytics and AI

What’s a data warehouse?

Data warehouses were created in the 1980s to help enterprise companies organize high data volumes for the purpose of making better business decisions. Data warehouses are used with legacy sources such as enterprise resources planning (ERP), customer relationship management (CRM) software, inventory, and point of sale systems.

The primary goal is to provide operational reporting across lines of business, product analytics, and business intelligence.

Data warehouses have used ETL (extract, transform, load) for decades, with a bias for completing transform and clean data before uploading it. Traditional data warehouses have stringent standards for data structure and advance planning to meet schema requirements.

Data is only stored in data warehouses when it has already been processed and refined. ETL processes data by first cleaning data and then uploading into a relational database. The upside is that the data is in good shape and can be used. However, there is processing overhead that you pay up front, which is lost if the data is never used.
Data analysts must create a predetermined data structure and fixed schema before they can run their queries. This blocker is a huge pain point for data scientists, analysts, and other lines of business, as it takes months or longer to be able to run new queries.
Typically, the data in a warehouse is read-only, so it can be difficult to add, update, or delete data files.

Upside: Data quality

With any system, there are tradeoffs. The upside of data warehouses is their data is in good shape at ingest and will likely stay that way due to the discipline of data cleansing and data governance.

Traditional data warehouses excel as ledgers, providing clean, structured, and normalized data that serves as a single source of truth for an organization. Using relational databases, managers and business analysts across the organization can query massive amounts of corporate data quickly and accurately, to guide critical business strategies.

Downside: Schema requirement

Data warehouses are more likely to use ETL for operational analytics and machine learning workloads.

However, traditional data warehouses require a fixed schema for structuring the data, which could take months or years to agree across all teams and lines of business managers. By the time a schema gets implemented, its users have new queries, taking them back to square one.

It’s fair to say that data warehouse schema drove immense interest in data lakes.

Why use a data lake?

In the early 2000s, Apache Hadoop introduced a new paradigm for storing data in distributed file systems (HDFS) so enterprises could more easily mine their data for competitive advantage. The idea of a data lake came from Hadoop, enabling ingest of a wide spectrum of data types stored in low-cost blob or object storage.

Over the last decade, organizations have flocked to data lakes to capture diverse data types from the web, social media, sensors, Internet of Things, weather data, purchased lists, and so on. As big data gets bigger, data lakes became popular to store petabytes of raw data using elastic technologies.

Data lakes have two main draws: the easy ingest of a wide spectrum of data types and ready access to that data for improvised queries.

Using ELT (extract, load, transform), data lakes can ingest most any data type: structured, unstructured, semistructured, and binary for images and video.
Data going into a data lake does not have to be transformed before it is stored. Ingest is efficient, without the overhead of cleansing and normalizing data by type.
Data lakes make it easy to store all types of data (PDFs, audio, JSON documents) without knowing how that data might be used in the future.

Upside: Ad hoc queries

The upside of data lakes is that teams can access diverse data and run arbitrary queries on demand. The need to have data analytics available immediately is the main driver for adoption of data lakes.

Downside: Data degradation over time

Raw data goes bad fast in a data lake. There are few tools to tame raw data, making it hard to do merges, deduplication, and data continuity.

What do data warehouses and data lakes have in common?

Data warehouses and data lakes both function as large data repositories and have common characteristics and drawbacks, especially around cost and complexity.

Scale: Both have the capability to retain massive amounts of data, using both batch and streaming.
High costs: Both are wildly expensive, costing more than a million dollars a year to maintain.
Complexity: Data centers are managing dozens of unique data sources, with rapid volume growth of 50% a year or more. Storage infrastructure is taking more IT person hours, raising storage costs and driving down overall efficiency.
Data processing: Both can use ETL and ELT processing.
Shared use cases: As data scientists prioritize ML techniques to derive new insights from their data, many organizations are now getting the best of both worlds: AI-enabled data analytics and a wide range of diverse data types.

What’s the difference between data warehouses and data lakes?

Comparing data warehouses to data lakes is a bit like comparing apples and oranges. They offer different things.

A data warehouse organizes, cleans, and stores data for analysis.
A data lake stores many data types and transforms them on demand.

As teams become more focused on AI projects, the gaps in functionality, manageability, and data quality issues come to light, causing both approaches to evolve and improve.

Deployment

Data warehouses are more likely to be on-premises or in a hybrid cloud. Data lakes are more likely to be cloud-based to take advantage of more affordable storage options.

Data processing

Data warehouses are more likely to use ETL for operational analytics and machine learning workloads. Data lakes ingest data using ELT pipelines of raw data in case that it’s needed in the future. Data lakes also do not require a schema, so teams can pose improvised queries without significant delay.

Tools

Data lakes lack the robustness of a data warehouse, in terms of a functioning programming model and mature, enterprise-ready software and tools. Data lakes have a myriad of pain points, including no support for transactions, atomicity, or data governance.

Data quality

It’s always a problem. It’s a bigger problem for data lakes. Expect to do a lot of monitoring and maintenance on data in a data lake. If you can’t manage raw data efficiently, you can end up with a data swamp, where performance is poor and storage costs are out of control.

Roughly 85% of data lakes fail, Gartner estimates, due to low-quality data. As the adage says: Data pipelines are only as good as the data that flows through them.

Buy compared to build

Companies like Teradata, Oracle, and IBM can sell you a data warehouse for millions of dollars. Storage is one of the most expensive components, as average companies are seeing data volumes growing more than 50% a year.

To get a data lake, most companies build their own on a free PaaS using open source Apache Spark, Kafka, or Zookeeper. This does not mean that building and maintaining a data lake is less expensive, however.

By one estimate, it can cost upwards of a million dollars each year to hire the people for deploying a production data lake with cloud storage. Standing up a data lake can take 6 months to a year, if you can get the expertise.

What’s best for ML workloads?

The short answer is both. Most companies will use both a data warehouse and a data lake for AI projects. Here’s why.

Data lakes are popular because they can scale up for big data—petabytes or exabytes—without breaking the bank. However, data lakes do not offer an end-to-end solution for ML workloads, due to constraints in its programming model.

Many organizations adopted the Hadoop paradigm, only to find that it was nearly impossible to get highly skilled talent to extract data from a data lake using MapReduce. The introduction and development of Apache Spark has kept data lakes afloat, making it easier to access data.

Still, the Hadoop model has not fulfilled its promise for ML. Data lakes’ ongoing pain points include a lack of atomicity, poor performance, lack of semantic updates, and an evolving Spark engine for SQL.

Compare that to a data warehouse, which is compatible with an entire SQL ecosystem. Any software written for a SQL backend can access enterprise software. The methods range from a WYSIWYG frontend and drag-and-drop interfaces to automatically generated dashboards to fully automated ways to do Kube analysis and hyper Kubes, to name a few.

All the business intelligence and data analytics work of the last 30 years is inherited in SQL databases. None of it works on Hadoop or in data lakes.

More data warehouses are supporting ELT that’s commonly used by data lakes. A primary use case for data lakes is to ingest data into a data warehouse, so that data can be extracted and structured for ML projects. ELT enables data scientists to both define a way to structure data and to query it while keeping raw data as a source of truth.

The promise of a data lakehouse

For data engineers looking for a more robust data solution for their big data needs, a data lakehouse (a combination of a data lake and data warehouse) promises to address the drawbacks of data lakes.

A data lakehouse can offer data integrity and governance, support for transactions, and ongoing high performance, with help from an open-source framework called delta lake.

Hybrid cloud options

If you’re just starting with AI data architectures, companies like Amazon and Google are offering cloud-based data warehouses (Amazon Redshift, Google BigQuery) to help lower storage and deployment costs.

CoreDB is an open-source database service that functions as a data lake as a service under an Apache license.

Conclusion

Data warehouses and data lakes are both useful approaches to tame big data and make steps forward for advanced ML analytics. Data lakes are a recent approach to storing massive amounts in commercial clouds, such as Amazon S3 and Azure Blob.

The definitions of data warehouse and data lakes are evolving. Each approach is experimenting with new data processes and models for novel use cases. Going forward, techniques for optimizing performance will be critical, both for managing costs and for monitoring data hygiene in large repositories.

A data lake offers a more flexible solution for data analytics and can process and store data at a low price. However, the Hadoop data lake paradigm does not offer a fully functional solution for machine learning at scale today. Many organizations are forging new tactics and trying new tools to enable better functionality for both data warehouses and data lakes in the near future.

For more information, see the following resources: