Jump to section

What is a data lake?

Copy URL

A data lake is a type of data repository that stores large and varied sets of raw data in its native format. Data lakes let you keep an unrefined view of your data. They are becoming a more common data management strategy for enterprises who want a holistic, large repository for their data. 

Raw data is data that hasn’t yet been processed for a specific purpose. Data in a data lake isn’t defined until it is queried. Data scientists can access the raw data when they need it using more advanced analytics tools or predictive modeling.

All data is kept when using a data lake; none of it is removed or filtered prior to storage. The data might be used for analysis soon, in the future, or never at all. Data could also be used many times for different purposes, as opposed to when the data has been refined for a specific purpose, which makes it difficult to reuse data in a different way.

Unfiltered and unstructured data

The term "data lake" was introduced by James Dixon, Chief Technology Officer of Pentaho. Describing this type of data repository as a lake makes sense because it stores a pool of data in its natural state, like a body of water that hasn’t been filtered or packaged. Data flows from multiple sources into the lake and is stored in its original format. 

Data in a data lake isn’t transformed until it is needed for analysis, schema is then applied so data can be analyzed. This is called "schema on read," because data is kept raw until it is ready to be used. 

Ad hoc access to data

Data lakes allow users to access and explore data in their own way, without needing to move the data into another system. Insights and reporting obtained from a data lake typically occur on an ad hoc basis, instead of regularly pulling an analytics report from another platform or type of data repository. However, users could apply schema and automation to make it possible to duplicate a report if needed. 

Data lakes need to have governance and require continual maintenance to make the data usable and accessible. Without this upkeep, you risk letting your data become junk—inaccessible, unwieldy, expensive, and useless. Data lakes that become inaccessible for their users are referred to as "data swamps."

Storing large and varied sets of raw data in its native format as a data lake has many advantages to an organization.

  • They’re scalable. Data lakes can handle large volumes of data, including structured, semi-structured, and unstructured data, at scale. They store data without the need for a predefined schema, allowing for the ingestion of diverse data types. This can improve computing performance. Modern data lake solutions will leverage distributed computing frameworks, enabling efficient processing of large datasets.
  • Data lakes are a cost-effective option for storing vast amounts of data because they typically use low-cost storage solutions, such as cloud-based object storage. Structured as centralized data storage, data lakes reduce the need for maintaining multiple copies of the same data across different systems.
  • Data lakes’ "schema on read" approach offers greater flexibility than traditional data warehouses. By storing data in its native format, data lakes possess greater agility to integrate and analyze diverse datasets.
  • Compared to traditional data warehouses, the central repository that a data lake provides enables a comprehensive view of organizational data. This consolidation of data improves access to data and removes barriers to data sharing and collaboration.
  • Data governance becomes easier with data lakes’ centralized repository. Features for data governance like metadata management, data lineage, and access controls ensure data quality, consistency, and compliance with regulations.
  • All of the previous benefits lead to more innovation. Data lakes act as a sandbox environment for data scientists to explore and experiment with data without affecting production systems. Faster data ingestion and flexible analysis in data lakes accelerate insights which improves agility and responsiveness to market changes.

Common use cases for data lakes include:

1. Advanced analytics and machine learning: Their ability to store large amounts of data in its native format makes data lakes essential to performing advanced analytics and machine learning. Data lakes can collect and integrate diverse data sources such as customer interactions, sales data, and social media activity. This allows data scientists to develop predictive models and sophisticated AI applications, driving better business insights and decision-making.

2. Real-time data processing: Because data lakes support real-time data ingestion and processing, they are ideal for applications that require immediate insights, such as financial trading, fraud detection, and operational monitoring. A data lake can monitor transaction data in real-time thereby identifying and preventing fraudulent activities instantly. In manufacturing facilities, real-time data from machinery can detect anomalies and perform predictive maintenance, reducing downtime and improving efficiency.

3. Data consolidation and integration: Data lakes can integrate data from multiple sources into a single, unified repository, eliminating data silos. This is particularly useful for creating a comprehensive view of customers. A retail company could combine data from purchase histories, website interactions, and social media to understand customer behavior better and deliver personalized marketing campaigns.

4. Regulatory compliance and data governance: Because data lakes provide a secure and scalable solution to store vast amounts of data, they can ensure compliance with regulations like GDPR, HIPAA, and CCPA. This real-time compliance is critical to industries such as healthcare and finance which must adhere to stringent regulatory requirements for data storage and security. 

5. Edge device data management: Edge devices generate enormous amounts of data, and data lakes are equipped to store and process such high volumes and varieties of data. On the edge, this data may include sensor readings, smart meter data, and connected device logs. This ability of data lakes supports use cases like smart city management, industrial automation, and predictive maintenance.

Data lakes provide the agility and adaptability to address many modern use cases for data storage and processing.

Though they are often confused, data lakes and data warehouses are not the same and serve different purposes. Both are data storage repositories for big data, but this is where the similarities end. Many enterprises will use both a data warehouse and a data lake to meet their specific needs and goals. 

A data warehouse provides a structured data model designed for reporting. This is a main difference between a data lake and a data warehouse. A data lake stores unstructured, raw data without a currently defined purpose. 

Before data can be put into a data warehouse, it needs to be processed. Decisions are made about what data will or will not be included in the data warehouse, which is referred to as "schema on write." 

The process of refining the data before storing it in a data warehouse can be time consuming and difficult, sometimes taking months or even years, which also prevents you from collecting data right away. With a data lake, you can start collecting data immediately and figure out what to do with it in the future.

Because of their structure, data warehouses are more often used by business analysts and other business users who know what data they need in advance for regular reporting. A data lake is more often used by data scientists and analysts because they are performing research using the data, and the data needs more advanced filters and analysis applied to it before it can be useful.

Data lakes and data warehouses also typically use different hardware for storage. Data warehouses can be expensive, while data lakes can remain inexpensive despite their large size because they often use commodity hardware.

Cloud solutions offer scalability and cost-effectiveness as organizations can pay as they grow. Data lakes using cloud storage are infinitely scalable as they don’t rely on an organization’s hardware on hand to grow. Along with that scalability, cloud solutions offer performance solutions as they’re able to scale up or down based on demand. Because cloud solutions for data lakes offer flexible infrastructure, they can be more cost-effective than on premise hardware.

Cloud data lakes offer more data access than other solutions as they can be accessed from anywhere in the world, empowering distributed teams. Additionally, because cloud services are built for integration with other cloud services, cloud data lakes can provide better integration with less effort.

The largest names in cloud computing all offer data lake services. Amazon S3 is the foundation for data lakes on AWS. Microsoft Azure offers Azure Data Lake Storage. Google Cloud Storage provides scalable and secure object storage that serves as the basis for data lakes on Google Cloud Platform. IBM Cloud Object Storage is ideal for building data lakes as it is designed for high durability, security, and data availability, as well as integrating with IBM’s analytics and AI services to provide comprehensive data solutions. 

A data lake has a flat architecture because the data can be unstructured, semi-structured, or structured, and collected from various sources across the organization, compared to a data warehouse that stores data in files or folders. You can have a data lake on-premises or in the cloud.

Because of their architecture, data lakes offer massive scalability up to the exabyte scale. This is important because when creating a data lake you generally don’t know in advance the volume of data it will need to hold. Traditional data storage systems can’t scale in this way.

This architecture benefits data scientists who are able to mine and explore data from across the enterprise and share and cross-reference data, including heterogeneous data from different fields, to ask questions and find new insights. They can also take advantage of big data analytics and machine learning to analyze the data in a data lake. 

Even though data does not have a fixed schema prior to storage in a data lake, data governance is still important to avoid a data swamp. Data should be tagged with metadata when it is put into the lake to ensure that it is accessible later.

Improve AI/ML application management

Get expert perspectives on how to simplify the deployment and lifecycle management of artificial intelligence/machine learning (AI/ML) applications so you can build, collaborate, and share ML models and AI apps faster with this webinar series. 

With Red Hat’s open, software-defined storage solutions, you can work more, grow faster, and rest easy knowing that your data—from important financial documents to rich media files—is stored safely and securely.

With scalable, cost-efficient software-defined storage, you can analyze huge lakes of data for better business insights. Red Hat’s software-defined storage solutions are all built on open source, and draw on the innovations of a community of developers, partners, and customers. This gives you control over exactly how your storage is formatted and used—based on your business’ unique workloads, environments, and needs.

Keep reading

Topic

Understanding data services

Data services are collections of small, independent, and loosely coupled functions that enhance, organize, share, or calculate information collected and saved in data storage volumes.

Article

Why choose Red Hat storage

Learn what software-defined storage is and how to deploy a Red Hat software-defined storage solution that gives you the flexibility to manage, store, and share data as you see fit.

Article

What is cloud storage?

Cloud storage is the organization of data kept somewhere that can be accessed by anyone with the right permissions over the internet. Learn about how it works.

More about storage

Products

Software-defined storage that gives data a permanent place to live as containers spin up and down and across environments.

An open, massively scalable, software-defined storage system that efficiently manages petabytes of data.

Resources

Podcast

Command Line Heroes Season 4, Episode 4:
"Floppies: The disks that changed the world"