Lambda Architecture-driven Data Lake
In the present technological scenario, many companies are getting attracted to Big data. But previously Big data used the Hadoop system and faced the problem of latency. To resolve this problem altogether a new architecture came into existence that can work for large datasets that too at high velocity.
In this piece, we will try to understand the architecture that makes it modest to work with Big Data, which is none other than Lambda Architecture. The architecture was created by James Warren & Nathan Marz.
Big data systems are built to handle variety, velocity, and volume. Traditional big data architectures are built to handle only historic data or in other words static / stale data. But with these traditional models are facing challenges today as users want to see real time data along with historic data to take business intelligence to the next level.
In this architecture, one can query both fresh and historical data. To gain insight into the historical data movements the information is sent to the data store. The principle of this architecture is based on Lambda calculus, so it is named Lambda Architecture. The architecture is designed to work with immutable datasets, especially for its functional manipulation. The architecture has also solved the problem of computation of arbitrary functions. In general, a problem can be segregated into three layers:
Here, the batch layer is same as the traditional data lake layer where historical data is collected and used for analytics and served using the Serve layer. The speed layer comes into action when real time data needs to be used and, on the fly, processing occurs before serving the results via the presentation layer along with the batch data. For example, in the Speed layer, one can choose Apache Storm or Apache Spark streaming.
Data Lake house
Data Lake house is relatively a new term in the world of data science and engineering. This is one of the most promising future Data platform architecture paradigms. Lake house is a combination of both Data Lake and Data Warehouse, hence the term Lake house. In Data Lake house, you will have processing power on top of Data Lakes such as S3, HDFS, Azure Blob, etc.
Data Lake house is a data management paradigm that combines the capabilities of data lakes and data warehouses, enabling BI and ML on all data.
Data warehouses have a long history in decision support and business intelligence applications, though were not suited or were expensive for handling unstructured data, semi-structured data, and data with high variety, velocity, and volume.
Data lakes then emerged to handle raw data in a variety of formats on cheap storage for data science and machine learning, though lacked critical features from the world of data warehouses: they do not support transactions, they do not enforce data quality, and their lack of consistency/isolation makes it almost impossible to mix appends and reads, and batch and streaming jobs.
Data teams consequently stitch these systems together to enable BI and ML across the data in both these systems, resulting in duplicate data, extra infrastructure cost, and security challenges.
Data lake houses are enabled by a new system design: implementing similar data structures and data management features to those in a data warehouse, directly on the kind of low cost storage used for data lakes. Merging them together into a single system means that data teams can move faster as they are able to use data without needing to access multiple systems. Data lake houses also ensure that teams have the most complete and up to date data available for data science, machine learning and business analytics projects.
Advantage of Data Lake house
- Elimination of simple ETL jobs
- Reduced Data Redundancy
- Ease of Data Governance
- Directly connect to BI tools
- Cost reduction
Some examples of tools that support Data Lake house architecture
Google Big query
Data Lake houses combined with Lambda architecture are increasing in popularity in the big data world today. Data Science and engineering teams are rapidly employing this architecture to meet the ever-demanding need of processing data faster and at high velocity.
Concept of Data Lake house is at an early stage, so there are some limitations to be considered before completely depending on the Data Lake house architecture such as query compatibility, Data Cleaning complexity, etc. But Data Engineers can contribute to the issues and limitations on open-source tools. Bigger companies like Facebook, Amazon has already set the base for Data Lake house and open sourcing the tools they use.
With a global crisis like Covid -19 pandemic, the medical research community can benefit from making both real time data and historic data available to them for extensive reach in the best form possible and in the most efficient way possible with reduced latency.
Follow My Blog
Get new content delivered directly to your inbox.