Data Vault Architecture
The data vault is typically described as a hybrid of traditional star schema and third-normal-form (3NF) warehouse architectures. The data vault architecture addresses several of shortcomings of 3NF like rigidity and easily breaks with change or hard to add additional data sources. Unlike a star schema design, a data vault architecture normalizes and regularizes data, which makes it suitable for data analytics, modern data warehousing as well as for other applications that require raw or normalized data. One f the key benefits of this approach is real-time loading. The data vault model expects conditions to change, and it’s designed to permit rapid and flexible responses to business change.
Data Vault is a method and architecture for delivering a Data Analytics Service to an enterprise supporting its Business Intelligence, Data Warehousing, Analytics and Data Science requirements. At the core it is a modern, agile way of designing and building efficient, effective Data Warehouses.
Below is a depiction of a typical ETL / ELT pattern for a data vault architecture.
Traditional Star Schema
Data Vault Approach
Data Vault Modelling Concepts
There are three core structures that make up a Data Vault architecture:
A Hub is a core business entity. This can be things like a company codes, customer, product, or a store. Hubs don’t contain any context data or details about the entity. They only contain the defined business key and a few mandated Data Vault fields. A critical attribute of a Hub is that they contain only one row per key.
A Link defines the relationship between business keys from two or more Hubs. Just like the Hub, a Link structure contains no contextual information about the entities. There should also be only one row representing the relationship between two entities. In order to represent a relationship that no longer exists, we would need to create a satellite table off this Link table which would contain an is_deleted flag; this is known as an Effectivity Satellite. In my business, data changes very frequently. How can non-volatile contextual tables work for me?
In Data Vault architecture, a Satellite houses all the contextual details regarding an entity. When there is a change in the data, a new row must be inserted with the changed data. These records are differentiated from one another by utilizing the hash key and one of the Data Vault mandated fields: the load_date. For a given record, the load_date enables us to determine what the most recent record is.
In this day and age, with the ever-increasing availability and volume of data from many types of sources such as IoT, mobile devices, and weblogs, there is a growing need, and yes, demand, to go from batch load processes to streaming or “real-time” (RT) loading of data. Businesses are changing at an alarming rate and are becoming more competitive all the time. Those that can harness the value of their data faster to drive better business outcomes will be the ones to prevail. The Data Vault methodology allows teams to add new data sources very efficiently and quickly. Instead of reengineering the model and wasting valuable cycles determining the impact of those changes, data from a new source can be ingested into a completely new Satellite table. This speed also enables data engineers to iterate rapidly with business users on the creation of new information marts. Need to integrate completely new business entities to the Data Vault? You can add new Hubs at any time, and you can define new relationships by building new Link tables between Hubs. This process has zero impact on the existing model.