Data Vault 101

Data Vault Architecture

Introduction

The data vault is typically described as a hybrid of traditional star schema and third-normal-form (3NF) warehouse architectures. The data vault architecture addresses several of shortcomings of 3NF like rigidity and easily breaks with change or hard to add additional data sources. Unlike a star schema design, a data vault architecture normalizes and regularizes data, which makes it suitable for data analytics, modern data warehousing as well as for other applications that require raw or normalized data. One f the key benefits of this approach is real-time loading. The data vault model expects conditions to change, and it’s designed to permit rapid and flexible responses to business change.

Data Vault is a method and architecture for delivering a Data Analytics Service to an enterprise supporting its Business Intelligence, Data Warehousing, Analytics and Data Science requirements. At the core it is a modern, agile way of designing and building efficient, effective Data Warehouses.

Below is a depiction of a typical ETL / ELT pattern for a data vault architecture.

Traditional Star Schema

Data Vault Approach

Data Vault Modelling Concepts

There are three core structures that make up a Data Vault architecture:

  • Hub
  • Link
  • Satellite

Hub

A Hub is a core business entity. This can be things like a company codes, customer, product, or a store. Hubs don’t contain any context data or details about the entity. They only contain the defined business key and a few mandated Data Vault fields. A critical attribute of a Hub is that they contain only one row per key.

Link

A Link defines the relationship between business keys from two or more Hubs. Just like the Hub, a Link structure contains no contextual information about the entities. There should also be only one row representing the relationship between two entities. In order to represent a relationship that no longer exists, we would need to create a satellite table off this Link table which would contain an is_deleted flag; this is known as an Effectivity Satellite. In my business, data changes very frequently. How can non-volatile contextual tables work for me?

Satellite
In Data Vault architecture, a Satellite houses all the contextual details regarding an entity. When there is a change in the data, a new row must be inserted with the changed data. These records are differentiated from one another by utilizing the hash key and one of the Data Vault mandated fields: the load_date. For a given record, the load_date enables us to determine what the most recent record is.

Summary

In this day and age, with the ever-increasing availability and volume of data from many types of sources such as IoT, mobile devices, and weblogs, there is a growing need, and yes, demand, to go from batch load processes to streaming or “real-time” (RT) loading of data. Businesses are changing at an alarming rate and are becoming more competitive all the time. Those that can harness the value of their data faster to drive better business outcomes will be the ones to prevail. The Data Vault methodology allows teams to add new data sources very efficiently and quickly. Instead of reengineering the model and wasting valuable cycles determining the impact of those changes, data from a new source can be ingested into a completely new Satellite table. This speed also enables data engineers to iterate rapidly with business users on the creation of new information marts. Need to integrate completely new business entities to the Data Vault? You can add new Hubs at any time, and you can define new relationships by building new Link tables between Hubs. This process has zero impact on the existing model.

Published by Narayan Sujay Somasekhar

• 12+ years of experience leading the build of BI and Cloud Data Platform solutions using cloud technologies such as Snowflake, Azure Synapse, Databricks and AWS Redshift. • Over 8+ years as a Data Analytics and Engineering practice leader with demonstrated history of working with management consulting firms across Tax & Accounting, Finance, Power & Utility industry. • Experience in managing the team roadmap, and delivering actionable data insights to sales, product, marketing, and senior leadership. • Strong background in Data Technology Solutions delivery & Data Automation for business processes using various tools. • Expertise in bringing Data-Driven IT Strategic Planning to align metrics, communicate data changes across reporting, Enterprise Data Warehouses, Data Lakes and Customer Relationship Managements Systems. • Experienced working with cross functional teams, Data Scientists/Analysts and Business Managers in building Data Science and Data Engineering practice from the ground up. • Experienced in Designing and implementing NLP solutions with focus on sentiment analysis, opinion mining, key phase extraction using Azure Cognitive Services and Amazon Comprehend • Extensive programming experience with SQL, Python, C#, R, and Scala.

One thought on “Data Vault 101

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: