Data Vault Modeling Design Uses

Claude Paugh
May 2
9 min read

Updated: Jun 24

Data Vault is really a design paradigm, and not a technology. It can be used on any relational database or datalake for that matter. It came about due to a desire to find a better way to warehouse data, and get away from the star/star-cluster/constellation and snowflake (not the DB company) schema designs that are frequently used in Data Warehouses.

But in doing so, it offers a different pattern that really excludes or at least limits, end users from the equation. It can be difficult to query for end-users, and the organization of entities in not intuitive. It can replace a warehouse or operation data store, if the primary audience is intended to be engineers and not end users. It's definitely not a Data Mart replacement.

Why get away from those, don't they work? Like many things in the technology industry, the answer of whether they "work" or not, is use-case driven. What is the purpose and function do you intend on using it for? The actors that participate in consumer and producer roles, generally determine the use case parameters, i.e. where the focus and boundaries are.

In the case of the variations on star schemas that I mentioned above, the focus of the use case are end users. Either they are accessing the data in an ad-hoc fashion using interactive tools and queries, or are consuming reports produced as well. Just to give you a visual on the above schema, I have a simple star diagram below. The three tables at the top are the source data for the star schema, highlighted in green.

The variations are depicted in following diagram, except for snowflake. Snowflake basically adds non-dimensional reference tables to the below, which in most cases, are related to dimensions. For example, various type tables to categorize data within dimensions. Think of types of shipments (air, ground, sea) and characteristics of each one.

Star: One Fact with Many Dimensions

Star Cluster - Multiple Facts with Dimensions

Start Cluster Schema — Depiction of Potential Star Cluster Pattern

Star Constellation: Multiple related Facts and Dimensions

Star Constellation Schema — Multiple related Facts and Dimensions

As you can probably infer, all three of these designs are focused on creating and delivering analytics to consumers, whether its a interactive tool using queries or reporting.

Behind the scenes, from an end user point of view, there is a great deal of data quality scrubbing and transformation to be able to write data to the dimension(Dim) tables. That's before you get to populating analytics in Fact(Facts) tables.

Since all of the dimensions and facts use "dumb" or surrogate keys, which is pretty much standard in mart and warehouse design, the possibility of creating duplicates exist. That's because there are use cases where duplicates are valid, and that surrogate key design pattern accommodates that. It comes in particularly useful when time-series data is used, since timestamp values can be the same, but the sequence differentiates them.

Data Quality

Data quality controls need to be precise, otherwise it's easy to have "distortions" infest your data. If you're running ML models from distorted data, they will drift and not give you the results you we're planning on. Excessive occurrences of drift usually points to data quality issues or ML model sensitivity. This is partially, at least, one benefit of using Data Vault design patters - it prevents duplicates and ensures uniqueness of certain classifications of data to help reduce the effort of data quality. In theory it should be additional help prevent the drifting problem.

Its only one aspect, uniqueness, and it does not address data content validation specifically. For example, if a column should only have values with 9 Alpha-Numeric characters, you need validation, or a column is restricted to a list-of-values, you still need to inspect. The most common types of data quality validation (IMO) are column inspection of values due to constraints. Length, data type, word count, etc. If all your sources of data is a relational database with constraints applied to columns, it makes the data quality work much easier. But that's very rare occurrence.

Data is source from structured, semi-structured, and unstructured formats. So you could say that inspection utilities have a long life-span ahead of them. Using regular expressions, or regex for short, in various forms, from Python scripts to SQL is pretty common for those who need to build a "home grown" solution. There are various Python data validation frameworks and utilities, like pydantic and cerberus to name a couple. Many data quality tools have existed over the years, but they seem to have priced themselves out of adoption, and as a result, were purchased by much larger companies and integrated into their products. Informatica, IBM, SAS, and Ab Initio have been acquirers of data quality and business intelligence products over the last two decades.

The other most common data quality checks are record counts, basically do the inputs and outputs match during loading. It's one of the few generic data quality checks you can do, since once you start transformations on your data, specific counts go out the window. You need a profile of what the input/output ratio is for the transformation so you can calculate what range of counts they transformation results can have. As I mentioned before, one factor in the creation of Data Vault was data quality, but there are several others.

Data Vault

It's describe by it creator, Dan Linstedt, as "a detail oriented, historical tracking, and uniquely linked set of normalized tables that support one ore more functional areas. It's a hybrid approach encompassing the best of breed between third normal form(3NF) and star schema". According to Dan, it was inspired by a simplistic view of neurons, dendrites, and synapses.

There are several cited or claimed benefits of Data Vault, including that all relationships are key driven and the design pattern can scale to multi-petabyte(PB) databases sizes. Data Vault is an example of anchor modeling, which is an agile approach to data modeling, so it can capture changes and not necessarily have wide-spread impact on the existing model. It's also business process driven, which ideally a star schema and it's variations should be as well.

The modeling methodology is similar to creating entity-relationship data models: create a logical model that represents the business process and data elements involved therein. Data Vault is a highly normalized approach to creating an entity-based model, which provides much of it's flexibility as you're not repeating on data elements across the schema.

Another benefit of this modeling practice is that it ensures complete auditability and traceability. It's also SEI CMM Level 5 compliant (repeatable, consistent, redundant architecture) and delivers models that are compliant with Sarbanes-Oxley, HIPPA, and BASIL II.

Data Hub, Links, and Satellites — Data Vault Hubs, Satellites, and Links

So what is it exactly? Data Vault is a collection of Hub, Satellites, and Links that represent the business process modeled in the logical data model phase (pre-requisite). It should be representative on the area(s) you're working on; hypothetical business case below. It's best use cases are for operational data stores or data warehouses; they are not a replacement for star schemas.

Below is a hypothetical data flow from an asset manager that trades stocks, bonds, futures, options, etc. Since some financial trading is now 24/7 and a significant portion algorithmic, the "trader" in the diagram is really software in those cases. Also the Broker and Market Maker can be a single party (one business), and there can be multiple-custodian banks involved in large institutional trades.

Occasionally a custodian bank is a third party where the sole purpose is to relay funds between two or more custodian banks. It tends to be a scenario in international markets. Those nuances just add more parties to the trade, but doesn't change the process per se.

The below represents an attributed logical data model, based on the data flow diagram above. I added common attributes that stand out to me, but this would usually include additional elements that adds elaboration to the model, but might add noise to this example. So I left them out.

trade logical data model — Trade Logical Data Model (LDM)

Translating the LDM above into a data vault physical format is below. Usually the LDM translates to physical with very similar entity -> table conversion, except for cases where many-to-many logical relationships need to be physically implemented with an associative table. Then there could be a de-normalization in the physical as well, but that phase is usually the result of testing cycles that spot performance issues. No pun intended, but de-normalization is not normal practice - you don't start there per se; it's reactive or following an established pattern.

The data vault implementation of a physical model is very different from the LDM. You will notice that it does not look like a star-schema or any of its derivatives.

There are a few core concepts that data vault follows in the design pattern:

Data Vault designs are not replacements for OLAP implemented in star schemas and the derivatives, they are not for OLTP, their "sweet spot" if as a data warehouse or operational data store. It's somewhat of a hybrid.
It's optimized for storage and not user driven queries, somewhat due to a strict adherence of third-normal form (3NF).
Three entity types consisting of hub, satellite and links. You can have reference tables that support satellites that are static, e.g. Country or Currency tables, zip/postal codes, weights and measures.
Hubs are tables that have a low number of changes to the data. They should contain a business key. You can think about them as somewhat analogous to very slowly changing dimension in some ways.
Links are associations between business keys, which are usually hubs, but not always. It's an associative entity that resolves relationships between the keys.
Satellites are for temporal and descriptive attributes, such as date/time generated data, similar to transaction content. Satellites can have relationships with other satellites, reference data, links for hubs.
Links exist between hubs, with the link table absorbing the primary keys of the hubs. Hubs can have satellites as children.
Links can have satellites as children, but not as parents. The link primary key is absorbed by the satellite.
Every table has a load date and record source for each row. They should create a unique key when added to the "natural" or "business" key in a table. Tables without a "natural" key, you can use the surrogate key. Use unique key constraints or unique key indexes to force this constraint (last one is my two cents contribution), assuming your using a relational DBMS.
If you implementing the model in a datalake where there are no constraints or indexes, you could use partition definitions instead, and obviously put the logic in your code to enforce this. Unless you are writing to Apache Iceberg, where constraints can be implemented in its meta-data layer.
Depending upon the document database you using, you can also use constraints.

Data Loading Practices

The sequence of loading the tables matters, especially if you're constraining using key's/indexes. First to load are the hubs, which you can perform in parallel, creating new surrogate keys. The next step would be the links, loading any attribute values and creating associations between hubs. Last are the satellites, who are on the receiving end of the links or hubs, they can be loaded in parallel with each other.

In order to ensure records are unique, remember that 3NF thing, hashing is commonly used as part of the ETL/ELT processing to compare a hash of the incoming record against hashes of the records currently in the table. Records are never deleted from the data vault.

Use Cases The prime use cases for Data Vault designs have been operational data stores(ODS) and enterprise data warehouses(EDW). Both of those cases fit well because of several factors:

Third Normal Form (3NF) strict adherence does not have fields de-normalized or duplicated in tables, and that leads to most efficient storage use.
Data uniqueness checks are built it to the loading process when hashes are used to confirm whether you have already stored that particular record, so no duplication of records. Which also reduces storage.
It can be hard to query for end users. You would need more advanced SQL knowledge to query the tables in a data vault design, as compared to typical end users. It's more suited to engineers to access data, i.e. not user friendly. If you want to expose the data to end-users, in traditional RDBMS you can create views to simplify their access.
Fewer records stored means less data scanned by queries, so the retrieval time for output with an efficient query has less work to do as compared to a de-normalized structure.
It's table design is more suited for engineers. It has a strict focus on normalization, and structures that reflect efficient storage of data rather than a business process flow.

I think the Data Vault design pattern definitely has a place, particularly in those cases where you literally want to keep the maximum amount of data possible. Whether that in a datalake or RDBMS. It's probably most effective at solving that problem, versus any star schema or derivative of a star schema. The ODS case is usually anywhere from a week to a few months of data, and it's well suited to this as well, because the normalization helps all around. Theses specialized uses are worth evaluation against.

Data Vault Modeling Design Uses

Star: One Fact with Many Dimensions

Star Cluster - Multiple Facts with Dimensions

Star Constellation: Multiple related Facts and Dimensions

Data Quality

Data Vault

The below represents an attributed logical data model, based on the data flow diagram above. I added common attributes that stand out to me, but this would usually include additional elements that adds elaboration to the model, but might add noise to this example. So I left them out.

The data vault implementation of a physical model is very different from the LDM. You will notice that it does not look like a star-schema or any of its derivatives.

Data Vault designs are not replacements for OLAP implemented in star schemas and the derivatives, they are not for OLTP, their "sweet spot" if as a data warehouse or operational data store. It's somewhat of a hybrid.

It's optimized for storage and not user driven queries, somewhat due to a strict adherence of third-normal form (3NF).

Three entity types consisting of hub, satellite and links. You can have reference tables that support satellites that are static, e.g. Country or Currency tables, zip/postal codes, weights and measures.

Hubs are tables that have a low number of changes to the data. They should contain a business key. You can think about them as somewhat analogous to very slowly changing dimension in some ways.

Links are associations between business keys, which are usually hubs, but not always. It's an associative entity that resolves relationships between the keys.

Satellites are for temporal and descriptive attributes, such as date/time generated data, similar to transaction content. Satellites can have relationships with other satellites, reference data, links for hubs.

Links exist between hubs, with the link table absorbing the primary keys of the hubs. Hubs can have satellites as children.

Links can have satellites as children, but not as parents. The link primary key is absorbed by the satellite.

If you implementing the model in a datalake where there are no constraints or indexes, you could use partition definitions instead, and obviously put the logic in your code to enforce this. Unless you are writing to Apache Iceberg, where constraints can be implemented in its meta-data layer.

Depending upon the document database you using, you can also use constraints.

Data Loading Practices

Use Cases The prime use cases for Data Vault designs have been operational data stores(ODS) and enterprise data warehouses(EDW). Both of those cases fit well because of several factors:

Third Normal Form (3NF) strict adherence does not have fields de-normalized or duplicated in tables, and that leads to most efficient storage use.

Data uniqueness checks are built it to the loading process when hashes are used to confirm whether you have already stored that particular record, so no duplication of records. Which also reduces storage.

Fewer records stored means less data scanned by queries, so the retrieval time for output with an efficient query has less work to do as compared to a de-normalized structure.

It's table design is more suited for engineers. It has a strict focus on normalization, and structures that reflect efficient storage of data rather than a business process flow.

Recent Posts

Privacy Policy