2022-12-30

Typical data modeling approaches

Typical data modeling approaches

In the context of data warehouse (DWH), there are three well-known architectures for data modeling:

  • EDW (Enterprize Data Warehouse)
  • Dimensional modeling
  • Data Vault

EDW

EDW, as proposed by Bill Inmon, also known as the father of DWH, loads data into a temporary Landing Area, and then a series of ETL processes load the data into a DWH in third normal form (3NF). Data is then extracted into a data mart for analysis and reporting.

Inmon model

When should I use Data Vault?

Benefits

By normalizing data down to 3 NF and avoiding redundancy as much as possible, business requirements can be clarified and fraudulent data updates can be prevented. In addition, DWH is the SSOT (Single Source Of Truth) for the data mart, ensuring consistency and integrity across the enterprise.

Disadvantages

Important disadvantages of EDWs include:

  • Time to market
    EDW must first integrate data from each source system into a central data repository before it can be used for reporting. This adds time and effort to the project.

  • Complexity and skill
    DWH may need to integrate data from as many as 100 sources. Designing an enterprise-wide data model to support a complex business environment requires highly skilled data modeling professionals.

Use cases

EDW is suitable for stable situations where design time and associated costs can be afforded. It is suited to the insurance industry, where all data is interrelated and it is important to understand the big picture, and to the manufacturing industry, where multiple functions are involved regardless of the associated budget.

Dimensional modeling

Dimensional modeling, introduced by Ralph Kimball, also known as the father of business intelligence, focuses on individual business areas and processes rather than the entire enterprise. Data loaded into a Landing Area is extracted into a dimensional data mart through a series of ETL processes.

Kimball model

When should I use Data Vault?

In dimensional modeling, data is normalized into a form called a star schema. A star schema consists of a fact table with the values to be analyzed and a dimension table with the values that will be the axes (dimensions) of the analysis, as shown in the figure below, in a star shape.

Star schema

Star schema

Benefits

The advantages of dimensional modeling include the following:

  • The structure is simple and easy to understand
    As a structure, it is easy to understand what facts are affected by what. Also, the data analyst can create the target tables using only simple queries.

  • Easy to analyze data
    Because dimension tables are not normalized to 3NF, the number of joins is reduced and query efficiency is increased.

  • Simplified system administration
    Focus on individual business areas and processes, rather than the enterprise as a whole, requires fewer resources in the DB.

Disadvantages

The disadvantages of dimensional modeling are as follows:

  • SSOT is lost
    Cannot ensure consistency and coherence across the enterprise.

  • Weakness to changes in data specifications
    Unnormalized methods make it difficult to change data as business needs change. Also, adding columns to facts expands dimensions and can negatively impact performance.

Use cases

Dimensional modeling is appropriate for situations where a DWH must be built quickly and where the priority is to handle data quickly and more easily by analysts. It is suited to CRM, where the services sold per customer do not need to be linked across departments and are limited in scope, and marketing, which primarily requires only data marts.

Data Vault

Data Vault is a modeling methodology that flexibly maintains a history of changes even if the number or specification of data changes, and allows data from a specific date and time to be retrieved at any time. Data Vault is a hybrid approach that incorporates both 3NF and star schemas.

Data vault

When should I use Data Vault?

The Data Vault architecture has a Raw Vault tier and a Business Vault tier.

  • Raw Vault
    Holds a historical copy of the data. No filters or business transformations are performed, except for storing source-independent data.
  • Business Vault
    Creates a Point in Time (PIT) Table or Bridge Table and avoids the complexity of SQL join processing.

Benefits

Data Vault addresses the problems inherent in both EDW and dimensional modeling in 3NF, combining the best aspects of both into a single hybrid approach. Its advantages include.

  • Flexibility
    Unlike EDW, which lacks flexibility, Data Vault does not require rework when adding sources; Data Vault stores raw data and business-derived data separately and can easily accommodate changes in business rules.

  • Change over time support
    Data Vault separates raw data from business-derived data and supports changes resulting from both source systems and business rules.

  • Lineage and Auditing
    Data Vault includes metadata that identifies the source system, making it easy to support data reneging.

Disadvantages

Data Vault is not a perfect solution for all data warehouses and has some drawbacks that must be considered. Here are some examples.

  • Learning cost
    A reasonable amount of knowledge is required to properly implement the Data Vault.

  • Massive amount of JOIN
    The number of source tables can double or triple. The number of tables, and thus the number of joins, can lead to unwieldy and complex join conditions.

  • Monetary cost
    Large amounts of storage may be required to maintain a complete history of data.

Use cases

Data Vault is best suited for large projects with 30+ source systems that have significant data integration challenges and are ready to adopt the skills and rigor of the new methodology. In contrast, Data Vault may not be suitable for small projects with small teams (less than 10 people), so dimensional modeling may be a good place to start.

References

https://medium.com/@amritha_fernando/types-of-data-warehousing-architecture-9a656443b510
https://panoply.io/data-warehouse-guide/data-mart-vs-data-warehouse/
https://www.phdata.io/blog/building-modern-data-platform-with-data-vault/
https://medium.com/analytics-vidhya/theories-of-kimball-and-inmon-about-data-warehouse-design-c16260fab5e9
https://www.techtarget.com/searchdatamanagement/definition/star-schema

Ryusei Kakujo

researchgatelinkedingithub

Focusing on data science for mobility

Bench Press 100kg!