Typical data modeling approaches
In the context of data warehouse (DWH), there are three well-known architectures for data modeling:
- EDW (Enterprize Data Warehouse)
- Dimensional modeling
- Data Vault
EDW
EDW, as proposed by Bill Inmon, also known as the father of DWH, loads data into a temporary Landing Area, and then a series of ETL processes load the data into a DWH in third normal form (3NF). Data is then extracted into a data mart for analysis and reporting.
Benefits
By normalizing data down to 3 NF and avoiding redundancy as much as possible, business requirements can be clarified and fraudulent data updates can be prevented. In addition, DWH is the SSOT (Single Source Of Truth) for the data mart, ensuring consistency and integrity across the enterprise.
Disadvantages
Important disadvantages of EDWs include:
-
Time to market
EDW must first integrate data from each source system into a central data repository before it can be used for reporting. This adds time and effort to the project. -
Complexity and skill
DWH may need to integrate data from as many as 100 sources. Designing an enterprise-wide data model to support a complex business environment requires highly skilled data modeling professionals.
Use cases
EDW is suitable for stable situations where design time and associated costs can be afforded. It is suited to the insurance industry, where all data is interrelated and it is important to understand the big picture, and to the manufacturing industry, where multiple functions are involved regardless of the associated budget.
Dimensional modeling
Dimensional modeling, introduced by Ralph Kimball, also known as the father of business intelligence, focuses on individual business areas and processes rather than the entire enterprise. Data loaded into a Landing Area is extracted into a dimensional data mart through a series of ETL processes.
In dimensional modeling, data is normalized into a form called a star schema. A star schema consists of a fact table with the values to be analyzed and a dimension table with the values that will be the axes (dimensions) of the analysis, as shown in the figure below, in a star shape.
Benefits
The advantages of dimensional modeling include the following:
-
The structure is simple and easy to understand
As a structure, it is easy to understand what facts are affected by what. Also, the data analyst can create the target tables using only simple queries. -
Easy to analyze data
Because dimension tables are not normalized to 3NF, the number of joins is reduced and query efficiency is increased. -
Simplified system administration
Focus on individual business areas and processes, rather than the enterprise as a whole, requires fewer resources in the DB.
Disadvantages
The disadvantages of dimensional modeling are as follows:
-
SSOT is lost
Cannot ensure consistency and coherence across the enterprise. -
Weakness to changes in data specifications
Unnormalized methods make it difficult to change data as business needs change. Also, adding columns to facts expands dimensions and can negatively impact performance.
Use cases
Dimensional modeling is appropriate for situations where a DWH must be built quickly and where the priority is to handle data quickly and more easily by analysts. It is suited to CRM, where the services sold per customer do not need to be linked across departments and are limited in scope, and marketing, which primarily requires only data marts.
Data Vault
Data Vault is a modeling methodology that flexibly maintains a history of changes even if the number or specification of data changes, and allows data from a specific date and time to be retrieved at any time. Data Vault is a hybrid approach that incorporates both 3NF and star schemas.
The Data Vault architecture has a Raw Vault tier and a Business Vault tier.
- Raw Vault
Holds a historical copy of the data. No filters or business transformations are performed, except for storing source-independent data. - Business Vault
Creates a Point in Time (PIT) Table or Bridge Table and avoids the complexity of SQL join processing.
Benefits
Data Vault addresses the problems inherent in both EDW and dimensional modeling in 3NF, combining the best aspects of both into a single hybrid approach. Its advantages include.
-
Flexibility
Unlike EDW, which lacks flexibility, Data Vault does not require rework when adding sources; Data Vault stores raw data and business-derived data separately and can easily accommodate changes in business rules. -
Change over time support
Data Vault separates raw data from business-derived data and supports changes resulting from both source systems and business rules. -
Lineage and Auditing
Data Vault includes metadata that identifies the source system, making it easy to support data reneging.
Disadvantages
Data Vault is not a perfect solution for all data warehouses and has some drawbacks that must be considered. Here are some examples.
-
Learning cost
A reasonable amount of knowledge is required to properly implement the Data Vault. -
Massive amount of
JOIN
The number of source tables can double or triple. The number of tables, and thus the number of joins, can lead to unwieldy and complex join conditions. -
Monetary cost
Large amounts of storage may be required to maintain a complete history of data.
Use cases
Data Vault is best suited for large projects with 30+ source systems that have significant data integration challenges and are ready to adopt the skills and rigor of the new methodology. In contrast, Data Vault may not be suitable for small projects with small teams (less than 10 people), so dimensional modeling may be a good place to start.
References