2022-10-15

About data management

Introduction

In recent years, we hear more and more about DX, which stands for Digital Transformation and refers to activities that leverage digital technology to create new value and promote organizational growth. In other words, at the core of DX is data utilization. For data utilization, data management is essential to ensure that the right data is available when you want to use it. Data management facilitates data utilization and promotes data-driven decision making, thereby realizing DX.

Challenges in data utilization

In order to leverage data, the right data must be available when you want to use it. However, many organizations face the following challenges that prevent DX from progressing.

  • Data is scattered throughout the organization.
  • Different departments and individuals have different ways of holding data.
  • There is a big wall between the business side and the engineering side
  • Inter-organizational interests prevent the sharing of data.
  • Not sure where the data exists or even if the data exists in the first place
  • Cannot get the data they want quickly

In order to solve these problems and create an organization where data is utilized, it is necessary to implement appropriate data management.

DMBOK, the bible of data management

The DMBOK (Data Management Body of Knowledge) is an indispensable reference on data management.

Data Management is the development, execution, and supervision of plans, policies, programs, and practices that deliver, control, protect, and Data Management is the development, execution, and supervision of plans, policies, programs, and practices that deliver, control, protect, and enhance the value of data and information assets throughout their lifecycles.

Data Management coverage

Data includes business data and metadata. Data management targets these data.

Business data

Business data includes both structured and unstructured data.

  • Structured data
    • Data that has a structure that allows it to be handled as table data
  • Unstructured data
    • Unstructured data is data that is difficult to handle as table data, such as text files, images, video, and audio data

Metadata

Metadata is information about data. For example, the following are called metadata

  • Data creator
  • Data creation date and time
  • Presence or absence of personal information in the data
  • Data type
  • Who the data is referenced by and for how long

Purpose of managing metadata

The purpose of managing metadata is to reduce the cost of data research. The person who creates the data is familiar with the data and may be able to understand the data without metadata, but the person who refers to the data does not know what the data is about and cannot easily refer to the data without metadata. Also, even the creator of the data may have forgotten the specifications of the data a year later. The creator of the data may even retire. If the metadata is not maintained, you may spend a lot of man-hours researching that data.

Data Platform

A data platform is a system for linking, integrating, and utilizing data, and is a data management service. It is typically a three-tier system consisting of a data lake, a data warehouse, and a data mart. Data management is aimed at organizations that fully utilize the data platform to make data-driven decisions.

Data Lake Layer

The data lake layer stores structured, unstructured, and semi-structured data (JSON, CSV, etc.) from data sources in its unprocessed state. Even if there are errors in the data, the data is aggregated in its original state without modification.

Data Warehouse Layer

The data warehouse layer integrates structured data within an organization in chronological order. By analyzing the large amount of data accumulated, insights are gained to support the organization's decision making.

Data Mart Layer

The data mart layer is a subset of the data warehouse and is a database that is extracted and stored in a one-to-one relationship with use cases from the data warehouse layer. Building a data mart layer to manage data for each use case in a data platform provides the following benefits

  • Limit the scope of impact
  • Improved SQL response time due to reduced data volume
  • Decreased time spent searching for needed data

Define use cases for data platform

The purpose of building a data infrastructure is to realize use cases. Therefore, before considering a data lake, data warehouse, or data mart, it is necessary to first organize what you want to achieve with the data platform. For example, a company operating an EC site may want to achieve the following with a data platform

  • Monitoring of sales, inventory, advertising costs, etc.
  • Customer review analysis
  • Measuring the benefits of advertising
  • Measuring effectiveness through AB testing

There are countless other use cases. It is also important to be aware that the best tool for each user of the data platform will vary; some will want to use Excel, others will want to use Jupyter Notebook, and so on. Based on these considerations, it is necessary to design a data platform while always keeping in mind the need to connect business and data.

Trade-off between data utilization and security

Data utilization and security are trade-offs. For example, a database that can be viewed by anyone in the organization is a great environment from the perspective of data utilization, but there is a significant security risk. Regulations regarding the protection of personal information, such as GDPR, are getting stronger every year, and if personal information were to leak out, it would be a huge problem. On the other hand, if security is rigidly enforced, data utilization will not progress. It is necessary to promote data utilization while appropriately assessing the trade-offs of security.

References

https://www.dama.org/cpages/dmbok-images
https://gdpr-info.eu/
https://www.silect.is/blog/know-your-data-lineage/

Ryusei Kakujo

researchgatelinkedingithub

Focusing on data science for mobility

Bench Press 100kg!