2023-07-10

Data Drift and Concept Drift

Model Degradation

Machine Learning (ML) models experience varying degrees of performance degradation over time, depending on the model and the application environment. The primary causes of model degradation are data drift and concept drift.

  • Data Drift
    After a model is trained, its performance deteriorates when the distribution of input data changes, making it unable to perform well on new data.

  • Concept Drift
    The problem the model is trying to solve (i.e., the relationship between input and output) changes over time, resulting in a decline in the model's performance.

Even when data quality is not an issue, these "drifts" can occur and impact the model's performance.

Data Drift

Data drift is the phenomenon where the statistical distribution of input data for a model changes over time. It's a common issue that ML models encounter in production. Changes in the distribution of specific features can lead to a degradation in the model's performance.

Example of Data Drift

  • Scenario
    An online news platform operates a machine learning model that recommends articles based on users' past browsing history.

  • Occurrence of Data Drift
    Initially, the model exhibited behavior like "recommend political articles to users who frequently read political news." However, after a major sports event (e.g., Olympics), many users started reading sports-related articles.

  • Cause
    In this case, the user interests (distribution of input data) temporarily changed due to the sports event, causing the model's learned relationship ("recommend political articles to users who frequently read political news") to no longer align with the new reality influenced by the sports event.

Concept Drift

Concept drift refers to the phenomenon where the "concept" of the problem that an ML model is attempting to solve—the relationship between input and output data—changes over time. Unlike data drift, the distribution of input data remains the same, but the meaning of the data changes. This change leads to a decline in the model's performance.

Types of Concept Drift

Concept drift mainly occurs in three forms:

  • Gradual Drift
    This type of drift occurs gradually over time. For instance, people's purchasing behavior changing with seasons.

  • Abrupt Drift
    This type occurs suddenly, such as with the emergence of a new virus or legal changes.

  • Recurring Drift
    This type occurs periodically, like a medical diagnosis model struggling to adapt to the changing patterns of a recurring winter cold virus.

Example of Concept Drift

  • Scenario
    An online banking system operates a model to detect fraudulent transactions.

  • Occurrence of Concept Drift
    Initially, the model accurately detected fraudulent transactions. However, a few months later, fraudsters developed new techniques, causing previously fraudulent transactions to be recognized as legitimate.

  • Cause
    In this case, assuming that the distribution of input data (transaction details, user behavior, etc.) hasn't changed, the cause of the drift is a change in the concept of the "fraudulent transaction" label (the target variable). With fraudsters adopting new techniques, transactions that were previously labeled as "fraudulent" are now more likely to be labeled as "not fraudulent" due to the new circumstances.

Strategies for Addressing Drift

Common strategies to address drift include:

Model Retraining

Regularly retrain the model using new data.

  • Example: Update the model with new data when the click-through rate of online ads changes.

Monitoring and Alerts

Continuously monitor data quality and model performance, triggering alerts when thresholds are exceeded.

  • Example: Receive immediate notifications if the performance of a fraud detection model in financial transactions drops.

Leveraging Domain Knowledge

Adjust models and features based on feedback from domain experts.

  • Example: In a healthcare diagnosis model, use physician expertise to select features.

Online Learning

Instantly update the model every time new data arrives.

  • Example: Real-time updates of a news recommendation system based on user click behavior.

Feature Engineering

Design new features to absorb drift.

  • Example: In retail affected by seasonality, add features indicating seasons or specific events.

References

https://www.evidentlyai.com/blog/machine-learning-monitoring-data-and-concept-drift

Ryusei Kakujo

researchgatelinkedingithub

Focusing on data science for mobility

Bench Press 100kg!