What are Normalization and Standardization
Normalization and standardization are two popular feature scaling techniques used to address the challenges of working with data of different scales. Both techniques aim to transform the input features to a common scale, but they differ in their approach and assumptions.
Normalization techniques typically rescale the input features to a specific range, such as [0, 1] or [-1, 1]. This is achieved by scaling the data according to its minimum and maximum values, using L1 or L2 normalization, or applying various transformations like log, Box-Cox, or Yeo-Johnson transformations. These techniques are best suited for data with a known or desired range and are particularly useful when dealing with non-Gaussian distributions.
Standardization techniques, on the other hand, transform the input features to have zero mean and unit variance. This is done by subtracting the mean and dividing by the standard deviation, effectively centering the distribution around zero. Standardization techniques, such as Z-score standardization, median and median absolute deviation (MAD) standardization, and robust scaling, are suitable for data with unknown distributions or when the assumption of a Gaussian distribution is reasonable.
Normalization Techniques
Min-Max Normalization
Min-Max normalization is a simple and widely used technique that scales the features of a dataset to a specific range, typically [0, 1]. The formula for Min-Max normalization is:
where
Pros
- Easy to understand and implement
- Suitable for data with a known or desired range
- Preserves the original distribution of the data
Cons
- Sensitive to outliers, which can lead to the compression of the majority of the data in a small range
- Not suitable for data with an unknown or infinite range
L1 and L2 Normalization
L1 and L2 normalization are techniques that scale the data based on their L1 or L2 norms, respectively. The L1 norm is the sum of the absolute values of the feature vector, while the L2 norm is the square root of the sum of the squared values of the feature vector. The formulas for L1 and L2 normalization are:
- L1 Normalization:
x_{normalized} = \frac{x}{||x||1} - L2 Normalization:
x{normalized} = \frac{x}{||x||_2}
where
Pros
- Less sensitive to outliers compared to Min-Max normalization
- L1 normalization creates sparse feature vectors, which can be useful for feature selection and dimensionality reduction
- L2 normalization is invariant to the scale and rotation of the input data
Cons
- L1 normalization may not be suitable for data with a large number of zero values
- L2 normalization is sensitive to the presence of very large values in the data
Log Transformations
Log transformations are a type of normalization technique that applies a logarithmic function to the input data. This technique can be useful for reducing the impact of outliers and transforming data with a skewed distribution. The formula for log transformation is:
where
Log transformations can be easily implemented using popular machine learning libraries or standard programming libraries, which usually provide built-in logarithmic functions.
Pros
- Reduces the impact of outliers
- Transforms skewed distributions to be more symmetric
- Stabilizes the variance of the data
- Can be applied to data with different scales
Cons
- Requires input data to be strictly positive (requires the addition of a constant if not)
- Can be sensitive to the choice of the logarithm base
- May not be suitable for data with a large number of zero values
Box-Cox Transformation
Box-Cox transformation is a family of normalization techniques that can be used to stabilize variance and make data more closely resemble a Gaussian distribution. The formula for the Box-Cox transformation is:
where
Pros
- Can stabilize variance and make data more closely resemble a Gaussian distribution
- Suitable for data with a skewed distribution
- The optimal value of
can be found using well-established optimization techniques\lambda
Cons
- Requires input data to be strictly positive
- The transformation may be sensitive to the choice of
\lambda
Yeo-Johnson Transformation
The Yeo-Johnson transformation is an extension of the Box-Cox transformation that can be applied to both positive and negative data. The formula for the Yeo-Johnson transformation is:
where
Pros
- Can be applied to both positive and negative data
- Can stabilize variance and make data more closely resemble a Gaussian distribution
- The optimal value of
can be found using well-established optimization techniques\lambda
Cons
- More complex than the Box-Cox transformation
- The transformation may be sensitive to the choice of
\lambda
Standardization Techniques
Z-score Standardization
Z-score standardization, also known as standard score normalization, is a technique that transforms the input features to have zero mean and unit variance. This is done by subtracting the mean and dividing by the standard deviation. The formula for Z-score standardization is:
where
Pros
- Centers the distribution around zero and scales it to have unit variance
- Suitable for data with unknown distributions or when the assumption of a Gaussian distribution is reasonable
- Improves the performance and convergence of gradient-based optimization algorithms
Cons
- Sensitive to outliers, which can affect the mean and standard deviation
- Assumes that the data follows a Gaussian distribution
Median and Median Absolute Deviation (MAD) Standardization
Median and Median Absolute Deviation (MAD) standardization is an alternative to Z-score standardization that is more robust to outliers. Instead of using the mean and standard deviation, this technique uses the median and the median absolute deviation. The formula for MAD standardization is:
where
Pros
- More robust to outliers compared to Z-score standardization
- Suitable for data with non-Gaussian distributions or heavy-tailed distributions
Cons
- Assumes that the data is symmetric around the median
- May be less efficient than Z-score standardization for large datasets
Robust Scaling
Robust scaling is a standardization technique that uses the interquartile range (IQR) to scale the data, making it more robust to outliers. The formula for robust scaling is:
where
Pros
- More robust to outliers compared to Z-score standardization
- Suitable for data with non-Gaussian distributions or heavy-tailed distributions
- Uses the interquartile range, which is less sensitive to extreme values
Cons
- May not be suitable for data with a strong skew
- Assumes that the data is symmetric around the median