2022-03-22

Statistics

What is Statistics

Statistics is a field of study that deals with the collection, analysis, interpretation, presentation, and organization of data. In other words, it is the science of learning from data. With the rapid increase in data availability in today's world, the importance of statistics has grown exponentially. It allows individuals and organizations to make data-driven decisions, extract meaningful insights, and formulate evidence-based policies.

Types of Statistics

There are two primary branches of statistics: descriptive and inferential statistics. Descriptive statistics aims to summarize and organize data by providing numerical and graphical measures to describe the data's main features. In contrast, inferential statistics focuses on making generalizations or predictions about a population based on a sample of data. It uses probability theory and other mathematical tools to draw conclusions and assess the uncertainty associated with these conclusions.

Descriptive Statistics

Descriptive statistics is a branch of statistics that focuses on summarizing, organizing, and describing the main features of a dataset. It provides a way to condense large amounts of data into simpler, more comprehensible measures that help us understand the overall pattern and structure of the data. Descriptive statistics involves calculating measures of central tendency, dispersion, and shape, as well as creating graphical representations of the data.

Measures of Central Tendency

Measures of central tendency provide a summary of the central location or average value of a dataset. The three most common measures are mean, median, and mode.

  • Mean
    The mean, or arithmetic average, is the sum of all data values divided by the number of values. It represents the typical value of a dataset and is highly influenced by extreme values or outliers.

  • Median
    The median is the middle value of a dataset when the data values are arranged in ascending or descending order. If there is an even number of values, the median is the average of the two middle values. The median is less sensitive to outliers than the mean.

  • Mode
    The mode is the value that appears most frequently in a dataset. A dataset may have no mode, one mode (unimodal), or multiple modes (multimodal). The mode can be useful for analyzing categorical data or identifying the most common value in a dataset.

Measures of Dispersion

Measures of dispersion describe the spread or variability of a dataset. They help us understand the degree to which data values deviate from the central tendency. Key measures of dispersion include range, variance, and standard deviation.

  • Range
    The range is the difference between the maximum and minimum values in a dataset. While it is a simple measure of dispersion, it can be highly influenced by outliers.

  • Variance
    Variance is the average of the squared differences between each data value and the mean. It quantifies the spread of data values around the mean and is useful for comparing the variability of different datasets.

  • Standard Deviation
    The standard deviation is the square root of the variance. It measures the average distance between each data value and the mean. Like the variance, it is useful for comparing the spread of different datasets but has the advantage of being in the same unit as the original data.

Measures of Shape

Measures of shape describe the distribution of data values in a dataset. The two most common measures of shape are skewness and kurtosis.

  • Skewness
    Skewness measures the asymmetry of a dataset's distribution. A positively skewed distribution has a longer tail on the right side, while a negatively skewed distribution has a longer tail on the left side. A symmetric distribution has a skewness of zero.

  • Kurtosis
    Kurtosis measures the "tailedness" of a dataset's distribution. High kurtosis indicates a distribution with more extreme values or outliers, while low kurtosis indicates a distribution with fewer extreme values. A normal distribution has a kurtosis of zero.

Graphical Representations

Graphical representations of data are visual tools that help us explore and understand the structure of a dataset. Some common types of graphs used in descriptive statistics are histograms, box plots, and scatter plots.

  • Histograms
    A histogram is a graphical representation of the frequency distribution of a dataset. It divides the data into intervals, called bins, and represents the frequency of each bin with a vertical bar. Histograms are useful for analyzing the shape, central tendency, and dispersion of a dataset.

  • Box Plots
    A box plot, also known as a box-and-whisker plot, is a graphical representation that displays a dataset's five-number summary: minimum, first quartile (Q1), median, third quartile (Q3), and maximum. The "box" represents the interquartile range (IQR), which is the range between Q1 and Q3, and the "whiskers" extend from the box to the minimum and maximum values. Box plots are useful for identifying outliers, comparing distributions, and visualizing the central tendency and dispersion of a dataset.

  • Scatter Plots
    A scatter plot is a graphical representation that displays the relationship between two continuous variables. Each data point is plotted as a point on a Cartesian coordinate system, with the x-axis representing one variable and the y-axis representing the other. Scatter plots are useful for exploring correlations between variables, identifying trends, and detecting outliers.

Inferential Statistics

Inferential statistics is a branch of statistics that focuses on making generalizations or predictions about a population based on a sample of data. It uses probability theory and other mathematical tools to estimate population parameters, test hypotheses, and quantify the uncertainty associated with these conclusions. Inferential statistics allows us to make inferences about larger groups based on the information gathered from smaller samples.

Probability and Sampling Distributions

Probability is a fundamental concept in inferential statistics. It quantifies the likelihood of an event or outcome occurring. By understanding probability, we can make informed decisions and predictions based on data.

Sampling distributions describe the probability distribution of a sample statistic, such as the sample mean or sample proportion, obtained from multiple random samples of the same size from a population. The Central Limit Theorem, a cornerstone of inferential statistics, states that the sampling distribution of the sample mean approaches a normal distribution as the sample size increases, regardless of the shape of the population distribution.

Hypothesis Testing

Hypothesis testing is a method used in inferential statistics to make decisions or draw conclusions about a population based on sample data. It involves formulating null and alternative hypotheses, calculating a test statistic, and determining the probability of observing the test statistic under the null hypothesis (the p-value).

Confidence Intervals

A confidence interval is a range of values within which the true population parameter is likely to fall, with a specified level of confidence (e.g., 95% or 99%). Confidence intervals provide an estimate of the uncertainty associated with a sample statistic, taking into account the variability in the sample data.

Parametric and Non-Parametric Tests

Parametric tests are statistical tests that assume the data follows a specific probability distribution, such as the normal distribution. These tests often have more statistical power but require the data to meet certain assumptions. Non-parametric tests, on the other hand, make fewer assumptions about the data distribution and are more robust to violations of these assumptions but may have less statistical power.

Regression Analysis

Regression analysis is a technique used to model the relationship between a dependent variable and one or more independent variables. It helps us understand how changes in the independent variables affect the dependent variable and can be used for prediction, estimation, and hypothesis testing.

Ryusei Kakujo

researchgatelinkedingithub

Focusing on data science for mobility

Bench Press 100kg!