📊Causal Inference Unit 1 – Probability and Statistics Fundamentals

Probability and statistics form the foundation of causal inference, providing tools to analyze data and draw meaningful conclusions. This unit covers key concepts like probability distributions, descriptive statistics, hypothesis testing, and regression analysis, essential for understanding causal relationships. These fundamentals are crucial for interpreting research findings and making informed decisions in various fields. By mastering these concepts, students can critically evaluate statistical evidence and apply appropriate methods to investigate causal effects in real-world scenarios.

Study Guides for Unit 1

1.1

Probability theory

10 min read

1.2

Random variables and distributions

10 min read

1.3

Sampling and estimation

11 min read

1.4

Hypothesis testing

8 min read

1.5

Regression analysis

12 min read

Key Concepts and Definitions

Probability the likelihood of an event occurring, expressed as a number between 0 and 1
Statistics the collection, analysis, interpretation, and presentation of data
- Descriptive statistics summarize and describe the main features of a data set
- Inferential statistics use sample data to make inferences about a larger population
Random variable a variable whose value is determined by the outcome of a random event
- Discrete random variables have a countable number of possible values (number of heads in 10 coin flips)
- Continuous random variables can take on any value within a specified range (height of students in a class)
Distribution a function that describes the likelihood of different outcomes for a random variable
Hypothesis testing a statistical method for determining whether there is enough evidence to support a claim about a population parameter
Correlation a measure of the strength and direction of the linear relationship between two variables
Regression a statistical method for modeling the relationship between a dependent variable and one or more independent variables
Causal inference the process of determining whether a causal relationship exists between two variables

Probability Basics

Probability is a measure of the likelihood that an event will occur, ranging from 0 (impossible) to 1 (certain)
The probability of an event A is denoted as P(A)
The sum of the probabilities of all possible outcomes in a sample space is equal to 1
Independent events the occurrence of one event does not affect the probability of another event occurring (rolling a die multiple times)
Dependent events the occurrence of one event affects the probability of another event occurring (drawing cards from a deck without replacement)
Conditional probability the probability of an event A occurring given that event B has already occurred, denoted as P(A|B)
- Calculated using the formula: $P(A|B) = \frac{P(A \cap B)}{P(B)}$
Bayes' Theorem a formula for calculating conditional probabilities based on prior probabilities and new evidence
- $P(A|B) = \frac{P(B|A) \cdot P(A)}{P(B)}$

Statistical Distributions

Normal distribution a symmetric, bell-shaped curve characterized by its mean and standard deviation
- Approximately 68% of data falls within one standard deviation of the mean, 95% within two standard deviations, and 99.7% within three standard deviations
Standard normal distribution a normal distribution with a mean of 0 and a standard deviation of 1
Z-score a measure of how many standard deviations an observation is from the mean of its distribution
- Calculated using the formula: $z = \frac{x - \mu}{\sigma}$ , where $x$ is the observation, $\mu$ is the mean, and $\sigma$ is the standard deviation
Binomial distribution the probability distribution of the number of successes in a fixed number of independent trials, each with the same probability of success (flipping a coin 10 times and counting the number of heads)
Poisson distribution the probability distribution of the number of events occurring in a fixed interval of time or space, given a known average rate (number of customers arriving at a store per hour)
Central Limit Theorem states that the sampling distribution of the mean of any independent, random variable will be normal or nearly normal, if the sample size is large enough

Descriptive Statistics

Measures of central tendency describe the center or typical value of a dataset
- Mean the arithmetic average of a set of numbers
- Median the middle value in a dataset when the values are arranged in order
- Mode the most frequently occurring value in a dataset
Measures of dispersion describe the spread or variability of a dataset
- Range the difference between the largest and smallest values in a dataset
- Variance the average of the squared differences from the mean
  - Calculated using the formula: $\sigma^2 = \frac{\sum_{i=1}^{n} (x_i - \mu)^2}{n}$ , where $x_i$ is each individual value, $\mu$ is the mean, and $n$ is the sample size
- Standard deviation the square root of the variance, expressing dispersion in the same units as the original data
Skewness a measure of the asymmetry of a distribution
- Positive skew the tail of the distribution extends to the right (income distribution)
- Negative skew the tail of the distribution extends to the left (exam scores with a difficult test)
Kurtosis a measure of the thickness of the tails of a distribution relative to a normal distribution
- Leptokurtic distribution has thicker tails than a normal distribution
- Platykurtic distribution has thinner tails than a normal distribution

Inferential Statistics

Population the entire group of individuals, objects, or events of interest
Sample a subset of the population used to make inferences about the population
Parameter a numerical characteristic of a population (population mean, population standard deviation)
Statistic a numerical characteristic of a sample (sample mean, sample standard deviation)
Sampling distribution the probability distribution of a statistic obtained from all possible samples of a given size from a population
Standard error the standard deviation of a sampling distribution
- For the sampling distribution of the mean, the standard error is calculated as: $SE = \frac{\sigma}{\sqrt{n}}$ , where $\sigma$ is the population standard deviation and $n$ is the sample size
Confidence interval a range of values that is likely to contain the true population parameter with a specified level of confidence
- For a population mean, the confidence interval is calculated as: $\bar{x} \pm z^* \cdot \frac{\sigma}{\sqrt{n}}$ , where $\bar{x}$ is the sample mean, $z^*$ is the critical value from the standard normal distribution, $\sigma$ is the population standard deviation, and $n$ is the sample size
Margin of error the maximum expected difference between the true population parameter and the sample estimate

Hypothesis Testing

Null hypothesis ( $H_0$ ) the claim that there is no significant difference or relationship between variables
Alternative hypothesis ( $H_a$ or $H_1$ ) the claim that there is a significant difference or relationship between variables
Type I error rejecting the null hypothesis when it is actually true (false positive)
- The probability of a Type I error is denoted by $\alpha$ and is typically set at 0.05
Type II error failing to reject the null hypothesis when it is actually false (false negative)
- The probability of a Type II error is denoted by $\beta$
Power the probability of correctly rejecting the null hypothesis when the alternative hypothesis is true
- Calculated as $1 - \beta$
p-value the probability of obtaining a test statistic as extreme as, or more extreme than, the observed result, assuming the null hypothesis is true
- A small p-value (typically < 0.05) indicates strong evidence against the null hypothesis
Test statistic a value calculated from the sample data used to determine whether to reject the null hypothesis (z-score, t-score, chi-square)
Critical value the threshold value of the test statistic that determines the boundary between rejecting and not rejecting the null hypothesis

Correlation and Regression

Correlation a measure of the strength and direction of the linear relationship between two variables
- Pearson correlation coefficient (r) a measure of the strength and direction of the linear relationship between two continuous variables
  - Ranges from -1 (perfect negative correlation) to 1 (perfect positive correlation), with 0 indicating no linear correlation
Scatterplot a graph that displays the relationship between two continuous variables
Simple linear regression a statistical method for modeling the linear relationship between a dependent variable and one independent variable
- Regression equation: $y = \beta_0 + \beta_1x + \epsilon$ , where $y$ is the dependent variable, $x$ is the independent variable, $\beta_0$ is the y-intercept, $\beta_1$ is the slope, and $\epsilon$ is the error term
Multiple linear regression a statistical method for modeling the linear relationship between a dependent variable and two or more independent variables
Coefficient of determination ( $R^2$ $R^{2}$ ) a measure of the proportion of variance in the dependent variable that is predictable from the independent variable(s)
- Ranges from 0 to 1, with higher values indicating a better fit of the regression model to the data
Residual the difference between the observed value of the dependent variable and the predicted value from the regression model

Applications in Causal Inference

Causal inference the process of determining whether a causal relationship exists between two variables
Randomized controlled trial (RCT) an experimental design in which participants are randomly assigned to treatment and control groups to estimate the causal effect of an intervention
Observational study a non-experimental study design in which researchers observe and analyze data without manipulating the variables of interest
Confounding a situation in which the relationship between an exposure and an outcome is distorted by the presence of a third variable that is associated with both the exposure and the outcome
Selection bias a systematic error that occurs when the sample is not representative of the population due to the way in which participants are selected
Propensity score matching a statistical technique used to estimate the causal effect of a treatment by matching treated and untreated individuals based on their likelihood of receiving the treatment
Instrumental variable a variable that is associated with the exposure but not directly associated with the outcome, used to estimate causal effects in the presence of unmeasured confounding
Difference-in-differences a method for estimating the causal effect of a policy or intervention by comparing the change in outcomes between a treatment group and a control group, before and after the intervention
Regression discontinuity design a quasi-experimental design that estimates the causal effect of a treatment by comparing outcomes for individuals just above and below a threshold value of a continuous variable used to assign treatment