📊Causal Inference Unit 1 – Probability and Statistics Fundamentals
Probability and statistics form the foundation of causal inference, providing tools to analyze data and draw meaningful conclusions. This unit covers key concepts like probability distributions, descriptive statistics, hypothesis testing, and regression analysis, essential for understanding causal relationships.
These fundamentals are crucial for interpreting research findings and making informed decisions in various fields. By mastering these concepts, students can critically evaluate statistical evidence and apply appropriate methods to investigate causal effects in real-world scenarios.
Probability the likelihood of an event occurring, expressed as a number between 0 and 1
Statistics the collection, analysis, interpretation, and presentation of data
Descriptive statistics summarize and describe the main features of a data set
Inferential statistics use sample data to make inferences about a larger population
Random variable a variable whose value is determined by the outcome of a random event
Discrete random variables have a countable number of possible values (number of heads in 10 coin flips)
Continuous random variables can take on any value within a specified range (height of students in a class)
Distribution a function that describes the likelihood of different outcomes for a random variable
Hypothesis testing a statistical method for determining whether there is enough evidence to support a claim about a population parameter
Correlation a measure of the strength and direction of the linear relationship between two variables
Regression a statistical method for modeling the relationship between a dependent variable and one or more independent variables
Causal inference the process of determining whether a causal relationship exists between two variables
Probability Basics
Probability is a measure of the likelihood that an event will occur, ranging from 0 (impossible) to 1 (certain)
The probability of an event A is denoted as P(A)
The sum of the probabilities of all possible outcomes in a sample space is equal to 1
Independent events the occurrence of one event does not affect the probability of another event occurring (rolling a die multiple times)
Dependent events the occurrence of one event affects the probability of another event occurring (drawing cards from a deck without replacement)
Conditional probability the probability of an event A occurring given that event B has already occurred, denoted as P(A|B)
Calculated using the formula: P(A∣B)=P(B)P(A∩B)
Bayes' Theorem a formula for calculating conditional probabilities based on prior probabilities and new evidence
P(A∣B)=P(B)P(B∣A)⋅P(A)
Statistical Distributions
Normal distribution a symmetric, bell-shaped curve characterized by its mean and standard deviation
Approximately 68% of data falls within one standard deviation of the mean, 95% within two standard deviations, and 99.7% within three standard deviations
Standard normal distribution a normal distribution with a mean of 0 and a standard deviation of 1
Z-score a measure of how many standard deviations an observation is from the mean of its distribution
Calculated using the formula: z=σx−μ, where x is the observation, μ is the mean, and σ is the standard deviation
Binomial distribution the probability distribution of the number of successes in a fixed number of independent trials, each with the same probability of success (flipping a coin 10 times and counting the number of heads)
Poisson distribution the probability distribution of the number of events occurring in a fixed interval of time or space, given a known average rate (number of customers arriving at a store per hour)
Central Limit Theorem states that the sampling distribution of the mean of any independent, random variable will be normal or nearly normal, if the sample size is large enough
Descriptive Statistics
Measures of central tendency describe the center or typical value of a dataset
Mean the arithmetic average of a set of numbers
Median the middle value in a dataset when the values are arranged in order
Mode the most frequently occurring value in a dataset
Measures of dispersion describe the spread or variability of a dataset
Range the difference between the largest and smallest values in a dataset
Variance the average of the squared differences from the mean
Calculated using the formula: σ2=n∑i=1n(xi−μ)2, where xi is each individual value, μ is the mean, and n is the sample size
Standard deviation the square root of the variance, expressing dispersion in the same units as the original data
Skewness a measure of the asymmetry of a distribution
Positive skew the tail of the distribution extends to the right (income distribution)
Negative skew the tail of the distribution extends to the left (exam scores with a difficult test)
Kurtosis a measure of the thickness of the tails of a distribution relative to a normal distribution
Leptokurtic distribution has thicker tails than a normal distribution
Platykurtic distribution has thinner tails than a normal distribution
Inferential Statistics
Population the entire group of individuals, objects, or events of interest
Sample a subset of the population used to make inferences about the population
Parameter a numerical characteristic of a population (population mean, population standard deviation)
Statistic a numerical characteristic of a sample (sample mean, sample standard deviation)
Sampling distribution the probability distribution of a statistic obtained from all possible samples of a given size from a population
Standard error the standard deviation of a sampling distribution
For the sampling distribution of the mean, the standard error is calculated as: SE=nσ, where σ is the population standard deviation and n is the sample size
Confidence interval a range of values that is likely to contain the true population parameter with a specified level of confidence
For a population mean, the confidence interval is calculated as: xˉ±z∗⋅nσ, where xˉ is the sample mean, z∗ is the critical value from the standard normal distribution, σ is the population standard deviation, and n is the sample size
Margin of error the maximum expected difference between the true population parameter and the sample estimate
Hypothesis Testing
Null hypothesis (H0) the claim that there is no significant difference or relationship between variables
Alternative hypothesis (Ha or H1) the claim that there is a significant difference or relationship between variables
Type I error rejecting the null hypothesis when it is actually true (false positive)
The probability of a Type I error is denoted by α and is typically set at 0.05
Type II error failing to reject the null hypothesis when it is actually false (false negative)
The probability of a Type II error is denoted by β
Power the probability of correctly rejecting the null hypothesis when the alternative hypothesis is true
Calculated as 1−β
p-value the probability of obtaining a test statistic as extreme as, or more extreme than, the observed result, assuming the null hypothesis is true
A small p-value (typically < 0.05) indicates strong evidence against the null hypothesis
Test statistic a value calculated from the sample data used to determine whether to reject the null hypothesis (z-score, t-score, chi-square)
Critical value the threshold value of the test statistic that determines the boundary between rejecting and not rejecting the null hypothesis
Correlation and Regression
Correlation a measure of the strength and direction of the linear relationship between two variables
Pearson correlation coefficient (r) a measure of the strength and direction of the linear relationship between two continuous variables
Ranges from -1 (perfect negative correlation) to 1 (perfect positive correlation), with 0 indicating no linear correlation
Scatterplot a graph that displays the relationship between two continuous variables
Simple linear regression a statistical method for modeling the linear relationship between a dependent variable and one independent variable
Regression equation: y=β0+β1x+ϵ, where y is the dependent variable, x is the independent variable, β0 is the y-intercept, β1 is the slope, and ϵ is the error term
Multiple linear regression a statistical method for modeling the linear relationship between a dependent variable and two or more independent variables
Coefficient of determination (R2) a measure of the proportion of variance in the dependent variable that is predictable from the independent variable(s)
Ranges from 0 to 1, with higher values indicating a better fit of the regression model to the data
Residual the difference between the observed value of the dependent variable and the predicted value from the regression model
Applications in Causal Inference
Causal inference the process of determining whether a causal relationship exists between two variables
Randomized controlled trial (RCT) an experimental design in which participants are randomly assigned to treatment and control groups to estimate the causal effect of an intervention
Observational study a non-experimental study design in which researchers observe and analyze data without manipulating the variables of interest
Confounding a situation in which the relationship between an exposure and an outcome is distorted by the presence of a third variable that is associated with both the exposure and the outcome
Selection bias a systematic error that occurs when the sample is not representative of the population due to the way in which participants are selected
Propensity score matching a statistical technique used to estimate the causal effect of a treatment by matching treated and untreated individuals based on their likelihood of receiving the treatment
Instrumental variable a variable that is associated with the exposure but not directly associated with the outcome, used to estimate causal effects in the presence of unmeasured confounding
Difference-in-differences a method for estimating the causal effect of a policy or intervention by comparing the change in outcomes between a treatment group and a control group, before and after the intervention
Regression discontinuity design a quasi-experimental design that estimates the causal effect of a treatment by comparing outcomes for individuals just above and below a threshold value of a continuous variable used to assign treatment