š²Data, Inference, and Decisions Unit 2 ā Probability Theory & Distributions
Probability theory and distributions form the backbone of statistical analysis, providing tools to quantify uncertainty and make informed decisions. This unit covers key concepts like random variables, probability distributions, and expected values, laying the groundwork for understanding complex statistical phenomena.
From basic probability rules to advanced theorems, students learn to model real-world events and draw meaningful conclusions from data. The unit also explores various probability distributions, their applications, and common pitfalls in statistical reasoning, preparing students for practical data analysis tasks.
Probability quantifies the likelihood of an event occurring ranges from 0 (impossible) to 1 (certain)
Random variables assign numerical values to outcomes of a random experiment can be discrete (countable) or continuous (uncountable)
Probability distributions describe the probabilities of different outcomes for a random variable
Discrete distributions (binomial, Poisson)
Continuous distributions (normal, exponential)
Expected value represents the average outcome of a random variable over many trials calculated as the sum of each outcome multiplied by its probability
Variance and standard deviation measure the spread or dispersion of a probability distribution higher values indicate greater variability in the outcomes
Independence means the occurrence of one event does not affect the probability of another event occurring
Conditional probability calculates the probability of an event given that another event has already occurred denoted as P(A|B)
Probability Basics
Sample space (S) represents the set of all possible outcomes for a random experiment
An event (E) is a subset of the sample space contains one or more outcomes
Probability of an event P(E) is the sum of the probabilities of all outcomes in that event
Complement of an event (E') includes all outcomes not in the event P(E') = 1 - P(E)
Mutually exclusive events cannot occur simultaneously P(A and B) = 0
Exhaustive events cover all possible outcomes in the sample space their probabilities sum to 1
Probability axioms state that probabilities must be non-negative, the probability of the sample space is 1, and the probability of the union of mutually exclusive events is the sum of their individual probabilities
Types of Distributions
Bernoulli distribution models a single trial with two possible outcomes (success or failure) with probability p for success and 1-p for failure
Binomial distribution models the number of successes in a fixed number of independent Bernoulli trials with the same success probability
Characterized by parameters n (number of trials) and p (success probability)
Probability mass function: P(X=k)=(knā)pk(1āp)nāk
Poisson distribution models the number of rare events occurring in a fixed interval of time or space
Characterized by parameter Ī» (average rate of events)
Probability mass function: P(X=k)=k!eāĪ»Ī»kā
Normal (Gaussian) distribution is a continuous probability distribution with a symmetric bell-shaped curve
Characterized by parameters Ī¼ (mean) and Ļ (standard deviation)
Probability density function: f(x)=2ĻĻ2ā1āeā2Ļ2(xāĪ¼)2ā
Exponential distribution models the time between events in a Poisson process
Characterized by parameter Ī» (rate)
Probability density function: f(x)=Ī»eāĪ»x for xā„0
Probability Rules and Theorems
Addition rule calculates the probability of the union of two events P(A or B) = P(A) + P(B) - P(A and B)
For mutually exclusive events, P(A or B) = P(A) + P(B)
Multiplication rule calculates the probability of the intersection of two events P(A and B) = P(A) Ć P(B|A)
For independent events, P(A and B) = P(A) Ć P(B)
Bayes' theorem updates the probability of an event based on new evidence P(A|B) = P(B)P(Bā£A)ĆP(A)ā
Law of total probability calculates the probability of an event by partitioning the sample space into mutually exclusive and exhaustive events P(B) = āi=1nāP(Aiā)ĆP(Bā£Aiā)
Central Limit Theorem states that the sum or average of a large number of independent random variables will be approximately normally distributed regardless of their individual distributions
Chebyshev's inequality provides an upper bound for the probability that a random variable deviates from its mean by more than a certain amount P(|X - Ī¼| ā„ kĻ) ā¤ k21ā for k > 0
Descriptive Statistics
Measures of central tendency summarize the center or typical value of a dataset
Mean (average) calculated as the sum of all values divided by the number of observations
Median the middle value when the dataset is ordered from lowest to highest
Mode the most frequently occurring value in the dataset
Measures of dispersion quantify the spread or variability of a dataset
Range the difference between the maximum and minimum values
Variance the average squared deviation from the mean nā1āi=1nā(xiāāxĖ)2ā
Standard deviation the square root of the variance
Skewness measures the asymmetry of a distribution
Positive skew (right-tailed) has a longer tail on the right side of the distribution
Negative skew (left-tailed) has a longer tail on the left side of the distribution
Kurtosis measures the heaviness of the tails and peakedness of a distribution compared to a normal distribution
Leptokurtic (heavy-tailed) has more outliers and a higher peak than a normal distribution
Platykurtic (light-tailed) has fewer outliers and a lower peak than a normal distribution
Inferential Statistics
Hypothesis testing evaluates claims about population parameters using sample data
Null hypothesis (H0) represents the status quo or no effect
Alternative hypothesis (Ha) represents the research claim or expected effect
P-value the probability of observing the sample data or more extreme results if the null hypothesis is true
Significance level (Ī±) the threshold for rejecting the null hypothesis (commonly 0.05)
Confidence intervals estimate a range of plausible values for a population parameter with a certain level of confidence (e.g., 95%)
Calculated as the sample statistic Ā± margin of error
Margin of error depends on the sample size, variability, and desired confidence level
Sampling distributions describe the variability of a sample statistic over repeated samples from the same population
Central Limit Theorem implies that the sampling distribution of the mean is approximately normal for large sample sizes
Type I error (false positive) occurs when rejecting a true null hypothesis
Controlled by the significance level (Ī±)
Type II error (false negative) occurs when failing to reject a false null hypothesis
Depends on the sample size, effect size, and power of the test
Applications in Data Analysis
Probability distributions model real-world phenomena and help make predictions
Binomial distribution models the number of defective items in a manufacturing process
Poisson distribution models the number of customer arrivals in a queue
Normal distribution models the distribution of heights, weights, or IQ scores in a population
Hypothesis testing and confidence intervals inform decision-making and draw conclusions from data
A/B testing compares the performance of two versions of a website or app
Clinical trials evaluate the effectiveness and safety of new medical treatments
Quality control ensures that products meet specified standards
Bayesian inference updates prior beliefs about parameters based on observed data
Used in machine learning for classification and regression tasks
Helps quantify uncertainty and make probabilistic predictions
Simulation and resampling methods (e.g., bootstrap) estimate the properties of estimators and test statistics without relying on analytical formulas
Useful when the sampling distribution is unknown or the assumptions are violated
Provides a flexible and computationally-intensive approach to statistical inference
Common Pitfalls and Misconceptions
Confusing probability with odds or likelihood
Probability is a number between 0 and 1, while odds represent the ratio of probabilities (e.g., 3:1)
Likelihood is a function of the parameters given the data, not a probability
Misinterpreting p-values and statistical significance
A small p-value does not necessarily imply practical significance or a large effect size
Failing to reject the null hypothesis does not prove that it is true
Assuming that independence always holds or that correlation implies causation
Many real-world events are dependent or conditionally dependent
Correlation measures the association between variables but does not establish a causal relationship
Neglecting the assumptions of statistical tests or models
Normality, independence, and homogeneity of variance are common assumptions
Violating assumptions can lead to invalid conclusions or biased estimates
Overfitting models to noise in the data or underfitting by ignoring important predictors
Overfitting leads to poor generalization and performance on new data
Underfitting results in high bias and missed patterns in the data
Relying too heavily on point estimates without considering uncertainty or variability
Confidence intervals and credible intervals provide a range of plausible values
Sensitivity analysis explores how the results change under different assumptions or scenarios