105 cheat sheet 2
wk 5 - wk 12
Last updated
wk 5 - wk 12
Last updated
Benouli
You either get success or failure
X ~ Bernoulli (p) / Ber (p)
~ stands for follow
P cannot be = 0 and 1
Because 0/1 are never possible solutions
Related Formulas
Binomial
A series of independent trials where you observe the number of success
Independent trials are like flipping a balanced coin for multiple times to observer whether a heads or tails is obtained.
As each flip's result is not linked or related to the previous or future flip's results, thus this is called independent trials
X ~ Binomial (n,p) or X ~ Bin (n,p)
~ stands for follow
Related Formulas
Geometric
Total number of trials not fixed, observing for first success after n trials
All trials need to be independent
X ~ Geometric (p) / Geo (p)
~ stands for follow
Related Formulas
where p = success and (1-p)^ (x-1) = failure
Memoryless property
Uniform Probability
X~ Uniform (a,b) / Unif (a,b)
Related Formulas
Normal/Gaussian Probability
X ~ Normal (μ,σ^2) or N(μ,σ^2)
Range from (-∞,∞)
Related Formula
Related Formulas
Determines how spread out the values are
Related Formula
Properties of Variance:
Constants dont really make a difference, as you can see
If X and Y are independent,
Covariance
How two things change together
Covariance Properties
Related Formulas
The above formulas can exist due to this:
If Cov(Y1,Y2) = 0, then variables may be independent
Independence implies zero covariance, but Covariance being 0 does not mean independent
The largest ρ is 1, and the smallest ρ is -1
Regression
Regression is a type of supervised learning (trained on labeled data) algorithm used in statistics and machine learning to model the relationship between a dependent variable (also called the target or response) and one or more independent variables (also called predictors or features). The goal of regression is to predict the value of the dependent variable based on the independent variables.
Predicts continuous outcomes, unlike classification, which is used for predicting discrete labels
Least Squares Estimators
Residual SSR
How far the predicted values are from actual values
Smaller SSR means that your predictions are closer to actual values
If guesses are close, R^2 is high. This means that the model can explain more of the pattern in the data.
R^2 = 1 means that the model is perfect.
Related Formula
R is called coefficient of determination
MAE & RMSE
RMSE
RMSE is another common metric used to evaluate the difference between values predicted by a model and the actual values.
It gives more weight to larger errors than MAE because the errors are squared before averaging.
Interpretation:
RMSE gives a measure of how spread out the residuals (errors) are. Since the errors are squared, RMSE is more sensitive to large errors compared to MAE. Large differences between predicted and actual values will result in a significantly larger RMSE.
Lower RMSE values indicate better model performance.
RMSE is commonly used when large errors are particularly undesirable and should be penalized more.
MAE
Interpretation:
MAE gives a straightforward understanding of the error in the model, as it shows the average absolute difference between actual and predicted values.
Lower MAE values indicate better model performance.
MAE is not sensitive to outliers (since it only considers absolute differences), meaning that large errors do not disproportionately influence the result.
Statistical Inference
About using data to make conclusions about the unknown
There are two kinds:
Bayesian inference
Combines:
A prior belief about a parameter
The likelihood of observing a data
To produce a posterior (updated belief)
Frequentist inference
The is fixed in the given scenario, but unknown to us
Through FI, we are trying to use data to estimate and test assumptions about the fixed value.
Bayesian inference
Same posterior, same bayes numerator
Not same numerator, same posterior
Because posterior = numerator / evidence, even if the numerator is the same in two scenarios, if the total of all Bates numerator is different, then the posterior will be different.
Conditional Independence
Updating two data points
Integration
Differentiation
Frequentist Inference
3 ways, but not covering significance testing
[Estimates an exact value] Maximum Likelihood Estimation
[Estimates a range of values] Confidence interval
Significance Testing
Maximum Likelihood Estimation
Used to find the most likely value of a parameter, given the data
Kind of like an average, and is the best estimate for the true parameter.
To solve, maximize p(x|p) or maximize ln p(x|p)
If there are multiple data points, it will be: p(x1, x2,x3... | p) or p(x1 |p)p(x2 |p)p(x3|p)
Steps:
Find log likelihood
Find stationary point
Confidence Interval
Range of values that we believe the true value of the parameter lies, given the data
This tells you how precise your estimate is and gives a range of plausible values for the true average.
Ie. After a CI test, you are 95% confident that the average weight loss is between x-y value, because if we repeated the experiment 100 times, 95 of the tests would be in this range.
CI does not mean 95% of people lost between x-y value weight.
CI focuses more on the uncertainty of the true value and CI provides a way to confirm the range where the true value of the population mean could be, rather than where the each individual people's data fall. CI is about the confidence percentage around estimating the true population mean (reflects the reliability of the estimation method ),not the probability that the specific interval from one sample contains the true mean.
where 1-a is the coefficient/confidence level.
CI for normal data with known variance
CI for normal data with unknown variance
Follows a t-distribution
You’re trying to estimate the true average of something (like how much weight people lose after running every day for 3 months). Since you don’t have all the data from everyone, you have to make a guess.
The t-distribution helps you make that guess, but it knows your sample’s small, so it’s a little extra careful.
The degrees of freedom (df) tell you how much you can trust your guess — the bigger the df, the more you can trust it.
When df -> infinity, it becomes a standard normal
Classification
Response is categorical
Supervised, as we randomly split into training and testing data
Naive Bayes
Probabilistic classifier based on Bayes' Theorem. It’s used to classify data based on the likelihood of certain features or attributes, under the assumption that these features are independent of each other (feature independence)
To classify a sample, we calculate:
P(y) ⇒ Prior probability of y
Summation ⇒ Likelihood of each feature x given y
Step 1: Estimate the Prior probability using CHD
Step 2: Estimate Likelihood for categorical features against each variable column
To find likelihood of features, use MLE.
Binomial - Categorical
Normal - Numerical
Step 3: Estimate Likelihood for numeric features
Step 4: Calculate Bayes Numerators
Step 5: Make Prediction
If any of the likelihood is 0, then the final score will be 0, meaning that it can unfairly eliminate a class due to a zero probability from limited data.
Hence, to solve this, people use smoothing (ie. Laplace smoothing), which prevents probability from being exactly 0.
Laplace smoothing
Adds one to all non numerical data variables
Pretends that every possibility has happened at least once, and makes model more realistic
Probabilities are now non-zero
*watch out joint likelihood (see below)
Evaluation Metrics
Classification
Or, TP+TN/(TP+TN+FP+FN)
To be used in balanced datasets
Precision = TP/(TP+FP)
High precision means few false positives
Good when false positives are costly
Recall = TP/TP+FN
High recall = few false negatives
Good when missing a positive is costly
F1-score = 2* (Precision) * (Recall) / (Precision + Recall)
Combination of precision and recall
Balances both FP and FN
Regression
Measures the difference between the true and predicted set
RMSE
MAE
Clustering vs Regression
Response variable ⇒ output
In classification, the response variable is categorical (e.g., "spam" or "not spam").
In regression, the response variable is continuous (e.g., predicting house prices).
Predictor variable ⇒ Variables that you feed into model to gain response
Classification and regression same
Eval metrics ⇒ Refer to above.
4 Outcomes
True Positives (TP): The model correctly predicts the positive class (predicted positive, actual positive).
True Negatives (TN): The model correctly predicts the negative class (predicted negative, actual negative).
False Positives (FP): The model incorrectly predicts the positive class (predicted positive, actual negative) — this is also known as a Type I error.
Model overestimates number of positives
Healthy patient is predicted to have a disease.
False Negatives (FN): The model incorrectly predicts the negative class (predicted negative, actual positive) — this is also known as a Type II error.
Model underestimates umber of positives/miss a positive case
Sick patient is predicted to be healthy
Clustering
Unsupervised learning based on predefined classes
K-means clustering
Good for well separated clusters, where each point belongs to one and ONLY one clusters (a concept called hard clusters)
Hard clusters ⇒ Each cluster is deterministic, non-probabilistic
However, real-world data may overlap each other, causing cluster assignment to have high uncertainty for points in between clusters (soft clusters)
The uncertainty cannot be modelled.
Soft clusters ⇒ Each point has a probability in each cluster, and is a mixture of clusters.
Example
Gaussian mixture models
Instead of assumes a normal distribution bell curve, GMM has multiple bell curves overlapping
GMMs work well with messy real-world data, considered as universal approximators
Universal approximators: Given enough bell curves, it can model any shape of data distribution.
Whether it is a nice smooth curve is a different story.
Soft clusters, hence can handle overlapping clusters.
Probabilistic, adapts to the shape of the data using variance.
Accuracy =