105 cheat sheet 2

wk 5 - wk 12

Benouli

  • You either get success or failure

  • X ~ Bernoulli (p) / Ber (p)

    • ~ stands for follow

  • P cannot be = 0 and 1

    • Because 0/1 are never possible solutions

  • Related Formulas

Mean(μ):E[X]=pMean (\mu): E[X] = p
Variance(σ2):Var(X)=E[X2]μ2=P(1p)Variance (\sigma^2) : Var(X) = E[X^2] - \mu^2 = P (1-p)

Binomial

  • A series of independent trials where you observe the number of success

    • Independent trials are like flipping a balanced coin for multiple times to observer whether a heads or tails is obtained.

      • As each flip's result is not linked or related to the previous or future flip's results, thus this is called independent trials

  • X ~ Binomial (n,p) or X ~ Bin (n,p)

    • ~ stands for follow

  • Related Formulas

P(X=x)=(nk)pk(1p)(nk)P (X=x) = {n\choose k} p^k (1-p)^{(n-k)}
Mean:E[X]=npMean : E[X] = np
Variance(σ2):Var(X)=np(1p)Variance (\sigma^2): Var(X) = np(1-p)

Geometric

  • Total number of trials not fixed, observing for first success after n trials

    • All trials need to be independent

  • X ~ Geometric (p) / Geo (p)

    • ~ stands for follow

  • Related Formulas

P(X=x)=(1p)x1pP (X =x) = (1-p)^{x-1} p
  • where p = success and (1-p)^ (x-1) = failure

Mean(μ)=E[X]=1pMean (\mu) = E[X] = \frac{1}{p}
Variance(σ2)=Var(X)=1pp2Variance (\sigma^2) = Var (X) = \frac{1-p}{p^2}
StandardDeviation(σ)=1pp2Standard Deviation ( \sigma) = \sqrt \frac{1-p} {p^2}
  • Memoryless property

CumulativeProbability:P(Xx)=(1p)k1Cumulative Probability : P (X \geq x) = (1-p)^{k-1}

Uniform Probability

  • X~ Uniform (a,b) / Unif (a,b)

  • Related Formulas

Mean(μ)=E[X]=a+b2Mean (\mu) = E[X] = \frac{a+b}{2}
Variance(σ2)=Var(X)=112(ba)2=E[X2]μ2=abx2f(x)Variance (\sigma^2) = Var (X) = \frac{1}{12} (b-a)^2 = E[X^2] - \mu^2 = \int_{a}^{b} x^2 \,f(x)

Normal/Gaussian Probability

  • X ~ Normal (μ,σ^2) or N(μ,σ^2)

  • Range from (-∞,∞)

  • Related Formula

Standardisation (Z)=Xμσ\text {Standardisation } (Z) = \frac {X- \mu}{\sigma}
Top p cutoff:P(Zz)=p\text {Top p cutoff} : P(Z \geq z) = p

Joint Probability Mass Function

  • Related Formulas

Marginal Probability:P(Y1)=y2y2P(Y1,Y2)\text {Marginal Probability} : P(Y_1) = \sum _{y2}{y2} P(Y_1,Y_2)
Calculating the table : (n1y1)(n2y2)(xn1n2x(y1+y2))(totalx)\text {Calculating the table : } \frac {{n_1 \choose y_1}{n_2 \choose y_2}{x-n_1-n_2 \choose x - (y_1 +y_2)}}{total \choose x}
Conditional Probability:P(Y1Y2)=P(Y1,Y2)P(Y2)\text {Conditional Probability} : P(Y_1|Y_2) = \frac {P(Y_1, Y_2)}{P(Y_2)}

Variance

  • Determines how spread out the values are

  • Related Formula

Var(X)=E[(xμ)2]=E[X2]μ2Var (X) = E[(x- \mu)^2] = E[X^2] - \mu^2
  • Properties of Variance:

    • Constants dont really make a difference, as you can see

Var(aX+b)=a2Var(X)Var(aX + b) = a^2 Var(X)
  • If X and Y are independent,

Var(X+Y)=Var(XY)=Var(X)+Var(Y)Var(X+Y) = Var (X-Y) = Var (X) + Var(Y)

Covariance and Independence

  • Covariance

    • How two things change together

  • Covariance Properties

Cov(X,X)=Var(X)Cov(X,X) = Var(X)
Cov(X,Y)=E[XY]μXμYCov(X,Y) = E[XY] - \mu_X \mu_Y
Cov(X+Y,Z)=Cov(X,Z)+Cov(Y,Z)Cov(X+Y,Z) = Cov(X,Z) + Cov(Y,Z)
Cov(aX+b,cY+d)=acCov(X,Y)Cov(aX+b, cY+d) = acCov(X,Y)
Var(X+Y)=Var(X)+Var(Y)+2Cov(X,Y)Var(X+Y) = Var(X) + Var(Y) + 2Cov(X,Y)
  • Related Formulas

Cov(Y1,Y2)=ρVar(Y1)Var(Y2)Cov (Y_1, Y_2) = \rho * \sqrt {Var (Y_1)} * \sqrt {Var(Y_2)}
Cov(Y1,Y2)=E[Y1Y2]μY1μY2=E[Y1Y2]E[Y1][Y2]Cov(Y_1, Y_2) = E[Y_1Y_2] - \mu{Y_1} \mu{Y_2} = E[Y_1Y_2] - E[Y_1][Y_2]
Cov(X,Y)=E[(XμX)(YμY)]Cov(X,Y) = E[(X-\mu_X)(Y-\mu _Y)]
E[Y1Y2]=p(y1,y2)(y1,y2)E[Y_1Y_2] = \sum p(y_1,y_2) * (y_1,y_2)
ρ=Cov(Y1,Y2)σY1σY2\rho = \frac {Cov(Y_1,Y_2)}{\sigma _{Y_1} * \sigma_{Y_2}}{}

The above formulas can exist due to this:

Standard Deviation of X : (σ)=Var(X)\text {Standard Deviation of X : } ( \sigma) = \sqrt {Var(X)}
  • If Cov(Y1,Y2) = 0, then variables may be independent

    • Independence implies zero covariance, but Covariance being 0 does not mean independent

  • The largest ρ is 1, and the smallest ρ is -1

Regression

  • Regression is a type of supervised learning (trained on labeled data) algorithm used in statistics and machine learning to model the relationship between a dependent variable (also called the target or response) and one or more independent variables (also called predictors or features). The goal of regression is to predict the value of the dependent variable based on the independent variables.

  • Predicts continuous outcomes, unlike classification, which is used for predicting discrete labels

  • Least Squares Estimators

Sxx=(xxˉ)2=x21n(x)2S_{xx} = (x -\bar x)^2 = \sum x^2 - \frac {1}{n} ( \sum x)^2
Sxy=(xxˉ)(yyˉ)=xy1nxyS_{xy} = (x- \bar x) (y- \bar y) = \sum xy - \frac{1}{n} \sum x \sum y
Slope : B1=SxySxx=xynxˉyˉx2nxˉ2\text {Slope : } B_1 = \frac {S_{xy}}{S_{xx}} =\frac { \sum xy - n \bar x \bar y}{\sum x^2 -n \bar x^2}
Constant : B0=yˉB1xˉ\text {Constant : } B_0 = {\bar y} - B_1 \bar x

Residual SSR

  • How far the predicted values are from actual values

  • Smaller SSR means that your predictions are closer to actual values

  • If guesses are close, R^2 is high. This means that the model can explain more of the pattern in the data.

    • R^2 = 1 means that the model is perfect.

  • Related Formula

    • R is called coefficient of determination

SSR=1r2=i=1n(yyˉ)2\text {SSR} = 1- r^2 = \sum ^n_{i=1} (y- \bar y) ^2
r2=1SSRSyy=S2xySxxSyyr^2 = 1- \frac {SSR}{S_{yy}} = \frac{S^2xy}{S_{xx}S_{yy}}

MAE & RMSE

RMSE

RMSE is another common metric used to evaluate the difference between values predicted by a model and the actual values.

It gives more weight to larger errors than MAE because the errors are squared before averaging.

Interpretation:

  • RMSE gives a measure of how spread out the residuals (errors) are. Since the errors are squared, RMSE is more sensitive to large errors compared to MAE. Large differences between predicted and actual values will result in a significantly larger RMSE.

  • Lower RMSE values indicate better model performance.

  • RMSE is commonly used when large errors are particularly undesirable and should be penalized more.

MAE

Interpretation:

  • MAE gives a straightforward understanding of the error in the model, as it shows the average absolute difference between actual and predicted values.

  • Lower MAE values indicate better model performance.

  • MAE is not sensitive to outliers (since it only considers absolute differences), meaning that large errors do not disproportionately influence the result.

Statistical Inference

  • About using data to make conclusions about the unknown

  • There are two kinds:

    • Bayesian inference

      • Combines:

        • A prior belief about a parameter

        • The likelihood of observing a data

      • To produce a posterior (updated belief)

    • Frequentist inference

      • The is fixed in the given scenario, but unknown to us

      • Through FI, we are trying to use data to estimate and test assumptions about the fixed value.

Bayesian inference

P(HE)=P(EH)P(H)P(E)P(H|E) = \frac {P(E|H) P(H)}{P(E)}

Same posterior, same bayes numerator

  • Not same numerator, same posterior

  • Because posterior = numerator / evidence, even if the numerator is the same in two scenarios, if the total of all Bates numerator is different, then the posterior will be different.

Conditional Independence

Updating two data points

Integration

xn+1n+1\frac {x^{n+1} }{n+1}

Differentiation

nxn1{n}x^{n-1}

Frequentist Inference

3 ways, but not covering significance testing

  • [Estimates an exact value] Maximum Likelihood Estimation

  • [Estimates a range of values] Confidence interval

  • Significance Testing

  • Maximum Likelihood Estimation

    • Used to find the most likely value of a parameter, given the data

    • Kind of like an average, and is the best estimate for the true parameter.

    • To solve, maximize p(x|p) or maximize ln p(x|p)

θ=arg max p(xp=θ)=arg max ln p(xp=θ)\theta^* = \text {arg max }p(x|p = \theta) = \text {arg max ln }p(x|p = \theta)

Steps:

  1. Find log likelihood

  2. Find stationary point

  • Confidence Interval

    • Range of values that we believe the true value of the parameter lies, given the data

    • This tells you how precise your estimate is and gives a range of plausible values for the true average.

      • Ie. After a CI test, you are 95% confident that the average weight loss is between x-y value, because if we repeated the experiment 100 times, 95 of the tests would be in this range.

P(θLθθU)=1αP(\theta_{L} \leq \theta \leq \theta_{U})= 1 - \alpha

where 1-a is the coefficient/confidence level.

  • CI for normal data with known variance

  • CI for normal data with unknown variance

  • Follows a t-distribution

    • You’re trying to estimate the true average of something (like how much weight people lose after running every day for 3 months). Since you don’t have all the data from everyone, you have to make a guess.

    • The t-distribution helps you make that guess, but it knows your sample’s small, so it’s a little extra careful.

    • The degrees of freedom (df) tell you how much you can trust your guess — the bigger the df, the more you can trust it.

      • When df -> infinity, it becomes a standard normal

Classification

  • Response is categorical

  • Supervised, as we randomly split into training and testing data

Naive Bayes

  • Probabilistic classifier based on Bayes' Theorem. It’s used to classify data based on the likelihood of certain features or attributes, under the assumption that these features are independent of each other (feature independence)

To classify a sample, we calculate:

P(yfeatures)P(y)×iP(xiy)P(y∣features)∝P(y)× ∏_i P(x_i ∣y)
  • P(y) ⇒ Prior probability of y

  • Summation ⇒ Likelihood of each feature x given y

exp is exponent

Step 1: Estimate the Prior probability using CHD

Step 2: Estimate Likelihood for categorical features against each variable column

  • To find likelihood of features, use MLE.

    • Binomial - Categorical

    • Normal - Numerical

Step 3: Estimate Likelihood for numeric features

Step 4: Calculate Bayes Numerators

Step 5: Make Prediction

Laplace smoothing

  • Adds one to all non numerical data variables

    • Pretends that every possibility has happened at least once, and makes model more realistic

    • Probabilities are now non-zero

*watch out joint likelihood (see below)

Evaluation Metrics

Type
Methods

Classification

  • Accuracy = No of correct predictionsNo of total predictions\frac{ \text{No of correct predictions}}{\text{No of total predictions}}

    • Or, TP+TN/(TP+TN+FP+FN)

    • To be used in balanced datasets

  • Precision = TP/(TP+FP)

    • High precision means few false positives

    • Good when false positives are costly

  • Recall = TP/TP+FN

    • High recall = few false negatives

    • Good when missing a positive is costly

  • F1-score = 2* (Precision) * (Recall) / (Precision + Recall)

    • Combination of precision and recall

    • Balances both FP and FN

Regression

Measures the difference between the true and predicted set

  • RMSE

  • MAE

  • Clustering vs Regression

    • Response variable ⇒ output

      • In classification, the response variable is categorical (e.g., "spam" or "not spam").

      • In regression, the response variable is continuous (e.g., predicting house prices).

    • Predictor variable ⇒ Variables that you feed into model to gain response

      • Classification and regression same

    • Eval metrics ⇒ Refer to above.

4 Outcomes

  • True Positives (TP): The model correctly predicts the positive class (predicted positive, actual positive).

  • True Negatives (TN): The model correctly predicts the negative class (predicted negative, actual negative).

  • False Positives (FP): The model incorrectly predicts the positive class (predicted positive, actual negative) — this is also known as a Type I error.

    • Model overestimates number of positives

    • Healthy patient is predicted to have a disease.

  • False Negatives (FN): The model incorrectly predicts the negative class (predicted negative, actual positive) — this is also known as a Type II error.

    • Model underestimates umber of positives/miss a positive case

      • Sick patient is predicted to be healthy

Clustering

  • Unsupervised learning based on predefined classes

K-means clustering

  • Good for well separated clusters, where each point belongs to one and ONLY one clusters (a concept called hard clusters)

    • Hard clusters ⇒ Each cluster is deterministic, non-probabilistic

  • However, real-world data may overlap each other, causing cluster assignment to have high uncertainty for points in between clusters (soft clusters)

    • The uncertainty cannot be modelled.

    • Soft clusters ⇒ Each point has a probability in each cluster, and is a mixture of clusters.

Iteration 1
Iteration 2
Iteration 3
Iteration 4

Example

Gaussian mixture models

  • Instead of assumes a normal distribution bell curve, GMM has multiple bell curves overlapping

  • GMMs work well with messy real-world data, considered as universal approximators

    • Universal approximators: Given enough bell curves, it can model any shape of data distribution.

      • Whether it is a nice smooth curve is a different story.

  • Soft clusters, hence can handle overlapping clusters.

  • Probabilistic, adapts to the shape of the data using variance.

Last updated