105 cheat sheet 2

wk 5 - wk 12

Benouli

You either get success or failure
X ~ Bernoulli (p) / Ber (p)
- ~ stands for follow
P cannot be = 0 and 1
- Because 0/1 are never possible solutions
Related Formulas

Mean (\mu): E[X] = p

Variance (\sigma^2) : Var(X) = E[X^2] - \mu^2 = P (1-p)

Binomial

A series of independent trials where you observe the number of success
- Independent trials are like flipping a balanced coin for multiple times to observer whether a heads or tails is obtained.
  - As each flip's result is not linked or related to the previous or future flip's results, thus this is called independent trials
X ~ Binomial (n,p) or X ~ Bin (n,p)
- ~ stands for follow
Related Formulas

P (X=x) = {n\choose k} p^k (1-p)^{(n-k)}

Mean : E[X] = np

Variance (\sigma^2): Var(X) = np(1-p)

Geometric

Total number of trials not fixed, observing for first success after n trials
- All trials need to be independent
X ~ Geometric (p) / Geo (p)
- ~ stands for follow
Related Formulas

P (X =x) = (1-p)^{x-1} p

where p = success and (1-p)^ (x-1) = failure

Mean (\mu) = E[X] = \frac{1}{p}

Variance (\sigma^2) = Var (X) = \frac{1-p}{p^2}

Standard Deviation ( \sigma) = \sqrt \frac{1-p} {p^2}

Memoryless property

Cumulative Probability : P (X \geq x) = (1-p)^{k-1}

Uniform Probability

X~ Uniform (a,b) / Unif (a,b)
Related Formulas

Mean (\mu) = E[X] = \frac{a+b}{2}

Variance (\sigma^2) = Var (X) = \frac{1}{12} (b-a)^2 = E[X^2] - \mu^2 = \int_{a}^{b} x^2 \,f(x)

Normal/Gaussian Probability

X ~ Normal (μ,σ^2) or N(μ,σ^2)
Range from (-∞,∞)
Related Formula

\text {Standardisation } (Z) = \frac {X- \mu}{\sigma}

\text {Top p cutoff} : P(Z \geq z) = p

Joint Probability Mass Function

Related Formulas

\text {Marginal Probability} : P(Y_1) = \sum _{y2}{y2} P(Y_1,Y_2)

\text {Calculating the table : } \frac {{n_1 \choose y_1}{n_2 \choose y_2}{x-n_1-n_2 \choose x - (y_1 +y_2)}}{total \choose x}

\text {Conditional Probability} : P(Y_1|Y_2) = \frac {P(Y_1, Y_2)}{P(Y_2)}

Variance

Determines how spread out the values are
Related Formula

Var (X) = E[(x- \mu)^2] = E[X^2] - \mu^2

Properties of Variance:
- Constants dont really make a difference, as you can see

Var(aX + b) = a^2 Var(X)

If X and Y are independent,

Var(X+Y) = Var (X-Y) = Var (X) + Var(Y)

Covariance and Independence

Covariance
- How two things change together
Covariance Properties

Cov(X,X) = Var(X)

Cov(X,Y) = E[XY] - \mu_X \mu_Y

Cov(X+Y,Z) = Cov(X,Z) + Cov(Y,Z)

Cov(aX+b, cY+d) = acCov(X,Y)

Var(X+Y) = Var(X) + Var(Y) + 2Cov(X,Y)

Related Formulas

Cov (Y_1, Y_2) = \rho * \sqrt {Var (Y_1)} * \sqrt {Var(Y_2)}

Cov(Y_1, Y_2) = E[Y_1Y_2] - \mu{Y_1} \mu{Y_2} = E[Y_1Y_2] - E[Y_1][Y_2]

Cov(X,Y) = E[(X-\mu_X)(Y-\mu _Y)]

E[Y_1Y_2] = \sum p(y_1,y_2) * (y_1,y_2)

\rho = \frac {Cov(Y_1,Y_2)}{\sigma _{Y_1} * \sigma_{Y_2}}{}

The above formulas can exist due to this:

\text {Standard Deviation of X : } ( \sigma) = \sqrt {Var(X)}

If Cov(Y1,Y2) = 0, then variables may be independent
- Independence implies zero covariance, but Covariance being 0 does not mean independent
The largest ρ is 1, and the smallest ρ is -1

Regression

Regression is a type of supervised learning (trained on labeled data) algorithm used in statistics and machine learning to model the relationship between a dependent variable (also called the target or response) and one or more independent variables (also called predictors or features). The goal of regression is to predict the value of the dependent variable based on the independent variables.
Predicts continuous outcomes, unlike classification, which is used for predicting discrete labels

Least Squares Estimators

S_{xx} = (x -\bar x)^2 = \sum x^2 - \frac {1}{n} ( \sum x)^2

S_{xy} = (x- \bar x) (y- \bar y) = \sum xy - \frac{1}{n} \sum x \sum y

\text {Slope : } B_1 = \frac {S_{xy}}{S_{xx}} =\frac { \sum xy - n \bar x \bar y}{\sum x^2 -n \bar x^2}

\text {Constant : } B_0 = {\bar y} - B_1 \bar x

Residual SSR

How far the predicted values are from actual values
Smaller SSR means that your predictions are closer to actual values
If guesses are close, R^2 is high. This means that the model can explain more of the pattern in the data.
- R^2 = 1 means that the model is perfect.
Related Formula
- R is called coefficient of determination

\text {SSR} = 1- r^2 = \sum ^n_{i=1} (y- \bar y) ^2

r^2 = 1- \frac {SSR}{S_{yy}} = \frac{S^2xy}{S_{xx}S_{yy}}

MAE & RMSE

RMSE

RMSE is another common metric used to evaluate the difference between values predicted by a model and the actual values.

It gives more weight to larger errors than MAE because the errors are squared before averaging.

Interpretation:

RMSE gives a measure of how spread out the residuals (errors) are. Since the errors are squared, RMSE is more sensitive to large errors compared to MAE. Large differences between predicted and actual values will result in a significantly larger RMSE.
Lower RMSE values indicate better model performance.
RMSE is commonly used when large errors are particularly undesirable and should be penalized more.

MAE

Interpretation:

MAE gives a straightforward understanding of the error in the model, as it shows the average absolute difference between actual and predicted values.
Lower MAE values indicate better model performance.
MAE is not sensitive to outliers (since it only considers absolute differences), meaning that large errors do not disproportionately influence the result.

Statistical Inference

About using data to make conclusions about the unknown

There are two kinds:
- Bayesian inference
  - Combines:
    A prior belief about a parameter
    The likelihood of observing a data
  - To produce a posterior (updated belief)
- Frequentist inference
  - The is fixed in the given scenario, but unknown to us
  - Through FI, we are trying to use data to estimate and test assumptions about the fixed value.

Bayesian inference

P(H|E) = \frac {P(E|H) P(H)}{P(E)}

Same posterior, same bayes numerator

Not same numerator, same posterior
Because posterior = numerator / evidence, even if the numerator is the same in two scenarios, if the total of all Bates numerator is different, then the posterior will be different.

Conditional Independence

Updating two data points

Integration

\frac {x^{n+1} }{n+1}

Differentiation

{n}x^{n-1}

Frequentist Inference

3 ways, but not covering significance testing

[Estimates an exact value] Maximum Likelihood Estimation
[Estimates a range of values] Confidence interval
Significance Testing

Maximum Likelihood Estimation
- Used to find the most likely value of a parameter, given the data
- Kind of like an average, and is the best estimate for the true parameter.
- To solve, maximize p(x|p) or maximize ln p(x|p)

\theta^* = \text {arg max }p(x|p = \theta) = \text {arg max ln }p(x|p = \theta)

If there are multiple data points, it will be: p(x1, x2,x3... | p) or p(x1 |p)p(x2 |p)p(x3|p)

Steps:

Find log likelihood
Find stationary point

Confidence Interval
- Range of values that we believe the true value of the parameter lies, given the data
- This tells you how precise your estimate is and gives a range of plausible values for the true average.
  - Ie. After a CI test, you are 95% confident that the average weight loss is between x-y value, because if we repeated the experiment 100 times, 95 of the tests would be in this range.

CI does not mean 95% of people lost between x-y value weight.

CI focuses more on the uncertainty of the true value and CI provides a way to confirm the range where the true value of the population mean could be, rather than where the each individual people's data fall. CI is about the confidence percentage around estimating the true population mean (reflects the reliability of the estimation method ),not the probability that the specific interval from one sample contains the true mean.

P(\theta_{L} \leq \theta \leq \theta_{U})= 1 - \alpha

where 1-a is the coefficient/confidence level.

CI for normal data with known variance

CI for normal data with unknown variance

Follows a t-distribution
- You’re trying to estimate the true average of something (like how much weight people lose after running every day for 3 months). Since you don’t have all the data from everyone, you have to make a guess.
- The t-distribution helps you make that guess, but it knows your sample’s small, so it’s a little extra careful.
- The degrees of freedom (df) tell you how much you can trust your guess — the bigger the df, the more you can trust it.
  - When df -> infinity, it becomes a standard normal

Classification

Response is categorical
Supervised, as we randomly split into training and testing data

Naive Bayes

Probabilistic classifier based on Bayes' Theorem. It’s used to classify data based on the likelihood of certain features or attributes, under the assumption that these features are independent of each other (feature independence)

To classify a sample, we calculate:

P(y∣features)∝P(y)× ∏_i P(x_i ∣y)

P(y) ⇒ Prior probability of y
Summation ⇒ Likelihood of each feature x given y

Step 1: Estimate the Prior probability using CHD

Step 2: Estimate Likelihood for categorical features against each variable column

To find likelihood of features, use MLE.
- Binomial - Categorical
- Normal - Numerical

Step 3: Estimate Likelihood for numeric features

Step 4: Calculate Bayes Numerators

Step 5: Make Prediction

If any of the likelihood is 0, then the final score will be 0, meaning that it can unfairly eliminate a class due to a zero probability from limited data.

Hence, to solve this, people use smoothing (ie. Laplace smoothing), which prevents probability from being exactly 0.

Laplace smoothing

Adds one to all non numerical data variables
- Pretends that every possibility has happened at least once, and makes model more realistic
- Probabilities are now non-zero

*watch out joint likelihood (see below)

Evaluation Metrics

Type

Methods

Classification

Accuracy = $\frac{ \text{No of correct predictions}}{\text{No of total predictions}}$
- Or, TP+TN/(TP+TN+FP+FN)
- To be used in balanced datasets
Precision = TP/(TP+FP)
- High precision means few false positives
- Good when false positives are costly
Recall = TP/TP+FN
- High recall = few false negatives
- Good when missing a positive is costly
F1-score = 2* (Precision) * (Recall) / (Precision + Recall)
- Combination of precision and recall
- Balances both FP and FN

Regression

Measures the difference between the true and predicted set

RMSE
MAE

Clustering vs Regression
- Response variable ⇒ output
  - In classification, the response variable is categorical (e.g., "spam" or "not spam").
  - In regression, the response variable is continuous (e.g., predicting house prices).
- Predictor variable ⇒ Variables that you feed into model to gain response
  - Classification and regression same
- Eval metrics ⇒ Refer to above.

4 Outcomes

True Positives (TP): The model correctly predicts the positive class (predicted positive, actual positive).
True Negatives (TN): The model correctly predicts the negative class (predicted negative, actual negative).
False Positives (FP): The model incorrectly predicts the positive class (predicted positive, actual negative) — this is also known as a Type I error.
- Model overestimates number of positives
- Healthy patient is predicted to have a disease.
False Negatives (FN): The model incorrectly predicts the negative class (predicted negative, actual positive) — this is also known as a Type II error.
- Model underestimates umber of positives/miss a positive case
  - Sick patient is predicted to be healthy

Clustering

Unsupervised learning based on predefined classes

K-means clustering

Good for well separated clusters, where each point belongs to one and ONLY one clusters (a concept called hard clusters)
- Hard clusters ⇒ Each cluster is deterministic, non-probabilistic
However, real-world data may overlap each other, causing cluster assignment to have high uncertainty for points in between clusters (soft clusters)
- The uncertainty cannot be modelled.
- Soft clusters ⇒ Each point has a probability in each cluster, and is a mixture of clusters.

Example

Gaussian mixture models

Instead of assumes a normal distribution bell curve, GMM has multiple bell curves overlapping
GMMs work well with messy real-world data, considered as universal approximators
- Universal approximators: Given enough bell curves, it can model any shape of data distribution.
  - Whether it is a nice smooth curve is a different story.
Soft clusters, hence can handle overlapping clusters.
Probabilistic, adapts to the shape of the data using variance.

Previous105 cheat sheet 1 Nexttfb

Last updated 2 months ago