“Machine Learning, Deep Learning, Data Science, etc. are all in the same nexus – which is the bacically Probability plus Statistics. ”

A useful book and a probability refer link.

Hoeffding's Inequality, Chebyshev Inequality

Asymptotic normality

Delta (

) method$\mathrm{\Delta}$

Mode of Convergence

Estimator

Probability Redux

**i.i.d.** stands for **independent and identically distributed** .

**r.v.** denotes **random variable**

According to the law, the average of the results obtained from performing the same experiment a large number of times should be close to the **expectation value**, and tend to be closer when n is even greater.

Let

CLT establishes that when independent random variables summed up, their properly normalized sum tends toward a **normal distribution** even if the original variables themselves are not normally distributed,

where

Standard Gaussian means that this quantity will be a number between (-3, 3) with overwhelming probability, we have

**Rule of thumb to apply CLT** - normally, we require

**Asymptotic normality**

Assuming the sequence

Hence, the sample mean

CMT states that continuous functions preserve limits even if their arguments are sequences of random variables. A continuous function

It applies to

But how close is

What if **not large enough** to apply CLT?

For bounded random variable, this is still **Hoeffding’s Inequality** we can say for any n.
Let

Here I need that my random variables are actfually **almost surely bounded**, which rules out like Gaussians and Exponential Random Variables.

**How to choose **

So let's parse this for a second… , if when

The square root of

So the conclusion is the average is a good replacement

**Is this tight?** That's the annoying thing about inequalities.

The above inequality could actually be e to the minus exponential of n (

These two inequalities guarentees that upper bounds on

**Markov inequality**

For a random variable

Note that the Markov inequality is restricted to **non-negative random variables**.

**Chebyshev inequality**

For a random variable

**Remark:**
When Markov inequality is applied to

**Triangle Inequality**

Slutsky's Theorem will be our main tool for **convergence in distribution**.

Let

Then,

,${T}_{n}+{U}_{n}\stackrel{(d)}{\underset{n\to \mathrm{\infty}}{\u27f6}}T+U$ ${T}_{n}{U}_{n}\stackrel{(d)}{\underset{n\to \mathrm{\infty}}{\u27f6}}TU$ If in addition,

, then$u\ne 0$ .$\frac{{T}_{n}}{{U}_{n}}\stackrel{(d)}{\underset{n\to \mathrm{\infty}}{\u27f6}}\frac{T}{U}$

https://en.wikipedia.org/wiki/Delta_method

Discrete: Probability mass function

Bernoulli, Uniform, Binomial, Geometric

Notation:

Mean and Variance:

Probability Density Function:

**Cumulative Distribution Function:**

**Why we use Gaussian Distribution so frequently?**

Normally, we use sample mean as our estimator. And the reason is because the Gaussian distribution is the thing that shows up as the limit of the CLT as the minute you start talking about averages.

Of the universe type of results, that says that if you take average of enough things, then it's going to go to a Gaussian.

**The extreme value?**

The value field of a Gaussian is

Yes, there exists extreme value, but they never really come into play. Because of the exponential can get really, really small. The Gaussian actually almost in a bounded interval.

**Gaussian Probability Tables**

A Gaussian CDF (z-score) calculator.

**Quantiles**

2.5% | 5% | 10% | |
---|---|---|---|

1.96 | 1.65 | 1.28 |

The advantage of using poisson distribution is that n or p do not need to be known! This can make assumptions much easier.

Notation:

Mean and Variance:

Pre-require 0! = 1

**Probability Mass Function:**

**Cumulative distribution function:**

sample space:

Mean and Variance:

**Probability Mass Function:**

**Cumulative distribution function:**

Notation:

Parameters:

Mean and Variance:

**Gamma Function:**

**Probability Mass Function:**

**Cumulative distribution function: **

Such as : number of trials until a success

Geometric Distribution is either one of below two distribution:

The probability distribution of the number

of Bernoulli trials needed to get one success, supported on the set$X$ ;$\{1,2,3,\dots \}$ The probability distribution of the number

of failures before the first success, supported on the set$Y=X-1$ .$\{0,1,2,\dots \}$

Notation:

Mean and Variance:

**Probability Mass Function:**

**Cumulative distribution function:**

Notation:

**Mean and Variance:**

**Probability Mass Function:**

**Cumulative Distribution Function:**

In other words, there are a finite amount of events in a binomial distribution, but an infinite number in a normal distribution.

Notation:

Mean and Variance:

PMF pmfs

The indicator function of a subset

defined as

**Derivative of indicator function**

I have an indicator function

*δ* is symmetric. *δ* can be thought of as the derivative of the Heaviside function *H*(*x*)=1 for *x*>0, 0 for *x*<0.

https://en.wikipedia.org/wiki/Moment-generating_function

expectation of moment generating function

https://online.stat.psu.edu/stat414/book/export/html/676

mixture distribution moment generating function

Useful…

We take Gaussian

**Affine Transformation:**

**Standardization:**

According to CLT, we assume

**Symmetry:**

Three types of convergence, going from strong to weak.

is a sequence of random variables${\left({T}_{n}\right)}_{n\ge 1}$ is a random variable ($T$ may be deterministic).$T$ Some examples are shown [here]

So I created two sequences, and I want this to converge.

The probability that they depart from each other by something is going to be going to 0 as

This just saying I'm going to measure something about this random variable, maybe it is distribution. For all continuous and bounded function **CLT**:

I'm just saying that its distribution is converging. My random variable is going to become the same as the probabilities for the second guy as

If

converges${\left({T}_{n}\right)}_{n\ge 1}$ **a.s.**, then it also converges in**probability**, and the two limits are equal a.s.If

converges in${\left({T}_{n}\right)}_{n\ge 1}$ **probability**, then it also converges in**distribution**

Convergence in distribution implies convergence of probabilities if the limit has a density (e.g. Gaussian):

**Addition, Multiplication and Division**
Assume,

Then,

,${T}_{n}+{U}_{n}\stackrel{\text{a.s./P}}{\underset{n\to \mathrm{\infty}}{\u27f6}}T+U$ ,${T}_{n}{U}_{n}\stackrel{\text{a.s.}/\mathrm{P}}{\underset{n\to \mathrm{\infty}}{\u27f6}}TU$ If in addition,

a.s., then$U\ne 0$ .$\frac{{T}_{n}}{{U}_{n}}\stackrel{\text{a.s.}/\mathbf{P}}{\u27f6}\frac{T}{U\to \mathrm{\infty}}$

Warning: In general, these rules do not apply to convergence in distribution

Normally, we have two estimator:

Compute the expectation of your random variable

Using Delta method

How can we decide how many samples (**What is the cutoff**, namely if 60 is enough, how about 59 and 58?

Now we have our first estimator of *Kissing Example*, we put a hat on everything that’s the estimator of something.

For

, define$i=1,\dots ,n$ ,$R\sim Ber(p)$ if the${R}_{i}=1$ th couple turns to the right RIGHT,$i$ otherwise.${R}_{i}=0$ Our first estimator of

is the sample averge:$p$

And averages of random variables are essentially controlled by two major tools: They are LLN and CLT.

**What is the accuracy of this estimator?**

**What is the probability that **

We don’t even know the trueandard and the observation

Modelling Assumptions:

Each r.v. and i.i.d

is Bernoulli (p)${R}_{i}$

If we want to estimate **mean** of a Gaussian and we can compute the expectation, but not it doesn’t work for the the variance.

You can go into example and compute the variance, which is actually coming from the method of moments.

But it turns our we have a much more powerful method called the maximum likelihood method, but it is far non-trivial.