TOC

Introduction

Probability is a branch of mathematics that deals with the study of random events and the likelihood of their occurrence. It is used to model situations where there is uncertainty or randomness involved, and is widely applied in various fields such as statistics, finance, physics, engineering, and computer science. Probability is also widely used in machine learning and artificial intelligence, where it is used to model uncertainty in data and to make predictions.

Random variable

A randome variable x denotes an uncertain quantity. It may be the result of a coin flip or the measurement of temperature. Each time we experience x, it can take a different value xi. However, values can repeat themselves and some seems to appear more frequent than others. This information is captured by the probability distribution Pr(x) of the random variable x. Note that a random variable x can assign number to each outcome. For example, in the die experiment we can assign to the six outcomes i the numbers 10i: x(1)=10,x(2)=20..x(6)=60 or we can assign number 1 to even outcomes and number zero to odd outcomes: x(1)=x(3)=x(5)=0,x(2)=x(4)=x(6)=1.

We can also say, in some other words: if the experiment is done n times and the event A occurs nA times, then with a high degree of certainty, the relative frequency nAn of the occurrence of A is close to P(A): P(A)nAn provided that n is sufficiently large. In the limit, theoretically, the probability P(A) of event A can be described as a hypothesis P(A)=limnnAn.

The random variable can become a function f(x) when the domain is the set of all experiment outcomes (so that the total proabilities summed to one). Note that a function is a rule of correspondence between x and y, with the values of the independent variable x form a set D named the domain and the values of dependent y = f(x) form the range set R of the function. In another way, we have two sets of number D and R. For every x in D we assign a number y = f(x) belong to R. We would say f is the function of x. The mapping between x and y can be one to one or many to one.

There are two types of random variables: discrete and continuous. A discrete variable has a set of values. This set can be an ordered set, for example the list of a dice rolling values, ranging from 1 to 6, or it can be an unordered one, say, the weather outcomes of sunny, snowy, rainy and windy. It can be finite or infinite and the probability distribution is best shown as a histogram. With that, each possible outcome has a positive probability and the sum of all such probability is 1. On the other hand, continuous random variable has values in the real domain. These can also be finite or infinite, depending on the problem. It can be infinite but bounded and the probability distribution is best shown as the graph of the probability density function (pdf). Each outcome would have its own probability (propensity) and the integral of the pdf always be 1, similar to the discrete variable.

Screen Shot 2023-04-22 at 17 09 39 Screen Shot 2023-04-22 at 17 09 44

Image: the visualization of the probability distribution of discrete and continuous variable

Continuous random variable

Normal (Gaussian) distribution

This is the most popular distribution. We say x is a normal (or Gaussian) random variable with parameters μ and σ2 if the density function is:

f(x)=12πσ2e(xμ)2/2σ2

Many natural phenomena follows Gaussian distribution. One example, Maxwell arrived at the normal distribution for the distribution of velocities of molecules, under the assumption that the probability density of molecules with given velocity components is a function of their velocity magnitude and not their directions.

Exponential distribution

We say x is exponential with parameter λ if the density function is

f(x)={λeλx,x00 otherwise

Some exponentially distributed events are phone calls or bus arrivals, given that the occurrences of those events are independent.

Screen Shot 2023-04-23 at 14 23 22

Image: The waiting time at bus stop or phone calls, according to exponential distribution assumption

Gamma distribution

We say x to be a gamma random variable with parameters α>0,β>0 if

f(x)={xα1Γ(α)βαex/β,x00 otherwise 

with Γ(α)=0xα1exdx

The gamma distribution (which was mentioned) takes on different shapes and sizes.

Chi-square distribution

x is said to be a χ2(x) with n degrees of freedom if

f(x)={xn/212n/2Γ(n/2)ex/2,x00 otherwise

with n = 2, we have the exponential distribution.

Uniform distribution

x is said to be uniformly distributed in the interval (a,b) <a<b< if

f(x)={1ba,axb0 otherwise

Screen Shot 2023-04-23 at 14 34 03

Image: A uniform distribution

Beta distribution

The random variable x has beta distribution with nonnegative parameters α and β if

f(x)={1B(α,β)xα1(1x)β1,0<x<b0 otherwise

where the beta function B(α,β)=01xα1(1x)β1dx=202x(sinθ)2α1(cosθ)2β1dθ

Cauchy distribution

f(x)=α/π(xμ)2+α2,x∣<

Laplace distribution

f(x)=α2eαx,x∣<

Maxwell distribution

f(x)={4α3πx2ex2/α2,x00 otherwise

Discrete variable

Bernouli distribution

The Bernoulli distribution refers to any experiment with only two possible outcomes: success or failure (head or tail). x is said to be Bernoulli distributed if x takes the values 1 and 0 with P(x=1) = p and P(x=0) = q = 1 - p

Binomial distribution

When we have independent trial of n Bernoulli experiment, we call it a binomial random variable. x is said to be a binomial random variable with parameters n and p if x takes the values of n classes: 1, 2,..n with P(x=k)=(nk)pkqnk with p+q=1 and k = 1,2,..n

Since the binomial coefficient (nk)=n!(nk)!k! grows rapidlly with n, it is difficult to compute the probability. So we can approximate this distribution with normal approximation and Possion approximation.

Let n with fixed p. Then for k in the npq neighborhood of np, we approximate (nk)pkqnk12πnpqe(knp)2/2npq with p+q=1.

Let’s state the law of large numbers: if an event A with P(A) = p occurs k times in n trials, then knp. In fact, P(k=np)11πnpq0 as n.

For the Poisson approximation: if n,p0 such that npλ, n!k!(nk)!pkqnkeλλkk! with k = 0,1,2,…

Poisson distribution

The Poisson distribution represents random variables such as number of telephone calls for a fixed period, the number of winning ticketss in a large lottery, and the number of printing errors in a book. The event can be rare, but does happen. x follows a Poisson distribution with parameter λ if x takes the values 0, 1, 2, … with P(x=k)=eλlambdakk!, k = 0,1,2...

Geometric distribution

Let x be the number of trials needed to find the first success in repeated Bernoulli trials. Then x follows a geometric distribution.

P(x=k)=pqk1 with k = 1,2,3,...

The probability of event (x>m) is: P(x>m)=k=m+1P(x=k)=k=m+1pqk1=pqm(1+q+...)=pqm1q=qm

Negative binomial distribution

x follows negative binomial distribution with parameters r and p if P(x=k)=(k1n1)pqk1

Discrete uniform distribution

P(x=k) = \frac{1}{N} with k = 1,2,..N

Joint probability

Joint probability of variable x and y Pr(x,y) is the probability at which those two appear together. The summing of all outcome probabilities is still one as usual. When we concern multiple variables, we write Pr(x,y,z) for the joint probability of x, y and z. Or we write Pr(x) for the joint probability of all of the elements of the multidimensional variable x=[x1,x2..xK]. Similar for Pr(x,y).

To extract the probability distribution of a single variable from a joint distribution we sum (or integrate) over all other variables:

Pr(x)=Pr(x,y)dy for continuous y.

Pr(x)=yPr(x,y) for discrete y.

Pr(x) is called the marginal distribution and doing the equation is called the marginalization process.

Screen Shot 2023-04-22 at 17 47 09

Image: Joint probability of two continuous variables x and y

Conditional proability

The conditional probability is the probability of x condition on y=y. This sentence is written mathematically as Pr(xy=y). The thing is, the various probabilities of x given a specific y doesn’t sum up to 1. So we normalize by the sum of all the probabilities in the slice so that the conditional probabilities become a distribution:

Pr(xy=y)=Pr(x,y=y)Pr(x,y=y)dx=Pr(x,y=y)Pr(y=y)

The denominator is the marginal probability of y=y. The above is also equivalent to:

Pr(xy)=Pr(x,y)Pr(y)

Screen Shot 2023-04-22 at 17 47 16

Image: Conditional probability of variable x given two values of y

Bayes’ rule

Since Pr(x,y)=Pr(yx)Pr(x), we also have Pr(x,y)=Pr(yx)Pr(x). Combining them we have Pr(yx)Pr(x)=Pr(xy)Pr(y).

Pr(yx)=Pr(xy)Pr(y)Pr(x)=Pr(xy)Pr(y)Pr(x,y)dy.

This is called the Bayes’ rule and Pr(yx) is called the posterior - what we know about y after taking x into account. The Pr(y) is the prior - what we know about y before considering x. Pr(xy) is called the likelihood. Pr(x) is the evidence. So the posterior is equal to the likelihood multiplied by the prior adjusted for the evidence.

Independence

Independence is a condition that knowing x doesn’t give out information about y. Hence the conditional probability is simply the evidence Pr(xy)=Pr(x). The joint probability then becomes the product of the marginal probabilities Pr(x,y)=Pr(xy)Pr(y)=Pr(x)Pr(y). Given two independent and mutually exclusive events A and B, then P(AB)=NA+BN=NAN+NBN=P(A)+P(B).

Expectation

Given random variable x with Pr(x) and a function f(x), we can calculate the expected value of f(x):

E[f[x]]=xf(x)Pr(x) for discrete x

E[f[x]]=f(x)Pr(x)dx for continuous x

For multiple variables x and y:

E[f[x,y]]=f(x,y)Pr(x,y)dxdy

When thinking of expectations, remember these rules:

  • the expected value of a constant k with respect to random variable x is k itself: E[k]=k

  • the expected value of a constant k times a function x is k times the expected value of that function E[kf(x)]=kE[f(x)]

  • the expected value of the sum of two functions of x is the sum of each of those expected values: E[f(x)+g(x)]=E[f(x)]+E[g(x)]

  • the expected value of the product of two functions f(x) and g(y) is the product of the individual expected values if x and y are independent: E[f(x),g(y)]=E[f(x)]E[g(y)]

The expectations also have special names for some functions. Let’s call the mean of the random variable x to be μx, then f(x)=(xμx)2 is called the variance. f(x)=(xμx)3 is called the skew. f(x)=(xμx)4 is called the kurtosis and (xμx)(yμy) is called the covariance of x and y.

We denote mn=E[xn]=xnf(x)dx to be the moments of the random variables x. The central moment is the mean μn=E[(xμ)n]=(xμ)nf(x)dx

The absolute moment is E[xn]=E[xμn]

Variance

The variance of f(x) is defined by:

var(f)=E[(f(x)E[f(x)])2]

This is how much variability there is in f(x) around its mean value E[f(x)]. It is equivalent to:

var(f)=E[f(x)2]E[f(x)]2

For one variable x, var(x)=E[x2]E[x]2. For two random variables x and y, the covariance if defined by cov(x,y)=Ex,y[(xE[x])(yE[y])]=Ex,y[xy]E[x]E[y] The covariance is to measure how much x and y vary together. If x and y are independent then the covariance is 0.