September 12, 2020

Intro to AI - Uncertainty

Definitions

probability

Probability is the likelihood of a certain event \$a\$ to happen. Every possible outcome of the event can be thought of as a world \$omega\$ with its probability \$P(omega)\$.
The probability of every possible event when summed together is \$1\$.
\$sum_(omega in Omega) P(omega) = 1\$
for the probability of a single event we always have
\$ 0 le P(omega) le 1\$
where \$0\$ means the event is impossible and \$1\$ the event is certain.

unconditional probability;
marginal probability

Unconditional probability or marginal probability is the degree of believe in a proposition in absence of any other evidence.
The probability is unaffected by previous or future events. It can be determined by adding up the outcomes of the event \$a\$ and dividing by the total number of possible outcomes.

\$P(a) = (text{number of times 'a' occurs}) / (text{total number of possible outcomes})\$

conditional probability

Conditional probability is the degree of believe in a proposition given some evidence has already been revealed.
For example the probability of \$a\$ given that \$b\$ is true:
\$P(a|b) = (P(a wedge b)) / (P (b)) \$
We are interested in the events where both \$a\$ and \$b\$ are true (the numerator), but only from the worlds where we know \$b\$ to be true (the denominator).

joint probability

Joint probability is the probability of event a and b.
The Formula is derived from conditional probability by multiplying with \$P(a)\$ or \$P(b)\$.
\$P(a wedge b) = P(a|b) \ P (b) \$
\$P(a wedge b) = P(b|a) \ P (a) \$
\$P(a wedge b)\$ can also be written as \$P(a, b)\$

random variable

A random variable in probability theory is a variable with a domain of possible (discrete) values it can take on, each possible value might have a different probability.

probability distribution

Probability distribution is a function that gives the probability for each possible value of a random variable, in the case of discrete values this can be just a vector of probabilities for each possible value of the variable.

independence

Independence is the knowledge that the outcome of one event doesn’t affect the outcome of another event. For two independent event the following is true:
\$P(a|b) = P(a) \$
used in the formula for joint probability,t his gives us for two independent events \$a\$ and \$b\$: \$P(a wedge b) = P(a) \ P (b) \$

Bayes' Rule

Bayesian Networks

A Bayesian Network is a directed graph, each node represents a random variable. Each edge from node \$X\$ to \$Y\$ represent a parent child relationship. The probability distribution of \$Y\$ depends on the value of \$X\$. Each node \$Y\$ has probability distribution \$P(Y | Parents(Y))\$.

negation

\$P(not a) = 1 - P(a)\$

inclusion-exclusion

\$P(a vee b) = P(a) + P(b) - P(a wedge b)\$
Both events \$a\$ and \$b\$ includes one case of \$a wedge b\$, actually one includes \$(a wedge b)\$ the other includes \$(b wedge a)\$, we have to subtract the probability once.

marginalization

for a simple boolean value b:
\$P(a) = P(a wedge b) + P(a wedge not b)\$
or more general:
\$P(X = x_i) = sum_(j) P(X = x_i wedge Y = y_j)\$

conditioning

for a single boolean value b:
\$P(a) = P(a|b) \ P(b) + P(a|not b) \ P(not b) \$
or more general:
\$P(X = x_i) = sum_(j) P(X = x_i | Y = y_j) \ P( Y = y_j)\$

sample

To sample means randomly picking a value for a variable according to its probability distribution.

markov assumption

The Markov assumption is an assumption that the current state depends on only a finite fixed number of previous states.

markov chain

A Markov chain is a sequence of random variables where the distribution of each variable follows the Markov assumption. That is, each event in the chain occurs based on the probability of the event before it.

transition model

the transition model of a markov chain creates the probability distributions of the next event based on the possible values of the current event.

hidden markov model

A hidden Markov model is a type of Markov model for a system with hidden states that generate some observed event. This means that sometimes, the AI has some measurement of the world but no access to the precise state of the world.

Example calculations

Using joint probabilities, to deduce conditional probability

given:

probability

\$text{clouds}\$

\$text{rain}\$

\$P(C = text{clouds}, R = text{rain}) = 0.08\$

\$text{clouds}\$

\$not \ text{rain}\$

\$P(C = text{clouds}, R = not text{rain}) = 0.32 \$

\$not \ text{clouds}\$

\$text{rain}\$

\$P(C = not text{clouds}, R = text{rain}) = 0.02 \$

\$not \ text{clouds}\$

\$not \ text{rain}\$

\$P(C = not text{clouds}, R = not text{rain}) = 0.58 \$

We can deduce the probability distribution for clouds given we have rain:
\$P(C = clouds | text{rain}) = (P(C = clouds , text{rain})) / (P(text{rain})) \$

with \$alpha = P(text{rain})\$:
\$ = alpha P(C = clouds , text{rain})) \$

with cloud distribution for \$R = text{rain}\$:
\$ = alpha ((0.08), (0.02)) \$

with \$sum_(c in C) P(c) = 1\$:
\$ = ((0.8), (0.2)) \$

Inference in Bayesian networks

inference by enumeration

Inference by enumeration is a process of finding the probability distribution of variable \$X\$ given observed evidence \$e\$ and some hidden variables

\$P(X \| e) = alpha P(X , e) = alpha sum_y P(X, e, y)\$.
We sum the normalized probability of \$X\$, \$e\$, and \$y\$, where \$y\$ takes each time a different value of the hidden variables \$Y\$.

drawback is a high computational effort needed

approximate inference by sampling

Starting to sample the variables in the nodes of a Bayesian network, starting at the top-level node, using the probability distributions of the nodes according to the variable values of the parent nodes. We get a number of samples that can be used for inference.

\$P(A = a)\$
can be calculated by counting the samples with the value of \$A = a\$ divided by the total number of collected samples.
\$P(A = a \| B = b)\$
can be calculated by counting the number of samples with the value for \$A = a\$ and \$B = b\$ divided by the number of samples with the value \$B = b\$

drawback is that for a low probability of \$B = b\$ we lose a lot of samples

likelihood weighting

Start by fixing the values for evidence variables.
Sample the non-evidence variables using conditional probabilities in the Bayesian network.
Weight each sample by its likelihood: the probability of all the evidence occurring.