Wednesday, February 13, 2013 CC-BY-NC
Bayes Networks

Maintainer: admin
  • You are allowed one cheat sheet for the midterm
  • Today's lecture is the last lecture being evaluated

-Today's lecture is on the board
- Bayes Nets are a probabilistic graphical model
- The word probabilistic tells you that they represent some kind of probability distribution. The word graphical means they do this through a graph which is going to be annotated with some quantitative information that is going to let us talk about the probability distribution
- We've spoken about a model previously called the Naive Bayes Model. --> This model looks like this:
- I have some random variable, eg: Disease, and some other random variables, e.g. symptom 1, symptom 2; the graph tells us that the disease influences these symptoms. This graph will tell us that if we have this disease, how the symptoms will be mapped out.
- This will be an acyclic directed edge graph where the arcs represent influences.
- More importantly, the lack of arcs will be equivalent to conditional independence
- In this particular case, the way the graph is interpreted is that the disease influences the symptoms but the symptoms don't affect each other
- So symptom 1 will be conditional independent given the disease
- So this graph represents a probability distribution of these variables and we will annotate it
- On each node we'll have some probability of the symptom, given the disease
- In general, this graph may be given but not the numbers
- We will later talk how to acquire the numbers from the data
- We can factorize the distribution using the graph
- We're going to write this as a product of all the probabilities in the nodes in the graph. So P(D, S1, S2....,Sn) = P(S1|D) * P(S2|D)...etc
- I can always write any joint distribution in this way using the chain rule
- But the graph tells me that I can simplify things
- So what we're doing is that for S2 you don't need to know S1 so long as you know D.
- This lack of arcs is equivalent to saying we can drop variables in the conditions.
- This basically means that we have some conditional independence relations
- Why would we want to do this kind of thing? Why is this useful?
- Consider these are binary variables. I am going to use binary distributions for all these nodes in order to represent the numbers
- So P(S1|D=0), P(S1|D=1), etc. How many numbers will I need in order to make this kind of model? O(n)
- How many numbers do we need if we assume that we have a full joint distribution, and we have binary variables so we're looking at the joint distribution of the n symptoms. O(2^n)
- If we pick a model which is simplified, we have that many fewer parameters that we need to figure out.
- The game is a simpler model, and fewer parameters
- What's the catch? The assumption may not be true.
- She might have the wrong model.
- Of course, in many cases very often conditional independent assumptions are pretty good. In many cases, we can partition our knowledge into chunks that are simpler.
- Sometimes, we make some wrong assumptions and then our model is wrong.
- Bear in mind that the assumptions and structure are very important.
- Now we're going to look at some very simple cases over 3 variables.
- That gives us a canonical set of graphs that we need to look at.
- We're going to have 3 random variables. One of them is going to be Quality of Education (Q), one of them is going Funding for University (F), and the other is going to be Your chances of getting a good job (J).
- And so, these are somewhat interconnected. I can write down different graphical models about the relationships between these variables.
F --> Q --> J
This is an instance of a Markov Chain.
- Now we can take this graph and annotate it
- So P(F) --> we can picture some numbers
- Then we can put over the Q node a, P(Q|F), and over J P(J|Q)
- Then if I want to write the joint distribution of these variables
P(F) * P(Q|F) * P(J|Q)
- And now I can see evidence.
- Evidence means that I can observe the value of some of these random variables.
For example: I may observe that Funding = No. Then I can use the probability distribution to see what the likelihood is of getting a good job.
- This is the canonical example of a chain
- They encode an interesting conditional independence
- What this says is that F and J are independent given the variable Q
F |_| J|Q (Symbol is like an upside down

hw: Check that P(J|Q,F) = P(J|Q)

Now we'll look at the V-structure.

Again, we'll have 3 random variables.
First one is that a Crime has been commited (C), I'll have 2 suspects (S1, and S2)
- Either of them could have committed the crime, but they each have some chance. They didn't like the person or whatever.
- What kind of graph structure would I use to encode this kind of influence or independence.
- Have the 2 suspects as nodes feeding into crime
S1 --> C
S2 --> C - Looks like a V but I can't draw it properly on notepad
- We're assuming the 2 suspects are independent --> this may be wrong
- There may be some other random variable: Like a mafia boss (Feeding into S1 and S2) that ordered the 2. But this is all we have so far.
- Suppose that a crime has been committed and the police have done interviews and S1 has a really good alibi.
- The probability of S1 will go down, which automatically means the probability of S2 will automatically go up. They're not independent anymore.
- So this is another time point
- We got an interesting reasoning pattern called: Explaining away
- Explaining away means that if I have evidence for the value of one those variables, that's going to influence the probability of the other variable; even though they were initially independent
- It's a good reasoning patterns for all kinds of faults, including debugging
- This kind of computation is happening at query answering
- This is important in Bioinformatics gene networks
- For any kind of graph now, I can think of writing it down in this way.
- For example, I could have a program which is a random variable and I could have a test on the program which might fail or succeed
- This program is perhaps written by two people on the fail. Human1 and Human2.
- Tests can either Pass or Fail.
- I want to know things like: Given H1 which is a good programmer or Given H2 which is a bad programmer, what is the probability tha the test is going to succeed?
- Or I may want to do another kind of reasoning which is what the probability that H1 is a good programmer given that I tested the program and failed.
- We need the things we observed, and what influence the variable will have on others.
- In the first case we have two programmers, there's some a priori probability that they're good. I can look at the program and then the program influences the outcome of the test.
H1 ---> P
H2 ---> P
P ---> T

I can make this more complicated. For example, it's not just the program that influences a Test but the data.
- When I make a graph, this automatically tells us our probability distribution.
For example P(T|P)
P(P|H1, H2)
- I'm assuming I can factorize over this and this gives us the distribution for the whole set of random variables.
- Now suppose that I want to answer queries. I'm going to ask something like: What is the probability of H1 having some particular value given that the test fails?
- How do I do this? I can go through the joint probability distribution.
So
P(H1 = G| T = F) = P(H1 = G, T = F) / P(T = F)
- We know the term on the bottom happened, so we're going to focus on the term on top.
- If I were to write this in terms of joint prob distribution of all the variables, that I know. I can write the term as
the sum of all possible values of H2 times the sum of all possible values of P P(H1 = G, H2, P, T = F).
- H2 and P aren't evidence or query variables. In statistics they're called nuisance variables.
- The way we get rid of themm is by the operation called "Summing Up"
- Then we're left with the thing we actually want.
- Here we have discrete values so we can do sums. Continuous values are expensive, we won't worry about them for now. This is going to be the marginal probability of these two variables.
- The interesting thing is we know how to obtain the joint. Factorization.
P(H1) * P(H2) * P(P|H1 = G, H2)P(T=F|P)
- This is specified by the graph structure
- Now we have this structure but if I tried to do this kind of reasoning, how many operations am I going to have to do for these sums?
- k ^number of nuisance variables
- Notice, that not all factors depend on everything.
- Some factors may depend on some of the variables. Not all of the variables. I don't need to compute all these big factors.
- I can take these sums and distribute them over the multiplications in order to compute them quicker in different parts of the Net.
- In this case,
Sum over H2(Sum over P(P(H1 = G)
P(H2)P(P|H1, H2) P(T|P)))
First of all P(H1 = G) is a constant, so it can be pulled out. Then there are sums that don't depend on P and we can pull them out and then put the sum in.
So P(H1 = G) * Sum over H2(P(H2)) * sum over P(P(P|H1,H2) etc
- You can marginilize further through Variable elimination
- This is an algorithm which will always let you compute exactly the right answer using your graph
- But given the structure of the graph and how smart you are in eliminating it may be significantly cheaper.
- Say
a = (A) --> (B) --> (C) --> (D)
I could think of doing these sums over D and over C, but first i can sum out the things I don't care about.
- I can take out the terms P(A)
P(B|A) * P(C|B) and can sum over what has to do with B first.
If I do this kind of sum, what I end up with is a term that tells you the probability of C given what I had in A.
- P(C|A) depends only on the numbers in C.
- She's explaining this a bit poorly. I am REALLY confused. Find stuff online.
- In the worst case: Variable elimination has no computational benefit.
- Best case, you can go from exponential to roughly linear.
- On the average, it tends to help but it's not really enough for big graphs.
- For big graphs we're going to need a better tool.
- Let's consider a larger graph
(A) --> (D)
(B) --> (D)
(D) --> (Z)
(C) ---> (Z)
(Z) = e
(Z) --_> (F)
(Z) ---> (G)
(G) ---> (H)
- What can we ignore?
- We're not interested in indirect influences. (Z) has happened. Everything that influences (Z) doesn't impact what may happen in (H) anymore.
- It blocks flow of information from its parents to its children.
- If I didn't know (Z), do I care about (F)? If I observed (Z)'s value, I won't care about (F). But if I didn't (F) may be influencing (Z).
- The question is: Given the evidence, which conditional independencies are implied by a graph?
- A related question is: if I want to determine the value of some particular node, and by value I mean some probability distribution, what other nodes do I need to know?
- The first question is useful when I try to think of the computational aspect.
- The second question is an experimental design question.
- Which nodes do I need to probe in order to find the values?
- We're going to start with the second question because it's easier and more intuitive.
- Suppose we're in a Bayes net and maybe you have children and maybe you don't, which nodes influence you?
- Stop caring.
- V-structures introduce complexity into these models. If I have information going in, information will get reflected to everybody else.
- Wrap Up: The base cases that are interesting are the markov chain, naive base and the v-structure.
- In the Markov chain if I have evidence, this blocks the flow of information.
- In the naive base case, if I have evidence that blocks the flow of information.
- The v-structure is the only case where it's the other way around. If I have a value that allows the flow of generation.