Hidden Markov Models (HMM)

Hidden Markov Models (HMM)

Go home

According to Dan Jurafsky and James Martin’s book Speech and Language Processing, a hidden markov model (HMM) is comprised of the following 5 components:

  • Q: a set of N states.
  • A: a transition probability matrix, with each element aij representing the probability of transitioning from state i to state j, subject to the constraint that the sum of these elements is 1 (forming a proper probability distribution)
  • O: a set of T observations, each one drawn from a vocabulary V.
  • B: a set of observation likelihoods, also called emission probabilities, each expressing the probability of an observation ot being generated from a state i.
  • pi: an initial probability distribution over states. pi i is the probability that the markov chain will start in state i. Some states j may have pi j = 0, meaning they cannot be initial states. As with the transition probabilities, all must sum to one to form a valid probability distribution.

A first-order hidden Markov model instantiates two simplifying assumptions:

  • First, as with a first-order Markov chain, the probability of a particular state depends only on the previous state: Markov Assumption:
  • Second, the probability of an output observation oi depends only on the state that produced the observation qi and not on any other states or any other observations: Output Independence:

An HMM has two probability matrices, A and B:

  • Matrix A contains the tag transition probabilities P(ti given ti−1) which represent the probability of a tag occurring given the previous tag. We compute the maximum likelihood estimate of this transition probability by counting, out of the times we see the first tag in a labeled corpus, how often the first tag is followed by the second:
    • This matrix will have dimensions (N * N), where N is the number of tags.
  • Matrix B (emission probabilities, P(wi given ti)), represents the probability, given a tag, that it will be associated with a given word. The MLE of the emission probability is:

The goal of HMM decoding:

  • Given an HMM λ = (A,B), and a sequence of observations O, find the most probable sequence of states Q:
  • The way we would do this in the context of an HMM is to use Bayes’ rule:
  • We can simplify a bit by dropping the denominator:

HMM taggers make two further simplifying assumptions:

  • The first is that the probability of a word appearing depends only on its own tag and is independent of neighboring words and tags:
  • The second assumption, the bigram assumption, is that the probability of a tag is dependent only on the previous tag, rather than the entire tag sequence:

With our two simplifying assumptions, the equation for the most probable tag sequence simplifies to:

  • Corresponds to our emission probability matrix.
  • Corresponds to our transmission probability matrix.

The Viterbi decoding algorithm:

  • First, set up a probability matrix (or lattice), with one column for each observation ot and one row for each state in the state graph.
  • Each cell of the trellis, vt(j), represents the probability that the HMM is in state j after seeing the first t observations and passing through the most probable state sequence q1,…,qt−1, given the HMM λ.
  • The value of each cell vt(j) is computed by recursively taking the most probable path that could lead us to this cell.
  • Formally, each cell expresses the probability vt(j):
  • We represent the most probable path by taking the maximum over all possible previous state sequences max q1,…,qt−1