Read in (Observation token, State tag) sequence data. This can be in the form of a dataframe, a list of tuples in Python, a named vector/named list in R, etc. The main idea is that we have two variable vectors of the same length, each made up of individual sequences.
If the sequences do not explicitly contain and observation tokens/ state tags, insert them at the beginning and end of each sequence. This way, we will not be counting the occurrences of state transitions that occur across two sequences; we want to wnsure that the only (observation, observation) bigrams we observe happen within one distinct sequence.
Split the sequence data into training set and validation or test set. This way, we can train the hidden markov model on the training data and test its performance in the test set.
Define a threshold probability for out-of-vocabulary, or , observation tokens. In other words, for any observations that occur rarely, we will treat them all as one observation and that observation will have the occurrence probability of the sum of the individual occurrence probabilities for all observations lower than the threshold that is set. In some cases, we may wish to treat this threshold as a hyperparameter to tune performance, or we may rely on a heuristic such as always using the top 95th percentile of observation frequencies as our vocabulary and any observations falling below that get treated as out-of-vocabulary.
Create a transmissions probability matrix, representing the probability of transitioning from one state tag to the next. This matrix will have dimensions nXn, where n = number of state tags.
Create an emissions probability matrix, representing the probability of a state tag given an observation token. This matrix will have dimensionality mXn, where m = number of observation tokens and n again represents number of state tags.
Create an initial probabilities matrix, representing the probability of starting the sequence in a given state tag. This matrix will have dimensionality nX1, where n again represents number of state tags.
For larger models where we are modeling lengthy sequences, we may wish to convert all three probability matrices to log10 probabilities, to avoid the problem of numerical underflow. Because we multiply so many probabilities during the Viterbi decoding algorithm, the final probability of a sequence may be so small that our model loses the ability to correctly identify the appropriate hidden state tag after a certain number of elements in a sequence. Treating the probabilities as log10 allows us to add the log10 probabilities at each time step instead of multiplying, so we will never have a problem where the final probability of the most probable sequence is a smaller number than what we can store on our machines (think of a decimal number with something like 100 zeros after the decimal point - this is far too small to represent as a floating point number). If we still wish to examine the probability, we can exponentitate the log10 probability to get the actual probability between 0 and 1, but for the most part we will only be interested in the Viterbi hidden state sequence.