Python Examples

Python Examples

Here we are taking the raw text files with the observation token and hidden state part-of-speech (POS) tags and adding in START and EOS markers to demarcate the boundaries between sentences (sequences). This way we do not consider bigrams such as “EOS —> START”, which would occur very frequently but not make sense in building our model:


import os
import csv
import pandas as pd

# Setting relevant file paths (Mac OS X):
homedir = os.path.dirname('~/Projects/hmm/')
datadir = os.path.join(homedir, 'data/')
WSJ_train = os.path.join(datadir, 'WSJ-train.txt')
WSJ_test = os.path.join(datadir, 'WSJ-test.txt')

observation_state_list = []
sentence_count = 0
lowercase = True

with open(WSJ_train) as infile:
  for line in infile:
    line = line.strip('\n')
    chars = line.split(' ')
    if len(chars) == 3:
      observation = chars[0].lower()
      state = chars[1]
      observation_state_list.append((observation, state))
    elif len(chars) != 3:
      sentence_count += 1
      observation_state_list.append(('<EOS>', '<EOS>'))
      observation_state_list.append(('<START>', '<START>'))

observation_state_list.insert(0, ('<START>', '<START>'))
observation_state_list.pop()

os_df = pd.DataFrame(observation_state_list)
os_df['start'] = os_df[0].map({'<START>' : 1})
os_df['cumsum'] = os_df['start'].cumsum()

cutoff = int(round(os_df['cumsum'].max()*0.7))

test_start = os_df[os_df['cumsum'] == cutoff].index.values[0]

# Now we can get rid of the cumsum and start columns:
os_df.drop(['start','cumsum'], axis=1, inplace=True)

train_df = os_df.iloc[:test_start,:]
test_df = os_df.iloc[test_start:,:]

train_df.to_csv(os.path.join(datadir, 'wsj_train.csv'), index=False)
test_df.to_csv(os.path.join(datadir, 'wsj_test.csv'), index=False)

After running the python script above, we should be able to use pd.read_csv to read the train_df and test_df files, which look something like this:


         token     tag
 1:    <START> <START>
 2:       most     JJS
 3:    banking      NN
 4:     issues     NNS
 5:      <OOV>   <OOV>
 6:      after      IN
 7:          a      DT
 8:     sector      NN
 9:  downgrade      NN
10:         by      IN
11:      <OOV>   <OOV>
12: securities     NNP
13:          ,       ,
14:   although      IN
15:   national     NNP
16:      <OOV>   <OOV>
17:     showed     VBD
18:   strength      NN
19:         on      IN
20:   positive      JJ
21:   comments     NNS
22:       from      IN
23:  brokerage      NN
24:      firms     NNS
25:      about      IN
26:        its    PRP$
27:  long-term      JJ
28:  prospects     NNS
29:          .       .
30:      <EOS>   <EOS>