maximtrp / mchmm Goto Github PK

Markov Chains and Hidden Markov Models in Python

License: GNU General Public License v3.0

Python 100.00%

python markov-chain markov-model hidden-markov-model simulation probability hmm hmm-viterbi-algorithm baum-welch-algorithm

mchmm's Introduction

mchmm

mchmm is a Python package implementing Markov chains and Hidden Markov models in pure NumPy and SciPy. It can also visualize Markov chains (see below).

Donate

If you find this package useful, please consider donating any amount of money. This will help me spend more time on supporting open-source software.

Dependencies

NumPy
SciPy

Installation

Install from PyPi:

pip install mchmm

Clone a GitHub repository:

git clone https://github.com/maximtrp/mchmm.git
cd mchmm
pip install . --user

Features

Discrete Markov chains

Initializing a Markov chain using some data.

import mchmm as mc
a = mc.MarkovChain().from_data('AABCABCBAAAACBCBACBABCABCBACBACBABABCBACBBCBBCBCBCBACBABABCBCBAAACABABCBBCBCBCBCBCBAABCBBCBCBCCCBABCBCBBABCBABCABCCABABCBABC')

Now, we can look at the observed transition frequency matrix:

a.observed_matrix
# array([[ 7., 18.,  7.],
#        [19.,  5., 29.],
#        [ 5., 30.,  3.]])

And the observed transition probability matrix:

a.observed_p_matrix
# array([[0.21875   , 0.5625    , 0.21875   ],
#        [0.35849057, 0.09433962, 0.54716981],
#        [0.13157895, 0.78947368, 0.07894737]])

You can visualize your Markov chain. First, build a directed graph with graph_make() method of MarkovChain object. Then render() it.

graph = a.graph_make(
    format="png",
    graph_attr=[("rankdir", "LR")],
    node_attr=[("fontname", "Roboto bold"), ("fontsize", "20")],
    edge_attr=[("fontname", "Iosevka"), ("fontsize", "12")]
)
graph.render()

Here is the result:

Pandas can help us annotate columns and rows:

import pandas as pd
pd.DataFrame(a.observed_matrix, index=a.states, columns=a.states, dtype=int)
#      A   B   C
#  A   7  18   7
#  B  19   5  29
#  C   5  30   3

Viewing the expected transition frequency matrix:

a.expected_matrix
# array([[ 8.06504065, 13.78861789, 10.14634146],
#        [13.35772358, 22.83739837, 16.80487805],
#        [ 9.57723577, 16.37398374, 12.04878049]])

Calculating Nth order transition probability matrix:

a.n_order_matrix(a.observed_p_matrix, order=2)
# array([[0.2782854 , 0.34881028, 0.37290432],
#        [0.1842357 , 0.64252707, 0.17323722],
#        [0.32218957, 0.21081868, 0.46699175]])

Carrying out a chi-squared test:

a.chisquare(a.observed_matrix, a.expected_matrix, axis=None)
# Power_divergenceResult(statistic=47.89038802624337, pvalue=1.0367838347591701e-07)

Finally, let's simulate a Markov chain given our data.

ids, states = a.simulate(10, start='A', seed=np.random.randint(0, 10, 10))
ids
# array([0, 2, 1, 0, 2, 1, 0, 2, 1, 0])

states
# array(['A', 'C', 'B', 'A', 'C', 'B', 'A', 'C', 'B', 'A'], dtype='<U1')

"".join(states)
# 'ACBACBACBA'

Hidden Markov models

We will use a fragment of DNA sequence with TATA box as an example. Initializing a hidden Markov model with sequences of observations and states:

import mchmm as mc
obs_seq = 'AGACTGCATATATAAGGGGCAGGCTG'
sts_seq = '00000000111111100000000000'
a = mc.HiddenMarkovModel().from_seq(obs_seq, sts_seq)

Unique states and observations are automatically inferred:

a.states
# ['0' '1']

a.observations
# ['A' 'C' 'G' 'T']

The transition probability matrix for all states can be accessed using tp attribute:

a.tp
# [[0.94444444 0.05555556]
#  [0.14285714 0.85714286]]

There is also ep attribute for the emission probability matrix for all states and observations.

a.ep
# [[0.21052632 0.21052632 0.47368421 0.10526316]
#  [0.57142857 0.         0.         0.42857143]]

Converting the emission matrix to Pandas DataFrame:

import pandas as pd
pd.DataFrame(a.ep, index=a.states, columns=a.observations)
#            A         C         G         T
#  0  0.210526  0.210526  0.473684  0.105263
#  1  0.571429  0.000000  0.000000  0.428571

Directed graph of the hidden Markov model:

Graph can be visualized using graph_make method of HiddenMarkovModel object:

graph = a.graph_make(
    format="png", graph_attr=[("rankdir", "LR"), ("ranksep", "1"), ("rank", "same")]
)
graph.render()

Viterbi algorithm

Running Viterbi algorithm on new observations.

new_obs = "GGCATTGGGCTATAAGAGGAGCTTG"
vs, vsi = a.viterbi(new_obs)
# states sequence
print("VI", "".join(vs))
# observations
print("NO", new_obs)
# VI 0000000001111100000000000
# NO GGCATTGGGCTATAAGAGGAGCTTG

Baum-Welch algorithm

Using Baum-Welch algorithm to infer the parameters of a Hidden Markov model:

obs_seq = 'AGACTGCATATATAAGGGGCAGGCTG'
a = hmm.HiddenMarkovModel().from_baum_welch(obs_seq, states=['0', '1'])
# training log: KL divergence values for all iterations

a.log
# {
#     'tp': [0.008646969455670256, 0.0012397829805491124, 0.0003950986109761759],
#     'ep': [0.09078874423746826, 0.0022734816599056084, 0.0010118204023946836],
#     'pi': [0.009030829793043593, 0.016658391248503462, 0.0038894983546756065]
# }

The inferred transition (tp), emission (ep) probability matrices and initial state distribution (pi) can be accessed as shown:

a.ep, a.tp, a.pi

This model can be decoded using Viterbi algorithm:

new_obs = "GGCATTGGGCTATAAGAGGAGCTTG"
vs, vsi = a.viterbi(new_obs)
print("VI", "".join(vs))
print("NO", new_obs)
# VI 0011100001111100000001100
# NO GGCATTGGGCTATAAGAGGAGCTTG

mchmm's People

Contributors

Stargazers

Watchers

mchmm's Issues

simulate function seed parameter

Dear Maximtrp,

Thank you for this great offering! I am really finding good use for your module. I have noticed some strange behavior with the seed parameter in your simulate function. When I set the seed= 100 or 1000 or 10**9 or anything at all (a.simulate(500, start='a', seed=100)), I get lots of repetition between states for example: ['a','a','a','a','a','a'.....]

However when the parameter is not included at all (a.simulate(500, start='h')), I get better behavior: ['h' 'l' 'h' 'h' 'l' 'd' 'a' 'e' 'e' 'a' 'e' 'e' 'b' 'g' 'h' 'h' 'h' 'i' 'b' 'h' 'i' 'g' 'l' 'i' 'd'], the states switch to other states and I get something much more useful instead of a string of the same character.

Just letting you know...

Thanks again for some great code :)

Open questions on the usage of mchmm

Hello, I have three questions on the use of the mchmm.

I have previously used the hmmlearn library, and I was wondering if it is possible to

a) set a starting set of initial probabilities for the transition matrix states. For example

start_probs = np.array([0.5, 0.5])

b) Use a different type of hmm (Poisson vs MixtureModel for example)

In hmmlearn you can set a model type (e.g. hmm.MultinomialHMM hmm.Poisson) is it possible to define different model types?

c) Output the probabilities for each predicted state.

In hmm learn you can output the probabilities of the predictions from the Viterbi function. Is there a method to perform this task in mchmm?

Thanks
Jonathan

Create MarkovChain Object from transition matrices

I do know that to initiate the mc object, you need to input of the states sequence. It is possible to initiate if I were to supply the transition matrices instead?

Attribute Error - 'NoneType' object has no attribute 'ep'

I'd like to run mc.HiddenMarkovModel() from known probabilities however when i try to access arguments I gave the method I experience Attribute Error - 'NoneType' object has no attribute 'states'. I tried to have a list of lists or numpy array as an input for ep, tp, states and obs parameters. Maybe I am not running it in a proper way, the code is below:

importing necessary modules

import numpy as np
import mchmm as mc
import pandas as pd

Definition of matrices and states used for the model

transition_mx = [[0.001, 0.001, 0.001, 0.001, 0.001, 0.001, 0.001],
                 [0.994, 0.001, 0.001, 0.001, 0.001, 0.001, 0.001],
                 [0.001, 0.994, 0.001, 0.001, 0.001, 0.001, 0.001],
                 [0.001, 0.001, 0.597, 0.398, 0.001, 0.001, 0.001],
                 [0.001, 0.001, 0.398, 0.597, 0.001, 0.001, 0.001],
                 [0.001, 0.001, 0.001, 0.001, 0.994, 0.001, 0.001],
                 [0.001, 0.001, 0.001, 0.001, 0.001, 0.994, 0.001]]

base A is the fist column next - T, C, G

emission_mx = [[0.799, 0.199, 0.001, 0.001],
               [0.001, 0.001, 0.799, 0.199],
               [0.799, 0.001, 0.199, 0.001], 
               [0.100, 0.200, 0.400, 0.200],
               [0.997, 0.001, 0.001, 0.001], 
               [0.001, 0.799, 0.001, 0.199],
               [0.001, 0.001, 0.199, 0.799]]

states = [0,1]
obs = ['A', 'T', 'C', 'G']

Creating array from list of lists

emission_mx_a = np.array(emission_mx)
transition_mx_a = np.array(transition_mx)
states_a = np.array(states)                          
obs_a =  np.array(obs)

creating a model from probabilities

a = mc.HiddenMarkovModel().from_prob(transition_mx_a, emission_mx_a, obs_a, states_a)
pd.DataFrame(a.ep, index=a.states, columns=a.observations)

At this step I am getting an Atribute error 'NoneType' object has no attribute 'ep'.

visualization of Markov chain

Thanks for making mchmm. It is a great package.

As a newcomer to Markov chain, I find it very easy to use. The only question I have is visualization. In the description, there is a very nice image mc.png, showing the transition between three states (A,B,C). I am wondering how to generate such diagram using mchmm's transition matrix and states.

Any pointers will be appreciated!

generating a markov simulation starting with a specific sequence

Hi there. Not sure if this is the right place to ask, but here is my problem. I am trying to generate a Markov simulation using a specific sequence as start. Your code is faster that what I have done, but I am not quite sure if I am using correctly, and I am dont know the use of Viterbi and Baum-Welch algorithms in the context of Markov.

To illustrate, I will follow some of the code in your examples.

import mchmm as mc
data = 'AABCABCBAAAACBCBACBABCABCBACBACBABABCBACBBCBBCBCBCBACBABABCBCBAAACABABCBBCBCBCBCBCBAABCBBCBCBCCCBABCBCBBABCBABCABCCABABCB'
a = mc.MarkovChain().from_data(data)

I want a markov simulation based on a 3 states transition matrix, starting with the last 3 characters in the sequence above ("ABC")

start_sequence = data[-3:]
tfm3 = a.n_order_matrix(a.observed_p_matrix, order=3)
ids, states = a.simulate(n=10, tf=tfm3, start=start_sequence)

this returns:

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
/tmp/ipykernel_2552615/2308700615.py in <module>
----> 1 ids, states = a.simulate(n=10, tf=tfm3, start=start_sequence)

~/anaconda3/lib/python3.8/site-packages/mchmm/_mc.py in simulate(self, n, tf, states, start, ret, seed)
    304             _start = np.random.randint(0, len(states))
    305         elif isinstance(start, str):
--> 306             _start = np.argwhere(states == start).item()
    307 
    308         # simulated sequence init

ValueError: can only convert an array of size 1 to a Python scalar

I was expecting to get a sequence of 10 characters, starting with the string 'ABC' (data[-3:]), since I want to constraint the Markov simulation to start with the probabilities implied by that specific sequence, and following a Markov of order 3. Can you help with this? I will post this also in StackOverflow, since maybe an of the users of your code might have had a simular question. Thanks.