# Model sequences with Markov chains#

Launch this notebook:

You need striplog version 0.8.6 or later for this notebook to work as intended.

import striplog

striplog.__version__

'unknown'


## Two ways to get a Markov chain#

There are two ways to generate a Markov chain:

1. Parse one or more sequences of states. This will be turned into a transition matrix, from which a probability matrix will be computed.

2. Directly from a transition matrix, if you already have that data.

Let’s look at the transition matrix first, since that’s how Powers & Easterling presented the data in their paper on this topic.

## Powers & Easterling data#

Key reference: Powers, DW and RG Easterling (1982). Improved methodology for using embedded Markov chains to describe cyclical sediments. Journal of Sedimentary Petrology 52 (3), p 913–923.

Let’s use one of the examples in Powers & Easterling — they use this transition matrix from Gingerich, PD (1969). Markov analysis of cyclic alluvial sediments. Journal of Sedimentary Petrology, 39, p. 330-332. https://doi.org/10.1306/74D71C4E-2B21-11D7-8648000102C1865D

from striplog.markov import Markov_chain

data = [[ 0, 37,  3,  2],
[21,  0, 41, 14],
[20, 25,  0,  0],
[ 1, 14,  1,  0]]

m = Markov_chain(data, states=['A', 'B', 'C', 'D'])

m

Markov_chain(179 transitions, states=['A', 'B', 'C', 'D'], step=1, include_self=False)


Note that they do not include self-transitions in their data. So the elements being counted are simple ‘beds’ not ‘depth samples’ (say).

If you build a Markov chain using a matrix with self-transitions, these will be preserved; note include_self=True in the example here:

data = [[10, 37,  3,  2],
[21, 20, 41, 14],
[20, 25, 20,  0],
[ 1, 14,  1, 10]]

Markov_chain(data)

Markov_chain(239 transitions, states=[0, 1, 2, 3], step=1, include_self=True)


### Testing for independence#

We use the model of quasi-independence given in Powers & Easterling, as opposed to an independent model like scipy.stats.chi2_contingency(), for computing chi-squared and the expected transitions:

First, let’s look at the expected transition frequencies of the original Powers & Easterling data:

import numpy as np

np.set_printoptions(precision=3)

m.expected_counts

array([[ 0.   , 31.271,  8.171,  2.558],
[31.282,  0.   , 34.057, 10.661],
[ 8.171, 34.044,  0.   ,  2.785],
[ 2.558, 10.657,  2.785,  0.   ]])


The $$\chi^2$$ statistic shows the value for the observed ordering, along with the critical value at (by default) the 95% confidence level. If the first number is higher than the second number (ideally much higher), then we can reject the hypothesis that the ordering is quasi-independent. That is, we have shown that the ordering is non-random.

m.chi_squared()

Chi2(chi2=35.73687369691601, crit=11.070497693516351, perc=0.9999989278539752)


### Which transitions are interesting?#

The normalized difference shows which transitions are ‘interesting’. These numbers can be interpreted as standard deviations away from the model of quasi-independence. That is, transitions with large positive numbers represent passages that occur more often than might be expected. Any numbers greater than 2 are likely to be important.

m.normalized_difference

array([[ 0.   ,  1.025, -1.809, -0.349],
[-1.838,  0.   ,  1.19 ,  1.023],
[ 4.138, -1.55 ,  0.   , -1.669],
[-0.974,  1.024, -1.07 ,  0.   ]])


We can visualize this as an image:

m.plot_norm_diff()

<AxesSubplot:>


We can also interpret this matrix as a graph. The upward transitions from C to A are particularly strong in this one. Transitions from A to C happen less often than we’d expect. Those from B to D and D to B, less so.

It’s arguably easier to interpret the data when this matrix is interpreted as a directed graph:

m.plot_graph()

Please install networkx with pip install networkx.


We can look at an undirected version of the graph too. It downplays non-reciprocal relationships. I’m not sure this is useful…

m.plot_graph(directed=False)

Please install networkx with pip install networkx.


## Generating random sequences#

We can generate a random succession of beds with the same transition statistics:

''.join(m.generate_states(n=30))

'ABCADBACABCBDCABDBDBCBDBABCBDB'


Again, this will respect the absence or presence of self-transitions, e.g.:

data = [[10, 37,  3,  2],
[21, 20, 41, 14],
[20, 25, 20,  0],
[ 1, 14,  1, 10]]

x = Markov_chain(data)

''.join(map(str, x.generate_states(n=30)))

'112120133312110121000003122010'


## Parse states#

Striplog can interpret various kinds of data as sequences of states. For example, it can get the unique elements and a ‘sequence of sequences’ from:

• A simple list of states, eg [1,2,2,2,1,1]

• A string of states, eg 'ABBDDDDDCCCC'

• A list of lists of states, eg [[1,2,2,3], [2,4,2]] (NB, not same length)

• A list of strings of states, eg ['aaabb', 'aabbccc'] (NB, not same length)

• A list of state names, eg ['sst', 'mud', 'sst'] (requires optional argument)

• A list of lists of state names, eg [['SS', 'M', 'SS'], ['M', 'M', 'LS']]

The corresponding sets of unique states look like:

• [1, 2]

• ['A', 'B', 'C', 'D']

• [1, 2, 3, 4]

• ['a', 'b', 'c']

• ['mud', sst']

• ['LS', 'M', 'SS']

Let’s look at a data example…

## Data from Matt’s thesis#

These are the transitions from some measured sections in my PhD thesis. They start at the bottom, so in Log 7, we start with lithofacies 1 (offshore mudstone) and pass upwards into lithofacies 3, then back into 1, then 3, and so on.

We can instantiate a Markov_chain object from a sequence using its from_sequence() method. This expects either a sequence of ‘states’ (numbers or letters or strings representing rock types) or a sequence of sequences of states.

data = {
'log7':  [1, 3, 1, 3, 5, 1, 2, 1, 3, 1, 5, 6, 1, 2, 1, 2, 1, 2, 1, 3, 5, 6, 5, 1],
'log9':  [1, 3, 1, 5, 1, 5, 3, 1, 2, 1, 2, 1, 3, 5, 1, 5, 6, 5, 6, 1, 2, 1, 5, 6, 1],
'log11': [1, 3, 1, 2, 1, 5, 3, 1, 2, 1, 2, 1, 3, 5, 3, 5, 1, 9, 5, 5, 5, 5, 6, 1],
'log12': [1, 5, 3, 1, 2, 1, 2, 1, 2, 1, 4, 5, 6, 1, 2, 1, 4, 5, 1, 5, 5, 5, 1, 2, 1, 8, 9, 10, 9, 5, 1],
'log13': [1, 6, 1, 3, 1, 3, 5, 3, 6, 1, 6, 5, 3, 1, 5, 1, 2, 1, 4, 3, 5, 3, 4, 3, 5, 1, 5, 9, 11, 9, 1],
'log14': [1, 3, 1, 5, 8, 5, 6, 1, 3, 4, 5, 3, 1, 3, 5, 1, 7, 7, 7, 1, 7, 1, 3, 8, 5, 5, 1, 5, 9, 9, 11, 9, 1],
'log15': [1, 8, 1, 3, 5, 1, 2, 3, 6, 3, 6, 5, 2, 1, 2, 1, 8, 5, 1, 5, 9, 9, 11, 1],
'log16': [1, 8, 1, 5, 1, 5, 5, 6, 1, 3, 5, 3, 5, 5, 5, 8, 5, 1, 9, 9, 3, 1],
'log17': [1, 3, 8, 1, 8, 5, 1, 8, 9, 5, 10, 5, 8, 9, 10, 8, 5, 1, 8, 9, 1],
'log18': [1, 8, 2, 1, 2, 1, 10, 8, 9, 5, 5, 1, 2, 1, 2, 9, 5, 9, 5, 8, 5, 9, 1]
}

logs = list(data.values())

m = Markov_chain.from_sequence(logs)
m.states

array([ 1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11])

m.observed_counts

array([[ 0, 21, 17,  3, 15,  2,  2,  8,  2,  1,  0],
[21,  0,  1,  0,  0,  0,  0,  0,  1,  0,  0],
[12,  0,  0,  2, 12,  3,  0,  2,  0,  0,  0],
[ 0,  0,  2,  0,  3,  0,  0,  0,  0,  0,  0],
[19,  1,  9,  0,  0,  9,  0,  4,  5,  1,  0],
[ 9,  0,  1,  0,  4,  0,  0,  0,  0,  0,  0],
[ 2,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0],
[ 3,  1,  0,  0,  7,  0,  0,  0,  5,  0,  0],
[ 4,  0,  1,  0,  6,  0,  0,  0,  0,  2,  3],
[ 0,  0,  0,  0,  1,  0,  0,  2,  1,  0,  0],
[ 1,  0,  0,  0,  0,  0,  0,  0,  2,  0,  0]])


Let’s check out the normalized difference matrix:

m.expected_counts

array([[0.000e+00, 9.734e+00, 1.362e+01, 1.972e+00, 2.329e+01, 5.709e+00,
7.807e-01, 6.576e+00, 6.576e+00, 1.572e+00, 1.175e+00],
[9.734e+00, 0.000e+00, 2.949e+00, 4.270e-01, 5.042e+00, 1.236e+00,
1.690e-01, 1.424e+00, 1.424e+00, 3.404e-01, 2.544e-01],
[1.362e+01, 2.949e+00, 0.000e+00, 5.974e-01, 7.054e+00, 1.729e+00,
2.365e-01, 1.992e+00, 1.992e+00, 4.762e-01, 3.559e-01],
[1.972e+00, 4.270e-01, 5.974e-01, 0.000e+00, 1.022e+00, 2.504e-01,
3.425e-02, 2.885e-01, 2.885e-01, 6.896e-02, 5.154e-02],
[2.329e+01, 5.042e+00, 7.054e+00, 1.021e+00, 0.000e+00, 2.957e+00,
4.044e-01, 3.406e+00, 3.406e+00, 8.143e-01, 6.086e-01],
[5.709e+00, 1.236e+00, 1.729e+00, 2.504e-01, 2.957e+00, 0.000e+00,
9.914e-02, 8.351e-01, 8.351e-01, 1.996e-01, 1.492e-01],
[7.806e-01, 1.690e-01, 2.365e-01, 3.425e-02, 4.044e-01, 9.914e-02,
0.000e+00, 1.142e-01, 1.142e-01, 2.730e-02, 2.040e-02],
[6.576e+00, 1.424e+00, 1.992e+00, 2.885e-01, 3.406e+00, 8.351e-01,
1.142e-01, 0.000e+00, 9.620e-01, 2.300e-01, 1.719e-01],
[6.576e+00, 1.424e+00, 1.992e+00, 2.885e-01, 3.406e+00, 8.351e-01,
1.142e-01, 9.620e-01, 0.000e+00, 2.300e-01, 1.719e-01],
[1.572e+00, 3.404e-01, 4.762e-01, 6.896e-02, 8.143e-01, 1.996e-01,
2.730e-02, 2.300e-01, 2.300e-01, 0.000e+00, 4.109e-02],
[1.175e+00, 2.544e-01, 3.559e-01, 5.154e-02, 6.086e-01, 1.492e-01,
2.040e-02, 1.719e-01, 1.719e-01, 4.109e-02, 0.000e+00]])


It’s hard to read this but we can change NumPy’s display options:

np.set_printoptions(suppress=True, precision=1, linewidth=120)
m.normalized_difference

array([[ 0. ,  3.6,  0.9,  0.7, -1.7, -1.6,  1.4,  0.6, -1.8, -0.5, -1.1],
[ 3.6,  0. , -1.1, -0.7, -2.2, -1.1, -0.4, -1.2, -0.4, -0.6, -0.5],
[-0.4, -1.7,  0. ,  1.8,  1.9,  1. , -0.5,  0. , -1.4, -0.7, -0.6],
[-1.4, -0.7,  1.8,  0. ,  2. , -0.5, -0.2, -0.5, -0.5, -0.3, -0.2],
[-0.9, -1.8,  0.7, -1. ,  0. ,  3.5, -0.6,  0.3,  0.9,  0.2, -0.8],
[ 1.4, -1.1, -0.6, -0.5,  0.6,  0. , -0.3, -0.9, -0.9, -0.4, -0.4],
[ 1.4, -0.4, -0.5, -0.2, -0.6, -0.3,  0. , -0.3, -0.3, -0.2, -0.1],
[-1.4, -0.4, -1.4, -0.5,  1.9, -0.9, -0.3,  0. ,  4.1, -0.5, -0.4],
[-1. , -1.2, -0.7, -0.5,  1.4, -0.9, -0.3, -1. ,  0. ,  3.7,  6.8],
[-1.3, -0.6, -0.7, -0.3,  0.2, -0.4, -0.2,  3.7,  1.6,  0. , -0.2],
[-0.2, -0.5, -0.6, -0.2, -0.8, -0.4, -0.1, -0.4,  4.4, -0.2,  0. ]])


Or use a graphical view:

m.plot_norm_diff()

<AxesSubplot:>


And the graph version. Note you can re-run this cell to rearrange the graph.

m.plot_graph(figsize=(15,15), max_size=2400, edge_labels=True, seed=13)

Please install networkx with pip install networkx.


### Experimental implementation!

Multistep Markov chains are a bit of an experiment in striplog. Please [get in touch](mailto:matt@agilescientific.com) if you have thoughts about how it should work.

## Multistep transitions#

### Step = 2#

So far we’ve just been looking at direct this-to-that transitions, i.e. only considering each previous transition. What if we use the previous-but-one?

m = Markov_chain.from_sequence(logs, step=2)
m

Markov_chain(213 transitions, states=[1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11], step=2, include_self=False)


Note that self-transitions are ignored by default. With multi-step transitions, this results in a lot of zeros because we don’t just eliminate transitions like 1 > 1 > 1, but also 1 > 1 > 2 and 1 > 1 > 3, etc, as well as 1 > 2 > 2, 1 > 3 > 3, etc, and 2 > 1 > 1, 3 > 1 > 1, etc. (But we will keep 1 > 2 > 1.)

Now we have a 3D array of transition probabilities.

m.normalized_difference.shape

(11, 11, 11)


This is hard to inspect! Let’s just get the indices of the highest values.

If we add one to the indices, we’ll have a handy list of facies number transitions, since these are just 1 to 11. So we can interpret these as transitions with anomalously high probability.

There are 11 × 11 × 11 = 1331 transitions in this array (inluding self-transitions).

cutoff = 5  # 1.96 is 95% confidence

idx = np.where(m.normalized_difference > cutoff)
locs = np.array(list(zip(*idx)))

scores = {tuple(loc+1):score for score, loc in zip(m.normalized_difference[idx], locs)}

for (a, b, c), score in sorted(scores.items(), key=lambda pair: pair[1], reverse=True):
print(f"{a:<2} -> {b:<2} -> {c:<2}   {score:.3f}")

9  -> 11 -> 9    14.602
7  -> 1  -> 7    12.685
10 -> 8  -> 9    12.685
1  -> 2  -> 1    10.109
8  -> 9  -> 10   9.441
2  -> 1  -> 2    8.542
11 -> 9  -> 1    7.825
9  -> 10 -> 8    6.676
2  -> 1  -> 4    5.864
9  -> 10 -> 9    5.851
5  -> 6  -> 1    5.372

m.chi_squared()

Chi2(chi2=1425.2383461290872, crit=112.02198574980785, perc=1.0)


Unfortunately, it’s a bit harder to draw this as a graph. Technically, it’s a hypergraph.

# This should error for now.
# m.plot_graph(figsize=(15,15), max_size=2400, edge_labels=True)


Also, the expected counts or frequencies are a bit… hard to interpret:

np.set_printoptions(suppress=True, precision=3, linewidth=120)

m.expected_counts

array([[[0.   , 0.   , 0.   , ..., 0.   , 0.   , 0.   ],
[2.624, 0.   , 1.197, ..., 0.681, 0.141, 0.16 ],
[3.484, 1.323, 0.   , ..., 0.792, 0.221, 0.184],
...,
[1.851, 0.574, 0.869, ..., 0.   , 0.16 , 0.098],
[0.411, 0.172, 0.273, ..., 0.12 , 0.   , 0.018],
[0.384, 0.15 , 0.135, ..., 0.077, 0.031, 0.   ]],

[[0.   , 0.899, 1.252, ..., 0.632, 0.16 , 0.123],
[0.   , 0.   , 0.   , ..., 0.   , 0.   , 0.   ],
[1.311, 0.399, 0.   , ..., 0.233, 0.055, 0.043],
...,
[0.638, 0.249, 0.304, ..., 0.   , 0.049, 0.031],
[0.15 , 0.043, 0.086, ..., 0.031, 0.   , 0.003],
[0.15 , 0.061, 0.046, ..., 0.028, 0.006, 0.   ]],

[[0.   , 1.215, 1.722, ..., 0.918, 0.221, 0.178],
[1.258, 0.   , 0.556, ..., 0.215, 0.058, 0.071],
[0.   , 0.   , 0.   , ..., 0.   , 0.   , 0.   ],
...,
[0.826, 0.316, 0.384, ..., 0.   , 0.068, 0.049],
[0.181, 0.068, 0.092, ..., 0.064, 0.   , 0.012],
[0.163, 0.049, 0.077, ..., 0.037, 0.009, 0.   ]],

...,

[[0.   , 0.611, 0.816, ..., 0.396, 0.12 , 0.08 ],
[0.632, 0.   , 0.319, ..., 0.15 , 0.025, 0.037],
[0.936, 0.289, 0.   , ..., 0.196, 0.046, 0.043],
...,
[0.   , 0.   , 0.   , ..., 0.   , 0.   , 0.   ],
[0.15 , 0.04 , 0.074, ..., 0.028, 0.   , 0.003],
[0.126, 0.015, 0.043, ..., 0.018, 0.009, 0.   ]],

[[0.   , 0.153, 0.166, ..., 0.101, 0.015, 0.018],
[0.187, 0.   , 0.071, ..., 0.028, 0.009, 0.009],
[0.246, 0.068, 0.   , ..., 0.052, 0.009, 0.009],
...,
[0.129, 0.058, 0.061, ..., 0.   , 0.   , 0.009],
[0.   , 0.   , 0.   , ..., 0.   , 0.   , 0.   ],
[0.025, 0.015, 0.006, ..., 0.006, 0.   , 0.   ]],

[[0.   , 0.135, 0.15 , ..., 0.086, 0.018, 0.018],
[0.117, 0.   , 0.055, ..., 0.031, 0.006, 0.006],
[0.144, 0.052, 0.   , ..., 0.04 , 0.012, 0.006],
...,
[0.061, 0.034, 0.034, ..., 0.   , 0.   , 0.003],
[0.028, 0.003, 0.012, ..., 0.009, 0.   , 0.   ],
[0.   , 0.   , 0.   , ..., 0.   , 0.   , 0.   ]]])


### Step = 3#

We can actually make a model for any number of steps, but we will need commensurately more data, especially if we’re not going to include self-transitions:

m = Markov_chain.from_sequence(logs, step=3)
m

Markov_chain(193 transitions, states=[1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11], step=3, include_self=False)


I have no idea how to visualize or interpret this thing, let me know if you do something with it!

## From Striplog() instance#

I use striplog to represent stratigraphy. Let’s make a Markov chain model from an instance of striplog!

First, we’ll make a striplog by applying a couple of cut-offs to a GR log:

from welly import Well
from striplog import Striplog, Component

w = Well.from_las("P-129_out.LAS")
gr = w.data['GR']

comps = [Component({'lithology': 'sandstone'}),
Component({'lithology': 'greywacke'}),
Component({'lithology': 'shale'}),
]

s = Striplog.from_log(gr, cutoff=[30, 90], components=comps, basis=gr.basis)
s

Striplog(672 Intervals, start=1.0668, stop=1939.1376)

s.plot()


Here’s what one bed (interval in striplog’s vocabulary) looks like:

s[0]

top1.0668
primary
 lithology greywacke
summary135.79 m of greywacke
description
data
base136.8552

It’s easy to get the ‘lithology’ attribute from all the beds:

seq = [i.primary.lithology for i in s]


This is what we need!

m = Markov_chain.from_sequence(seq, strings_are_states=True, include_self=False)

m.normalized_difference

array([[ 0.   ,  0.085,  0.023],
[ 0.121,  0.   , -0.926],
[-0.008, -0.928,  0.   ]])

m.plot_norm_diff()

<AxesSubplot:>

m.plot_graph()

Please install networkx with pip install networkx.