A set of simulation studies
A Markov model is a statistical tool used to model systems that change over time and where the future depends only on the present.
Common examples include:
Markov property
\[ P(X_{t+1} = j \; \vert \; X_t = i, X_{t-1} = i_{t-1}, \dots X_0 = i_0) = P(X_{t+1} = j \; \vert \; X_t = i) \]
In ordinary regression settings, residuals measure the difference between what the model predicts and what we observe.
They are central for diagnosing:
Residuals typically rely on a clear “predicted value” for each observation.
Figure 1: Observed data
Figure 2: Observed data with fitted model
Figure 3: Observed data, fitted model, and predicted values
Figure 4: Residuals: distance between observed and fitted values
Markov Models dont deal with predicted values, but rather probabilities.
\[ P = \begin{bmatrix} P(A \rightarrow A) \quad P(A \rightarrow B) \\ P(B \rightarrow A) \quad P(B \rightarrow B) \end{bmatrix} = \begin{bmatrix} 0.6 \quad 0.4 \\ 0.7 \quad 0.3 \end{bmatrix} \]
For state A, the model predicts a probability vector:
\[ P(X_{t+1} \; \vert \; X_t = A) = (0.6, 0.4) \]
But in reality we observe one transition: \(A \rightarrow A \quad A \rightarrow B\)
A residual like \(y - \hat{y}\) no longer make any mathematical sense.
Figure 5: Simulation equations
From each fitted model we compute an individual-specific predicted transition matrix:
\[\begin{equation} \hat{P} = \begin{bmatrix} P(y_t=1 | y_{t-1}=1) & P(y_t=2 | y_{t-1}=1) & P(y_t=3 | y_{t-1}=1) \\ P(y_t=1 | y_{t-1}=2) & P(y_t=2 | y_{t-1}=2) & P(y_t=3 | y_{t-1}=2) \\ P(y_t=1 | y_{t-1}=3) & P(y_t=2 | y_{t-1}=3) & P(y_t=3 | y_{t-1}=3) \\ \end{bmatrix} \end{equation}\]
We then compare \(P\) and \(\hat{P}\) via several distance metrics.
Frobenius Norm \(\rightarrow ||\mathbf{D}||_F = \sqrt{\sum_{j=1}^{n}\sum_{k=1}^{n} d_{j,k}^2}\)
Manhattan Distance \(\rightarrow ||\mathbf{d}||_1 = \sum_{i=1}^{n} |d_i|\)
Maximum Absolute Error \(\rightarrow \max(|d_i|)\)
Mean Absolute Error \(\rightarrow \frac{1}{n} ||\mathbf{d}||\)
Root Mean Squared Error \(\rightarrow \frac{1}{\sqrt{n}} ||\mathbf{D}||_F\)
Correlation Dissimilarity \(\rightarrow 1 - \rho(\text{vec}(\mathbf{P}), \text{vec}(\mathbf{\hat{P}}))\)
We also compare against AIC and BIC

We have developed goodness of fit metrics for small sample sizes
\[\begin{equation} \hat{P} = \begin{bmatrix} P(y_t=1 | y_{t-1}=1) & P(y_t=2 | y_{t-1}=1) & P(y_t=3 | y_{t-1}=1) \\ P(y_t=1 | y_{t-1}=2) & P(y_t=2 | y_{t-1}=2) & P(y_t=3 | y_{t-1}=2) \\ P(y_t=1 | y_{t-1}=3) & P(y_t=2 | y_{t-1}=3) & P(y_t=3 | y_{t-1}=3) \\ 0 & 0 & 0 \end{bmatrix} \end{equation}\]
These slides were built with , , and Quarto