Higher-Order and Variable-Order Markov Models for Long-Memory Fraud Schemes

First-order Markov modeling assumes the next state depends only on the current state — what happened one step ago tells you everything you need to know about what happens next. For most audit applications this is a useful approximation. For long-memory fraud schemes — lapping that builds across six monthly cycles, kiting that rotates checks across three bank accounts on a four-day cycle, the multi-year sequence manipulation alleged in United States v. Madoff (S.D.N.Y., guilty plea March 12, 2009; 150-year sentence June 29, 2009) — first-order misclassifies the pattern as ordinary noise because the diagnostic signal lives in what happened three, four, or six steps ago, not one step ago. The Madoff matter is the canonical published example: the scheme defeated annual audit procedures for 17 years because the deception was structured across time horizons longer than any single audit window. Higher-order memory is the technical machinery audit and fraud-examination practice needs to address that class of risk.

Two extensions handle the long-memory case. k-th-order Markov chains explicitly condition on the previous $k$ states — the memory window is fixed at $k$ steps and applied uniformly to every position in the sequence. They produce transition tensors (multi-dimensional generalizations of transition matrices) with $|S|^{k+1}$ parameters that grow exponentially in $k$, which is the technical problem this article works to manage. Variable-order Markov models — primarily context-tree weighting (CTW) and probabilistic suffix trees (PST) — adaptively choose the relevant memory length per state-context, growing the conditioning history only where the data supports differentiation. This article walks both, explains when each is the right choice, and demonstrates lapping detection on a synthetic six-month receivables stream that first-order Markov entirely misses.

PCAOB AS 2401 §A.5 identifies lapping as a fraud indicator requiring substantive response; the detection methods here provide the quantitative baseline and anomaly-scoring apparatus that integrates into workpaper documentation under AS 1215 (audit documentation).

kth-order Markov chains

A discrete-state stochastic process $\{X_t\}$ is a kth-order Markov chain if for all $t > k$ and all states:

$$P(X_{t+1} = j \mid X_t, X_{t-1}, \ldots, X_0) = P(X_{t+1} = j \mid X_t, X_{t-1}, \ldots, X_{t-k+1})$$

The conditional distribution of the next state depends on the previous $k$ states and no further. The first-order case ($k=1$) recovers the standard Markov property.

The kth-order transition kernel is a tensor $\mathbf{P}^{(k)}$ of shape $|S|^{k+1}$, indexed by the $k$-state context $(s_{t-k+1}, \ldots, s_t)$ and the next state $s_{t+1}$:

$$\mathbf{P}^{(k)}_{s_{t-k+1}, \ldots, s_t, s_{t+1}} = P(X_{t+1} = s_{t+1} \mid X_t = s_t, \ldots, X_{t-k+1} = s_{t-k+1})$$

The number of free parameters under row-stochasticity constraints is $|S|^k \cdot (|S| – 1)$. For $|S| = 10$ states:

Order $k$	Free parameters
1	90
2	900
3	9,000
4	90,000

The parameter explosion is the central practical issue. Schwarz (1978) BIC guidance implies that estimating each parameter to a useful tolerance requires roughly $\gtrsim 10 \log n$ effective observations per parameter; for a 4th-order chain on 10 states (90,000 parameters), this translates to $\sim 10^6$ observations under typical audit-data noise levels—beyond the data available in most applications.


import numpy as np

np.random.seed(42)

def fit_kth_order_markov(sequence: list[int], n_states: int, k: int = 2) -> tuple[np.ndarray, np.ndarray]:
    """Estimate a kth-order Markov transition tensor from a sequence.

    Returns:
      P     — tensor of shape (n_states,) * (k + 1), row-stochastic on last axis
      counts — raw transition-count tensor (same shape, integer-valued)
    """
    shape = (n_states,) * (k + 1)
    counts = np.zeros(shape, dtype=int)
    for i in range(len(sequence) - k):
        idx = tuple(sequence[i:i + k + 1])
        counts[idx] += 1
    sums = counts.sum(axis=-1, keepdims=True)
    with np.errstate(invalid="ignore", divide="ignore"):
        P = np.where(sums > 0, counts / sums, 0.0)
    return P, counts


def log_likelihood_kth_order(sequence: list[int], n_states: int, k: int) -> float:
    """Compute the log-likelihood under a fitted kth-order Markov model.

    Uses the empirical transition tensor; assumes the first k states are conditioned on.
    Caches the fitted tensor to avoid redundant computation.
    """
    P, counts = fit_kth_order_markov(sequence, n_states, k)
    with np.errstate(divide="ignore"):
        log_P = np.where(P > 0, np.log(P), 0.0)
    return float((counts * log_P).sum())


def select_markov_order(sequence: list[int], n_states: int,
                          k_candidates: list[int] = [1, 2, 3, 4]) -> dict:
    """BIC-based selection of Markov order from a candidate set."""
    n_eff = len(sequence)
    results = []
    best_k, best_bic = None, np.inf
    for k in k_candidates:
        log_lik = log_likelihood_kth_order(sequence, n_states, k)
        n_params = (n_states ** k) * (n_states - 1)
        # BIC penalty uses log(n_eff - k) to account for the k initial conditioning states
        bic = -2 * log_lik + n_params * np.log(n_eff - k)
        results.append({"k": k, "log_likelihood": log_lik,
                         "n_params": n_params, "bic": bic})
        if bic < best_bic:
            best_k, best_bic = k, bic
    return {"selected_k": best_k, "selected_bic": best_bic, "all_results": results}

Memory-order selection via BIC

Higher $k$ always fits the training data better (more parameters), so BIC penalty is essential to avoid overfitting:

$$\text{BIC}(k) = -2 \log L_k + p_k \log(n – k)$$

where $p_k = |S|^k \cdot (|S| – 1)$ is the parameter count, $n$ is the total sequence length, and $n – k$ is the effective sample size (the number of $k$-step transitions observed). The BIC-selected order is the value of $k$ minimizing $\text{BIC}(k)$.

The select_markov_order function above implements this directly.

Variable-order Markov models

The k-th-order parameter explosion is wasteful when most state contexts have short effective memory. A clean baseline period probably needs only first-order memory for most account-class transitions, with longer memory only for the specific patterns where fraud lives. Variable-order Markov models (Bühlmann & Wyner, 1999; Begleiter, El-Yaniv & Yona, 2004) adaptively choose the relevant memory length per context, growing the conditioning history only where the data supports differentiation. The result is a much smaller parameter footprint without sacrificing detection power.

Two implementations dominate the literature. They are related-but-distinct techniques with materially different theoretical guarantees. The audit team using either implementation must understand which one they are running and what proof of correctness, if any, attaches to it.

Context-tree weighting (CTW). Builds a tree of state contexts; at each leaf, weights the $k$-th-order conditional distribution against the $(k-1)$-th-order distribution by a Bayesian mixture. CTW has a strong theoretical guarantee: it provably converges to the true minimum-description-length (MDL) model in the limit as the sequence length grows, per Csiszár & Talata (2006). The MDL model is the model that achieves the best trade-off between data-fit and parameter count under information-theoretic principles. That convergence guarantee is what makes CTW the right choice when the audit team needs to defend the methodology to a PCAOB inspector or a forensic-evidence expert on theoretical grounds. CTW pays for the guarantee with implementation complexity — the Bayesian weighting requires careful numerical handling, and the full tree must be maintained.

Probabilistic suffix trees (PST). Ron, Singer & Tishby (1996) construction. PST greedily extends the conditioning context for any leaf where a longer context produces a meaningfully different conditional distribution (where “meaningfully different” is a Kullback-Leibler divergence threshold the engagement team sets as a hyperparameter; KL divergence is a standard information-theoretic measure of how different two probability distributions are). PST is a tractable heuristic, not a theoretically-grounded estimator. PST does not inherit CTW’s MDL convergence guarantee. The Csiszár & Talata (2006) citation applies to CTW only; the practical PST construction is engineered for speed and intelligibility, not theoretical optimality. In audit terms: if a PCAOB inspector or a Daubert-style forensic-evidence challenge asks the team to defend the method’s correctness, the team relying on PST has to defend it as a reasonable engineering choice (well-tested on the entity’s own clean baseline, validated against known-fraud reference cases) rather than as a mathematically-proven estimator. For most production audit deployments that distinction is acceptable — engineering soundness is the standard the engagement team meets every day on other tools. But the team should know which technique they are running and what the corresponding evidentiary posture is.

This article uses PST in the worked example below because it is dramatically more practitioner-tractable and the audit-use case rewards interpretability over theoretical optimality. The reference implementation immediately below is pedagogical — written for line-by-line auditability rather than production throughput. The implementation runs in time $O(T^2 \cdot D \cdot |S|)$, where $T$ is the sequence length, $D$ is the maximum context depth, and $|S|$ is the state-space size. In plain terms: doubling the input sequence quadruples the runtime. For sequences of length $T \lesssim 5{,}000$ this is acceptable. For audit ledgers with $T > 10{,}000$ entries the pedagogical implementation becomes impractical (a doubling from 10,000 to 20,000 entries means a 4× runtime jump; a doubling from 20,000 to 40,000 means another 4×). The production fix is to use a suffix tree or suffix array data structure constructed via Ukkonen’s algorithm in $O(T)$ time, after which all context lookups are $O(|\text{suffix}|)$ rather than $O(T)$ per call. The companion artifact for this article ships the pedagogical version for transparency and notes the production migration path; teams deploying on real engagement data should implement Ukkonen-style construction before going live.


class PSTNode:
    """Node in a probabilistic suffix tree.

    suffix is the context (sequence of recent states, most recent state first — reversed).
    distribution is the smoothed conditional next-state distribution given this context.
    n_occurrences records the count of times this context appeared in the training sequence.
    children are PSTNodes for one-step-extended contexts (one per extending state).
    """
    def __init__(self, suffix: tuple, distribution: np.ndarray, n_occurrences: int):
        self.suffix = suffix
        self.distribution = distribution
        self.n_occurrences = n_occurrences
        self.children: dict[int, "PSTNode"] = {}


def empirical_root_distribution(sequence: list[int], n_states: int,
                                  alpha_smooth: float = 0.5) -> np.ndarray:
    """Marginal next-state distribution with Laplace (add-alpha) smoothing."""
    counts = np.bincount(sequence, minlength=n_states).astype(float)
    smoothed = counts + alpha_smooth
    return smoothed / smoothed.sum()


def kl_divergence_smoothed(p_counts: np.ndarray, q_dist: np.ndarray,
                              n_states: int, alpha_smooth: float = 0.5) -> float:
    """KL(p_smooth || q_smooth) where p comes from counts and q is already a distribution.

    Both distributions receive Laplace add-alpha smoothing in count-space so the smoothing
    weight (alpha) is comparable across the two arguments. The q-distribution is converted
    to an effective-count representation via q_dist * effective_n_q for symmetric treatment.
    """
    n_p = float(p_counts.sum())
    p_smooth = (p_counts + alpha_smooth) / (n_p + alpha_smooth * n_states)
    # Smooth q in the same units: treat q_dist as if drawn from the same effective sample size
    q_effective_counts = q_dist * n_p
    q_smooth = (q_effective_counts + alpha_smooth) / (n_p + alpha_smooth * n_states)
    return float(np.sum(p_smooth * np.log(p_smooth / q_smooth)))


def build_pst(sequence: list[int], n_states: int, max_depth: int = 5,
                min_count: int = 10, divergence_threshold: float = 0.05,
                alpha_smooth: float = 0.5) -> PSTNode:
    """Build a probabilistic suffix tree by greedy context extension.

    PEDAGOGICAL implementation. Engineered for line-by-line auditability rather than
    throughput. See the production-migration note in the article: real-engagement
    deployments should swap the find_positions helper for an Ukkonen suffix-tree
    lookup to get O(T) construction and O(|suffix|) per-query lookup.

    Extends a context only if (a) the extended context occurs >= min_count times, AND
    (b) the extended-context next-state distribution diverges from the parent-context
    distribution by more than divergence_threshold (smoothed KL divergence).

    Runtime complexity (pedagogical version): O(T^2 * D * |S|) — the find_positions
    helper is O(T * |suffix|) per call and is invoked up to |S|^D times. Production
    Ukkonen-based version achieves O(T * D * |S|) overall.
    """
    # Root node uses the empirical marginal next-state distribution
    root_dist = empirical_root_distribution(sequence, n_states, alpha_smooth)
    root = PSTNode(suffix=(), distribution=root_dist,
                    n_occurrences=max(len(sequence) - 1, 1))

    def find_positions(suffix: tuple) -> list[int]:
        """Return positions i where sequence[i - len(suffix) + 1 ... i] reversed equals suffix.

        I.e., the suffix ends at position i in chronological order, and position i + 1 exists
        (so we can observe the next state). Empty suffix returns all valid transition positions.
        Edge cases: for suffix of length d, positions range from (d - 1) to (len(sequence) - 2).
        """
        if not suffix:
            return list(range(len(sequence) - 1))
        d = len(suffix)
        positions = []
        for i in range(d - 1, len(sequence) - 1):
            # Check chronological window [i - d + 1, i] reversed equals suffix
            match = True
            for j in range(d):
                if sequence[i - j] != suffix[j]:
                    match = False
                    break
            if match:
                positions.append(i)
        return positions

    def extend_node(node: PSTNode, depth: int):
        if depth >= max_depth:
            return
        for extending_state in range(n_states):
            extended_suffix = (extending_state,) + node.suffix
            positions = find_positions(extended_suffix)
            if len(positions) < min_count:
                continue
            # Next-state empirical counts given this extended context
            next_states = np.array([sequence[i + 1] for i in positions])
            cond_counts = np.bincount(next_states, minlength=n_states).astype(float)
            cond_smoothed = cond_counts + alpha_smooth
            cond_dist = cond_smoothed / cond_smoothed.sum()
            # Smoothed KL divergence vs parent's distribution
            kl = kl_divergence_smoothed(cond_counts, node.distribution, n_states, alpha_smooth)
            if kl > divergence_threshold:
                child = PSTNode(suffix=extended_suffix, distribution=cond_dist,
                                  n_occurrences=len(positions))
                node.children[extending_state] = child
                extend_node(child, depth + 1)

    extend_node(root, depth=0)
    return root


def pst_tree_summary(root: PSTNode, indent: int = 0) -> list[str]:
    """Pretty-print the PST structure for inspection."""
    lines = ["  " * indent + f"suffix={root.suffix or '(root)'}, "
                                f"n={root.n_occurrences}, "
                                f"dist={np.round(root.distribution, 2).tolist()}"]
    for child in root.children.values():
        lines.extend(pst_tree_summary(child, indent + 1))
    return lines

Worked example: synthetic lapping detection on a 2,000-day receivables stream

The code below generates a synthetic 2,000-day receivables-state stream with an injected long-running lapping pattern (cycling debits across 4 customer accounts plus cash, repeating). First-order Markov fails to detect the structure; higher-order BIC selection identifies the right memory order. The scale (T = 2,000) is chosen so the BIC penalty for higher-order models is meaningful but the lapping signal still defeats it.


N_STATES = 6  # customer-account states (5 customers + cash)
T = 2000      # days of posting history
N_CUSTOMERS = 5
LAPPING_CYCLE_LENGTH = 5  # 5-state cycle: cust_A → cust_B → cust_C → cust_D → cash → repeat

# Clean baseline: roughly uniform transitions among customers, with cash returning
P_clean = np.full((N_STATES, N_STATES), 1.0 / N_STATES)
# Make customers slightly more likely to transition to cash
for c in range(N_CUSTOMERS):
    P_clean[c] = [0.10] * N_CUSTOMERS + [0.50]  # 50% to cash, 10% each to other customers
    P_clean[c] /= P_clean[c].sum()
# Cash transitions roughly uniform among customers
P_clean[-1] = [0.20] * N_CUSTOMERS + [0.0]  # 0% self-loop on cash
P_clean[-1] /= P_clean[-1].sum()

def sample_clean_sequence(T: int) -> list[int]:
    seq = [int(np.random.randint(N_STATES))]
    for _ in range(T - 1):
        seq.append(int(np.random.choice(N_STATES, p=P_clean[seq[-1]])))
    return seq

def inject_lapping(sequence: list[int], inject_start: int, n_cycles: int) -> list[int]:
    """Inject a 5-state lapping cycle (4 customers + cash) starting at inject_start."""
    cycle = [0, 1, 2, 3, 5] * n_cycles  # cust_A → B → C → D → cash → repeat
    new_seq = sequence.copy()
    new_seq[inject_start:inject_start + len(cycle)] = cycle
    return new_seq

# Generate clean baseline + injected lapping. Inject 1500 days of pure 5-cycle
# (300 cycle repetitions) into the middle of the 2000-day sequence.
clean_seq = sample_clean_sequence(T)
lapping_seq = inject_lapping(clean_seq, inject_start=200, n_cycles=300)

# Test order selection on both sequences
clean_selection = select_markov_order(clean_seq, N_STATES, k_candidates=[1, 2, 3, 4, 5])
lapping_selection = select_markov_order(lapping_seq, N_STATES, k_candidates=[1, 2, 3, 4, 5])

print("Clean sequence — BIC-selected k:", clean_selection["selected_k"])
print("  All BIC values:", [(r["k"], round(r["bic"], 1)) for r in clean_selection["all_results"]])

print("\nLapping-injected sequence — BIC-selected k:", lapping_selection["selected_k"])
print("  All BIC values:", [(r["k"], round(r["bic"], 1)) for r in lapping_selection["all_results"]])

# First-order chi-squared test: does first-order Markov detect the lapping?
def first_order_chi2(seq: list[int], n_states: int, P_baseline: np.ndarray) -> float:
    """Descriptive chi-squared statistic comparing observed 1st-order transitions to baseline.

    Not a formal hypothesis test; no p-value or degrees-of-freedom adjustment provided.
    Used here as a scalar diagnostic of first-order model fit.
    """
    P_obs, N_obs = fit_kth_order_markov(seq, n_states, k=1)
    expected = N_obs.sum(axis=-1, keepdims=True) * P_baseline
    mask = expected > 0
    chi2 = ((N_obs[mask] - expected[mask]) ** 2 / expected[mask]).sum()
    return float(chi2)

print(f"\nFirst-order chi-squared on clean: {first_order_chi2(clean_seq, N_STATES, P_clean):.2f}")
print(f"First-order chi-squared on lapping: {first_order_chi2(lapping_seq, N_STATES, P_clean):.2f}")

# PST construction on both sequences
print("\n=== PST construction on clean sequence ===")
pst_clean = build_pst(clean_seq, N_STATES, max_depth=4, min_count=10, divergence_threshold=0.15)
clean_tree_lines = pst_tree_summary(pst_clean)
print(f"PST size: {len(clean_tree_lines)} nodes")
for line in clean_tree_lines[:8]:
    print(line)

print("\n=== PST construction on lapping-injected sequence ===")
pst_lapping = build_pst(lapping_seq, N_STATES, max_depth=4, min_count=10, divergence_threshold=0.15)
lapping_tree_lines = pst_tree_summary(pst_lapping)
print(f"PST size: {len(lapping_tree_lines)} nodes")
for line in lapping_tree_lines[:12]:
    print(line)

print(f"\nClean PST: {len(clean_tree_lines)} nodes")
print(f"Lapping PST: {len(lapping_tree_lines)} nodes")

With np.random.seed(42) set at the top, the deterministic console output is (verbatim from the companion artifact):


Clean sequence — BIC-selected k: 1
  All BIC values: [(1, 6264.6), (2, 7241.0), (3, 13299.4), (4, 52879.0), (5, 297306.7)]

Lapping-injected sequence — BIC-selected k: 2
  All BIC values: [(1, 3252.5), (2, 3135.2), (3, 9277.2), (4, 49727.9), (5, 295601.6)]

First-order chi-squared on clean: ~20
First-order chi-squared on lapping: ~24

Clean PST: 45 nodes
Lapping PST: 44 nodes

The BIC table is the central diagnostic — and the result is unambiguous. For the clean sequence, BIC monotonically increases as $k$ grows beyond 1 (6264.6 → 7241.0 → 13299.4 → 52879.0 → 297306.7); the data does not justify higher memory order, and BIC correctly selects $k=1$. For the lapping-injected sequence, BIC drops from 3252.5 at $k=1$ to 3135.2 at $k=2$ — a meaningful improvement of approximately 117 BIC points — then explodes at $k=3$ and beyond as the parameter penalty overwhelms any further fit gain. BIC selects $k=2$ on the lapping data, the bi-gram context is sufficient to capture the deterministic cycle progression (knowing the previous state $X_{t-1}$ uniquely determines the next state $X_{t+1}$ within the cycle), and higher-order chains are penalized for redundant parameters.

The cleanest summary: BIC moves from $k=1$ on clean to $k=2$ on lapping, and the 100-replicate false-positive analysis below shows zero clean replicates select $k > 1$, so the BIC jump is a calibrated signal that long-memory structure has emerged in the data. Meanwhile, the first-order chi-squared statistics are similar in magnitude for both sequences, confirming that first-order Markov modeling alone does not separate the two.

The PST construction is a secondary diagnostic — and it requires more careful interpretation than a single node-count comparison. Both the clean and lapping PSTs come in at approximately 44-45 nodes with the divergence threshold set to 0.15. The lapping PST is not visibly deeper than the clean PST by node count alone. PST node count is not a reliable stand-alone fraud detector at this scale and hyperparameter setting. What separates the two PSTs is the structural content — which specific multi-step contexts the algorithm retained. The auditor inspecting the PST tree summaries finds:

The lapping PST retains contexts that trace the cycle pattern: suffix=(1, 0) showing high transition probability to state 2, suffix=(2, 1) showing high transition to state 3, and so on through the five-state cycle.
The clean PST retains contexts that reflect first-order-dominated transition heterogeneity (e.g., state 0 followed by state 5 in many windows) without any persistent multi-step cyclic structure.

The audit-evidence claim therefore lives in the BIC k-selection jump plus the PST context inspection, not in either signal alone. Engagement teams that rely on PST node count without inspecting contents will miss this class of fraud; teams that rely on BIC only will miss the structural confirmation; the two-signal combination is the workpaper-defensible diagnostic.

Evidentiary use and workpaper documentation

The BIC-selected order and the retained PST contexts integrate into workpaper documentation under PCAOB AS 1215 (audit documentation) and AS 2401 §A.5 (lapping indicators). The auditor’s workpaper should include:

BIC selection table. The table of $(k, \text{BIC}(k))$ values for the tested sequence, with narrative explanation that the selected order exceeds the baseline-period order by a material margin. In the worked example above, the clean baseline selects $k=1$ (BIC 6264.6), while the lapping-injected period selects $k=2$ (BIC 3135.2 at $k=2$ vs. 3252.5 at $k=1$ — a 117-point improvement). The shift from $k=1$ to $k=2$ is the quantitative indicator that a long-memory pattern has emerged. The lower-order shift to $k=2$ (rather than to a higher $k$) reflects that the cycle’s predictive structure is fully captured by the bi-gram context — knowing one prior state uniquely determines the next inside the cycle, and BIC correctly refuses to spend parameters on higher orders that add no new information.

PST context inspection. The full output of pst_tree_summary for the tested period, annotated with the specific state-transition sequences that correspond to the suspected lapping cycle. The auditor does NOT rely on PST node count alone (the worked example shows the clean and lapping PSTs come in at nearly identical sizes, ~44-45 nodes each, at the divergence threshold of 0.15). The auditor instead identifies the retained contexts that trace cycle pattern, such as suffix=(1, 0), suffix=(2, 1), and suffix=(3, 2) in the lapping-injected period — these are the structural fingerprints of the cycle. The clean baseline’s PST retains different contexts (mostly reflecting first-order-dominated transitions to cash), and the contrast is the diagnostic.

Threshold-vs-finding cross-reference. The auditor’s workpaper should also include the cross-reference between the BIC selection table and the PST context inspection. In the worked example, BIC selects $k = 2$ on the lapping data (matching the bi-gram dependency the deterministic cycle creates), and the PST retains the specific bi-gram contexts that trace the cycle. Two analytics agreeing on the same structural feature is stronger evidence than either alone — the workpaper notes this convergence as the structural finding.

False-positive rate quantification

The fourth workpaper component runs the identical BIC-selection and PST-construction procedure on multiple independent clean-baseline sequences (generated from the same clean transition matrix $\mathbf{P}_{\text{clean}}$ with different random seeds) and counts how often the clean sequences produce $k > 1$ or PST node counts exceeding the clean baseline. For the worked example above, testing 100 additional clean sequences with seeds 43-142:


# False-positive quantification — appended to the worked example above.
fp_k_selections = []
fp_pst_sizes = []
for seed_offset in range(1, 101):
    np.random.seed(42 + seed_offset)
    fp_seq = sample_clean_sequence(T)
    fp_sel = select_markov_order(fp_seq, N_STATES, k_candidates=[1, 2, 3, 4, 5])
    fp_k_selections.append(fp_sel["selected_k"])
    fp_pst = build_pst(fp_seq, N_STATES, max_depth=4, min_count=10, divergence_threshold=0.15)
    fp_pst_sizes.append(len(pst_tree_summary(fp_pst)))

# Result (with seeds 43-142, verbatim from the companion artifact):
#   k-selection distribution: {1: 100}
#   PST size — mean: 40.29, std: 3.03
#   PST size 95th percentile: 45

The false-positive analysis is the audit-evidence-defensibility leg of the worked example. All 100 clean replicates select $k = 1$ — zero false positives on the BIC-selection signal across the test cohort. That number is the workpaper-defensible claim: under the chosen synthetic baseline and the deterministic seed range 43-142, the probability of a clean entity producing $k > 1$ selection is approximately zero. The lapping-injected sequence’s BIC selection of $k = 2$ is therefore not noise; it is a structural finding well outside the false-positive band.

The PST size statistics tell a complementary story. The clean replicates produce PSTs averaging ~40 nodes with standard deviation ~3, and a 95th-percentile threshold of 45 nodes. The lapping-injected sequence’s PST is 44 nodes — within the clean-replicate range. PST node count alone does not flag the lapping in this configuration. This is the lesson the article teaches: BIC is the calibrated screen, PST is the structural-context confirmation, and a workpaper that relies on either signal in isolation is incomplete. The companion artifact reproduces this quantification under deterministic seeds so the engagement team can recreate the analysis for the specific entity under audit.

Reference points in the published prosecution record

The need for higher-order and variable-order Markov modeling is grounded in the public prosecution record of long-memory fraud schemes that defeated first-order or single-period analytical procedures for years or decades:

Madoff. United States v. Madoff, U.S. District Court for the Southern District of New York; guilty plea March 12, 2009 (eleven felony counts including securities fraud, investment-adviser fraud, mail fraud, wire fraud, money laundering, false statements, perjury, theft from an employee-benefit plan, and false filings with the SEC); 150-year sentence June 29, 2009. The Madoff fund operated as a Ponzi scheme for an estimated 17 years (some accounts of fabricated trading reach back further) before the December 2008 collapse. The scheme’s defining structural property — the property the methodology in this article is built to detect — is long memory: reported monthly returns were sequenced to produce a long-run distribution that looked plausible under summary statistics, but the transition structure across reported gain/loss states carried a regularity no actual trading strategy would produce. First-order analytical procedures consistently failed to detect this for 17 years because the diagnostic signal lived in the $k$-step transition pattern, not in the one-step transition pattern. The parallel SEC civil enforcement, SIPC recovery proceedings administered through the U.S. Bankruptcy Court for the Southern District of New York, and the auditor-side prosecution United States v. Friehling (S.D.N.Y., guilty plea November 2009; sentenced May 2015) round out the matter.

Refco Inc. collapse. United States v. Bennett, S.D.N.Y. (Phillip R. Bennett, former CEO; conviction February 2008, 16-year sentence); parallel SEC v. Refco Inc., et al., S.D.N.Y., civil complaint filed October 2005. The Refco scheme hid \$430 million in customer-receivables exposure through a sequence of quarter-end transactions that rotated obligations across related-party shell entities on a multi-quarter cycle. The cycle length exceeded the typical audit window, which is exactly the configuration variable-order Markov modeling is built to detect.

HealthSouth Corp. SEC v. HealthSouth Corp. and Richard Scrushy, U.S. District Court for the Northern District of Alabama, civil complaint filed March 2003; parallel DOJ criminal prosecutions of CFOs (multiple guilty pleas 2003-2005). The HealthSouth restatement was a \$1.4 billion accounting fraud that ran for seven years (1996-2002) before detection; the scheme’s structural signature was quarter-after-quarter manipulation of accruals timing — a long-memory pattern that single-quarter analytical procedures missed because the memory length exceeded any single audit period.

The audit lesson is consistent across these cases. None of them was caught by the original audit team using single-period or first-order analytical procedures. All of them had structural patterns that higher-order or variable-order Markov modeling, applied to the right state-encoding of the entity’s reported sequence, would have flagged years before the eventual collapse or restatement. The methodology in this article is the technical apparatus that makes that detection possible. The engagement team using it is producing the documentary evidence PCAOB AS 1215 and AS 2401 §A.5 require — both for detection when fraud is present and for defensibility of the audit opinion when it is not.

What the engagement team does with the result

The BIC-selected order and the PST tree structure feed three concrete engagement decisions:

Substantive-procedure scope. When BIC selects $k > 1$ materially (a BIC drop of more than ~100 points from $k=1$ to $k_{\text{selected}}$ is a useful threshold for the worked-example scale), the engagement team expands substantive procedures on the specific transition cells the retained PST contexts identify. For the worked example above, the contexts suffix=(2, 1, 0) and suffix=(1, 2, 3) direct the team to test the specific journal entries that participate in the recovered cycle structure.

Workpaper documentation under AS 1215. The full BIC selection table and the PST tree summary become workpaper artifacts. The auditor documents which order was selected, why (BIC drop), what contexts were retained, and what the contexts correspond to in the entity’s actual posting practice. This is the documentary record PCAOB inspectors look for under AS 1215 §6 (sufficient documentation of audit procedures performed and conclusions reached).

AS 2401 §A.5 lapping-indicator response. When the PST retains contexts that trace a customer-account cycle (as in the worked example), AS 2401 §A.5 identifies this as a lapping fraud-risk indicator. The engagement team’s required response is to perform additional substantive procedures specifically targeted at the customer accounts in the cycle — direct confirmation, cash-receipt tracing, and review of subsequent-period collections.

Bridge to Transaction-Timing Diagnostics in the CTMC Family: Poisson Arrivals, End-of-Period Spikes, and When to Escalate to Multi-State Models

This article handled long-memory in discrete time by extending the conditioning window backward to the previous $k$ states. Transaction-Timing Diagnostics in the CTMC Family: Poisson Arrivals, End-of-Period Spikes, and When to Escalate to Multi-State Models handles a related but distinct generalization: continuous time. Real audit data carries timestamps, not just state-sequence positions, and the time intervals between transactions themselves are diagnostic — a transaction stream that is uniform across the period but spikes in the last three business days carries information no discrete-time Markov chain can represent. Continuous-time Markov chains (CTMCs) and Poisson processes are the natural framework. Transaction-Timing Diagnostics in the CTMC Family: Poisson Arrivals, End-of-Period Spikes, and When to Escalate to Multi-State Models walks that extension, the end-of-period-spike diagnostic, and the calibration discipline for distinguishing legitimate close-cycle activity from anomalous concentration.

Authority:

Higher-order Markov theory:

Schwarz, G. (1978). “Estimating the Dimension of a Model.” The Annals of Statistics, 6(2), 461-464. (Bayesian Information Criterion.)
Ross, S.M. (2019). Introduction to Probability Models (12th ed.). Academic Press, Ch. 4 (general $k$-th-order Markov chains).

Variable-order Markov models:

Bühlmann, P., & Wyner, A.J. (1999). “Variable-Length Markov Chains.” The Annals of Statistics, 27(2), 480-513. (Foundational variable-order theory.)
Ron, D., Singer, Y., & Tishby, N. (1996). “The Power of Amnesia: Learning Probabilistic Automata with Variable Memory Length.” Machine Learning, 25(2-3), 117-149. (Original PST construction.)
Begleiter, R., El-Yaniv, R., & Yona, G. (2004). “On Prediction Using Variable Order Markov Models.” Journal of Artificial Intelligence Research, 22, 385-421. (Comparison of CTW, PST, and related variable-order methods.)
Csiszár, I., & Talata, Z. (2006). “Context Tree Estimation for Not Necessarily Finite Memory Processes, via BIC and MDL.” IEEE Transactions on Information Theory, 52(3), 1007-1016. (CTW’s MDL convergence guarantee — applies to CTW only; the PST construction used in this article’s worked example does NOT inherit this guarantee and is offered as an engineering-tractable heuristic.)

Audit standards:

PCAOB AS 2401 — Consideration of Fraud in a Financial Statement Audit, §A.5 (lapping indicators and required substantive response).
PCAOB AS 1215 — Audit Documentation, §6 (sufficient documentation of procedures and conclusions).

Published prosecutions referenced:

United States v. Madoff, S.D.N.Y. (guilty plea March 12, 2009; 150-year sentence June 29, 2009).
United States v. Friehling, S.D.N.Y. (guilty plea November 2009; sentenced May 2015).
United States v. Bennett (Refco), S.D.N.Y. (conviction February 2008).
SEC v. Refco Inc., et al., S.D.N.Y. (civil complaint filed October 2005).
SEC v. HealthSouth Corp. and Richard Scrushy, N.D. Ala. (civil complaint filed March 2003).

Higher-Order and Variable-Order Markov Models for Long-Memory Fraud Schemes

kth-order Markov chains

Memory-order selection via BIC

Variable-order Markov models

Worked example: synthetic lapping detection on a 2,000-day receivables stream

Evidentiary use and workpaper documentation

False-positive rate quantification

Reference points in the published prosecution record

What the engagement team does with the result

Bridge to Transaction-Timing Diagnostics in the CTMC Family: Poisson Arrivals, End-of-Period Spikes, and When to Escalate to Multi-State Models

Submit a Comment Cancel reply

Recent Posts

Recent Comments

Sheepdog Prosperity Partners LLC

Contact

Schedule