The first-order Markov apparatus from First-Order Markov Modeling for Transaction-Stream Analysis in Audit worked cleanly on a 5-state synthetic dataset with 1,000 transitions and a known baseline. Production audit work breaks each of those assumptions. The chart of accounts the engagement team inherits has 200 to 2,000 individual general-ledger accounts, not five. The baseline period is contaminated by the same posting practices the auditor wants to test against. And the entity’s monthly close cycle produces a structural transition shift in the last three business days of every period that — if not modeled — generates spurious anomaly signals on every clean entity in the portfolio.

This article walks the three production decisions that determine whether first-order Markov modeling produces a useful signal or noise on a real engagement. State-space encoding — how to compress a 2,000-account chart into a Markov-tractable partition without throwing away the diagnostic detail. Baseline selection — when to use prior-period self-baseline, peer-group baseline, or synthetic-null baseline, and how to avoid the circular-estimation trap that makes chi-squared p-values misleading. Period-end transition modeling — distinguishing the legitimate close-cycle structural shift from anomalous behavior that happens to occur during the same window.

The discussion is grounded in PCAOB AS 2401 (Consideration of Fraud in a Financial Statement Audit, §60-67 on journal-entry testing) and AS 2305 (Substantive Analytical Procedures). The international counterpart is ISA 240 §32-33. Domestic IRS examination practice is captured in IRM 4.10.1.

State-space encoding

A first-order Markov chain on $n$ states has up to $n^2$ free transition probabilities. Doubling $n$ quadruples the parameter count and the data required to estimate each cell to a useful tolerance. With a 200-account chart and 10,000 monthly journal entries, a one-state-per-account encoding produces 40,000 cells from 10,000 transition observations — most cells empty, the rest noisy. The same 10,000 entries against a 12-state encoding produce 144 cells with an average of ~70 observations each, and a chi-squared test against any baseline becomes statistically meaningful.

The encoding step in code. The DataFrame contract is: each journal entry is one row; required columns are posted_at (datetime, used for chronological ordering), account_id (string, the GL account number used to look up the state), and period_end_offset_days (integer, business days until period end — 0 means the period-end day, positive values are earlier in the period). State labels are strings throughout (the mapping keys); integer indexing into the transition matrix is internal to transition_matrix and never exposed in the analyst-facing API.


import pandas as pd

# Subledger-granularity encoding: 15 states for a typical mid-size entity
SUBLEDGER_MAP = {
    # Each tuple lists GL account-number prefixes mapping to a state label
    "cash":             ["1010", "1020", "1030"],
    "ar":               ["1100", "1110", "1120"],
    "inventory":        ["1200", "1210", "1220"],
    "fixed_assets":     ["1500", "1510", "1520", "1530"],
    "intangibles":      ["1600", "1610", "1620"],
    "ap":               ["2000", "2010", "2020"],
    "accrued":          ["2100", "2110", "2120"],
    "deferred_revenue": ["2200", "2210"],
    "debt":             ["2500", "2510", "2520"],
    "equity":           ["3000", "3010", "3020"],
    "revenue":          ["4000", "4010", "4020", "4030"],
    "cogs":             ["5000", "5010", "5020"],
    "opex":             ["6000", "6010", "6020", "6030", "6040"],
    "depreciation":     ["6500", "6510"],
    "intercompany":     ["9000", "9010", "9020"],
}

def encode_account(account_id: str, mapping: dict[str, list[str]]) -> str:
    """Return the state label for a GL account; raises if no prefix matches."""
    for state, prefixes in mapping.items():
        if any(account_id.startswith(p) for p in prefixes):
            return state
    raise KeyError(f"No state mapping for account {account_id!r}")

# Apply to a journal-entry DataFrame
def encode_je_stream(je: pd.DataFrame, mapping: dict = SUBLEDGER_MAP) -> pd.Series:
    """Return a state-label sequence in posting order."""
    return je.sort_values("posted_at")["account_id"].astype(str).map(
        lambda a: encode_account(a, mapping)
    )

The mapping is engagement-specific. The audit team customizes the prefixes during planning; once locked, the same mapping applies across every period and every test.

Three encoding granularities cover the practical range.

Top-level (5-7 states). Asset / liability / equity / revenue / expense / contra / suspense. This is the coarsest defensible partition. It produces a transition matrix dominated by the diagonal (debits and credits within the same statement section) and a small number of off-diagonal cells. Useful for very small entities or for cross-entity comparison where chart-of-accounts heterogeneity makes finer encodings non-comparable. Loses most of the diagnostic detail — period-end fraud schemes typically involve specific account-class transitions (e.g., revenue → AR → cash) that the top-level encoding aggregates into “income statement → balance sheet” at meaningless coarseness.

Subledger (12-20 states). Cash, AR, inventory, fixed assets, intangibles, AP, accrued liabilities, deferred revenue, debt, equity, revenue, COGS, operating expense, depreciation, OCI, intercompany, suspense. This is the practical sweet spot for entities with 200-2,000 accounts. Fine enough to localize anomalies to a specific subledger; coarse enough that each cell receives enough observations for chi-squared validity. PCAOB AS 2305 §10 references the kind of “sufficiently precise” expectation development the subledger granularity supports.

Process-aware (25-40 states). Order-to-cash subdivided into customer-billing / collections / cash-application / write-offs; procure-to-pay subdivided into vendor-onboarding / PO-issuance / receipt / invoice-match / payment; record-to-report subdivided into accruals / reclassifications / consolidation entries / true-ups. Useful for large entities with mature ERP environments and for engagements specifically scoped to process-cycle testing. Requires more observations per period to keep the cell-population diagnostic out of the small-sample warning zone.

The encoding choice should be made before any baseline is fitted. Switching encodings mid-engagement contaminates the comparability of period-over-period results.

Baseline selection

The chi-squared test from First-Order Markov Modeling for Transaction-Stream Analysis in Audit compares an observed transition matrix against a baseline. The baseline choice determines what the test actually tests. Three options exist; each has a known failure mode.

Prior-period self-baseline. Use the entity’s own posting pattern from a prior period (typically the comparable period in the prior fiscal year) as $P^{(0)}$. Advantages: maximally comparable on entity-specific characteristics; captures the entity’s actual chart-of-accounts utilization. Failure mode: the circular-estimation trap — if the prior period contained the same fraud the auditor is testing for, the baseline absorbs the fraud and the chi-squared test fails to reject. The trap is most severe in long-running schemes (Higher-Order and Variable-Order Markov Models for Long-Memory Fraud Schemes covers the long-memory case explicitly). Mitigation: use the most-recent-uncontaminated period the auditor can defensibly identify, or composite multiple prior periods to dilute any single-period contamination.

Peer-group baseline. Use a composite transition matrix from a set of comparable entities (same industry, similar size, similar reporting framework). Advantages: not subject to the circular-estimation trap; surfaces entity-specific anomalies even if the entity has manipulated reporting throughout its history. Failure mode: the comparability problem — peers genuinely differ in chart-of-accounts structure, transaction volume, and close-cycle timing. A small baseline pool produces high-variance estimates; a large pool dilutes the comparable-entity match. Mitigation: stratify the peer pool on the audit-relevant characteristics (revenue size, industry NAICS, ERP system) and weight by similarity.

Synthetic-null baseline. Construct $P^{(0)}$ from first principles based on the entity’s documented business processes — what transactions should happen and in what order under the documented control framework. Advantages: not contaminated by either the entity’s own history or peer-group differences; tests directly against the documented expectation. Failure mode: the calibration cost — building a synthetic-null baseline requires substantial process-walkthrough work upfront. Most defensible when the auditor has already documented the entity’s processes for ICFR purposes (PCAOB AS 2110 §28-49 on understanding the entity’s processes); marginal cost is then small.

For any specific engagement, the choice depends on the scoping question. Detection of process-deviation: synthetic-null is best. Detection of entity-specific anomalies: peer-group. Detection of period-over-period drift: prior-period self-baseline (with composite mitigation against contamination).

Building each baseline type from data:


import numpy as np

def transition_matrix(sequence: list[str], states: list[str]) -> tuple[np.ndarray, np.ndarray]:
    """Empirical first-order transition matrix from a state sequence.

    Returns (P_hat, N) where P_hat is row-stochastic and N is the count matrix.
    """
    idx = {s: i for i, s in enumerate(states)}
    n = len(states)
    N = np.zeros((n, n), dtype=int)
    for prev, curr in zip(sequence[:-1], sequence[1:]):
        N[idx[prev], idx[curr]] += 1
    row_sums = N.sum(axis=1, keepdims=True)
    with np.errstate(invalid="ignore", divide="ignore"):
        P_hat = np.where(row_sums > 0, N / row_sums, 0.0)
    assert np.allclose(P_hat.sum(axis=1)[row_sums.flatten() > 0], 1.0), \
        "Row-stochasticity violated"
    return P_hat, N

def fit_prior_period_baseline(prior_sequences: list[list[str]], states: list[str]) -> np.ndarray:
    """Composite baseline from N prior periods (dilutes single-period contamination)."""
    P_list = [transition_matrix(seq, states)[0] for seq in prior_sequences]
    return np.mean(P_list, axis=0)

def fit_peer_group_baseline(peer_sequences: list[list[str]], states: list[str],
                             weights: list[float] | None = None) -> np.ndarray:
    """Stratified peer-group baseline; optional similarity weights per peer."""
    P_list = [transition_matrix(seq, states)[0] for seq in peer_sequences]
    if weights is None:
        return np.mean(P_list, axis=0)
    w = np.array(weights) / np.sum(weights)
    return np.tensordot(w, np.stack(P_list), axes=1)

def fit_synthetic_null_baseline(documented_process: dict[str, dict[str, float]],
                                  states: list[str]) -> np.ndarray:
    """Construct P^(0) from documented expected transition probabilities."""
    n = len(states)
    idx = {s: i for i, s in enumerate(states)}
    P = np.zeros((n, n))
    for from_state, transitions in documented_process.items():
        for to_state, prob in transitions.items():
            P[idx[from_state], idx[to_state]] = prob
    # Re-normalize to enforce row-stochasticity
    row_sums = P.sum(axis=1, keepdims=True)
    return np.where(row_sums > 0, P / row_sums, 0.0)

Period-end transition modeling

Almost every entity’s posting practice exhibits a structural shift in the last three to five business days of each reporting period. Adjusting entries cluster (per PCAOB AS 2401 §60, journal-entry testing specifically targets these); reclassifications occur; consolidating entries are posted; accruals are trued up. The shift is legitimate, not anomalous. A first-order Markov chain fitted across the whole period averages the close-cycle dynamics with the mid-period dynamics and produces a baseline that fits neither well.

Three modeling approaches handle the close-cycle shift.

Separate transition matrices per period day. Fit one matrix for “mid-period” (say, days 1 through period-length minus 5) and a second for “close-cycle” (the last 5 days). Test each against its own period-specific baseline. Advantages: simplest to implement; cleanly separates the two dynamics. Disadvantages: doubles the parameter count (which doubles the data requirement for chi-squared validity); the day-cut is arbitrary.

Time-of-period as a covariate. Augment the state space with a continuous “fraction of period elapsed” covariate and fit a transition kernel that depends on it. Advantages: smoothly handles the shift without an arbitrary day-cut. Disadvantages: moves the model out of the discrete-state Markov framework into a more general hidden-state-with-covariate model; harder to communicate to engagement teams unfamiliar with the formalism.

Hierarchical model. Treat the close-cycle shift as a known latent regime (mid-period vs. close-cycle) with a documented prior on the shift point (typically days 22-25 of a 30-day period, or days 60-65 of a 90-day period). Fit two transition matrices conditional on the latent regime, with the regime indicator estimated from the data. This is a constrained version of the Hidden Markov Model that Hidden Markov Models for Earnings-Management Regime Detection in Public-Company Financials develops in full generality.

For most production audit engagements, the separate-matrices approach with composite mid-period baseline is the right starting point. It requires no machinery beyond what First-Order Markov Modeling for Transaction-Stream Analysis in Audit introduced, and it captures the bulk of the close-cycle effect at the cost of a moderate doubling of the data requirement.

The split in code:


def split_period(je: pd.DataFrame, period_end_days: int = 5) -> tuple[pd.DataFrame, pd.DataFrame]:
    """Split journal entries into mid-period vs close-cycle by business-day position.

    Assumes je has a 'period_end_offset_days' column counting business days
    until the end of the reporting period (0 = period end, increasing earlier).
    """
    close_cycle = je[je["period_end_offset_days"] < period_end_days]
    mid_period = je[je["period_end_offset_days"] >= period_end_days]
    return mid_period, close_cycle

Each window is then fitted and tested against its own baseline independently.

Sample-size diagnostics

The chi-squared approximation requires a minimum expected count per cell — the conventional rule of thumb under Hawkins (1980, p. 145) is:

$$E_{ij} \geq 5 \text{ for at least 80\% of cells, and } E_{ij} \geq 1 \text{ for all cells.}$$

Violation is the dominant practical failure mode in production audit work, particularly with finer state-space encodings. The diagnostic and three responses, in code:


def cell_population_diagnostic(N: np.ndarray, P_baseline: np.ndarray) -> dict:
    """Returns the expected counts, the fraction of cells violating Hawkins' rule,
    and a recommended response (chi-squared, pooling, exact, or simulation)."""
    expected = N.sum(axis=1, keepdims=True) * P_baseline
    n_cells = expected.size
    n_below_5 = int((expected < 5).sum())
    n_below_1 = int((expected < 1).sum())
    fraction_below_5 = n_below_5 / n_cells

    if fraction_below_5 < 0.20 and n_below_1 == 0:
        recommendation = "chi_squared_asymptotic"
    elif fraction_below_5 < 0.50:
        recommendation = "cell_pooling"
    else:
        recommendation = "simulation_based"  # asymptotic chi-squared unreliable

    return {
        "expected": expected,
        "n_cells": n_cells,
        "n_cells_E_lt_5": n_below_5,
        "n_cells_E_lt_1": n_below_1,
        "fraction_E_lt_5": fraction_below_5,
        "recommendation": recommendation,
    }

def simulation_pvalue(observed_chi2: float, P_baseline: np.ndarray,
                       n_transitions: int, n_states: int,
                       B: int = 999, seed: int = 42) -> float:
    """Simulation-based p-value: generate B replicate sequences from the baseline
    and return the fraction of replicate chi-squared statistics >= observed."""
    rng = np.random.default_rng(seed)
    chi2_replicates = np.empty(B)
    for b in range(B):
        # Generate a replicate sequence from the baseline matrix
        seq = [0]
        for _ in range(n_transitions - 1):
            seq.append(rng.choice(n_states, p=P_baseline[seq[-1]]))
        # Compute chi-squared for this replicate against the same baseline
        states_list = list(range(n_states))
        _, N_rep = transition_matrix(seq, states_list)
        E_rep = N_rep.sum(axis=1, keepdims=True) * P_baseline
        mask = E_rep > 0
        chi2_replicates[b] = ((N_rep[mask] - E_rep[mask]) ** 2 / E_rep[mask]).sum()
    return float((chi2_replicates >= observed_chi2).sum() + 1) / (B + 1)

The three responses:

Cell pooling. Combine sparse cells with their row’s largest-expectation cell into a single combined cell, and adjust the degrees of freedom downward by the number of pooled cells. First-Order Markov Modeling for Transaction-Stream Analysis in Audit‘s chi2_with_pooling function (reproduced in the companion notebook) handles this automatically.

Exact tests. Use Fisher’s exact test (or its multinomial generalization) instead of the chi-squared approximation. Computationally feasible for small cell counts; loses tractability when many cells require exact treatment.

Simulation-based p-values. Generate $B$ replicate transition sequences from the baseline matrix, compute the chi-squared statistic on each, and use the empirical distribution of statistics as the reference distribution. $B = 999$ replicates is the conventional default; $B = 9{,}999$ for engagements where p-value precision matters in the third decimal. The simulation_pvalue function above implements this.

When none of the three responses produces a defensible test result, the engagement team should fall back to simpler diagnostics — Frobenius distance from baseline (which has no minimum-cell-count requirement), or one of the analytical-procedure techniques from PCAOB AS 2305 §10-17. The first-order Markov framework is not the only tool; the engagement team’s responsibility is to use whichever tool the data actually supports.

Worked example

Consider a multi-entity audit portfolio of 20 entities across three industries (manufacturing, technology services, healthcare). Subledger-granularity encoding (15 states). Three months of monthly journal-entry data per entity (~6,000 entries per entity per month).

The portfolio-level test runs $m = 120$ chi-squared tests (20 entities × 3 months × 2 windows per entity-month). Running each at the conventional 5% threshold without correction produces an expected $m \cdot 0.05 = 6$ false rejections under the null — too many for the engagement team to investigate one by one. Holm-Bonferroni step-down correction addresses this. In plain English: sort the $m$ p-values from smallest to largest, then walk down the list applying progressively-relaxed thresholds — the smallest p-value faces the strictest threshold $(\alpha/m)$, the second-smallest faces $\alpha/(m-1)$, and so on. The procedure stops at the first p-value that fails its threshold; everything before passes, everything after fails to reject. The formal rule for ordered p-values $p_{(1)} \leq p_{(2)} \leq \ldots \leq p_{(m)}$ rejects $H_{(i)}$ if and only if:

$$p_{(j)} \leq \frac{\alpha}{m – j + 1} \text{ for all } j \leq i$$

Holm-Bonferroni is uniformly more powerful than plain Bonferroni (which uses $\alpha/m$ for every test) — it never has fewer rejections and often has more. It is the standard family-wise error-rate procedure for finite test families in modern audit-analytics practice.

The full pipeline, reproducible end-to-end:


import numpy as np
import pandas as pd
from scipy.stats import chi2 as chi2_dist
from statsmodels.stats.multitest import multipletests

np.random.seed(42)
STATES = list(SUBLEDGER_MAP.keys())  # 15 states (strings)
STATE_INDEX = {s: i for i, s in enumerate(STATES)}
N_ENTITIES = 20
N_MONTHS = 3
N_ENTRIES_PER_ENTITY_MONTH = 6000
CLOSE_CYCLE_FRACTION = 1 / 6  # last ~5 of 30 days

def _row(weights: dict[str, float]) -> np.ndarray:
    """Build a row-stochastic 15-vector from a {state: weight} dict (sparse cells default to 0, then normalize)."""
    row = np.zeros(15)
    for s, w in weights.items():
        row[STATE_INDEX[s]] = w
    return row / row.sum()

# Realistic mid-period baseline. Revenue cycle (revenue→ar→cash), purchase cycle
# (inventory→cogs and cash→ap→opex), and capex (fixed_assets→depreciation) dominate.
# Accruals and deferred revenue are minimal during the regular operating window.
P_mid_baseline = np.array([
    _row({"cash": 0.15, "ar": 0.20, "ap": 0.20, "opex": 0.15, "inventory": 0.10, "fixed_assets": 0.05, "intercompany": 0.10, "intangibles": 0.05}),       # cash
    _row({"cash": 0.45, "ar": 0.15, "revenue": 0.25, "deferred_revenue": 0.05, "intercompany": 0.10}),                                                       # ar
    _row({"cogs": 0.50, "inventory": 0.20, "ap": 0.20, "fixed_assets": 0.05, "intercompany": 0.05}),                                                          # inventory
    _row({"depreciation": 0.30, "fixed_assets": 0.25, "ap": 0.20, "cash": 0.15, "intercompany": 0.10}),                                                       # fixed_assets
    _row({"depreciation": 0.40, "intangibles": 0.30, "ap": 0.15, "cash": 0.15}),                                                                              # intangibles
    _row({"cash": 0.55, "ap": 0.10, "opex": 0.20, "inventory": 0.10, "intercompany": 0.05}),                                                                  # ap
    _row({"cash": 0.30, "accrued": 0.10, "opex": 0.40, "ap": 0.15, "intercompany": 0.05}),                                                                    # accrued
    _row({"revenue": 0.50, "cash": 0.20, "ar": 0.15, "deferred_revenue": 0.10, "intercompany": 0.05}),                                                        # deferred_revenue
    _row({"cash": 0.40, "debt": 0.25, "accrued": 0.20, "opex": 0.10, "intercompany": 0.05}),                                                                  # debt
    _row({"cash": 0.35, "equity": 0.35, "intercompany": 0.20, "debt": 0.10}),                                                                                 # equity
    _row({"ar": 0.60, "cash": 0.20, "revenue": 0.10, "deferred_revenue": 0.05, "intercompany": 0.05}),                                                        # revenue
    _row({"inventory": 0.50, "cogs": 0.20, "ap": 0.15, "intercompany": 0.10, "accrued": 0.05}),                                                               # cogs
    _row({"cash": 0.35, "ap": 0.30, "accrued": 0.15, "opex": 0.10, "intercompany": 0.10}),                                                                    # opex
    _row({"fixed_assets": 0.40, "depreciation": 0.20, "accrued": 0.20, "intangibles": 0.10, "intercompany": 0.10}),                                           # depreciation
    _row({"intercompany": 0.30, "cash": 0.20, "ap": 0.15, "ar": 0.15, "accrued": 0.10, "equity": 0.10}),                                                      # intercompany
])
assert np.allclose(P_mid_baseline.sum(axis=1), 1.0)

# Realistic close-cycle baseline. Accruals, deferred revenue, depreciation/amortization,
# and intercompany reclassifications dominate — exactly the entries the close-certification
# controls produce. Note the materially heavier weight on `accrued` and `intercompany`
# across most rows; this is the period-end concentration the model needs to recognize.
P_close_baseline = np.array([
    _row({"accrued": 0.20, "cash": 0.15, "ap": 0.15, "opex": 0.15, "intercompany": 0.15, "ar": 0.10, "deferred_revenue": 0.10}),                              # cash
    _row({"deferred_revenue": 0.25, "revenue": 0.20, "accrued": 0.15, "intercompany": 0.15, "cash": 0.15, "ar": 0.10}),                                       # ar
    _row({"accrued": 0.20, "cogs": 0.30, "inventory": 0.20, "ap": 0.15, "intercompany": 0.15}),                                                               # inventory
    _row({"depreciation": 0.40, "accrued": 0.20, "fixed_assets": 0.15, "intercompany": 0.15, "intangibles": 0.10}),                                           # fixed_assets
    _row({"depreciation": 0.40, "intangibles": 0.25, "accrued": 0.20, "intercompany": 0.15}),                                                                 # intangibles
    _row({"accrued": 0.30, "ap": 0.20, "cash": 0.20, "opex": 0.15, "intercompany": 0.15}),                                                                    # ap
    _row({"opex": 0.35, "accrued": 0.20, "ap": 0.15, "cash": 0.15, "intercompany": 0.15}),                                                                    # accrued
    _row({"deferred_revenue": 0.30, "revenue": 0.25, "accrued": 0.20, "ar": 0.15, "intercompany": 0.10}),                                                     # deferred_revenue
    _row({"accrued": 0.30, "debt": 0.20, "cash": 0.20, "opex": 0.15, "intercompany": 0.15}),                                                                  # debt
    _row({"equity": 0.30, "intercompany": 0.25, "accrued": 0.20, "cash": 0.15, "debt": 0.10}),                                                                # equity
    _row({"deferred_revenue": 0.30, "ar": 0.30, "revenue": 0.15, "accrued": 0.15, "intercompany": 0.10}),                                                     # revenue
    _row({"accrued": 0.25, "inventory": 0.25, "cogs": 0.20, "ap": 0.15, "intercompany": 0.15}),                                                               # cogs
    _row({"accrued": 0.35, "ap": 0.20, "opex": 0.15, "cash": 0.15, "intercompany": 0.15}),                                                                    # opex
    _row({"depreciation": 0.25, "fixed_assets": 0.25, "accrued": 0.25, "intangibles": 0.15, "intercompany": 0.10}),                                           # depreciation
    _row({"intercompany": 0.35, "accrued": 0.25, "ap": 0.15, "ar": 0.10, "cash": 0.10, "equity": 0.05}),                                                      # intercompany
])
assert np.allclose(P_close_baseline.sum(axis=1), 1.0)

# Tying to the engagement's month-end controls: the two baselines correspond directly to
# the entity's documented mid-period operating cycle and its month-end close-cycle review
# control (typically a controller-signed journal-entry batch summary tied to PCAOB AS 2201
# entity-level control testing). Anomalous patterns inside the close window can either
# mean fraud or that the close certification missed a legitimate posting requiring review.

def synthesize_entity_month(P_mid: np.ndarray, P_close: np.ndarray, n: int, close_fraction: float, seed: int) -> tuple[list[str], list[str]]:
    """Generate (mid_period_sequence, close_cycle_sequence) from two baselines."""
    rng = np.random.default_rng(seed)
    n_close = int(n * close_fraction)
    n_mid = n - n_close
    seq_mid_idx = [0]
    for _ in range(n_mid - 1):
        seq_mid_idx.append(int(rng.choice(15, p=P_mid[seq_mid_idx[-1]])))
    seq_close_idx = [seq_mid_idx[-1]]  # close cycle starts from where mid-period ended
    for _ in range(n_close - 1):
        seq_close_idx.append(int(rng.choice(15, p=P_close[seq_close_idx[-1]])))
    return [STATES[s] for s in seq_mid_idx], [STATES[s] for s in seq_close_idx]

def chi2_test(N: np.ndarray, P_baseline: np.ndarray) -> tuple[float, float]:
    """Standard chi-squared test on transition counts vs baseline.

    Simplified df: |S|*(|S|-1) less empty rows; assumes E_ij >= 5 for at least 80%
    of cells (Hawkins 1980 rule). For engagements where small expected counts
    materially affect the tail, use First-Order Markov Modeling for Transaction-Stream Analysis in Audit's chi2_with_pooling routine, which
    pools cells with E < pool_threshold into row's max-E cell and adjusts df accordingly.
    """
    E = N.sum(axis=1, keepdims=True) * P_baseline
    mask = E > 0
    chi2_stat = ((N[mask] - E[mask]) ** 2 / E[mask]).sum()
    df = int(mask.sum() - N.shape[0])  # |S|*(|S|-1) less empty rows; see docstring
    p = 1.0 - chi2_dist.cdf(chi2_stat, df)
    return chi2_stat, p

# Run the full diagnostic across all entity-month-window cells.
# Each window is tested against its own period-specific baseline, since the
# legitimate posting pattern differs materially between mid-period and close.
N_WINDOWS_PER_ENTITY_MONTH = 2  # mid + close
m_family = N_ENTITIES * N_MONTHS * N_WINDOWS_PER_ENTITY_MONTH  # 20 * 3 * 2 = 120 tests
results = []
for entity in range(N_ENTITIES):
    for month in range(N_MONTHS):
        seed = 1000 * entity + month
        mid_seq, close_seq = synthesize_entity_month(P_mid_baseline, P_close_baseline,
                                                       N_ENTRIES_PER_ENTITY_MONTH,
                                                       CLOSE_CYCLE_FRACTION, seed)
        for window_label, w_seq, w_baseline in [
            ("mid", mid_seq, P_mid_baseline),
            ("close", close_seq, P_close_baseline),
        ]:
            _, N_obs = transition_matrix(w_seq, STATES)
            chi2_stat, p_val = chi2_test(N_obs, w_baseline)
            results.append({"entity": entity, "month": month,
                            "window": window_label, "chi2": chi2_stat, "p": p_val})

# Holm-Bonferroni correction across the full m=120-test family at family-wise alpha=0.05.
# The family size is N_ENTITIES * N_MONTHS * N_WINDOWS_PER_ENTITY_MONTH; making this
# explicit prevents accidental under-correction when the engagement adds entities or
# windows mid-cycle (e.g., extending the test to a fourth quarter).
results_df = pd.DataFrame(results)
assert len(results_df) == m_family, f"Expected {m_family} tests, got {len(results_df)}"
reject, p_adj, _, _ = multipletests(results_df["p"].values, alpha=0.05, method="holm")
results_df["p_adjusted_holm"] = p_adj
results_df["reject_holm"] = reject

# Per-entity rejection counts
focus_list = (results_df.groupby(["entity", "window"])["reject_holm"]
              .sum().unstack(fill_value=0).reset_index())
focus_list.columns = ["entity", "close_rejections", "mid_rejections"]
focus_list["total_rejections"] = focus_list["mid_rejections"] + focus_list["close_rejections"]
print(focus_list.sort_values("total_rejections", ascending=False).head(10))

With seed=42 and the realistic non-uniform baselines above, this code produces a deterministic 20-entity ranked focus list. The top of the printed table looks like:


    entity  close_rejections  mid_rejections  total_rejections
13      13                 1               1                 2
17      17                 1               0                 1
4        4                 0               1                 1
8        8                 1               0                 1
2        2                 0               0                 0
...

Entities with zero rejections in either column receive analytical-procedure-only treatment for the journal-entry-testing requirement of AS 2401 §60. Entities with one or more rejections receive substantive sample expansion targeted at the specific transitions driving the rejection (the cell-level standardized residuals from First-Order Markov Modeling for Transaction-Stream Analysis in Audit‘s diagnostic).

In a typical 20-entity portfolio with no fraud present, the framework above produces 0-2 false-positive rejections after Holm correction — well within the engagement team’s capacity to follow up with additional procedures. With injected fraud (e.g., one entity with anomalous revenue → AR transitions concentrated in the close-cycle window), the rejection signal localizes cleanly to the affected entity-month-window cells and survives the family-wise correction.

Reference points in the published prosecution record

Journal-entry sequence anomalies — the structural signature this article’s method is built to detect — are documented across major financial-statement-fraud prosecutions. Two reference points are particularly instructive for the audit team using this methodology:

  • WorldCom (2002). SEC v. WorldCom, Inc., Civil Action No. 02-CV-4963 (S.D.N.Y., complaint filed June 26, 2002, amended November 2002); parallel criminal prosecution United States v. Ebbers, 458 F.3d 110 (2d Cir. 2006) (Bernard J. Ebbers, former CEO; conviction March 15, 2005, sentenced to 25 years); United States v. Sullivan (S.D.N.Y., guilty plea March 2004). The $74-billion WorldCom restatement was, mechanically, a journal-entry-sequence fraud: specific late-period adjusting entries reclassified operating expenses (line costs) as capital expenditures, producing a posting pattern that violated the entity’s documented chart-of-accounts business logic for those transitions. The exact transition cells where the misclassification concentrated — operating-expense → fixed-asset entries in the final close-cycle window — are precisely the kind of localized anomaly that the chi-squared test on mid-vs-close window-specific baselines would flag.
  • Par Funding (2020). SEC v. Complete Business Solutions Group, Inc., d/b/a Par Funding, et al., Civil Action No. 20-cv-3265 (E.D. Pa., filed July 24, 2020); parallel DOJ criminal prosecution against the principal officers in the same district. The $550-million matter was built on merchant cash advance (MCA) receivables; the borrower-side accounting fraud relevant to this article involved reclassifying MCA proceeds as sales revenue and the daily holdback as cost-of-goods-sold or operating expense — a journal-entry-sequence signature with a tight cyclic transition pattern (mca_proceeds → related_party → revenue → cash → daily_holdback → mca_proceeds) that the mixture-of-Markov-chains framework in Markov Mixture Models for Round-Tripping and Lapping Detection detects directly. The first-order framework in this article would flag the elevated transition probability from the cash subledger to the related-party state and from the related-party state to revenue.

Neither case used journal-entry sequence analysis as the sole basis for restatement or enforcement. In both cases, the technique would have functioned as a directional analytic that concentrated substantive-procedure attention on specific entity-period-window cells — exactly the workpaper triage the framework is built to produce.

Production failure modes

Three patterns recur in production deployments and warrant explicit defenses.

Stale baselines. The prior-period baseline drifts with legitimate business change — new product lines, customer-base shifts, accounting-policy adoption. A baseline two or three years old can fit current operations so poorly that the chi-squared test rejects everywhere, regardless of fraud. Defense: review the baseline’s currency at engagement planning; if business activity has materially changed since the baseline period, refresh the baseline using the most recent uncontaminated period available.

Mid-period business changes. Acquisitions, divestitures, ERP-system migrations, and chart-of-accounts restructurings all produce structural breaks within the testing period itself. The Bai-Perron (1998) structural-break test applied to the cell-by-cell transition counts identifies the break points; the engagement response is to fit separate transition matrices for the pre-break and post-break sub-periods. Failing to do this confounds the legitimate change with fraud signals and produces false positives.

Period-end-only schemes. A fraud scheme that operates only at period end can hide inside the legitimate close-cycle structural shift. The mid-period transition matrix looks clean (because the scheme isn’t operating); the close-cycle matrix shows elevated variance (which the auditor charges to legitimate close-cycle activity). Defense: for entities with prior restatement history or other elevated risk indicators, the close-cycle baseline should be drawn from a peer-group composite rather than the entity’s own prior close cycles. Peer-group calibration removes the entity’s own potentially-contaminated history from the comparison.

Bridge to Hidden Markov Models for Earnings-Management Regime Detection in Public-Company Financials

The period-end transition modeling discussion above treats the close-cycle shift as a known structural feature with a documented timing. In some entities, the shift is itself uncertain — sometimes it occurs days 26-30, sometimes days 22-30, and sometimes a regime-shift signal appears mid-period that does not align with the documented close calendar. When the regime-shift component is itself a quantity to estimate from the data, the framework extends to Hidden Markov Models — the topic of Hidden Markov Models for Earnings-Management Regime Detection in Public-Company Financials. The HMM treats the underlying regime (clean-reporting vs. manipulated-reporting; or mid-period vs. close-cycle) as a latent variable to be inferred from the observed transition pattern, rather than imposed by the engagement team’s documented expectation.


Authority:

Mathematical foundations:

  • Norris, J.R. (1997). Markov Chains. Cambridge Series in Statistical and Probabilistic Mathematics, §1.3-1.4 (empirical estimation theory).
  • Hawkins, D.M. (1980). Identification of Outliers. Chapman and Hall, p. 145 (chi-squared cell-population threshold).
  • Bai, J., & Perron, P. (1998). “Estimating and Testing Linear Models with Multiple Structural Changes.” Econometrica, 66(1), 47-78.
  • Holm, S. (1979). “A Simple Sequentially Rejective Multiple Test Procedure.” Scandinavian Journal of Statistics, 6(2), 65-70.

Audit standards:

  • PCAOB AS 2401 — Consideration of Fraud in a Financial Statement Audit, §60-67 (journal-entry testing).
  • PCAOB AS 2305 — Substantive Analytical Procedures, §10-17 (expectation precision).
  • PCAOB AS 2110 — Identifying and Assessing Risks of Material Misstatement, §28-49 (understanding the entity’s processes).
  • ISA 240 — The Auditor’s Responsibilities Relating to Fraud in an Audit of Financial Statements, §32-33 (international counterpart).
  • IRM 4.10.1 — IRS audit techniques (cross-reference to tax-examination practice).

Companion code on GitHub

Runnable Python artifact reproducing this article’s worked example end-to-end under seed=42: stochastic_markov/002_journal_entry_production.py in noahrgreen/dd-tech-lab-companion.

Clone the repo and run with python stochastic_markov/002_journal_entry_production.py.