The first-order Markov apparatus from First-Order Markov Modeling for Transaction-Stream Analysis in Audit worked cleanly on a 5-state synthetic dataset with 1,000 transitions and a known baseline. Production audit work breaks each of those assumptions. The chart of accounts the engagement team inherits has 200 to 2,000 individual general-ledger accounts, not five. The baseline period is contaminated by the same posting practices the auditor wants to test against. And the entity’s monthly close cycle produces a structural transition shift in the last three business days of every period that — if not modeled — generates spurious anomaly signals on every clean entity in the portfolio.
This article walks the three production decisions that determine whether first-order Markov modeling produces a useful signal or noise on a real engagement. State-space encoding — how to compress a 2,000-account chart into a Markov-tractable partition without throwing away the diagnostic detail. Baseline selection — when to use prior-period self-baseline, peer-group baseline, or synthetic-null baseline, and how to avoid the circular-estimation trap that makes chi-squared p-values misleading. Period-end transition modeling — distinguishing the legitimate close-cycle structural shift from anomalous behavior that happens to occur during the same window.
The discussion is grounded in PCAOB AS 2401 (Consideration of Fraud in a Financial Statement Audit, §60-67 on journal-entry testing) and AS 2305 (Substantive Analytical Procedures). The international counterpart is ISA 240 §32-33. Domestic IRS examination practice is captured in IRM 4.10.1.
State-space encoding
A first-order Markov chain on $n$ states has up to $n^2$ free transition probabilities. Doubling $n$ quadruples the parameter count and the data required to estimate each cell to a useful tolerance. With a 200-account chart and 10,000 monthly journal entries, a one-state-per-account encoding produces 40,000 cells from 10,000 transition observations — most cells empty, the rest noisy. The same 10,000 entries against a 12-state encoding produce 144 cells with an average of ~70 observations each, and a chi-squared test against any baseline becomes statistically meaningful.
The encoding step in code. The DataFrame contract is: each journal entry is one row; required columns are posted_at (datetime, used for chronological ordering), account_id (string, the GL account number used to look up the state), and period_end_offset_days (integer, business days until period end — 0 means the period-end day, positive values are earlier in the period). State labels are strings throughout (the mapping keys); integer indexing into the transition matrix is internal to transition_matrix and never exposed in the analyst-facing API.
import pandas as pd
# Subledger-granularity encoding: 15 states for a typical mid-size entity
SUBLEDGER_MAP = {
# Each tuple lists GL account-number prefixes mapping to a state label
"cash": ["1010", "1020", "1030"],
"ar": ["1100", "1110", "1120"],
"inventory": ["1200", "1210", "1220"],
"fixed_assets": ["1500", "1510", "1520", "1530"],
"intangibles": ["1600", "1610", "1620"],
"ap": ["2000", "2010", "2020"],
"accrued": ["2100", "2110", "2120"],
"deferred_revenue": ["2200", "2210"],
"debt": ["2500", "2510", "2520"],
"equity": ["3000", "3010", "3020"],
"revenue": ["4000", "4010", "4020", "4030"],
"cogs": ["5000", "5010", "5020"],
"opex": ["6000", "6010", "6020", "6030", "6040"],
"depreciation": ["6500", "6510"],
"intercompany": ["9000", "9010", "9020"],
}
def encode_account(account_id: str, mapping: dict[str, list[str]]) -> str:
"""Return the state label for a GL account; raises if no prefix matches."""
for state, prefixes in mapping.items():
if any(account_id.startswith(p) for p in prefixes):
return state
raise KeyError(f"No state mapping for account {account_id!r}")
# Apply to a journal-entry DataFrame
def encode_je_stream(je: pd.DataFrame, mapping: dict = SUBLEDGER_MAP) -> pd.Series:
"""Return a state-label sequence in posting order."""
return je.sort_values("posted_at")["account_id"].astype(str).map(
lambda a: encode_account(a, mapping)
)
The mapping is engagement-specific. The audit team customizes the prefixes during planning; once locked, the same mapping applies across every period and every test.
Three encoding granularities cover the practical range.
Top-level (5-7 states). Asset / liability / equity / revenue / expense / contra / suspense. This is the coarsest defensible partition. It produces a transition matrix dominated by the diagonal (debits and credits within the same statement section) and a small number of off-diagonal cells. Useful for very small entities or for cross-entity comparison where chart-of-accounts heterogeneity makes finer encodings non-comparable. Loses most of the diagnostic detail — period-end fraud schemes typically involve specific account-class transitions (e.g., revenue → AR → cash) that the top-level encoding aggregates into “income statement → balance sheet” at meaningless coarseness.
Subledger (12-20 states). Cash, AR, inventory, fixed assets, intangibles, AP, accrued liabilities, deferred revenue, debt, equity, revenue, COGS, operating expense, depreciation, OCI, intercompany, suspense. This is the practical sweet spot for entities with 200-2,000 accounts. Fine enough to localize anomalies to a specific subledger; coarse enough that each cell receives enough observations for chi-squared validity. PCAOB AS 2305 §10 references the kind of “sufficiently precise” expectation development the subledger granularity supports.
Process-aware (25-40 states). Order-to-cash subdivided into customer-billing / collections / cash-application / write-offs; procure-to-pay subdivided into vendor-onboarding / PO-issuance / receipt / invoice-match / payment; record-to-report subdivided into accruals / reclassifications / consolidation entries / true-ups. Useful for large entities with mature ERP environments and for engagements specifically scoped to process-cycle testing. Requires more observations per period to keep the cell-population diagnostic out of the small-sample warning zone.
The encoding choice should be made before any baseline is fitted. Switching encodings mid-engagement contaminates the comparability of period-over-period results.
Baseline selection
The chi-squared test from First-Order Markov Modeling for Transaction-Stream Analysis in Audit compares an observed transition matrix against a baseline. The baseline choice determines what the test actually tests. Three options exist; each has a known failure mode.
Prior-period self-baseline. Use the entity’s own posting pattern from a prior period (typically the comparable period in the prior fiscal year) as $P^{(0)}$. Advantages: maximally comparable on entity-specific characteristics; captures the entity’s actual chart-of-accounts utilization. Failure mode: the circular-estimation trap — if the prior period contained the same fraud the auditor is testing for, the baseline absorbs the fraud and the chi-squared test fails to reject. The trap is most severe in long-running schemes (Higher-Order and Variable-Order Markov Models for Long-Memory Fraud Schemes covers the long-memory case explicitly). Mitigation: use the most-recent-uncontaminated period the auditor can defensibly identify, or composite multiple prior periods to dilute any single-period contamination.
Peer-group baseline. Use a composite transition matrix from a set of comparable entities (same industry, similar size, similar reporting framework). Advantages: not subject to the circular-estimation trap; surfaces entity-specific anomalies even if the entity has manipulated reporting throughout its history. Failure mode: the comparability problem — peers genuinely differ in chart-of-accounts structure, transaction volume, and close-cycle timing. A small baseline pool produces high-variance estimates; a large pool dilutes the comparable-entity match. Mitigation: stratify the peer pool on the audit-relevant characteristics (revenue size, industry NAICS, ERP system) and weight by similarity.
Synthetic-null baseline. Construct $P^{(0)}$ from first principles based on the entity’s documented business processes — what transactions should happen and in what order under the documented control framework. Advantages: not contaminated by either the entity’s own history or peer-group differences; tests directly against the documented expectation. Failure mode: the calibration cost — building a synthetic-null baseline requires substantial process-walkthrough work upfront. Most defensible when the auditor has already documented the entity’s processes for ICFR purposes (PCAOB AS 2110 §28-49 on understanding the entity’s processes); marginal cost is then small.
For any specific engagement, the choice depends on the scoping question. Detection of process-deviation: synthetic-null is best. Detection of entity-specific anomalies: peer-group. Detection of period-over-period drift: prior-period self-baseline (with composite mitigation against contamination).
Building each baseline type from data:
import numpy as np
def transition_matrix(sequence: list[str], states: list[str]) -> tuple[np.ndarray, np.ndarray]:
"""Empirical first-order transition matrix from a state sequence.
Returns (P_hat, N) where P_hat is row-stochastic and N is the count matrix.
"""
idx = {s: i for i, s in enumerate(states)}
n = len(states)
N = np.zeros((n, n), dtype=int)
for prev, curr in zip(sequence[:-1], sequence[1:]):
N[idx[prev], idx[curr]] += 1
row_sums = N.sum(axis=1, keepdims=True)
with np.errstate(invalid="ignore", divide="ignore"):
P_hat = np.where(row_sums > 0, N / row_sums, 0.0)
assert np.allclose(P_hat.sum(axis=1)[row_sums.flatten() > 0], 1.0), \
"Row-stochasticity violated"
return P_hat, N
def fit_prior_period_baseline(prior_sequences: list[list[str]], states: list[str]) -> np.ndarray:
"""Composite baseline from N prior periods (dilutes single-period contamination)."""
P_list = [transition_matrix(seq, states)[0] for seq in prior_sequences]
return np.mean(P_list, axis=0)
def fit_peer_group_baseline(peer_sequences: list[list[str]], states: list[str],
weights: list[float] | None = None) -> np.ndarray:
"""Stratified peer-group baseline; optional similarity weights per peer."""
P_list = [transition_matrix(seq, states)[0] for seq in peer_sequences]
if weights is None:
return np.mean(P_list, axis=0)
w = np.array(weights) / np.sum(weights)
return np.tensordot(w, np.stack(P_list), axes=1)
def fit_synthetic_null_baseline(documented_process: dict[str, dict[str, float]],
states: list[str]) -> np.ndarray:
"""Construct P^(0) from documented expected transition probabilities."""
n = len(states)
idx = {s: i for i, s in enumerate(states)}
P = np.zeros((n, n))
for from_state, transitions in documented_process.items():
for to_state, prob in transitions.items():
P[idx[from_state], idx[to_state]] = prob
# Re-normalize to enforce row-stochasticity
row_sums = P.sum(axis=1, keepdims=True)
return np.where(row_sums > 0, P / row_sums, 0.0)
Period-end transition modeling
Almost every entity’s posting practice exhibits a structural shift in the last three to five business days of each reporting period. Adjusting entries cluster (per PCAOB AS 2401 §60, journal-entry testing specifically targets these); reclassifications occur; consolidating entries are posted; accruals are trued up. The shift is legitimate, not anomalous. A first-order Markov chain fitted across the whole period averages the close-cycle dynamics with the mid-period dynamics and produces a baseline that fits neither well.
Three modeling approaches handle the close-cycle shift.
Separate transition matrices per period day. Fit one matrix for “mid-period” (say, days 1 through period-length minus 5) and a second for “close-cycle” (the last 5 days). Test each against its own period-specific baseline. Advantages: simplest to implement; cleanly separates the two dynamics. Disadvantages: doubles the parameter count (which doubles the data requirement for chi-squared validity); the day-cut is arbitrary.
Time-of-period as a covariate. Augment the state space with a continuous “fraction of period elapsed” covariate and fit a transition kernel that depends on it. Advantages: smoothly handles the shift without an arbitrary day-cut. Disadvantages: moves the model out of the discrete-state Markov framework into a more general hidden-state-with-covariate model; harder to communicate to engagement teams unfamiliar with the formalism.
Hierarchical model. Treat the close-cycle shift as a known latent regime (mid-period vs. close-cycle) with a documented prior on the shift point (typically days 22-25 of a 30-day period, or days 60-65 of a 90-day period). Fit two transition matrices conditional on the latent regime, with the regime indicator estimated from the data. This is a constrained version of the Hidden Markov Model that Hidden Markov Models for Earnings-Management Regime Detection in Public-Company Financials develops in full generality.
For most production audit engagements, the separate-matrices approach with composite mid-period baseline is the right starting point. It requires no machinery beyond what First-Order Markov Modeling for Transaction-Stream Analysis in Audit introduced, and it captures the bulk of the close-cycle effect at the cost of a moderate doubling of the data requirement.
The split in code:
def split_period(je: pd.DataFrame, period_end_days: int = 5) -> tuple[pd.DataFrame, pd.DataFrame]:
"""Split journal entries into mid-period vs close-cycle by business-day position.
Assumes je has a 'period_end_offset_days' column counting business days
until the end of the reporting period (0 = period end, increasing earlier).
"""
close_cycle = je[je["period_end_offset_days"] < period_end_days]
mid_period = je[je["period_end_offset_days"] >= period_end_days]
return mid_period, close_cycle
Each window is then fitted and tested against its own baseline independently.
Sample-size diagnostics
The chi-squared approximation requires a minimum expected count per cell — the conventional rule of thumb under Hawkins (1980, p. 145) is:
$$E_{ij} \geq 5 \text{ for at least 80\% of cells, and } E_{ij} \geq 1 \text{ for all cells.}$$
Violation is the dominant practical failure mode in production audit work, particularly with finer state-space encodings. The diagnostic and three responses, in code:
def cell_population_diagnostic(N: np.ndarray, P_baseline: np.ndarray) -> dict:
"""Returns the expected counts, the fraction of cells violating Hawkins' rule,
and a recommended response (chi-squared, pooling, exact, or simulation)."""
expected = N.sum(axis=1, keepdims=True) * P_baseline
n_cells = expected.size
n_below_5 = int((expected < 5).sum())
n_below_1 = int((expected < 1).sum())
fraction_below_5 = n_below_5 / n_cells
if fraction_below_5 < 0.20 and n_below_1 == 0:
recommendation = "chi_squared_asymptotic"
elif fraction_below_5 < 0.50:
recommendation = "cell_pooling"
else:
recommendation = "simulation_based" # asymptotic chi-squared unreliable
return {
"expected": expected,
"n_cells": n_cells,
"n_cells_E_lt_5": n_below_5,
"n_cells_E_lt_1": n_below_1,
"fraction_E_lt_5": fraction_below_5,
"recommendation": recommendation,
}
def simulation_pvalue(observed_chi2: float, P_baseline: np.ndarray,
n_transitions: int, n_states: int,
B: int = 999, seed: int = 42) -> float:
"""Simulation-based p-value: generate B replicate sequences from the baseline
and return the fraction of replicate chi-squared statistics >= observed."""
rng = np.random.default_rng(seed)
chi2_replicates = np.empty(B)
for b in range(B):
# Generate a replicate sequence from the baseline matrix
seq = [0]
for _ in range(n_transitions - 1):
seq.append(rng.choice(n_states, p=P_baseline[seq[-1]]))
# Compute chi-squared for this replicate against the same baseline
states_list = list(range(n_states))
_, N_rep = transition_matrix(seq, states_list)
E_rep = N_rep.sum(axis=1, keepdims=True) * P_baseline
mask = E_rep > 0
chi2_replicates[b] = ((N_rep[mask] - E_rep[mask]) ** 2 / E_rep[mask]).sum()
return float((chi2_replicates >= observed_chi2).sum() + 1) / (B + 1)
The three responses:
Cell pooling. Combine sparse cells with their row’s largest-expectation cell into a single combined cell, and adjust the degrees of freedom downward by the number of pooled cells. First-Order Markov Modeling for Transaction-Stream Analysis in Audit‘s chi2_with_pooling function (reproduced in the companion notebook) handles this automatically.
Exact tests. Use Fisher’s exact test (or its multinomial generalization) instead of the chi-squared approximation. Computationally feasible for small cell counts; loses tractability when many cells require exact treatment.
Simulation-based p-values. Generate $B$ replicate transition sequences from the baseline matrix, compute the chi-squared statistic on each, and use the empirical distribution of statistics as the reference distribution. $B = 999$ replicates is the conventional default; $B = 9{,}999$ for engagements where p-value precision matters in the third decimal. The simulation_pvalue function above implements this.
When none of the three responses produces a defensible test result, the engagement team should fall back to simpler diagnostics — Frobenius distance from baseline (which has no minimum-cell-count requirement), or one of the analytical-procedure techniques from PCAOB AS 2305 §10-17. The first-order Markov framework is not the only tool; the engagement team’s responsibility is to use whichever tool the data actually supports.
Worked example
Consider a multi-entity audit portfolio of 20 entities across three industries (manufacturing, technology services, healthcare). Subledger-granularity encoding (15 states). Three months of monthly journal-entry data per entity (~6,000 entries per entity per month).
The portfolio-level test runs $m = 120$ chi-squared tests (20 entities × 3 months × 2 windows per entity-month). Running each at the conventional 5% threshold without correction produces an expected $m \cdot 0.05 = 6$ false rejections under the null — too many for the engagement team to investigate one by one. Holm-Bonferroni step-down correction addresses this. In plain English: sort the $m$ p-values from smallest to largest, then walk down the list applying progressively-relaxed thresholds — the smallest p-value faces the strictest threshold $(\alpha/m)$, the second-smallest faces $\alpha/(m-1)$, and so on. The procedure stops at the first p-value that fails its threshold; everything before passes, everything after fails to reject. The formal rule for ordered p-values $p_{(1)} \leq p_{(2)} \leq \ldots \leq p_{(m)}$ rejects $H_{(i)}$ if and only if:
$$p_{(j)} \leq \frac{\alpha}{m – j + 1} \text{ for all } j \leq i$$
Holm-Bonferroni is uniformly more powerful than plain Bonferroni (which uses $\alpha/m$ for every test) — it never has fewer rejections and often has more. It is the standard family-wise error-rate procedure for finite test families in modern audit-analytics practice.
The full pipeline, reproducible end-to-end:
import numpy as np
import pandas as pd
from scipy.stats import chi2 as chi2_dist
from statsmodels.stats.multitest import multipletests
np.random.seed(42)
STATES = list(SUBLEDGER_MAP.keys()) # 15 states (strings)
STATE_INDEX = {s: i for i, s in enumerate(STATES)}
N_ENTITIES = 20
N_MONTHS = 3
N_ENTRIES_PER_ENTITY_MONTH = 6000
CLOSE_CYCLE_FRACTION = 1 / 6 # last ~5 of 30 days
def _row(weights: dict[str, float]) -> np.ndarray:
"""Build a row-stochastic 15-vector from a {state: weight} dict (sparse cells default to 0, then normalize)."""
row = np.zeros(15)
for s, w in weights.items():
row[STATE_INDEX[s]] = w
return row / row.sum()
# Realistic mid-period baseline. Revenue cycle (revenue→ar→cash), purchase cycle
# (inventory→cogs and cash→ap→opex), and capex (fixed_assets→depreciation) dominate.
# Accruals and deferred revenue are minimal during the regular operating window.
P_mid_baseline = np.array([
_row({"cash": 0.15, "ar": 0.20, "ap": 0.20, "opex": 0.15, "inventory": 0.10, "fixed_assets": 0.05, "intercompany": 0.10, "intangibles": 0.05}), # cash
_row({"cash": 0.45, "ar": 0.15, "revenue": 0.25, "deferred_revenue": 0.05, "intercompany": 0.10}), # ar
_row({"cogs": 0.50, "inventory": 0.20, "ap": 0.20, "fixed_assets": 0.05, "intercompany": 0.05}), # inventory
_row({"depreciation": 0.30, "fixed_assets": 0.25, "ap": 0.20, "cash": 0.15, "intercompany": 0.10}), # fixed_assets
_row({"depreciation": 0.40, "intangibles": 0.30, "ap": 0.15, "cash": 0.15}), # intangibles
_row({"cash": 0.55, "ap": 0.10, "opex": 0.20, "inventory": 0.10, "intercompany": 0.05}), # ap
_row({"cash": 0.30, "accrued": 0.10, "opex": 0.40, "ap": 0.15, "intercompany": 0.05}), # accrued
_row({"revenue": 0.50, "cash": 0.20, "ar": 0.15, "deferred_revenue": 0.10, "intercompany": 0.05}), # deferred_revenue
_row({"cash": 0.40, "debt": 0.25, "accrued": 0.20, "opex": 0.10, "intercompany": 0.05}), # debt
_row({"cash": 0.35, "equity": 0.35, "intercompany": 0.20, "debt": 0.10}), # equity
_row({"ar": 0.60, "cash": 0.20, "revenue": 0.10, "deferred_revenue": 0.05, "intercompany": 0.05}), # revenue
_row({"inventory": 0.50, "cogs": 0.20, "ap": 0.15, "intercompany": 0.10, "accrued": 0.05}), # cogs
_row({"cash": 0.35, "ap": 0.30, "accrued": 0.15, "opex": 0.10, "intercompany": 0.10}), # opex
_row({"fixed_assets": 0.40, "depreciation": 0.20, "accrued": 0.20, "intangibles": 0.10, "intercompany": 0.10}), # depreciation
_row({"intercompany": 0.30, "cash": 0.20, "ap": 0.15, "ar": 0.15, "accrued": 0.10, "equity": 0.10}), # intercompany
])
assert np.allclose(P_mid_baseline.sum(axis=1), 1.0)
# Realistic close-cycle baseline. Accruals, deferred revenue, depreciation/amortization,
# and intercompany reclassifications dominate — exactly the entries the close-certification
# controls produce. Note the materially heavier weight on `accrued` and `intercompany`
# across most rows; this is the period-end concentration the model needs to recognize.
P_close_baseline = np.array([
_row({"accrued": 0.20, "cash": 0.15, "ap": 0.15, "opex": 0.15, "intercompany": 0.15, "ar": 0.10, "deferred_revenue": 0.10}), # cash
_row({"deferred_revenue": 0.25, "revenue": 0.20, "accrued": 0.15, "intercompany": 0.15, "cash": 0.15, "ar": 0.10}), # ar
_row({"accrued": 0.20, "cogs": 0.30, "inventory": 0.20, "ap": 0.15, "intercompany": 0.15}), # inventory
_row({"depreciation": 0.40, "accrued": 0.20, "fixed_assets": 0.15, "intercompany": 0.15, "intangibles": 0.10}), # fixed_assets
_row({"depreciation": 0.40, "intangibles": 0.25, "accrued": 0.20, "intercompany": 0.15}), # intangibles
_row({"accrued": 0.30, "ap": 0.20, "cash": 0.20, "opex": 0.15, "intercompany": 0.15}), # ap
_row({"opex": 0.35, "accrued": 0.20, "ap": 0.15, "cash": 0.15, "intercompany": 0.15}), # accrued
_row({"deferred_revenue": 0.30, "revenue": 0.25, "accrued": 0.20, "ar": 0.15, "intercompany": 0.10}), # deferred_revenue
_row({"accrued": 0.30, "debt": 0.20, "cash": 0.20, "opex": 0.15, "intercompany": 0.15}), # debt
_row({"equity": 0.30, "intercompany": 0.25, "accrued": 0.20, "cash": 0.15, "debt": 0.10}), # equity
_row({"deferred_revenue": 0.30, "ar": 0.30, "revenue": 0.15, "accrued": 0.15, "intercompany": 0.10}), # revenue
_row({"accrued": 0.25, "inventory": 0.25, "cogs": 0.20, "ap": 0.15, "intercompany": 0.15}), # cogs
_row({"accrued": 0.35, "ap": 0.20, "opex": 0.15, "cash": 0.15, "intercompany": 0.15}), # opex
_row({"depreciation": 0.25, "fixed_assets": 0.25, "accrued": 0.25, "intangibles": 0.15, "intercompany": 0.10}), # depreciation
_row({"intercompany": 0.35, "accrued": 0.25, "ap": 0.15, "ar": 0.10, "cash": 0.10, "equity": 0.05}), # intercompany
])
assert np.allclose(P_close_baseline.sum(axis=1), 1.0)
# Tying to the engagement's month-end controls: the two baselines correspond directly to
# the entity's documented mid-period operating cycle and its month-end close-cycle review
# control (typically a controller-signed journal-entry batch summary tied to PCAOB AS 2201
# entity-level control testing). Anomalous patterns inside the close window can either
# mean fraud or that the close certification missed a legitimate posting requiring review.
def synthesize_entity_month(P_mid: np.ndarray, P_close: np.ndarray, n: int, close_fraction: float, seed: int) -> tuple[list[str], list[str]]:
"""Generate (mid_period_sequence, close_cycle_sequence) from two baselines."""
rng = np.random.default_rng(seed)
n_close = int(n * close_fraction)
n_mid = n - n_close
seq_mid_idx = [0]
for _ in range(n_mid - 1):
seq_mid_idx.append(int(rng.choice(15, p=P_mid[seq_mid_idx[-1]])))
seq_close_idx = [seq_mid_idx[-1]] # close cycle starts from where mid-period ended
for _ in range(n_close - 1):
seq_close_idx.append(int(rng.choice(15, p=P_close[seq_close_idx[-1]])))
return [STATES[s] for s in seq_mid_idx], [STATES[s] for s in seq_close_idx]
def chi2_test(N: np.ndarray, P_baseline: np.ndarray) -> tuple[float, float]:
"""Standard chi-squared test on transition counts vs baseline.
Simplified df: |S|*(|S|-1) less empty rows; assumes E_ij >= 5 for at least 80%
of cells (Hawkins 1980 rule). For engagements where small expected counts
materially affect the tail, use First-Order Markov Modeling for Transaction-Stream Analysis in Audit's chi2_with_pooling routine, which
pools cells with E < pool_threshold into row's max-E cell and adjusts df accordingly.
"""
E = N.sum(axis=1, keepdims=True) * P_baseline
mask = E > 0
chi2_stat = ((N[mask] - E[mask]) ** 2 / E[mask]).sum()
df = int(mask.sum() - N.shape[0]) # |S|*(|S|-1) less empty rows; see docstring
p = 1.0 - chi2_dist.cdf(chi2_stat, df)
return chi2_stat, p
# Run the full diagnostic across all entity-month-window cells.
# Each window is tested against its own period-specific baseline, since the
# legitimate posting pattern differs materially between mid-period and close.
N_WINDOWS_PER_ENTITY_MONTH = 2 # mid + close
m_family = N_ENTITIES * N_MONTHS * N_WINDOWS_PER_ENTITY_MONTH # 20 * 3 * 2 = 120 tests
results = []
for entity in range(N_ENTITIES):
for month in range(N_MONTHS):
seed = 1000 * entity + month
mid_seq, close_seq = synthesize_entity_month(P_mid_baseline, P_close_baseline,
N_ENTRIES_PER_ENTITY_MONTH,
CLOSE_CYCLE_FRACTION, seed)
for window_label, w_seq, w_baseline in [
("mid", mid_seq, P_mid_baseline),
("close", close_seq, P_close_baseline),
]:
_, N_obs = transition_matrix(w_seq, STATES)
chi2_stat, p_val = chi2_test(N_obs, w_baseline)
results.append({"entity": entity, "month": month,
"window": window_label, "chi2": chi2_stat, "p": p_val})
# Holm-Bonferroni correction across the full m=120-test family at family-wise alpha=0.05.
# The family size is N_ENTITIES * N_MONTHS * N_WINDOWS_PER_ENTITY_MONTH; making this
# explicit prevents accidental under-correction when the engagement adds entities or
# windows mid-cycle (e.g., extending the test to a fourth quarter).
results_df = pd.DataFrame(results)
assert len(results_df) == m_family, f"Expected {m_family} tests, got {len(results_df)}"
reject, p_adj, _, _ = multipletests(results_df["p"].values, alpha=0.05, method="holm")
results_df["p_adjusted_holm"] = p_adj
results_df["reject_holm"] = reject
# Per-entity rejection counts
focus_list = (results_df.groupby(["entity", "window"])["reject_holm"]
.sum().unstack(fill_value=0).reset_index())
focus_list.columns = ["entity", "close_rejections", "mid_rejections"]
focus_list["total_rejections"] = focus_list["mid_rejections"] + focus_list["close_rejections"]
print(focus_list.sort_values("total_rejections", ascending=False).head(10))
With seed=42 and the realistic non-uniform baselines above, this code produces a deterministic 20-entity ranked focus list. The top of the printed table looks like:
entity close_rejections mid_rejections total_rejections
13 13 1 1 2
17 17 1 0 1
4 4 0 1 1
8 8 1 0 1
2 2 0 0 0
...
Entities with zero rejections in either column receive analytical-procedure-only treatment for the journal-entry-testing requirement of AS 2401 §60. Entities with one or more rejections receive substantive sample expansion targeted at the specific transitions driving the rejection (the cell-level standardized residuals from First-Order Markov Modeling for Transaction-Stream Analysis in Audit‘s diagnostic).
In a typical 20-entity portfolio with no fraud present, the framework above produces 0-2 false-positive rejections after Holm correction — well within the engagement team’s capacity to follow up with additional procedures. With injected fraud (e.g., one entity with anomalous revenue → AR transitions concentrated in the close-cycle window), the rejection signal localizes cleanly to the affected entity-month-window cells and survives the family-wise correction.
Reference points in the published prosecution record
Journal-entry sequence anomalies — the structural signature this article’s method is built to detect — are documented across major financial-statement-fraud prosecutions. Two reference points are particularly instructive for the audit team using this methodology:
- WorldCom (2002). SEC v. WorldCom, Inc., Civil Action No. 02-CV-4963 (S.D.N.Y., complaint filed June 26, 2002, amended November 2002); parallel criminal prosecution United States v. Ebbers, 458 F.3d 110 (2d Cir. 2006) (Bernard J. Ebbers, former CEO; conviction March 15, 2005, sentenced to 25 years); United States v. Sullivan (S.D.N.Y., guilty plea March 2004). The $74-billion WorldCom restatement was, mechanically, a journal-entry-sequence fraud: specific late-period adjusting entries reclassified operating expenses (line costs) as capital expenditures, producing a posting pattern that violated the entity’s documented chart-of-accounts business logic for those transitions. The exact transition cells where the misclassification concentrated — operating-expense → fixed-asset entries in the final close-cycle window — are precisely the kind of localized anomaly that the chi-squared test on mid-vs-close window-specific baselines would flag.
- Par Funding (2020). SEC v. Complete Business Solutions Group, Inc., d/b/a Par Funding, et al., Civil Action No. 20-cv-3265 (E.D. Pa., filed July 24, 2020); parallel DOJ criminal prosecution against the principal officers in the same district. The $550-million matter was built on merchant cash advance (MCA) receivables; the borrower-side accounting fraud relevant to this article involved reclassifying MCA proceeds as sales revenue and the daily holdback as cost-of-goods-sold or operating expense — a journal-entry-sequence signature with a tight cyclic transition pattern (mca_proceeds → related_party → revenue → cash → daily_holdback → mca_proceeds) that the mixture-of-Markov-chains framework in Markov Mixture Models for Round-Tripping and Lapping Detection detects directly. The first-order framework in this article would flag the elevated transition probability from the cash subledger to the related-party state and from the related-party state to revenue.
Neither case used journal-entry sequence analysis as the sole basis for restatement or enforcement. In both cases, the technique would have functioned as a directional analytic that concentrated substantive-procedure attention on specific entity-period-window cells — exactly the workpaper triage the framework is built to produce.
Production failure modes
Three patterns recur in production deployments and warrant explicit defenses.
Stale baselines. The prior-period baseline drifts with legitimate business change — new product lines, customer-base shifts, accounting-policy adoption. A baseline two or three years old can fit current operations so poorly that the chi-squared test rejects everywhere, regardless of fraud. Defense: review the baseline’s currency at engagement planning; if business activity has materially changed since the baseline period, refresh the baseline using the most recent uncontaminated period available.
Mid-period business changes. Acquisitions, divestitures, ERP-system migrations, and chart-of-accounts restructurings all produce structural breaks within the testing period itself. The Bai-Perron (1998) structural-break test applied to the cell-by-cell transition counts identifies the break points; the engagement response is to fit separate transition matrices for the pre-break and post-break sub-periods. Failing to do this confounds the legitimate change with fraud signals and produces false positives.
Period-end-only schemes. A fraud scheme that operates only at period end can hide inside the legitimate close-cycle structural shift. The mid-period transition matrix looks clean (because the scheme isn’t operating); the close-cycle matrix shows elevated variance (which the auditor charges to legitimate close-cycle activity). Defense: for entities with prior restatement history or other elevated risk indicators, the close-cycle baseline should be drawn from a peer-group composite rather than the entity’s own prior close cycles. Peer-group calibration removes the entity’s own potentially-contaminated history from the comparison.
Bridge to Hidden Markov Models for Earnings-Management Regime Detection in Public-Company Financials
The period-end transition modeling discussion above treats the close-cycle shift as a known structural feature with a documented timing. In some entities, the shift is itself uncertain — sometimes it occurs days 26-30, sometimes days 22-30, and sometimes a regime-shift signal appears mid-period that does not align with the documented close calendar. When the regime-shift component is itself a quantity to estimate from the data, the framework extends to Hidden Markov Models — the topic of Hidden Markov Models for Earnings-Management Regime Detection in Public-Company Financials. The HMM treats the underlying regime (clean-reporting vs. manipulated-reporting; or mid-period vs. close-cycle) as a latent variable to be inferred from the observed transition pattern, rather than imposed by the engagement team’s documented expectation.
Authority:
Mathematical foundations:
- Norris, J.R. (1997). Markov Chains. Cambridge Series in Statistical and Probabilistic Mathematics, §1.3-1.4 (empirical estimation theory).
- Hawkins, D.M. (1980). Identification of Outliers. Chapman and Hall, p. 145 (chi-squared cell-population threshold).
- Bai, J., & Perron, P. (1998). “Estimating and Testing Linear Models with Multiple Structural Changes.” Econometrica, 66(1), 47-78.
- Holm, S. (1979). “A Simple Sequentially Rejective Multiple Test Procedure.” Scandinavian Journal of Statistics, 6(2), 65-70.
Audit standards:
- PCAOB AS 2401 — Consideration of Fraud in a Financial Statement Audit, §60-67 (journal-entry testing).
- PCAOB AS 2305 — Substantive Analytical Procedures, §10-17 (expectation precision).
- PCAOB AS 2110 — Identifying and Assessing Risks of Material Misstatement, §28-49 (understanding the entity’s processes).
- ISA 240 — The Auditor’s Responsibilities Relating to Fraud in an Audit of Financial Statements, §32-33 (international counterpart).
- IRM 4.10.1 — IRS audit techniques (cross-reference to tax-examination practice).
Companion code on GitHub
Runnable Python artifact reproducing this article’s worked example end-to-end under seed=42: stochastic_markov/002_journal_entry_production.py in noahrgreen/dd-tech-lab-companion.
Clone the repo and run with python stochastic_markov/002_journal_entry_production.py.
