Two-Stage Screening: Benford's Law as a Stationary Distribution Combined With First-Order Markov Tests

Benford’s Law and first-order Markov anomaly detection are usually presented as separate tools. Practitioners run each in isolation, get two binary “deviation / no deviation” signals, and either use a single tool or take the OR of both. The OR-combination doubles the false-positive rate; the AND-combination overlooks anomalies that show up cleanly in only one test. Neither matches what the framework actually says is correct.

The principled view is that Benford’s Law describes a stationary distribution over the digit space — the long-run frequency the first digit takes if the underlying generation process satisfies Hill’s (1995) scale-invariance hypothesis. A first-order Markov chain describes the transition dynamics over a state space — the conditional probability of moving from one state to another. The two describe orthogonal aspects of the same generative process: Benford’s Law cares about marginal frequencies; the Markov chain cares about sequential structure. A complete screening framework tests both, combines the test-statistics via a controlled family-wise error rate, and produces a single calibrated signal.

This framing aligns with PCAOB AS 2305 (Substantive Analytical Procedures) and the ACFE Fraud Examiners Manual §1.7 (data-driven fraud testing).

Benford’s Law as a stationary distribution

Newcomb (1881) and later Benford (1938) observed that the leading-digit distribution of many empirical datasets follows the logarithmic distribution:

$$P(d) = \log_{10}\left(1 + \frac{1}{d}\right) \qquad \text{for } d \in \{1, 2, \ldots, 9\}$$

producing $P(1) \approx 0.301$, $P(2) \approx 0.176$, …, $P(9) \approx 0.046$. Hill (1995) provided the theoretical foundation: Benford’s Law is the unique scale-invariant base-independent limit law for the leading-digit distribution under broad regularity conditions on the underlying generation process.

The connection to Markov-chain theory comes through the stationary-distribution interpretation. A discrete-state Markov chain with transition matrix $P$ has (under irreducibility and aperiodicity, per Norris 1997, §1.7-1.8) a unique stationary distribution $\pi$ satisfying $\pi P = \pi$ and $\sum_d \pi_d = 1$. For the first-digit application, $\pi$ is the long-run frequency the first digit takes. Benford’s Law specifies the value of $\pi$ that the data should converge on under the scale-invariance hypothesis. A chi-squared test of observed first-digit frequencies against Benford’s $\pi$ is therefore the audit-context test of “is this data consistent with the scale-invariance regime.”

First-order Markov as transition dynamics

The First-Order Markov Modeling for Transaction-Stream Analysis in Audit article introduced the first-order Markov framework for transaction-stream anomaly detection. The transition matrix $P$ encodes the conditional dynamics $P_{ij} = P(X_{t+1} = j \mid X_t = i)$. The chi-squared test of observed transition counts against a baseline $P^{(0)}$ measures whether the sequential structure of the data matches the baseline’s structure.

Crucially, the transition-dynamics test is orthogonal to the marginal-frequency test. A dataset can pass Benford’s first-digit test (correct marginal frequencies) while violating expected transition structure (e.g., clean digit-frequencies but a round-tripping cycle in the transaction sequence). Conversely, a dataset can pass the transition-structure test while violating Benford (e.g., correct transitions but inflated round-number bias). The two tests are not redundant.

Family-wise error rate under multiple testing

Running two tests at $\alpha = 0.05$ each gives a per-test false-positive probability of 5%. Under independence and the OR-combination decision rule, the combined false-positive rate is:

$$P(\text{at least one rejection} \mid H_0 \text{ true for both}) = 1 – (1 – \alpha)^2 \approx 0.0975$$

— nearly double the nominal $\alpha$. Multiple-testing correction restores the family-wise error rate to the nominal level.

Bonferroni correction uses threshold $\alpha / m$ for each of $m$ tests; conservative but simple. Holm’s step-down procedure (Holm, 1979) sorts p-values $p_{(1)} \leq p_{(2)} \leq \ldots \leq p_{(m)}$ and rejects $H_{(i)}$ if and only if:

$$p_{(j)} \leq \frac{\alpha}{m – j + 1} \quad \text{for all } j \leq i$$

Holm dominates Bonferroni — never less powerful, often strictly more powerful — and is the standard choice for small $m$. Benjamini-Hochberg correction (Benjamini & Hochberg, 1995) controls the false-discovery rate (FDR) rather than the family-wise error rate; appropriate when tests are conducted across many entities or many periods (portfolio-level screening) and individual false positives are acceptable as long as their proportion is bounded.

The combined screening function


import numpy as np
import pandas as pd
from scipy.stats import chisquare
from scipy.stats import chi2 as chi2_dist
from statsmodels.stats.multitest import multipletests

def benford_first_digit_test(values: np.ndarray, alpha: float = 0.05) -> dict:
    """Chi-squared goodness-of-fit test against Benford expected frequencies."""
    expected_p = np.array([np.log10(1 + 1.0 / d) for d in range(1, 10)])
    nonzero = values[values != 0]
    first_digits = np.array([int(str(int(abs(v)))[0]) for v in nonzero])
    observed = np.array([(first_digits == d).sum() for d in range(1, 10)])
    expected = expected_p * observed.sum()

    chi2_stat, p_value = chisquare(observed, expected)
    return {
        "test": "Benford first-digit",
        "chi2_statistic": float(chi2_stat),
        "p_value": float(p_value),
        "n_observations": int(nonzero.size),
        "reject_benford": p_value < alpha,
    }


def markov_transition_test(sequence: list[int], n_states: int,
                            P_baseline: np.ndarray, alpha: float = 0.05) -> dict:
    """Chi-squared test of observed transition counts against baseline.

    Degrees of freedom are computed on the active support only: rows with zero
    observed transitions or zero expected mass do not contribute free cells.
    """
    N = np.zeros((n_states, n_states), dtype=int)
    for prev, curr in zip(sequence[:-1], sequence[1:]):
        N[prev, curr] += 1
    expected = N.sum(axis=1, keepdims=True) * P_baseline
    active_rows = expected.sum(axis=1) > 0
    mask = (expected > 0) & active_rows[:, None]
    observed_active = N[mask]
    expected_active = expected[mask]
    if expected_active.size == 0:
        return {
            "test": "First-order Markov transition",
            "chi2_statistic": 0.0,
            "p_value": 1.0,
            "degrees_of_freedom": 0,
            "n_transitions": int(N.sum()),
            "reject_baseline": False,
        }
    chi2_stat = float(((observed_active - expected_active) ** 2 / expected_active).sum())
    active_cell_count = int(expected_active.size)
    active_row_count = int(active_rows.sum())
    df = max(active_cell_count - active_row_count, 1)
    p_value = 1.0 - chi2_dist.cdf(chi2_stat, df)
    return {
        "test": "First-order Markov transition",
        "chi2_statistic": chi2_stat,
        "p_value": float(p_value),
        "degrees_of_freedom": df,
        "n_transitions": int(N.sum()),
        "reject_baseline": p_value < alpha,
    }


def two_stage_screen(values: np.ndarray, sequence: list[int], n_states: int,
                      P_baseline: np.ndarray, alpha: float = 0.05,
                      correction: str = "holm") -> dict:
    """Combined Benford + Markov screening with family-wise correction.

    correction: 'holm' (Holm-Bonferroni; default), 'bonferroni',
                'fdr_bh' (Benjamini-Hochberg FDR control)
    """
    benford = benford_first_digit_test(values, alpha)
    markov = markov_transition_test(sequence, n_states, P_baseline, alpha)
    pvals = np.array([benford["p_value"], markov["p_value"]])

    reject, pvals_corrected, _, _ = multipletests(pvals, alpha=alpha, method=correction)

    overall_reject = bool(reject.any())
    return {
        "stage_1_benford": benford,
        "stage_2_markov": markov,
        "raw_p_values": pvals.tolist(),
        "adjusted_p_values": pvals_corrected.tolist(),
        "correction_method": correction,
        "any_test_rejects": overall_reject,
        "decision": "investigate" if overall_reject else "no_action",
    }

Worked example: synthetic 5,000-transaction file

The companion code generates a synthetic 5,000-transaction file with two injected anomalies:

A round-number-bias anomaly (a fraud scheme that produces ~\$1{,}000 / \$10{,}000 transactions disproportionately) — should fail Benford
A round-tripping cycle in the counterparty-transition sequence — should fail Markov


import numpy as np
from scipy.stats import chisquare
from scipy.stats import chi2 as chi2_dist
from statsmodels.stats.multitest import multipletests

np.random.seed(42)

N_TRANSACTIONS = 5000
N_STATES = 5  # transaction-counterparty states (vendor, customer, intercompany, expense, asset)
STATE_LABELS = ["vendor", "customer", "intercompany", "expense", "asset"]

def benford_first_digit_test(values: np.ndarray, alpha: float = 0.05) -> dict:
    expected_p = np.array([np.log10(1 + 1.0 / d) for d in range(1, 10)])
    nonzero = values[values != 0]
    first_digits = np.array([int(str(int(abs(v)))[0]) for v in nonzero])
    observed = np.array([(first_digits == d).sum() for d in range(1, 10)])
    expected = expected_p * observed.sum()
    chi2_stat, p_value = chisquare(observed, expected)
    return {
        "chi2_statistic": float(chi2_stat),
        "p_value": float(p_value),
        "reject_benford": p_value < alpha,
    }

def markov_transition_test(sequence: list[int], n_states: int,
                            P_baseline: np.ndarray, alpha: float = 0.05) -> dict:
    N = np.zeros((n_states, n_states), dtype=int)
    for prev, curr in zip(sequence[:-1], sequence[1:]):
        N[prev, curr] += 1
    expected = N.sum(axis=1, keepdims=True) * P_baseline
    active_rows = expected.sum(axis=1) > 0
    mask = (expected > 0) & active_rows[:, None]
    observed_active = N[mask]
    expected_active = expected[mask]
    if expected_active.size == 0:
        return {
            "chi2_statistic": 0.0,
            "p_value": 1.0,
            "degrees_of_freedom": 0,
            "reject_baseline": False,
        }
    chi2_stat = float(((observed_active - expected_active) ** 2 / expected_active).sum())
    active_cell_count = int(expected_active.size)
    active_row_count = int(active_rows.sum())
    df = max(active_cell_count - active_row_count, 1)
    p_value = 1.0 - chi2_dist.cdf(chi2_stat, df)
    return {
        "chi2_statistic": chi2_stat,
        "p_value": float(p_value),
        "degrees_of_freedom": df,
        "reject_baseline": p_value < alpha,
    }

def two_stage_screen(values: np.ndarray, sequence: list[int], n_states: int,
                      P_baseline: np.ndarray, alpha: float = 0.05,
                      correction: str = "holm") -> dict:
    benford = benford_first_digit_test(values, alpha)
    markov = markov_transition_test(sequence, n_states, P_baseline, alpha)
    pvals = np.array([benford["p_value"], markov["p_value"]])
    reject, pvals_corrected, _, _ = multipletests(pvals, alpha=alpha, method=correction)
    return {
        "stage_1_benford": benford,
        "stage_2_markov": markov,
        "raw_p_values": pvals.tolist(),
        "adjusted_p_values": pvals_corrected.tolist(),
        "any_test_rejects": bool(reject.any()),
        "decision": "investigate" if bool(reject.any()) else "no_action",
    }

def generate_clean_data(n: int, seed: int) -> tuple[np.ndarray, list[int]]:
    """Clean transaction values (Benford-compliant) + clean counterparty sequence."""
    rng = np.random.default_rng(seed)
    # Benford-compliant values: log-uniform distribution over 100-100,000
    log_values = rng.uniform(np.log10(100), np.log10(100000), n)
    values = 10 ** log_values
    # Counterparty transitions: realistic vendor/customer/expense baseline
    P_clean = np.array([
        [0.10, 0.20, 0.05, 0.55, 0.10],  # vendor → mostly expense
        [0.20, 0.10, 0.05, 0.10, 0.55],  # customer → mostly asset (AR)
        [0.05, 0.05, 0.10, 0.40, 0.40],  # intercompany → expense or asset
        [0.55, 0.10, 0.05, 0.20, 0.10],  # expense → mostly vendor
        [0.10, 0.55, 0.05, 0.10, 0.20],  # asset → mostly customer
    ])
    sequence = [int(rng.integers(N_STATES))]
    for _ in range(n - 1):
        sequence.append(int(rng.choice(N_STATES, p=P_clean[sequence[-1]])))
    return values, sequence, P_clean

# Generate clean baseline
clean_values, clean_seq, P_baseline = generate_clean_data(N_TRANSACTIONS, seed=42)

# Inject Benford-violating values: 200 transactions at exact $1,000 / $10,000
rng_inject = np.random.default_rng(43)
inject_idx = rng_inject.choice(N_TRANSACTIONS, 200, replace=False)
clean_values[inject_idx[:100]] = 1000.0
clean_values[inject_idx[100:]] = 10000.0

# Inject Markov-violating sequence: 100-step round-tripping cycle (vendor → customer → intercompany → expense → vendor → ...)
cycle_pattern = [0, 1, 2, 3] * 25
inject_start = 2000
clean_seq[inject_start:inject_start + 100] = cycle_pattern

# Run combined screen
result = two_stage_screen(clean_values, clean_seq, N_STATES, P_baseline, alpha=0.05, correction="holm")

print(f"Benford p-value: {result['stage_1_benford']['p_value']:.6f} (reject: {result['stage_1_benford']['reject_benford']})")
print(f"Markov p-value:  {result['stage_2_markov']['p_value']:.6f} (reject: {result['stage_2_markov']['reject_baseline']})")
print(f"Raw p-values: {result['raw_p_values']}")
print(f"Holm-adjusted p-values: {result['adjusted_p_values']}")
print(f"Combined decision: {result['decision']}")
# Clean-file false-positive comparison
n_simulations = 1000
or_rejections = 0
holm_rejections = 0
for sim_seed in range(n_simulations):
    sim_values, sim_seq, sim_baseline = generate_clean_data(2000, seed=sim_seed)
    sim_result = two_stage_screen(sim_values, sim_seq, N_STATES, sim_baseline, alpha=0.05)
    benford_p = sim_result["stage_1_benford"]["p_value"]
    markov_p = sim_result["stage_2_markov"]["p_value"]
    if benford_p < 0.05 or markov_p < 0.05:
        or_rejections += 1
    if sim_result["any_test_rejects"]:
        holm_rejections += 1

print(f"OR-combination false-positive rate: {or_rejections / n_simulations:.4f}")
print(f"Holm-corrected false-positive rate: {holm_rejections / n_simulations:.4f}")

With seed=42, the deterministic output produces both Benford and Markov rejections at $\alpha = 0.05$ even after Holm correction — both anomalies are detected. Running the same screen on a clean baseline (no injected anomalies) produces non-rejection on both tests with high probability, as expected.

OR-combination vs principled correction: a comparison

The OR-combination false-positive rate empirically lands near \$0.0975$ (the theoretical rate); the Holm-corrected rate stays at the nominal \$0.05$. The principled-correction approach reduces false positives by ~50% relative to naive OR-combination.

Operational integration: workpaper template

The two-stage screen produces a workpaper-ready audit-evidence artifact:


TWO-STAGE SCREENING — TRANSACTION FILE [REDACTED-FILE-ID]
Date of analysis: [DATE]
Auditor: [NAME], [CREDENTIALS]

Stage 1 — Benford's Law first-digit test
  H_0: First-digit frequencies match Benford's logarithmic distribution
  Test statistic (chi-squared, df=8): [VALUE]
  p-value: [VALUE]
  Decision: [REJECT / FAIL TO REJECT] at α = 0.05

Stage 2 — First-order Markov transition test
  H_0: Counterparty-transition matrix matches prior-period baseline
  Test statistic (chi-squared, df=[VALUE]): [VALUE]
  p-value: [VALUE]
  Decision: [REJECT / FAIL TO REJECT] at α = 0.05

Multiple-testing correction
  Method: Holm-Bonferroni (m = 2)
  Adjusted p-values: [VALUE], [VALUE]
  Combined decision: [INVESTIGATE / NO ACTION]

If INVESTIGATE: substantive procedures focus on:
  - [If Benford rejected] Round-number bias subset; threshold-avoidance candidates
  - [If Markov rejected] Specific transition cells with standardized residuals > 2

This template satisfies PCAOB AS 2305 §10-17 (expectation precision), AS 2401 §60 (journal-entry testing), and audit-evidence documentation requirements under AS 1215.

Failure modes and defenses

Three patterns recur in two-stage screening deployment.

Test dependence. Benford and Markov p-values are not independent in finite samples because the same manipulation can distort both the marginal digit profile and the transition structure. Holm is used here as a conservative family-wise error-rate control device despite that dependence, not because independence is assumed. Defense: report both Holm-corrected and uncorrected p-values; transparency over precision.

Benford-applicability question. Benford’s Law applies to data spanning multiple orders of magnitude; series with artificial cutoffs (transactions capped at an approval limit), small ranges, or integer-only values do not satisfy the scale-invariance hypothesis. Defense: precede the Benford test with an applicability check (range spans at least two orders of magnitude; observations are not externally bounded; values are not artificially padded).

Recap-of-Article-001 dependency. Readers who haven’t read the First-Order Markov Modeling for Transaction-Stream Analysis in Audit article may not have the Markov apparatus context. Defense: include a brief recap of the transition-matrix mechanics in §3 of the article (this article does so); link explicitly to the First-Order Markov Modeling for Transaction-Stream Analysis in Audit article for the full derivation.

Practitioner close

Use the two-stage screen when you have both amount data and a defensible transaction-state sequence, and when the engagement question is whether the file deserves escalation rather than immediate substantive conclusion. Do not use it where the amount field is range-bound or policy-capped, where the sequence encoding is unstable, or where the baseline transition matrix is itself contaminated. A Benford rejection without a Markov rejection usually points you toward amount-shaping or threshold avoidance; a Markov rejection without a Benford rejection usually points you toward engineered workflow cycling or round-tripping. When both reject after Holm correction, the right next step is not “fraud proven” but targeted journal-entry work on the subset that drives the deviations.

Authority:

Benford’s Law and digit theory:

Benford, F. (1938). “The Law of Anomalous Numbers.” Proceedings of the American Philosophical Society, 78(4), 551-572.
Newcomb, S. (1881). “Note on the Frequency of Use of the Different Digits in Natural Numbers.” American Journal of Mathematics, 4(1), 39-40.
Hill, T.P. (1995). “A Statistical Derivation of the Significant-Digit Law.” Statistical Science, 10(4), 354-363.
Nigrini, M.J. (2012). Benford’s Law: Applications for Forensic Accounting, Auditing, and Fraud Detection. Wiley.

Multiple-testing correction:

Holm, S. (1979). “A Simple Sequentially Rejective Multiple Test Procedure.” Scandinavian Journal of Statistics, 6(2), 65-70.
Benjamini, Y., & Hochberg, Y. (1995). “Controlling the False Discovery Rate: A Practical and Powerful Approach to Multiple Testing.” Journal of the Royal Statistical Society, Series B, 57(1), 289-300.

Markov-chain stationary distributions:

Norris, J.R. (1997). Markov Chains. §1.7-1.8 (stationary distributions).

Audit standards:

PCAOB AS 2305 — Substantive Analytical Procedures, §10-17.
PCAOB AS 2401 — Consideration of Fraud in a Financial Statement Audit, §60 (journal-entry testing).
PCAOB AS 1215 — Audit Documentation.
ACFE Fraud Examiners Manual, §1.7 (data-driven fraud testing).

Two-Stage Screening: Benford’s Law as a Stationary Distribution Combined With First-Order Markov Tests

Benford’s Law as a stationary distribution

First-order Markov as transition dynamics

Family-wise error rate under multiple testing

The combined screening function

Worked example: synthetic 5,000-transaction file

OR-combination vs principled correction: a comparison

Operational integration: workpaper template

Failure modes and defenses

Practitioner close

Submit a Comment Cancel reply

Recent Posts

Recent Comments

Sheepdog Prosperity Partners LLC

Contact

Schedule