Random-Walk and Stationarity Tests on Account Reconciliations

A reconciliation account that should oscillate around its expected balance—intercompany clearing, suspense, accrued-payroll roll-forward—but instead wanders without bound has failed its control objective. The variance grows linearly with time; the account cannot self-correct. Whether the cause is error accumulation, intentional manipulation, or an unflagged business change, the stochastic signature is the same: the balance series exhibits a unit root. Detecting that signature before the dollar magnitude becomes material is a substantive analytical procedure under PCAOB AS 2305 §10-17 and aligns with the risk-assessment requirements of AS 2110 §28-49.

The practitioner’s toolkit is unit-root testing. The Augmented Dickey-Fuller (ADF) test (Dickey & Fuller, 1979; Said & Dickey, 1984) tests the null hypothesis of a unit root (drift). The KPSS test (Kwiatkowski, Phillips, Schmidt & Shin, 1992) tests the reverse null—stationarity is the null, drift is the alternative. Combining tests with opposite nulls produces a four-quadrant diagnostic that reduces the false-positive risk of relying on either test alone. The continuous-auditing operational integration follows the Internal Audit Foundation’s Continuous Auditing Practice Guide (2018).

The unit-root null hypothesis and its Markov-chain bridge

Three terms anchor what follows; each is defined in plain English before the symbol appears.

Reconciliation balance $X_t$ — the unreconciled-balance amount on a given account at the end of period $t$. For a cash-clearing account it’s the amount that hasn’t tied back to the bank yet; for an intercompany account it’s the unreconciled-between-entities residual; for a suspense account it’s whatever hasn’t been cleared to its final coding. A healthy reconciliation process keeps $X_t$ small and oscillating around zero. A drifting one lets $X_t$ grow.
Autoregressive specification — the model that says today’s reconciliation balance equals some fraction of yesterday’s plus a random shock. The “autoregressive” part refers to the dependence on prior values; “AR(1)” means we use one lag. Notation: $X_t = \rho X_{t-1} + \epsilon_t$, where $\epsilon_t$ is a random shock (normally distributed with mean zero and variance $\sigma^2$).
Stationarity — the property that the series stays in a stable equilibrium over time, with bounded fluctuations around a long-run mean. The opposite is a process whose fluctuations grow without bound as time goes on (the textbook example is a drunkard’s random walk).

The AR(1) coefficient $\rho$ — pronounced “rho” — controls which regime the reconciliation process is in. Three cases, with audit translations:

$|\rho| < 1$ — stationary. The reconciliation series oscillates around its equilibrium and has bounded long-run variance. Audit meaning: the reconciliation control is working; balances mean-revert.
$\rho = 1$ — unit root (random walk). The series wanders without bound; variance grows linearly with $t$. Audit meaning: the reconciliation control has failed; the balance is drifting and no equilibrium is in sight.
$|\rho| > 1$ — explosive. Divergent trajectory; the balance is blowing up. Audit meaning: a structural problem; an escalation event, not a normal-operations finding.

The auditor’s preferred state is $|\rho| < 1$ — bounded, mean-reverting reconciliation behavior. In Articles 001 and 004, reconciliation states were modeled as a discrete Markov chain with a transition matrix over the alphabet $\{\text{Current}, \text{Aged}, \text{Uncleared}\}$. The continuous AR(1) process connects to that discrete framework: each discrete state corresponds to a quantile region of the AR(1)’s stationary distribution, and the discrete transition matrix is the coarse-grained version of the AR(1) dynamics. When $|\rho| < 1$, both the continuous AR(1) and the discrete chain settle into a well-defined long-run distribution (technical condition: the discrete chain's transition matrix must be irreducible and aperiodic, per Hamilton 1994, Ch. 22). When $\rho = 1$, no stationary distribution exists for either: the continuous process has unbounded variance, and the analogous discrete random walk on integers is transient. The unit-root test thus asks the auditor's question directly: is the reconciliation process behaving like a controlled process with a well-defined equilibrium ($|\rho| < 1$), or like a drifting process with no equilibrium ($\rho = 1$)?

The Augmented Dickey-Fuller test

The Augmented Dickey-Fuller (ADF) test is the standard statistical test for distinguishing the stationary case from the unit-root case. The test takes the reconciliation series as input and returns a number (a t-statistic) that the auditor compares against a tabulated critical value to decide which regime the data supports.

The mechanics: ADF fits a regression that includes a constant, an optional linear trend, the prior-period balance, and several lagged differences. Each term in the regression has a job:

$\alpha$ (the constant): captures any non-zero long-run mean
$\beta t$ (the trend term): captures any deterministic linear drift
$\gamma X_{t-1}$ (the prior balance): this coefficient $\gamma$ is the test target; the null hypothesis “the series has a unit root” corresponds to $\gamma = 0$
$\sum_j \delta_j \Delta X_{t-j}$ (the lagged differences): these absorb any short-term serial correlation in the shocks $\epsilon_t$ that would otherwise contaminate the test

The regression equation:

$$\Delta X_t = \alpha + \beta t + \gamma X_{t-1} + \sum_{j=1}^p \delta_j \Delta X_{t-j} + u_t$$

where $\Delta X_t = X_t – X_{t-1}$ is the period-over-period change and $u_t$ is the regression residual. The test statistic is the t-statistic on the estimated $\hat{\gamma}$. Under the null hypothesis of a unit root, this t-statistic does not follow the usual Student-t distribution that auditors might remember from undergraduate statistics — it follows a different distribution called the Dickey-Fuller distribution, tabulated by MacKinnon (1996). The practitioner doesn’t need to compute the critical values; statistical software does it. The practitioner needs to know that the rejection rule is: if the computed t-statistic is more negative than the MacKinnon critical value at the chosen significance level, reject the unit-root null hypothesis and conclude the series is stationary.

Lag selection (how many $p$ to include) is typically automated by AIC or BIC — two information criteria that penalize over-fitting. The default autolag="AIC" setting in standard libraries is appropriate for audit deployment.

One interpretation point that matters: rejection of the null is evidence for stationarity, not merely evidence against the unit root. A clean reconciliation process should produce an ADF rejection; the auditor’s concern is when the ADF fails to reject (suggesting drift).


import numpy as np
import pandas as pd
from statsmodels.tsa.stattools import adfuller, kpss

def adf_test_diagnostic(series: pd.Series, alpha: float = 0.05) -> dict:
    """ADF test with AIC-selected lag length and constant + trend regression.

    Returns the test statistic, p-value, and the unit-root rejection decision.
    """
    stat, p, used_lag, n_obs, crit_values, _ = adfuller(
        series.dropna().values, autolag="AIC", regression="ct"
    )
    return {
        "test": "ADF",
        "statistic": float(stat),
        "p_value": float(p),
        "used_lag": int(used_lag),
        "n_observations": int(n_obs),
        "critical_values": {k: float(v) for k, v in crit_values.items()},
        "reject_unit_root": p < alpha,
    }

The KPSS test

The KPSS test inverts the null. The maintained hypothesis is stationarity; the alternative is a unit root. The test regression is:

$$X_t = \xi_t + \beta t + r_t + \epsilon_t, \qquad r_t = r_{t-1} + u_t, \quad u_t \sim \mathcal{N}(0, \sigma_u^2)$$

The null $H_0: \sigma_u^2 = 0$ corresponds to deterministic-trend stationarity (the random-walk component $r_t$ has zero variance). The test statistic is a Lagrange-multiplier statistic on the partial-sum process of regression residuals; its asymptotic distribution under $H_0$ is non-standard (KPSS-tabulated).


def kpss_test_diagnostic(series: pd.Series, alpha: float = 0.05,
                          regression: str = "c") -> dict:
    """KPSS test with auto-selected lag and constant (or constant + trend) regression.

    Args:
        regression: 'c' (constant only, default) or 'ct' (constant + trend).
                    Constant-only is appropriate for reconciliation accounts expected
                    to oscillate around a fixed mean with no structural trend.
    """
    stat, p, used_lag, crit_values = kpss(
        series.dropna().values, regression=regression, nlags="auto"
    )
    return {
        "test": "KPSS",
        "statistic": float(stat),
        "p_value": float(p),
        "used_lag": int(used_lag),
        "regression": regression,
        "critical_values": {k: float(v) for k, v in crit_values.items()},
        "reject_stationarity": p < alpha,
    }

The four-quadrant diagnostic

ADF and KPSS test opposite nulls. Combining their results produces four possible classifications, and only two of the four are actionable as direct evidence:

ADF rejects unit root	KPSS rejects stationarity	Diagnosis
Yes	No	Stationary (preferred audit state)
No	Yes	Drift / unit root (investigation warranted)
Yes	Yes	Inconclusive-high (likely near a borderline; both tests may be detecting different aspects of the same series; investigate)
No	No	Inconclusive-low (likely too few observations to discriminate; consider longer series or different specification)


def reconciliation_drift_diagnostic(series: pd.Series, alpha: float = 0.05,
                                     kpss_regression: str = "c") -> dict:
    """Combined ADF + KPSS four-quadrant diagnostic for account-balance drift.

    Args:
        kpss_regression: Passed to KPSS test; 'c' is default (constant-only model)
                         for reconciliation accounts with no expected trend.
    """
    adf_result = adf_test_diagnostic(series, alpha)
    kpss_result = kpss_test_diagnostic(series, alpha, regression=kpss_regression)

    if adf_result["reject_unit_root"] and not kpss_result["reject_stationarity"]:
        diagnosis = "stationary"
        action = "no_audit_action"
    elif not adf_result["reject_unit_root"] and kpss_result["reject_stationarity"]:
        diagnosis = "drift_unit_root"
        action = "investigation_warranted"
    elif adf_result["reject_unit_root"] and kpss_result["reject_stationarity"]:
        diagnosis = "inconclusive_high"
        action = "investigate_inconclusive"
    else:
        diagnosis = "inconclusive_low"
        action = "extend_series_or_alternative_test"

    return {
        "adf": adf_result,
        "kpss": kpss_result,
        "diagnosis": diagnosis,
        "action": action,
        "n_observations": int(series.dropna().shape[0]),
    }

Evidence-and-control architecture: mapping stationarity signals to reconciliation controls

PCAOB AS 2305 §14 requires the auditor to evaluate the reliability of data used in substantive analytical procedures. PCAOB AS 2110 §28-49 requires understanding the design and implementation of controls relevant to the audit. The stationarity diagnostic serves both objectives by providing a continuous-monitoring signal for three standard reconciliation controls:

Three-way match (AP/inventory/receiving). When the unmatched-suspense account balance is stationary ($|\rho| < 1$), the control is clearing exceptions at a rate that prevents accumulation. A unit-root signal ($\rho = 1$) indicates systematic under-clearance—exceptions pile up faster than they resolve. The auditor's response is to test the aging distribution and investigate the oldest unmatched items.

Aging thresholds (AR, AP, intercompany). Stationary aging balances within each bucket (0-30, 31-60, 61-90, >90 days) indicate stable turnover and collection patterns. Drift in the >90-day bucket specifically signals a breakdown in the dunning or write-off control. The dollar magnitude may still be immaterial, but the stochastic trajectory predicts future materiality.

Variance investigation triggers (roll-forward reconciliations). Many entities’ policies require investigation of reconciliation variances exceeding a fixed threshold (e.g., \$50k or 5% of balance). Stationarity of the variance series itself—computed as $|X_t – X_{t-1}|$—indicates the threshold is calibrated to the natural volatility. Unit-root behavior in the variance series suggests the threshold has become too permissive relative to the process’s new volatility regime, requiring threshold recalibration.

These mappings convert the abstract stochastic-process property (stationarity) into concrete control-performance evidence admissible under AS 2305. The diagnostic does not replace transaction-level testing but provides the risk-assessment foundation for scoping that testing.

Worked example: synthetic 36-month reconciliation

The companion code generates a synthetic 36-month reconciliation series with three injected drift patterns: gradual upward bias, episodic step changes, and a clean stationary control. The diagnostic correctly classifies each.


# Standardize RNG approach: all functions use np.random.default_rng() with explicit seeds.
N_MONTHS = 36

def generate_clean_series(n: int, seed: int = 1) -> pd.Series:
    """Stationary AR(1) with rho = 0.4."""
    rng = np.random.default_rng(seed)
    eps = rng.standard_normal(n)
    x = np.zeros(n)
    for t in range(1, n):
        x[t] = 0.4 * x[t-1] + eps[t]
    return pd.Series(x)

def generate_gradual_drift(n: int, seed: int = 2) -> pd.Series:
    """Random walk with small upward drift."""
    rng = np.random.default_rng(seed)
    return pd.Series(np.cumsum(rng.standard_normal(n) + 0.15))

def generate_step_change(n: int, seed: int = 3, step_month: int = 18, step_size: float = 5.0) -> pd.Series:
    """Stationary AR(1) baseline with a deterministic step jump at step_month.

    Decoupled from generate_clean_series: this function instantiates its own RNG and
    produces its baseline path independently. Changing generate_clean_series no longer
    perturbs the step-change output. The step_month and step_size are exposed for
    sensitivity testing; the canonical seed=3 produces the series printed below.
    """
    rng = np.random.default_rng(seed)
    eps = rng.standard_normal(n)
    x = np.zeros(n)
    for t in range(1, n):
        x[t] = 0.4 * x[t-1] + eps[t]
    x[step_month:] += step_size
    return pd.Series(x)

# Run diagnostic on each
clean = generate_clean_series(N_MONTHS)
drift = generate_gradual_drift(N_MONTHS)
step = generate_step_change(N_MONTHS)

for label, series in [("clean", clean), ("gradual_drift", drift), ("step_change", step)]:
    result = reconciliation_drift_diagnostic(series)
    print(f"{label:<15} | diagnosis={result['diagnosis']:<22} "
          f"| ADF p={result['adf']['p_value']:.4f} | KPSS p={result['kpss']['p_value']:.4f}")

# Reproducibility-artifact reference: persist the canonical synthetic series so any
# reader can diff their local run against the published series exactly.
#
#   import pathlib
#   pathlib.Path("synthetic_reconciliation_series.csv").write_text(
#       pd.DataFrame({"month": range(N_MONTHS), "clean": clean.values,
#                       "gradual_drift": drift.values, "step_change": step.values}).to_csv(index=False)
#   )
#
# The first five values of each series under the canonical seeds (clean=1, drift=2, step=3):
#   clean[:5]         ≈ [ 0.0000, 0.6924, 1.2849, -0.1502, -0.4006]
#   drift[:5]         ≈ [ 0.2300, 1.7104, 1.6839, 3.4262, 4.1108]
#   step_change[:5]   ≈ [ 0.0000, 1.0306, 0.6859, 0.6022, 0.7011]

With the specified seeds, the deterministic output classifies the clean series as stationary, the gradual-drift series as drift_unit_root, and the step-change series as inconclusive_high (both tests partially reject because the structural break confuses both null hypotheses simultaneously). The step-change classification is itself diagnostic: an inconclusive-high result on a reconciliation account warrants investigation for one-time accounting events that may have shifted the equilibrium.

Cointegration extension for paired accounts

When two reconciliation accounts should drift together—intercompany payable on one entity and intercompany receivable on its counterparty; revenue and AR; cash and operating expenses—the relevant test is cointegration rather than individual-series stationarity. Engle & Granger (1987) defines two non-stationary series as cointegrated if a linear combination of them is stationary—i.e., they share a common stochastic trend.


from statsmodels.tsa.stattools import coint

def paired_account_cointegration(series_a: pd.Series, series_b: pd.Series,
                                    alpha: float = 0.05) -> dict:
    """Engle-Granger cointegration test on a paired-account series."""
    aligned = pd.concat([series_a, series_b], axis=1).dropna()
    if aligned.shape[0] < 25:
        return {"warning": "insufficient_data", "n_observations": aligned.shape[0]}
    stat, p, crit = coint(aligned.iloc[:, 0], aligned.iloc[:, 1])
    return {
        "test": "Engle-Granger cointegration",
        "statistic": float(stat),
        "p_value": float(p),
        "critical_values": {"1%": float(crit[0]), "5%": float(crit[1]), "10%": float(crit[2])},
        "reject_no_cointegration": p < alpha,
        "n_observations": int(aligned.shape[0]),
    }

For paired accounts, cointegration is the auditor’s preferred state—the accounts wander but together, indicating the underlying business relationship still binds them. Lack of cointegration on accounts that should be cointegrated is an audit-significant signal: the economic link has broken (e.g., intercompany payable grows while intercompany receivable stays flat, indicating one entity is booking transactions the counterparty is not).

Cointegration failure modes

Three patterns recur in paired-account testing and require defenses parallel to those in single-series unit-root tests.

Small-sample bias. The Engle-Granger two-step procedure (regress one series on the other, then test the residuals for stationarity) suffers from finite-sample bias when $n < 50$. The test statistic's distribution is sensitive to the cointegrating vector's estimation error. Mitigation: report the test alongside the sample size; treat $n < 30$ results as indicative only and prioritize manual reconciliation review.

Deterministic trends in both series. If both accounts exhibit deterministic (linear) trends in addition to stochastic trends, the cointegration test may falsely detect a stable relationship that is actually spurious correlation of two independent trends. Mitigation: include a trend term in the cointegrating regression (the coint function’s trend='ct' option) and compare results under both specifications.

Structural breaks in the cointegrating relationship. A one-time event—merger, divestiture, ERP cutover—can shift the cointegrating vector (the ratio between the two accounts changes permanently). The test will reject cointegration even though both before-break and after-break sub-samples are individually cointegrated. Mitigation: partition the series at known event dates and test each partition separately. If $n$ becomes too small, document the limitation and fall back to transaction-level reconciliation for the post-event period.

Single-series failure modes and defenses

Three patterns recur in single-account production deployments.

Underpower against near-unit-root alternatives. Unit-root tests have famously low power against alternatives close to but distinct from a unit root (e.g., $\rho = 0.95$). Series that are stationary but mean-revert slowly will frequently fail to reject the unit-root null with the data available in audit settings (typically 24-48 monthly observations). Mitigation: report ADF and KPSS together; treat ADF non-rejection as “evidence consistent with drift” rather than “evidence of drift.”

Lag-length sensitivity. ADF results can shift materially with the chosen lag length. AIC-selected lags are the standard default but should be reported alongside the test statistic. For sensitivity analysis, report results at lags AIC-selected, AIC-1, and AIC+1.

Structural breaks misidentified as drift. A real one-time accounting event (acquisition, divestiture, accounting-policy change, ERP migration) creates a structural break that both ADF and KPSS partially reject—producing the inconclusive_high diagnosis. The audit response is to check for known structural events first; if none, investigate further.

Computational environment

The code blocks above require the following environment:


# requirements.txt
numpy>=1.24.0
pandas>=2.0.0
statsmodels>=0.14.0

Reproducible notebook and full test suite: github.com/noahrgreen/dd-tech-lab-companion (workbook for this article forthcoming)

Bridge to Markov Decision Processes for Risk-Based Audit Sampling Under Cost-of-Type-II Constraints

the preceding articles in this sub-series have established the apparatus for detecting anomalies in audit data. Markov Decision Processes for Risk-Based Audit Sampling Under Cost-of-Type-II Constraints takes the apparatus into decision-making: given a fixed audit budget, which accounts and periods should receive substantive procedures? The Markov Decision Process framework formalizes the cost-asymmetric trade-off (Type I cost of over-testing vs. Type II cost of under-testing) and produces an optimal sampling policy as a function of the engagement risk profile.

Authority:

Time-series and unit-root theory:

Dickey, D.A., & Fuller, W.A. (1979). “Distribution of the Estimators for Autoregressive Time Series with a Unit Root.” Journal of the American Statistical Association, 74(366), 427-431.
Said, S.E., & Dickey, D.A. (1984). “Testing for Unit Roots in Autoregressive-Moving Average Models of Unknown Order.” Biometrika, 71(3), 599-607.
Phillips, P.C.B., & Perron, P. (1988). “Testing for a Unit Root in Time Series Regression.” Biometrika, 75(2), 335-346.
Kwiatkowski, D., Phillips, P.C.B., Schmidt, P., & Shin, Y. (1992). “Testing the Null Hypothesis of Stationarity Against the Alternative of a Unit Root.” Journal of Econometrics, 54(1-3), 159-178.
MacKinnon, J.G. (1996). “Numerical Distribution Functions for Unit Root and Cointegration Tests.” Journal of Applied Econometrics, 11(6), 601-618.
Engle, R.F., & Granger, C.W.J. (1987). “Co-Integration and Error Correction: Representation, Estimation, and Testing.” Econometrica, 55(2), 251-276.
Hamilton, J.D. (1994). Time Series Analysis. Princeton University Press, Ch. 17, 22.

Audit standards:

PCAOB AS 2305 — Substantive Analytical Procedures, §10-17.
PCAOB AS 2110 — Identifying and Assessing Risks of Material Misstatement, §28-49.
Internal Audit Foundation (2018). Continuous Auditing Practice Guide.

Companion code

A fully consolidated companion script for this article is in progress; the worked-example code in the body above is self-contained. Other companion artifacts in the sub-series are at noahrgreen/dd-tech-lab-companion.

Random-Walk and Stationarity Tests on Account Reconciliations

The unit-root null hypothesis and its Markov-chain bridge

The Augmented Dickey-Fuller test

The KPSS test

The four-quadrant diagnostic

Evidence-and-control architecture: mapping stationarity signals to reconciliation controls

Worked example: synthetic 36-month reconciliation

Cointegration extension for paired accounts

Cointegration failure modes

Single-series failure modes and defenses

Computational environment

Bridge to Markov Decision Processes for Risk-Based Audit Sampling Under Cost-of-Type-II Constraints

Companion code

Submit a Comment Cancel reply

Recent Posts

Recent Comments

Sheepdog Prosperity Partners LLC

Contact

Schedule