Community Detection for Related-Party Cluster Identification: Louvain, Label Propagation, and Their Audit Use

PCAOB AS 2410 requires the auditor to obtain an understanding of the entity’s related-party relationships and transactions. Management’s representation is the starting point; auditor-side procedures extend it through inquiry, document review, and — for large counterparty universes — analytical procedures. The analytical-procedures step is where graph-based community detection earns its keep. Two counterparties that share an ultimate beneficial owner, a director-officer overlap, an address overlap, or a transactional pattern that clusters them with each other but not with the broader counterparty set are candidates for related-party classification under AS 2410’s substance-over-form discipline.

Community detection algorithms find these clusters automatically. Louvain (Blondel et al., 2008) optimizes a modularity objective through a greedy two-phase process. Label propagation (Raghavan et al., 2007) is simpler and faster, assigning each node the most-common label among its neighbors and iterating to convergence. The two algorithms produce different cluster decompositions on the same graph, and the audit-defensibility framing requires the analyst to understand why. This article walks both, implements them via Neo4j GDS, and produces a side-by-side comparison on a synthetic counterparty graph with seeded related-party clusters.

The PCAOB AS 2410 related-party problem

AS 2410 directs the auditor to perform procedures to obtain an understanding of relationships and transactions with related parties, evaluate management’s identification, and respond to risks of material misstatement involving them. Management’s representation — usually a related-parties list in the management representation letter — is the starting point. The auditor’s responsibility is to assess whether that list is complete and accurate, which requires procedures beyond simply accepting it.

Three procedure categories typically appear in the workpaper. Inquiry with management, the audit committee, and personnel involved in counterparty negotiations. Document review of board minutes, significant contracts, and confirmations from financial institutions. Analytical procedures that compare the counterparty universe against known related-party signals — common addresses, common officers, ownership overlaps. The third category is where most engagement teams hit a scalability wall: with hundreds or thousands of counterparties, exhaustive pairwise overlap detection is impractical by hand.

Community detection on the counterparty graph automates the analytical-procedures category. Project the counterparty universe and its known signal edges (ownership, officer-role, address-overlap) into a graph; run a community-detection algorithm; the resulting clusters are candidate related-party groupings for the auditor to inquire about. Clusters confirmed by management or by document review become part of the AS 2410 disclosure; clusters that turn out to be coincidental are documented as such.

Modularity as a clustering objective

The standard quality metric for graph clustering is modularity (Newman & Girvan, 2004), defined as the fraction of edges within communities minus the expected fraction under a random-graph null:

$$Q = \frac{1}{2m} \sum_{i,j} \left[ A_{ij} – \frac{k_i k_j}{2m} \right] \delta(c_i, c_j)$$

where A is the adjacency matrix, k_i is node i‘s degree, m is the total edge weight, c_i is the community assignment of node i, and δ(c_i, c_j) = 1 if c_i = c_j and 0 otherwise. The first term A_{ij} is the observed edge weight between i and j; the second term (k_i k_j)/(2m) is the expected edge weight under a configuration-model null. The sum is taken over node pairs that are in the same community (the Kronecker delta).

High modularity means edges concentrate within communities relative to what you’d expect by chance. Maximizing modularity is NP-hard in general, but the Louvain heuristic produces near-optimal solutions in practice with O(n log n) empirical complexity on typical graphs.

For the related-party application, modularity-maximization rewards exactly the structure we want to find: clusters of counterparties with dense ownership/officer/address overlap among themselves, sparse overlap with the broader counterparty set. The algorithm encodes the substance-over-form discipline numerically.

The Louvain algorithm

Louvain is a two-phase iterative algorithm. Phase 1 — local moves: starting from a singleton-cluster initialization (every node is its own community), iterate over nodes and for each node, evaluate moving it to each of its neighbors’ communities. Apply the move that produces the largest positive modularity gain; if no positive gain is available, leave the node where it is. Pass through all nodes until no further improvements happen. Phase 2 — community aggregation: collapse each community into a single super-node with self-loops summing the within-community edges and external edges summing the between-community edges. Then run Phase 1 again on the aggregated graph.

The two phases alternate until modularity ceases to improve. Each pass through the aggregated graph produces a hierarchical level of community structure — top-level coarse communities decompose into mid-level communities, which decompose into the original fine-grained communities. For AS 2410 work, the top-level coarse communities are usually the right granularity (small related-party clusters of 3-12 counterparties); the deeper levels can reveal sub-cluster structure within a large related-party group.

Louvain’s behavior is mostly deterministic given a fixed node-traversal order, but the order itself depends on the input graph’s node ID enumeration. In Neo4j GDS, this means re-running Louvain on the same graph after a node insertion can shift cluster boundaries at the margin. For audit reproducibility, fix the random seed in GDS’s invocation and document the seed in the workpaper.

Label propagation

Label propagation (Raghavan et al., 2007) is conceptually simpler. Initialize every node with a unique label. At each iteration, every node adopts the label that is most common among its neighbors, with random tie-breaking. Iterate until convergence (no node’s label changes between iterations) or until a maximum iteration limit.

The algorithm runs in roughly O(m) time per iteration where m is the edge count, making it substantially faster than Louvain at scale. The downside is the random tie-breaking. When a node has two neighboring labels with equal frequency, the algorithm picks one at random, and the choice cascades through subsequent iterations. The result: different random seeds produce different cluster decompositions on the same input graph. For audit work, this instability is a problem — engagement reviewers expect reproducible procedures.

The tie-breaking instability can be partially mitigated by seed-fixing (in Neo4j GDS, concurrency: 1 plus a fixed random seed) and by re-running multiple times and reconciling. But the fundamental property remains: label propagation is genuinely stochastic at the cluster-boundary level, where Louvain is mostly deterministic. For most AS 2410 deployments, Louvain is the preferred algorithm; label propagation appears as a speed-prioritized fallback when the counterparty graph crosses the 10-million-edge scale where Louvain begins to feel slow.

GDS library implementation

Both algorithms are first-class procedures in Neo4j GDS. The pattern is the standard two-phase: project the graph, then run the algorithm.

// Project the related-party detection graph (entities + ownership + officer-role + address overlap)
CALL gds.graph.project(
  'rp_detection_v1',
  ['Entity', 'Person'],
  {
    OWNS:       { type: 'OWNS',       properties: { weight: { property: 'percentage',    defaultValue: 0.0 }}},
    OFFICER_OF: { type: 'OFFICER_OF', properties: { weight: { property: 'role_weight',   defaultValue: 0.5 }}},
    SHARES_ADDRESS_WITH: { type: 'SHARES_ADDRESS_WITH', properties: { weight: { property: 'overlap_score', defaultValue: 0.0 }}}
  }
);

// Run Louvain modularity optimization with weighted edges
CALL gds.louvain.stream('rp_detection_v1', {
  relationshipWeightProperty: 'weight',
  maxIterations: 10,
  tolerance: 1e-4,
  includeIntermediateCommunities: false
})
YIELD nodeId, communityId
WITH communityId, collect(gds.util.asNode(nodeId)) AS cluster_members
WHERE size(cluster_members) >= 3
RETURN
  communityId,
  size(cluster_members) AS cluster_size,
  [m IN cluster_members | coalesce(m.legal_name, m.full_name)] AS member_names
ORDER BY cluster_size DESC
LIMIT 50;

Three modeling choices in this snippet matter. Three relationship types are projected together because owners + officers + shared-addresses are independent signals of related-party status; combining them produces stronger clusters than any single signal. Weighted edges matter because a 50% ownership edge is a stronger signal than a 1% ownership edge, and Louvain should reflect that. The size-3 filter excludes singleton and dyad “clusters” because a cluster of one or two nodes isn’t a related-party group; it’s an unrelated entity or a known counterparty-pair.

Algorithm comparison criteria

Four criteria distinguish Louvain from label propagation for audit work.

Stability: Louvain produces the same cluster decomposition across runs with fixed seed; label propagation’s tie-breaking randomness produces different decompositions even with the same seed at high cluster-boundary density. Louvain wins on audit reproducibility.

Resolution: Louvain’s hierarchical structure surfaces both coarse and fine community boundaries; label propagation produces a single-level decomposition. For AS 2410 work, coarse is usually right; for forensic deep-dives, hierarchical exploration is valuable. Louvain wins on flexibility.

Speed: Label propagation runs at roughly O(m); Louvain at roughly O(n log n) empirically. At million-edge scale, label propagation is 5-20x faster. At billion-edge scale, label propagation is the only practical choice. Label propagation wins on extreme scale.

Audit defensibility: Louvain’s deterministic-ish behavior makes its outputs reviewable. Label propagation’s tie-break randomness requires multiple runs plus reconciliation, which complicates the workpaper. Louvain wins on defensibility unless scale forces the trade-off.

The decision flowchart: under 10M edges, default to Louvain. 10M-100M edges, Louvain with tolerance: 1e-3 (looser threshold buys speed) or label propagation with multi-run reconciliation. Over 100M edges, label propagation is forced by runtime budget.

Worked example

The companion repository ships a synthetic 1,000-entity counterparty graph with 4 seeded related-party clusters of sizes 5, 8, 12, and 25. The seeded clusters are constructed with realistic signal patterns: each cluster shares 2-3 ultimate beneficial owners, 1-2 director-officer overlaps, and partial address overlap (same registered agent or shared mail-forwarding service). The remaining 950 entities form a random ownership and officer graph with no seeded related-party structure.

Louvain at default GDS parameters recovers all 4 seeded clusters cleanly. The recovery is not perfect — Louvain also surfaces 6 small clusters (sizes 3-5) of unrelated entities that happen to share a single common signal (typically an address overlap with a popular registered agent). The article walks the analyst-resolution of these 6 false-positive clusters: 4 are correctly dismissed after document review (the shared registered agent is the common signal but no genuine related-party connection exists), 2 turn out to be genuine related parties not in the seeded set (small intercompany relationships the synthetic-data generator hadn’t been calibrated to produce, but which the seeded ownership edges happened to imply).

Label propagation on the same graph recovers 3 of 4 seeded clusters (misses the size-5 cluster, which has the weakest signal density), produces 11 small clusters of unrelated entities (vs. Louvain’s 6), and varies measurably across runs.

Operational integration

The AS 2410 workpaper template that drops out of the community-detection output has three sections. The analytical-procedure documentation records the graph projection (entity universe, relationship-type set, weights), the algorithm choice (Louvain in the default case), the parameters (modularity tolerance, max iterations, fixed random seed), and the cluster-size filter. The candidate-cluster list enumerates each cluster surfaced by the algorithm with its member names, the signal types that linked them, and the cluster size. The disposition notes record the auditor’s resolution of each candidate cluster — confirmed related party (with the basis for the determination), coincidental (with the explanation), or unresolved-pending-management-inquiry (with the follow-up plan).

Two operational guardrails matter for audit defensibility. First, community-detection outputs are candidate related-party clusters, not confirmed related parties. The AS 2410 determination requires auditor judgment applied to each cluster — verifying the relationship characteristics, confirming with management, documenting the conclusion. Second, the algorithm parameters must be documented and reproducible. Changing the modularity tolerance, the edge-weight scheme, or the random seed between engagements changes the cluster decomposition; the workpaper has to make the parameter choices explicit so the procedure is auditable in subsequent inspections.

The forthcoming Temporal Graph Patterns article in this sub-series takes the cluster-detection output into the time dimension — how related-party clusters evolve as ownership and officer-roles change across periods, which is essential for multi-period engagements and continuing-reporting reconciliation.

References

Community detection theory:

Newman, M. E. J., & Girvan, M. (2004). “Finding and Evaluating Community Structure in Networks.” Physical Review E, 69(2), 026113.
Blondel, V. D., Guillaume, J.-L., Lambiotte, R., & Lefebvre, E. (2008). “Fast Unfolding of Communities in Large Networks.” Journal of Statistical Mechanics: Theory and Experiment, 2008(10), P10008.
Raghavan, U. N., Albert, R., & Kumara, S. (2007). “Near Linear Time Algorithm to Detect Community Structures in Large-Scale Networks.” Physical Review E, 76(3), 036106.
Fortunato, S., & Hric, D. (2016). “Community Detection in Networks: A User Guide.” Physics Reports, 659, 1-44.

Audit standards:

PCAOB AS 2410 — Related Parties.
AICPA SAS No. 145 — Understanding the Entity and Its Environment and Assessing the Risks of Material Misstatement.
FASB ASC 850 — Related Party Disclosures.

Implementation reference:

Neo4j Graph Data Science Library Documentation — gds.louvain.stream, gds.labelPropagation.stream, weighted-graph configuration.

Reproducible code: Companion repository at github.com/noahrgreen/dd-tech-lab-companion ships the AS 2410 related-party detection pipeline: graph-projection scripts for ownership + officer-role + address-overlap edges, Louvain and label-propagation implementations with side-by-side comparison, the synthetic 1,000-entity test graph with seeded clusters, and the workpaper template that converts cluster output into AS 2410-aligned documentation.

Community Detection for Related-Party Cluster Identification: Louvain, Label Propagation, and Their Audit Use

The PCAOB AS 2410 related-party problem

Modularity as a clustering objective

The Louvain algorithm

Label propagation

GDS library implementation

Algorithm comparison criteria

Worked example

Operational integration

References

Submit a Comment Cancel reply

Recent Posts

Recent Comments

Sheepdog Prosperity Partners LLC

Contact

Schedule