From Beneficial-Ownership Lists to Cypher: A Practical Knowledge-Graph Setup for Due Diligence

An ownership chain that nests beyond three levels is the point at which the DD analyst’s tooling has to change. A natural person owns 51 percent of Entity A; Entity A owns 80 percent of Entity B; Entity B reaches a 40 percent indirect interest in Entity C through three intermediate holding companies; Entity C is the counterparty under review. The chain has five hops, two intermediate branches that turn out to be dead ends, an effective-date for each ownership edge that may have shifted in the relevant period, and a cumulative-ownership math problem that scales as a product of fractions along each surviving path. The practitioner who tries to express this in pivot tables loses an afternoon and produces a workpaper their reviewer cannot follow.

The data is graph-shaped at its source. Persons and entities are nodes; ownership and officer-role relationships are edges with percentage and effective-date properties; the question “which natural persons cumulatively own at least 25 percent of Entity C through any combination of paths as of December 31” reduces to a single parameterized Cypher query against an indexed graph. This article walks the practical setup, schema design, data loading, the foundational Cypher queries for beneficial-ownership analysis, the audit-evidence discipline that converts graph output into AS 1105 evidence, and the limitations that FinCEN’s Beneficial Ownership Information rule and OFAC’s 50 Percent Rule make load-bearing.

The companion articles in this sub-series extend the methodology in specific directions: Cypher patterns for transaction-graph anomaly detection, schema design for sanctions screening, GraphRAG combining LLM retrieval with Cypher path queries, random walks for cascade-exposure ranking, community detection for related-party clusters, temporal-graph patterns for ownership lifecycle, graph-based wash-sale and layering detection, production-scale operations, and platform-architecture comparison. This first article establishes the schema and the loading discipline that the others build on.

Why ownership data is graph-shaped

The Excel-against-the-grain feeling that practitioners hit when ownership chains nest more than two levels deep is not a tooling complaint, it is a recognition that the data does not naturally fit the row-and-column representation. A row in an ownership spreadsheet has one owner, one entity, and a percentage. A chain of ownership requires joining the table to itself, and a five-hop chain requires five recursive self-joins. SQL handles this through recursive Common Table Expressions, which work but produce queries that are difficult to read, difficult to debug, and performance-fragile at moderate scale.

The graph representation collapses each of those self-joins into a single edge traversal. (:Person)-[:OWNS]->(:Entity) is the elementary unit; the chain of five hops is (:Person)-[:OWNS*1..5]->(:Entity). The query is composable, readable, and, given proper indexes, performant. The cost of the framework is upfront schema design and data loading; the benefit is that every subsequent ownership question becomes a Cypher idiom rather than a SQL puzzle.

For due-diligence work, two regulatory frameworks make the graph model useful, but they do not contribute equally to the public-source data picture in 2026. OFAC’s 50 Percent Rule remains directly load-bearing for indirect-ownership analysis: any entity owned 50 percent or more, directly or indirectly, by one or more sanctioned persons, is itself sanctioned by transitivity, and the practitioner must compute cumulative ownership through paths to apply the rule correctly. FinCEN’s Beneficial Ownership Information Reporting Rule under 31 CFR §1010.380 still defines beneficial ownership and update mechanics conceptually (25 percent equity interest or substantial control; 30-day continuing-reporting on changes, the temporal dimension is the forthcoming Temporal-Graph Patterns for Ownership Lifecycle article’s domain). After FinCEN’s March 2025 interim final rule, however, domestic U.S. entities are exempt from BOI reporting and only certain foreign reporting companies remain in scope. The practical implication is that the graph model retains its value for indirect-ownership logic and OFAC-cascade analysis, but practitioners should not frame BOI as a broad current public-source feed for U.S. domestic-company DD. The article’s worked-example data path uses corporate-registry sources and synthetic data; production deployments that rely on BOI should track FinCEN’s current effective rule rather than the pre-2025 framing the regulation appears to suggest at first reading.

Neo4j AuraDB vs self-hosted Neo4j

For a single-engagement-scale graph (~50,000 entities or fewer), Neo4j AuraDB Free is the fastest path to a working deployment. The free tier accommodates one database, 200,000 nodes, and 400,000 relationships, which fits a single small DD engagement comfortably. The managed-service overhead, no JVM tuning, no neo4j.conf decisions, no version upgrades, is the right trade-off for the first engagement; the deeper platform-architecture question becomes relevant once the firm’s aggregate graph crosses tier limits or the workload pattern exceeds AuraDB Free’s resource budget.

For multi-engagement portfolios, Neo4j AuraDB Professional tiers (10M to 200M relationships per database) cover most mid-size DD practices. Self-hosted Neo4j Community Edition is free but limits clustering and operational features that the Enterprise tier provides; the build-vs-buy and self-host-vs-managed calculation is the forthcoming Neo4j vs Alternatives: Architecture Comparison article’s full treatment. For this article, the assumption is AuraDB Free or Professional, and the connection setup uses standard bolt:// or neo4j+s:// URI patterns.

Ownership-graph schema design

The schema reflects the data the practitioner actually has. Persons and entities are nodes with stable identifiers (jurisdiction-issued registration numbers, or, for U.S. legal entities, Secretary of State entity numbers). The schema-design choices that earn their weight are typed relationships (OWNS for equity, OFFICER_OF for officer roles, CONTROLS for substantial-control assertions) and edge properties for ownership percentage and effective-date metadata. The article’s worked example uses the equity-ownership subset; richer schemas with Address and IdentificationDocument sub-nodes appear in the forthcoming Schema Design for Sanctions Screening: Modeling the OFAC SDN List as a Knowledge Graph article’s sanctions-screening setup, where jurisdictional-disclosure variance is load-bearing.

// Schema setup — uniqueness constraints and indexes
CREATE CONSTRAINT person_uid IF NOT EXISTS
  FOR (p:Person) REQUIRE p.uid IS UNIQUE;

CREATE CONSTRAINT entity_uid IF NOT EXISTS
  FOR (e:Entity) REQUIRE e.uid IS UNIQUE;

CREATE INDEX person_full_name IF NOT EXISTS
  FOR (p:Person) ON (p.full_name);

CREATE INDEX entity_legal_name IF NOT EXISTS
  FOR (e:Entity) ON (e.legal_name);

CREATE INDEX owns_percentage IF NOT EXISTS
  FOR ()-[r:OWNS]-() ON (r.percentage);

The uniqueness constraints on uid are non-negotiable and are the load-bearing performance controls for the query patterns below, they allow deterministic starting-node resolution, prevent duplicate-merging during incremental loads (which would silently produce ownership-percentage drift across runs), and let the Cypher planner anchor variable-length traversals at a single resolved node. The indexes on full_name and legal_name support the initial entity-resolution lookups during data ingestion (matching a counterparty against existing graph nodes before deciding whether to merge or create). The relationship-property index on :OWNS(percentage) is optional and usually secondary for bounded ownership traversals; it is more useful for edge-property filtering workloads (e.g., “find all majority-ownership edges above 50 percent”) than for the core path-product queries in this article. Include it if the engagement workload uses percentage-threshold filtering directly; omit it otherwise.

Loading the ownership graph

The data source picture for production loads varies sharply by jurisdiction and filing type. OpenCorporates is useful for company-registry metadata, jurisdictional identifiers, and certain control-related filing signals (officer roles, registered-agent records, principal-place-of-business data), but its coverage depth is jurisdiction-dependent and is not a complete beneficial-ownership feed for U.S. domestic-company DD. Other production sources, paid subscription registries (Bureau van Dijk Orbis, LexisNexis), individual Secretary-of-State filings pulled directly, FinCEN BOI access where authorized under the access-restricted framework, internal CRM and KYC records, supplement OpenCorporates for different jurisdictional or evidentiary needs. The right production source mix is engagement-specific and outside this article’s scope.

For this article, the synthetic dataset below is the reproducible base case. Any production load should be described as jurisdiction-dependent enrichment from one or more registry / disclosure / internal sources, not as a single-source BO feed. The Python generator below produces the ownership_synthetic.csv referenced by the load query, with a fixed random seed so every reader can reproduce the same graph and verify the same query results.

# synthetic_ownership_generator.py — produces ownership_synthetic.csv
import csv
import random

random.seed(42)

PERSON_COUNT = 25
ENTITY_COUNT = 75
SANCTIONED_PERSONS = {"P-0001", "P-0002"}  # 2 of 25 persons flagged
JURISDICTIONS = ["US-DE", "US-NY", "KY", "LU", "VG", "PA"]

# Designated target entity for the worked-example queries below
TARGET_ENTITY = "E-0042"

rows = []
for i in range(1, ENTITY_COUNT + 1):
    entity_uid = f"E-{i:04d}"
    n_owners = random.choice([1, 2, 3, 4])
    owner_pool = ([f"P-{j:04d}" for j in range(1, PERSON_COUNT + 1)] +
                  [f"E-{j:04d}" for j in range(1, i)])
    owner_uids = random.sample(owner_pool, k=min(n_owners, len(owner_pool)))
    pcts = sorted([random.randint(5, 100) for _ in owner_uids], reverse=True)
    total = sum(pcts)
    # Normalize each entity's ownership weights to sum to 100, producing a per-entity
    # ownership partition. This is the precondition for the cumulative-product
    # semantics in the Q1/Q3 path queries: at each hop, the edge weight is the
    # fraction of the downstream entity owned by the upstream node.
    pcts = [round(p * 100 / total, 2) for p in pcts]
    for owner_uid, pct in zip(owner_uids, pcts):
        rows.append({
            "owner_uid": owner_uid,
            "owner_kind": "Person" if owner_uid.startswith("P-") else "Entity",
            "owner_full_name": f"Person {owner_uid}" if owner_uid.startswith("P-") else f"Entity {owner_uid}",
            "owner_jurisdiction": random.choice(JURISDICTIONS),
            "owner_sanctioned": 1 if owner_uid in SANCTIONED_PERSONS else 0,
            "entity_uid": entity_uid,
            "entity_legal_name": f"Entity {entity_uid}",
            "entity_jurisdiction": random.choice(JURISDICTIONS),
            "entity_type": "LLC",
            "percentage": pct,
            "effective_date": "2024-01-01",
            "source_filing_ref": f"SYNTH-{entity_uid}-2024-001",
        })

with open("ownership_synthetic.csv", "w", newline="") as f:
    writer = csv.DictWriter(f, fieldnames=rows[0].keys())
    writer.writeheader()
    writer.writerows(rows)

print(f"Wrote {len(rows)} ownership rows. TARGET_ENTITY = {TARGET_ENTITY!r}")

The generator emits an owner_kind column with the explicit Person / Entity label and a numeric owner_sanctioned flag (0 or 1), so the LOAD does not have to infer labels from uid prefixes or compare against string 'True'/'False'. The designated TARGET_ENTITY = "E-0042" is the worked-example target for the queries below.

// Two-pass load. Pass 1: Person owner rows
LOAD CSV WITH HEADERS FROM 'file:///ownership_synthetic.csv' AS row
WITH row WHERE row.owner_kind = 'Person'
MERGE (p:Person {uid: row.owner_uid})
  ON CREATE SET p.full_name = row.owner_full_name,
                p.jurisdiction = row.owner_jurisdiction,
                p.sanctioned = toBoolean(toInteger(row.owner_sanctioned)),
                p.created_at = datetime()
MERGE (e:Entity {uid: row.entity_uid})
  ON CREATE SET e.legal_name = row.entity_legal_name,
                e.jurisdiction = row.entity_jurisdiction,
                e.entity_type = row.entity_type,
                e.created_at = datetime()
MERGE (p)-[r:OWNS]->(e)
  ON CREATE SET r.percentage = toFloat(row.percentage),
                r.effective_date = date(row.effective_date),
                r.source_filing_ref = row.source_filing_ref;

// Pass 2: Entity-owner rows (entity holds equity in another entity)
LOAD CSV WITH HEADERS FROM 'file:///ownership_synthetic.csv' AS row
WITH row WHERE row.owner_kind = 'Entity'
MERGE (owner:Entity {uid: row.owner_uid})
  ON CREATE SET owner.legal_name = row.owner_full_name,
                owner.jurisdiction = row.owner_jurisdiction,
                owner.created_at = datetime()
MERGE (e:Entity {uid: row.entity_uid})
  ON CREATE SET e.legal_name = row.entity_legal_name,
                e.jurisdiction = row.entity_jurisdiction,
                e.entity_type = row.entity_type,
                e.created_at = datetime()
MERGE (owner)-[r:OWNS]->(e)
  ON CREATE SET r.percentage = toFloat(row.percentage),
                r.effective_date = date(row.effective_date),
                r.source_filing_ref = row.source_filing_ref;

The two-pass pattern is cleaner than a single-pass conditional-label approach, each pass writes a single node label, so the resulting graph has the right labels for the queries below without an APOC dependency. For small loads (under 10,000 rows), the plain LOAD CSV form above is sufficient. For larger loads, the apoc.periodic.iterate streaming pattern from the forthcoming Production-Scale DD Graph Operations article is the right escalation; the syntax is similar but adds transaction-batching for memory safety. The MERGE ... ON CREATE SET idiom is the canonical pattern for idempotent loads, repeated runs with the same source data do not duplicate nodes or relationships. The created_at and effective_date timestamps support the temporal-graph queries from the forthcoming Temporal-Graph Patterns for Ownership Lifecycle article.

The synthetic generator is the reproducible-evidence anchor for the rest of the article. The reader can run it locally, load the resulting CSV into a Neo4j AuraDB Free instance, and execute the queries below against the loaded graph, every result figure is reproducible from the fixed seed.

Cumulative ownership: the underlying math

Before the Cypher idioms, the math. An ownership path $P = (v_0, v_1, \ldots, v_n)$ in the graph is a sequence of nodes where each consecutive pair is connected by an OWNS edge with percentage weight $w(v_{i-1}, v_i) \in [0, 100]$. The cumulative ownership along the path is the product of edge weights, expressed as a fraction:

$$o(P) = \prod_{i=1}^{n} \frac{w(v_{i-1}, v_i)}{100}$$

For a single owner reaching a target through multiple disjoint paths $P_1, P_2, \ldots, P_k$, the aggregate ownership is path-dependent. Two conventions matter for due-diligence work:

Maximum cumulative, $o_{\max} = \max_j o(P_j)$, the conservative reporting choice for beneficial-ownership disclosure questions; reports the strongest single chain of control regardless of additional weaker chains.
Sum of disjoint paths, $o_{\text{sum}} = \sum_j o(P_j)$, appropriate only when paths share no intermediate nodes, which the practitioner must verify before reporting; without that check, summing over paths that share intermediates produces inflated aggregate ownership.

The FinCEN BOI 25-percent threshold and the OFAC 50-percent threshold are both expressed in terms of cumulative ownership, but the regulatory text leaves the path-aggregation convention to the practitioner’s reasonable interpretation. Conservative DD practice is to compute and report both the maximum-cumulative and sum-of-disjoint-paths figures, distinguishing them in the workpaper rather than choosing silently.

The three Cypher queries that cover most beneficial-ownership tasks

Query 1: Ultimate beneficial owners above a threshold. The most-asked DD question, given a target entity, who are the natural persons with at least 25 percent cumulative ownership through any path. The Cypher idiom traverses owner-to-entity edges, accumulates the product of edge percentages along each path per the $o(P)$ formula above, and filters to the threshold.

// Q1: Ultimate beneficial owners with ≥25% maximum-cumulative ownership of the target
// Parameter binding: $target_uid is the entity uid to investigate
// For the worked example, set $target_uid = 'E-0042'
:param target_uid => 'E-0042';

MATCH path = (owner:Person)-[:OWNS*1..10]->(target:Entity {uid: $target_uid})
WITH owner, path,
     reduce(pct = 1.0, r IN relationships(path) | pct * (r.percentage / 100.0)) AS path_fraction
WITH owner, max(path_fraction) AS max_cumulative_fraction
WHERE max_cumulative_fraction >= 0.25
RETURN owner.full_name,
       owner.jurisdiction,
       round(max_cumulative_fraction * 100, 2) AS max_cumulative_pct
ORDER BY max_cumulative_pct DESC;

Two design choices matter. The 1..10 variable-length-path bound prevents the query from following pathological cycles and makes worst-case behavior predictable; ten hops covers virtually all realistic ownership structures. The max(path_fraction) aggregation reports the strongest single chain of control regardless of additional weaker chains, the conservative DD framing per the math above. The outer round(max_cumulative_fraction 100, 2) AS max_cumulative_pct converts the fraction to a percentage for the returned result, keeping the internal math in fractions and the workpaper-facing output in percentages.

Performance caveat for variable-length paths. The 1..10 traversal pattern is the canonical Cypher hotspot on large graphs. On a 50,000-node graph the query above completes in milliseconds; on a 10M-node graph without proper indexes it can run for minutes. Three mitigations apply specifically to ownership-path analysis: (1) ensure the uid uniqueness constraints from the schema setup are in place so the planner anchors the traversal at a single resolved target node rather than scanning the full graph; (2) on a real-world graph that may contain cycles (entity A owns entity B which owns entity A through cross-holdings), enforce path uniqueness explicitly so the traversal does not revisit nodes, WHERE size(apoc.coll.toSet(nodes(path))) = size(nodes(path)) is the simplest pattern, or use a Cypher path-pattern that requires distinct nodes by construction; (3) for production-scale work, precompute an acyclic ownership closure for the reporting period (a materialized view of the directed ownership DAG with cycles broken and percentages aggregated under a documented convention) before running the cumulative-product query, the forthcoming Production-Scale DD Graph Operations* article covers this discipline in detail. Critical: do not substitute shortest-path procedures for cumulative-ownership analysis. Shortest-path procedures (gds.shortestPath, gds.allShortestPaths) optimize for path length and are useful for connectivity questions; the DD question here is path-weighted ownership reachability and maximum cumulative exposure, which is not a shortest-path problem regardless of how the graph is structured. The synthetic data generator above produces a DAG (no cycles), so the queries below work without the uniqueness filter; on real-world graphs the filter is required.

Note on parameter binding. The :param target_uid => 'E-0042'; syntax above is Neo4j Browser’s parameter-binding syntax. Driver-side code passes parameters via the driver’s standard dict interface, for example, in the Python driver: session.run(query, target_uid='E-0042'). The Cypher body remains unchanged; only the binding mechanism differs.

Query 2: Common ownership between two entities. The pattern that surfaces related-party candidates under PCAOB AS 2410, two entities controlled (directly or indirectly) by an overlapping set of persons.

// Q2: Common owners across two entities
// For the worked example, set the two target entity uids:
:param entity_a_uid => 'E-0042';
:param entity_b_uid => 'E-0058';

MATCH (owner:Person)-[:OWNS*1..10]->(:Entity {uid: $entity_a_uid})
MATCH (owner)-[:OWNS*1..10]->(:Entity {uid: $entity_b_uid})
RETURN DISTINCT owner.full_name, owner.jurisdiction
LIMIT 100;

Two MATCH clauses against the same owner node already enforce reachability to both targets; the result-set is the set of persons that reach both. The DISTINCT collapses any owner who reaches a target through multiple paths into a single row. The forthcoming Community Detection for Related-Party Clusters article’s community-detection methodology extends this from pairwise to cluster-level, when the question is not “do these two share an owner” but “which entities form a related-party cluster across the portfolio.”

Query 3: Aggregate exposure to a designated source set (e.g., sanctioned persons). The OFAC 50 Percent Rule’s cascade question, given the set of sanctioned persons on the SDN list, which downstream entities are 50-percent-or-more cumulatively owned by them.

// Q3: Entities with aggregate ownership by sanctioned persons ≥ 50%
//
// For each (sanctioned source, target entity) pair, compute the maximum cumulative
// ownership across all paths from source to target. Then sum across sources.
//
// NOTE: This summation across DIFFERENT sources is the OFAC 50 Percent Rule
// "one or more" aggregation; it does NOT double-count multiple paths from the
// SAME source (the max() inside the per-source aggregate handles that).
MATCH path = (sanctioned:Person {sanctioned: true})-[:OWNS*1..10]->(target:Entity)
WITH sanctioned, target,
     reduce(pct = 1.0, r IN relationships(path) | pct * (r.percentage / 100.0)) AS path_fraction
WITH sanctioned, target, max(path_fraction) AS source_cumulative_fraction
WITH target, sum(source_cumulative_fraction) AS aggregated_sanctioned_fraction
WHERE aggregated_sanctioned_fraction >= 0.50
RETURN target.legal_name,
       target.jurisdiction,
       round(aggregated_sanctioned_fraction * 100, 2) AS aggregated_sanctioned_ownership_pct
ORDER BY aggregated_sanctioned_ownership_pct DESC;

The two-stage aggregation matters. Inside each (sanctioned-source, target) pair, max(path_fraction) reports the strongest path from that source to that target; across sanctioned sources, sum(source_cumulative_fraction) aggregates per the 50 Percent Rule’s “one or more sanctioned persons” framing. The outer round(... 100, 2) converts the resulting fraction back to a percentage for reporting and workpaper-readability. This is a first-cut filter that can over-flag when different sanctioned sources reach the target through shared intermediate nodes, the principled extension is the Personalized PageRank treatment in the forthcoming Random Walks, PageRank, and Personalized PageRank for Cascade-Exposure Ranking article, which handles the cascade structure with the random-walk operator on the ownership graph. For production AML / KYC pipelines, the practical pattern is to use Query 3 as a triage filter and the forthcoming Random Walks, PageRank, and Personalized PageRank for Cascade-Exposure Ranking* article’s PPR for the final exposure ranking.

Audit evidence and controls over the graph itself

The graph is not just an analytical tool, when its output supports an audit assertion or a regulatory determination, it becomes part of the audit-evidence chain under PCAOB AS 1105 (Audit Evidence). Two operational consequences follow.

Sufficient and appropriate evidence requirements. Under AS 1105, audit evidence must be both sufficient (quantity) and appropriate (relevance and reliability). A query result generated against a graph whose provenance the auditor cannot trace fails the appropriate-evidence bar regardless of the query’s correctness. Three documentation disciplines convert graph output into audit-appropriate evidence: (1) source-data lineage, every node and edge traces to a specific source (OpenCorporates filing reference, jurisdiction-issued registration, internal CRM record, management representation) recorded as a property on the node or edge; (2) ingestion-batch identifiers, every load operation produces an immutable batch record with timestamp, source URL or file hash, and operator identity; (3) query-result archival, for any query result used to support an assertion, the query text, the parameters, the execution timestamp, and the result-set hash are recorded as a workpaper artifact.

Controls over the graph as a system. Under PCAOB AS 2110 (Identifying and Assessing Risks of Material Misstatement) and the SOC 1 / SOC 2 controls framework where applicable, the graph itself is an IT system whose controls the auditor must understand. The minimum control set: access controls (who can write to the graph; the read-only-for-analysts pattern); change management (schema changes are version-controlled; the change log is auditable); backup and recovery (the graph state is restorable to a documented prior point); segregation of duties (the practitioner who loads data is not the practitioner who runs the queries that support audit conclusions). For Neo4j AuraDB the managed-service provider documents many of these controls under its SOC 2 reports; for self-hosted deployments, the firm’s own IT-controls program covers them. Either way, the auditor’s understanding of these controls is part of the AS 2110 risk-assessment file.

The practical consequence: a beneficial-ownership query result is not, by itself, audit evidence. The query result plus the lineage of the underlying graph data plus the documented controls over the graph system together constitute audit-appropriate evidence. The Cypher idioms above are necessary; they are not sufficient. A workpaper template that operationalizes this documentation chain will accompany the article series repository when it is published.

Where graph algorithms beyond Cypher help

Path-pattern queries cover the explicit-question DD tasks. Graph algorithms cover the population-level questions: which entities are most central to the ownership network, which clusters exist, which paths are bottlenecks. Three algorithms recur across the sub-series.

PageRank and Personalized PageRank. The PageRank value $r$ for the graph satisfies the eigenvector equation $r = (1-d)\mathbf{1}/n + d M^\top r$, where $d \in [0,1]$ is the damping factor (typically 0.85), $M$ is the row-stochastic normalization of the adjacency matrix, and $n$ is the node count. Personalized PageRank substitutes a source-concentrated restart distribution $p$ for the uniform $\mathbf{1}/n$ term, so $r = (1-d)p + d M^\top r$; the resulting scores measure each node’s exposure to the source set $p$. For the OFAC application, $p$ concentrates probability mass on sanctioned persons, and the PPR scores measure cascade exposure. The Neo4j Graph Data Science library exposes both via gds.pageRank.stream with sourceNodes and weighted-edge parameters.

Community detection. Louvain (Blondel et al., 2008) optimizes the modularity objective $Q = \frac{1}{2m} \sum_{i,j}\left[A_{ij} – \frac{k_i k_j}{2m}\right]\delta(c_i, c_j)$, where $A$ is the adjacency matrix, $k_i$ is node $i$’s degree, $m$ is the total edge weight, and $\delta(c_i, c_j) = 1$ iff nodes $i$ and $j$ share a community. Label propagation (Raghavan et al., 2007) is faster but produces less stable cluster assignments across runs. PCAOB AS 2410 related-party identification is the canonical application.

Centrality measures. Betweenness centrality $C_B(v) = \sum_{s \neq v \neq t} \frac{\sigma_{st}(v)}{\sigma_{st}}$ (where $\sigma_{st}$ is the number of shortest paths from $s$ to $t$ and $\sigma_{st}(v)$ is the number passing through $v$) identifies chokepoint entities in the ownership network. Eigenvector centrality identifies entities connected to other well-connected entities. Both are diagnostic for which entities to investigate first in a large counterparty universe.

The article series treats each of these in its own dedicated piece; the point here is that the schema designed in this article supports all of them without modification, which is the value of getting the schema right upfront.

Where the graph approach falls short

The framework’s reach has clear limits. Three are common.

Sparse data. A graph built from partial disclosures is partial. FinCEN BOI is access-restricted (the registry is not public) and, after the March 2025 interim final rule, no longer covers domestic U.S. reporting companies, so the public-source picture in 2026 is materially thinner than it was during the brief 2024 BOI-effective window. State corporate filings vary widely in officer-disclosure depth across jurisdictions, and informal control arrangements (nominee shareholders, undisclosed trust structures, oral side-agreements) are by design invisible to corporate-registry data. The graph reports what the source data discloses; sophisticated obfuscation defeats every commercial graph platform, not just Neo4j.

Jurisdictional-disclosure gaps. A U.S. Secretary of State filing is structured differently from a Cayman Islands corporate registry, which is different again from a Luxembourg disclosure. Entity-resolution across jurisdictions requires both transliteration awareness (the forthcoming Schema Design for Sanctions Screening: Modeling the OFAC SDN List as a Knowledge Graph article’s domain) and disclosure-completeness flagging, a counterparty in a jurisdiction with weak disclosure should be flagged as such, not treated as having no beneficial owners.

Temporal misalignment. A static graph answers “who owns what today” but cannot answer “who owned what on December 31, 2024” unless the schema models the time dimension explicitly. The forthcoming Temporal-Graph Patterns for Ownership Lifecycle article treats this comprehensively; for now, the practical pattern is to store an effective_date on every OWNS relationship and to maintain prior-period snapshots for audit-defensibility purposes.

Putting the schema to work

The practitioner with access to one or more corporate-registry sources (OpenCorporates for the open-data entry tier; paid registries, direct Secretary-of-State filings, internal CRM/KYC, or authorized FinCEN BOI access for richer coverage), a Neo4j AuraDB Free instance, and roughly an hour of setup time has, after this article, the operational ability to run beneficial-ownership investigations that would otherwise require either manual spreadsheet work or a commercial-vendor relationship. The framework is not a substitute for the commercial vendors at the high end, World-Check, LexisNexis Bridger, and Refinitiv ONE have access to data sources the practitioner does not, but it is a substantive improvement over the spreadsheet baseline at the cost of a free database tier.

The next article (002) takes the same graph apparatus to transaction streams, where the diagnostic patterns shift from ownership traversal to cycle detection in time-bounded transaction graphs.

Authority:

Graph database foundations:

Robinson, I., Webber, J., & Eifrem, E. (2015). Graph Databases (2nd ed.). O’Reilly.
Angles, R., Arenas, M., et al. (2017). “Foundations of Modern Query Languages for Graph Databases.” ACM Computing Surveys, 50(5), 1-40.
ISO/IEC 39075:2024. Information technology, Database languages, GQL.

Regulatory framework:

FinCEN. (2022). Beneficial Ownership Information Reporting Rule, 31 CFR §1010.380.
OFAC. (2014). Revised Guidance on Entities Owned by Persons Whose Property and Interests in Property Are Blocked. (The 50 Percent Rule.)
FFIEC. BSA/AML Examination Manual, Customer Due Diligence and Beneficial Ownership sections.
PCAOB AS 2410, Related Parties.

Platform documentation:

Neo4j AuraDB Documentation (current release), pricing tiers, instance specifications.
Neo4j Cypher Manual, current release.
Neo4j Graph Data Science Library Documentation, gds.pageRank.stream, gds.louvain.stream, centrality procedures.
OpenCorporates API Documentation.

Reproducible code: Companion notebook and synthetic dataset will be published with the article series repository. Until the repository is live, the synthetic-data generator and Cypher queries in this article are self-contained and reproducible from the source text alone.

From Beneficial-Ownership Lists to Cypher: A Practical Knowledge-Graph Setup for Due Diligence

Why ownership data is graph-shaped

Neo4j AuraDB vs self-hosted Neo4j

Ownership-graph schema design

Loading the ownership graph

Cumulative ownership: the underlying math

The three Cypher queries that cover most beneficial-ownership tasks

Audit evidence and controls over the graph itself

Where graph algorithms beyond Cypher help

Where the graph approach falls short

Putting the schema to work

Submit a Comment Cancel reply

Recent Posts

Recent Comments

Sheepdog Prosperity Partners LLC

Contact

Schedule