Loading and Maintaining Production-Scale DD Graphs: APOC Procedures, Bulk Loading, Index Strategy, and Memory Tuning

The Neo4j sub-series up to this point has lived in the data-modeling and query-design layer. The methodology is correct; the queries make sense; the worked examples run in seconds. Production operation is a different problem. A DD firm whose ownership-and-transaction graph crosses 1M relationships, then 10M, then 100M, hits performance walls that have nothing to do with Cypher correctness and everything to do with index strategy, JVM heap allocation, page-cache configuration, bulk-loading methodology, and transaction-batching discipline. A round-tripping query that runs in 800ms on a 50,000-relationship synthetic graph can run in 8 minutes on a 50M-relationship production graph if any of those layers is misconfigured, and the practitioner who hits that wall typically blames Cypher rather than the operational layer.

This article walks the operational layer. Bulk loading, when to use Cypher’s LOAD CSV, when to switch to APOC’s apoc.periodic.iterate for streaming, when to escalate to the offline neo4j-admin import tool for initial bulk inserts. Index strategy, composite indexes for compound lookups, range indexes for time-windowed queries (the Wash-Sale and Layering Detection article’s use case), full-text indexes for fuzzy lookups (the Schema Design for Sanctions Screening article’s case). Memory tuning, the JVM heap vs page-cache trade-off, the empirical sizing rules for typical DD graph workloads. Query performance discipline, PROFILE every variable-length-path query, narrow with selective predicates first, use EXISTS { } subqueries to short-circuit. Each is delivered as a practitioner-deployable workflow rather than a Neo4j-internals deep-dive.

The scale wall

Empirically, DD graphs hit four scale thresholds at which different operational concerns dominate.

Under 1M relationships: Most things work without tuning. Neo4j AuraDB Free or Professional tiers handle the workload; default configurations are adequate; index strategy doesn’t matter much. The methodology articles in this sub-series were tested at this scale.

1M to 10M relationships: Index strategy starts to matter. Variable-length-path queries (round-tripping cycles, ownership cascades, layering chains) become noticeably slower without proper indexing. Bulk-loading new ingestion data via single-transaction LOAD CSV starts to hit memory limits. JVM heap tuning becomes necessary.

10M to 100M relationships: APOC procedures become required, not optional. apoc.periodic.iterate for batched updates is the standard pattern; transaction sizing matters; deadlock avoidance becomes a real concern. Page-cache sizing dominates query latency.

100M+ relationships: The offline neo4j-admin import tool is required for initial bulk loads. Index-creation strategy needs to be carefully planned (drop and recreate during bulk load, or create after the data is in place). At billion-relationship scale, single-instance Neo4j may not be the right substrate at all, the Architecture Choices for the Mid-Size DD Practice article in this sub-series treats the platform-comparison question.

The numbers are practitioner rules of thumb, not theoretical limits. Specific workloads (heavy variable-length-path queries vs simple lookup-heavy workloads) shift the thresholds materially.

Three bulk-loading patterns

Three Neo4j bulk-loading patterns dominate. Each fits a specific scale range.

LOAD CSV (Cypher built-in). Reads a CSV file directly from a URL or file path, executes a MERGE or CREATE per row, runs as a single transaction by default. Fast and simple for files up to ~1M rows. Beyond that, memory exhaustion on the single transaction becomes the failure mode.

apoc.periodic.iterate (APOC streaming). Wraps an inner query in a batched-transaction outer loop. Each batch (typically 1,000-50,000 rows) commits independently, allowing arbitrarily-large source files. The standard pattern for 1M-100M row loads.

neo4j-admin import (offline tool). A command-line tool that takes pre-validated CSV files and produces a fresh Neo4j database directly, bypassing the transaction layer entirely. Required for 100M+ row initial loads. Cannot be used to update an existing database; it’s an initial-bulk-load tool.

The decision flowchart: under 1M rows on an existing database → LOAD CSV. 1M to 100M rows on an existing database → apoc.periodic.iterate. Initial bulk load of 100M+ rows on a fresh database → neo4j-admin import to seed the database, then apoc.periodic.iterate for subsequent updates.

// Streaming bulk-load via APOC
CALL apoc.periodic.iterate(
  "LOAD CSV WITH HEADERS FROM 'file:///ownership_2026.csv' AS row RETURN row",
  "MERGE (owner:Person {uid: row.owner_uid})
     SET owner.full_name = row.owner_name
   MERGE (entity:Entity {uid: row.entity_uid})
     SET entity.legal_name = row.entity_legal_name
   MERGE (owner)-[r:OWNS]->(entity)
     SET r.percentage = toFloat(row.percentage),
         r.valid_from = date(row.valid_from),
         r.valid_to   = CASE WHEN row.valid_to = '' THEN null ELSE date(row.valid_to) END",
  { batchSize: 50000, parallel: false, retries: 3 }
);

The parameters earn their weight. batchSize: 50000 is the empirical sweet spot, smaller batches incur transaction-overhead penalty, larger batches risk OOM on heavy MERGE work. parallel: false for MERGE-heavy loads, concurrent MERGEs against the same node deadlock; serial execution avoids this. retries: 3 provides operational resilience against transient lock-wait timeouts.

Index strategy

Three index types serve different DD-graph workloads.

Range indexes. The default Neo4j index type. Support equality and range scans on a single property. The base case for WHERE entity.legal_name = 'Acme' or WHERE trade.trade_date >= date('2024-01-01'). Most queries need one or more range indexes; the indexing-decision check is “would this query benefit from index-based lookup vs full scan?”

Composite indexes. Cover multiple properties in a single index. Critical for the AS-OF query pattern from the Temporal Graph Patterns article in this sub-series, where the predicate is valid_from <= as_of AND valid_to > as_of. A composite index on (valid_from, valid_to) is dramatically faster than two separate range indexes on the same properties.

Full-text indexes. Lucene-backed; support fuzzy matching, tokenization, case-folding, ASCII-folding. Critical for the sanctions-screening pattern from the Schema Design for Sanctions Screening article, where the predicate is “match this counterparty name against any SDN alias with edit distance ≤ 1 and phonetic similarity ≥ threshold.”

// Range index for the time-windowed cycle queries in the wash-sale article
CREATE INDEX trade_date IF NOT EXISTS FOR (t:Trade) ON (t.trade_date);

// Composite index for the AS-OF query pattern in the temporal-graph article
CREATE INDEX owns_valid_range IF NOT EXISTS FOR ()-[r:OWNS]-() ON (r.valid_from, r.valid_to);

// Full-text index for sanctions fuzzy lookup
CREATE FULLTEXT INDEX sanction_name_search IF NOT EXISTS
  FOR (n:SanctionedParty|Alias) ON EACH [n.full_name, n.normalized_name];

The indexing decision is workload-driven. After deployment, the PROFILE of representative queries surfaces operators that are doing full scans where index lookup would be 100-1000x faster. The practitioner adds the index, re-profiles, and verifies the improvement.

Memory tuning

Neo4j’s runtime memory has two primary consumers: the JVM heap (transaction working memory, query plan caches, runtime allocations) and the page cache (the in-memory cache of the on-disk graph store).

The empirical sizing rule for typical DD-graph workloads: page cache approximately equal to graph store size on disk, JVM heap approximately equal to the working-set transaction size. For a 50GB on-disk graph with typical analyst-paced workloads, page cache 50GB plus JVM heap 4-8GB is the starting point. For workloads with heavy variable-length-path queries (cycle detection, cascade traversal), heap may need to climb to 16-32GB to accommodate the in-memory plan execution.

Configuration in neo4j.conf (Neo4j 5+; parameter names shifted from 4.x):

dbms.memory.heap.initial_size=8g
dbms.memory.heap.max_size=8g
dbms.memory.pagecache.size=50g

Signs of misconfiguration: frequent garbage-collection pauses (heap too small for the working set), query latency that’s much higher than disk-IO would suggest (page cache too small, queries thrashing the disk), or OOM kills under load (heap or pagecache too large for the physical instance).

Page-cache sizing is the larger lever. Going from a 16GB page cache to a 48GB page cache on a 40GB graph store typically produces a 5-10x query-latency improvement for read-heavy workloads, simply because the working set fits in memory and the disk-IO path is bypassed.

Transaction-batching discipline

The apoc.periodic.iterate pattern from the bulk-loading section generalizes to any large-volume operation: bulk updates, bulk deletes, bulk relationship inserts. Three parameters matter beyond batchSize.

parallel. True for read-only or independent operations; false for MERGE-heavy operations or operations that share node-write contention. Concurrent MERGEs deadlock; serial execution avoids this.

retries. Number of times a batch is retried on transient errors (lock-wait timeouts, brief connectivity hiccups). 3 is a reasonable default; higher values increase operational resilience at the cost of failure-mode masking.

concurrency. Number of concurrent worker threads when parallel: true. Defaults to 50; tuning to the physical CPU count is the right move for CPU-bound workloads.

// Bulk relationship deletion with batching
CALL apoc.periodic.iterate(
  "MATCH ()-[r:OBSOLETE_OWNS]-() RETURN r",
  "DELETE r",
  { batchSize: 10000, parallel: false, retries: 3 }
);

The batched-deletion pattern is the canonical retirement-of-old-relationships workflow when superseded ownership edges accumulate over years of temporal-graph maintenance.

Query performance discipline

Three habits earn their weight on production Cypher.

PROFILE every variable-length-path query. Run PROFILE <query> and read the operator-by-operator execution plan. Look for operators with disproportionately high “DB Hits” counts, those are the bottlenecks. Look for “All Nodes Scan” or “Label Scan” operators where index-based lookup should be available, those are missing indexes.

Narrow with selective predicates first. Cypher’s planner mostly handles predicate ordering, but variable-length-path queries benefit from explicit narrowing. Apply highly-selective predicates (specific node IDs, time-window constraints) before expanding into the variable-length traversal.

Use EXISTS { } subqueries to short-circuit. When a query needs to know whether a path exists but doesn’t need to enumerate all paths, EXISTS { ... } returns true as soon as the first match is found. The pattern is materially faster than the equivalent MATCH for cross-account control checks and similar existence questions in the Wash-Sale and Layering Detection article’s queries.

// Profile a typical round-tripping query to inspect plan + memory cost
PROFILE
MATCH cycle = (origin:Counterparty)-[:TRANSFERRED*2..5]->(origin)
WHERE ALL(r IN relationships(cycle) WHERE r.transferred_at >= datetime('2025-01-01'))
RETURN origin, length(cycle), reduce(t = 0.0, r IN relationships(cycle) | t + r.amount_usd) AS cycle_volume
LIMIT 100;

PROFILE output identifies which operator is consuming the most database hits, surfacing tuning opportunities the practitioner can’t see from query syntax alone.

Worked example

The companion repository ships a synthetic 100M-relationship ownership-and-transaction graph with realistic counterparty topology. Before tuning, a representative round-tripping query (variable-length 2-5 hop cycle detection with a 90-day time window) runs in 8.4 minutes. After applying four configuration changes, the same query runs in 480ms.

Change 1: Range index on transferred_at. Reduces the time-window scan from a full-relationship scan to an index range scan. Improvement: ~10x.

Change 2: Composite index on (transferred_at, amount_usd). Allows the planner to push the amount filter into the index scan. Improvement: ~2x.

Change 3: Page cache resized from 16GB to 48GB. Working set now fits in memory; disk-IO path is bypassed for the hot data. Improvement: ~5x.

Change 4: apoc.periodic.iterate for the result aggregation. The original query returned all matching cycles in a single result set, exhausting heap. Refactoring to batched aggregation eliminates the heap pressure. Improvement: ~2x (and removes the OOM failure mode).

Total improvement: 1050x. The numbers are workload-specific, different query patterns and graph topologies produce different speedups, but the 100x-1000x range is typical of well-targeted production tuning.

Operational integration

Production deployment of a DD graph has three durable operational concerns beyond initial setup.

Backup and restore discipline. Neo4j Aura handles this automatically with point-in-time recovery; self-hosted deployments need a neo4j-admin backup schedule. Test restore quarterly against a non-production instance to verify the backup chain actually works.

Monitoring and alerting. Page-cache hit rate, JVM heap utilization, query latency percentiles, transaction commit rate. Neo4j Operations Manager covers most metrics natively; Prometheus + Grafana on the JMX exporter is the open-source alternative.

Version-upgrade strategy. Neo4j has aggressive release cadence (LTS releases every 18-24 months). The neo4j.conf parameter names shift across major versions; APOC procedure signatures evolve; the GDS library’s algorithm interfaces change. Pin the production version in the companion repository, document the upgrade path, and plan upgrades against a non-production instance first.

The article’s specific configuration parameters and APOC procedure signatures are pinned to Neo4j 5.x current LTS at drafting time. Future versions will have different parameter names; the reader’s responsibility is to consult the current Operations Manual rather than trusting the article’s specifics three years from now.

The forthcoming Architecture Choices for the Mid-Size DD Practice article in this sub-series addresses the next-level decision, when even tuned Neo4j hits scale limits and the architecture choice pivots to alternatives (Neo4j Aura Enterprise, JanusGraph, TigerGraph, Amazon Neptune).

References

Graph database operations:

Robinson, I., Webber, J., & Eifrem, E. (2015). Graph Databases (2nd ed.). O’Reilly. Ch. 7 (Performance) and Ch. 11 (Production environments).

Neo4j documentation:

Neo4j Operations Manual (current release), Memory configuration, Indexes, Performance Tuning.
Neo4j APOC Documentation, apoc.periodic.iterate, apoc.load.csv, apoc.import.*.
Neo4j neo4j-admin import Documentation.
Neo4j Cypher Manual, PROFILE and EXPLAIN introspection.

Theory background:

Angles, R., Arenas, M., Barceló, P., Hogan, A., Reutter, J., & Vrgoč, D. (2017). “Foundations of Modern Query Languages for Graph Databases.” ACM Computing Surveys, 50(5), 1-40.

Reproducible code: Companion repository at github.com/noahrgreen/dd-tech-lab-companion ships the production-deployment toolkit: bulk-loading templates for each of the three patterns, a sizing-and-tuning calculator (input graph dimensions plus workload mix → recommended JVM heap, page-cache, batch sizes), the indexing-decision flowchart as a runnable rules engine, and the worked-example 100M-relationship benchmark scripts.

Loading and Maintaining Production-Scale DD Graphs: APOC Procedures, Bulk Loading, Index Strategy, and Memory Tuning

The scale wall

Three bulk-loading patterns

Index strategy

Memory tuning

Transaction-batching discipline

Query performance discipline

Worked example

Operational integration

References

Submit a Comment Cancel reply

Recent Posts

Recent Comments

Sheepdog Prosperity Partners LLC

Contact

Schedule