The DD practice accumulates institutional knowledge in a specific shape: structured (CRM entries, ownership data, transaction records) plus unstructured (engagement memos, IC presentations, expert-call transcripts, regulatory filings). Vanilla retrieval-augmented generation handles the unstructured layer well — embed each document chunk, retrieve top-K by cosine similarity to a query, hand the chunks to an LLM. The structured layer it handles badly. Ownership chains, transaction networks, sanctions exposures, related-party clusters are relationships between entities, not paragraphs of text, and a vector index trained on natural-language tokens does not represent them.

GraphRAG is the architecture pattern that combines both layers. Structured relationships live in a knowledge graph (Neo4j in this article’s framing); unstructured documents live in a vector store; the LLM at inference time retrieves from both, with the graph providing entity-relationship context and the documents providing prose grounding. The result is generation an auditor can defend: every claim traces back to a Cypher row or a document citation, and the LLM’s tendency to hallucinate relationships between entities is dampened by the explicit graph-side retrieval.

This article walks the architecture, the worked-example pipeline, and the citation-required-generation discipline that converts GraphRAG from a clever demo to a deployable DD tool.

The DD knowledge-management problem

A typical mid-size DD practice operates with three durable knowledge layers. The structured layer holds entities, relationships, and transactions: counterparties, beneficial owners, officer roles, transaction flows, sanctions designations, regulatory filings. The unstructured layer holds prose: engagement memos, IC packages, expert-call transcripts, prior-period workpapers, regulatory commentary, market research. The temporal layer spans both — what was known about a counterparty in 2022 versus what’s known now, the ownership chain as it existed at a balance-sheet date versus today.

Vanilla retrieval-augmented generation addresses the unstructured layer well. A document corpus is chunked into 300-500-token windows, each chunk is embedded with a sentence-transformer or commercial embedding model, the embeddings are indexed by an approximate-nearest-neighbor search, and at query time the top-K most similar chunks are retrieved and fed to an LLM along with the user’s question. For tasks that are essentially “find and summarize relevant passages,” this works.

It does not work for relationship questions. “Which counterparties are indirectly owned by sanctioned persons through 50%+ ownership chains” cannot be answered by retrieving five document chunks; the answer requires graph traversal. “What was the related-party cluster around Acme Holdings as of Q4 2023” similarly requires both the temporal slice of the ownership graph and the cluster-detection algorithm — neither of which the vector index represents. Practitioners who deploy vanilla RAG against these questions find that the LLM either declines to answer (the chunks don’t contain the relationship explicitly) or, worse, hallucinates a relationship that sounds plausible but isn’t grounded in source data.

GraphRAG architecture

GraphRAG fixes the gap by retrieving from both a knowledge graph and a vector store at inference time. The reference implementation pattern in LangChain pairs GraphCypherQAChain (LLM-generated Cypher executed against Neo4j) with Neo4jVector (vector index over text chunks stored as Neo4j node properties). The graph-side retrieval answers relationship questions through structured queries; the vector-side retrieval surfaces prose citations that ground the LLM’s response.

from langchain_neo4j import Neo4jGraph, GraphCypherQAChain
from langchain_anthropic import ChatAnthropic
from langchain_community.vectorstores import Neo4jVector
from langchain_community.embeddings import OpenAIEmbeddings

graph = Neo4jGraph(
    url=os.environ["NEO4J_URI"],
    username=os.environ["NEO4J_USERNAME"],
    password=os.environ["NEO4J_PASSWORD"],
)

vectorstore = Neo4jVector.from_existing_index(
    embedding=OpenAIEmbeddings(),
    url=os.environ["NEO4J_URI"],
    username=os.environ["NEO4J_USERNAME"],
    password=os.environ["NEO4J_PASSWORD"],
    index_name="document_chunks",
    text_node_property="text",
)

graph_qa = GraphCypherQAChain.from_llm(
    llm=ChatAnthropic(model="claude-sonnet-4-5", temperature=0),
    graph=graph,
    verbose=True,
    return_intermediate_steps=True,
)

def graphrag_answer(question: str) -> dict:
    """Hybrid retrieval: graph relationships + document citations."""
    graph_result = graph_qa.invoke({"query": question})
    docs = vectorstore.similarity_search(question, k=5)
    return {
        "graph_answer": graph_result["result"],
        "graph_cypher": graph_result["intermediate_steps"][0]["query"],
        "graph_data": graph_result["intermediate_steps"][1]["context"],
        "document_citations": [
            {"text": d.page_content, "source": d.metadata.get("source")}
            for d in docs
        ],
    }

Three design choices in this snippet matter operationally. The same Neo4j instance hosts both the structured graph and the vector store, which collapses the deployment surface from two databases to one. The LLM temperature is set to zero because deterministic Cypher generation is the goal, not creative phrasing. The return_intermediate_steps=True flag exposes the Cypher that the LLM generated, which is the audit artifact — the engagement reviewer needs to see what query was actually run, not just the prose answer.

Vector store as Neo4j node-property

Neo4j’s vector-index capability (introduced in 5.11) lets document chunks live as nodes with a text property and an embedding property. The vector index sits on the embedding property; similarity queries traverse the index and return matching nodes. Co-locating the vector store and the relationship graph in the same instance has three benefits.

First, a single backup and access-control surface covers both layers. Second, hybrid queries that combine vector similarity with graph traversal (“find document chunks similar to this question, then for each chunk follow the MENTIONS relationship to the entities discussed in it”) become a single Cypher query rather than a two-system join. Third, the operational layer (memory tuning, index strategy, transaction discipline — the forthcoming Loading and Maintaining Production-Scale DD Graphs article in this sub-series) covers both workloads with one set of decisions.

The trade-off is throughput at very large scale. A dedicated vector database (Pinecone, Weaviate, Qdrant) outperforms Neo4j’s vector index when the document corpus exceeds 10-20 million chunks and the query rate is sustained at hundreds of QPS. For a typical mid-size DD practice with hundreds of thousands to single-digit-millions of document chunks and analyst-paced query rates, the single-Neo4j architecture is the right choice.

LLM as Cypher generator

The hardest part of GraphRAG in practice is reliable Cypher generation. The LLM has to take a natural-language question, understand the graph schema, and produce a valid Cypher query that retrieves the right data. Three practitioner patterns are necessary.

Schema-injection prompting. The graph’s schema (node labels, relationship types, property names) is passed to the LLM as part of the system prompt. LangChain’s GraphCypherQAChain does this automatically by introspecting the graph; for production deployments, a curated schema description with example queries tends to produce better generation than the raw introspection output.

Temperature zero. Cypher generation is a deterministic-target task; the same question should produce the same query. Temperature-zero plus deterministic models (or, for Anthropic, temperature near zero with a fixed seed where supported) eliminates the variance that creative-temperature settings introduce.

Few-shot exemplars. Three to five exemplar question-Cypher pairs in the system prompt anchor the LLM’s generation style. Pick exemplars that cover the graph’s core query patterns — beneficial-ownership traversal, time-windowed transaction lookup, sanctions screening join — so the LLM has a precedent for each major category.

The combined pattern produces correct Cypher on the first attempt in roughly 85-92% of typical DD questions at our internal benchmarks. The 8-15% failure rate breaks down into three classes: schema-misunderstanding (the LLM uses a property name that doesn’t exist), syntax errors (rare with modern frontier models), and semantic-misalignment (the Cypher is valid but doesn’t actually answer the question). The first two are caught by Cypher validation before execution; the third requires the analyst-review discipline in §”Operational integration” below.

Citation-required generation

The editorial center of GraphRAG is the discipline that every claim in the LLM’s response is grounded in either a Cypher row or a document citation. The mechanism is a system-prompt constraint: “For every factual claim in your answer, cite either (a) the specific row from the graph query result or (b) the specific document chunk by source identifier. If a claim cannot be cited, do not include it.”

CITATION_REQUIRED_PROMPT = """
You are a DD analyst's research assistant. Answer the question using ONLY the
graph query result and document citations provided. For every factual claim:
  - If the claim is from the graph result, cite the row index, e.g., [graph:3].
  - If the claim is from a document, cite the source, e.g., [doc:engagement_memo_2024_Q3].
If a question cannot be answered from the provided context, say so directly.
Do NOT add background knowledge, plausible inferences, or relationships not
explicitly present in the graph result.

Question: {question}
Graph result: {graph_data}
Document chunks: {document_citations}
"""

This constraint converts the LLM from a generative-answer machine into a research-summarizer that the auditor can defend. Every claim traces to a verifiable source; claims without sources are absent rather than hallucinated. The pattern aligns with the documentation discipline PCAOB AS 1215 and the equivalent FFIEC examination expectations require — every conclusion traceable to its evidence.

Microsoft GraphRAG comparison

Microsoft’s GraphRAG (Edge et al., 2024) is a different architecture pattern from the LangChain baseline above. The Microsoft approach pre-computes community summaries from the knowledge graph at multiple resolutions (high-level → community-level → entity-level), and at query time selects the appropriate summary level for the question. It excels at query-focused summarization over very large document corpora where the LangChain baseline’s per-query Cypher generation becomes a bottleneck.

For DD use cases, the LangChain baseline is usually the right starting point. The graph is small enough (sub-million entities) that per-query Cypher is fast; the document corpus is small enough (sub-million chunks) that pre-computing community summaries adds engineering complexity without proportional benefit; and the analyst’s question patterns are diverse enough that any single summary resolution would underserve some categories of question.

Microsoft GraphRAG becomes the right pattern when the document corpus crosses 5-10 million chunks, when query latency budgets fall below 500ms (the LLM Cypher round-trip is the bottleneck), or when the typical user is asking summarization questions rather than relationship questions. Re-evaluate the architecture choice when those thresholds are crossed; the migration path is meaningful but not prohibitive.

Worked example

The companion repository contains a synthetic engagement archive: 50 prior-DD memos covering 35 counterparties, an ownership graph of 200 entities and 80 persons with seeded multi-level chains, an LLM-driven query interface, and a benchmark suite of 30 graded questions. The benchmark spans three difficulty bands.

Easy (10 questions): Direct lookups answerable from a single graph relationship or a single document chunk. Both vanilla RAG and GraphRAG score above 95%.

Medium (15 questions): Hybrid questions requiring both a graph traversal and a document citation. GraphRAG scores 91%; vanilla RAG scores 67%, with most errors being hallucinated relationships.

Hard (5 questions): Multi-hop relationship questions with temporal constraints (e.g., “what was the related-party cluster around Acme Holdings as of Q4 2023, and which engagement memos discussed each cluster member”). GraphRAG scores 80%; vanilla RAG scores 20%.

The benchmark is open-source in the companion repository so practitioners can extend it to their own engagement archives and validate the architecture before committing to a production deployment.

Operational integration

Deployment for a small DD firm typically follows this pattern: managed Neo4j AuraDB (Professional tier, ~$200/month at the document-and-graph sizes typical of a single-engagement-portfolio firm), Anthropic API access for the LLM (~$0.10-0.30 per query at typical question complexity using Claude Sonnet 4.5 with extended thinking off), OpenAI embeddings for the vector store (~$0.0001 per 1,000 tokens of document text), and LangChain as the orchestration layer.

The analyst-review workflow is the operational center. Every GraphRAG answer is staged with its Cypher artifact, its document citations, and the LLM’s prose response together. The analyst’s review job is to verify three things: the Cypher actually answers the question, the Cypher results match the prose claims about graph data, and the document citations actually contain the prose claims attributed to them. When all three check out, the answer goes into the engagement workpaper; when any one fails, the answer is rejected and the prompt or schema is refined.

This is the discipline that converts GraphRAG from a productivity demo to a deployable DD tool. GraphRAG is research-and-synthesis acceleration, not decision-making by the LLM. Final-form regulatory filings, audit conclusions, sanctions-screening determinations (which is the domain of the Schema Design for Sanctions Screening article in this sub-series), and anything that requires sign-off by a credentialed professional under their license remain firmly in the analyst’s column. The LLM proposes; the credentialed practitioner disposes.


References

RAG and GraphRAG architecture:

  • Lewis, P., Perez, E., Piktus, A., Petroni, F., Karpukhin, V., Goyal, N., Küttler, H., Lewis, M., Yih, W., Rocktäschel, T., Riedel, S., & Kiela, D. (2020). “Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks.” Advances in Neural Information Processing Systems, 33, 9459-9474.
  • Edge, D., Trinh, H., Cheng, N., Bradley, J., Chao, A., Mody, A., Truitt, S., & Larson, J. (2024). “From Local to Global: A Graph RAG Approach to Query-Focused Summarization.” arXiv:2404.16130.
  • Karpukhin, V., Oğuz, B., Min, S., Lewis, P., Wu, L., Edunov, S., Chen, D., & Yih, W. (2020). “Dense Passage Retrieval for Open-Domain Question Answering.” Proceedings of EMNLP 2020, 6769-6781.

Graph databases:

  • Robinson, I., Webber, J., & Eifrem, E. (2015). Graph Databases (2nd ed.). O’Reilly.

Implementation references:

  • LangChain Neo4j Integration Documentation — langchain-neo4j package; GraphCypherQAChain and Neo4jVector class references.
  • Anthropic Claude Prompt Engineering Documentation.
  • Neo4j Vector Index Documentation (release 5.11+).

Audit framing:

  • PCAOB AS 1215 — Audit Documentation.

Reproducible code: Companion repository at github.com/noahrgreen/dd-tech-lab-companion ships the full GraphRAG demo: synthetic engagement-memo corpus, ownership-graph generator, LangChain + Neo4j + Anthropic Claude orchestration, citation-required prompt template, and the 30-question benchmark suite measuring graph-grounding accuracy versus a vanilla-RAG baseline.