What you’ll be able to do after reading this. Read an AIP prompt specification and understand what structured output it produces. Recognize the four-gate validation pattern (schema validate, citation substring, hallucination verifier-LLM, insufficient-corpus check) and what each gate prevents. Read precision/recall metrics and translate them to the practitioner’s “how often does the system get it right” intuition. Document an LLM-augmented adverse-media review in workpapers that satisfy SR 11-7 §IV model governance and FFIEC BSA/AML adverse-media-review expectations. Maintain the discipline that AIP augments — never substitutes for — analyst judgment.

Terms anchored before the article walks them. First-use acronym expansions matter here because LLM technology brings its own vocabulary that did not exist in CPA / CFE training programs five years ago.

  • AIP (Artificial Intelligence Platform) — Palantir’s bundled LLM-orchestration product. The “AI” in AIP refers to large language models (LLMs), specifically the kind deployed for structured-output text tasks like summarization.
  • LLM (Large Language Model) — a class of statistical models that produce text given text input. The model the institution interacts with is hosted by a provider (Anthropic, OpenAI, Mistral, others); AIP routes prompts to the configured provider.
  • Grounding — the discipline of requiring the model’s output to be tied to specific source documents the analyst can verify. The opposite is “hallucination,” where the model produces text that sounds correct but is not supported by any source.
  • Structured output — the requirement that the model produce a specific JSON schema with declared fields, not free-form prose. The platform validates the structure; outputs that fail validation are rejected.
  • Verifier-LLM — a second LLM call (typically a cheaper model) that evaluates whether each factual assertion the first LLM made is actually supported by the cited source passage. The two-pass pattern trades latency for confidence.
  • Precision / Recall — measurement metrics for binary classification. Precision = $\frac{\text{true positives}}{\text{true positives} + \text{false positives}}$ (of the items the system flagged, what fraction were correctly flagged). Recall = $\frac{\text{true positives}}{\text{true positives} + \text{false negatives}}$ (of the items that should have been flagged, what fraction were actually caught).
  • FFIEC (Federal Financial Institutions Examination Council) — the U.S. inter-agency body that publishes the BSA/AML Examination Manual that AML programs are examined against.
  • BSA/AML (Bank Secrecy Act / Anti-Money Laundering) — the federal regulatory framework requiring financial institutions to maintain programs for detecting and reporting suspected money laundering.

Adverse-media review at institutional scale is labor-bounded. A single high-risk counterparty in an active jurisdiction can generate fifty to several hundred news mentions across a relevant lookback window. A portfolio of several thousand counterparties refreshed quarterly produces a corpus operationally intractable for manual review. The analyst who tries to read everything within the time available reads almost nothing carefully; the analyst who triages to a manageable subset by ad-hoc rules introduces selection bias the institution cannot defend.

The AIP value proposition for due diligence is to compress the labor. Feed the counterparty’s news corpus to a large language model. Get back a 200-word risk-rating-relevant synthesis with citations to the source articles. Let the human investigator focus attention on the synthesis rather than the raw corpus. The compression is real, and the operational payoff at portfolio scale is substantial.

The failure modes are also real and have been well-characterized in the academic and operational literature. Hallucinated facts: the model asserts something the source articles do not actually say. Missed key articles: the model summarizes the volume but misses the one buried lawsuit that should drive the rating. Citation errors: the model cites an article but the assertion does not appear in the cited paragraph at all. Each failure is individually rare in well-prompted systems, but at scale and with regulator-relevant consequences, even rare failures aggregate into material risk.

This article walks the prompt patterns, grounding strategies, and output-validation controls that make AIP-generated adverse-media summaries defensible enough to ship to a credit committee or a regulatory examiner. The patterns are model-agnostic — they work whether the institution’s AIP deployment routes to Anthropic, OpenAI, Mistral, or another provider. The discipline is the value-add; the specific model is secondary.

One framing the reader should hold throughout: an AIP-generated summary augments analyst judgment. It does not substitute for it. The pattern this article walks treats the summary as an evidentiary input that an analyst evaluates and acts on, not as an automated determination the institution implements. The distinction is what makes the workflow defensible under SR 11-7 / OCC 2011-12 model-risk-management expectations and under FFIEC BSA/AML examination scrutiny. Without the human-in-the-loop framing, the institution has automated a regulated workflow in ways the regulators have not blessed; with it, the institution has scaled a regulated workflow without changing the regulatory locus of the decision.

The adverse-media review problem at scale

The operational tractability problem is the foundation. A bank-group’s commercial portfolio of ten thousand legal-entity customers, refreshed quarterly, with average adverse-media density of ten articles per counterparty per refresh cycle, generates one hundred thousand articles per quarter — four hundred thousand per year. At an average analyst-reading speed of two hundred words per minute against five-hundred-word articles, the raw review labor is two and a half minutes per article, or roughly seventeen thousand analyst-hours per year. The institution staffing this work end-to-end without compression needs the equivalent of eight to ten full-time analysts on adverse-media review alone — and that is before considering the cost of summarizing findings into rating-decision input for the next analytical step.

The institution can compress through any of several strategies. Triage rules can pre-filter the corpus — drop articles below some relevance threshold, drop duplicates of single events covered across many outlets, drop articles where the counterparty is mentioned only in passing. Each triage rule is a defensible heuristic but also a source of false negatives: the one buried mention of an emerging-litigation matter that does not match any heuristic’s keyword set gets dropped.

AIP-based summarization is a different compression strategy. Rather than triaging the corpus to a smaller set the analyst still reads in full, AIP produces a synthesis the analyst reads in lieu of the corpus. The synthesis is structured (per the schema this article walks), grounded in specific source passages, and constructed under controls that prevent the worst failure modes from reaching the analyst’s screen unflagged. Done well, the synthesis takes the analyst from twenty-five minutes of corpus reading per counterparty to two minutes of synthesis reading plus targeted drill-down. The portfolio-level labor compression is roughly an order of magnitude.

Done poorly, the synthesis adds risk without subtracting labor: the analyst still has to verify the summary because they cannot trust the summary, and the verification step takes nearly as long as reading the corpus would have taken. The discipline this article walks is what distinguishes the two cases.

AIP framing: what it is, what it is not

Palantir’s AIP is the platform’s integrated LLM orchestration layer. The integration is specifically with the ontology — AIP functions can take ontology objects as inputs, can produce structured outputs that conform to declared JSON schemas, and can be wired into Workshop applications (Workshop Application Patterns) and Actions framework gates (Foundry Actions Framework) without requiring custom infrastructure code.

The integration matters because the alternative — using an external LLM API directly — produces an architecture where the model’s outputs are unmoored from the institution’s data substrate. The institution has to manually wire input data into prompts, manually parse outputs, manually validate citations against source documents, manually route results into review queues. AIP collapses this work into a configuration surface. The institution still has to design the prompt, validate the output schema, calibrate the verifier-LLM, and gate the production-quality assessment — but the wiring work is platform-level rather than custom code.

What AIP is not, despite occasional product marketing that suggests otherwise: it is not a general-purpose chatbot, and it is not a replacement for analyst judgment on consequential decisions. The grounding patterns this article walks are what restrict AIP outputs to the evidentiary role they are well-suited to. Treating AIP as an autonomous decision-maker is the failure mode the SR 11-7 framework was written to govern; treating it as a structured-output evidentiary tool with disciplined human review is the pattern that actually fits the regulatory framework.

Structured-output enforcement

The single design decision that eliminates the largest class of practical failure modes is requiring structured output. The AIP function does not produce free-form prose. It produces a JSON object conforming to a strict schema that the platform validates at the output gate.

The companion bundle’s schemas/grounded_summary_schema.json ships the full schema as a JSON Schema draft 2020-12 document. The six required fields are: summary_text (string, maximum length bounded), factual_assertions (array of objects, each with an assertion text, a source-article-ID, and a verbatim source-passage substring), rating_pressure_direction (enum: upward, downward, or none), rating_pressure_strength (enum: high, medium, or low), insufficient_corpus (boolean — flagging when the input corpus is too thin to produce a meaningful summary), and items_flagged_for_human_review (array of strings — ambiguous claims requiring analyst judgment).

The schema validation is non-optional. If the model’s output does not validate, the platform rejects it and the AIP function retries with the schema-violation error included as feedback in the next user turn. Production-ready AIP outputs validate on the first attempt nearly all the time after the prompt has been calibrated against the schema; the retry pattern handles the occasional drift case where the model produces a structurally-non-conforming output.

The reason structured output eliminates so many failure modes: it forces the model to commit to specific assertions tied to specific sources, in a format that downstream validation tools can mechanically check. The vague-summary failure mode (“the counterparty has been reported in adverse media recently”) gets eliminated because the schema requires specific factual_assertions, each linked to a source_article_id and a source_passage. The unsupported-claim failure mode gets caught downstream by the citation-check post-processor.

The bounded length on summary_text (1800 characters in the companion bundle’s schema — roughly 200 words) matters too. The compression value of the AIP function depends on the summary being short enough that reading it is faster than reading the corpus. A 600-word summary loses most of the operational advantage; a 200-word summary preserves it. The schema’s maxLength constraint encodes the operational requirement in the platform’s validation gate.

Citation-check post-processing

Every factual_assertion in the output must include a source_passage that appears verbatim — modulo whitespace normalization and case-folding — in the named source article. The citation_check post-processor enforces this mechanically. Its implementation lives in the companion bundle at tools/citation_check.py.

The post-processor works in three steps. For each factual_assertion in the output, the post-processor looks up the source article by source_article_id in the input corpus. It normalizes both the article body and the asserted source_passage (collapsing whitespace runs, lowercasing). It checks that the normalized source_passage is a substring of the normalized article body. If the check fails for any assertion, the post-processor records the failure with a structured reason (“source_passage not a substring of article body”).

The institutional rule is that if any factual_assertion fails the citation check, the entire AIP output is rejected. The output does not proceed to the analyst’s review queue. Instead, the platform either retries the AIP function (with the citation-failure list included in the retry feedback) or marks the case for human review with the failure reasons surfaced. The default-deny posture is important: producing a partial output where some claims are validated and others are not creates exactly the kind of ambiguous evidentiary record that examiners criticize.

The substring-match approach is intentionally strict. A paraphrase of a source passage — even an accurate paraphrase — fails the check because it is not a verbatim substring. The model is required to use the exact words from the source article in the source_passage field, regardless of how the assertion in assertion paraphrases or restates them. The strict-substring rule eliminates the failure mode where the model produces a paraphrase that is almost accurate but slightly drifts from the source — a class of failure the academic literature on hallucination has documented extensively (see Ji et al., 2023).

Hallucination-check verifier pass

The citation check confirms structural grounding (the source_passage is in the article). It does not confirm semantic grounding (the assertion is faithfully supported by the source_passage). The verifier-LLM second pass closes that gap.

The pattern is straightforward. After the citation_check passes, the platform runs a second LLM call for each factual_assertion. The verifier prompt receives only the source_passage and the proposed assertion. It returns one of three verdicts: SUPPORTED, NOT_SUPPORTED, or AMBIGUOUS, with a one-sentence reason for the latter two. The companion bundle’s tools/verifier_prompt_template.md ships the full verifier prompt structure.

The routing logic for the verdicts: SUPPORTED assertions stay in the final output. NOT_SUPPORTED assertions are removed from factual_assertions and added to items_flagged_for_human_review with a “Removed by verifier” prefix. AMBIGUOUS assertions are kept in the final output but added to items_flagged_for_human_review with a “Verifier ambiguous” prefix. The final output presented to the analyst contains only verifier-passing assertions, plus an explicit list of items the verifier was uncertain about — surfacing the uncertainty rather than hiding it.

The verifier-LLM does not need to be the same model that generated the original summary. In production deployments, the cost-quality optimum is usually a cheaper, faster model for the verifier role (the verifier task is simpler — judge a single assertion against a single passage — and benefits less from the larger model’s broader contextual reasoning). The verifier model’s precision and recall against a labeled holdout set are what determine whether the institution can promote the verifier from advisory to blocking status in the hallucination-gate configuration.

Calibration is critical. The illustrative ranges in the companion bundle’s verifier template (~0.92-0.96 precision, ~0.88-0.93 recall) are placeholders. The institution must measure these against its own labeled holdout set before changing the verifier’s gate status from blocking: false to blocking: true. The two-stage approach trades latency for confidence — total wall time per counterparty roughly two to three times longer than single-pass generation — and the trade is acceptable for batch DD workflows but not for interactive UX. The companion bundle’s evaluations/hallucination_gate.yaml captures the five-rule gate framework that ties this all together.

Insufficient-corpus handling

The model must be allowed to say “I cannot produce a meaningful summary.” The failure mode where the model instead generates plausible-but-empty text on a thin corpus is, in operational terms, the single most dangerous output the system can produce. The system that confidently summarizes nothing is worse than the system that explicitly reports the corpus was insufficient: the analyst who reads the plausible-but-empty summary may close the case, while the analyst who reads the insufficient-corpus flag knows to either expand the corpus or invest more attention.

The schema enforces this through the insufficient_corpus boolean and a deterministic gate. The gate rule (insufficient_corpus_required_when_thin in the hallucination gates) is that if the news corpus contains fewer than three distinct articles, or if all articles cover the same single event (deduplicated by event_id where available), the output must set insufficient_corpus = true. Producing a non-empty summary on thin corpus is a high-severity policy violation that the gate catches at the platform level rather than relying on the model’s discretion.

The behavioral pattern in the system prompt is explicit:

If the corpus is insufficient to produce a meaningful summary (fewer than three substantive articles, or all articles cover the same single event), set insufficient_corpus = true and explain the gap in summary_text. Producing plausible-but-empty text on a thin corpus is the failure mode this control exists to prevent.

The pattern works because both the model and the platform validate the rule. The model is prompted to honor the rule; the platform enforces it as a deterministic gate. The institution’s confidence in the rule’s actual operation comes from the combination: prompt-side instruction biases the model toward correct behavior; platform-side enforcement catches the residual cases where the model would have produced a confident summary anyway.

Human-review handoff

The items_flagged_for_human_review array is the structured handoff to the analyst’s attention. The array captures three types of items: claims the verifier-LLM was uncertain about (AMBIGUOUS verdicts), claims the verifier rejected (NOT_SUPPORTED, removed from the final output), and items the model itself flagged because they require analyst judgment beyond what the AIP function can provide.

The third category is the most subtle. Consider an adverse-media item describing a counterparty as “linked to” a sanctioned entity in a relationship that the article describes ambiguously. The grounding may be solid (the article does say the words). The verifier may approve the assertion (the passage does support the wording). But the analytical question of whether the linkage constitutes a control relationship under the institution’s CDD framework requires judgment the AIP function should not exercise. The right behavior is for the AIP function to surface the claim in items_flagged_for_human_review with framing like “Counterparty described as ‘linked to’ SanctionedEntity-A in article AM-003 — analyst to evaluate whether linkage constitutes a control relationship under §1010.230 framework.”

The Workshop application’s review queue (Workshop Application Patterns) consumes the array directly. Each flagged item becomes a checkbox the analyst evaluates before completing the review. The audit-trail discipline is preserved: the analyst’s resolution of each flagged item is captured in the AuditEntry (Foundry Actions Framework) when the analyst takes an action based on the review. The chain from corpus to summary to flagged items to analyst decision to audit-trail entry is unbroken and reconstructible.

Model-evaluation methodology

The institution that deploys an AIP-based adverse-media summarizer without measured performance against labeled outcomes is deploying an unvalidated model. The SR 11-7 framework expects model performance to be measured before deployment, monitored ongoing, and re-validated periodically — and the AIP function is unambiguously a model under that framework’s definition.

The minimum evaluation methodology has three layers. First, a labeled holdout set: a collection of counterparties with hand-curated “gold” summaries produced by experienced analysts, against which the AIP function’s outputs can be compared. The set should cover the institution’s actual distribution of counterparty patterns (jurisdictions, industries, corpus densities, rating levels) and should contain at least a few hundred examples to support meaningful precision/recall measurement.

Second, defined metrics. The natural metrics for adverse-media summarization are: factual-assertion precision (fraction of asserted facts that an analyst confirms are accurate), factual-assertion recall (fraction of facts the gold summary contains that the AIP output also contains), insufficient-corpus precision (fraction of insufficient-corpus flags that the analyst confirms are appropriate), and rating-pressure-signal correlation (degree to which the AIP function’s directional/strength signals predict the analyst’s eventual rating-action decision). Each metric needs a target threshold the institution commits to before production deployment.

Third, ongoing measurement. The deployed AIP function should be monitored continuously — a randomly-selected sample of outputs should be subjected to analyst review with the analyst’s verdict captured for metric computation. Material drift in any metric is a re-validation trigger. The SR 11-7 framework expects this kind of ongoing-monitoring discipline; the institutional model-risk policy operationalizes it.

The “we tested it once and it looked good” approach fails both the regulatory framework and the operational reality. Models drift. Source distributions change. The labeled holdout set the institution measured against in March 2026 is not necessarily representative of the corpus the function processes in November 2026. Continuous measurement is what catches the drift before it produces a regulatory finding.

Worked example

Consider counterparty CP-SYNTH-00042 — Meridian Holdings (Pte) Ltd, Singapore, fully synthetic — facing a 90-day adverse-media corpus of four articles. The corpus covers a port-documentation-review matter, a Q1 revenue announcement, a civil-complaint filing at pleadings stage, and a regulator’s Q1 inspection summary naming the company. The AIP function processes the corpus.

The output passes the schema validation: all six required fields present, summary_text under the 1800-character limit, factual_assertions array contains eight grounded assertions each with verbatim source_passages. The citation_check post-processor verifies all eight: 8/8 pass, exit code 0, no failures. The verifier-LLM second pass evaluates each assertion: 8 SUPPORTED, 0 NOT_SUPPORTED, 0 AMBIGUOUS. The output proceeds to the analyst’s review queue with three items_flagged_for_human_review covering analyst-judgment questions the function correctly identified as requiring human evaluation (whether the pleadings-stage civil complaint warrants rating elevation, how to weigh the regulator’s “compliance-dialogue” framing against the absolute hit count, and whether the company’s omission of the matters from its Q1 announcement is itself a disclosure-discipline signal).

The companion bundle’s sample_output/aip_summary_output_example.json ships the full output as it would appear in production. The rating_pressure_direction is “upward” with rating_pressure_strength “medium” — both advisory signals to the analyst, not automated rating-change triggers. The analyst opens the output in the Workshop application, reads the 180-word summary, hover-verifies several citations directly against the source articles, evaluates the three flagged items, and decides to submit an ElevateRiskRating action (per Foundry Actions Framework) elevating the counterparty from B+ to high. The AIP summary becomes a supporting-evidence item on the resulting AuditEntry — snapshotted at the moment of the action so the examiner who reviews the entry eighteen months later sees the summary as the analyst actually saw it, not as the underlying corpus may have evolved by then.

This is the operational success state. The AIP function compressed roughly twelve thousand words of corpus into a 180-word summary the analyst evaluated in two minutes. The compression preserved the citation chain from the summary back to the corpus, with verifier-validated grounding for every assertion. The analyst’s decision was supported by — but not made by — the AIP function. The audit-trail discipline captured the analyst’s reasoning, the supporting evidence (including the summary itself), and the lineage back to the original corpus. The chain is intact for any subsequent examination.

The institution’s confidence in the system rests on the validation work it did before deployment: the labeled holdout set, the measured metrics, the ongoing monitoring, the documented model-risk-management framework that ties the entire workflow into the institution’s broader SR 11-7 / OCC 2011-12 compliance posture. Without that work, the function is unvalidated; with it, the function is a regulator-defensible early-warning-and-compression layer that the analyst still owns the decisions on top of.

Grounding cases: when adverse-media monitoring failures appeared in enforcement

The institutional history of AML enforcement is dense with cases where adverse-media monitoring gaps contributed to the underlying compliance failure. The four cases below illustrate distinct fraud-signal patterns the AIP summarization workflow this article walks would have surfaced — at the granularity the practitioner running the analysis can actually implement. Each case carries the agency record, news corroboration, the specific fraud-signal pattern, and a reference to either real public data or the companion bundle’s synthetic-data corpus that an analyst can use to reproduce the technique.

Case 1: HSBC USA — Mexican-cartel correspondent banking (2012)

Case 2: Standard Chartered — Iran-related transaction monitoring (2012/2018/2019)

Case 3: Wirecard AG — Dan McCrum FT investigation (2015-2020)

  • Agency record: Munich I District Court conviction of former CEO Markus Braun (July 2025); BaFin’s 2020 Wirecard supervisory failures report.
  • News coverage: Financial Times, “The Wirecard scandal” (FT investigations dossier, Dan McCrum reporting 2015-2020); Financial Times, “Wirecard: the timeline” (2020-06-26).
  • Fraud-signal pattern: Dan McCrum’s FT coverage from January 2015 through June 2020 published over twenty named-fraud-allegation articles citing specific irregularities in Wirecard’s Asia-Pacific accounting — escrow account discrepancies, third-party-acquirer partner opacity, named subsidiaries with no operational presence at registered addresses. Correspondent banks holding exposure to Wirecard or its named partner entities (typically through processing or correspondent relationships) had access to this adverse-media signal at production-publishing cadence. The signal an AIP-style summarization workflow would surface: counterparty mentions where the same source (here, the FT) publishes recurring named-allegation pieces over a 6+ month window, distinct from one-off coverage of unrelated news events. The practitioner-implementable rule: track adverse-media density not just in aggregate but stratified by source persistence — a single outlet running multiple named-allegation pieces over months is the high-signal pattern, not the same-day pile-on of many outlets covering one event. Route counterparties with persistent-source signals to enhanced due diligence regardless of whether other outlets have picked the story up.
  • Reproducible data: The FT’s Wirecard investigation archive is partially accessible at ft.com/wirecard. The companion bundle ships synthetic_data/wirecard_pattern_source_persistence_corpus.json — 80 synthetic adverse-media items demonstrating source-persistence vs. one-time-spike patterns, with a worked example showing how the AIP function distinguishes the two.

Case 4: 1MDB — Goldman Sachs settlement (2020)

What the practitioner does with these references

For each case above, the article provides the agency record (the regulatorily-authoritative source), the news corroboration (so a practitioner without legal-tech tools can verify the claim), the fraud-signal pattern (what to look for in the data), and the reproducible data (real or synthetic) that lets the analyst practice the technique. When designing or evaluating an LLM-augmented adverse-media program, the workpaper documentation should: (1) cite at least two of the patterns above as the operational threat scenarios the institution’s program is designed to detect; (2) demonstrate the institution’s adverse-media corpus covers the language-stratified, source-stratified, and outlet-weighted dimensions the cases reveal; (3) show the labeled-validation set the institution used to calibrate precision and recall against patterns of this kind; (4) document the SR 11-7 §V model-governance framework that ties the workflow into the institution’s broader control environment.

The platform-side scaffolding (AIP function, grounding controls, verifier gate) is the mechanism. The discipline that turns it into examination-defensible work is the labeled-holdout validation against patterns like these four, the ongoing measured precision/recall against the institution’s actual outcomes, and the SR 11-7 framing that makes the entire workflow regulator-defensible.

Authority

Companion repository

AIP-Driven Adverse-Media Summarization‘s full companion bundle lives at github.com/noahrgreen/dd-tech-lab-companion/articles/004_aip_adverse_media_summarization/. It ships the AIP prompt YAML with full system and user prompts, the JSON Schema draft 2020-12 for the structured output, the five-rule hallucination_gate configuration, the citation_check Python validator and verifier-LLM prompt template, the synthetic counterparty plus four-article adverse-media corpus, the sample AIP output passing all gates end-to-end, the four text-based Foundry UI screen-equivalents, and the validation_notes covering scope, known limits, and reproducibility. Foundry Actions Framework takes the human-review-handoff output and walks the Actions framework that captures the resulting analyst decision as an auditable event.