Code Repositories in Foundry: When to Embed Python / PySpark in the Pipeline (and When to Stay in Pipeline Builder)

What you’ll be able to do after reading this. Recognize the decision boundary between no-code Pipeline Builder and code-based Code Repositories. Read a @transform-decorated PySpark function and follow what it does. Identify the audit-metadata columns (pipeline_run_id, transform_version, computed_at) that link a gold-layer output row back to the specific build that produced it. Spot the production-scale failure modes (skew, partition imbalance, shuffle bloat) that turn a transform that works in development into one that fails under production load. Document the transform-change-governance chain that connects to Foundry Actions Framework‘s TransformChangeEntry audit pattern.

Terms anchored before the article walks them. Some of these are programming-vocabulary terms the city-school BBA-trained practitioner may not have encountered. Each has a plain-English bridge plus the precise meaning.

PySpark — a Python interface to Apache Spark, a distributed computing engine. “Distributed” means the work runs across many machines simultaneously; “Python interface” means the practitioner writes Python that Spark translates into the distributed work. PySpark is not Foundry-specific; many institutions outside Foundry use it.
DAG (Directed Acyclic Graph) — a chart of computational dependencies where each box depends on its predecessors. Pipeline Builder’s visual editor IS a DAG editor; the DAG defines what runs when.
@transform decorator — a Foundry-specific Python construct that wraps a function and tells Foundry “this function is a transform; its inputs are these datasets, its output is this dataset.” The decorator is what wires the function into Pipeline Builder’s DAG. Practitioners reading the code see the decorator and know which datasets are involved.
Window function — a SQL/Spark operation that ranks or aggregates rows within a “window” defined by partition and order. A familiar audit-side analog: “give me the most recent journal entry per account per quarter” is a windowed query (window = account+quarter; order = posting date desc; pick rank 1).
UDF (User-Defined Function) — a custom Python function the analyst wraps for Spark to call. UDFs bypass Spark’s query optimizer, so they’re powerful but slower; the practitioner reads “UDF” as “the engineer chose to write custom logic here for a specific reason.”
CI/CD (Continuous Integration / Continuous Deployment) — the institutional practice of automatically running tests, security scans, and build steps on every code change. The practitioner reads the CI/CD pipeline as the audit-evidence chain for transform changes: each promotion to production is a documented event.
Shuffle — Spark’s term for moving data between machines to compute joins or groupings. “Shuffle bloat” means the job is moving more data than necessary, which slows it dramatically. The practitioner does not need to fix shuffle bloat; the practitioner needs to recognize the failure mode if a transform that worked on small data is failing at scale.

Foundry Pipeline Builder handles roughly eighty percent of the data transformations a due-diligence ingestion pipeline requires without writing any code. The remaining twenty percent — complex windowing across long time periods, custom string-parsing for jurisdictional-code variants, machine-learning inference embedded mid-pipeline, non-standard joins involving array-typed fields, performance-tuning at scales where Spark-level control matters — needs Python or PySpark in a Foundry Code Repository. The decision of where each transform belongs is the engineering judgment call that distinguishes a maintainable pipeline from a brittle one. Most pipelines get this decision wrong in one of two predictable directions, and both directions produce predictable failure modes.

This article walks the decision boundary, the Code Repository project structure that makes the twenty percent work, the testing and CI/CD disciplines that keep the code path from regressing as schemas evolve underneath it, and the operational pattern of mixing no-code and code transforms in the same pipeline without producing what teams sometimes call “twenty percent code that everyone is afraid to touch.”

The eighty-twenty rule of Foundry transforms

Pipeline Builder is genuinely no-code: a visual editor produces a directed acyclic graph of transforms, each a node that reads input datasets, applies a finite set of operators (filter, join, group, aggregate, pivot, format, rename, derive), and writes output datasets. The operator catalog is rich enough to express most analytical transformations a due-diligence pipeline requires. For the practitioner whose background is in audit or finance rather than data engineering, this is the right entry point. A senior analyst with strong SQL fluency can typically build production-quality Pipeline Builder transforms within their first week on the platform.

The boundary appears in a small set of patterns the no-code envelope does not cover. Window functions — “most recent value per group,” “rank within partition,” “cumulative sum ordered by date” — are the most common. Pipeline Builder operators do not express these compactly, and the analyst working around the limitation by chaining multiple operators produces transforms that work but become illegible. Custom string parsing — jurisdictional-code variants where “US-NY,” “USA-NY,” “United States: New York” all need to normalize to a canonical form — is another common boundary; regex and conditional logic in code is shorter and clearer than the no-code chain. Performance tuning — broadcast hints, partition counts, shuffle reduction — requires Spark-level control the no-code envelope intentionally hides. Embedded machine-learning inference (a fraud-scoring model applied mid-pipeline to every row) needs the code path because Pipeline Builder has no model-serving operator.

The signs that a transform has crossed the boundary are usually visible: the no-code expression is awkward, the analyst has chained six operators where two should have sufficed, the output is correct but the transform is illegible, or the analyst’s review-time on every change exceeds the change effort itself. When those signs appear, the answer is not “build more clever no-code”; the answer is to move the transform to a Code Repository.

The companion bundle’s decision_matrix/pipeline_vs_code_repo_matrix.csv ships an eighteen-criterion comparison covering schema mapping, date filtering patterns, windowing, joins (simple and array-typed), conditional aggregations, custom string-parsing, schema-drift handling, ML inference, performance tuning, audit metadata, testing patterns, CI/CD integration, debuggability, analyst editability, and onboarding time. Each row labels which side wins and provides a one-sentence rationale. The matrix is the operational reference; the rest of this article walks the how of the code-side patterns.

Code Repository project structure

A Foundry Code Repository is a Git-backed project that contains Python transform files, tests, dependencies, and CI configuration. The project structure is familiar to any practitioner who has worked with Python in a production data context — transforms/ for the transform modules, tests/ for the pytest test suite, requirements.txt for dependency declarations, .ci/ for the CI configuration that integrates with the institution’s broader change-management process.

The transform module itself is small and structured. The companion bundle’s transforms/pyspark_silver_to_gold.py shows the canonical pattern:

from transforms.api import transform, Input, Output
from pyspark.sql import functions as F, Window

TRANSFORM_VERSION = "silver_to_gold_counterparty@v2.1"

@transform(
    output=Output("/dd/ontology/Counterparty"),
    silver_resolved=Input("/dd/silver/counterparties_resolved"),
    sanctions=Input("/dd/silver/sanctions_hits"),
    adverse_media=Input("/dd/silver/adverse_media_mentions"),
)
def compute_counterparty_gold(ctx, silver_resolved, sanctions, adverse_media, output):
    """Produce gold-layer Counterparty ontology objects with link metadata."""
    sanctions_active = (
        sanctions.dataframe()
        .filter(F.col("effective_date") >= F.date_sub(F.current_date(), 365))
        .withColumn(
            "rank",
            F.row_number().over(
                Window.partitionBy("counterparty_id").orderBy(F.col("effective_date").desc())
            ),
        )
        .filter(F.col("rank") == 1)
        .select("counterparty_id", "sanctions_list_id", "effective_date")
    )
    # ... media density, gold-layer composition, audit metadata
    output.write_dataframe(gold)

Three observations from this pattern matter to the practitioner.

First, the @transform decorator and the Input / Output symbols are Foundry-specific — they come from the platform’s transforms.api library. They wire the transform into Pipeline Builder’s DAG: when one of the input datasets refreshes, this transform reruns and produces its output dataset. The wiring is declarative; the practitioner does not write the orchestration code.

Second, the body of the transform — the PySpark window function, the join, the column derivation — is portable Spark. The same logic with minor adjustments runs on Snowflake Snowpark, on Databricks notebooks, on standalone PySpark deployments. The Foundry value-add is the integration; the analytical work is platform-agnostic.

Third, the transform produces three audit-metadata columns on every output row: pipeline_run_id (from the context object), transform_version (a module-level constant pinning the specific build that ran), and computed_at (the timestamp at execution). These columns are what answer the question “which code generated this specific value” when a downstream user — or an examiner — asks six months later. The audit metadata is non-optional for any transform writing to regulated ontology objects.

Dependency management

Foundry-side Python projects depend on a curated set of libraries plus, for external dependencies, the institution’s allowlist. The institutional security-review process typically gates which external pip dependencies the data-engineering team can introduce. For most due-diligence transforms, the curated library set — pandas, NumPy, PySpark, SciPy, statsmodels, requests, pyyaml — is sufficient. Niche dependencies (specific geospatial libraries, model-serving runtimes, vendor SDKs) go through the security review.

The dependency-pinning discipline matters more than the curation. Every transform should declare its dependencies with explicit version pins in requirements.txt. A transform that depends on pandas without a version pin is a transform whose behavior may silently change when pandas releases a minor version with a behavior tweak. The CI pipeline catches this in the unit-test step, but only if the unit tests are thorough enough to catch the relevant behavior. Pinning the version eliminates the silent-change failure mode entirely.

For institutions running Foundry across multiple environments — development, staging, production — the version pinning also serves the reproducibility expectation that regulators care about. When the examiner asks “which version of the model ran on April 15?”, the answer requires that the institution can identify, and ideally rerun, the specific build. The transform_version constant in the module plus the pinned dependencies plus the Git commit hash are the three identifiers that make the answer concrete.

PySpark patterns for due-diligence transforms

The patterns that recur across most due-diligence Code Repository transforms are a small set, and the practitioner who learns them comfortably can produce the eighty-percent-of-actually-useful code path with a few days of practice.

Window functions are the most common pattern. The “most recent value per group” — most recent sanctions hit per counterparty, most recent KYC refresh per legal-entity ID, most recent transaction per beneficial-ownership chain — uses Window.partitionBy(...).orderBy(...desc()) combined with F.row_number() to rank and filter(rank == 1) to keep only the top entry. The pattern is the workhorse for any “what is the current state” derivation from a history-of-changes table.

Multi-source joins combine multiple silver-layer inputs into a single gold-layer object. The join pattern in the worked example combines silver counterparties, silver sanctions hits, and silver adverse-media mentions. The patterns to watch are join key collisions (use explicit join expressions with table aliases when the same column name exists in multiple inputs) and null propagation through left joins (always wrap potentially-null aggregations in F.coalesce to produce a deterministic non-null result).

Conditional aggregations — counting items that match a condition, summing values above a threshold — use F.sum(F.when(condition, value).otherwise(0)). The pattern reads cleanly and translates to efficient Spark execution plans. The alternative pattern of pre-filtering and then aggregating is slightly less flexible because it cannot combine multiple conditional aggregates in a single pass.

Custom string parsing uses UDFs (user-defined functions) when the parsing logic is non-trivial. The practitioner should write UDFs sparingly because they bypass Spark’s query optimizer. For simple string normalization — lower(trim(col)) — the built-in functions are faster. For complex parsing that genuinely needs Python (parsing jurisdictional codes against a lookup table, applying date-format detection across heterogeneous source systems), a UDF is the right tool. The convention is to mark UDFs explicitly in the transform module’s docstring with a note about why a UDF was preferred over a built-in.

Dynamic-date filtering — “transactions in the last 365 days from the pipeline run” — uses F.date_sub(F.current_date(), 365) rather than a hard-coded date. The pattern is the natural form in code and awkward in no-code. Every dynamic-date filter the analyst writes in code is a transform that ages gracefully; every hard-coded date in no-code is a transform that needs manual updates when the date drifts.

Testing harness

Code Repository transforms benefit from the same testing discipline as any production Python project. The Foundry-side test framework integrates with pytest, runs unit tests against synthetic input dataframes, and runs integration tests against sampled slices of production datasets. The institution that skips this discipline pays the cost downstream — a transform that worked on the analyst’s small test dataset but breaks at production scale, a transform that silently produces wrong results after the upstream schema changes underneath it, a transform whose behavior nobody can confidently reason about because the test coverage is sparse.

The companion bundle’s tests/test_silver_to_gold.py ships seven pytest contract tests for the worked-example transform. The tests cover: that all counterparties survive the left joins; that the sanctions window function picks the most recent hit per counterparty; that out-of-window sanctions are dropped before the join; that the no-sanctions case yields null sanctions fields; that the adverse-media density counts mentions only inside the 90-day window; that no-adverse-media coalesces to zero rather than null; that audit-metadata columns are present and non-null on every output row; and that the business-logic output is deterministic across re-runs of the same input.

Two patterns make the tests useful in practice. First, the tests run against the pure-Spark variant of the transform (compute_counterparty_gold_pure in the worked example) rather than the decorated version. The pure-Spark variant takes plain DataFrames; the decorated version takes Foundry’s Input / Output types and requires the runtime to wire them. By extracting the business logic into a pure function and keeping the @transform decoration thin, the testable surface lives in pytest where the practitioner can iterate quickly. Second, the test fixtures use small DataFrames — three or four counterparties, half a dozen sanctions hits, a dozen adverse-media mentions. The small fixtures are intentional: each test should be readable in fifteen seconds, with the input data right alongside the assertion. Integration tests against larger samples live separately, in tests/integration/, and run less frequently (in CI on every merge, not on every commit).

The CI/CD pattern

Foundry Code Repositories integrate with the institution’s broader Git-based change-management process. The CI pipeline that ships with the companion bundle (ci_cd/foundry_code_repo_ci_template.yaml) describes a seven-stage flow: lint (ruff, black, mypy), unit test (pytest against synthetic inputs), integration test (pytest against sampled dataset slices), security scan (pip-audit on requirements), Foundry-side build (the platform’s own validation that the transforms compile and the dataset dependencies resolve), staging deploy, manual approval gate, and production deploy.

The manual approval gate matters most. For transforms writing to gold-layer regulated ontology objects, the production deploy requires dual approval — one approver from data engineering (verifying the technical change), one from compliance (verifying the change does not introduce a model-risk or audit-trail concern). The dual approval echoes the dual-approval pattern in Foundry Actions Framework‘s Actions framework, and for the same reason: high-impact state changes deserve segregation of duties at the change boundary, not just at the runtime boundary.

The post-deploy reconciliation step that the CI template references is the operational hook that connects to Foundry Actions Framework‘s TransformChangeEntry concept. Every production deploy generates a TransformChangeEntry capturing the prior commit hash, the new commit hash, the change owner, the approvers, and a materiality assessment of the expected affected rows. The entry is itself an immutable AuditEntry subtype in the Foundry Actions Framework sense. The institution’s compliance-data steward gets a daily roll-up of TransformChangeEntries affecting regulated ontology objects, with the materiality-assessment narratives. Reading those narratives weekly is part of how the institution maintains the model-change governance discipline SR 11-7 §V.A.5 expects.

The mixed-pipeline problem

A production pipeline that mixes no-code and code transforms — eighty percent Pipeline Builder, twenty percent Code Repository — can be either a maintainable architecture or an unmaintainable one. The same code-paths and the same no-code paths produce both outcomes; the difference is the discipline around the boundary.

The failure mode looks like this. A senior engineer builds a Code Repository transform for a hard problem — the silver-to-gold counterparty production transform with its windowing and audit metadata. The transform works. The engineer rotates to another priority. Eighteen months later, the upstream silver schema changes, and the transform silently starts producing wrong results — not crashes, just subtly wrong values that nobody catches until a downstream analyst notices an anomaly in a portfolio review. By the time the institution traces the anomaly back to the transform, the original engineer is gone, the test suite was thin, and the transform’s documentation lives in a docstring nobody updated. The institution now has a mission-critical transform nobody on the current team understands and nobody is comfortable modifying. The “twenty percent code that everyone is afraid to touch” outcome.

The disciplines that prevent this failure mode are not exotic. First: documentation in the transform module’s docstring covering the business logic, the schema dependencies, and the failure modes. Second: tests thorough enough that running them gives the maintainer confidence about what the transform does. Third: an ownership designation — every Code Repository transform names a current owner in the project’s CODEOWNERS file, and ownership rotates when the original owner leaves. Fourth: schema-change detection in CI, which the institution wires by capturing schema fingerprints of every input dataset at transform-build time and comparing against current fingerprints at every CI run. Fifth: review on every change, even for minor refactors, with the reviewer not being the same person as the author.

These disciplines apply just as much to no-code Pipeline Builder transforms, but no-code transforms have a kind of self-documenting structure that code transforms lack — the DAG visualization is at least a starting point for a maintainer trying to understand the transform. Code Repository transforms need to compensate with explicit documentation that the no-code path produces implicitly.

Worked example walked through pattern by pattern

The worked example — the silver-to-gold counterparty production transform from the companion bundle — exhibits each pattern this article walks. The @transform decorator wires it into the Pipeline Builder DAG. The window function on effective_date desc() produces “most recent sanctions hit per counterparty.” The 365-day filter is dynamic-date filtering against F.current_date(). The left joins propagate nulls that the F.coalesce calls then resolve to deterministic non-null values. The audit-metadata columns at the end give the transform’s output rows the traceability that regulated workflows expect.

The transform is a small example, but it is a real example. Its size — under one hundred lines of body code — is not because the author was minimalist; it is because each PySpark pattern, applied to its specific job, is intrinsically compact when written correctly. The corollary that the practitioner should take seriously: a Code Repository transform that runs to several hundred lines should be regarded with suspicion. The transform is either doing too much (and should be split into a chain) or implementing a pattern that should have been a UDF, a library function, or a no-code transform. Long transforms are usually wrong transforms.

The companion bundle’s performance_notes.md covers what happens when the worked example runs against a fifty-to-hundred-million-counterparty universe. Three classic Spark failure modes appear at production scale: skew (a small number of counterparties have far more sanctions hits or media mentions than others, producing partition imbalance), partition imbalance (the default shuffle-partition count is wrong for the dataset size, producing either OOM errors or scheduling overhead), and shuffle bloat (the wide transforms move ten times more data over the network than the input size justifies). Each has a concrete mitigation: salting the join key for skew, repartitioning to a calibrated target for imbalance, broadcasting small dimensions for shuffle bloat. The pre-production gate criteria in the performance notes — measure wall time, shuffle volume, and peak executor memory at three different scale tiers before promoting — are the operational discipline that catches these failures before production does.

The deeper architectural observation is the one Time Series in Foundry (Time Series in Foundry) picks up. For counterparty universes large enough that the worked-example transform’s joins become structurally expensive, the right architecture is not “tune the joins harder”; the right architecture is “pre-compute the rollups in a streaming pipeline and reduce the gold-layer transform to a thin assembly step.” Code Repository transforms scale to the point where re-architecture starts to dominate over tuning, and the practitioner who recognizes the boundary saves the institution from the variant of “twenty percent code nobody is afraid to touch but everyone wishes had been re-architected eighteen months ago.”

A grounding case: when transform-governance gaps appear in enforcement

Transforms that write to regulated ontology objects are the institutional mechanism through which the data ultimately reaches workpapers, SAR filings, and regulatory reports. The institutional history of accounting-restatement and AML enforcement includes cases where transform-change governance gaps contributed to the underlying data-quality issue. The Wells Fargo OCC enforcement (2018-2020 series of actions, totaling several billion in penalties) included findings related to data-quality controls in the institution’s customer-account systems — controls of the type that the TransformChangeEntry pattern (per Foundry Actions Framework) and the CI/CD discipline (this article’s §”The CI/CD pattern”) are designed to support.

What the practitioner does with this reference. When the institution proposes promoting a Code Repository transform from staging to production, the practitioner-reviewer asks: who validated this change, what tests passed, what dependencies were verified, what materiality assessment was performed on the expected output drift. The CI/CD pipeline produces the evidence; the institutional discipline that turns the evidence into examination-credible work is consistent dual-approval enforcement on regulated-ontology writers (per Foundry Actions Framework‘s framework) plus the post-deploy reconciliation that catches material drift before it appears in workpapers downstream.

Authority

Palantir Foundry Documentation — Code Repositories (project structure, @transform decorator, Input/Output API).
Palantir Foundry Documentation — Pipeline Builder (no-code operator catalog, dataset dependency model).
Apache PySpark Documentation — DataFrame API, Window functions, performance tuning.
Zaharia, M., et al. (2012). “Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing.” NSDI ’12. The foundational RDD paper for the curious; PySpark sits on top of this abstraction.
Karau, H., & Warren, R. (2017). High Performance Spark. O’Reilly. The practitioner reference for performance-tuning patterns.
Kim, G., Humble, J., Debois, P., Willis, J., & Forsgren, N. (2021). The DevOps Handbook (2nd ed.). IT Revolution. For the CI/CD discipline framing.
Federal Reserve, Supervisory Guidance on Model Risk Management (SR 11-7), §V.A.5 on model change governance.

Companion repository

Code Repositories in Foundry‘s full companion bundle lives at github.com/noahrgreen/dd-tech-lab-companion/articles/006_code_repositories_python_pyspark/. It ships the silver-to-gold counterparty PySpark transform, the seven pytest contract tests, the eighteen-criterion Pipeline-Builder-vs-Code-Repository decision matrix, the seven-stage CI/CD configuration template with the dual-approval gate, and the production-scale performance notes covering skew, partition imbalance, and shuffle bloat with concrete mitigation code. Quiver for Ad-Hoc Counterparty Queries covers the Quiver tool for the ad-hoc-query patterns that would otherwise tempt the analyst into writing a Code Repository job for a one-off question.

Code Repositories in Foundry: When to Embed Python / PySpark in the Pipeline (and When to Stay in Pipeline Builder)

The eighty-twenty rule of Foundry transforms

Code Repository project structure

Dependency management

PySpark patterns for due-diligence transforms

Testing harness

The CI/CD pattern

The mixed-pipeline problem

Worked example walked through pattern by pattern

A grounding case: when transform-governance gaps appear in enforcement

Authority

Companion repository

Submit a Comment Cancel reply

Recent Posts

Recent Comments

Sheepdog Prosperity Partners LLC

Contact

Schedule