Credit risk PD models that survive audit: a concise data analytical playbook

Probability of default (PD) models live or die on whether they survive model risk review, not on which algorithm they use. This is a compact, data-analytical template for a 12-month retail/SME PD project, structured so the same notebook can be defended to credit, finance, and audit in the same meeting.

1. Start from the risk question, not the algorithm

A defensible PD project does not start with "let's try XGBoost". It starts with one specific question: from all active loans today, which are most likely to default in the next 12 months? Lock the definition of default (e.g. 90+ DPD, restructured, written-off), the horizon (12 months), and the output shape (PD buckets that map to approve / decline / limit actions) with credit and risk before touching a single row.

CRISP-DM business understanding

Agreeing on what "success" and "bad" mean before modelling is what separates a reproducible PD pipeline from a one-off score that nobody can explain.

2. Build the right table: one row per loan

The modelling table is one row per loan, frozen at a cut-off date, with a binary default_12m label observed over the following 12 months. This shape prevents leakage and aligns directly with how risk managers think about exposure.

Feature group	Examples	Why it matters
Application	Age, employment, product, tenor, loan amount	Underwriting context at origination
Capacity	Income, instalment-to-income, existing debt	Ability to repay
Behavioural	Payment history, roll rates, prior delinquencies	Strongest single signal for retail PD
Target	default_12m (0/1)	Observed bad outcome over the horizon

Most of the project effort sits here: joining sources, engineering features, sanity-checking distributions, and reconciling business rules with what the data actually contains.

3. Transparent baseline first: logistic regression

Logistic regression remains the right starting point for PD: outputs sit naturally in [0, 1], coefficients have signs and magnitudes you can defend, and validation is straightforward. Any fancier model — GBM, XGBoost, neural net — must beat this baseline in measurable lift, not vibes.

Stratified train/test split (75/25) on the default flag.
Standardise numeric features (age, income, loan amount, tenor).
Fit logistic regression on the training fold.
Score the test set and compute AUC ROC plus Gini.

Minimal PD baseline (sklearn)python

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import roc_auc_score

X_tr, X_te, y_tr, y_te = train_test_split(
    X, y, test_size=0.25, stratify=y, random_state=42)

sc = StandardScaler().fit(X_tr)
clf = LogisticRegression(max_iter=1000).fit(sc.transform(X_tr), y_tr)

pd_hat = clf.predict_proba(sc.transform(X_te))[:, 1]
print('AUC:', round(roc_auc_score(y_te, pd_hat), 3))

4. Evaluate discrimination AND calibration

AUC alone is not enough to sign off a risk model. Two questions matter equally: can the model rank-order risk (AUC / Gini), and are the probabilities believable (calibration)? Plot PD buckets — 0–2%, 2–5%, 5–10% — against realised default rates and check that "5%" really behaves like 1 in 20 over time.

Question	Metric	Used by
Does the model rank risk?	AUC ROC, Gini, KS	Underwriting, limit setting
Are PDs believable?	Calibration plot, Brier score	Provisioning, capital, IFRS 9
Is it stable across segments?	Segment AUC, PSI	Model risk review

Both, not either

Underwriting cares about rank ordering. Finance and risk appetite care about absolute PD levels for provisioning and stress testing. A model that passes one and fails the other will not survive review.

5. From notebook to risk process

A beautiful notebook that never leaves the lab adds no risk capability. Three concrete hooks turn the model into part of the risk toolkit:

Model packaging — serialise the fitted model and scalers (joblib) with model ID, training date, and feature schema in version control.
Scoring interface — a function or API that takes one loan record and returns a PD plus the top contributing features.
Monitoring — track score distribution, feature drift (PSI), and the gap between predicted and realised default rates monthly.

Treat the notebook as the analysis canvas and move production logic into plain Python modules so the same code can run inside batch jobs, real-time APIs, or whichever architecture the bank already runs.

6. The same template generalises

Swap the target label and the structure carries over: fraud detection (transaction-level), supplier/counterparty risk (entity-level), collections prioritisation (delinquent-account-level). The discipline is identical: crisp business question, clean case-level table, transparent baseline, dual-metric evaluation, and monitored deployment from day one.

Download the working files

Article source (Markdown)

The full long-form version of this walkthrough, in plain Markdown.

credit_risk_article.md · Download MD →

Reference Jupyter notebook

Compact CRISP-DM PD pipeline: imports, split, baseline logistic regression, AUC, calibration hooks.

credit_risk_project.ipynb · Download IPYNB →

#credit-risk#pd-model#data-science#model-risk#crisp-dm#logistic-regression

Rudy Prasetiya

IT GRC, cybersecurity & audit practitioner. Writes about controls that actually hold.