# Building a Credit Risk Data Science Project That Actually Survives Audit

Risk management is often discussed in abstract terms at the board level, but it becomes painfully concrete once defaults start rising and cash stops coming in.
Data science can help, but only if projects are framed around real risk questions and produce outputs that risk, finance, and audit can all live with.

This article walks through a realistic example: a **probability of default (PD)** model for retail or SME loans.
The goal is not to show the fanciest algorithm, but to lay out a project structure that can survive both model risk review and day‑to‑day operations.

---

## 1. Start from the risk question, not the algorithm

A sensible PD project does not start with "let's try XGBoost", it starts with a very specific question:

> "From all active loans today, which ones are most likely to default in the next 12 months?"

That question drives everything else.
Together with credit and risk teams you should lock in:

- **Definition of default** – e.g. more than 90 days past due, restructuring due to financial difficulty, or write‑off.
- **Time horizon** – 12‑month PD is a common anchor for portfolio steering and limit setting.
- **Output format** – not just raw scores, but buckets or segments that map to decisions (approve/decline, limit increase/decrease, early collections).

This is exactly the spirit of the **Business Understanding** phase in CRISP‑DM: agreeing what success looks like before touching a single row of data.

---

## 2. Build the right table: one row per loan

Once the question is clear, you can design the dataset.
For a PD use case, the core table usually combines:

- **Application attributes** – age, employment type, product type, tenor, loan amount.
- **Capacity metrics** – income, instalment‑to‑income ratio, existing indebtedness.
- **Behavioural information** – payment history, roll‑rate patterns, past delinquencies.
- **Target label** – a binary `default_12m` flag describing what happened to that loan over the next 12 months.

Technically, you end up with one row per loan at a frozen "cut‑off" date.
This allows you to train and evaluate models without leaking future information.
It also maps neatly to how risk managers think about exposure.

Getting to this clean table is where most of the work sits: joining sources, engineering features, sanity‑checking distributions, and reconciling business rules with what the data actually contains.

---

## 3. Use a transparent baseline model first

Many data science projects get lost because they jump straight to deep learning or complex ensembles.
For PD, a **logistic regression baseline** is still incredibly valuable:

- It produces probabilities between 0 and 1 that map naturally to "chance of default".
- Coefficients and signs can be explained to credit and audit.
- Implementation and validation are straightforward.

A simple but solid baseline recipe:

1. Split the data into **train** and **test** sets (e.g. 75% / 25%) with stratification on the default flag.
2. Standardise numeric features (age, income, loan_amount, tenor, etc.).
3. Fit a logistic regression model on the training set.
4. Evaluate it with metrics like **AUC ROC** on the test set.

The important mindset is this: the baseline is your reference point.
Any more sophisticated model must beat it in a measurable way, not just feel more "advanced".

---

## 4. Evaluate both discrimination and calibration

A single summary metric like AUC is not enough to sign off a risk model.
Two additional questions matter just as much:

1. **Can the model rank‑order risk?**  
   A decent AUC or Gini tells you whether high‑score loans really do default more often than low‑score loans.
2. **Are the probabilities believable?**  
   A calibration plot comparing PD buckets (e.g. 0–2%, 2–5%, 5–10%) with realised default rates shows whether "5%" really behaves like 1 in 20 over time.

For practical portfolio steering you care about both:

- Underwriting teams need good rank ordering to set thresholds and limits.
- Finance and risk appetite functions need realistic PD levels for provisioning, capital, and stress testing.

This is also where you start to see if certain segments (e.g. by product, channel, or region) behave differently enough to justify separate models or override rules.

---

## 5. From notebook to risk process: deployment hooks

Even a beautiful notebook is useless if it never leaves the lab.
To make the PD model part of your real risk toolkit you need a few concrete steps:

- **Model packaging** – store the trained model object and any scalers in version control (for example using `joblib`), with clear model IDs and metadata.
- **Scoring interface** – implement a function or API endpoint that takes a loan application record and returns a PD plus the key features used.
- **Monitoring** – track the distribution of scores, feature drift, and the gap between predicted and realised default rates over time.

A pragmatic pattern is to treat the notebook as the **analysis canvas** and move production code into plain Python modules or services.
That way, the same logic can run inside batch jobs, real‑time APIs, or whatever architecture you use.

---

## 6. How this template generalises to other risk problems

The same structure applies to many other risk and GRC use cases:

- **Fraud detection** – ranking transactions or accounts by likelihood of being fraudulent.
- **Supplier / counterparty risk** – estimating the chance that a supplier will fail or seriously under‑perform.
- **Collections and recovery** – prioritising which delinquent accounts are most likely to cure given limited collections capacity.

In all of these, the discipline is the same:

1. Start from a crisp business question with a clear definition of "bad" outcomes.
2. Build a case‑level dataset with a robust target label.
3. Establish a transparent baseline model and evaluation.
4. Move results from notebook to production code with monitoring from day one.

That discipline is what turns data science from an experimental slide deck into a repeatable part of your risk governance.