{
  "cells": [
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "# Credit Risk Data Science Project\n\nThis notebook sketches a compact, production\u2011minded workflow for a credit risk use case: estimating the probability that a borrower will default on a loan within 12 months.\n\nIt loosely follows the CRISP\u2011DM lifecycle (business understanding \u2192 data understanding \u2192 preparation \u2192 modelling \u2192 evaluation \u2192 deployment hooks)."
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "## 1. Business understanding\n\nObjective: estimate the probability of default (PD) for each loan so that the risk team can price accurately, set exposure limits, and design early\u2011warning triggers.\n\nExample success criteria:\n- AUC above 0.75 on a hold\u2011out sample.\n- Reasonable calibration: PD buckets line up with realised default rates over time.\n- Model and features can be explained to credit, risk, and audit teams."
      ]
    },
    {
      "cell_type": "code",
      "metadata": {},
      "source": [
        "# 2. Imports and configuration\nimport pandas as pd\nimport numpy as np\n\nfrom sklearn.model_selection import train_test_split\nfrom sklearn.metrics import roc_auc_score, roc_curve\nfrom sklearn.preprocessing import StandardScaler\nfrom sklearn.linear_model import LogisticRegression\n\nimport matplotlib.pyplot as plt\n%matplotlib inline\n"
      ],
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "## 2. Data understanding\n\nIn a real implementation you would pull data from your warehouse or lake (loan tape, bureau data, behavioural data).\n\nFor this template we assume a single table with one row per loan and a binary target `default_12m`. Replace the CSV path below with your own dataset when you move from template to production."
      ]
    },
    {
      "cell_type": "code",
      "metadata": {},
      "source": [
        "# 3. Load or simulate a dataset\n# Update `csv_path` to point at your real data extract when available.\n\ncsv_path = 'data/loans_sample.csv'\n\ntry:\n    df = pd.read_csv(csv_path)\nexcept FileNotFoundError:\n    # Lightweight synthetic data so the notebook runs end\u2011to\u2011end.\n    rng = np.random.default_rng(42)\n    n = 1000\n    df = pd.DataFrame({\n        'age': rng.integers(21, 70, size=n),\n        'income': rng.normal(10_000_000, 3_000_000, size=n),\n        'loan_amount': rng.normal(5_000_000, 2_000_000, size=n),\n        'tenor_months': rng.integers(6, 60, size=n),\n        'has_collateral': rng.integers(0, 2, size=n),\n    })\n\n    # Synthetic default rule for illustration only (do not use in production).\n    logit = (\n        0.5\n        + 0.0000001 * df['loan_amount']\n        - 0.00000005 * df['income']\n        + 0.01 * (df['tenor_months'] - 24)\n        + 0.3 * (1 - df['has_collateral'])\n    )\n    p = 1 / (1 + np.exp(-logit))\n    df['default_12m'] = rng.binomial(1, p)\n\ndf.head()\n"
      ],
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "## 3. Quick data checks\n\nAt this stage you typically look for obvious issues: missing values, impossible values, and the balance of the target variable."
      ]
    },
    {
      "cell_type": "code",
      "metadata": {},
      "source": [
        "df.describe(include='all')\n"
      ],
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "code",
      "metadata": {},
      "source": [
        "df['default_12m'].value_counts(normalize=True).rename('proportion')\n"
      ],
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "## 4. Train/test split and baseline model\n\nTo keep the template compact we use all non\u2011target columns as features and fit a logistic regression baseline.\nIn a real project you would iterate on feature engineering, variable selection, and stability analysis across time."
      ]
    },
    {
      "cell_type": "code",
      "metadata": {},
      "source": [
        "target_col = 'default_12m'\nfeature_cols = [c for c in df.columns if c != target_col]\n\nX = df[feature_cols].copy()\ny = df[target_col].copy()\n\nX_train, X_test, y_train, y_test = train_test_split(\n    X, y, test_size=0.25, random_state=42, stratify=y\n)\n\nscaler = StandardScaler()\nX_train_scaled = scaler.fit_transform(X_train)\nX_test_scaled = scaler.transform(X_test)\n\nmodel = LogisticRegression(max_iter=1000)\nmodel.fit(X_train_scaled, y_train)\n\ny_pred_proba = model.predict_proba(X_test_scaled)[:, 1]\nauc = roc_auc_score(y_test, y_pred_proba)\nauc\n"
      ],
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "## 5. ROC curve for the baseline model\n\nThis gives a quick view of discrimination. For production, you would add calibration plots and back\u2011testing by vintage or cohort."
      ]
    },
    {
      "cell_type": "code",
      "metadata": {},
      "source": [
        "fpr, tpr, _ = roc_curve(y_test, y_pred_proba)\n\nplt.figure(figsize=(5, 5))\nplt.plot(fpr, tpr, label=f'AUC = {auc:.3f}')\nplt.plot([0, 1], [0, 1], 'k--', alpha=0.5)\nplt.xlabel('False Positive Rate')\nplt.ylabel('True Positive Rate')\nplt.title('ROC curve \u2013 baseline credit risk model')\nplt.legend(loc='lower right')\nplt.tight_layout()\nplt.show()\n"
      ],
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "## 6. Deployment hooks and next steps\n\nIn a live project, this notebook would feed into:\n\n- A packaged model object (for example saved with `joblib`) checked into version control.\n- A scoring function that takes an application record and returns a PD plus key features.\n- Monitoring jobs that track score distribution, feature drift, and realised vs predicted default rates.\n\nTreat this notebook as an analysis canvas. Production code should live in plain Python modules or services so it can run reliably in your batch or real\u2011time environment."
      ]
    }
  ],
  "metadata": {
    "kernelspec": {
      "display_name": "Python 3",
      "language": "python",
      "name": "python3"
    },
    "language_info": {
      "name": "python",
      "version": "3.11"
    }
  },
  "nbformat": 4,
  "nbformat_minor": 5
}