Chapter 8: Baseline Experiments¶
Purpose: Train baseline models to understand data predictability and establish performance benchmarks.
What you'll learn:
- How to prepare data for ML with proper train/test splitting
- How to handle class imbalance with class weights
- How to evaluate models with appropriate metrics (not just accuracy!)
- How to interpret feature importance
Outputs:
- Baseline model performance (AUC, Precision, Recall, F1)
- Feature importance rankings
- ROC and Precision-Recall curves
- Performance benchmarks for comparison
Evaluation Metrics for Imbalanced Data¶
| Metric | What It Measures | When to Use |
|---|---|---|
| AUC-ROC | Ranking quality across thresholds | General model comparison |
| Precision | "Of predicted churned, how many are correct?" | When false positives are costly |
| Recall | "Of actual churned, how many did we catch?" | When missing churners is costly |
| F1-Score | Balance of precision and recall | When both matter equally |
| PR-AUC | Precision-Recall under curve | Better for imbalanced data |
8.1 Setup¶
Show/Hide Code
from customer_retention.analysis.notebook_progress import track_and_export_previous
track_and_export_previous("08_baseline_experiments.ipynb")
import pandas as pd
from sklearn.ensemble import GradientBoostingClassifier, RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import (
average_precision_score,
classification_report,
f1_score,
precision_score,
recall_score,
roc_auc_score,
)
from sklearn.model_selection import cross_val_score, train_test_split
from sklearn.preprocessing import LabelEncoder, StandardScaler
from customer_retention.analysis.auto_explorer import ExplorationFindings
from customer_retention.analysis.visualization import ChartBuilder, display_figure, display_table
from customer_retention.core.config.column_config import NON_FEATURE_COLUMN_TYPES, ColumnType
from customer_retention.core.config.experiments import (
FINDINGS_DIR,
)
Show/Hide Code
from pathlib import Path
from customer_retention.analysis.auto_explorer import load_notebook_findings, resolve_target_column
FINDINGS_PATH, _namespace, dataset_name = load_notebook_findings(
"08_baseline_experiments.ipynb", prefer_aggregated=True
)
print(f"Using: {FINDINGS_PATH}")
findings = ExplorationFindings.load(FINDINGS_PATH)
target = resolve_target_column(_namespace, findings)
# Load data - prefer aggregated entity-level data for modeling
from customer_retention.analysis.auto_explorer.active_dataset_store import load_active_dataset
from customer_retention.core.config.column_config import DatasetGranularity
from customer_retention.stages.temporal import TEMPORAL_METADATA_COLS
if "_aggregated" in FINDINGS_PATH:
source_path = Path(findings.source_path)
if not source_path.is_absolute():
source_path = Path("..") / source_path
if source_path.is_dir():
from customer_retention.integrations.adapters.factory import get_delta
df = get_delta(force_local=True).read(str(source_path))
elif source_path.is_file():
df = pd.read_parquet(source_path)
else:
df = load_active_dataset(_namespace, dataset_name)
data_source = f"aggregated:{source_path.name}"
elif dataset_name is None and _namespace:
from customer_retention.integrations.adapters.factory import get_delta
df = get_delta(force_local=True).read(str(_namespace.silver_merged_path))
data_source = "silver_merged"
else:
df = load_active_dataset(_namespace, dataset_name)
data_source = dataset_name
charts = ChartBuilder()
print(f"\nLoaded {len(df):,} rows from: {data_source}")
Using: /Users/Vital/python/CustomerRetention/experiments/runs/email-6301db6c/datasets/customer_emails/findings/customer_emails_aggregated_findings.yaml
Loaded 4,998 rows from: aggregated:customer_emails_aggregated
8.2 Prepare Data for Modeling¶
📖 Feature Source:
Features used in this notebook come from the ExplorationFindings generated in earlier notebooks:
- Column types are auto-detected in notebook 01 (Data Discovery)
- Target column is identified from the findings
- Identifier columns are excluded to prevent data leakage
- Text columns are excluded (require specialized NLP processing)
📖 Best Practices:
- Stratified Split: Maintains class ratios in train/test sets
- Scale After Split: Fit scaler on train only (prevents data leakage)
- Handle Missing: Impute or drop before scaling
📖 Transformations Applied:
- Categorical variables → Label Encoded
- Missing values → Median (numeric) or Mode (categorical)
- Features → StandardScaler (fit on train only)
Show/Hide Code
if not target:
raise ValueError("No target column set. Please define one in exploration notebooks.")
y = df[target]
feature_cols = [
name for name, col in findings.columns.items()
if col.inferred_type not in NON_FEATURE_COLUMN_TYPES
and name not in TEMPORAL_METADATA_COLS
]
print("=" * 70)
print("FEATURE SELECTION FROM FINDINGS")
print("=" * 70)
print(f"\n Target Column: {target}")
print(f" Features Selected: {len(feature_cols)}")
type_counts = {}
for name in feature_cols:
col_type = findings.columns[name].inferred_type.value
type_counts[col_type] = type_counts.get(col_type, 0) + 1
print("\n Features by Type:")
for col_type, count in sorted(type_counts.items()):
print(f" {col_type}: {count}")
excluded = [name for name, col in findings.columns.items()
if col.inferred_type in NON_FEATURE_COLUMN_TYPES]
if excluded:
print(f"\n Excluded Columns ({len(excluded)}): {', '.join(excluded[:10])}{'...' if len(excluded) > 10 else ''}")
====================================================================== FEATURE SELECTION FROM FINDINGS ====================================================================== Target Column: unsubscribed Features Selected: 210 Features by Type: binary: 29 categorical_nominal: 2 numeric_continuous: 79 numeric_discrete: 100 Excluded Columns (7): customer_id, unsubscribed, opened_middle, clicked_middle, send_hour_middle, bounced_middle, time_to_open_hours_middle
Show/Hide Code
# Check feature availability and remove problematic features
from customer_retention.stages.features.feature_selector import FeatureSelector
print("=" * 70)
print("FEATURE AVAILABILITY CHECK")
print("=" * 70)
unavailable_features = []
if findings.has_availability_issues:
selector = FeatureSelector(target_column=target)
availability_recs = selector.get_availability_recommendations(findings.feature_availability)
unavailable_features = [rec.column for rec in availability_recs]
print(f"\n⚠️ {len(availability_recs)} feature(s) have availability issues:\n")
for rec in availability_recs:
print(f" • {rec.column} ({rec.issue_type}, {rec.coverage_pct:.0f}% coverage)")
print("\n📋 Alternative approaches (for investigation):")
print(" • segment_by_cohort: Train separate models per availability period")
print(" • add_indicator: Create availability flags and impute missing")
print(" • filter_window: Restrict data to feature's available period")
original_count = len(feature_cols)
feature_cols = [f for f in feature_cols if f not in unavailable_features]
print(f"\n🗑️ Removed {original_count - len(feature_cols)} unavailable features")
print(f"📊 Features remaining: {len(feature_cols)}")
else:
print("\n✅ All features have full temporal coverage.")
====================================================================== FEATURE AVAILABILITY CHECK ====================================================================== ✅ All features have full temporal coverage.
Show/Hide Code
from customer_retention.analysis.auto_explorer.project_context import ProjectContext
from customer_retention.core.config.column_config import select_model_ready_columns
from customer_retention.stages.modeling import DataSplitter, SplitStrategy
_project_ctx = ProjectContext.load(_namespace.project_context_path) if _namespace and _namespace.project_context_path.exists() else None
_use_temporal = _project_ctx.intent.temporal_split if _project_ctx and _project_ctx.intent else False
X = select_model_ready_columns(df[feature_cols].copy())
feature_cols = X.columns.tolist()
_nan_target = y.isna().sum()
if _nan_target:
_valid = y.notna()
X, y, df = X.loc[_valid], y.loc[_valid], df.loc[_valid]
print(f"Dropped {_nan_target} rows with missing target")
for col in X.select_dtypes(include=['object']).columns:
le = LabelEncoder()
X[col] = le.fit_transform(X[col].astype(str))
for col in X.columns:
if X[col].isnull().any():
if X[col].dtype in ['int64', 'float64']:
X[col] = X[col].fillna(X[col].median())
else:
X[col] = X[col].fillna(X[col].mode()[0])
if _use_temporal and "as_of_date" in df.columns:
_purge_gap = _project_ctx.intent.purge_gap_days if _project_ctx and _project_ctx.intent else 104
_exclude = [c for c in ["as_of_date", "entity_id"] if c in X.columns]
_split_df = pd.concat([X, y], axis=1)
_split_df["as_of_date"] = df.loc[X.index, "as_of_date"].values
if "entity_id" in df.columns:
_split_df["entity_id"] = df.loc[X.index, "entity_id"].values
_split_df = _split_df.sort_values("as_of_date").reset_index(drop=True)
splitter = DataSplitter(
target_column=target,
strategy=SplitStrategy.TEMPORAL,
temporal_column="as_of_date",
test_size=0.2,
purge_gap_days=_purge_gap,
exclude_columns=_exclude,
)
_split_result = splitter.split(_split_df)
X_train, X_test = _split_result.X_train, _split_result.X_test
y_train, y_test = _split_result.y_train, _split_result.y_test
_train_entities = _split_df.loc[X_train.index, "entity_id"] if "entity_id" in _split_df.columns else None
_train_dates = _split_df.loc[X_train.index, "as_of_date"] if "as_of_date" in _split_df.columns else None
_split_method = "temporal (purge gap)"
print(f"Purge gap: {_purge_gap} days")
print(f"Cutoff date: {_split_result.split_info.get('cutoff_date', 'N/A')}")
print(f"Rows purged: {_split_result.split_info.get('purge_gap_rows', 0)}")
else:
_split_df = pd.concat([X, y], axis=1)
splitter = DataSplitter(
target_column=target,
strategy=SplitStrategy.RANDOM_STRATIFIED,
test_size=0.2,
random_state=42,
)
_split_result = splitter.split(_split_df)
X_train, X_test = _split_result.X_train, _split_result.X_test
y_train, y_test = _split_result.y_train, _split_result.y_test
_train_entities = None
_train_dates = None
_split_method = "stratified random"
scaler = StandardScaler()
X_train_scaled = pd.DataFrame(scaler.fit_transform(X_train), columns=X_train.columns)
X_test_scaled = pd.DataFrame(scaler.transform(X_test), columns=X_test.columns)
print(f"\nSplit method: {_split_method}")
print(f"Train size: {len(X_train):,} ({len(X_train)/len(X)*100:.0f}%)")
print(f"Test size: {len(X_test):,} ({len(X_test)/len(X)*100:.0f}%)")
print("\nTrain class distribution:")
print(f" Retained (1): {(y_train == 1).sum():,} ({(y_train == 1).sum()/len(y_train)*100:.1f}%)")
print(f" Churned (0): {(y_train == 0).sum():,} ({(y_train == 0).sum()/len(y_train)*100:.1f}%)")
Split method: stratified random Train size: 3,998 (80%) Test size: 1,000 (20%) Train class distribution: Retained (1): 1,781 (44.5%) Churned (0): 2,217 (55.5%)
8.3 Baseline Models (with Class Weights)¶
📖 Using Class Weights:
class_weight='balanced'automatically adjusts weights inversely proportional to class frequencies- This helps models pay more attention to the minority class (churned customers)
- Without weights, models may just predict "retained" for everyone
Show/Hide Code
import warnings
import numpy as np
from customer_retention.stages.modeling import CrossValidator, CVStrategy
models = {
"Logistic Regression": LogisticRegression(max_iter=1000, random_state=42, class_weight='balanced'),
"Random Forest": RandomForestClassifier(n_estimators=100, random_state=42, n_jobs=-1, class_weight='balanced'),
"Gradient Boosting": GradientBoostingClassifier(n_estimators=100, random_state=42)
}
_is_binary = y.nunique() == 2
_avg = "binary" if _is_binary else "weighted"
_cv_scoring = "roc_auc" if _is_binary else "f1_weighted"
def _safe_auc(y_true, y_score, model_classes=None):
try:
with warnings.catch_warnings():
warnings.simplefilter("ignore")
if model_classes is None:
return roc_auc_score(y_true, y_score)
return roc_auc_score(y_true, y_score, multi_class='ovr', labels=model_classes)
except ValueError:
return float('nan')
results = []
model_predictions = {}
for name, model in models.items():
print(f"Training {name}...")
_use_scaled = "Logistic" in name
_X_fit, _X_eval = (X_train_scaled, X_test_scaled) if _use_scaled else (X_train, X_test)
model.fit(_X_fit, y_train)
y_pred = model.predict(_X_eval)
y_pred_proba = model.predict_proba(_X_eval)
if _is_binary:
y_score = y_pred_proba[:, 1]
auc = _safe_auc(y_test, y_score)
pr_auc = average_precision_score(y_test, y_score)
else:
y_score = y_pred_proba
auc = _safe_auc(y_test, y_score, model.classes_)
pr_auc = float('nan')
f1 = f1_score(y_test, y_pred, average=_avg, zero_division=0)
precision = precision_score(y_test, y_pred, average=_avg, zero_division=0)
recall = recall_score(y_test, y_pred, average=_avg, zero_division=0)
if _use_temporal and _train_entities is not None:
_cv = CrossValidator(strategy=CVStrategy.TEMPORAL_ENTITY, n_splits=5, scoring=_cv_scoring, purge_gap_days=_purge_gap)
_cv_result = _cv.run(model, _X_fit, y_train, groups=_train_entities, temporal_values=_train_dates)
cv_scores = _cv_result.cv_scores
else:
cv_scores = cross_val_score(model, _X_fit, y_train, cv=5, scoring=_cv_scoring)
results.append({
"Model": name, "Test AUC": auc, "PR-AUC": pr_auc,
"F1-Score": f1, "Precision": precision, "Recall": recall,
"CV Score Mean": cv_scores.mean(), "CV Score Std": cv_scores.std()
})
model_predictions[name] = {
'y_pred': y_pred, 'y_pred_proba': y_score, 'model': model
}
results_df = pd.DataFrame(results).round(4)
_cv_method = "temporal entity (GroupKFold + purge)" if (_use_temporal and _train_entities is not None) else "stratified 5-fold"
_class_type = "binary" if _is_binary else f"multiclass ({y.nunique()} classes)"
_cv_metric = "AUC" if _is_binary else "F1-weighted"
print(f"\nCV method: {_cv_method}")
print(f"CV metric: {_cv_metric}")
print(f"Classification type: {_class_type}")
print("\n" + "=" * 80)
print("MODEL COMPARISON")
print("=" * 80)
display_table(results_df)
Training Logistic Regression...
Training Random Forest...
Training Gradient Boosting...
CV method: stratified 5-fold CV metric: AUC Classification type: binary ================================================================================ MODEL COMPARISON ================================================================================
| Model | Test AUC | PR-AUC | F1-Score | Precision | Recall | CV Score Mean | CV Score Std |
|---|---|---|---|---|---|---|---|
| Logistic Regression | 0.9646 | 0.9615 | 0.9179 | 0.9475 | 0.8901 | 0.9613 | 0.0101 |
| Random Forest | 0.9656 | 0.9659 | 0.9336 | 0.9709 | 0.8991 | 0.9698 | 0.0087 |
| Gradient Boosting | 0.9708 | 0.9726 | 0.9333 | 0.9756 | 0.8946 | 0.9723 | 0.0091 |
8.4 Feature Importance (Random Forest)¶
Show/Hide Code
rf_model = models["Random Forest"]
importance_df = pd.DataFrame({
"Feature": feature_cols,
"Importance": rf_model.feature_importances_
}).sort_values("Importance", ascending=False)
top_n = 15
top_features = importance_df.head(top_n)
fig = charts.bar_chart(
top_features["Feature"].tolist(),
top_features["Importance"].tolist(),
title=f"Top {top_n} Feature Importances"
)
display_figure(fig)
8.5 Classification Report (Best Model)¶
Show/Hide Code
best_model = models["Gradient Boosting"]
y_pred = best_model.predict(X_test)
print("Classification Report (Gradient Boosting):")
print(classification_report(y_test, y_pred))
Classification Report (Gradient Boosting):
precision recall f1-score support
0 0.92 0.98 0.95 554
1 0.98 0.89 0.93 446
accuracy 0.94 1000
macro avg 0.95 0.94 0.94 1000
weighted avg 0.95 0.94 0.94 1000
8.6 Model Comparison Grid¶
This visualization shows all models side-by-side with:
- Row 1: Confusion matrices (counts and percentages)
- Row 2: ROC curves with AUC scores
- Row 3: Precision-Recall curves with PR-AUC scores
📖 How to Read:
- Confusion Matrix: Diagonal = correct predictions. Off-diagonal = errors.
- ROC Curve: Higher curve = better. AUC > 0.8 is good, > 0.9 is excellent.
- PR Curve: Higher curve = better at finding positives without false alarms.
Show/Hide Code
grid_results = {
name: {"y_pred": data["y_pred"], "y_pred_proba": data["y_pred_proba"]}
for name, data in model_predictions.items()
}
if _is_binary:
fig = charts.model_comparison_grid(
grid_results, y_test,
class_labels=["Churned (0)", "Retained (1)"],
title="Model Comparison: Confusion Matrix | ROC Curve | Precision-Recall"
)
display_figure(fig)
else:
import numpy as np
import plotly.graph_objects as go
from plotly.subplots import make_subplots
from sklearn.metrics import confusion_matrix
model_names = list(grid_results.keys())
n_models = len(model_names)
fig = make_subplots(rows=1, cols=n_models, subplot_titles=[f"{n[:20]}" for n in model_names])
for i, name in enumerate(model_names):
cm = confusion_matrix(y_test, grid_results[name]["y_pred"])
fig.add_trace(go.Heatmap(
z=cm, x=list(range(cm.shape[1])), y=list(range(cm.shape[0])),
text=cm.astype(str), texttemplate="%{text}", showscale=False,
colorscale="Blues",
), row=1, col=i + 1)
fig.update_layout(title="Model Comparison: Confusion Matrices (multiclass)", height=400, width=350 * n_models + 50)
display_figure(fig)
print("\n" + "=" * 80)
print("METRICS SUMMARY")
print("=" * 80)
_metrics_cols = ["Model", "Test AUC", "F1-Score", "Precision", "Recall"]
if _is_binary:
_metrics_cols.insert(2, "PR-AUC")
display_table(results_df[_metrics_cols])
================================================================================ METRICS SUMMARY ================================================================================
| Model | Test AUC | PR-AUC | F1-Score | Precision | Recall |
|---|---|---|---|---|---|
| Logistic Regression | 0.9646 | 0.9615 | 0.9179 | 0.9475 | 0.8901 |
| Random Forest | 0.9656 | 0.9659 | 0.9336 | 0.9709 | 0.8991 |
| Gradient Boosting | 0.9708 | 0.9726 | 0.9333 | 0.9756 | 0.8946 |
8.6.1 Individual Model Analysis¶
The grid above shows all models together. Below is detailed analysis per model.
Show/Hide Code
print("=" * 70)
print("CLASSIFICATION REPORTS BY MODEL")
print("=" * 70)
_target_names = ["Churned", "Retained"] if _is_binary else None
for name, data in model_predictions.items():
print(f"\n{'='*40}")
print(f" {name}")
print('='*40)
print(classification_report(y_test, data['y_pred'], target_names=_target_names, zero_division=0))
======================================================================
CLASSIFICATION REPORTS BY MODEL
======================================================================
========================================
Logistic Regression
========================================
precision recall f1-score support
Churned 0.92 0.96 0.94 554
Retained 0.95 0.89 0.92 446
accuracy 0.93 1000
macro avg 0.93 0.93 0.93 1000
weighted avg 0.93 0.93 0.93 1000
========================================
Random Forest
========================================
precision recall f1-score support
Churned 0.92 0.98 0.95 554
Retained 0.97 0.90 0.93 446
accuracy 0.94 1000
macro avg 0.95 0.94 0.94 1000
weighted avg 0.94 0.94 0.94 1000
========================================
Gradient Boosting
========================================
precision recall f1-score support
Churned 0.92 0.98 0.95 554
Retained 0.98 0.89 0.93 446
accuracy 0.94 1000
macro avg 0.95 0.94 0.94 1000
weighted avg 0.95 0.94 0.94 1000
8.6.1 Precision-Recall Curves¶
📖 Why PR Curves for Imbalanced Data:
- ROC curves can look optimistic for imbalanced data
- PR curves focus on the minority class (churners)
- Better at showing how well we detect actual churners
📖 How to Read:
- Baseline (dashed line) = proportion of positives in the data
- Higher curve = better at finding churners without too many false alarms
8.7 Key Takeaways¶
📖 Interpreting Results:
Show/Hide Code
_primary_metric = "Test AUC" if results_df["Test AUC"].notna().any() else "F1-Score"
best_model = results_df.loc[results_df[_primary_metric].idxmax()]
print("=" * 70)
print("KEY TAKEAWAYS")
print("=" * 70)
print(f"\n BEST MODEL (by {_primary_metric}): {best_model['Model']}")
if results_df["Test AUC"].notna().any():
print(f" Test AUC: {best_model['Test AUC']:.4f}")
if _is_binary:
print(f" PR-AUC: {best_model['PR-AUC']:.4f}")
print(f" F1-Score: {best_model['F1-Score']:.4f}")
print("\n TOP 3 IMPORTANT FEATURES:")
for i, feat in enumerate(importance_df.head(3)['Feature'].tolist(), 1):
imp = importance_df[importance_df['Feature'] == feat]['Importance'].values[0]
print(f" {i}. {feat} ({imp:.3f})")
_best_score = best_model[_primary_metric]
print("\n MODEL PERFORMANCE ASSESSMENT:")
if _best_score > 0.90:
print(" Excellent predictive signal - likely production-ready with tuning")
elif _best_score > 0.80:
print(" Strong predictive signal - good baseline for improvement")
elif _best_score > 0.70:
print(" Moderate signal - consider more feature engineering")
else:
print(" Weak signal - may need more data or different features")
print("\n NEXT STEPS:")
print(" 1. Feature engineering with derived features (notebook 05)")
print(" 2. Hyperparameter tuning (GridSearchCV)")
print(" 3. Threshold optimization for business metrics")
print(" 4. A/B testing in production")
====================================================================== KEY TAKEAWAYS ====================================================================== BEST MODEL (by Test AUC): Gradient Boosting Test AUC: 0.9708 PR-AUC: 0.9726 F1-Score: 0.9333 TOP 3 IMPORTANT FEATURES: 1. days_since_last_event_x (0.073) 2. days_since_first_event_y (0.065) 3. send_hour_count_365d (0.055) MODEL PERFORMANCE ASSESSMENT: Excellent predictive signal - likely production-ready with tuning NEXT STEPS: 1. Feature engineering with derived features (notebook 05) 2. Hyperparameter tuning (GridSearchCV) 3. Threshold optimization for business metrics 4. A/B testing in production
Summary: What We Learned¶
In this notebook, we trained baseline models and established performance benchmarks:
- Data Preparation - Proper train/test split with stratification and scaling
- Class Imbalance Handling - Used balanced class weights
- Model Comparison - Compared Logistic Regression, Random Forest, and Gradient Boosting
- Multiple Metrics - Evaluated with AUC, PR-AUC, F1, Precision, Recall
- Feature Importance - Identified the most predictive features
Key Results for This Dataset¶
| Metric | Value | Interpretation |
|---|---|---|
| Best AUC | ~0.98 | Excellent discrimination |
| Top Feature | esent | Email engagement is critical |
| Imbalance | ~4:1 | Moderate, handled with class weights |
Next Steps¶
Continue to 09_business_alignment.ipynb to:
- Align model performance with business objectives
- Define intervention strategies by risk level
- Calculate expected ROI from the model
- Set deployment requirements
Show/Hide Code
_best_score_val = results_df[_primary_metric].max()
print("Key Takeaways:")
print("="*50)
print(f"Best baseline {_primary_metric}: {_best_score_val:.4f}")
print(f"Top 3 important features: {', '.join(importance_df.head(3)['Feature'].tolist())}")
if _best_score_val > 0.85:
print("\nStrong predictive signal detected. Data is well-suited for modeling.")
elif _best_score_val > 0.70:
print("\nModerate predictive signal. Consider feature engineering for improvement.")
else:
print("\nWeak predictive signal. May need more features or data.")
Key Takeaways: ================================================== Best baseline Test AUC: 0.9708 Top 3 important features: days_since_last_event_x, days_since_first_event_y, send_hour_count_365d Strong predictive signal detected. Data is well-suited for modeling.
Save Reminder: Save this notebook (Ctrl+S / Cmd+S) before running the next one. The next notebook will automatically export this notebook's HTML documentation from the saved file.