Chapter 1c: Temporal Pattern Analysis (Event Bronze Track)ΒΆ

Purpose: Discover temporal patterns in event-level data that inform feature engineering and model design.

When to use this notebook:

  • After completing 01a and 01b (temporal deep dive and quality checks)
  • Your dataset is EVENT_LEVEL granularity
  • You want to understand time-based patterns before aggregation

What you'll learn:

  • How to detect long-term trends in your data
  • How to identify seasonality patterns (weekly, monthly)
  • How cohort analysis reveals customer lifecycle patterns
  • How recency relates to target outcomes

Pattern Categories:

Pattern Description Feature Engineering Impact
Trend Long-term direction (up/down) Detrend features, add trend slope
Seasonality Periodic patterns (weekly, monthly) Add cyclical encodings, seasonal indicators
Cohort Effects Behavior varies by join date Add cohort features, stratify models
Recency Effects Recent activity predicts outcomes Prioritize recent time windows

1c.1 Load Findings and DataΒΆ

InΒ [1]:
Show/Hide Code
from customer_retention.analysis.notebook_progress import track_and_export_previous

track_and_export_previous("01c_temporal_patterns.ipynb")

import numpy as np
import pandas as pd
import plotly.graph_objects as go
from plotly.subplots import make_subplots

from customer_retention.analysis.auto_explorer import ExplorationFindings, load_notebook_findings
from customer_retention.analysis.visualization import ChartBuilder, display_figure
from customer_retention.core.config.experiments import FINDINGS_DIR  # noqa: F401
from customer_retention.stages.profiling import (
    TemporalFeatureAnalyzer,
    TemporalPatternAnalyzer,
    TrendDirection,
)
InΒ [2]:
Show/Hide Code
DATASET_NAME = None  # Set to override auto-resolved dataset, e.g. "3set_support_tickets"

FINDINGS_PATH, _namespace, dataset_name = load_notebook_findings("01c_temporal_patterns.ipynb")
if DATASET_NAME is not None:
    dataset_name = DATASET_NAME

print(f"Using: {FINDINGS_PATH}")
findings = ExplorationFindings.load(FINDINGS_PATH)
print(f"Loaded findings for {findings.column_count} columns")
Using: /Users/Vital/python/CustomerRetention/experiments/runs/email-6301db6c/datasets/customer_emails/findings/customer_emails_findings.yaml
Loaded findings for 13 columns
InΒ [3]:
Show/Hide Code
# Get time series configuration
ts_meta = findings.time_series_metadata
ENTITY_COLUMN = ts_meta.entity_column if ts_meta else None
TIME_COLUMN = ts_meta.time_column if ts_meta else None

print(f"Entity column: {ENTITY_COLUMN}")
print(f"Time column: {TIME_COLUMN}")

# Note: Target column configuration is handled in section 1c.2 below
# This allows for event-level to entity-level aggregation when needed
Entity column: customer_id
Time column: sent_date
InΒ [4]:
Show/Hide Code
from customer_retention.analysis.auto_explorer.project_context import ProjectContext

LIGHT_RUN = False
if _namespace and _namespace.project_context_path.exists():
    _project_ctx = ProjectContext.load(_namespace.project_context_path)
    LIGHT_RUN = _project_ctx.light_run

if LIGHT_RUN:
    print("LIGHT_RUN mode: heavy analysis cells will be skipped")
InΒ [5]:
Show/Hide Code
from customer_retention.analysis.auto_explorer.active_dataset_store import load_active_dataset
from customer_retention.stages.temporal import TEMPORAL_METADATA_COLS

df = load_active_dataset(_namespace, dataset_name)
charts = ChartBuilder()

df[TIME_COLUMN] = pd.to_datetime(df[TIME_COLUMN])

print(f"Loaded {len(df):,} rows x {len(df.columns)} columns")
print(f"Data source: {dataset_name}")
Loaded 83,198 rows x 13 columns
Data source: customer_emails

1c.2 Target Column ConfigurationΒΆ

πŸ“– Event-Level vs Entity-Level Targets:

In time series data, targets can be defined at different granularities:

Target Level Example Usage
Event-level "Did this email get clicked?" Exists in raw data
Entity-level "Did this customer churn?" Need to join from entity table

If your target is entity-level, you may need to join it or configure it manually.

InΒ [6]:
Show/Hide Code
# === TARGET CONFIGURATION ===
# Override target column if needed (None = auto-detect, "DEFER_TO_MULTI_DATASET" = skip)
TARGET_COLUMN_OVERRIDE = None
TARGET_AGGREGATION = "max"  # Options: "max", "mean", "sum", "last", "first"

# Detect and analyze target
from customer_retention.stages.profiling import AggregationMethod, TargetColumnDetector, TargetLevelAnalyzer

detector = TargetColumnDetector()
target_col, method = detector.detect(findings, df, override=TARGET_COLUMN_OVERRIDE)
detector.print_detection(target_col, method)

TARGET_COLUMN = target_col
if TARGET_COLUMN and TARGET_COLUMN in df.columns and ENTITY_COLUMN:
    analyzer = TargetLevelAnalyzer()
    agg_method = AggregationMethod(TARGET_AGGREGATION)
    df, result = analyzer.aggregate_to_entity(df, TARGET_COLUMN, ENTITY_COLUMN, TIME_COLUMN, agg_method)
    analyzer.print_analysis(result)

    # Update TARGET_COLUMN to entity-level version if aggregated
    if result.entity_target_column:
        ORIGINAL_TARGET = TARGET_COLUMN
        TARGET_COLUMN = result.entity_target_column

print("\n" + "─"*70)
print("Final configuration:")
print(f"   ENTITY_COLUMN: {ENTITY_COLUMN}")
print(f"   TIME_COLUMN: {TIME_COLUMN}")
print(f"   TARGET_COLUMN: {TARGET_COLUMN}")
print("─"*70)
πŸ” Auto-detected target: unsubscribed
======================================================================
TARGET LEVEL ANALYSIS
======================================================================

Column: unsubscribed
Level: EVENT_LEVEL

⚠️  EVENT-LEVEL TARGET DETECTED
   44.1% of entities have varying target values

   Event-level distribution:
      unsubscribed=0: 80,961 events (97.3%)
      unsubscribed=1: 2,237 events (2.7%)

   Suggested aggregation: max

   Aggregation applied: max
   Entity target column: unsubscribed_entity

   Entity-level distribution (after aggregation):
      Retained (unsubscribed_entity=0): 2,761 entities (55.2%)
      Churned (unsubscribed_entity=1): 2,237 entities (44.8%)


──────────────────────────────────────────────────────────────────────
Final configuration:
   ENTITY_COLUMN: customer_id
   TIME_COLUMN: sent_date
   TARGET_COLUMN: unsubscribed_entity
──────────────────────────────────────────────────────────────────────

1c.3 Aggregation Window ConfigurationΒΆ

βš™οΈ Central Configuration for All Pattern Analysis

Windows are loaded from 01a findings and used consistently throughout this notebook for:

  • Velocity analysis (shortest window)
  • Momentum analysis (window pairs)
  • Rolling statistics
  • Feature engineering recommendations

Override below if needed for your specific analysis.

InΒ [7]:
Show/Hide Code
# === AGGREGATION WINDOW CONFIGURATION ===
# These windows were recommended by 01a based on your data's temporal coverage.
# They are used consistently for velocity, momentum, rolling stats, and feature engineering.

# Override: Set to a list like ["7d", "30d", "90d"] to use custom windows
# Set to None to use 01a recommendations
WINDOW_OVERRIDE = None

from customer_retention.stages.profiling import PatternAnalysisConfig

pattern_config = PatternAnalysisConfig.from_findings(
    findings,
    target_column=TARGET_COLUMN,
    window_override=WINDOW_OVERRIDE,
)

# Display configuration
print("="*70)
print("AGGREGATION WINDOW CONFIGURATION")
print("="*70)
print(f"\nSource: {'Manual override' if WINDOW_OVERRIDE else '01a findings (recommended)'}")
print(f"\nWindows: {pattern_config.aggregation_windows}")
print("\nDerived settings used throughout this notebook:")
print(f"   β€’ Velocity/Rolling window: {pattern_config.velocity_window_days} days")
print(f"   β€’ Momentum pairs: {pattern_config.get_momentum_pairs()}")
print("\nπŸ’‘ To override, set WINDOW_OVERRIDE = ['7d', '30d', '90d'] above and re-run")
======================================================================
AGGREGATION WINDOW CONFIGURATION
======================================================================

Source: 01a findings (recommended)

Windows: ['180d', '365d', 'all_time']

Derived settings used throughout this notebook:
   β€’ Velocity/Rolling window: 180 days
   β€’ Momentum pairs: [(180, 365)]

πŸ’‘ To override, set WINDOW_OVERRIDE = ['7d', '30d', '90d'] above and re-run

1c.4 Configure Value Column for AnalysisΒΆ

Temporal patterns are analyzed on aggregated metrics. Choose the primary metric to analyze.

InΒ [8]:
Show/Hide Code
# Find numeric columns for pattern analysis
numeric_cols = df.select_dtypes(include=[np.number]).columns.tolist()
numeric_cols = [c for c in numeric_cols if c not in [ENTITY_COLUMN] and c not in TEMPORAL_METADATA_COLS]

# Separate target columns from feature columns
target_cols = [c for c in numeric_cols if c.lower() in ['target', 'target_entity', 'label']
               or (TARGET_COLUMN and c.lower() == TARGET_COLUMN.lower())]
feature_cols = [c for c in numeric_cols if c not in target_cols]

print("Numeric columns for pattern analysis:")
print("\n  FEATURE COLUMNS (can derive features from):")
for col in feature_cols:
    print(f"    - {col}")

if target_cols:
    print("\n  TARGET COLUMNS (analysis only - never derive features):")
    for col in target_cols:
        print(f"    - {col} [TARGET]")

# Default: use event count (most common for pattern detection)
# Change this to analyze patterns in a specific metric
VALUE_COLUMN = "_event_count"  # Special: will aggregate event counts
Numeric columns for pattern analysis:

  FEATURE COLUMNS (can derive features from):
    - opened
    - clicked
    - send_hour
    - unsubscribed
    - bounced
    - time_to_open_hours

  TARGET COLUMNS (analysis only - never derive features):
    - unsubscribed_entity [TARGET]
InΒ [9]:
Show/Hide Code
# Prepare data for pattern analysis
# Aggregate to daily level for trend/seasonality detection

if VALUE_COLUMN == "_event_count":
    # Aggregate event counts by day
    daily_data = df.groupby(df[TIME_COLUMN].dt.date).size().reset_index()
    daily_data.columns = [TIME_COLUMN, "value"]
    daily_data[TIME_COLUMN] = pd.to_datetime(daily_data[TIME_COLUMN])
    analysis_col = "value"
    print("Analyzing: Daily event counts")
else:
    # Aggregate specific column by day
    daily_data = df.groupby(df[TIME_COLUMN].dt.date)[VALUE_COLUMN].sum().reset_index()
    daily_data.columns = [TIME_COLUMN, "value"]
    daily_data[TIME_COLUMN] = pd.to_datetime(daily_data[TIME_COLUMN])
    analysis_col = "value"
    print(f"Analyzing: Daily sum of {VALUE_COLUMN}")

print(f"\nDaily data points: {len(daily_data)}")
print(f"Date range: {daily_data[TIME_COLUMN].min()} to {daily_data[TIME_COLUMN].max()}")
Analyzing: Daily event counts

Daily data points: 3286
Date range: 2015-01-01 00:00:00 to 2023-12-30 00:00:00

1c.5 Trend DetectionΒΆ

πŸ“– Understanding Trends:

  • Increasing: Metric growing over time (e.g., expanding customer base)
  • Decreasing: Metric shrinking (e.g., declining engagement)
  • Stationary: No significant trend (stable business)

Impact on ML:

  • Strong trends can cause data leakage if not handled
  • Consider detrending or adding trend as explicit feature
InΒ [10]:
Show/Hide Code
# Trend Analysis - computation and visualization
from customer_retention.stages.profiling import generate_trend_recommendations

analyzer = TemporalPatternAnalyzer(time_column=TIME_COLUMN)
trend_result = analyzer.detect_trend(daily_data, value_column=analysis_col)
trend_recs = generate_trend_recommendations(trend_result, mean_value=daily_data[analysis_col].mean())

# Visualization
direction_emoji = {"increasing": "πŸ“ˆ", "decreasing": "πŸ“‰", "stable": "➑️", "unknown": "❓"}
print(f"Trend: {direction_emoji.get(trend_result.direction.value, '')} {trend_result.direction.value.upper()} (RΒ²={trend_result.strength:.2f})")

fig = go.Figure()
fig.add_trace(go.Scatter(
    x=daily_data[TIME_COLUMN], y=daily_data[analysis_col],
    mode="lines", name="Daily Values", line=dict(color="steelblue", width=1), opacity=0.7
))

if trend_result.slope is not None:
    x_numeric = (daily_data[TIME_COLUMN] - daily_data[TIME_COLUMN].min()).dt.days
    y_trend = trend_result.slope * x_numeric + (daily_data[analysis_col].mean() - trend_result.slope * x_numeric.mean())
    trend_color = {TrendDirection.INCREASING: "green", TrendDirection.DECREASING: "red"}.get(trend_result.direction, "gray")
    fig.add_trace(go.Scatter(
        x=daily_data[TIME_COLUMN], y=y_trend, mode="lines",
        name=f"Trend ({trend_result.direction.value})", line=dict(color=trend_color, width=3, dash="dash")
    ))

rolling_avg = daily_data[analysis_col].rolling(window=pattern_config.rolling_window, center=True).mean()
fig.add_trace(go.Scatter(
    x=daily_data[TIME_COLUMN], y=rolling_avg, mode="lines",
    name=f"{pattern_config.rolling_window}-day Rolling Avg", line=dict(color="orange", width=2)
))

fig.update_layout(
    title=f"Trend Analysis: {trend_result.direction.value.title()} (RΒ²={trend_result.strength:.2f})",
    xaxis_title="Date", yaxis_title="Value", template="plotly_white", height=400,
    legend=dict(yanchor="top", y=0.99, xanchor="left", x=0.01)
)
display_figure(fig)
Trend: ➑️ STABLE (R²=0.52)
No description has been provided for this image
InΒ [11]:
Show/Hide Code
# Trend details and recommendations
print("πŸ“ˆ TREND ANALYSIS DETAILS")
print("="*50)
print(f"\n   Direction: {trend_result.direction.value.upper()}")
print(f"   Strength (RΒ²): {trend_result.strength:.3f}")
print(f"   Confidence: {trend_result.confidence.upper()}")

if trend_result.slope is not None:
    mean_val = daily_data[analysis_col].mean()
    daily_pct = (trend_result.slope / mean_val * 100) if mean_val else 0
    print(f"   Slope: {trend_result.slope:.4f} per day ({daily_pct:+.3f}%/day)")
if trend_result.p_value is not None:
    print(f"   P-value: {trend_result.p_value:.4f}")

print("\nπŸ“‹ RECOMMENDATIONS:")
for rec in trend_recs:
    priority_icon = {"high": "πŸ”΄", "medium": "🟑", "low": "🟒"}.get(rec.priority, "βšͺ")
    print(f"   {priority_icon} [{rec.priority.upper()}] {rec.action}")
    print(f"      {rec.reason}")
    if rec.features:
        print(f"      Features: {', '.join(rec.features)}")

TREND_RECOMMENDATIONS = [{"action": r.action, "priority": r.priority, "reason": r.reason,
                          "features": r.features} for r in trend_recs]
πŸ“ˆ TREND ANALYSIS DETAILS
==================================================

   Direction: STABLE
   Strength (RΒ²): 0.519
   Confidence: HIGH
   Slope: -0.0057 per day (-0.023%/day)
   P-value: 0.0000

πŸ“‹ RECOMMENDATIONS:
   🟒 [LOW] skip_trend_features
      No significant trend (RΒ²=0.52) - trend features unlikely to help

1c.6 Seasonality DetectionΒΆ

πŸ“– Understanding Seasonality:

  • Weekly (period=7): Higher activity on certain days
  • Monthly (period~30): End-of-month patterns, billing cycles
  • Quarterly (period~90): Business cycles, seasonal products

πŸ“Š Interpreting Strength (Autocorrelation):

Strength measures how well values at a given lag correlate with current values.

Strength Interpretation Random Data Baseline
0.0 No pattern (random noise) β‰ˆ 0.0
0.1–0.3 Weak pattern Barely above random
0.3–0.5 Moderate pattern 3–5Γ— lift over random
0.5–0.7 Strong pattern Clear repeating cycle
> 0.7 Very strong pattern Near-deterministic cycle

Lift interpretation: A strength of 0.4 means the pattern explains ~40% of variance at that lag, vs ~0% for random data.

🎯 Window-Aligned Pattern Detection:

We check two types of patterns:

  • Natural periods (7, 14, 21, 30 days): Calendar-driven cycles
  • Aggregation windows (from findings): Patterns at your selected feature windows (e.g., 180d, 365d)

If a pattern aligns with your aggregation window, features computed over that window may capture the full cycle β€” consider this when interpreting aggregated features.

Impact on ML:

  • Add day-of-week, month features for detected periods
  • Consider seasonal decomposition for strong patterns
  • Use cyclical encodings (sin/cos) for neural networks
InΒ [12]:
Show/Hide Code
# Seasonality Analysis - Temporal Pattern Grid + Autocorrelation

# Prepare temporal columns
daily_data["day_of_week"] = daily_data[TIME_COLUMN].dt.day_name()
daily_data["month"] = daily_data[TIME_COLUMN].dt.month_name()
daily_data["quarter"] = "Q" + daily_data[TIME_COLUMN].dt.quarter.astype(str)
daily_data["year"] = daily_data[TIME_COLUMN].dt.year.astype(str)

dow_order = ["Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday", "Sunday"]
month_order = ["January", "February", "March", "April", "May", "June", "July", "August", "September", "October", "November", "December"]
daily_data["day_of_week"] = pd.Categorical(daily_data["day_of_week"], categories=[d for d in dow_order if d in daily_data["day_of_week"].values], ordered=True)
daily_data["month"] = pd.Categorical(daily_data["month"], categories=[m for m in month_order if m in daily_data["month"].values], ordered=True)
daily_data["quarter"] = pd.Categorical(daily_data["quarter"], categories=[q for q in ["Q1","Q2","Q3","Q4"] if q in daily_data["quarter"].values], ordered=True)

# Compute statistics
dow_stats = daily_data.groupby("day_of_week", observed=True)[analysis_col].agg(["mean", "std"]).reset_index()
monthly_stats = daily_data.groupby("month", observed=True)[analysis_col].agg(["mean", "std"]).reset_index()
quarterly_stats = daily_data.groupby("quarter", observed=True)[analysis_col].agg(["mean", "std"]).reset_index()
yearly_stats = daily_data.groupby("year", observed=True)[analysis_col].agg(["mean", "std"]).reset_index()
overall_mean = daily_data[analysis_col].mean()

# Get aggregation window lags for seasonality detection
window_lags = []
if findings.time_series_metadata and findings.time_series_metadata.suggested_aggregations:
    for w in findings.time_series_metadata.suggested_aggregations:
        if w != "all_time":
            days = int(w.replace("d", "").replace("h", "")) if "d" in w else int(w.replace("h", "")) // 24
            if days > 30:
                window_lags.append(days)

# Run seasonality detection
seasonality_results = analyzer.detect_seasonality(daily_data, value_column=analysis_col, additional_lags=window_lags)

# Create 2x2 visualization grid
fig = make_subplots(rows=2, cols=2, subplot_titles=["Day of Week", "Monthly", "Quarterly", "Yearly"],
                    horizontal_spacing=0.1, vertical_spacing=0.12)

colors_dow = ["lightgray" if d in ["Saturday", "Sunday"] else "steelblue" for d in dow_stats["day_of_week"]]
fig.add_trace(go.Bar(x=dow_stats["day_of_week"], y=dow_stats["mean"], error_y=dict(type="data", array=dow_stats["std"]),
                     marker_color=colors_dow, showlegend=False), row=1, col=1)
fig.add_trace(go.Bar(x=monthly_stats["month"], y=monthly_stats["mean"], error_y=dict(type="data", array=monthly_stats["std"]),
                     marker_color="mediumpurple", showlegend=False), row=1, col=2)
fig.add_trace(go.Bar(x=quarterly_stats["quarter"], y=quarterly_stats["mean"], error_y=dict(type="data", array=quarterly_stats["std"]),
                     marker_color="teal", showlegend=False), row=2, col=1)
fig.add_trace(go.Bar(x=yearly_stats["year"], y=yearly_stats["mean"], error_y=dict(type="data", array=yearly_stats["std"]),
                     marker_color="coral", showlegend=False), row=2, col=2)

for row, col in [(1, 1), (1, 2), (2, 1), (2, 2)]:
    fig.add_hline(y=overall_mean, line_dash="dot", line_color="red", opacity=0.5, row=row, col=col)

fig.update_layout(title={"text": "πŸ“… Temporal Pattern Analysis<br><sup>Gray = weekends | Red line = overall mean</sup>",
                         "x": 0.5, "xanchor": "center"}, template="plotly_white", height=700)
fig.update_yaxes(title_text="Avg Value", row=1, col=1)
fig.update_yaxes(title_text="Avg Value", row=2, col=1)
display_figure(fig)

# Combined Pattern Analysis
print("πŸ“Š SEASONALITY & TEMPORAL PATTERN ANALYSIS")
print("="*60)

# Variation analysis
def calc_var(stats): return (stats["mean"].max() - stats["mean"].min()) / overall_mean * 100 if len(stats) > 1 else 0
variations = {"day_of_week": calc_var(dow_stats), "month": calc_var(monthly_stats),
              "quarter": calc_var(quarterly_stats), "year": calc_var(yearly_stats)}

print("\nπŸ“ˆ Pattern Variation (% from mean):")
print(f"   Day of Week: {variations['day_of_week']:.1f}%")
print(f"   Monthly:     {variations['month']:.1f}%")
print(f"   Quarterly:   {variations['quarter']:.1f}%")
print(f"   Yearly:      {variations['year']:.1f}%")

# Autocorrelation seasonality
print("\nπŸ” Autocorrelation Seasonality (threshold > 0.3):")
if seasonality_results:
    for sr in seasonality_results:
        strength = "Strong" if sr.strength > 0.5 else "Moderate"
        aligned = " [aggregation window]" if sr.period in window_lags else ""
        print(f"   β€’ {sr.period_name or f'{sr.period}d'}: {sr.strength:.3f} ({strength}){aligned}")
else:
    print("   No significant autocorrelation patterns detected")

# Generate recommendations
SEASONALITY_RECOMMENDATIONS = []
for pattern, var_pct in variations.items():
    priority = "high" if var_pct > 20 else "medium" if var_pct > 10 else "low"

    if pattern == "day_of_week" and var_pct > 10:
        SEASONALITY_RECOMMENDATIONS.append({"pattern": pattern, "variation": var_pct, "priority": priority,
            "features": ["dow_sin", "dow_cos", "is_weekend"], "reason": f"{var_pct:.1f}% variation - add cyclical encoding"})
    elif pattern == "month" and var_pct > 10:
        SEASONALITY_RECOMMENDATIONS.append({"pattern": pattern, "variation": var_pct, "priority": priority,
            "features": ["month_sin", "month_cos"], "reason": f"{var_pct:.1f}% variation - add cyclical encoding"})
    elif pattern == "quarter" and var_pct > 10:
        SEASONALITY_RECOMMENDATIONS.append({"pattern": pattern, "variation": var_pct, "priority": priority,
            "features": ["quarter_sin", "quarter_cos"], "reason": f"{var_pct:.1f}% variation - add cyclical encoding"})
    elif pattern == "year" and var_pct > 20:
        trend_explains = 'trend_result' in dir() and trend_result.strength > 0.3 and trend_result.has_direction
        if trend_explains:
            SEASONALITY_RECOMMENDATIONS.append({"pattern": pattern, "variation": var_pct, "priority": priority,
                "features": ["year_trend"], "reason": f"{var_pct:.1f}% variation aligned with trend"})
        else:
            SEASONALITY_RECOMMENDATIONS.append({"pattern": pattern, "variation": var_pct, "priority": priority,
                "features": ["year_categorical"], "reason": f"{var_pct:.1f}% variation but NO linear trend - use categorical",
                "warning": "Stepwise changes or non-linear cycles suspected"})

# For autocorrelation-detected patterns
for sr in seasonality_results:
    if sr.period in [7, 14, 21, 30] and sr.strength > 0.3:
        SEASONALITY_RECOMMENDATIONS.append({"pattern": f"{sr.period}d_cycle", "variation": sr.strength * 100,
            "priority": "medium", "features": [f"lag_{sr.period}d_ratio"],
            "reason": f"Autocorrelation {sr.strength:.2f} at {sr.period}d - add lag ratio feature"})

print("\n" + "─"*60)
print("πŸ“‹ SEASONALITY RECOMMENDATIONS:")
print("─"*60)
if SEASONALITY_RECOMMENDATIONS:
    for rec in SEASONALITY_RECOMMENDATIONS:
        icon = {"high": "πŸ”΄", "medium": "🟑", "low": "🟒"}.get(rec["priority"], "βšͺ")
        print(f"\n{icon} [{rec['priority'].upper()}] {rec['pattern'].replace('_', ' ').title()}")
        print(f"   {rec['reason']}")
        if rec.get("warning"):
            print(f"   ⚠️  {rec['warning']}")
        if rec.get("features"):
            print(f"   β†’ Features: {', '.join(rec['features'])}")
else:
    print("\n   No significant patterns - seasonal features unlikely to help")

TEMPORAL_PATTERN_RECOMMENDATIONS = SEASONALITY_RECOMMENDATIONS
No description has been provided for this image
πŸ“Š SEASONALITY & TEMPORAL PATTERN ANALYSIS
============================================================

πŸ“ˆ Pattern Variation (% from mean):
   Day of Week: 1.6%
   Monthly:     12.2%
   Quarterly:   8.5%
   Yearly:      70.5%

πŸ” Autocorrelation Seasonality (threshold > 0.3):
   β€’ weekly: 0.539 (Strong)
   β€’ tri-weekly: 0.531 (Strong)
   β€’ bi-weekly: 0.530 (Strong)

────────────────────────────────────────────────────────────
πŸ“‹ SEASONALITY RECOMMENDATIONS:
────────────────────────────────────────────────────────────

🟑 [MEDIUM] Month
   12.2% variation - add cyclical encoding
   β†’ Features: month_sin, month_cos

πŸ”΄ [HIGH] Year
   70.5% variation but NO linear trend - use categorical
   ⚠️  Stepwise changes or non-linear cycles suspected
   β†’ Features: year_categorical

🟑 [MEDIUM] 7D Cycle
   Autocorrelation 0.54 at 7d - add lag ratio feature
   β†’ Features: lag_7d_ratio

🟑 [MEDIUM] 21D Cycle
   Autocorrelation 0.53 at 21d - add lag ratio feature
   β†’ Features: lag_21d_ratio

🟑 [MEDIUM] 14D Cycle
   Autocorrelation 0.53 at 14d - add lag ratio feature
   β†’ Features: lag_14d_ratio

1c.7 Cohort AnalysisΒΆ

πŸ“– Understanding Cohorts:

  • Group entities by when they first appeared (signup cohort)
  • Compare behavior across cohorts
  • Identify if acquisition quality changed over time

Cohorts vs Segments: Cohorts are time-bound groups (when entities joined), while segments are attribute-based groups (what entities are). Cohorts are fixed at signup; segments can change over time.

Other time-based cohort ideas:

  • First purchase date (not just signup)
  • First feature usage (e.g., "first mobile app use")
  • Campaign/promotion exposure date
  • Onboarding completion date
  • Product version or pricing plan at signup time

These can be derived as custom features if your data contains the relevant timestamps.

InΒ [13]:
Show/Hide Code
# Cohort Analysis - computation and visualization
from customer_retention.stages.profiling import analyze_cohort_distribution, generate_cohort_recommendations

COHORT_RECOMMENDATIONS = []
cohort_dist = None

if ENTITY_COLUMN and not LIGHT_RUN:
    first_events = df.groupby(ENTITY_COLUMN)[TIME_COLUMN].min().reset_index()
    first_events.columns = [ENTITY_COLUMN, "first_event"]
    cohort_dist = analyze_cohort_distribution(first_events, "first_event")

    cohort_result = analyzer.analyze_cohorts(
        df, entity_column=ENTITY_COLUMN, cohort_column=TIME_COLUMN,
        target_column=TARGET_COLUMN, period="M"
    )

    print("πŸ‘₯ COHORT ANALYSIS")
    print("="*50)
    print(f"\nEntity Onboarding: {cohort_dist.dominant_pct:.0f}% in {cohort_dist.dominant_year}, {cohort_dist.num_years} years total")

    if len(cohort_result) > 0:
        cohort_sorted = cohort_result.sort_values("cohort")
        has_retention = "retention_rate" in cohort_sorted.columns

        fig = make_subplots(specs=[[{"secondary_y": True}]]) if has_retention else go.Figure()

        fig.add_trace(go.Bar(
            x=cohort_sorted["cohort"].astype(str), y=cohort_sorted["entity_count"],
            name="Entities (sign-up cohort)", marker_color="steelblue", opacity=0.7
        ), secondary_y=False) if has_retention else fig.add_trace(go.Bar(
            x=cohort_sorted["cohort"].astype(str), y=cohort_sorted["entity_count"],
            name="Entities (sign-up cohort)", marker_color="steelblue", opacity=0.7
        ))

        if has_retention:
            fig.add_trace(go.Scatter(
                x=cohort_sorted["cohort"].astype(str), y=cohort_sorted["retention_rate"] * 100,
                mode="lines+markers", name="Retention Rate %",
                line=dict(color="coral", width=3), marker=dict(size=8)
            ), secondary_y=True)
            fig.update_yaxes(title_text="Retention Rate %", secondary_y=True)

        fig.update_layout(
            title="Cohort Analysis: Entity Count by Sign-up Month (cohort = first event period)",
            xaxis_title="Cohort (First Event Month)", template="plotly_white", height=400
        )
        fig.update_yaxes(title_text="Entity Count", secondary_y=False) if has_retention else fig.update_yaxes(title_text="Entity Count")
        display_figure(fig)
elif LIGHT_RUN:
    print("Cohort analysis skipped (LIGHT_RUN)")
πŸ‘₯ COHORT ANALYSIS
==================================================

Entity Onboarding: 90% in 2015, 4 years total
No description has been provided for this image
InΒ [14]:
Show/Hide Code
# Cohort details and recommendations
if ENTITY_COLUMN and cohort_dist:
    retention_var = None
    if "retention_rate" in cohort_result.columns:
        retention_var = cohort_result["retention_rate"].max() - cohort_result["retention_rate"].min()

    cohort_recs = generate_cohort_recommendations(cohort_dist, retention_variation=retention_var)

    print("πŸ“Š COHORT DETAILS")
    print("="*50)
    print("\nEntity Onboarding Distribution by Year:")
    print("─" * 40)
    for year, count in sorted(cohort_dist.year_counts.items()):
        pct = count / cohort_dist.total_entities * 100
        bar = "β–ˆ" * int(pct / 3)
        print(f"   {year}: {count:>5,} entities ({pct:>5.1f}%) {bar}")

    print(f"\n   Total entities: {cohort_dist.total_entities:,}")
    print(f"   Data spans: {df[TIME_COLUMN].min().date()} to {df[TIME_COLUMN].max().date()}")

    print("\nπŸ“‹ RECOMMENDATIONS:")
    for rec in cohort_recs:
        priority_icon = {"high": "πŸ”΄", "medium": "🟑", "low": "🟒"}.get(rec.priority, "βšͺ")
        print(f"   {priority_icon} [{rec.priority.upper()}] {rec.action}")
        print(f"      {rec.reason}")
        if rec.features:
            print(f"      Features: {', '.join(rec.features)}")
        if rec.insight:
            print(f"      πŸ’‘ {rec.insight}")

    COHORT_RECOMMENDATIONS = [{"action": r.action, "priority": r.priority, "reason": r.reason,
                               "features": getattr(r, 'features', []),
                               "insight": getattr(r, 'insight', None)} for r in cohort_recs]
πŸ“Š COHORT DETAILS
==================================================

Entity Onboarding Distribution by Year:
────────────────────────────────────────
   2015: 4,505 entities ( 90.1%) β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ
   2016:   452 entities (  9.0%) β–ˆβ–ˆβ–ˆ
   2017:    37 entities (  0.7%) 
   2018:     4 entities (  0.1%) 

   Total entities: 4,998
   Data spans: 2015-01-01 to 2023-12-30

πŸ“‹ RECOMMENDATIONS:
   🟒 [LOW] skip_cohort_features
      90% onboarded in 2015 - insufficient variation
      πŸ’‘ Established customer base, not a growing acquisition funnel
   🟑 [MEDIUM] investigate_cohort_retention
      Retention varies 67pp across cohorts - investigate drivers

1c.8 Correlation Matrix AnalysisΒΆ

πŸ“Š Understanding Feature Relationships:

This section shows feature-feature relationships in two complementary ways:

  • Correlation Matrix: Numerical summary (r values)
  • Scatter Matrix: Visual relationships with cohort overlay
Correlation Interpretation Action
\|r\| > 0.9 Near-duplicate features Remove one
\|r\| > 0.7 Strong relationship Consider combining
\|r\| < 0.3 Weak/no relationship Independent features
InΒ [15]:
Show/Hide Code
# Correlation matrix for numeric event attributes
# Define analysis columns - exclude entity, time, target, and temporal metadata
numeric_event_cols = [c for c in df.select_dtypes(include=[np.number]).columns
                      if c not in [ENTITY_COLUMN, TIME_COLUMN, TARGET_COLUMN]
                      and c not in TEMPORAL_METADATA_COLS
                      and 'target' not in c.lower()]

excluded_cols = [c for c in df.select_dtypes(include=[np.number]).columns if c not in numeric_event_cols]

print(f"Correlation Analysis (event-level, n={len(df):,})")
print(f"  Included ({len(numeric_event_cols)}): {numeric_event_cols}")
print(f"  Excluded ({len(excluded_cols)}): {excluded_cols}")

if len(numeric_event_cols) >= 2:
    corr_matrix = df[numeric_event_cols].corr()
    fig = charts.heatmap(
        corr_matrix.values, x_labels=numeric_event_cols, y_labels=numeric_event_cols,
        title="Feature Correlation Matrix (Event-Level)"
    )
    display_figure(fig)

    # High correlation pairs
    high_corr = []
    for i in range(len(numeric_event_cols)):
        for j in range(i+1, len(numeric_event_cols)):
            corr_val = corr_matrix.iloc[i, j]
            if abs(corr_val) > 0.7:
                high_corr.append((numeric_event_cols[i], numeric_event_cols[j], corr_val))

    if high_corr:
        print("\n⚠️ Highly correlated pairs (|r| > 0.7):")
        for c1, c2, r in sorted(high_corr, key=lambda x: abs(x[2]), reverse=True)[:5]:
            print(f"   {c1} ↔ {c2}: r={r:.2f}")
Correlation Analysis (event-level, n=83,198)
  Included (6): ['opened', 'clicked', 'send_hour', 'unsubscribed', 'bounced', 'time_to_open_hours']
  Excluded (1): ['unsubscribed_entity']
No description has been provided for this image
InΒ [16]:
Show/Hide Code
# Scatter Matrix: Entity-level features (mixed aggregation types)
if len(numeric_event_cols) >= 2 and ENTITY_COLUMN and TARGET_COLUMN and TARGET_COLUMN in df.columns:
    # Create entity-level aggregations (mean, sum, std) - like original
    agg_dict = {col: ['mean', 'sum', 'std'] for col in numeric_event_cols}
    entity_aggs = df.groupby(ENTITY_COLUMN).agg(agg_dict)
    entity_aggs.columns = ['_'.join(col).strip() for col in entity_aggs.columns]
    entity_aggs = entity_aggs.reset_index()

    # Get all numeric aggregated columns
    all_agg_cols = [c for c in entity_aggs.columns if c != ENTITY_COLUMN]

    # Select top 4 by variance across ALL aggregation types
    variances = entity_aggs[all_agg_cols].var().sort_values(ascending=False)
    top_features = variances.head(4).index.tolist()

    # Sample if needed
    sample_size = min(1000, len(entity_aggs))
    scatter_sample = entity_aggs.sample(sample_size, random_state=42) if sample_size < len(entity_aggs) else entity_aggs

    print(f"Scatter Matrix (n={len(scatter_sample):,} entities)")
    print(f"  Total aggregated features: {len(all_agg_cols)}")
    print(f"  Selected (top 4 by variance): {top_features}")

    # Short labels for x-axis (no line breaks)
    short_labels = [f.replace('_', ' ') for f in top_features]

    scatter_data = scatter_sample[top_features].copy()
    scatter_data.columns = short_labels

    fig = charts.scatter_matrix(scatter_data, height=500)
    fig.update_traces(marker=dict(opacity=0.5, size=4))

    # Update y-axis labels to be multirow, keep x-axis single row
    n_features = len(short_labels)
    for i in range(n_features):
        # Y-axis: multirow
        yaxis_name = f'yaxis{i+1}' if i > 0 else 'yaxis'
        y_label = top_features[i].replace('_', '<br>')
        fig.update_layout(**{yaxis_name: dict(title=dict(text=y_label))})

        # X-axis: single row (spaces instead of underscores)
        xaxis_name = f'xaxis{i+1}' if i > 0 else 'xaxis'
        x_label = top_features[i].replace('_', ' ')
        fig.update_layout(**{xaxis_name: dict(title=dict(text=x_label))})

    fig.update_layout(
        title="Feature Relationships (Top 4 by Variance)",
        margin=dict(l=100, r=20, t=50, b=60)
    )

    display_figure(fig)

    print("\nπŸ“ˆ Scatter Matrix Insights:")
    print("   β€’ Different aggregation types create different patterns/bands")
    print("   β€’ sum features often show exponential-like distributions")
    print("   β€’ std features reveal variability clusters")
    print("   β€’ mean features show central tendency patterns")
Scatter Matrix (n=1,000 entities)
  Total aggregated features: 18
  Selected (top 4 by variance): ['send_hour_sum', 'time_to_open_hours_sum', 'opened_sum', 'time_to_open_hours_mean']
No description has been provided for this image
πŸ“ˆ Scatter Matrix Insights:
   β€’ Different aggregation types create different patterns/bands
   β€’ sum features often show exponential-like distributions
   β€’ std features reveal variability clusters
   β€’ mean features show central tendency patterns
InΒ [17]:
Show/Hide Code
# Correlation Analysis: Interpretation
print("\n" + "="*70)
print("CORRELATION ANALYSIS SUMMARY")
print("="*70)

if 'high_corr' in dir() and high_corr:
    print(f"\nπŸ“Š Found {len(high_corr)} highly correlated pairs (|r| > 0.7):")
    for c1, c2, r in sorted(high_corr, key=lambda x: abs(x[2]), reverse=True)[:5]:
        print(f"   β€’ {c1} ↔ {c2}: r={r:.2f}")
    print("\nπŸ’‘ RECOMMENDATIONS:")
    print("   β†’ Remove redundant features to reduce multicollinearity")
    print("   β†’ Or create composite features from correlated groups")
else:
    print("\nβœ… No highly correlated pairs detected")
    print("   β†’ Features appear independent, good for modeling")
======================================================================
CORRELATION ANALYSIS SUMMARY
======================================================================

βœ… No highly correlated pairs detected
   β†’ Features appear independent, good for modeling

1c.9 Temporal SparklinesΒΆ

πŸ“ˆ Understanding Temporal Trends:

Sparklines show how numeric features evolve over time:

Pattern What It Means Implication
Upward trend Metric increasing Growth or engagement
Downward trend Metric decreasing Decline or churn signal
Flat line Stable metric Consistent behavior
Spikes Sudden changes Events or anomalies
InΒ [18]:
Show/Hide Code
# Temporal Sparklines - Cohort Γ— Time Period per feature with analysis
sparkline_cols = []
if not LIGHT_RUN and len(numeric_event_cols) >= 2:
    variances = df[numeric_event_cols].var().sort_values(ascending=False)
    sparkline_cols = variances.index.tolist()

    print("\n" + "="*70)
    print("TEMPORAL SPARKLINES - COHORT Γ— TIME PERIOD")
    print("="*70)
    print(f"\n{len(sparkline_cols)} features analyzed across Weekly/Monthly/Yearly periods")

    if ENTITY_COLUMN and TIME_COLUMN:
        df_spark = df.copy()
        df_spark['_week'] = pd.to_datetime(df_spark[TIME_COLUMN]).dt.to_period('W').dt.start_time
        df_spark['_month'] = pd.to_datetime(df_spark[TIME_COLUMN]).dt.to_period('M').dt.start_time
        df_spark['_year'] = pd.to_datetime(df_spark[TIME_COLUMN]).dt.to_period('Y').dt.start_time

        has_target = TARGET_COLUMN and TARGET_COLUMN in df.columns
        all_actions = []

        for col in sparkline_cols:
            if col not in df_spark.columns:
                continue

            feature_data = {}
            cohort_masks = ([("retained", df_spark[TARGET_COLUMN] == 1),
                            ("churned", df_spark[TARGET_COLUMN] == 0),
                            ("overall", slice(None))] if has_target
                           else [("overall", slice(None))])

            for cohort, mask in cohort_masks:
                cohort_df = df_spark[mask] if isinstance(mask, pd.Series) else df_spark
                feature_data[cohort] = {
                    "weekly": cohort_df.groupby('_week')[col].mean().dropna().tolist(),
                    "monthly": cohort_df.groupby('_month')[col].mean().dropna().tolist(),
                    "yearly": cohort_df.groupby('_year')[col].mean().dropna().tolist(),
                }

            period_effects = None
            if has_target:
                analysis = charts.analyze_cohort_trends(feature_data, col)
                period_effects = {p: analysis["periods"][p]["effect_size"]
                                  for p in analysis["periods"]}
                all_actions.extend(analysis.get("actions", []))

            fig = charts.cohort_sparklines(feature_data, feature_name=col, period_effects=period_effects)
            display_figure(fig)

        if has_target and all_actions:
            print("\n" + "="*70)
            print("TREND & VARIANCE RECOMMENDATIONS")
            print("="*70)

            BOLD, RESET = "\033[1m", "\033[0m"

            type_labels = {
                "add_trend_feature": "πŸ“ˆ Add Trend Features (opposite cohort trends)",
                "add_time_indicator": "πŸ“… Add Time Indicators (seasonality detected)",
                "robust_scale": "πŸ”§ Apply Robust Scaling (high variance ratio)",
                "normalize": "πŸ“Š Apply Normalization (high variance)",
            }

            by_type = {}
            for action in all_actions:
                action_type = action["action_type"]
                if action_type not in by_type:
                    by_type[action_type] = []
                by_type[action_type].append(action)

            for action_type, actions in by_type.items():
                print(f"\n{type_labels.get(action_type, action_type)}:")
                for a in actions:
                    params_str = ", ".join(f"{k}={v}" for k, v in a.get("params", {}).items())
                    print(f"   β€’ {BOLD}{a['feature']}{RESET}: {a['reason']}")
                    if params_str:
                        print(f"     params: {{{params_str}}}")
elif LIGHT_RUN:
    print("Sparkline analysis skipped (LIGHT_RUN)")
else:
    print("Insufficient numeric columns for sparkline visualization")
# Store sparkline recommendations for pattern_summary
SPARKLINE_RECOMMENDATIONS = [
    {"action": a["action_type"], "feature": a["feature"], "reason": a["reason"],
     "params": a.get("params", {}), "priority": "high" if a["action_type"] == "add_trend_feature" else "medium",
     "features": [f"{a['feature']}_{a['action_type']}"]}
    for a in all_actions
] if 'all_actions' in dir() and all_actions else []
======================================================================
TEMPORAL SPARKLINES - COHORT Γ— TIME PERIOD
======================================================================

6 features analyzed across Weekly/Monthly/Yearly periods
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
======================================================================
TREND & VARIANCE RECOMMENDATIONS
======================================================================

πŸ“ˆ Add Trend Features (opposite cohort trends):
   β€’ time_to_open_hours: Opposite trends detected at yearly scale
     params: {period=yearly, method=slope}
   β€’ opened: Opposite trends detected at yearly scale
     params: {period=yearly, method=slope}
   β€’ clicked: Opposite trends detected at yearly scale
     params: {period=yearly, method=slope}

πŸ“… Add Time Indicators (seasonality detected):
   β€’ time_to_open_hours: Seasonality detected at weekly scale
     params: {period=weekly, indicators=['cyclical_encoding']}
   β€’ send_hour: Seasonality detected at weekly scale
     params: {period=weekly, indicators=['cyclical_encoding']}
   β€’ opened: Seasonality detected at weekly scale
     params: {period=weekly, indicators=['cyclical_encoding']}
   β€’ clicked: Seasonality detected at weekly scale
     params: {period=weekly, indicators=['cyclical_encoding']}
   β€’ unsubscribed: Seasonality detected at weekly scale
     params: {period=weekly, indicators=['cyclical_encoding']}
   β€’ bounced: Seasonality detected at weekly scale
     params: {period=weekly, indicators=['cyclical_encoding']}

πŸ”§ Apply Robust Scaling (high variance ratio):
   β€’ time_to_open_hours: High variance ratio (11.1x) between cohorts
     params: {method=robust_scaler}

1c.10 Entity-Level Feature Analysis (Effect Sizes)ΒΆ

This section uses three complementary approaches to understand feature separation:

Approach What It Measures Output
Cohen's d Standardized mean difference Single number per feature
Correlation Linear relationship with target Single number per feature
Box Plots Full distribution by cohort Visual comparison

πŸ“Š Cohen's d Interpretation:

\|d\| Value Effect Size Predictive Value
β‰₯ 0.8 Large Strong differentiator
0.5-0.8 Medium Useful signal
0.2-0.5 Small Weak signal
< 0.2 Negligible Not predictive

Connection to Sparklines (1c.9): The d values shown in the sparkline column headers are per-period effect sizes. Here we compute entity-level effect sizes across all aggregated features.

See also: Section 1c.8 for scatter matrix showing feature relationships with cohort overlay.

InΒ [19]:
Show/Hide Code
# Aggregate event data to entity level for effect size analysis
if ENTITY_COLUMN and TARGET_COLUMN and TARGET_COLUMN in df.columns and not LIGHT_RUN:
    # Build entity-level aggregations
    entity_aggs = df.groupby(ENTITY_COLUMN).agg({
        TIME_COLUMN: ['count', 'min', 'max'],
        **{col: ['mean', 'sum', 'std'] for col in numeric_event_cols if col != TARGET_COLUMN}
    })
    entity_aggs.columns = ['_'.join(col).strip() for col in entity_aggs.columns]
    entity_aggs = entity_aggs.reset_index()

    # Add target
    entity_target = df.groupby(ENTITY_COLUMN)[TARGET_COLUMN].first().reset_index()
    entity_df = entity_aggs.merge(entity_target, on=ENTITY_COLUMN)

    # Add derived features
    entity_df['tenure_days'] = (entity_df[f'{TIME_COLUMN}_max'] - entity_df[f'{TIME_COLUMN}_min']).dt.days
    entity_df['event_count'] = entity_df[f'{TIME_COLUMN}_count']

    # Calculate effect sizes (Cohen's d) for entity-level features
    # Exclude entity, target, and temporal metadata columns
    effect_feature_cols = [c for c in entity_df.select_dtypes(include=[np.number]).columns
                          if c not in [ENTITY_COLUMN, TARGET_COLUMN]
                          and c not in TEMPORAL_METADATA_COLS]

    print("="*80)
    print("ENTITY-LEVEL FEATURE EFFECT SIZES (Cohen's d)")
    print("="*80)
    print(f"\nAnalyzing {len(effect_feature_cols)} aggregated features at entity level")
    print(f"Entities: {len(entity_df):,} (Retained: {(entity_df[TARGET_COLUMN]==1).sum():,}, Churned: {(entity_df[TARGET_COLUMN]==0).sum():,})\n")

    effect_sizes = []
    for col in effect_feature_cols:
        churned = entity_df[entity_df[TARGET_COLUMN] == 0][col].dropna()
        retained = entity_df[entity_df[TARGET_COLUMN] == 1][col].dropna()

        if len(churned) > 0 and len(retained) > 0:
            pooled_std = np.sqrt(((len(churned)-1)*churned.std()**2 + (len(retained)-1)*retained.std()**2) /
                                 (len(churned) + len(retained) - 2))
            d = (retained.mean() - churned.mean()) / pooled_std if pooled_std > 0 else 0

            abs_d = abs(d)
            if abs_d >= 0.8:
                interp, emoji = "Large effect", "πŸ”΄"
            elif abs_d >= 0.5:
                interp, emoji = "Medium effect", "🟑"
            elif abs_d >= 0.2:
                interp, emoji = "Small effect", "🟒"
            else:
                interp, emoji = "Negligible", "βšͺ"

            effect_sizes.append({
                "feature": col, "cohens_d": d, "abs_d": abs_d,
                "interpretation": interp, "emoji": emoji,
                "retained_mean": retained.mean(), "churned_mean": churned.mean()
            })

    # Sort and display
    effect_df = pd.DataFrame(effect_sizes).sort_values("abs_d", ascending=False)

    print(f"{'Feature':<35} {'d':>8} {'Effect':<15} {'Direction':<20}")
    print("-" * 80)
    for _, row in effect_df.head(15).iterrows():
        direction = "↑ Higher in retained" if row["cohens_d"] > 0 else "↓ Lower in retained"
        print(f"{row['emoji']} {row['feature'][:33]:<33} {row['cohens_d']:>+8.3f} {row['interpretation']:<15} {direction:<20}")

    # Categorize features
    large_effect = effect_df[effect_df["abs_d"] >= 0.8]["feature"].tolist()
    medium_effect = effect_df[(effect_df["abs_d"] >= 0.5) & (effect_df["abs_d"] < 0.8)]["feature"].tolist()
    small_effect = effect_df[(effect_df["abs_d"] >= 0.2) & (effect_df["abs_d"] < 0.5)]["feature"].tolist()

    # INTERPRETATION
    print("\n" + "─"*80)
    print("πŸ“– INTERPRETATION & RECOMMENDATIONS")
    print("─"*80)

    if large_effect:
        print("\nπŸ”΄ LARGE EFFECT (|d| β‰₯ 0.8) - Priority Features:")
        for f in large_effect[:5]:
            row = effect_df[effect_df["feature"] == f].iloc[0]
            direction = "higher" if row["cohens_d"] > 0 else "lower"
            print(f"   β€’ {f}: Retained customers have {direction} values")
            print(f"     Mean: Retained={row['retained_mean']:.2f}, Churned={row['churned_mean']:.2f}")
        print("   β†’ MUST include in predictive model")

    if medium_effect:
        print("\n🟑 MEDIUM EFFECT (0.5 ≀ |d| < 0.8) - Useful Features:")
        for f in medium_effect[:3]:
            print(f"   β€’ {f}")
        print("   β†’ Should include in model")

    if small_effect:
        print("\n🟒 SMALL EFFECT (0.2 ≀ |d| < 0.5) - Supporting Features:")
        print(f"   {', '.join(small_effect[:5])}")
        print("   β†’ May help in combination with other features")

    negligible = effect_df[effect_df["abs_d"] < 0.2]["feature"].tolist()
    if negligible:
        print(f"\nβšͺ NEGLIGIBLE EFFECT (|d| < 0.2): {len(negligible)} features")
        print("   β†’ Consider engineering or dropping from model")
elif LIGHT_RUN:
    print("Entity-level effect size analysis skipped (LIGHT_RUN)")
else:
    print("Entity column or target not available for effect size analysis")
================================================================================
ENTITY-LEVEL FEATURE EFFECT SIZES (Cohen's d)
================================================================================

Analyzing 21 aggregated features at entity level
Entities: 4,998 (Retained: 2,237, Churned: 2,761)

Feature                                    d Effect          Direction           
--------------------------------------------------------------------------------
πŸ”΄ unsubscribed_std                    +4.158 Large effect    ↑ Higher in retained
πŸ”΄ tenure_days                         -2.404 Large effect    ↓ Lower in retained 
πŸ”΄ unsubscribed_mean                   +1.417 Large effect    ↑ Higher in retained
πŸ”΄ opened_sum                          -0.997 Large effect    ↓ Lower in retained 
πŸ”΄ opened_std                          -0.953 Large effect    ↓ Lower in retained 
πŸ”΄ sent_date_count                     -0.874 Large effect    ↓ Lower in retained 
πŸ”΄ event_count                         -0.874 Large effect    ↓ Lower in retained 
πŸ”΄ send_hour_sum                       -0.866 Large effect    ↓ Lower in retained 
πŸ”΄ opened_mean                         -0.838 Large effect    ↓ Lower in retained 
πŸ”΄ time_to_open_hours_sum              -0.816 Large effect    ↓ Lower in retained 
🟑 clicked_sum                         -0.694 Medium effect   ↓ Lower in retained 
🟑 clicked_std                         -0.628 Medium effect   ↓ Lower in retained 
🟒 clicked_mean                        -0.473 Small effect    ↓ Lower in retained 
🟒 bounced_sum                         -0.338 Small effect    ↓ Lower in retained 
🟒 bounced_std                         -0.237 Small effect    ↓ Lower in retained 

────────────────────────────────────────────────────────────────────────────────
πŸ“– INTERPRETATION & RECOMMENDATIONS
────────────────────────────────────────────────────────────────────────────────

πŸ”΄ LARGE EFFECT (|d| β‰₯ 0.8) - Priority Features:
   β€’ unsubscribed_std: Retained customers have higher values
     Mean: Retained=0.33, Churned=0.00
   β€’ tenure_days: Retained customers have lower values
     Mean: Retained=1444.03, Churned=2950.74
   β€’ unsubscribed_mean: Retained customers have higher values
     Mean: Retained=0.14, Churned=0.00
   β€’ opened_sum: Retained customers have lower values
     Mean: Retained=2.17, Churned=5.00
   β€’ opened_std: Retained customers have lower values
     Mean: Retained=0.29, Churned=0.43
   β†’ MUST include in predictive model

🟑 MEDIUM EFFECT (0.5 ≀ |d| < 0.8) - Useful Features:
   β€’ clicked_sum
   β€’ clicked_std
   β†’ Should include in model

🟒 SMALL EFFECT (0.2 ≀ |d| < 0.5) - Supporting Features:
   clicked_mean, bounced_sum, bounced_std
   β†’ May help in combination with other features

βšͺ NEGLIGIBLE EFFECT (|d| < 0.2): 6 features
   β†’ Consider engineering or dropping from model
InΒ [20]:
Show/Hide Code
# Box Plots: Entity-level feature distributions by target
if ENTITY_COLUMN and TARGET_COLUMN and 'entity_df' in dir() and len(effect_df) > 0:
    # Select top features by effect size for visualization
    top_features = effect_df.head(6)["feature"].tolist()
    n_features = len(top_features)

    if n_features > 0:
        print("="*70)
        print("DISTRIBUTION COMPARISON: Retained vs Churned (Box Plots)")
        print("="*70)
        print("\nπŸ“Š Showing top 6 features by effect size")
        print("   🟒 Green = Retained | πŸ”΄ Red = Churned\n")

        fig = make_subplots(rows=1, cols=n_features, subplot_titles=top_features, horizontal_spacing=0.05)

        for i, col in enumerate(top_features):
            col_num = i + 1

            # Retained (1) - Green
            retained_data = entity_df[entity_df[TARGET_COLUMN] == 1][col].dropna()
            fig.add_trace(go.Box(y=retained_data, name='Retained',
                fillcolor='rgba(46, 204, 113, 0.7)', line=dict(color='#1e8449', width=2),
                boxpoints='outliers', width=0.35, showlegend=(i == 0), legendgroup='retained',
                marker=dict(color='rgba(46, 204, 113, 0.5)', size=4)), row=1, col=col_num)

            # Churned (0) - Red
            churned_data = entity_df[entity_df[TARGET_COLUMN] == 0][col].dropna()
            fig.add_trace(go.Box(y=churned_data, name='Churned',
                fillcolor='rgba(231, 76, 60, 0.7)', line=dict(color='#922b21', width=2),
                boxpoints='outliers', width=0.35, showlegend=(i == 0), legendgroup='churned',
                marker=dict(color='rgba(231, 76, 60, 0.5)', size=4)), row=1, col=col_num)

        fig.update_layout(height=450, title_text="Top Features: Retained (Green) vs Churned (Red)",
            template='plotly_white', showlegend=True, boxmode='group',
            legend=dict(orientation="h", yanchor="bottom", y=1.05, xanchor="center", x=0.5))
        fig.update_xaxes(showticklabels=False)
        display_figure(fig)

        # INTERPRETATION
        print("─"*70)
        print("πŸ“– HOW TO READ BOX PLOTS")
        print("─"*70)
        print("""
Box Plot Elements:
  β€’ Box = Middle 50% of data (IQR: 25th to 75th percentile)
  β€’ Line inside box = Median (50th percentile)
  β€’ Whiskers = 1.5 Γ— IQR from box edges
  β€’ Dots outside = Outliers

What makes a good predictor:
  βœ“ Clear SEPARATION between green and red boxes
  βœ“ Different MEDIANS (center lines at different heights)
  βœ“ Minimal OVERLAP between boxes

Patterns to look for:
  β€’ Green box entirely above red β†’ Retained have higher values
  β€’ Green box entirely below red β†’ Retained have lower values
  β€’ Overlapping boxes β†’ Feature alone may not discriminate well
  β€’ Many outliers in one group β†’ Subpopulations worth investigating
""")
======================================================================
DISTRIBUTION COMPARISON: Retained vs Churned (Box Plots)
======================================================================

πŸ“Š Showing top 6 features by effect size
   🟒 Green = Retained | πŸ”΄ Red = Churned

No description has been provided for this image
──────────────────────────────────────────────────────────────────────
πŸ“– HOW TO READ BOX PLOTS
──────────────────────────────────────────────────────────────────────

Box Plot Elements:
  β€’ Box = Middle 50% of data (IQR: 25th to 75th percentile)
  β€’ Line inside box = Median (50th percentile)
  β€’ Whiskers = 1.5 Γ— IQR from box edges
  β€’ Dots outside = Outliers

What makes a good predictor:
  βœ“ Clear SEPARATION between green and red boxes
  βœ“ Different MEDIANS (center lines at different heights)
  βœ“ Minimal OVERLAP between boxes

Patterns to look for:
  β€’ Green box entirely above red β†’ Retained have higher values
  β€’ Green box entirely below red β†’ Retained have lower values
  β€’ Overlapping boxes β†’ Feature alone may not discriminate well
  β€’ Many outliers in one group β†’ Subpopulations worth investigating

InΒ [21]:
Show/Hide Code
# Feature-Target Correlation Ranking
if ENTITY_COLUMN and TARGET_COLUMN and 'entity_df' in dir():
    print("="*70)
    print("FEATURE-TARGET CORRELATIONS (Entity-Level)")
    print("="*70)

    correlations = []
    for col in effect_feature_cols:
        if col != TARGET_COLUMN:
            corr = entity_df[[col, TARGET_COLUMN]].corr().iloc[0, 1]
            if not np.isnan(corr):
                correlations.append({"Feature": col, "Correlation": corr})

    if correlations:
        corr_df = pd.DataFrame(correlations).sort_values("Correlation", key=abs, ascending=False)

        fig = charts.bar_chart(
            corr_df["Feature"].head(12).tolist(),
            corr_df["Correlation"].head(12).tolist(),
            title=f"Feature Correlations with {TARGET_COLUMN}"
        )
        display_figure(fig)

        print("\nπŸ“Š Correlation Rankings:")
        print(f"{'Feature':<35} {'Correlation':>12} {'Strength':<15} {'Direction'}")
        print("-" * 75)

        for _, row in corr_df.head(10).iterrows():
            abs_corr = abs(row["Correlation"])
            if abs_corr >= 0.5:
                strength = "Strong"
            elif abs_corr >= 0.3:
                strength = "Moderate"
            elif abs_corr >= 0.1:
                strength = "Weak"
            else:
                strength = "Very weak"

            direction = "Positive" if row["Correlation"] > 0 else "Negative"
            print(f"{row['Feature'][:34]:<35} {row['Correlation']:>+12.3f} {strength:<15} {direction}")

        # INTERPRETATION
        print("\n" + "─"*70)
        print("πŸ“– INTERPRETING CORRELATIONS WITH TARGET")
        print("─"*70)
        print("""
Correlation with binary target (retained=1, churned=0):

  Positive correlation (+): Higher values β†’ more likely RETAINED
  Negative correlation (-): Higher values β†’ more likely CHURNED

Strength guide:
  |r| > 0.5:  Strong - prioritize this feature
  |r| 0.3-0.5: Moderate - useful predictor
  |r| 0.1-0.3: Weak - may help in combination
  |r| < 0.1:  Very weak - limited predictive value

Note: Correlation captures LINEAR relationships only.
Non-linear relationships may have low correlation but still be predictive.
""")
======================================================================
FEATURE-TARGET CORRELATIONS (Entity-Level)
======================================================================
No description has been provided for this image
πŸ“Š Correlation Rankings:
Feature                              Correlation Strength        Direction
---------------------------------------------------------------------------
unsubscribed_sum                          +1.000 Strong          Positive
unsubscribed_std                          +0.900 Strong          Positive
tenure_days                               -0.767 Strong          Negative
unsubscribed_mean                         +0.576 Strong          Positive
opened_sum                                -0.444 Moderate        Negative
opened_std                                -0.428 Moderate        Negative
sent_date_count                           -0.399 Moderate        Negative
event_count                               -0.399 Moderate        Negative
send_hour_sum                             -0.396 Moderate        Negative
opened_mean                               -0.385 Moderate        Negative

──────────────────────────────────────────────────────────────────────
πŸ“– INTERPRETING CORRELATIONS WITH TARGET
──────────────────────────────────────────────────────────────────────

Correlation with binary target (retained=1, churned=0):

  Positive correlation (+): Higher values β†’ more likely RETAINED
  Negative correlation (-): Higher values β†’ more likely CHURNED

Strength guide:
  |r| > 0.5:  Strong - prioritize this feature
  |r| 0.3-0.5: Moderate - useful predictor
  |r| 0.1-0.3: Weak - may help in combination
  |r| < 0.1:  Very weak - limited predictive value

Note: Correlation captures LINEAR relationships only.
Non-linear relationships may have low correlation but still be predictive.

InΒ [22]:
Show/Hide Code
# Entity-Level Analysis: Summary
print("\n" + "="*70)
print("ENTITY-LEVEL FEATURE SUMMARY")
print("="*70)

if 'effect_df' in dir() and len(effect_df) > 0:
    large_effects = effect_df[effect_df['cohens_d'].abs() >= 0.5]
    print("\nπŸ“Š Effect Size Summary:")
    print(f"   β€’ Total features analyzed: {len(effect_df)}")
    print(f"   β€’ Features with |d| β‰₯ 0.5 (medium+): {len(large_effects)}")
    print(f"   β€’ Features with |d| < 0.2 (negligible): {(effect_df['cohens_d'].abs() < 0.2).sum()}")

    if len(large_effects) > 0:
        print("\n   Top differentiators:")
        for _, row in large_effects.head(5).iterrows():
            direction = "↑ higher in retained" if row['cohens_d'] > 0 else "↓ lower in retained"
            print(f"   β€’ \033[1m{row['feature']}\033[0m: d={row['cohens_d']:+.2f} ({direction})")

    print("\nπŸ“‹ What the Three Approaches Showed:")
    print("   β€’ Cohen's d β†’ identified features with strongest mean separation")
    print("   β€’ Correlation β†’ confirmed linear relationship direction")
    print("   β€’ Box plots β†’ revealed distribution shapes and outliers")

    print("\nπŸ’‘ RECOMMENDATIONS:")
    print("   β†’ Prioritize features with |d| > 0.5 in model")
    print("   β†’ Consider dropping features with |d| < 0.2")
    print("   β†’ Check box plots for non-normal distributions that may need transformation")
else:
    print("\n⚠️ Effect size analysis not performed")

# Store effect size recommendations for pattern_summary
EFFECT_SIZE_RECOMMENDATIONS = []
if 'effect_df' in dir() and len(effect_df) > 0:
    for _, row in effect_df.iterrows():
        abs_d = abs(row['cohens_d'])
        if abs_d >= 0.5:
            EFFECT_SIZE_RECOMMENDATIONS.append({
                "action": "prioritize_feature", "feature": row['feature'],
                "effect_size": row['cohens_d'], "priority": "high" if abs_d >= 0.8 else "medium",
                "reason": f"Cohen's d={row['cohens_d']:.2f} shows {'large' if abs_d >= 0.8 else 'medium'} effect",
                "features": [row['feature']]
            })
        elif abs_d < 0.2:
            EFFECT_SIZE_RECOMMENDATIONS.append({
                "action": "consider_dropping", "feature": row['feature'],
                "effect_size": row['cohens_d'], "priority": "low",
                "reason": f"Cohen's d={row['cohens_d']:.2f} shows negligible effect",
                "features": []  # No feature to add, just informational
            })
======================================================================
ENTITY-LEVEL FEATURE SUMMARY
======================================================================

πŸ“Š Effect Size Summary:
   β€’ Total features analyzed: 21
   β€’ Features with |d| β‰₯ 0.5 (medium+): 12
   β€’ Features with |d| < 0.2 (negligible): 6

   Top differentiators:
   β€’ unsubscribed_std: d=+4.16 (↑ higher in retained)
   β€’ tenure_days: d=-2.40 (↓ lower in retained)
   β€’ unsubscribed_mean: d=+1.42 (↑ higher in retained)
   β€’ opened_sum: d=-1.00 (↓ lower in retained)
   β€’ opened_std: d=-0.95 (↓ lower in retained)

πŸ“‹ What the Three Approaches Showed:
   β€’ Cohen's d β†’ identified features with strongest mean separation
   β€’ Correlation β†’ confirmed linear relationship direction
   β€’ Box plots β†’ revealed distribution shapes and outliers

πŸ’‘ RECOMMENDATIONS:
   β†’ Prioritize features with |d| > 0.5 in model
   β†’ Consider dropping features with |d| < 0.2
   β†’ Check box plots for non-normal distributions that may need transformation

1c.11 Recency AnalysisΒΆ

πŸ“– What is Recency? Days since each entity's last event. A key predictor in churn models.

πŸ“Š How to Read the Panel:

  • Top Row: Distribution histograms for Retained vs Churned
    • Compare shapes: Similar = weak signal, Different = strong signal
    • Compare medians: Large gap = recency discriminates well
  • Bottom Left: Target rate by recency bucket
    • Look for: Monotonic decline, sharp thresholds, or flat patterns
    • Inflection points suggest where to create binary flags

βœ… Pattern Interpretation:

Pattern Meaning Feature Strategy
Monotonic decline Gradual disengagement Use continuous recency
Threshold/step Clear activity boundary Create binary is_active_Nd flag
Flat Recency not predictive May omit or use only in combination
InΒ [23]:
Show/Hide Code
# Recency Analysis - Combined visualization and insights
from customer_retention.analysis.visualization import console
from customer_retention.stages.profiling import compare_recency_by_target

recency_comparison = None
recency_result = None
RECENCY_RECOMMENDATIONS = []

if ENTITY_COLUMN:
    reference_date = df[TIME_COLUMN].max()

    # Compute recency_result for use in summary cells
    recency_result = analyzer.analyze_recency(df, ENTITY_COLUMN, TARGET_COLUMN, reference_date)

    if TARGET_COLUMN and TARGET_COLUMN in df.columns:
        recency_comparison = compare_recency_by_target(
            df, ENTITY_COLUMN, TIME_COLUMN, TARGET_COLUMN, reference_date
        )

    if recency_comparison:
        # Combined visualization panel
        entity_last = df.groupby(ENTITY_COLUMN)[TIME_COLUMN].max().reset_index()
        entity_last["recency_days"] = (reference_date - entity_last[TIME_COLUMN]).dt.days
        entity_target = df.groupby(ENTITY_COLUMN)[TARGET_COLUMN].first().reset_index()
        entity_recency = entity_last.merge(entity_target, on=ENTITY_COLUMN)
        cap = entity_recency["recency_days"].quantile(0.99)
        entity_capped = entity_recency[entity_recency["recency_days"] <= cap]

        retained = entity_capped[entity_capped[TARGET_COLUMN] == 1]["recency_days"].values
        churned = entity_capped[entity_capped[TARGET_COLUMN] == 0]["recency_days"].values

        fig = charts.recency_analysis_panel(
            retained_recency=retained,
            churned_recency=churned,
            bucket_stats=recency_comparison.bucket_stats,
            retained_median=recency_comparison.retained_stats.median,
            churned_median=recency_comparison.churned_stats.median,
            cap_value=cap
        )
        display_figure(fig)

        # Key Findings
        console.start_section()
        console.header("Key Findings")
        for insight in recency_comparison.key_findings:
            console.info(insight.finding)
        console.end_section()

        # Statistics
        ret, churn = recency_comparison.retained_stats, recency_comparison.churned_stats
        console.start_section()
        console.header("Detailed Statistics")
        console.metric("Retained (n)", f"{ret.count:,}")
        console.metric("Churned (n)", f"{churn.count:,}")
        print(f"{'Metric':<15} {'Retained':>12} {'Churned':>12} {'Diff':>12}")
        print("-" * 52)
        for name, r, c in [("Mean", ret.mean, churn.mean), ("Median", ret.median, churn.median),
                           ("Std Dev", ret.std, churn.std)]:
            print(f"{name:<15} {r:>12.1f} {c:>12.1f} {c-r:>+12.1f}")
        console.metric("Effect Size", f"{recency_comparison.cohens_d:+.2f} ({recency_comparison.effect_interpretation})")
        console.metric("Pattern", recency_comparison.distribution_pattern.replace("_", " ").title())
        if recency_comparison.inflection_bucket:
            console.metric("Inflection", recency_comparison.inflection_bucket)
        console.end_section()

        # Actionable Recommendations
        console.start_section()
        console.header("Actionable Recommendations")
        RECENCY_RECOMMENDATIONS = recency_comparison.recommendations
        for rec in RECENCY_RECOMMENDATIONS:
            priority = rec.get("priority", "medium")
            priority_icon = {"high": "πŸ”΄", "medium": "🟑", "low": "🟒"}.get(priority, "βšͺ")
            console.info(f"{priority_icon} [{priority.upper()}] {rec['action'].replace('_', ' ').title()}")
            console.info(f"   {rec['reason']}")
            if rec.get("features"):
                console.metric("Features", ", ".join(rec["features"]))
        console.end_section()
    else:
        # No target - show basic recency distribution
        entity_last = df.groupby(ENTITY_COLUMN)[TIME_COLUMN].max().reset_index()
        entity_last["recency_days"] = (reference_date - entity_last[TIME_COLUMN]).dt.days
        median_recency = entity_last["recency_days"].median()
        cap = entity_last["recency_days"].quantile(0.99)
        capped = entity_last[entity_last["recency_days"] <= cap]

        fig = go.Figure()
        fig.add_trace(go.Histogram(x=capped["recency_days"], nbinsx=50, marker_color="coral", opacity=0.7))
        fig.add_vline(x=median_recency, line_dash="solid", line_color="green", annotation_text=f"Median: {median_recency:.0f} days")
        fig.update_layout(title=f"Recency Distribution (capped at {cap:.0f} days)", xaxis_title="Days Since Last Event", yaxis_title="Count", template="plotly_white", height=400)
        display_figure(fig)

        console.start_section()
        console.header("Recency Statistics")
        console.metric("Median", f"{median_recency:.0f} days")
        console.metric("Mean", f"{entity_last['recency_days'].mean():.0f} days")
        console.info("No target column - cannot compare retained vs churned")
        console.end_section()
No description has been provided for this image

KEY FINDINGSΒΆ

(i) Churned entities last active 1608 days shorter than retained (median: 118d vs 1726d)
(i) ⚠️ Unusual pattern: churned have MORE recent activity. Target=1 is minority (45%) - likely means CHURN not retention.
(i) Recency strongly discriminates target (Large effect, d=+2.32) - high predictive value

Metric              Retained      Churned         Diff
----------------------------------------------------
Mean                  1662.5        168.9      -1493.6
Median                1726.0        118.0      -1608.0
Std Dev                895.4        167.1       -728.3

DETAILED STATISTICSΒΆ

Retained (n): 2,187
Churned (n): 2,761
Effect Size: +2.32 (Large effect)
Pattern: Variable

ACTIONABLE RECOMMENDATIONSΒΆ

(i) πŸ”΄ [HIGH] Invert Target Interpretation
(i) Target=1 is minority (45%) - interpret as CHURN; recency pattern is classic churn behavior
Features: days_since_last_event, log_recency

1c.12 Velocity & Acceleration AnalysisΒΆ

πŸ“– Why Velocity and Acceleration Matter:

Metric Formula Interpretation
Velocity Ξ”(value) / Ξ”t Rate of change - is activity speeding up or slowing down?
Acceleration Ξ”(velocity) / Ξ”t Change in rate - is the slowdown accelerating?

πŸ“Š Analysis Approach:

  1. Signal Heatmap: Effect sizes (Cohen's d) across variable Γ— time window combinations

    • Shows cohort separation strength at each time scale
    • Higher |d| = stronger individual signal, but low |d| features may still help in combinations
  2. Detailed Sparklines: For top features (ranked by max |d| across windows)

    • Shows ALL time windows for each feature - different scales capture different dynamics
    • Retained vs churned velocity/acceleration side by side
InΒ [24]:
Show/Hide Code
# Velocity & Acceleration Cohort Analysis with Effect Size Heatmap
velocity_recs = []
if ENTITY_COLUMN and TARGET_COLUMN and sparkline_cols:
    continuous_cols = [c for c in sparkline_cols if df[c].nunique() > 2][:6]

    if not continuous_cols:
        print("\u26a0\ufe0f No continuous numeric columns found for velocity analysis.")
    else:
        print("\n" + "="*70)
        print("VELOCITY & ACCELERATION COHORT ANALYSIS")
        print("="*70)

        if 'feature_analyzer' not in dir():
            feature_analyzer = TemporalFeatureAnalyzer(time_column=TIME_COLUMN, entity_column=ENTITY_COLUMN)

        windows = [7, 14, 30, 90, 180, 365]
        print(f"Analyzing {len(continuous_cols)} features across windows: {windows} days")

        all_results = {}
        heatmap_data = {"velocity": {}, "acceleration": {}}

        for col in continuous_cols:
            results = feature_analyzer.compute_cohort_velocity_signals(
                df, [col], TARGET_COLUMN, windows=windows
            )
            all_results[col] = results[col]
            heatmap_data["velocity"][col] = {f"{r.window_days}d": r.velocity_effect_size for r in results[col]}
            heatmap_data["acceleration"][col] = {f"{r.window_days}d": r.accel_effect_size for r in results[col]}

        fig = charts.velocity_signal_heatmap(heatmap_data)
        display_figure(fig)

        feature_max_d = [(col, max(abs(r.velocity_effect_size) for r in results))
                         for col, results in all_results.items()]
        feature_max_d.sort(key=lambda x: -x[1])

        top_features = [col for col, _ in feature_max_d[:3]]
        for col in top_features:
            fig = charts.cohort_velocity_sparklines(all_results[col], feature_name=col)
            display_figure(fig)

        print("\n\U0001f4ca VELOCITY EFFECT SIZE INTERPRETATION")
        print("="*70)
        print("Cohen's d measures the standardized difference between retained/churned velocity")
        print("|d| \u2265 0.8: large effect | \u2265 0.5: medium | \u2265 0.2: small\n")

        interpretation_notes = feature_analyzer.generate_velocity_interpretation(all_results)
        for note in interpretation_notes:
            print(note)

        print("\n\U0001f3af FEATURE RECOMMENDATIONS")
        print("\u2500"*70)

        velocity_recs = feature_analyzer.generate_velocity_recommendations(all_results)
        if velocity_recs:
            for rec in velocity_recs:
                priority_marker = "\U0001f534" if rec.priority == 1 else "\U0001f7e1"
                print(f"\n{priority_marker} {rec.action.upper()}")
                print(f"   Column: {rec.source_column}")
                print(f"   {rec.description}")
                print(f"   Params: {rec.params}")
        else:
            print("\nNo velocity/acceleration features recommended (no strong signals found).")

# Store velocity recommendations for pattern_summary
VELOCITY_RECOMMENDATIONS = [{"action": r.action, "source_column": r.source_column,
                              "description": r.description, "priority": r.priority,
                              "effect_size": r.effect_size, "params": r.params,
                              "features": [f"{r.source_column}_velocity_{r.params.get('window_days', 7)}d"]}
                             for r in velocity_recs] if velocity_recs else []
======================================================================
VELOCITY & ACCELERATION COHORT ANALYSIS
======================================================================
Analyzing 2 features across windows: [7, 14, 30, 90, 180, 365] days
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
πŸ“Š VELOCITY EFFECT SIZE INTERPRETATION
======================================================================
Cohen's d measures the standardized difference between retained/churned velocity
|d| β‰₯ 0.8: large effect | β‰₯ 0.5: medium | β‰₯ 0.2: small

β€’ time_to_open_hours: No significant velocity difference between cohorts
β€’ send_hour: No significant velocity difference between cohorts

🎯 FEATURE RECOMMENDATIONS
──────────────────────────────────────────────────────────────────────

No velocity/acceleration features recommended (no strong signals found).

1c.13 Momentum Analysis (Window Ratios)ΒΆ

πŸ“– What is Momentum?

Momentum compares recent activity to historical activity for each customer:

Momentum = mean(value over last N days) / mean(value over last M days)

Where N < M (e.g., 7d/30d compares last week to last month).

Momentum Value Interpretation
> 1.0 Recent activity higher than historical β†’ engagement increasing
< 1.0 Recent activity lower than historical β†’ potential churn signal
β‰ˆ 1.0 Stable behavior

Window Pairs Analyzed:

  • Natural pairs (week/month/quarter): 7d/30d, 30d/90d, 7d/90d
  • Recommended pairs from pattern_config (based on 01a aggregation windows)
  • Accumulation pair: recent activity vs all-time behavior
InΒ [25]:
Show/Hide Code
# Momentum Analysis - Cohort Comparison
momentum_recs = []
if ENTITY_COLUMN and TARGET_COLUMN and sparkline_cols:
    print("="*70)
    print("MOMENTUM ANALYSIS (Window Ratios)")
    print("="*70)

    if 'feature_analyzer' not in dir():
        feature_analyzer = TemporalFeatureAnalyzer(time_column=TIME_COLUMN, entity_column=ENTITY_COLUMN)

    # Use sparkline_cols directly (includes all numeric features ranked by variance)
    momentum_cols = sparkline_cols[:6]

    # Build comprehensive window pairs from multiple sources:
    # 1. Standard natural pairs (week/month/quarter)
    natural_pairs = [(7, 30), (30, 90), (7, 90)]

    # 2. Recommended pairs from pattern_config (based on 01a aggregation windows)
    recommended_pairs = pattern_config.get_momentum_pairs()

    # 3. Accumulation pair: shortest window vs all-time
    max_days = (df[TIME_COLUMN].max() - df[TIME_COLUMN].min()).days
    all_windows = [w for pair in natural_pairs + recommended_pairs for w in pair]
    shortest_window = min(all_windows) if all_windows else 7
    accumulation_pair = (shortest_window, max_days)

    # Combine and deduplicate (preserve order: natural first, then recommended, then accumulation)
    seen = set()
    window_pairs = []
    for pair in natural_pairs + recommended_pairs + [accumulation_pair]:
        if pair not in seen:
            window_pairs.append(pair)
            seen.add(pair)

    print(f"Analyzing {len(momentum_cols)} features across {len(window_pairs)} window pairs:")
    print(f"  Natural pairs (week/month/quarter): {natural_pairs}")
    print(f"  Recommended pairs (from 01a): {recommended_pairs}")
    print(f"  Accumulation pair: {shortest_window}d vs all-time ({max_days}d)")
    print(f"  Combined (deduplicated): {len(window_pairs)} pairs")

    all_momentum_results = {}
    for col in momentum_cols:
        results = feature_analyzer.compute_cohort_momentum_signals(
            df, [col], TARGET_COLUMN, window_pairs=window_pairs
        )
        all_momentum_results[col] = results[col]

    print("\n\U0001f4ca Momentum by Cohort:")
    print(f"{'Feature':<18} {'Window':<12} {'Retained':>10} {'Churned':>10} {'Effect d':>10}")
    print("-" * 62)
    for col, col_results in all_momentum_results.items():
        for r in col_results:
            label = r.window_label if r.long_window < 1000 else f"{r.short_window}d/all"
            print(f"{col[:17]:<18} {label:<12} {r.retained_momentum:>10.2f} {r.churned_momentum:>10.2f} {r.effect_size:>10.2f}")

    # Bar chart for best window pair per feature - with window labels above bars
    best_pair_data = {}
    best_window_labels = {}  # Track which window was best
    for col, col_results in all_momentum_results.items():
        best = max(col_results, key=lambda r: abs(r.effect_size))
        best_pair_data[col] = {"retained": best.retained_momentum, "churned": best.churned_momentum}
        best_window_labels[col] = best.window_label if best.long_window < 1000 else f"{best.short_window}d/all"

    if best_pair_data:
        import plotly.graph_objects as go
        columns = list(best_pair_data.keys())
        col_labels = [c[:15] for c in columns]

        # Find max y value for positioning labels above bars
        max_y = max(max(best_pair_data[c]["retained"], best_pair_data[c]["churned"]) for c in columns)

        fig = go.Figure()
        fig.add_trace(go.Bar(
            name="\U0001f7e2 Retained", x=col_labels,
            y=[best_pair_data[c]["retained"] for c in columns],
            marker_color=charts.colors["success"],
        ))
        fig.add_trace(go.Bar(
            name="\U0001f534 Churned", x=col_labels,
            y=[best_pair_data[c]["churned"] for c in columns],
            marker_color=charts.colors["danger"],
        ))
        fig.add_hline(y=1.0, line_dash="dash", line_color="gray",
                     annotation_text="baseline", annotation_position="right")

        # Add window labels as annotations above each bar group
        for i, col in enumerate(columns):
            window_lbl = best_window_labels[col]
            fig.add_annotation(
                x=i, y=max_y * 1.08,
                text=f"<b>{window_lbl}</b>",
                showarrow=False,
                font=dict(size=10, color="#555"),
                xref="x", yref="y",
            )

        fig.update_layout(
            title="Momentum Comparison (Best Window per Feature)",
            xaxis_title="Feature",
            yaxis_title="Momentum Ratio",
            barmode="group",
            height=400,
            yaxis=dict(range=[0, max_y * 1.15]),  # Extra headroom for labels
        )
        display_figure(fig)

    print("\n" + "\u2500"*70)
    print("\U0001f4d6 INTERPRETATION")
    print("\u2500"*70)
    print("\nMomentum = recent_mean / historical_mean (per entity, then averaged)")
    print("> 1.0 = accelerating | < 1.0 = decelerating | \u2248 1.0 = stable")
    print("|d| measures how differently retained vs churned customers behave\n")

    interpretation_notes = feature_analyzer.generate_momentum_interpretation(all_momentum_results)
    for note in interpretation_notes:
        print(note)

    print("\n" + "\u2500"*70)
    print("\U0001f3af FEATURE RECOMMENDATIONS")
    print("\u2500"*70)

    momentum_recs = feature_analyzer.generate_momentum_recommendations(all_momentum_results)
    if momentum_recs:
        for rec in momentum_recs:
            priority_marker = "\U0001f534" if rec.priority == 1 else "\U0001f7e1"
            print(f"\n{priority_marker} {rec.action.upper()}")
            print(f"   Column: {rec.source_column}")
            print(f"   {rec.description}")
            print(f"   Params: {rec.params}")
    else:
        print("\nNo momentum features recommended (no strong cohort separation found).")

# Store momentum recommendations for pattern_summary
MOMENTUM_RECOMMENDATIONS = [{"action": r.action, "source_column": r.source_column,
                              "description": r.description, "priority": r.priority,
                              "effect_size": r.effect_size, "params": r.params,
                              "features": [f"{r.source_column}_momentum_{r.params['short_window']}_{r.params['long_window']}"]}
                             for r in momentum_recs] if momentum_recs else []
======================================================================
MOMENTUM ANALYSIS (Window Ratios)
======================================================================
Analyzing 6 features across 5 window pairs:
  Natural pairs (week/month/quarter): [(7, 30), (30, 90), (7, 90)]
  Recommended pairs (from 01a): [(180, 365)]
  Accumulation pair: 7d vs all-time (3285d)
  Combined (deduplicated): 5 pairs
πŸ“Š Momentum by Cohort:
Feature            Window         Retained    Churned   Effect d
--------------------------------------------------------------
time_to_open_hour  7d/30d             1.00       0.97       0.00
time_to_open_hour  30d/90d            1.00       0.99       0.00
time_to_open_hour  7d/90d             1.00       0.97       0.00
time_to_open_hour  180d/365d          1.06       0.99       0.24
time_to_open_hour  7d/all             1.00       1.06       0.00
send_hour          7d/30d             1.00       1.00      -0.01
send_hour          30d/90d            1.00       0.99       0.12
send_hour          7d/90d             1.00       0.98       0.21
send_hour          180d/365d          1.01       1.00       0.04
send_hour          7d/all             1.12       1.01       0.44
opened             7d/30d             1.00       1.10       0.00
opened             30d/90d            0.00       0.99       0.00
opened             7d/90d             1.00       0.99       0.00
opened             180d/365d          0.55       0.98      -0.56
opened             7d/all             0.00       0.97      -0.79
clicked            7d/30d             1.00       1.12       0.00
clicked            30d/90d            0.00       0.88       0.00
clicked            7d/90d             1.00       0.92       0.00
clicked            180d/365d          0.54       0.97      -0.54
clicked            7d/all             0.00       0.76      -0.40
unsubscribed       7d/30d             1.00       1.00       0.00
unsubscribed       30d/90d            1.36       1.00       0.00
unsubscribed       7d/90d             1.00       1.00       0.00
unsubscribed       180d/365d          1.83       1.00       0.00
unsubscribed       7d/all            17.50       1.00       0.00
bounced            7d/30d             1.00       1.00       0.00
bounced            30d/90d            1.00       0.86       0.00
bounced            7d/90d             1.00       1.00       0.00
bounced            180d/365d          1.75       0.85       0.74
bounced            7d/all             0.00       0.28       0.00
No description has been provided for this image
──────────────────────────────────────────────────────────────────────
πŸ“– INTERPRETATION
──────────────────────────────────────────────────────────────────────

Momentum = recent_mean / historical_mean (per entity, then averaged)
> 1.0 = accelerating | < 1.0 = decelerating | β‰ˆ 1.0 = stable
|d| measures how differently retained vs churned customers behave

β€’ time_to_open_hours: Moderate signal at 180d/365d (d=0.24) - retained=1.06, churned=0.99
β€’ send_hour: Moderate signal at 7d/3285d (d=0.44) - retained=1.12, churned=1.01
β€’ opened: Strong signal at 7d/3285d - retained decelerating (0.00), churned stable (0.97), d=-0.79
β€’ clicked: Strong signal at 180d/365d - retained decelerating (0.54), churned stable (0.97), d=-0.54
β€’ unsubscribed: No significant momentum difference between cohorts
β€’ bounced: Strong signal at 180d/365d - retained accelerating (1.75), churned decelerating (0.85), d=0.74

──────────────────────────────────────────────────────────────────────
🎯 FEATURE RECOMMENDATIONS
──────────────────────────────────────────────────────────────────────

🟑 ADD_MOMENTUM_FEATURE
   Column: opened
   Add 7d/3285d momentum for opened (d=-0.79)
   Params: {'short_window': 7, 'long_window': 3285}

🟑 ADD_MOMENTUM_FEATURE
   Column: bounced
   Add 180d/365d momentum for bounced (d=0.74)
   Params: {'short_window': 180, 'long_window': 365}

🟑 ADD_MOMENTUM_FEATURE
   Column: clicked
   Add 180d/365d momentum for clicked (d=-0.54)
   Params: {'short_window': 180, 'long_window': 365}

1c.14 Lag Correlation AnalysisΒΆ

πŸ“– Why Lag Correlations Matter:

Lag correlations show how a metric relates to itself over time:

  • High lag-1 correlation: Today's value predicts tomorrow's
  • Decaying correlations: Effect diminishes over time
  • Periodic spikes: Seasonality (e.g., spike at lag 7 = weekly pattern)
InΒ [26]:
Show/Hide Code
# Lag Correlation Analysis using TemporalFeatureAnalyzer
if ENTITY_COLUMN and sparkline_cols:
    lag_cols = sparkline_cols[:6]
    max_lag = 14

    print("="*70)
    print("LAG CORRELATION ANALYSIS")
    print("="*70)

    if 'feature_analyzer' not in dir():
        feature_analyzer = TemporalFeatureAnalyzer(time_column=TIME_COLUMN, entity_column=ENTITY_COLUMN)

    # Calculate lag correlations using framework
    lag_results = feature_analyzer.calculate_lag_correlations(df, lag_cols, max_lag=max_lag)

    # Build data for heatmap
    lag_corr_data = {col: result.correlations for col, result in lag_results.items()}

    # Use ChartBuilder for visualization
    fig = charts.lag_correlation_heatmap(
        lag_corr_data,
        max_lag=max_lag,
        title="Autocorrelation by Lag (days)"
    )
    display_figure(fig)

    # Display framework results
    print("\nπŸ“Š Best Lag per Variable:")
    for col, result in lag_results.items():
        best_lag_info = f"best lag={result.best_lag}d (r={result.best_correlation:.2f})"
        weekly_info = " [Weekly pattern]" if result.has_weekly_pattern else ""
        print(f"   {col[:25]}: {best_lag_info}{weekly_info}")

    # INTERPRETATION SECTION
    print("\n" + "─"*70)
    print("πŸ“– INTERPRETATION")
    print("─"*70)
    print("\nLag correlation shows how a variable relates to its PAST values:")
    print("  β€’ r > 0.5: Strong memory - today predicts tomorrow well")
    print("  β€’ r 0.3-0.5: Moderate predictability from past")
    print("  β€’ r < 0.3: Weak autocorrelation - lag features less useful\n")

    interpretation_notes = feature_analyzer.generate_lag_interpretation(lag_results)
    for note in interpretation_notes:
        print(note)

    # RECOMMENDATIONS SECTION
    print("\n" + "─"*70)
    print("🎯 FEATURE RECOMMENDATIONS")
    print("─"*70)

    lag_recs = feature_analyzer.generate_lag_recommendations(lag_results)
    if lag_recs:
        for rec in lag_recs:
            priority_marker = "πŸ”΄" if rec.priority == 1 else "🟑"
            print(f"\n{priority_marker} {rec.action.upper()}")
            print(f"   Column: {rec.source_column}")
            print(f"   {rec.description}")
            print(f"   Params: {rec.params}")
    else:
        print("\nNo lag features recommended (no strong autocorrelation found).")

# Store lag recommendations for pattern_summary
LAG_RECOMMENDATIONS = [{"action": r.action, "source_column": r.source_column,
                         "description": r.description, "priority": r.priority,
                         "features": [f"{r.source_column}_lag_{r.params.get('lag_days', 7)}d"],
                         "params": r.params}
                        for r in lag_recs] if lag_recs else []
======================================================================
LAG CORRELATION ANALYSIS
======================================================================
No description has been provided for this image
πŸ“Š Best Lag per Variable:
   time_to_open_hours: best lag=5d (r=0.03)
   send_hour: best lag=2d (r=0.03)
   opened: best lag=9d (r=0.02)
   clicked: best lag=6d (r=-0.03)
   unsubscribed: best lag=12d (r=0.04)
   bounced: best lag=3d (r=-0.04)

──────────────────────────────────────────────────────────────────────
πŸ“– INTERPRETATION
──────────────────────────────────────────────────────────────────────

Lag correlation shows how a variable relates to its PAST values:
  β€’ r > 0.5: Strong memory - today predicts tomorrow well
  β€’ r 0.3-0.5: Moderate predictability from past
  β€’ r < 0.3: Weak autocorrelation - lag features less useful

All variables show weak autocorrelation (r < 0.3)
  β†’ Lag features may not be highly predictive
  β†’ Consider aggregated/rolling features instead

──────────────────────────────────────────────────────────────────────
🎯 FEATURE RECOMMENDATIONS
──────────────────────────────────────────────────────────────────────

No lag features recommended (no strong autocorrelation found).

1c.15 Predictive Power Analysis (IV & KS Statistics)ΒΆ

πŸ“– Information Value (IV) and KS Statistics:

These metrics measure how well features predict the target at entity level:

Metric What It Measures Interpretation
IV Predictive strength across bins <0.02 weak, 0.02-0.1 medium, 0.1-0.3 strong, >0.3 very strong
KS Maximum separation between distributions Higher = better class separation

How These Relate to Other Sections:

Section Metric Relationship to IV/KS
1c.10 Cohen's d Should correlate - both measure cohort separation. d assumes normality, IV handles non-linear.
1c.12 Velocity effect sizes High velocity d β†’ feature changes differently by cohort β†’ may show high IV
1c.13 Momentum effect sizes High momentum d β†’ behavioral change patterns differ β†’ may show high IV
1c.16 CramΓ©r's V For categorical features (IV/KS is for numeric)

Validation: Features with high Cohen's d (1c.10) should generally show high IV here. Disagreements may indicate non-linear relationships (IV captures) or outlier effects (KS captures).

InΒ [27]:
Show/Hide Code
# Lag Correlation Analysis using TemporalFeatureAnalyzer
lag_recs = []
if ENTITY_COLUMN and sparkline_cols:
    lag_cols = sparkline_cols[:6]
    max_lag = 14

    print("="*70)
    print("LAG CORRELATION ANALYSIS")
    print("="*70)
    print(f"\nAnalyzing autocorrelation for {len(lag_cols)} columns (max lag: {max_lag} days)")

    lag_results = {}

    for col in lag_cols:
        try:
            result = feature_analyzer.analyze_lag_correlation(col, max_lag_days=max_lag)
            lag_results[col] = result
            print(f"\nπŸ“Š {col}:")
            print(f"   Best lag: {result['best_lag_days']} days (correlation: {result['best_correlation']:.3f})")
            print(f"   Significant lags: {result['significant_lags']}")
        except Exception as e:
            print(f"\n⚠️ {col}: Could not analyze - {e}")

    if lag_results:
        print("\n" + "="*70)
        print("πŸ” LAG ANALYSIS SUMMARY")
        print("="*70)

        strong_lags = {col: r for col, r in lag_results.items()
                       if r['best_correlation'] > 0.3}
        print(f"\nColumns with strong autocorrelation (>0.3): {len(strong_lags)}/{len(lag_results)}")

        if strong_lags:
            print("\nRecommended lag features:")
            for col, r in sorted(strong_lags.items(), key=lambda x: -x[1]['best_correlation']):
                print(f"  β†’ {col}_lag_{r['best_lag_days']}d (corr={r['best_correlation']:.3f})")

    print("\n" + "="*70)
    print("🎯 FEATURE RECOMMENDATIONS")
    print("─"*70)

    lag_recs = feature_analyzer.generate_lag_recommendations(lag_results)
    if lag_recs:
        for rec in lag_recs:
            priority_marker = "πŸ”΄" if rec.priority == 1 else "🟑"
            print(f"\n{priority_marker} {rec.action.upper()}")
            print(f"   Column: {rec.source_column}")
            print(f"   {rec.description}")
            print(f"   Params: {rec.params}")
    else:
        print("\nNo lag features recommended (no strong autocorrelation found).")

# Store lag recommendations for pattern_summary
LAG_RECOMMENDATIONS = [{"action": r.action, "source_column": r.source_column,
                         "description": r.description, "priority": r.priority,
                         "features": [f"{r.source_column}_lag_{r.params.get('lag_days', 7)}d"],
                         "params": r.params}
                        for r in lag_recs] if lag_recs else []
======================================================================
LAG CORRELATION ANALYSIS
======================================================================

Analyzing autocorrelation for 6 columns (max lag: 14 days)

⚠️ time_to_open_hours: Could not analyze - 'TemporalFeatureAnalyzer' object has no attribute 'analyze_lag_correlation'

⚠️ send_hour: Could not analyze - 'TemporalFeatureAnalyzer' object has no attribute 'analyze_lag_correlation'

⚠️ opened: Could not analyze - 'TemporalFeatureAnalyzer' object has no attribute 'analyze_lag_correlation'

⚠️ clicked: Could not analyze - 'TemporalFeatureAnalyzer' object has no attribute 'analyze_lag_correlation'

⚠️ unsubscribed: Could not analyze - 'TemporalFeatureAnalyzer' object has no attribute 'analyze_lag_correlation'

⚠️ bounced: Could not analyze - 'TemporalFeatureAnalyzer' object has no attribute 'analyze_lag_correlation'

======================================================================
🎯 FEATURE RECOMMENDATIONS
──────────────────────────────────────────────────────────────────────

No lag features recommended (no strong autocorrelation found).

1c.16 Categorical Feature AnalysisΒΆ

πŸ“– What This Measures:

For each categorical feature, we analyze how its categories relate to the target (retention/churn):

Metric What It Measures How to Read
CramΓ©r's V Overall association strength (0-1) Higher = categories strongly predict target
High-Risk Categories Categories with target rate < 90% of average These segments churn more
Low-Risk Categories Categories with target rate > 110% of average These segments retain better

Panel Guide:

Panel What It Shows Color Scheme
Top-Left Feature ranking by CramΓ©r's V πŸ”΄ Strong β‰₯0.3 / 🟠 Moderate β‰₯0.1 / πŸ”΅ Weak
Top-Right Count of features per effect bucket 🟣 Purple gradient (darker = more significant bucket)
Bottom-Left High/low risk category counts πŸ”΄ High-risk (churn) / 🟒 Low-risk (retain)
Bottom-Right Category breakdown (top feature) πŸ”΄ Below avg / 🟒 Above avg / πŸ”΅ Near avg

Effect Strength Thresholds:

CramΓ©r's V Strength Action
β‰₯ 0.3 Strong Priority feature - include and consider interactions
0.15–0.3 Moderate Include in model
0.05–0.15 Weak May add noise, test impact
< 0.05 Negligible Consider dropping
InΒ [28]:
Show/Hide Code
# Categorical Feature Analysis
from customer_retention.stages.profiling import analyze_categorical_features

if ENTITY_COLUMN and TARGET_COLUMN:
    print("="*70)
    print("CATEGORICAL FEATURE ANALYSIS")
    print("="*70)

    # Aggregate to entity level (take mode for categorical columns)
    cat_cols = [c for c in df.select_dtypes(include=['object', 'category']).columns
                if c not in [ENTITY_COLUMN, TIME_COLUMN, TARGET_COLUMN]]

    if cat_cols:
        entity_cats_df = df.groupby(ENTITY_COLUMN).agg(
            {c: lambda x: x.mode().iloc[0] if len(x.mode()) > 0 else None for c in cat_cols}
        ).reset_index()
        entity_target = df.groupby(ENTITY_COLUMN)[TARGET_COLUMN].first().reset_index()
        entity_data = entity_cats_df.merge(entity_target, on=ENTITY_COLUMN)

        cat_result = analyze_categorical_features(entity_data, ENTITY_COLUMN, TARGET_COLUMN)

        print(f"Features analyzed: {len(cat_result.feature_insights)}")
        print(f"Features filtered: {len(cat_result.filtered_columns)}")
        print(f"Overall target rate: {cat_result.overall_target_rate:.1%}")

        if cat_result.feature_insights:
            # VISUALS
            fig = charts.categorical_analysis_panel(cat_result.feature_insights, cat_result.overall_target_rate)
            display_figure(fig)

            # DETAILS TABLE
            print("\nπŸ“Š Feature Details:")
            print(f"{'Feature':<20} {'CramΓ©r V':>10} {'Effect':>12} {'#Cats':>7} {'High Risk':>10} {'Low Risk':>10}")
            print("-" * 75)
            for insight in cat_result.feature_insights:
                print(f"{insight.feature_name[:19]:<20} {insight.cramers_v:>10.3f} {insight.effect_strength:>12} "
                      f"{insight.n_categories:>7} {len(insight.high_risk_categories):>10} {len(insight.low_risk_categories):>10}")

            # INTERPRETATION
            print("\n" + "─"*70)
            print("πŸ“– INTERPRETATION")
            print("─"*70)

            strong = [i for i in cat_result.feature_insights if i.effect_strength == "strong"]
            moderate = [i for i in cat_result.feature_insights if i.effect_strength == "moderate"]
            weak = [i for i in cat_result.feature_insights if i.effect_strength in ("weak", "negligible")]

            if strong:
                print(f"\nStrong predictors ({len(strong)}): {', '.join(i.feature_name for i in strong)}")
                print("  β†’ These features have clear category-target relationships")
                print("  β†’ Include in model, consider one-hot encoding")

            if moderate:
                print(f"\nModerate predictors ({len(moderate)}): {', '.join(i.feature_name for i in moderate)}")
                print("  β†’ Some predictive power, include if cardinality is reasonable")

            if weak:
                print(f"\nWeak/negligible ({len(weak)}): {', '.join(i.feature_name for i in weak)}")
                print("  β†’ Limited predictive value, may add noise")

            # High-risk category insights
            all_high_risk = [(i.feature_name, c) for i in cat_result.feature_insights for c in i.high_risk_categories[:2]]
            if all_high_risk:
                print("\nHigh-risk segments (below-average retention):")
                for feat, cat in all_high_risk[:5]:
                    print(f"  β€’ {feat} = '{cat}'")

            # RECOMMENDATIONS
            print("\n" + "─"*70)
            print("🎯 FEATURE RECOMMENDATIONS")
            print("─"*70)

            if cat_result.recommendations:
                for rec in cat_result.recommendations:
                    priority_marker = "πŸ”΄" if rec.get('priority') == 'high' else "🟑"
                    print(f"\n{priority_marker} {rec.get('action', 'RECOMMENDATION').upper()}")
                    print(f"   {rec.get('reason', '')}")
            else:
                # Generate recommendations based on analysis
                if strong:
                    print("\nπŸ”΄ INCLUDE STRONG PREDICTORS")
                    for i in strong:
                        print(f"   β€’ {i.feature_name}: V={i.cramers_v:.3f}, {i.n_categories} categories")

                if any(i.n_categories > 20 for i in cat_result.feature_insights):
                    high_card = [i for i in cat_result.feature_insights if i.n_categories > 20]
                    print("\n🟑 HIGH CARDINALITY - CONSIDER GROUPING")
                    for i in high_card:
                        print(f"   β€’ {i.feature_name}: {i.n_categories} categories β†’ group rare categories")

                if not strong and not moderate:
                    print("\nNo strong categorical predictors found.")
                    print("   β€’ Consider creating derived features (e.g., category combinations)")
                    print("   β€’ Or focus on numeric/temporal features")
        else:
            print("\nNo categorical features passed filtering criteria.")
            if cat_result.filtered_columns:
                print("Filtered out:")
                for col in cat_result.filtered_columns[:5]:
                    reason = cat_result.filter_reasons.get(col, "unknown")
                    print(f"  β€’ {col}: {reason}")
    else:
        print("No categorical columns found in dataset.")
else:
    print("Skipped: Requires both entity and target columns")
======================================================================
CATEGORICAL FEATURE ANALYSIS
======================================================================
Features analyzed: 3
Features filtered: 2
Overall target rate: 44.8%
No description has been provided for this image
πŸ“Š Feature Details:
Feature                CramΓ©r V       Effect   #Cats  High Risk   Low Risk
---------------------------------------------------------------------------
campaign_type             0.091   negligible       6          0          3
subject_line_catego       0.041   negligible       6          0          0
device_type               0.038   negligible       3          0          0

──────────────────────────────────────────────────────────────────────
πŸ“– INTERPRETATION
──────────────────────────────────────────────────────────────────────

Weak/negligible (3): campaign_type, subject_line_category, device_type
  β†’ Limited predictive value, may add noise

──────────────────────────────────────────────────────────────────────
🎯 FEATURE RECOMMENDATIONS
──────────────────────────────────────────────────────────────────────

No strong categorical predictors found.
   β€’ Consider creating derived features (e.g., category combinations)
   β€’ Or focus on numeric/temporal features

1c.17 Feature Engineering SummaryΒΆ

πŸ“‹ Feature Types with Configured Windows:

The table below shows feature formulas using windows derived from 01a findings. Run the next cell to see actual values for your data.

InΒ [29]:
Show/Hide Code
# Feature Engineering Recommendations
print("="*80)
print("FEATURE ENGINEERING RECOMMENDATIONS")
print("="*80)

# Display configured windows from pattern_config
momentum_pairs = pattern_config.get_momentum_pairs()
short_w = momentum_pairs[0][0] if momentum_pairs else 7
long_w = momentum_pairs[0][1] if momentum_pairs else 30

print(f"""
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Feature Type    β”‚ Formula (using configured windows)                 β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ Velocity        β”‚ (value_now - value_{short_w}d_ago) / {short_w}                 β”‚
β”‚ Acceleration    β”‚ velocity_now - velocity_{short_w}d_ago                   β”‚
β”‚ Momentum        β”‚ mean_{short_w}d / mean_{long_w}d                              β”‚
β”‚ Lag             β”‚ df[col].shift(N)                                   β”‚
β”‚ Rolling Mean    β”‚ df[col].rolling({short_w}).mean()                        β”‚
β”‚ Rolling Std     β”‚ df[col].rolling({long_w}).std()                         β”‚
β”‚ Ratio           β”‚ sum_{long_w}d / sum_all_time                            β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Windows derived from 01a findings: {pattern_config.aggregation_windows}
Velocity window: {pattern_config.velocity_window_days}d
Momentum pairs: {momentum_pairs}
""")

# Framework recommendations (without target - event-level data)
if 'feature_analyzer' in dir() and sparkline_cols:
    recommendations = feature_analyzer.get_feature_recommendations(
        df, value_columns=sparkline_cols, target_column=None
    )
    if recommendations:
        print("🎯 Framework Recommendations (based on temporal patterns):")
        for rec in recommendations[:5]:
            print(f"   β€’ {rec.feature_type.value}: {rec.source_column} β†’ {rec.feature_name}")
            print(f"     Formula: {rec.formula}")
            print(f"     Rationale: {rec.rationale}")
        print("""
πŸ’‘ Note: Target-based recommendations require entity-level data.
   Run notebook 01d first to aggregate, then 02 for target analysis.
""")
================================================================================
FEATURE ENGINEERING RECOMMENDATIONS
================================================================================

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Feature Type    β”‚ Formula (using configured windows)                 β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ Velocity        β”‚ (value_now - value_180d_ago) / 180                 β”‚
β”‚ Acceleration    β”‚ velocity_now - velocity_180d_ago                   β”‚
β”‚ Momentum        β”‚ mean_180d / mean_365d                              β”‚
β”‚ Lag             β”‚ df[col].shift(N)                                   β”‚
β”‚ Rolling Mean    β”‚ df[col].rolling(180).mean()                        β”‚
β”‚ Rolling Std     β”‚ df[col].rolling(365).std()                         β”‚
β”‚ Ratio           β”‚ sum_365d / sum_all_time                            β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Windows derived from 01a findings: ['180d', '365d', 'all_time']
Velocity window: 180d
Momentum pairs: [(180, 365)]

🎯 Framework Recommendations (based on temporal patterns):
   β€’ momentum: clicked β†’ clicked_momentum_7_30
     Formula: mean_7d / mean_30d
     Rationale: Momentum indicates accelerating behavior

πŸ’‘ Note: Target-based recommendations require entity-level data.
   Run notebook 01d first to aggregate, then 02 for target analysis.

InΒ [30]:
Show/Hide Code
print("\n" + "="*70)
print("TEMPORAL PATTERN SUMMARY")
print("="*70)

# Windows used
print(f"\nβš™οΈ CONFIGURED WINDOWS: {pattern_config.aggregation_windows}")
print(f"   Velocity: {pattern_config.velocity_window_days}d | Momentum: {pattern_config.get_momentum_pairs()}")

# Trend summary
print("\nπŸ“ˆ TREND:")
print(f"   Direction: {trend_result.direction.value}")
print(f"   Confidence: {trend_result.confidence}")

# Seasonality summary
print("\nπŸ” SEASONALITY:")
if seasonality_results:
    for sr in seasonality_results[:2]:
        period_name = sr.period_name or f"{sr.period}-day"
        print(f"   {period_name.title()} pattern (strength: {sr.strength:.2f})")
else:
    print("   No significant seasonality detected")

# Recency summary
if ENTITY_COLUMN:
    print("\n⏱️ RECENCY:")
    print(f"   Median: {recency_result.median_recency_days:.0f} days")
    if recency_result.target_correlation:
        corr = recency_result.target_correlation
        print(f"   Target correlation: {corr:.3f} {'(strong signal)' if abs(corr) > 0.3 else ''}")

# Velocity summary (if computed)
if 'velocity_summary' in dir() and velocity_summary:
    print(f"\nπŸš€ VELOCITY ({pattern_config.velocity_window_days}d window):")
    divergent = [col for col, v in velocity_summary.items() if v.get('divergent')]
    if divergent:
        print(f"   Divergent columns (retained vs churned): {divergent}")
    else:
        print("   No significant divergence between cohorts")

# Momentum summary (if computed)
if 'momentum_data' in dir() and momentum_data:
    print(f"\nπŸ“Š MOMENTUM ({pattern_config.get_momentum_pairs()[0] if pattern_config.get_momentum_pairs() else 'N/A'}):")
    if 'divergent_cols' in dir() and divergent_cols:
        # Filter out target to prevent misleading metadata
        filtered_divergent = [c for c in divergent_cols if c.lower() != TARGET_COLUMN.lower()] if TARGET_COLUMN else divergent_cols
        if filtered_divergent:
            print(f"   High-signal columns: {filtered_divergent}")
        else:
            print("   No significant momentum differences detected (target excluded)")
    else:
        print("   No significant momentum differences detected")
======================================================================
TEMPORAL PATTERN SUMMARY
======================================================================

βš™οΈ CONFIGURED WINDOWS: ['180d', '365d', 'all_time']
   Velocity: 180d | Momentum: [(180, 365)]

πŸ“ˆ TREND:
   Direction: stable
   Confidence: high

πŸ” SEASONALITY:
   Weekly pattern (strength: 0.54)
   Tri-Weekly pattern (strength: 0.53)

⏱️ RECENCY:
   Median: 314 days
   Target correlation: 0.773 (strong signal)
InΒ [31]:
Show/Hide Code
# Feature engineering recommendations based on patterns
print("\n" + "="*70)
print("RECOMMENDED TEMPORAL FEATURES")
print("="*70)

print("\n\U0001f6e0\ufe0f Based on detected patterns, consider these features:\n")

print("1. RECENCY FEATURES:")
print("   - days_since_last_event")
print("   - log_days_since_last_event (if right-skewed)")
print("   - recency_bucket (categorical: 0-7d, 8-30d, etc.)")

if seasonality_results:
    weekly = any(6 <= sr.period <= 8 for sr in seasonality_results)
    monthly = any(28 <= sr.period <= 32 for sr in seasonality_results)

    print("\n2. SEASONALITY FEATURES:")
    if weekly:
        print("   - is_weekend (binary)")
        print("   - day_of_week_sin, day_of_week_cos (cyclical encoding)")
    if monthly:
        print("   - day_of_month")
        print("   - is_month_start, is_month_end")

print("\n3. TREND-ADJUSTED FEATURES:")
if trend_result.direction in [TrendDirection.INCREASING, TrendDirection.DECREASING]:
    print("   - event_count_recent_vs_overall (ratio)")
    print("   - activity_trend_direction (for each entity)")
else:
    print("   - Standard time-window aggregations should work well")

print("\n4. COHORT FEATURES:")
print("   - cohort_month (categorical or ordinal)")
print("   - tenure_days (days since first event)")
======================================================================
RECOMMENDED TEMPORAL FEATURES
======================================================================

πŸ› οΈ Based on detected patterns, consider these features:

1. RECENCY FEATURES:
   - days_since_last_event
   - log_days_since_last_event (if right-skewed)
   - recency_bucket (categorical: 0-7d, 8-30d, etc.)

2. SEASONALITY FEATURES:
   - is_weekend (binary)
   - day_of_week_sin, day_of_week_cos (cyclical encoding)

3. TREND-ADJUSTED FEATURES:
   - Standard time-window aggregations should work well

4. COHORT FEATURES:
   - cohort_month (categorical or ordinal)
   - tenure_days (days since first event)

1c.18 Save Pattern Analysis ResultsΒΆ

InΒ [32]:
Show/Hide Code
# Store pattern analysis results in findings with actionable recommendations
pattern_summary = {
    "windows_used": {
        # Note: aggregation_windows already stored in ts_metadata.suggested_aggregations
        "velocity_window": pattern_config.velocity_window_days,
        "momentum_pairs": pattern_config.get_momentum_pairs(),
    },
    "trend": {
        "direction": trend_result.direction.value,
        "strength": trend_result.strength,
        "confidence": trend_result.confidence,
        "recommendations": TREND_RECOMMENDATIONS if 'TREND_RECOMMENDATIONS' in dir() else [],
    },
    "seasonality": {
        "patterns": [
            {"period": sr.period, "name": sr.period_name, "strength": sr.strength,
             "window_aligned": sr.period in window_lags if 'window_lags' in dir() else False}
            for sr in seasonality_results
        ],
        "recommendations": [],
    },
}

# Generate seasonality recommendations
seasonality_recs = []
if seasonality_results:
    strong_patterns = [sr for sr in seasonality_results if sr.strength > 0.5]
    moderate_patterns = [sr for sr in seasonality_results if 0.3 < sr.strength <= 0.5]

    for sr in seasonality_results:
        if sr.period == 7:
            seasonality_recs.append({
                "action": "add_cyclical_feature", "feature": "day_of_week", "encoding": "sin_cos",
                "reason": f"Weekly pattern detected (strength={sr.strength:.2f})"
            })
        elif sr.period in [28, 30, 31]:
            seasonality_recs.append({
                "action": "add_cyclical_feature", "feature": "day_of_month", "encoding": "sin_cos",
                "reason": f"Monthly pattern detected (strength={sr.strength:.2f})"
            })
        elif sr.period in [90, 91, 92]:
            seasonality_recs.append({
                "action": "add_cyclical_feature", "feature": "quarter", "encoding": "sin_cos",
                "reason": f"Quarterly pattern detected (strength={sr.strength:.2f})"
            })

    if strong_patterns:
        seasonality_recs.append({
            "action": "consider_deseasonalization", "periods": [sr.period for sr in strong_patterns],
            "reason": "Strong seasonal patterns may dominate signal"
        })

    if 'window_lags' in dir() and window_lags:
        aligned = [sr for sr in seasonality_results if sr.period in window_lags]
        if aligned:
            seasonality_recs.append({
                "action": "window_captures_cycle", "windows": [sr.period for sr in aligned],
                "reason": "Aggregation window aligns with seasonal cycle"
            })
        else:
            seasonality_recs.append({
                "action": "window_partial_cycle",
                "detected_periods": [sr.period for sr in seasonality_results], "windows": window_lags,
                "reason": "Aggregation windows don't align with detected cycles"
            })

pattern_summary["seasonality"]["recommendations"] = seasonality_recs

# Add temporal pattern recommendations
if 'TEMPORAL_PATTERN_RECOMMENDATIONS' in dir() and TEMPORAL_PATTERN_RECOMMENDATIONS:
    pattern_summary["temporal_patterns"] = {
        "patterns": [{"name": r["pattern"], "variation": r.get("variation", 0), "priority": r["priority"]} for r in TEMPORAL_PATTERN_RECOMMENDATIONS],
        "recommendations": [{"pattern": r["pattern"], "features": r["features"], "priority": r["priority"], "reason": r["reason"]} for r in TEMPORAL_PATTERN_RECOMMENDATIONS if r.get("features")],
    }

# Add recency analysis with recommendations
if ENTITY_COLUMN:
    recency_data = {
        "median_days": recency_result.median_recency_days,
        "target_correlation": recency_result.target_correlation,
    }
    if recency_comparison:
        recency_data.update({
            "effect_size": recency_comparison.cohens_d,
            "effect_interpretation": recency_comparison.effect_interpretation,
            "distribution_pattern": recency_comparison.distribution_pattern,
            "inflection_bucket": recency_comparison.inflection_bucket,
            "retained_median": recency_comparison.retained_stats.median,
            "churned_median": recency_comparison.churned_stats.median,
            "key_findings": [{"finding": f.finding, "metric": f.metric_name, "value": f.metric_value} for f in recency_comparison.key_findings],
            "recommendations": RECENCY_RECOMMENDATIONS,
        })
    pattern_summary["recency"] = recency_data

# Add velocity results
if 'velocity_summary' in dir() and velocity_summary:
    pattern_summary["velocity"] = {col: {"mean_velocity": v["mean_velocity"], "direction": v["direction"]} for col, v in velocity_summary.items()}

# Add momentum results
if 'momentum_data' in dir() and momentum_data:
    pattern_summary["momentum"] = {col: {"retained": v["retained"], "churned": v["churned"]} for col, v in momentum_data.items()}
    if 'divergent_cols' in dir():
        pattern_summary["momentum"]["_divergent_columns"] = [c for c in divergent_cols if c.lower() != TARGET_COLUMN.lower()] if TARGET_COLUMN else divergent_cols

# Add cohort analysis results
if 'COHORT_RECOMMENDATIONS' in dir() and COHORT_RECOMMENDATIONS:
    pattern_summary["cohort"] = {"recommendations": COHORT_RECOMMENDATIONS}


# Add categorical analysis results
if 'cat_result' in dir() and cat_result.feature_insights:
    pattern_summary["categorical"] = {
        "overall_target_rate": cat_result.overall_target_rate,
        "features_analyzed": len(cat_result.feature_insights),
        "columns_filtered": len(cat_result.filtered_columns),
        "insights": [
            {"feature": i.feature_name, "cramers_v": i.cramers_v, "effect_strength": i.effect_strength,
             "high_risk": i.high_risk_categories[:3], "low_risk": i.low_risk_categories[:3]}
            for i in cat_result.feature_insights[:10]
        ],
        "recommendations": cat_result.recommendations,
        "key_findings": cat_result.key_findings,
    }

# Add velocity analysis results and recommendations
if 'VELOCITY_RECOMMENDATIONS' in dir() and VELOCITY_RECOMMENDATIONS:
    pattern_summary["velocity"]["recommendations"] = VELOCITY_RECOMMENDATIONS

# Add momentum recommendations (separate from momentum data which is already stored)
if 'MOMENTUM_RECOMMENDATIONS' in dir() and MOMENTUM_RECOMMENDATIONS:
    if "momentum" not in pattern_summary:
        pattern_summary["momentum"] = {}
    pattern_summary["momentum"]["recommendations"] = MOMENTUM_RECOMMENDATIONS

# Add lag correlation recommendations
if 'LAG_RECOMMENDATIONS' in dir() and LAG_RECOMMENDATIONS:
    pattern_summary["lag"] = {"recommendations": LAG_RECOMMENDATIONS}

# Add sparkline analysis recommendations (trend, seasonality, scaling)
if 'SPARKLINE_RECOMMENDATIONS' in dir() and SPARKLINE_RECOMMENDATIONS:
    pattern_summary["sparkline"] = {"recommendations": SPARKLINE_RECOMMENDATIONS}

# Add effect size recommendations (feature prioritization)
if 'EFFECT_SIZE_RECOMMENDATIONS' in dir() and EFFECT_SIZE_RECOMMENDATIONS:
    pattern_summary["effect_size"] = {"recommendations": EFFECT_SIZE_RECOMMENDATIONS}

# Add predictive power recommendations (IV/KS based)
if 'PREDICTIVE_POWER_RECOMMENDATIONS' in dir() and PREDICTIVE_POWER_RECOMMENDATIONS:
    pattern_summary["predictive_power"] = {"recommendations": PREDICTIVE_POWER_RECOMMENDATIONS}

# Generate feature flags for 01d aggregation
# These flags tell 01d which optional features to include based on analysis results
pattern_summary["feature_flags"] = {
    "include_recency": (
        recency_comparison.cohens_d > 0.2
        if 'recency_comparison' in dir() and recency_comparison
        else True
    ),
    "include_tenure": True,  # Default on; could be derived from tenure analysis if available
    "include_lifecycle_quadrant": ts_meta.temporal_segmentation_recommendation is not None if 'ts_meta' in dir() else False,
    "include_trend_features": bool(pattern_summary.get("trend", {}).get("recommendations")),
    "include_seasonality_features": bool(pattern_summary.get("seasonality", {}).get("recommendations")),
    "include_cohort_features": not any(
        r.get("action") == "skip_cohort_features"
        for r in pattern_summary.get("cohort", {}).get("recommendations", [])
    ),
}

# Save to findings
if not findings.metadata:
    findings.metadata = {}
findings.metadata["temporal_patterns"] = pattern_summary
findings.save(FINDINGS_PATH)

print(f"Pattern analysis saved to: {FINDINGS_PATH}")
print(f"Saved sections: {list(pattern_summary.keys())}")

# Print recency recommendations
if pattern_summary.get("recency", {}).get("recommendations"):
    recency_recs = pattern_summary["recency"]["recommendations"]
    print(f"\n⏱️ RECENCY FEATURES TO ADD ({len(recency_recs)}):")
    for rec in recency_recs:
        priority_icon = {"high": "πŸ”΄", "medium": "🟑", "low": "🟒"}.get(rec.get("priority", "medium"), "βšͺ")
        features = rec.get("features", [])
        if features:
            print(f"   {priority_icon} [{rec['priority'].upper()}] {', '.join(features)}")
            print(f"      {rec['reason']}")

# Print cohort recommendations
if "cohort" in pattern_summary:
    cohort_recs = pattern_summary["cohort"].get("recommendations", [])
    feature_recs = [r for r in cohort_recs if r.get("features")]
    skip_recs = [r for r in cohort_recs if r.get("action") == "skip_cohort_features"]
    if skip_recs:
        print(f"\nπŸ‘₯ COHORT: Skip cohort features ({skip_recs[0]['reason']})")
    elif feature_recs:
        print("\nπŸ‘₯ COHORT FEATURES TO ADD:")
        for rec in feature_recs:
            print(f"   β€’ {', '.join(rec['features'])} ({rec['priority']} priority)")

# Print trend recommendations
if pattern_summary.get("trend", {}).get("recommendations"):
    trend_recs = [r for r in pattern_summary["trend"]["recommendations"] if r.get("features")]
    if trend_recs:
        print(f"\nπŸ“ˆ TREND FEATURES TO ADD ({len(trend_recs)}):")
        for rec in trend_recs:
            print(f"   β€’ {', '.join(rec['features'])} ({rec['priority']} priority)")

# Print temporal pattern recommendations
if "temporal_patterns" in pattern_summary:
    tp_recs = pattern_summary["temporal_patterns"].get("recommendations", [])
    if tp_recs:
        print(f"\nπŸ“… TEMPORAL PATTERN FEATURES TO ADD ({len(tp_recs)}):")
        for rec in tp_recs:
            print(f"   β€’ {rec['pattern']}: {', '.join(rec['features'])} ({rec['priority']} priority)")


# Print categorical recommendations
if pattern_summary.get("categorical", {}).get("recommendations"):
    cat_recs = pattern_summary["categorical"]["recommendations"]
    print(f"\n🏷️ CATEGORICAL FEATURE RECOMMENDATIONS ({len(cat_recs)}):")
    for rec in cat_recs:
        priority_icon = {"high": "πŸ”΄", "medium": "🟑", "low": "🟒"}.get(rec.get("priority", "medium"), "βšͺ")
        features = rec.get("features", [])
        if features:
            print(f"   {priority_icon} [{rec['priority'].upper()}] {rec['action']}")
            print(f"      {rec['reason']}")
# Print seasonality recommendations
if seasonality_recs:
    print(f"\nπŸ” SEASONALITY RECOMMENDATIONS ({len(seasonality_recs)}):")
    for rec in seasonality_recs:
        action = rec["action"].replace("_", " ").title()
        print(f"   β€’ {action}: {rec['reason']}")
Pattern analysis saved to: /Users/Vital/python/CustomerRetention/experiments/runs/email-6301db6c/datasets/customer_emails/findings/customer_emails_findings.yaml
Saved sections: ['windows_used', 'trend', 'seasonality', 'temporal_patterns', 'recency', 'cohort', 'categorical', 'momentum', 'sparkline', 'effect_size', 'feature_flags']

⏱️ RECENCY FEATURES TO ADD (1):
   πŸ”΄ [HIGH] days_since_last_event, log_recency
      Target=1 is minority (45%) - interpret as CHURN; recency pattern is classic churn behavior

πŸ‘₯ COHORT: Skip cohort features (90% onboarded in 2015 - insufficient variation)

πŸ“… TEMPORAL PATTERN FEATURES TO ADD (5):
   β€’ month: month_sin, month_cos (medium priority)
   β€’ year: year_categorical (high priority)
   β€’ 7d_cycle: lag_7d_ratio (medium priority)
   β€’ 21d_cycle: lag_21d_ratio (medium priority)
   β€’ 14d_cycle: lag_14d_ratio (medium priority)

πŸ” SEASONALITY RECOMMENDATIONS (3):
   β€’ Add Cyclical Feature: Weekly pattern detected (strength=0.54)
   β€’ Consider Deseasonalization: Strong seasonal patterns may dominate signal
   β€’ Window Partial Cycle: Aggregation windows don't align with detected cycles

1c.19 Record Snapshot Grid VoteΒΆ

Record this dataset's temporal analysis results as a vote on the snapshot grid. The grid uses these votes to determine readiness for entity-level aggregation in notebook 1d.

InΒ [33]:
Show/Hide Code
from customer_retention.analysis.auto_explorer.snapshot_grid import DatasetGridVote, SnapshotGrid

_grid_path = _namespace.snapshot_grid_path
if _grid_path.exists():
    _snap_grid = SnapshotGrid.load(_grid_path)

    _data_start = str(df[TIME_COLUMN].min().date()) if TIME_COLUMN in df.columns else None
    _data_end = str(df[TIME_COLUMN].max().date()) if TIME_COLUMN in df.columns else None

    _vote = DatasetGridVote(
        dataset_name=dataset_name,
        granularity=findings.metadata.get("granularity", "event_level") if findings.metadata else "event_level",
        voted=True,
        data_span_start=_data_start,
        data_span_end=_data_end,
    )
    _snap_grid.record_vote(dataset_name, _vote)
    _snap_grid.save(_grid_path)

    print(f"Snapshot grid vote recorded for '{dataset_name}'")
    print(f"  Data span: {_data_start} to {_data_end}")
    _ready, _missing = _snap_grid.is_ready_for_aggregation()
    if _ready:
        print("  Grid status: READY for aggregation")
    else:
        print(f"  Grid status: waiting on {_missing}")
else:
    print("No snapshot grid found β€” skipping vote (run notebook 00 first)")
Snapshot grid vote recorded for 'customer_emails'
  Data span: 2015-01-01 to 2023-12-30
  Grid status: READY for aggregation

Summary: What We LearnedΒΆ

In this notebook, we analyzed temporal patterns:

  1. Trend Detection - Identified long-term direction in data
  2. Seasonality - Found periodic patterns (weekly, monthly)
  3. Cohort Analysis - Compared behavior by entity join date
  4. Recency Analysis - Measured how recent activity relates to outcomes
  5. Feature Recommendations - Generated feature engineering suggestions

Pattern SummaryΒΆ

Pattern Status Recommendation
Trend Check findings Detrend if strong
Seasonality Check findings Add cyclical features
Cohort Effects Check findings Add cohort indicators
Recency Effects Check findings Prioritize recent windows

Next StepsΒΆ

Complete the Event Bronze Track:

  • 01d_event_aggregation.ipynb - Aggregate events to entity-level (produces new dataset)

After 01d produces the aggregated dataset, continue with:

  • 04_column_deep_dive.ipynb - Profile aggregated feature distributions
  • 02_source_integrity.ipynb - Quality checks on aggregated data
  • 05_relationship_analysis.ipynb - Feature correlations and relationships

The aggregated data from 01d becomes the input for the Entity Bronze Track.

Save Reminder: Save this notebook (Ctrl+S / Cmd+S) before running the next one. The next notebook will automatically export this notebook's HTML documentation from the saved file.