Chapter 1a: Temporal Deep Dive (Event Bronze Track)¶
Purpose: Analyze event-level (time series) datasets with focus on temporal patterns, entity lifecycles, and event frequency distributions.
When to use this notebook:
- Your dataset was detected as
EVENT_LEVELgranularity in notebook 01 - You have multiple rows per entity (customer, user, etc.)
- Each row represents an event with a timestamp
What you'll learn:
- How to profile entity lifecycles (first event, last event, duration)
- Understanding event frequency distributions per entity
- Inter-event timing patterns and their implications
- Time series-specific feature engineering opportunities
Outputs:
- Entity lifecycle visualizations
- Event frequency distribution analysis
- Inter-event timing statistics
- Updated exploration findings with time series metadata
Understanding Time Series Profiling¶
| Metric | Description | Why It Matters |
|---|---|---|
| Events per Entity | Distribution of event counts | Identifies power users vs. one-time users |
| Entity Lifecycle | Duration from first to last event | Reveals customer tenure patterns |
| Inter-event Time | Time between consecutive events | Indicates engagement patterns |
| Time Span | Overall data period coverage | Helps plan time window aggregations |
Aggregation Windows (used in notebook 01d):
- 24h: Very recent activity
- 7d: Weekly patterns
- 30d: Monthly patterns
- 90d: Quarterly trends
- 180d: Semi-annual patterns
- 365d: Annual patterns
- all-time: Historical totals
1a.1 Load Previous Findings¶
Show/Hide Code
from customer_retention.analysis.notebook_progress import track_and_export_previous
track_and_export_previous("01a_temporal_deep_dive.ipynb")
import numpy as np
import pandas as pd
import plotly.graph_objects as go
from plotly.subplots import make_subplots
from customer_retention.analysis.auto_explorer import ExplorationFindings
from customer_retention.analysis.visualization import display_figure, display_table
from customer_retention.core.config.column_config import DatasetGranularity
from customer_retention.core.config.experiments import (
FINDINGS_DIR,
)
from customer_retention.stages.profiling import (
TimeSeriesProfiler,
TypeDetector,
)
Show/Hide Code
from customer_retention.analysis.auto_explorer import load_notebook_findings
DATASET_NAME = None # Set to override auto-resolved dataset, e.g. "3set_support_tickets"
FINDINGS_PATH, _namespace, dataset_name = load_notebook_findings("01a_temporal_deep_dive.ipynb")
if DATASET_NAME is not None:
dataset_name = DATASET_NAME
print(f"Using: {FINDINGS_PATH}")
findings = ExplorationFindings.load(FINDINGS_PATH)
print(f"\nLoaded findings for {findings.column_count} columns from {findings.source_path}")
Using: /Users/Vital/python/CustomerRetention/experiments/runs/email-6301db6c/datasets/customer_emails/findings/customer_emails_findings.yaml Loaded findings for 13 columns from ../tests/fixtures/customer_emails.csv
Show/Hide Code
# Verify this is a time series dataset
if findings.is_time_series:
ts_meta = findings.time_series_metadata
temporal_pattern = (ts_meta.temporal_pattern or "unknown").upper()
print(f"\u2705 Dataset confirmed as {temporal_pattern} (event-level)")
print(f" Entity column: {ts_meta.entity_column}")
print(f" Time column: {ts_meta.time_column}")
print(f" Avg events per entity: {ts_meta.avg_events_per_entity:.1f}" if ts_meta.avg_events_per_entity else "")
else:
print("\u26a0\ufe0f This dataset was NOT detected as time series.")
print(" Consider using 04_column_deep_dive.ipynb instead.")
print(" Or manually specify entity and time columns below.")
✅ Dataset confirmed as EVENT_LOG (event-level) Entity column: customer_id Time column: sent_date Avg events per entity: 16.6
1a.2 Load Source Data & Configure Columns¶
Show/Hide Code
from customer_retention.analysis.auto_explorer.active_dataset_store import load_active_dataset
from customer_retention.stages.temporal import TEMPORAL_METADATA_COLS
df = load_active_dataset(_namespace, dataset_name)
print(f"Loaded {len(df):,} rows x {len(df.columns)} columns")
print(f"Data source: {dataset_name}")
Loaded 83,198 rows x 13 columns Data source: customer_emails
Show/Hide Code
# === COLUMN CONFIGURATION ===
# These will be auto-populated from findings if available
# Override manually if needed
if findings.is_time_series and findings.time_series_metadata:
ENTITY_COLUMN = findings.time_series_metadata.entity_column
TIME_COLUMN = findings.time_series_metadata.time_column
else:
# Manual configuration - uncomment and set if auto-detection failed
# ENTITY_COLUMN = "customer_id"
# TIME_COLUMN = "event_date"
# Try auto-detection
detector = TypeDetector()
granularity = detector.detect_granularity(df)
ENTITY_COLUMN = granularity.entity_column
TIME_COLUMN = granularity.time_column
print(f"Entity column: {ENTITY_COLUMN}")
print(f"Time column: {TIME_COLUMN}")
if not ENTITY_COLUMN or not TIME_COLUMN:
raise ValueError("Please set ENTITY_COLUMN and TIME_COLUMN manually above")
Entity column: customer_id Time column: sent_date
1a.3 Time Series Profile Overview¶
What we analyze:
- Total events and unique entities
- Time span coverage
- Events per entity distribution
- Entity lifecycle metrics
Show/Hide Code
# Create the time series profiler and run analysis
profiler = TimeSeriesProfiler(entity_column=ENTITY_COLUMN, time_column=TIME_COLUMN)
ts_profile = profiler.profile(df)
print("="*70)
print("TIME SERIES PROFILE SUMMARY")
print("="*70)
print("\n\U0001f4ca Dataset Overview:")
print(f" Total Events: {ts_profile.total_events:,}")
print(f" Unique Entities: {ts_profile.unique_entities:,}")
print(f" Avg Events/Entity: {ts_profile.events_per_entity.mean:.1f}")
print(f" Time Span: {ts_profile.time_span_days:,} days ({ts_profile.time_span_days/365:.1f} years)")
print("\n\U0001f4c5 Date Range:")
print(f" First Event: {ts_profile.first_event_date}")
print(f" Last Event: {ts_profile.last_event_date}")
print("\n\u23f1\ufe0f Inter-Event Timing:")
if ts_profile.avg_inter_event_days is not None:
print(f" Avg Days Between Events: {ts_profile.avg_inter_event_days:.1f}")
else:
print(" Not enough data to compute inter-event timing")
====================================================================== TIME SERIES PROFILE SUMMARY ====================================================================== 📊 Dataset Overview: Total Events: 83,198 Unique Entities: 4,998 Avg Events/Entity: 16.6 Time Span: 3,285 days (9.0 years) 📅 Date Range: First Event: 2015-01-01 00:00:00 Last Event: 2023-12-30 00:00:00 ⏱️ Inter-Event Timing: Avg Days Between Events: 145.5
1a.4 Events per Entity Distribution¶
Goal: Understand how event volume varies across entities to guide feature engineering and identify modeling challenges.
| Segment | Definition | Why It Matters for Modeling |
|---|---|---|
| One-time | Exactly 1 event | No temporal features possible; cold-start problem |
| Low Activity | Below Q25 | Sparse features, many zeros; log-transform counts |
| Medium Activity | Q25 to Q75 | Core population; standard aggregation windows work |
| High Activity | Above Q75 | Rich features; watch for training set dominance |
Show/Hide Code
from customer_retention.stages.profiling import classify_activity_segments
segment_result = classify_activity_segments(ts_profile.entity_lifecycles)
segment_order = ["One-time", "Low Activity", "Medium Activity", "High Activity"]
segment_colors = {
"One-time": "#d62728", "Low Activity": "#ff7f0e",
"Medium Activity": "#2ca02c", "High Activity": "#1f77b4",
}
event_counts = segment_result.lifecycles["event_count"]
x_max = event_counts.quantile(0.99)
bins = np.linspace(0, x_max, 31)
bin_centers = (bins[:-1] + bins[1:]) / 2
lc = segment_result.lifecycles
bin_indices = np.digitize(lc["event_count"], bins) - 1
bin_indices = bin_indices.clip(0, len(bin_centers) - 1)
lc_binned = lc.assign(_bin=bin_indices)
fig = go.Figure()
for seg in segment_order:
subset = lc_binned[lc_binned["activity_segment"] == seg]
if subset.empty:
continue
counts_per_bin = subset.groupby("_bin").size().reindex(range(len(bin_centers)), fill_value=0)
fig.add_trace(go.Bar(
x=bin_centers, y=counts_per_bin.values, name=seg,
marker_color=segment_colors[seg], opacity=0.85,
))
fig.add_vline(
x=event_counts.median(), line_dash="solid", line_color="gray",
annotation_text=f"Median: {event_counts.median():.0f}",
annotation_position="top left",
)
use_log_y = event_counts.value_counts().max() > event_counts.value_counts().median() * 50
log_note = ("<br><sub>Log Y-axis: bar heights compress large differences — "
"see table below for actual segment shares</sub>" if use_log_y else "")
fig.update_layout(
barmode="stack", template="plotly_white", height=420,
title="Events per Entity by Activity Segment" + log_note,
xaxis_title="Number of Events",
yaxis_title="Entities",
yaxis_type="log" if use_log_y else "linear",
legend=dict(orientation="h", yanchor="top", y=-0.15, xanchor="center", x=0.5),
margin=dict(b=70),
)
display_figure(fig)
Show/Hide Code
print(f"Segment thresholds: Q25 = {segment_result.q25_threshold:.0f} events, "
f"Q75 = {segment_result.q75_threshold:.0f} events\n")
display_table(segment_result.recommendations)
Segment thresholds: Q25 = 12 events, Q75 = 19 events
| Segment | Entities | Share | Avg Events | Feature Approach | Modeling Implication |
|---|---|---|---|---|---|
| Medium Activity | 2529 | 50.6% | 16.6 | Standard windows; mean/std aggregations reliable | Core modeling population; most features well-populated |
| Low Activity | 1273 | 25.5% | 7.3 | Wider windows with count/recency; sparse aggregations | Features will be noisy; log-transform counts, handle many zeros |
| High Activity | 1161 | 23.2% | 27.5 | All windows including narrower; trends and velocity meaningful | Rich feature space; watch for dominance in training set |
| One-time | 35 | 0.7% | 1.0 | No temporal features possible; use event-level attributes only | Cold-start problem; consider population-level fallback or separate model |
1a.5 Entity Lifecycle Analysis¶
Goal: Classify entities by their engagement pattern to inform feature engineering and modeling strategy.
We combine two dimensions — tenure (days from first to last event) and intensity (events per day of tenure) — to identify four lifecycle quadrants:
| Quadrant | Tenure | Intensity | Meaning | Feature Implication |
|---|---|---|---|---|
| Intense & Brief | Short | High | Burst engagement, then gone | Recency features critical |
| Steady & Loyal | Long | High | Consistent power users | Trend/seasonality features valuable |
| Occasional & Loyal | Long | Low | Infrequent but persistent | Wider time windows needed |
| One-shot | Short | Low | Single/few interactions | May lack enough history for features |
Show/Hide Code
from customer_retention.stages.profiling import classify_lifecycle_quadrants
quadrant_result = classify_lifecycle_quadrants(ts_profile.entity_lifecycles)
lifecycles = quadrant_result.lifecycles
quadrant_order = ["Steady & Loyal", "Occasional & Loyal", "Intense & Brief", "One-shot"]
quadrant_colors = {
"Steady & Loyal": "#2ca02c", "Occasional & Loyal": "#1f77b4",
"Intense & Brief": "#ff7f0e", "One-shot": "#d62728",
}
tenure_median = quadrant_result.tenure_threshold
print(f"Split thresholds: Tenure median = {quadrant_result.tenure_threshold:.0f} days, "
f"Intensity median = {quadrant_result.intensity_threshold:.4f} events/day\n")
display_table(quadrant_result.recommendations)
Split thresholds: Tenure median = 2745 days, Intensity median = 0.0066 events/day
| Quadrant | Entities | Share | Windows | Feature Strategy | Risk |
|---|---|---|---|---|---|
| Occasional & Loyal | 1681 | 33.6% | Wider windows (capture sparse events) | Long-window aggregations, recency gap | May churn silently; long gaps are normal |
| Intense & Brief | 1677 | 33.6% | Narrower windows (capture recency) | Recency features, burst detection | High churn risk; may be early churners |
| Steady & Loyal | 822 | 16.4% | All available windows | Trend/seasonality features, engagement decay | Low churn risk; monitor for engagement decline |
| One-shot | 818 | 16.4% | N/A (insufficient history) | Cold-start fallback, population-level stats | Cannot build temporal features; consider separate handling |
Show/Hide Code
# Combined panel: small multiples (top 2x2) + tenure histogram (bottom)
fig = make_subplots(
rows=3, cols=2,
subplot_titles=[*quadrant_order, "Tenure Distribution by Quadrant", ""],
specs=[[{}, {}], [{}, {}], [{"colspan": 2}, None]],
vertical_spacing=0.08, horizontal_spacing=0.10,
row_heights=[0.28, 0.28, 0.44],
)
# Top 2x2: scatter per quadrant
positions = [(1, 1), (1, 2), (2, 1), (2, 2)]
for (row, col), q in zip(positions, quadrant_order):
subset = lifecycles[lifecycles["lifecycle_quadrant"] == q]
fig.add_trace(go.Scatter(
x=subset["duration_days"], y=subset["intensity"],
mode="markers", marker=dict(color=quadrant_colors[q], opacity=0.4, size=3),
showlegend=False,
), row=row, col=col)
fig.update_xaxes(title_text="Tenure (d)", title_font_size=10, row=row, col=col)
fig.update_yaxes(title_text="Ev/day", title_font_size=10, row=row, col=col)
# Bottom: overlaid tenure histograms
for q in quadrant_order:
subset = lifecycles[lifecycles["lifecycle_quadrant"] == q]
fig.add_trace(go.Histogram(
x=subset["duration_days"], nbinsx=40, name=q,
marker_color=quadrant_colors[q], opacity=0.6,
), row=3, col=1)
fig.add_vline(x=tenure_median, line_dash="dot", line_color="gray", opacity=0.5,
row=3, col=1, annotation_text=f"Median: {tenure_median:.0f}d",
annotation_position="top left")
fig.update_layout(
barmode="overlay", template="plotly_white", height=900,
title="Entity Lifecycle Quadrants",
legend=dict(orientation="h", yanchor="top", y=-0.05, xanchor="center", x=0.5),
margin=dict(b=80),
)
fig.update_xaxes(title_text="Tenure (days)", row=3, col=1)
fig.update_yaxes(title_text="Entities", row=3, col=1)
display_figure(fig)
1a.6 Create Analysis Views (Historic + Recent)¶
Understanding Temporal Stratification
Splitting data into time periods reveals whether patterns are stable or evolving:
- Historic: older data establishing baseline behavior
- Recent: newer data showing current regime
- Divergence between periods signals concept drift or population shift
Interpreting Period Comparisons:
| Signal | Stable (H ≈ R) | Shifting (H ≠ R) |
|---|---|---|
| Event volume | Consistent population | Growth or decline wave |
| Entity arrivals | Steady acquisition | Acceleration or saturation |
| Inter-event gaps | Uniform cadence | Engagement regime change |
| Data gaps | Known quality | Emerging coverage issues |
Reading Inter-Event Time:
- Median gap = typical engagement cadence
- Mean >> Median (ratio > 1.5) = heavy right skew, long tail of inactive entities
- IQR > Median = high variability across entities
- Period shift in median gap = engagement regime change
Stratification vs Segmentation:
- Stratification = time-based split (when events happened)
- Segmentation = entity-based split (who the entities are)
Show/Hide Code
from customer_retention.core.compat import safe_to_datetime
from customer_retention.stages.profiling import analyze_temporal_coverage, derive_drift_implications
df_temp = df.copy()
df_temp[TIME_COLUMN] = safe_to_datetime(df_temp[TIME_COLUMN])
coverage_result = analyze_temporal_coverage(df_temp, ENTITY_COLUMN, TIME_COLUMN)
drift = derive_drift_implications(coverage_result)
midpoint = coverage_result.first_event + (coverage_result.last_event - coverage_result.first_event) / 2
split_date = drift.recommended_training_start or midpoint
historic_mask = df_temp[TIME_COLUMN] < split_date
def _inter_event_days(data):
result = []
for _, group in data.groupby(ENTITY_COLUMN):
if len(group) < 2:
continue
result.extend(group[TIME_COLUMN].sort_values().diff().dropna().dt.total_seconds() / 86400)
return result
inter_event_times = _inter_event_days(df_temp)
inter_event_series = pd.Series(inter_event_times) if inter_event_times else pd.Series(dtype=float)
historic_iet = pd.Series(_inter_event_days(df_temp[historic_mask])) if historic_mask.any() else pd.Series(dtype=float)
recent_iet = pd.Series(_inter_event_days(df_temp[~historic_mask])) if (~historic_mask).any() else pd.Series(dtype=float)
fig = make_subplots(
rows=2, cols=2,
subplot_titles=["Event Volume Over Time", "New Entities Over Time",
"Inter-Event Time: Historic vs Recent", "Entity Coverage by Window"],
vertical_spacing=0.12, horizontal_spacing=0.10,
)
fig.add_trace(go.Scatter(
x=coverage_result.events_over_time.index, y=coverage_result.events_over_time.values,
mode="lines", fill="tozeroy", line_color="steelblue", showlegend=False,
), row=1, col=1)
fig.add_vline(x=split_date.isoformat(), line_dash="dash", line_color="red", row=1, col=1)
fig.add_annotation(x=split_date.isoformat(), y=1, yref="y domain",
text=f"Split: {split_date.strftime('%Y-%m-%d')}",
showarrow=False, font=dict(size=9), xanchor="left", yanchor="top",
row=1, col=1)
for gap in coverage_result.gaps:
severity_colors = {"minor": "rgba(255,165,0,0.15)", "moderate": "rgba(255,100,0,0.25)",
"major": "rgba(255,0,0,0.25)"}
fig.add_vrect(x0=gap.start.isoformat(), x1=gap.end.isoformat(),
fillcolor=severity_colors[gap.severity], line_width=0, row=1, col=1)
fig.add_trace(go.Bar(
x=coverage_result.new_entities_over_time.index, y=coverage_result.new_entities_over_time.values,
marker_color="mediumseagreen", opacity=0.8, showlegend=False,
), row=1, col=2)
if len(historic_iet) > 0:
fig.add_trace(go.Histogram(
x=historic_iet[historic_iet <= historic_iet.quantile(0.99)], nbinsx=40,
name="Historic", marker_color="steelblue", opacity=0.6,
), row=2, col=1)
if len(recent_iet) > 0:
fig.add_trace(go.Histogram(
x=recent_iet[recent_iet <= recent_iet.quantile(0.99)], nbinsx=40,
name="Recent", marker_color="coral", opacity=0.6,
), row=2, col=1)
cov = [(c.window, c.coverage_pct * 100, c.active_entities) for c in coverage_result.entity_window_coverage]
fig.add_trace(go.Bar(
x=[c[0] for c in cov], y=[c[1] for c in cov], showlegend=False,
marker_color=["#2ca02c" if p >= 50 else "#ff7f0e" if p >= 10 else "#d62728" for _, p, _ in cov],
opacity=0.85, text=[f"{p:.0f}%<br>({n:,})" for _, p, n in cov],
textposition="outside", textfont_size=9,
), row=2, col=2)
trend_label = f"{coverage_result.volume_trend} ({coverage_result.volume_change_pct:+.0%})"
gap_label = f" | {len(coverage_result.gaps)} gap(s)" if coverage_result.gaps else ""
fig.update_layout(
template="plotly_white", height=700, barmode="overlay",
title=(f"Temporal Coverage: Historic vs Recent<br><sub>"
f"Split: {split_date.strftime('%Y-%m-%d')} | Trend: {trend_label}{gap_label}</sub>"),
legend=dict(orientation="h", yanchor="top", y=-0.08, xanchor="center", x=0.5),
margin=dict(b=70),
)
fig.update_xaxes(title_text="Date", row=1, col=1)
fig.update_xaxes(title_text="First Event Date", row=1, col=2)
fig.update_xaxes(title_text="Days Between Events", row=2, col=1)
fig.update_xaxes(title_text="Window", row=2, col=2)
fig.update_yaxes(title_text="Events", row=1, col=1)
fig.update_yaxes(title_text="New Entities", row=1, col=2)
fig.update_yaxes(title_text="Frequency", row=2, col=1)
fig.update_yaxes(title_text="% Entities Active", range=[0, 115], row=2, col=2)
display_figure(fig)
Show/Hide Code
print("DETAILED FINDINGS")
print("=" * 70)
print(f"Time span: {coverage_result.time_span_days:,} days "
f"({coverage_result.first_event.strftime('%Y-%m-%d')} to "
f"{coverage_result.last_event.strftime('%Y-%m-%d')})")
print(f"Volume trend: {coverage_result.volume_trend} ({coverage_result.volume_change_pct:+.0%})")
print(f"Data gaps: {len(coverage_result.gaps)}"
+ (f" ({sum(g.duration_days for g in coverage_result.gaps):.0f} total days)"
if coverage_result.gaps else ""))
print(f"Historic: {historic_mask.sum():,} events | Recent: {(~historic_mask).sum():,} events")
iet_shift_pct, skew_ratio = 0.0, 1.0
if len(inter_event_series) > 0:
skew_ratio = (inter_event_series.mean() / inter_event_series.median()
if inter_event_series.median() > 0 else 1.0)
skew_label = ("heavily right-skewed" if skew_ratio > 1.5
else "moderately skewed" if skew_ratio > 1.2 else "symmetric")
print(f"Inter-event shape: mean/median = {skew_ratio:.2f} ({skew_label})")
h_med = historic_iet.median() if len(historic_iet) > 0 else 0
r_med = recent_iet.median() if len(recent_iet) > 0 else 0
if h_med > 0 and r_med > 0:
iet_shift_pct = (r_med - h_med) / h_med
print(f"Inter-event median: Historic {h_med:.1f}d | Recent {r_med:.1f}d "
f"(shift: {iet_shift_pct:+.0%})")
print(f"Drift: {drift.risk_level.upper()} | Regimes: {drift.regime_count} | "
f"Stability: {drift.population_stability:.2f}")
if drift.recommended_training_start:
print(f"Recommended training start: "
f"{drift.recommended_training_start.strftime('%Y-%m-%d')}")
print("\nIMPLICATIONS")
print("=" * 70)
if len(inter_event_series) > 0:
median_iet = inter_event_series.median()
ev_30d = 30.0 / median_iet if median_iet > 0 else 0
if ev_30d < 2:
print(f"Windowing: 30d captures ~{ev_30d:.1f} events/entity "
f"— longer windows (90d+) needed")
elif median_iet < 7:
print("Windowing: High-frequency — 7d/24h windows rich with signal")
else:
print(f"Windowing: Median cadence {median_iet:.0f}d "
f"— standard windows appropriate")
has_major_gaps = any(g.severity == "major" for g in coverage_result.gaps)
print(f"Coverage: {'Gaps may produce misleading zeros in aggregated features'
if coverage_result.gaps else 'Clean — no gap artifacts'}")
print(f"Stability: {drift.volume_drift_risk} volume drift")
if abs(iet_shift_pct) > 0.2:
print(f"Segmentation: Cadence shifted {iet_shift_pct:+.0%} "
f"— consider period-aware features")
print("\nOBJECTIVE SUPPORT")
print("=" * 70)
immediate = (3 if drift.risk_level == "high" or has_major_gaps
else 2 if drift.risk_level == "moderate" or coverage_result.gaps else 1)
disengage = (
(3 if skew_ratio > 1.5 and abs(iet_shift_pct) > 0.2
else 2 if skew_ratio > 1.2 or abs(iet_shift_pct) > 0.1 else 1)
if len(inter_event_series) > 0 else 0)
long_window_cov = max(
(c.coverage_pct for c in coverage_result.entity_window_coverage
if c.window in ("180d", "365d")), default=0)
renewal = 3 if long_window_cov > 0.5 else 2 if long_window_cov > 0.2 else 1
bars = {0: "[ ]", 1: "[█ ]", 2: "[██ ]", 3: "[███]"}
print(f"Immediate risk : {bars[immediate]}")
print(f"Disengagement : {bars[disengage]}")
print(f"Renewal risk : {bars[renewal]}")
why = []
if drift.risk_level in ("moderate", "high"):
why.append(f"{drift.risk_level} drift detected")
if coverage_result.gaps:
why.append(f"{'major' if has_major_gaps else 'minor'} data gaps present")
if skew_ratio > 1.5:
why.append("heavy skew — long tail of inactive entities")
if abs(iet_shift_pct) > 0.1:
why.append(f"engagement cadence shifted {iet_shift_pct:+.0%}")
if long_window_cov < 0.3:
why.append("limited long-window coverage for renewal horizon")
print("\nWhy:")
for w in why:
print(f" - {w}")
DETAILED FINDINGS ====================================================================== Time span: 3,285 days (2015-01-01 to 2023-12-30) Volume trend: declining (-30%) Data gaps: 0 Historic: 49,011 events | Recent: 34,187 events Inter-event shape: mean/median = 1.53 (heavily right-skewed) Inter-event median: Historic 86.0d | Recent 95.0d (shift: +10%) Drift: HIGH | Regimes: 1 | Stability: 0.66 IMPLICATIONS ====================================================================== Windowing: 30d captures ~0.3 events/entity — longer windows (90d+) needed Coverage: Clean — no gap artifacts Stability: declining volume drift OBJECTIVE SUPPORT ====================================================================== Immediate risk : [███] Disengagement : [██ ] Renewal risk : [███] Why: - high drift detected - heavy skew — long tail of inactive entities - engagement cadence shifted +10%
1a.7 Temporal Aggregation Perspective¶
Which aggregation windows preserve temporal signal per column? Compares within-entity variance (how much each entity's values change over time) vs between-entity variance (how different entities are from each other).
Show/Hide Code
numeric_cols = [n for n, c in findings.columns.items()
if c.inferred_type.value in ('numeric_continuous', 'numeric_discrete')
and n not in [ENTITY_COLUMN, TIME_COLUMN] and n not in TEMPORAL_METADATA_COLS]
if not numeric_cols:
print("No numeric columns detected — skipping temporal aggregation analysis")
else:
if inter_event_times:
median_iet = inter_event_series.median()
print(f"Median inter-event time: {median_iet:.0f} days")
print("Expected events per window (at median cadence):")
for label, days in [("7d", 7), ("30d", 30), ("90d", 90), ("180d", 180), ("365d", 365)]:
expected = days / median_iet if median_iet > 0 else 0
marker = "\u2705" if expected >= 2 else "\u26a0\ufe0f" if expected >= 1 else "\u274c"
print(f" {marker} {label}: ~{expected:.1f} events/entity")
print()
print(f"{'Column':<25} {'Within-CV':<12} {'Between-CV':<12} {'Ratio':<8} {'Aggregation Guidance'}")
print("-" * 90)
for col in numeric_cols:
col_data = df_temp.groupby(ENTITY_COLUMN)[col]
entity_means = col_data.mean()
entity_stds = col_data.std()
within_cv = (entity_stds / entity_means.abs().clip(lower=1e-10)).median()
between_cv = entity_means.std() / entity_means.abs().mean() if entity_means.abs().mean() > 1e-10 else 0.0
ratio = within_cv / between_cv if between_cv > 0 else (float("inf") if within_cv > 0 else 0.0)
if within_cv < 0.3:
guidance = "Stable per entity -> all_time mean sufficient"
elif ratio > 1.5:
guidance = "High temporal dynamics -> shorter windows preserve signal"
elif ratio > 0.5:
guidance = "Mixed -> both short and long windows add value"
else:
guidance = "Entity-driven -> between-entity differences dominate"
within_str = f"{within_cv:.2f}" if not np.isinf(within_cv) else "inf"
ratio_str = f"{ratio:.2f}" if not np.isinf(ratio) else ">10"
print(f"{col:<25} {within_str:<12} {between_cv:<12.2f} {ratio_str:<8} {guidance}")
print("\nWithin-CV: how much each entity's values vary across their events")
print("Between-CV: how much entity averages differ from each other")
print("Ratio > 1: temporal variation dominates -> shorter windows capture dynamics")
print("Ratio < 1: entity identity dominates -> longer windows (or all_time) sufficient")
Median inter-event time: 95 days Expected events per window (at median cadence): ❌ 7d: ~0.1 events/entity ❌ 30d: ~0.3 events/entity ❌ 90d: ~0.9 events/entity ⚠️ 180d: ~1.9 events/entity ✅ 365d: ~3.8 events/entity Column Within-CV Between-CV Ratio Aggregation Guidance ------------------------------------------------------------------------------------------ send_hour 0.28 0.09 3.22 Stable per entity -> all_time mean sufficient time_to_open_hours 0.83 0.61 1.37 Mixed -> both short and long windows add value Within-CV: how much each entity's values vary across their events Between-CV: how much entity averages differ from each other Ratio > 1: temporal variation dominates -> shorter windows capture dynamics Ratio < 1: entity identity dominates -> longer windows (or all_time) sufficient
1a.8 Update Findings with Time Series Metadata¶
Show/Hide Code
from customer_retention.analysis.auto_explorer.findings import TimeSeriesMetadata
from customer_retention.stages.profiling import WindowRecommendationCollector
# Build window recommendations from data coverage analysis
window_collector = WindowRecommendationCollector(coverage_threshold=0.10)
window_collector.add_segment_context(segment_result)
window_collector.add_quadrant_context(quadrant_result)
# Add inter-event timing context if available
if inter_event_times:
window_collector.add_inter_event_context(
median_days=inter_event_series.median(),
mean_days=inter_event_series.mean(),
)
window_result = window_collector.compute_union(
lifecycles=quadrant_result.lifecycles,
time_span_days=ts_profile.time_span_days,
value_columns=len(numeric_cols),
agg_funcs=4,
)
print(f"Selected windows: {window_result.windows}")
print(f"Total features per entity: ~{window_result.feature_count_estimate}\n")
explanation = window_result.explanation.drop(columns=["window_days"]).copy()
explanation["coverage_pct"] = (explanation["coverage_pct"] * 100).round(1).astype(str) + "%"
explanation["meaningful_pct"] = (explanation["meaningful_pct"] * 100).round(1).astype(str) + "%"
display_table(explanation)
print("\nCoverage: % of entities with enough tenure AND expected >=2 events in that window")
print("Meaningful: among entities with enough tenure, % that have sufficient event density")
Selected windows: ['180d', '365d', 'all_time'] Total features per entity: ~27
| window | coverage_pct | meaningful_pct | beneficial_entities | primary_segments | included | exclusion_reason | note |
|---|---|---|---|---|---|---|---|
| 24h | 0.1% | 0.1% | 3 | [High Activity, Low Activity] | False | Coverage 0.1% < threshold 10.0% | |
| 7d | 0.1% | 0.1% | 7 | [High Activity, Low Activity, Medium Activity] | False | Coverage 0.1% < threshold 10.0% | |
| 14d | 0.2% | 0.2% | 11 | [High Activity, Low Activity, Medium Activity] | False | Coverage 0.2% < threshold 10.0% | |
| 30d | 0.7% | 0.7% | 35 | [High Activity, Low Activity, Medium Activity] | False | Coverage 0.7% < threshold 10.0% | |
| 90d | 3.4% | 3.5% | 170 | [High Activity, Low Activity, Medium Activity] | False | Coverage 3.4% < threshold 10.0% | Timing-aligned (median inter-event) |
| 180d | 12.2% | 12.6% | 611 | [High Activity, Low Activity, Medium Activity] | True | Timing-aligned (median inter-event) | |
| 365d | 75.8% | 80.9% | 3789 | [High Activity, Low Activity, Medium Activity] | True | ||
| all_time | 100.0% | 100.0% | 4998 | [High Activity, Low Activity, Medium Activity, One-time] | True |
Coverage: % of entities with enough tenure AND expected >=2 events in that window Meaningful: among entities with enough tenure, % that have sufficient event density
Show/Hide Code
h = window_result.heterogeneity
print("Temporal Heterogeneity (eta-squared):")
print(" eta² measures the fraction of variance in a metric explained by lifecycle quadrant grouping.")
print(" Scale: 0 = no group differences, 1 = all variance is between groups.")
print(" Thresholds: <0.06 = low | 0.06-0.14 = moderate | >0.14 = high effect size\n")
eta_max = max(h.eta_squared_intensity, h.eta_squared_event_count)
print(f" Intensity eta²: {h.eta_squared_intensity:.3f} {'<-- dominant' if h.eta_squared_intensity >= h.eta_squared_event_count else ''}")
print(f" Event count eta²: {h.eta_squared_event_count:.3f} {'<-- dominant' if h.eta_squared_event_count > h.eta_squared_intensity else ''}")
print(f" Overall level: {h.heterogeneity_level.upper()} (max eta² = {eta_max:.3f})\n")
advisory_labels = {
"single_model": "Single model with union windows is appropriate",
"consider_segment_feature": "Add lifecycle_quadrant as a categorical feature to the model",
"consider_separate_models": "Consider separate models for entities with vs without history",
}
advisory_text = advisory_labels.get(h.segmentation_advisory, h.segmentation_advisory)
print(f"Recommendation: {advisory_text}")
for r in h.advisory_rationale:
print(f" -> {r}")
print()
display_table(h.coverage_table)
Temporal Heterogeneity (eta-squared): eta² measures the fraction of variance in a metric explained by lifecycle quadrant grouping. Scale: 0 = no group differences, 1 = all variance is between groups. Thresholds: <0.06 = low | 0.06-0.14 = moderate | >0.14 = high effect size Intensity eta²: 0.015 Event count eta²: 0.335 <-- dominant Overall level: HIGH (max eta² = 0.335) Recommendation: Add lifecycle_quadrant as a categorical feature to the model -> High temporal diversity across quadrants -> Union windows still pragmatic for feature engineering -> Model may benefit from knowing entity's engagement pattern
| window | coverage_pct | meaningful_pct | zero_risk_pct |
|---|---|---|---|
| 180d | 0.9696 | 0.1222 | 0.8778 |
| 365d | 0.9366 | 0.7581 | 0.2419 |
| all_time | 1.0000 | 1.0000 | 0.0000 |
Show/Hide Code
advisory_labels = {
"single_model": "Single model with union windows is appropriate",
"consider_segment_feature": "Add lifecycle_quadrant as a categorical feature to the model",
"consider_separate_models": "Consider separate models for entities with vs without history",
}
# Preserve temporal_pattern from original findings if available
existing_pattern = findings.time_series_metadata.temporal_pattern if findings.time_series_metadata else None
ts_metadata = TimeSeriesMetadata(
granularity=DatasetGranularity.EVENT_LEVEL,
temporal_pattern=existing_pattern,
entity_column=ENTITY_COLUMN,
time_column=TIME_COLUMN,
avg_events_per_entity=ts_profile.events_per_entity.mean,
time_span_days=ts_profile.time_span_days,
unique_entities=ts_profile.unique_entities,
suggested_aggregations=window_result.windows,
window_coverage_threshold=window_result.coverage_threshold,
heterogeneity_level=window_result.heterogeneity.heterogeneity_level,
eta_squared_intensity=window_result.heterogeneity.eta_squared_intensity,
eta_squared_event_count=window_result.heterogeneity.eta_squared_event_count,
temporal_segmentation_advisory=window_result.heterogeneity.segmentation_advisory,
temporal_segmentation_recommendation=advisory_labels.get(
window_result.heterogeneity.segmentation_advisory,
window_result.heterogeneity.segmentation_advisory,
),
drift_risk_level=drift.risk_level,
volume_drift_risk=drift.volume_drift_risk,
population_stability=drift.population_stability,
regime_count=drift.regime_count,
recommended_training_start=(
drift.recommended_training_start.isoformat() if drift.recommended_training_start else None
),
)
findings.time_series_metadata = ts_metadata
findings.save(FINDINGS_PATH)
print(f"Updated findings saved to: {FINDINGS_PATH}")
print(f" Suggested aggregations: {ts_metadata.suggested_aggregations}")
print(f" Heterogeneity: {ts_metadata.heterogeneity_level}")
print(f" Recommendation: {ts_metadata.temporal_segmentation_recommendation}")
print(f" Drift risk: {ts_metadata.drift_risk_level}")
Updated findings saved to: /Users/Vital/python/CustomerRetention/experiments/runs/email-6301db6c/datasets/customer_emails/findings/customer_emails_findings.yaml Suggested aggregations: ['180d', '365d', 'all_time'] Heterogeneity: high Recommendation: Add lifecycle_quadrant as a categorical feature to the model Drift risk: high
Summary: What We Learned¶
In this notebook, we performed a deep dive on time series data:
- Event Distribution - Analyzed how events are distributed across entities
- Activity Segments - Categorized entities by activity level (one-time, low, medium, high)
- Lifecycle Analysis - Examined entity tenure and duration patterns
- Temporal Stratification - Compared historic vs recent periods: coverage, drift, inter-event timing, and objective alignment
- Temporal Aggregation - Assessed within-entity vs between-entity variance per aggregation window
- Window Selection - Selected aggregation windows with heterogeneity and segmentation assessment
Next Steps¶
Continue with the Event Bronze Track:
- 01b_temporal_quality.ipynb - Check for duplicate events, temporal gaps, future dates
- 01c_temporal_patterns.ipynb - Detect trends, seasonality, cohort analysis
- 01d_event_aggregation.ipynb - Aggregate events to entity-level (produces new dataset)
After completing 01d, continue with the Entity Bronze Track (02 → 03 → 04) on the aggregated data.
Save Reminder: Save this notebook (Ctrl+S / Cmd+S) before running the next one. The next notebook will automatically export this notebook's HTML documentation from the saved file.