Chapter 4: Column Deep Dive¶
Purpose: Analyze each column in detail with distribution analysis, value validation, and transformation recommendations.
What you'll learn:
- How to validate value ranges for different column types
- How to interpret distribution shapes (skewness, kurtosis)
- When and why to apply transformations (log, sqrt, capping)
- How to detect zero-inflation and handle it
Outputs:
- Value range validation results
- Per-column distribution visualizations with statistics
- Skewness/kurtosis analysis with transformation recommendations
- Zero-inflation detection
- Type confirmation/override capability
- Updated exploration findings
4.1 Load Previous Findings¶
Show/Hide Code
from customer_retention.analysis.notebook_progress import track_and_export_previous
track_and_export_previous("04_column_deep_dive.ipynb")
import numpy as np
import pandas as pd
import plotly.graph_objects as go
from scipy import stats
from customer_retention.analysis.auto_explorer import ExplorationFindings, RecommendationRegistry
from customer_retention.analysis.visualization import ChartBuilder, console, display_figure, display_table
from customer_retention.core.config.column_config import ColumnType
from customer_retention.core.config.experiments import (
FINDINGS_DIR,
)
from customer_retention.stages.profiling import (
CategoricalDistributionAnalyzer,
DistributionAnalyzer,
TemporalAnalyzer,
TemporalGranularity,
TransformationType,
)
from customer_retention.stages.validation import DataValidator, RuleGenerator
Show/Hide Code
from customer_retention.analysis.auto_explorer import load_notebook_findings
FINDINGS_PATH, _namespace, dataset_name = load_notebook_findings("04_column_deep_dive.ipynb")
print(f"Using: {FINDINGS_PATH}")
findings = ExplorationFindings.load(FINDINGS_PATH)
print(f"\nLoaded findings for {findings.column_count} columns from {findings.source_path}")
# Warn if this is event-level data (should run 01d first)
if findings.is_time_series and "_aggregated" not in FINDINGS_PATH:
ts_meta = findings.time_series_metadata
print("\n\u26a0\ufe0f WARNING: This appears to be EVENT-LEVEL data")
print(f" Entity: {ts_meta.entity_column}, Time: {ts_meta.time_column}")
print(" Recommendation: Run 01d_event_aggregation.ipynb first to create entity-level data")
Using: /Users/Vital/python/CustomerRetention/experiments/runs/email-6301db6c/datasets/customer_emails/findings/customer_emails_aggregated_findings.yaml
Loaded findings for 217 columns from /Users/Vital/python/CustomerRetention/experiments/runs/email-6301db6c/data/bronze/customer_emails_aggregated
4.2 Load Source Data¶
Show/Hide Code
# Load data - handle aggregated data (parquet or Delta Lake)
from pathlib import Path
from customer_retention.analysis.auto_explorer.active_dataset_store import load_active_dataset
from customer_retention.stages.temporal import TEMPORAL_METADATA_COLS
# For aggregated data, load directly from the source path
if "_aggregated" in FINDINGS_PATH:
source_path = Path(findings.source_path)
if not source_path.is_absolute():
source_path = Path("..") / source_path
if source_path.is_dir():
from customer_retention.integrations.adapters.factory import get_delta
df = get_delta(force_local=True).read(str(source_path))
elif source_path.is_file():
df = pd.read_parquet(source_path)
else:
df = load_active_dataset(_namespace, dataset_name)
data_source = f"aggregated:{source_path.name}"
else:
# Standard loading for event-level or entity-level data
df = load_active_dataset(_namespace, dataset_name)
data_source = dataset_name
print(f"Loaded data from: {data_source}")
print(f"Shape: {df.shape}")
charts = ChartBuilder()
# Initialize recommendation registry for this exploration
registry = RecommendationRegistry()
registry.init_bronze(findings.source_path)
# Find target column for Gold layer initialization
target_col = next((name for name, col in findings.columns.items() if col.inferred_type == ColumnType.TARGET), None)
if target_col:
registry.init_gold(target_col)
# Find entity column for Silver layer initialization
entity_col = next((name for name, col in findings.columns.items() if col.inferred_type == ColumnType.IDENTIFIER), None)
if entity_col:
registry.init_silver(entity_col)
print(f"Initialized recommendation registry (Bronze: {findings.source_path})")
Loaded data from: aggregated:customer_emails_aggregated Shape: (4998, 217) Initialized recommendation registry (Bronze: /Users/Vital/python/CustomerRetention/experiments/runs/email-6301db6c/data/bronze/customer_emails_aggregated)
4.3 Value Range Validation¶
📖 Interpretation Guide:
- Percentage fields (rates): Should be 0-100 or 0-1 depending on format
- Binary fields: Should only contain 0 and 1
- Count fields: Should be non-negative integers
- Amount fields: Should be non-negative (unless refunds are possible)
What to Watch For:
- Rates > 100% suggest measurement or data entry errors
- Negative values in fields that should be positive
- Binary fields with values other than 0/1
Actions:
- Cap rates at 100 if they exceed (or investigate cause)
- Flag records with impossible negative values
- Convert binary fields to proper 0/1 encoding
Show/Hide Code
validator = DataValidator()
range_rules = RuleGenerator.from_findings(findings)
console.start_section()
console.header("Value Range Validation")
if range_rules:
range_results = validator.validate_value_ranges(df, range_rules)
issues_found = []
for r in range_results:
detail = f"{r.invalid_values} invalid" if r.invalid_values > 0 else None
console.check(f"{r.column_name} ({r.rule_type})", r.invalid_values == 0, detail)
if r.invalid_values > 0:
issues_found.append(r)
all_invalid = sum(r.invalid_values for r in range_results)
if all_invalid == 0:
console.success("All value ranges valid")
else:
console.error(f"Found {all_invalid:,} values outside expected ranges")
console.info("Examples of invalid values:")
for r in issues_found[:3]:
col = r.column_name
if col in df.columns:
if r.rule_type == 'binary':
invalid_mask = ~df[col].isin([0, 1, np.nan])
condition = "value not in [0, 1]"
elif r.rule_type == 'non_negative':
invalid_mask = df[col] < 0
condition = "value < 0"
elif r.rule_type == 'percentage':
invalid_mask = (df[col] < 0) | (df[col] > 100)
condition = "value < 0 or value > 100"
elif r.rule_type == 'rate':
invalid_mask = (df[col] < 0) | (df[col] > 1)
condition = "value < 0 or value > 1"
else:
continue
invalid_values = df.loc[invalid_mask, col].dropna()
if len(invalid_values) > 0:
examples = invalid_values.head(5).tolist()
console.metric(f" {col}", f"{examples}")
# Add filtering recommendation
registry.add_bronze_filtering(
column=col, condition=condition, action="cap",
rationale=f"{r.invalid_values} values violate {r.rule_type} constraint",
source_notebook="04_column_deep_dive"
)
console.info("Rules auto-generated from detected column types")
else:
range_results = []
console.info("No validation rules generated - no binary/numeric columns detected")
console.end_section()
VALUE RANGE VALIDATION¶
[OK] opened_max_180d (binary)
[OK] clicked_max_180d (binary)
[OK] bounced_max_180d (binary)
[OK] opened_max_365d (binary)
[OK] clicked_max_365d (binary)
[OK] bounced_max_365d (binary)
[OK] opened_max_all_time (binary)
[OK] clicked_max_all_time (binary)
[OK] bounced_max_all_time (binary)
[OK] lag0_opened_max (binary)
[OK] lag0_clicked_max (binary)
[OK] lag0_bounced_max (binary)
[OK] lag1_opened_max (binary)
[OK] lag1_clicked_max (binary)
[OK] lag1_bounced_sum (binary)
[OK] lag1_bounced_max (binary)
[OK] lag2_opened_max (binary)
[OK] lag2_clicked_max (binary)
[OK] lag2_bounced_sum (binary)
[OK] lag2_bounced_max (binary)
[OK] lag3_opened_max (binary)
[OK] lag3_clicked_sum (binary)
[OK] lag3_clicked_max (binary)
[OK] lag3_bounced_sum (binary)
[OK] lag3_bounced_max (binary)
[X] opened_velocity_pct (percentage) — 137 invalid
[X] clicked_velocity_pct (binary) — 52 invalid
[X] send_hour_velocity_pct (percentage) — 390 invalid
[X] bounced_velocity_pct (binary) — 18 invalid
[X] time_to_open_hours_velocity_pct (percentage) — 155 invalid
[X] opened_acceleration (percentage) — 41 invalid
[X] clicked_acceleration (percentage) — 16 invalid
[X] clicked_momentum (binary) — 48 invalid
[X] send_hour_acceleration (percentage) — 91 invalid
[X] bounced_acceleration (percentage) — 5 invalid
[X] bounced_momentum (binary) — 34 invalid
[X] time_to_open_hours_acceleration (percentage) — 39 invalid
[OK] opened_trend_ratio (percentage)
[OK] clicked_trend_ratio (percentage)
[OK] send_hour_trend_ratio (percentage)
[OK] bounced_trend_ratio (percentage)
[X] time_to_open_hours_trend_ratio (percentage) — 6 invalid
[OK] recency_ratio (percentage)
[OK] opened_vs_cohort_pct (percentage)
[OK] clicked_vs_cohort_pct (percentage)
[OK] send_hour_vs_cohort_pct (percentage)
[OK] bounced_vs_cohort_pct (percentage)
[OK] time_to_open_hours_vs_cohort_pct (percentage)
[X] Found 1,032 values outside expected ranges
(i) Examples of invalid values:
opened_velocity_pct: [-1.0, -0.5, -1.0, -1.0, -1.0]
clicked_velocity_pct: [-1.0, -1.0, -1.0, -1.0, -1.0]
send_hour_velocity_pct: [-0.14285714285714285, -0.36666666666666664, -0.391304347826087, -0.19047619047619047, -0.08823529411764706]
(i) Rules auto-generated from detected column types
4.4 Numeric Columns Analysis¶
📖 How to Interpret These Charts:
- Red dashed line = Mean (sensitive to outliers)
- Green solid line = Median (robust to outliers)
- Large gap between mean and median = Skewed distribution
- Long right tail = Positive skew (common in count/amount data)
📖 Understanding Distribution Metrics
| Metric | Interpretation | Action |
|---|---|---|
| Skewness | Measures asymmetry | |skew| > 1: Consider log transform |
| Kurtosis | Measures tail heaviness | kurt > 10: Cap outliers before transform |
| Zero % | Percentage of zeros | > 40%: Use zero-inflation handling |
📖 Transformation Decision Tree:
- If zeros > 40% → Create binary indicator + log(non-zeros)
- If |skewness| > 1 AND kurtosis > 10 → Cap then log
- If |skewness| > 1 → Log transform
- If kurtosis > 10 → Cap outliers only
- Otherwise → Standard scaling is sufficient
Show/Hide Code
# Use framework's DistributionAnalyzer for comprehensive analysis
analyzer = DistributionAnalyzer()
numeric_cols = [
name for name, col in findings.columns.items()
if col.inferred_type in [ColumnType.NUMERIC_CONTINUOUS, ColumnType.NUMERIC_DISCRETE]
and name not in TEMPORAL_METADATA_COLS
]
# Analyze all numeric columns using the framework
analyses = analyzer.analyze_dataframe(df, numeric_cols)
recommendations = {col: analyzer.recommend_transformation(analysis)
for col, analysis in analyses.items()}
for col_name in numeric_cols:
col_info = findings.columns[col_name]
analysis = analyses.get(col_name)
rec = recommendations.get(col_name)
print(f"\n{'='*70}")
print(f"Column: {col_name}")
print(f"Type: {col_info.inferred_type.value} (Confidence: {col_info.confidence:.0%})")
print("-" * 70)
if analysis:
print("📊 Distribution Statistics:")
print(f" Mean: {analysis.mean:.3f} | Median: {analysis.median:.3f} | Std: {analysis.std:.3f}")
print(f" Range: [{analysis.min_value:.3f}, {analysis.max_value:.3f}]")
print(f" Percentiles: 1%={analysis.percentiles['p1']:.3f}, 25%={analysis.q1:.3f}, 75%={analysis.q3:.3f}, 99%={analysis.percentiles['p99']:.3f}")
print("\n📈 Shape Analysis:")
skew_label = '(Right-skewed)' if analysis.skewness > 0.5 else '(Left-skewed)' if analysis.skewness < -0.5 else '(Symmetric)'
print(f" Skewness: {analysis.skewness:.2f} {skew_label}")
kurt_label = '(Heavy tails/outliers)' if analysis.kurtosis > 3 else '(Light tails)'
print(f" Kurtosis: {analysis.kurtosis:.2f} {kurt_label}")
print(f" Zeros: {analysis.zero_count:,} ({analysis.zero_percentage:.1f}%)")
print(f" Outliers (IQR): {analysis.outlier_count_iqr:,} ({analysis.outlier_percentage:.1f}%)")
if rec:
print(f"\n🔧 Recommended Transformation: {rec.recommended_transform.value}")
print(f" Reason: {rec.reason}")
print(f" Priority: {rec.priority}")
if rec.warnings:
for warn in rec.warnings:
print(f" ⚠️ {warn}")
# Create enhanced histogram with Plotly
data = df[col_name].dropna()
fig = go.Figure()
fig.add_trace(go.Histogram(x=data, nbinsx=50, name='Distribution',
marker_color='steelblue', opacity=0.7))
# Calculate mean and median
mean_val = data.mean()
median_val = data.median()
# Position labels on opposite sides (left/right) to avoid overlap
# The larger value gets right-justified, smaller gets left-justified
mean_position = "top right" if mean_val >= median_val else "top left"
median_position = "top left" if mean_val >= median_val else "top right"
# Add mean line
fig.add_vline(
x=mean_val,
line_dash="dash",
line_color="red",
annotation_text=f"Mean: {mean_val:.2f}",
annotation_position=mean_position,
annotation_font_color="red",
annotation_bgcolor="rgba(255,255,255,0.8)"
)
# Add median line
fig.add_vline(
x=median_val,
line_dash="solid",
line_color="green",
annotation_text=f"Median: {median_val:.2f}",
annotation_position=median_position,
annotation_font_color="green",
annotation_bgcolor="rgba(255,255,255,0.8)"
)
# Add 99th percentile marker if there are outliers
if analysis and analysis.outlier_percentage > 5:
fig.add_vline(x=analysis.percentiles['p99'], line_dash="dot", line_color="orange",
annotation_text=f"99th: {analysis.percentiles['p99']:.2f}",
annotation_position="top right",
annotation_font_color="orange",
annotation_bgcolor="rgba(255,255,255,0.8)")
transform_label = rec.recommended_transform.value if rec else "none"
fig.update_layout(
title=f"Distribution: {col_name}<br><sub>Skew: {analysis.skewness:.2f} | Kurt: {analysis.kurtosis:.2f} | Strategy: {transform_label}</sub>",
xaxis_title=col_name,
yaxis_title="Count",
template='plotly_white',
height=400
)
display_figure(fig)
====================================================================== Column: event_count_180d Type: numeric_discrete (Confidence: 70%) ---------------------------------------------------------------------- 📊 Distribution Statistics: Mean: 0.639 | Median: 0.000 | Std: 1.009 Range: [0.000, 11.000] Percentiles: 1%=0.000, 25%=0.000, 75%=1.000, 99%=4.000 📈 Shape Analysis: Skewness: 2.15 (Right-skewed) Kurtosis: 7.46 (Heavy tails/outliers) Zeros: 3,084 (61.7%) Outliers (IQR): 285 (5.7%) 🔧 Recommended Transformation: zero_inflation_handling Reason: Zero-inflation (61.7%) combined with high skewness (2.15) Priority: high ⚠️ Consider creating a binary indicator for zeros plus log transform of non-zero values
====================================================================== Column: event_count_365d Type: numeric_discrete (Confidence: 70%) ---------------------------------------------------------------------- 📊 Distribution Statistics: Mean: 1.316 | Median: 1.000 | Std: 1.656 Range: [0.000, 15.000] Percentiles: 1%=0.000, 25%=0.000, 75%=2.000, 99%=7.000 📈 Shape Analysis: Skewness: 1.61 (Right-skewed) Kurtosis: 4.25 (Heavy tails/outliers) Zeros: 2,362 (47.3%) Outliers (IQR): 100 (2.0%) 🔧 Recommended Transformation: zero_inflation_handling Reason: Significant zero-inflation (47.3%) Priority: medium ⚠️ Many zero values may indicate a mixture distribution
====================================================================== Column: event_count_all_time Type: numeric_continuous (Confidence: 90%) ---------------------------------------------------------------------- 📊 Distribution Statistics: Mean: 16.566 | Median: 16.000 | Std: 9.139 Range: [1.000, 112.000] Percentiles: 1%=2.000, 25%=12.000, 75%=19.000, 99%=53.000 📈 Shape Analysis: Skewness: 2.61 (Right-skewed) Kurtosis: 16.11 (Heavy tails/outliers) Zeros: 0 (0.0%) Outliers (IQR): 298 (6.0%) 🔧 Recommended Transformation: cap_then_log Reason: High skewness (2.61) with significant outliers (6.0%) Priority: high
====================================================================== Column: opened_sum_180d Type: numeric_discrete (Confidence: 70%) ---------------------------------------------------------------------- 📊 Distribution Statistics: Mean: 0.152 | Median: 0.000 | Std: 0.424 Range: [0.000, 4.000] Percentiles: 1%=0.000, 25%=0.000, 75%=0.000, 99%=2.000 📈 Shape Analysis: Skewness: 3.14 (Right-skewed) Kurtosis: 11.52 (Heavy tails/outliers) Zeros: 4,350 (87.0%) Outliers (IQR): 648 (13.0%) 🔧 Recommended Transformation: zero_inflation_handling Reason: Zero-inflation (87.0%) combined with high skewness (3.14) Priority: high ⚠️ Consider creating a binary indicator for zeros plus log transform of non-zero values
====================================================================== Column: opened_mean_180d Type: numeric_discrete (Confidence: 70%) ---------------------------------------------------------------------- 📊 Distribution Statistics: Mean: 0.232 | Median: 0.000 | Std: 0.366 Range: [0.000, 1.000] Percentiles: 1%=0.000, 25%=0.000, 75%=0.500, 99%=1.000 📈 Shape Analysis: Skewness: 1.27 (Right-skewed) Kurtosis: 0.05 (Light tails) Zeros: 1,266 (66.1%) Outliers (IQR): 0 (0.0%) 🔧 Recommended Transformation: zero_inflation_handling Reason: Significant zero-inflation (66.1%) Priority: medium ⚠️ Many zero values may indicate a mixture distribution
====================================================================== Column: opened_count_180d Type: numeric_discrete (Confidence: 70%) ---------------------------------------------------------------------- 📊 Distribution Statistics: Mean: 0.639 | Median: 0.000 | Std: 1.009 Range: [0.000, 11.000] Percentiles: 1%=0.000, 25%=0.000, 75%=1.000, 99%=4.000 📈 Shape Analysis: Skewness: 2.15 (Right-skewed) Kurtosis: 7.46 (Heavy tails/outliers) Zeros: 3,084 (61.7%) Outliers (IQR): 285 (5.7%) 🔧 Recommended Transformation: zero_inflation_handling Reason: Zero-inflation (61.7%) combined with high skewness (2.15) Priority: high ⚠️ Consider creating a binary indicator for zeros plus log transform of non-zero values
====================================================================== Column: clicked_sum_180d Type: numeric_discrete (Confidence: 70%) ---------------------------------------------------------------------- 📊 Distribution Statistics: Mean: 0.048 | Median: 0.000 | Std: 0.225 Range: [0.000, 2.000] Percentiles: 1%=0.000, 25%=0.000, 75%=0.000, 99%=1.000 📈 Shape Analysis: Skewness: 4.78 (Right-skewed) Kurtosis: 23.72 (Heavy tails/outliers) Zeros: 4,767 (95.4%) Outliers (IQR): 231 (4.6%) 🔧 Recommended Transformation: zero_inflation_handling Reason: Zero-inflation (95.4%) combined with high skewness (4.78) Priority: high ⚠️ Consider creating a binary indicator for zeros plus log transform of non-zero values
====================================================================== Column: clicked_mean_180d Type: numeric_discrete (Confidence: 70%) ---------------------------------------------------------------------- 📊 Distribution Statistics: Mean: 0.072 | Median: 0.000 | Std: 0.219 Range: [0.000, 1.000] Percentiles: 1%=0.000, 25%=0.000, 75%=0.000, 99%=1.000 📈 Shape Analysis: Skewness: 3.26 (Right-skewed) Kurtosis: 10.01 (Heavy tails/outliers) Zeros: 1,683 (87.9%) Outliers (IQR): 231 (12.1%) 🔧 Recommended Transformation: zero_inflation_handling Reason: Zero-inflation (87.9%) combined with high skewness (3.26) Priority: high ⚠️ Consider creating a binary indicator for zeros plus log transform of non-zero values
====================================================================== Column: clicked_count_180d Type: numeric_discrete (Confidence: 70%) ---------------------------------------------------------------------- 📊 Distribution Statistics: Mean: 0.639 | Median: 0.000 | Std: 1.009 Range: [0.000, 11.000] Percentiles: 1%=0.000, 25%=0.000, 75%=1.000, 99%=4.000 📈 Shape Analysis: Skewness: 2.15 (Right-skewed) Kurtosis: 7.46 (Heavy tails/outliers) Zeros: 3,084 (61.7%) Outliers (IQR): 285 (5.7%) 🔧 Recommended Transformation: zero_inflation_handling Reason: Zero-inflation (61.7%) combined with high skewness (2.15) Priority: high ⚠️ Consider creating a binary indicator for zeros plus log transform of non-zero values
====================================================================== Column: send_hour_sum_180d Type: numeric_continuous (Confidence: 90%) ---------------------------------------------------------------------- 📊 Distribution Statistics: Mean: 8.645 | Median: 0.000 | Std: 13.953 Range: [0.000, 149.000] Percentiles: 1%=0.000, 25%=0.000, 75%=15.000, 99%=56.030 📈 Shape Analysis: Skewness: 2.19 (Right-skewed) Kurtosis: 7.43 (Heavy tails/outliers) Zeros: 3,084 (61.7%) Outliers (IQR): 227 (4.5%) 🔧 Recommended Transformation: zero_inflation_handling Reason: Zero-inflation (61.7%) combined with high skewness (2.19) Priority: high ⚠️ Consider creating a binary indicator for zeros plus log transform of non-zero values
====================================================================== Column: send_hour_mean_180d Type: numeric_continuous (Confidence: 90%) ---------------------------------------------------------------------- 📊 Distribution Statistics: Mean: 13.555 | Median: 13.500 | Std: 3.302 Range: [6.000, 22.000] Percentiles: 1%=6.000, 25%=11.333, 75%=16.000, 99%=22.000 📈 Shape Analysis: Skewness: 0.04 (Symmetric) Kurtosis: -0.10 (Light tails) Zeros: 0 (0.0%) Outliers (IQR): 0 (0.0%) 🔧 Recommended Transformation: none Reason: Distribution is approximately normal (skewness: 0.04) Priority: low
====================================================================== Column: send_hour_max_180d Type: numeric_discrete (Confidence: 70%) ---------------------------------------------------------------------- 📊 Distribution Statistics: Mean: 14.692 | Median: 15.000 | Std: 3.750 Range: [6.000, 22.000] Percentiles: 1%=6.000, 25%=12.000, 75%=17.000, 99%=22.000 📈 Shape Analysis: Skewness: -0.17 (Symmetric) Kurtosis: -0.44 (Light tails) Zeros: 0 (0.0%) Outliers (IQR): 0 (0.0%) 🔧 Recommended Transformation: none Reason: Distribution is approximately normal (skewness: -0.17) Priority: low
====================================================================== Column: send_hour_count_180d Type: numeric_discrete (Confidence: 70%) ---------------------------------------------------------------------- 📊 Distribution Statistics: Mean: 0.639 | Median: 0.000 | Std: 1.009 Range: [0.000, 11.000] Percentiles: 1%=0.000, 25%=0.000, 75%=1.000, 99%=4.000 📈 Shape Analysis: Skewness: 2.15 (Right-skewed) Kurtosis: 7.46 (Heavy tails/outliers) Zeros: 3,084 (61.7%) Outliers (IQR): 285 (5.7%) 🔧 Recommended Transformation: zero_inflation_handling Reason: Zero-inflation (61.7%) combined with high skewness (2.15) Priority: high ⚠️ Consider creating a binary indicator for zeros plus log transform of non-zero values
====================================================================== Column: bounced_sum_180d Type: numeric_discrete (Confidence: 70%) ---------------------------------------------------------------------- 📊 Distribution Statistics: Mean: 0.012 | Median: 0.000 | Std: 0.113 Range: [0.000, 2.000] Percentiles: 1%=0.000, 25%=0.000, 75%=0.000, 99%=1.000 📈 Shape Analysis: Skewness: 9.70 (Right-skewed) Kurtosis: 100.59 (Heavy tails/outliers) Zeros: 4,939 (98.8%) Outliers (IQR): 59 (1.2%) 🔧 Recommended Transformation: zero_inflation_handling Reason: Zero-inflation (98.8%) combined with high skewness (9.70) Priority: high ⚠️ Consider creating a binary indicator for zeros plus log transform of non-zero values
====================================================================== Column: bounced_mean_180d Type: numeric_discrete (Confidence: 70%) ---------------------------------------------------------------------- 📊 Distribution Statistics: Mean: 0.018 | Median: 0.000 | Std: 0.110 Range: [0.000, 1.000] Percentiles: 1%=0.000, 25%=0.000, 75%=0.000, 99%=0.500 📈 Shape Analysis: Skewness: 7.20 (Right-skewed) Kurtosis: 55.40 (Heavy tails/outliers) Zeros: 1,855 (96.9%) Outliers (IQR): 59 (3.1%) 🔧 Recommended Transformation: zero_inflation_handling Reason: Zero-inflation (96.9%) combined with high skewness (7.20) Priority: high ⚠️ Consider creating a binary indicator for zeros plus log transform of non-zero values
====================================================================== Column: bounced_count_180d Type: numeric_discrete (Confidence: 70%) ---------------------------------------------------------------------- 📊 Distribution Statistics: Mean: 0.639 | Median: 0.000 | Std: 1.009 Range: [0.000, 11.000] Percentiles: 1%=0.000, 25%=0.000, 75%=1.000, 99%=4.000 📈 Shape Analysis: Skewness: 2.15 (Right-skewed) Kurtosis: 7.46 (Heavy tails/outliers) Zeros: 3,084 (61.7%) Outliers (IQR): 285 (5.7%) 🔧 Recommended Transformation: zero_inflation_handling Reason: Zero-inflation (61.7%) combined with high skewness (2.15) Priority: high ⚠️ Consider creating a binary indicator for zeros plus log transform of non-zero values
====================================================================== Column: time_to_open_hours_sum_180d Type: numeric_continuous (Confidence: 90%) ---------------------------------------------------------------------- 📊 Distribution Statistics: Mean: 0.581 | Median: 0.000 | Std: 2.158 Range: [0.000, 26.300] Percentiles: 1%=0.000, 25%=0.000, 75%=0.000, 99%=11.203 📈 Shape Analysis: Skewness: 5.51 (Right-skewed) Kurtosis: 38.63 (Heavy tails/outliers) Zeros: 4,357 (87.2%) Outliers (IQR): 641 (12.8%) 🔧 Recommended Transformation: zero_inflation_handling Reason: Zero-inflation (87.2%) combined with high skewness (5.51) Priority: high ⚠️ Consider creating a binary indicator for zeros plus log transform of non-zero values
====================================================================== Column: time_to_open_hours_mean_180d Type: numeric_continuous (Confidence: 90%) ---------------------------------------------------------------------- 📊 Distribution Statistics: Mean: 3.861 | Median: 2.800 | Std: 3.559 Range: [0.000, 25.100] Percentiles: 1%=0.047, 25%=1.400, 75%=5.100, 99%=16.412 📈 Shape Analysis: Skewness: 1.90 (Right-skewed) Kurtosis: 5.01 (Heavy tails/outliers) Zeros: 7 (1.1%) Outliers (IQR): 40 (6.2%) 🔧 Recommended Transformation: sqrt_transform Reason: Moderate skewness (1.90) Priority: medium
====================================================================== Column: time_to_open_hours_max_180d Type: numeric_continuous (Confidence: 90%) ---------------------------------------------------------------------- 📊 Distribution Statistics: Mean: 4.136 | Median: 3.000 | Std: 3.796 Range: [0.000, 25.100] Percentiles: 1%=0.047, 25%=1.500, 75%=5.525, 99%=16.753 📈 Shape Analysis: Skewness: 1.76 (Right-skewed) Kurtosis: 3.97 (Heavy tails/outliers) Zeros: 7 (1.1%) Outliers (IQR): 39 (6.0%) 🔧 Recommended Transformation: sqrt_transform Reason: Moderate skewness (1.76) Priority: medium
====================================================================== Column: time_to_open_hours_count_180d Type: numeric_discrete (Confidence: 70%) ---------------------------------------------------------------------- 📊 Distribution Statistics: Mean: 0.152 | Median: 0.000 | Std: 0.424 Range: [0.000, 4.000] Percentiles: 1%=0.000, 25%=0.000, 75%=0.000, 99%=2.000 📈 Shape Analysis: Skewness: 3.14 (Right-skewed) Kurtosis: 11.52 (Heavy tails/outliers) Zeros: 4,350 (87.0%) Outliers (IQR): 648 (13.0%) 🔧 Recommended Transformation: zero_inflation_handling Reason: Zero-inflation (87.0%) combined with high skewness (3.14) Priority: high ⚠️ Consider creating a binary indicator for zeros plus log transform of non-zero values
====================================================================== Column: opened_sum_365d Type: numeric_discrete (Confidence: 70%) ---------------------------------------------------------------------- 📊 Distribution Statistics: Mean: 0.315 | Median: 0.000 | Std: 0.633 Range: [0.000, 6.000] Percentiles: 1%=0.000, 25%=0.000, 75%=0.000, 99%=3.000 📈 Shape Analysis: Skewness: 2.37 (Right-skewed) Kurtosis: 7.21 (Heavy tails/outliers) Zeros: 3,794 (75.9%) Outliers (IQR): 1,204 (24.1%) 🔧 Recommended Transformation: zero_inflation_handling Reason: Zero-inflation (75.9%) combined with high skewness (2.37) Priority: high ⚠️ Consider creating a binary indicator for zeros plus log transform of non-zero values
====================================================================== Column: opened_mean_365d Type: numeric_continuous (Confidence: 90%) ---------------------------------------------------------------------- 📊 Distribution Statistics: Mean: 0.234 | Median: 0.000 | Std: 0.309 Range: [0.000, 1.000] Percentiles: 1%=0.000, 25%=0.000, 75%=0.500, 99%=1.000 📈 Shape Analysis: Skewness: 1.19 (Right-skewed) Kurtosis: 0.43 (Light tails) Zeros: 1,432 (54.3%) Outliers (IQR): 0 (0.0%) 🔧 Recommended Transformation: zero_inflation_handling Reason: Significant zero-inflation (54.3%) Priority: medium ⚠️ Many zero values may indicate a mixture distribution
====================================================================== Column: opened_count_365d Type: numeric_discrete (Confidence: 70%) ---------------------------------------------------------------------- 📊 Distribution Statistics: Mean: 1.316 | Median: 1.000 | Std: 1.656 Range: [0.000, 15.000] Percentiles: 1%=0.000, 25%=0.000, 75%=2.000, 99%=7.000 📈 Shape Analysis: Skewness: 1.61 (Right-skewed) Kurtosis: 4.25 (Heavy tails/outliers) Zeros: 2,362 (47.3%) Outliers (IQR): 100 (2.0%) 🔧 Recommended Transformation: zero_inflation_handling Reason: Significant zero-inflation (47.3%) Priority: medium ⚠️ Many zero values may indicate a mixture distribution
====================================================================== Column: clicked_sum_365d Type: numeric_discrete (Confidence: 70%) ---------------------------------------------------------------------- 📊 Distribution Statistics: Mean: 0.099 | Median: 0.000 | Std: 0.326 Range: [0.000, 3.000] Percentiles: 1%=0.000, 25%=0.000, 75%=0.000, 99%=1.000 📈 Shape Analysis: Skewness: 3.48 (Right-skewed) Kurtosis: 12.93 (Heavy tails/outliers) Zeros: 4,545 (90.9%) Outliers (IQR): 453 (9.1%) 🔧 Recommended Transformation: zero_inflation_handling Reason: Zero-inflation (90.9%) combined with high skewness (3.48) Priority: high ⚠️ Consider creating a binary indicator for zeros plus log transform of non-zero values
====================================================================== Column: clicked_mean_365d Type: numeric_discrete (Confidence: 70%) ---------------------------------------------------------------------- 📊 Distribution Statistics: Mean: 0.073 | Median: 0.000 | Std: 0.187 Range: [0.000, 1.000] Percentiles: 1%=0.000, 25%=0.000, 75%=0.000, 99%=1.000 📈 Shape Analysis: Skewness: 3.06 (Right-skewed) Kurtosis: 10.07 (Heavy tails/outliers) Zeros: 2,183 (82.8%) Outliers (IQR): 453 (17.2%) 🔧 Recommended Transformation: zero_inflation_handling Reason: Zero-inflation (82.8%) combined with high skewness (3.06) Priority: high ⚠️ Consider creating a binary indicator for zeros plus log transform of non-zero values
====================================================================== Column: clicked_count_365d Type: numeric_discrete (Confidence: 70%) ---------------------------------------------------------------------- 📊 Distribution Statistics: Mean: 1.316 | Median: 1.000 | Std: 1.656 Range: [0.000, 15.000] Percentiles: 1%=0.000, 25%=0.000, 75%=2.000, 99%=7.000 📈 Shape Analysis: Skewness: 1.61 (Right-skewed) Kurtosis: 4.25 (Heavy tails/outliers) Zeros: 2,362 (47.3%) Outliers (IQR): 100 (2.0%) 🔧 Recommended Transformation: zero_inflation_handling Reason: Significant zero-inflation (47.3%) Priority: medium ⚠️ Many zero values may indicate a mixture distribution
====================================================================== Column: send_hour_sum_365d Type: numeric_continuous (Confidence: 90%) ---------------------------------------------------------------------- 📊 Distribution Statistics: Mean: 17.825 | Median: 10.000 | Std: 22.838 Range: [0.000, 193.000] Percentiles: 1%=0.000, 25%=0.000, 75%=31.000, 99%=90.060 📈 Shape Analysis: Skewness: 1.62 (Right-skewed) Kurtosis: 3.96 (Heavy tails/outliers) Zeros: 2,362 (47.3%) Outliers (IQR): 108 (2.2%) 🔧 Recommended Transformation: zero_inflation_handling Reason: Significant zero-inflation (47.3%) Priority: medium ⚠️ Many zero values may indicate a mixture distribution
====================================================================== Column: send_hour_mean_365d Type: numeric_continuous (Confidence: 90%) ---------------------------------------------------------------------- 📊 Distribution Statistics: Mean: 13.535 | Median: 13.500 | Std: 2.875 Range: [6.000, 22.000] Percentiles: 1%=6.000, 25%=11.702, 75%=15.333, 99%=21.000 📈 Shape Analysis: Skewness: 0.09 (Symmetric) Kurtosis: 0.28 (Light tails) Zeros: 0 (0.0%) Outliers (IQR): 62 (2.4%) 🔧 Recommended Transformation: none Reason: Distribution is approximately normal (skewness: 0.09) Priority: low
====================================================================== Column: send_hour_max_365d Type: numeric_discrete (Confidence: 70%) ---------------------------------------------------------------------- 📊 Distribution Statistics: Mean: 15.685 | Median: 16.000 | Std: 3.581 Range: [6.000, 22.000] Percentiles: 1%=6.000, 25%=13.000, 75%=18.000, 99%=22.000 📈 Shape Analysis: Skewness: -0.34 (Symmetric) Kurtosis: -0.23 (Light tails) Zeros: 0 (0.0%) Outliers (IQR): 0 (0.0%) 🔧 Recommended Transformation: none Reason: Distribution is approximately normal (skewness: -0.34) Priority: low
====================================================================== Column: send_hour_count_365d Type: numeric_discrete (Confidence: 70%) ---------------------------------------------------------------------- 📊 Distribution Statistics: Mean: 1.316 | Median: 1.000 | Std: 1.656 Range: [0.000, 15.000] Percentiles: 1%=0.000, 25%=0.000, 75%=2.000, 99%=7.000 📈 Shape Analysis: Skewness: 1.61 (Right-skewed) Kurtosis: 4.25 (Heavy tails/outliers) Zeros: 2,362 (47.3%) Outliers (IQR): 100 (2.0%) 🔧 Recommended Transformation: zero_inflation_handling Reason: Significant zero-inflation (47.3%) Priority: medium ⚠️ Many zero values may indicate a mixture distribution
====================================================================== Column: bounced_sum_365d Type: numeric_discrete (Confidence: 70%) ---------------------------------------------------------------------- 📊 Distribution Statistics: Mean: 0.026 | Median: 0.000 | Std: 0.167 Range: [0.000, 3.000] Percentiles: 1%=0.000, 25%=0.000, 75%=0.000, 99%=1.000 📈 Shape Analysis: Skewness: 6.95 (Right-skewed) Kurtosis: 56.96 (Heavy tails/outliers) Zeros: 4,872 (97.5%) Outliers (IQR): 126 (2.5%) 🔧 Recommended Transformation: zero_inflation_handling Reason: Zero-inflation (97.5%) combined with high skewness (6.95) Priority: high ⚠️ Consider creating a binary indicator for zeros plus log transform of non-zero values
====================================================================== Column: bounced_mean_365d Type: numeric_discrete (Confidence: 70%) ---------------------------------------------------------------------- 📊 Distribution Statistics: Mean: 0.019 | Median: 0.000 | Std: 0.100 Range: [0.000, 1.000] Percentiles: 1%=0.000, 25%=0.000, 75%=0.000, 99%=0.500 📈 Shape Analysis: Skewness: 6.70 (Right-skewed) Kurtosis: 52.39 (Heavy tails/outliers) Zeros: 2,510 (95.2%) Outliers (IQR): 126 (4.8%) 🔧 Recommended Transformation: zero_inflation_handling Reason: Zero-inflation (95.2%) combined with high skewness (6.70) Priority: high ⚠️ Consider creating a binary indicator for zeros plus log transform of non-zero values
====================================================================== Column: bounced_count_365d Type: numeric_discrete (Confidence: 70%) ---------------------------------------------------------------------- 📊 Distribution Statistics: Mean: 1.316 | Median: 1.000 | Std: 1.656 Range: [0.000, 15.000] Percentiles: 1%=0.000, 25%=0.000, 75%=2.000, 99%=7.000 📈 Shape Analysis: Skewness: 1.61 (Right-skewed) Kurtosis: 4.25 (Heavy tails/outliers) Zeros: 2,362 (47.3%) Outliers (IQR): 100 (2.0%) 🔧 Recommended Transformation: zero_inflation_handling Reason: Significant zero-inflation (47.3%) Priority: medium ⚠️ Many zero values may indicate a mixture distribution
====================================================================== Column: time_to_open_hours_sum_365d Type: numeric_continuous (Confidence: 90%) ---------------------------------------------------------------------- 📊 Distribution Statistics: Mean: 1.221 | Median: 0.000 | Std: 3.252 Range: [0.000, 37.100] Percentiles: 1%=0.000, 25%=0.000, 75%=0.000, 99%=16.100 📈 Shape Analysis: Skewness: 3.94 (Right-skewed) Kurtosis: 20.10 (Heavy tails/outliers) Zeros: 3,807 (76.2%) Outliers (IQR): 1,191 (23.8%) 🔧 Recommended Transformation: zero_inflation_handling Reason: Zero-inflation (76.2%) combined with high skewness (3.94) Priority: high ⚠️ Consider creating a binary indicator for zeros plus log transform of non-zero values
====================================================================== Column: time_to_open_hours_mean_365d Type: numeric_continuous (Confidence: 90%) ---------------------------------------------------------------------- 📊 Distribution Statistics: Mean: 3.907 | Median: 2.900 | Std: 3.667 Range: [0.000, 37.100] Percentiles: 1%=0.003, 25%=1.400, 75%=5.300, 99%=16.588 📈 Shape Analysis: Skewness: 2.25 (Right-skewed) Kurtosis: 9.46 (Heavy tails/outliers) Zeros: 13 (1.1%) Outliers (IQR): 55 (4.6%) 🔧 Recommended Transformation: yeo_johnson Reason: High skewness (2.25) with non-positive values Priority: high
====================================================================== Column: time_to_open_hours_max_365d Type: numeric_continuous (Confidence: 90%) ---------------------------------------------------------------------- 📊 Distribution Statistics: Mean: 4.419 | Median: 3.300 | Std: 4.087 Range: [0.000, 37.100] Percentiles: 1%=0.003, 25%=1.500, 75%=6.200, 99%=17.391 📈 Shape Analysis: Skewness: 1.90 (Right-skewed) Kurtosis: 6.19 (Heavy tails/outliers) Zeros: 13 (1.1%) Outliers (IQR): 48 (4.0%) 🔧 Recommended Transformation: sqrt_transform Reason: Moderate skewness (1.90) Priority: medium
====================================================================== Column: time_to_open_hours_count_365d Type: numeric_discrete (Confidence: 70%) ---------------------------------------------------------------------- 📊 Distribution Statistics: Mean: 0.315 | Median: 0.000 | Std: 0.633 Range: [0.000, 6.000] Percentiles: 1%=0.000, 25%=0.000, 75%=0.000, 99%=3.000 📈 Shape Analysis: Skewness: 2.37 (Right-skewed) Kurtosis: 7.21 (Heavy tails/outliers) Zeros: 3,794 (75.9%) Outliers (IQR): 1,204 (24.1%) 🔧 Recommended Transformation: zero_inflation_handling Reason: Zero-inflation (75.9%) combined with high skewness (2.37) Priority: high ⚠️ Consider creating a binary indicator for zeros plus log transform of non-zero values
====================================================================== Column: opened_sum_all_time Type: numeric_continuous (Confidence: 90%) ---------------------------------------------------------------------- 📊 Distribution Statistics: Mean: 3.718 | Median: 3.000 | Std: 3.159 Range: [0.000, 37.000] Percentiles: 1%=0.000, 25%=1.000, 75%=5.000, 99%=14.000 📈 Shape Analysis: Skewness: 2.27 (Right-skewed) Kurtosis: 13.25 (Heavy tails/outliers) Zeros: 672 (13.4%) Outliers (IQR): 105 (2.1%) 🔧 Recommended Transformation: yeo_johnson Reason: High skewness (2.27) with non-positive values Priority: high
====================================================================== Column: opened_mean_all_time Type: numeric_continuous (Confidence: 90%) ---------------------------------------------------------------------- 📊 Distribution Statistics: Mean: 0.204 | Median: 0.200 | Std: 0.130 Range: [0.000, 1.000] Percentiles: 1%=0.000, 25%=0.118, 75%=0.286, 99%=0.524 📈 Shape Analysis: Skewness: 0.38 (Symmetric) Kurtosis: 0.73 (Light tails) Zeros: 672 (13.4%) Outliers (IQR): 37 (0.7%) 🔧 Recommended Transformation: none Reason: Distribution is approximately normal (skewness: 0.38) Priority: low
====================================================================== Column: opened_count_all_time Type: numeric_continuous (Confidence: 90%) ---------------------------------------------------------------------- 📊 Distribution Statistics: Mean: 16.566 | Median: 16.000 | Std: 9.139 Range: [1.000, 112.000] Percentiles: 1%=2.000, 25%=12.000, 75%=19.000, 99%=53.000 📈 Shape Analysis: Skewness: 2.61 (Right-skewed) Kurtosis: 16.11 (Heavy tails/outliers) Zeros: 0 (0.0%) Outliers (IQR): 298 (6.0%) 🔧 Recommended Transformation: cap_then_log Reason: High skewness (2.61) with significant outliers (6.0%) Priority: high
====================================================================== Column: clicked_sum_all_time Type: numeric_discrete (Confidence: 70%) ---------------------------------------------------------------------- 📊 Distribution Statistics: Mean: 1.138 | Median: 1.000 | Std: 1.326 Range: [0.000, 15.000] Percentiles: 1%=0.000, 25%=0.000, 75%=2.000, 99%=6.000 📈 Shape Analysis: Skewness: 2.00 (Right-skewed) Kurtosis: 8.18 (Heavy tails/outliers) Zeros: 1,967 (39.4%) Outliers (IQR): 52 (1.0%) 🔧 Recommended Transformation: zero_inflation_handling Reason: Zero-inflation (39.4%) combined with high skewness (2.00) Priority: high ⚠️ Consider creating a binary indicator for zeros plus log transform of non-zero values
====================================================================== Column: clicked_mean_all_time Type: numeric_continuous (Confidence: 90%) ---------------------------------------------------------------------- 📊 Distribution Statistics: Mean: 0.063 | Median: 0.056 | Std: 0.068 Range: [0.000, 0.500] Percentiles: 1%=0.000, 25%=0.000, 75%=0.105, 99%=0.250 📈 Shape Analysis: Skewness: 1.26 (Right-skewed) Kurtosis: 2.33 (Light tails) Zeros: 1,967 (39.4%) Outliers (IQR): 48 (1.0%) 🔧 Recommended Transformation: zero_inflation_handling Reason: Significant zero-inflation (39.4%) Priority: medium ⚠️ Many zero values may indicate a mixture distribution
====================================================================== Column: clicked_count_all_time Type: numeric_continuous (Confidence: 90%) ---------------------------------------------------------------------- 📊 Distribution Statistics: Mean: 16.566 | Median: 16.000 | Std: 9.139 Range: [1.000, 112.000] Percentiles: 1%=2.000, 25%=12.000, 75%=19.000, 99%=53.000 📈 Shape Analysis: Skewness: 2.61 (Right-skewed) Kurtosis: 16.11 (Heavy tails/outliers) Zeros: 0 (0.0%) Outliers (IQR): 298 (6.0%) 🔧 Recommended Transformation: cap_then_log Reason: High skewness (2.61) with significant outliers (6.0%) Priority: high
====================================================================== Column: send_hour_sum_all_time Type: numeric_continuous (Confidence: 90%) ---------------------------------------------------------------------- 📊 Distribution Statistics: Mean: 224.062 | Median: 222.000 | Std: 124.178 Range: [6.000, 1501.000] Percentiles: 1%=23.000, 25%=161.000, 75%=261.750, 99%=703.360 📈 Shape Analysis: Skewness: 2.52 (Right-skewed) Kurtosis: 15.09 (Heavy tails/outliers) Zeros: 0 (0.0%) Outliers (IQR): 253 (5.1%) 🔧 Recommended Transformation: cap_then_log Reason: High skewness (2.52) with significant outliers (5.1%) Priority: high
====================================================================== Column: send_hour_mean_all_time Type: numeric_continuous (Confidence: 90%) ---------------------------------------------------------------------- 📊 Distribution Statistics: Mean: 13.518 | Median: 13.529 | Std: 1.173 Range: [6.000, 21.000] Percentiles: 1%=10.398, 25%=12.846, 75%=14.188, 99%=16.500 📈 Shape Analysis: Skewness: -0.25 (Symmetric) Kurtosis: 3.46 (Heavy tails/outliers) Zeros: 0 (0.0%) Outliers (IQR): 158 (3.2%) 🔧 Recommended Transformation: none Reason: Distribution is approximately normal (skewness: -0.25) Priority: low
====================================================================== Column: send_hour_max_all_time Type: numeric_discrete (Confidence: 70%) ---------------------------------------------------------------------- 📊 Distribution Statistics: Mean: 19.871 | Median: 20.000 | Std: 2.169 Range: [6.000, 22.000] Percentiles: 1%=13.000, 25%=19.000, 75%=22.000, 99%=22.000 📈 Shape Analysis: Skewness: -1.44 (Left-skewed) Kurtosis: 3.55 (Heavy tails/outliers) Zeros: 0 (0.0%) Outliers (IQR): 122 (2.4%) 🔧 Recommended Transformation: sqrt_transform Reason: Moderate skewness (-1.44) Priority: medium
====================================================================== Column: send_hour_count_all_time Type: numeric_continuous (Confidence: 90%) ---------------------------------------------------------------------- 📊 Distribution Statistics: Mean: 16.566 | Median: 16.000 | Std: 9.139 Range: [1.000, 112.000] Percentiles: 1%=2.000, 25%=12.000, 75%=19.000, 99%=53.000 📈 Shape Analysis: Skewness: 2.61 (Right-skewed) Kurtosis: 16.11 (Heavy tails/outliers) Zeros: 0 (0.0%) Outliers (IQR): 298 (6.0%) 🔧 Recommended Transformation: cap_then_log Reason: High skewness (2.61) with significant outliers (6.0%) Priority: high
====================================================================== Column: bounced_sum_all_time Type: numeric_discrete (Confidence: 70%) ---------------------------------------------------------------------- 📊 Distribution Statistics: Mean: 0.354 | Median: 0.000 | Std: 0.620 Range: [0.000, 4.000] Percentiles: 1%=0.000, 25%=0.000, 75%=1.000, 99%=3.000 📈 Shape Analysis: Skewness: 1.92 (Right-skewed) Kurtosis: 4.18 (Heavy tails/outliers) Zeros: 3,553 (71.1%) Outliers (IQR): 53 (1.1%) 🔧 Recommended Transformation: zero_inflation_handling Reason: Significant zero-inflation (71.1%) Priority: medium ⚠️ Many zero values may indicate a mixture distribution
====================================================================== Column: bounced_mean_all_time Type: numeric_continuous (Confidence: 90%) ---------------------------------------------------------------------- 📊 Distribution Statistics: Mean: 0.022 | Median: 0.000 | Std: 0.048 Range: [0.000, 1.000] Percentiles: 1%=0.000, 25%=0.000, 75%=0.042, 99%=0.182 📈 Shape Analysis: Skewness: 7.27 (Right-skewed) Kurtosis: 114.67 (Heavy tails/outliers) Zeros: 3,553 (71.1%) Outliers (IQR): 239 (4.8%) 🔧 Recommended Transformation: zero_inflation_handling Reason: Zero-inflation (71.1%) combined with high skewness (7.27) Priority: high ⚠️ Consider creating a binary indicator for zeros plus log transform of non-zero values
====================================================================== Column: bounced_count_all_time Type: numeric_continuous (Confidence: 90%) ---------------------------------------------------------------------- 📊 Distribution Statistics: Mean: 16.566 | Median: 16.000 | Std: 9.139 Range: [1.000, 112.000] Percentiles: 1%=2.000, 25%=12.000, 75%=19.000, 99%=53.000 📈 Shape Analysis: Skewness: 2.61 (Right-skewed) Kurtosis: 16.11 (Heavy tails/outliers) Zeros: 0 (0.0%) Outliers (IQR): 298 (6.0%) 🔧 Recommended Transformation: cap_then_log Reason: High skewness (2.61) with significant outliers (6.0%) Priority: high
====================================================================== Column: time_to_open_hours_sum_all_time Type: numeric_continuous (Confidence: 90%) ---------------------------------------------------------------------- 📊 Distribution Statistics: Mean: 14.680 | Median: 11.500 | Std: 14.908 Range: [0.000, 170.500] Percentiles: 1%=0.000, 25%=3.525, 75%=21.300, 99%=66.142 📈 Shape Analysis: Skewness: 2.23 (Right-skewed) Kurtosis: 10.23 (Heavy tails/outliers) Zeros: 678 (13.6%) Outliers (IQR): 163 (3.3%) 🔧 Recommended Transformation: yeo_johnson Reason: High skewness (2.23) with non-positive values Priority: high
====================================================================== Column: time_to_open_hours_mean_all_time Type: numeric_continuous (Confidence: 90%) ---------------------------------------------------------------------- 📊 Distribution Statistics: Mean: 3.918 | Median: 3.550 | Std: 2.388 Range: [0.000, 29.600] Percentiles: 1%=0.250, 25%=2.400, 75%=4.969, 99%=12.075 📈 Shape Analysis: Skewness: 2.00 (Right-skewed) Kurtosis: 9.83 (Heavy tails/outliers) Zeros: 6 (0.1%) Outliers (IQR): 144 (3.3%) 🔧 Recommended Transformation: yeo_johnson Reason: High skewness (2.00) with non-positive values Priority: high
====================================================================== Column: time_to_open_hours_max_all_time Type: numeric_continuous (Confidence: 90%) ---------------------------------------------------------------------- 📊 Distribution Statistics: Mean: 7.817 | Median: 6.800 | Std: 5.218 Range: [0.000, 43.200] Percentiles: 1%=0.300, 25%=4.100, 75%=10.500, 99%=24.450 📈 Shape Analysis: Skewness: 1.27 (Right-skewed) Kurtosis: 2.83 (Light tails) Zeros: 6 (0.1%) Outliers (IQR): 122 (2.8%) 🔧 Recommended Transformation: sqrt_transform Reason: Moderate skewness (1.27) Priority: medium
====================================================================== Column: time_to_open_hours_count_all_time Type: numeric_continuous (Confidence: 90%) ---------------------------------------------------------------------- 📊 Distribution Statistics: Mean: 3.718 | Median: 3.000 | Std: 3.159 Range: [0.000, 37.000] Percentiles: 1%=0.000, 25%=1.000, 75%=5.000, 99%=14.000 📈 Shape Analysis: Skewness: 2.27 (Right-skewed) Kurtosis: 13.25 (Heavy tails/outliers) Zeros: 672 (13.4%) Outliers (IQR): 105 (2.1%) 🔧 Recommended Transformation: yeo_johnson Reason: High skewness (2.27) with non-positive values Priority: high
====================================================================== Column: days_since_last_event_x Type: numeric_continuous (Confidence: 90%) ---------------------------------------------------------------------- 📊 Distribution Statistics: Mean: 853.054 | Median: 314.000 | Std: 983.953 Range: [0.000, 3284.000] Percentiles: 1%=3.000, 25%=95.000, 75%=1539.750, 99%=3175.060 📈 Shape Analysis: Skewness: 1.03 (Right-skewed) Kurtosis: -0.39 (Light tails) Zeros: 22 (0.4%) Outliers (IQR): 0 (0.0%) 🔧 Recommended Transformation: sqrt_transform Reason: Moderate skewness (1.03) Priority: medium
====================================================================== Column: days_since_first_event_x Type: numeric_continuous (Confidence: 90%) ---------------------------------------------------------------------- 📊 Distribution Statistics: Mean: 3129.425 | Median: 3179.000 | Std: 158.137 Range: [1958.000, 3285.000] Percentiles: 1%=2569.940, 25%=3063.000, 75%=3244.000, 99%=3284.000 📈 Shape Analysis: Skewness: -1.88 (Left-skewed) Kurtosis: 4.95 (Heavy tails/outliers) Zeros: 0 (0.0%) Outliers (IQR): 217 (4.3%) 🔧 Recommended Transformation: sqrt_transform Reason: Moderate skewness (-1.88) Priority: medium
====================================================================== Column: dow_sin Type: numeric_continuous (Confidence: 90%) ---------------------------------------------------------------------- 📊 Distribution Statistics: Mean: 0.380 | Median: 0.434 | Std: 0.408 Range: [-1.000, 1.000] Percentiles: 1%=-0.782, 25%=0.133, 75%=0.700, 99%=0.997 📈 Shape Analysis: Skewness: -0.73 (Left-skewed) Kurtosis: 0.21 (Light tails) Zeros: 8 (0.2%) Outliers (IQR): 74 (1.5%) 🔧 Recommended Transformation: none Reason: Distribution is approximately normal (skewness: -0.73) Priority: low
====================================================================== Column: dow_cos Type: numeric_continuous (Confidence: 90%) ---------------------------------------------------------------------- 📊 Distribution Statistics: Mean: -0.781 | Median: -0.879 | Std: 0.282 Range: [-1.000, 1.000] Percentiles: 1%=-1.000, 25%=-0.969, 75%=-0.701, 99%=0.475 📈 Shape Analysis: Skewness: 2.58 (Right-skewed) Kurtosis: 8.92 (Heavy tails/outliers) Zeros: 0 (0.0%) Outliers (IQR): 297 (5.9%) 🔧 Recommended Transformation: yeo_johnson Reason: High skewness (2.58) with negative values present Priority: high ⚠️ Yeo-Johnson handles negative values unlike log/sqrt
====================================================================== Column: bounced_momentum_180_365 Type: numeric_discrete (Confidence: 70%) ---------------------------------------------------------------------- 📊 Distribution Statistics: Mean: 0.997 | Median: 1.000 | Std: 0.136 Range: [0.000, 4.000] Percentiles: 1%=1.000, 25%=1.000, 75%=1.000, 99%=1.000 📈 Shape Analysis: Skewness: 4.10 (Right-skewed) Kurtosis: 157.83 (Heavy tails/outliers) Zeros: 45 (0.9%) Outliers (IQR): 81 (1.6%) 🔧 Recommended Transformation: yeo_johnson Reason: High skewness (4.10) with non-positive values Priority: high
====================================================================== Column: clicked_momentum_180_365 Type: numeric_continuous (Confidence: 90%) ---------------------------------------------------------------------- 📊 Distribution Statistics: Mean: 0.997 | Median: 1.000 | Std: 0.257 Range: [0.000, 4.000] Percentiles: 1%=0.000, 25%=1.000, 75%=1.000, 99%=2.000 📈 Shape Analysis: Skewness: 1.95 (Right-skewed) Kurtosis: 33.12 (Heavy tails/outliers) Zeros: 149 (3.0%) Outliers (IQR): 309 (6.2%) 🔧 Recommended Transformation: sqrt_transform Reason: Moderate skewness (1.95) Priority: medium
====================================================================== Column: lag0_opened_sum Type: numeric_discrete (Confidence: 70%) ---------------------------------------------------------------------- 📊 Distribution Statistics: Mean: 0.179 | Median: 0.000 | Std: 0.415 Range: [0.000, 6.000] Percentiles: 1%=0.000, 25%=0.000, 75%=0.000, 99%=1.000 📈 Shape Analysis: Skewness: 2.65 (Right-skewed) Kurtosis: 12.49 (Heavy tails/outliers) Zeros: 4,152 (83.1%) Outliers (IQR): 846 (16.9%) 🔧 Recommended Transformation: zero_inflation_handling Reason: Zero-inflation (83.1%) combined with high skewness (2.65) Priority: high ⚠️ Consider creating a binary indicator for zeros plus log transform of non-zero values
====================================================================== Column: lag0_opened_mean Type: numeric_discrete (Confidence: 70%) ---------------------------------------------------------------------- 📊 Distribution Statistics: Mean: 0.147 | Median: 0.000 | Std: 0.339 Range: [0.000, 1.000] Percentiles: 1%=0.000, 25%=0.000, 75%=0.000, 99%=1.000 📈 Shape Analysis: Skewness: 1.99 (Right-skewed) Kurtosis: 2.15 (Light tails) Zeros: 4,152 (83.1%) Outliers (IQR): 846 (16.9%) 🔧 Recommended Transformation: zero_inflation_handling Reason: Significant zero-inflation (83.1%) Priority: medium ⚠️ Many zero values may indicate a mixture distribution
====================================================================== Column: lag0_opened_count Type: numeric_discrete (Confidence: 70%) ---------------------------------------------------------------------- 📊 Distribution Statistics: Mean: 1.221 | Median: 1.000 | Std: 0.661 Range: [1.000, 24.000] Percentiles: 1%=1.000, 25%=1.000, 75%=1.000, 99%=3.000 📈 Shape Analysis: Skewness: 12.96 (Right-skewed) Kurtosis: 350.62 (Heavy tails/outliers) Zeros: 0 (0.0%) Outliers (IQR): 891 (17.8%) 🔧 Recommended Transformation: cap_then_log Reason: High skewness (12.96) with significant outliers (17.8%) Priority: high
====================================================================== Column: lag0_clicked_sum Type: numeric_discrete (Confidence: 70%) ---------------------------------------------------------------------- 📊 Distribution Statistics: Mean: 0.054 | Median: 0.000 | Std: 0.234 Range: [0.000, 4.000] Percentiles: 1%=0.000, 25%=0.000, 75%=0.000, 99%=1.000 📈 Shape Analysis: Skewness: 4.73 (Right-skewed) Kurtosis: 29.86 (Heavy tails/outliers) Zeros: 4,733 (94.7%) Outliers (IQR): 265 (5.3%) 🔧 Recommended Transformation: zero_inflation_handling Reason: Zero-inflation (94.7%) combined with high skewness (4.73) Priority: high ⚠️ Consider creating a binary indicator for zeros plus log transform of non-zero values
====================================================================== Column: lag0_clicked_mean Type: numeric_discrete (Confidence: 70%) ---------------------------------------------------------------------- 📊 Distribution Statistics: Mean: 0.043 | Median: 0.000 | Std: 0.192 Range: [0.000, 1.000] Percentiles: 1%=0.000, 25%=0.000, 75%=0.000, 99%=1.000 📈 Shape Analysis: Skewness: 4.51 (Right-skewed) Kurtosis: 19.06 (Heavy tails/outliers) Zeros: 4,733 (94.7%) Outliers (IQR): 265 (5.3%) 🔧 Recommended Transformation: zero_inflation_handling Reason: Zero-inflation (94.7%) combined with high skewness (4.51) Priority: high ⚠️ Consider creating a binary indicator for zeros plus log transform of non-zero values
====================================================================== Column: lag0_clicked_count Type: numeric_discrete (Confidence: 70%) ---------------------------------------------------------------------- 📊 Distribution Statistics: Mean: 1.221 | Median: 1.000 | Std: 0.661 Range: [1.000, 24.000] Percentiles: 1%=1.000, 25%=1.000, 75%=1.000, 99%=3.000 📈 Shape Analysis: Skewness: 12.96 (Right-skewed) Kurtosis: 350.62 (Heavy tails/outliers) Zeros: 0 (0.0%) Outliers (IQR): 891 (17.8%) 🔧 Recommended Transformation: cap_then_log Reason: High skewness (12.96) with significant outliers (17.8%) Priority: high
====================================================================== Column: lag0_send_hour_sum Type: numeric_continuous (Confidence: 90%) ---------------------------------------------------------------------- 📊 Distribution Statistics: Mean: 16.543 | Median: 15.000 | Std: 10.061 Range: [6.000, 346.000] Percentiles: 1%=6.000, 25%=11.000, 75%=19.000, 99%=45.000 📈 Shape Analysis: Skewness: 10.65 (Right-skewed) Kurtosis: 276.91 (Heavy tails/outliers) Zeros: 0 (0.0%) Outliers (IQR): 278 (5.6%) 🔧 Recommended Transformation: cap_then_log Reason: High skewness (10.65) with significant outliers (5.6%) Priority: high
====================================================================== Column: lag0_send_hour_mean Type: numeric_continuous (Confidence: 90%) ---------------------------------------------------------------------- 📊 Distribution Statistics: Mean: 13.541 | Median: 13.500 | Std: 3.626 Range: [6.000, 22.000] Percentiles: 1%=6.000, 25%=11.000, 75%=16.000, 99%=22.000 📈 Shape Analysis: Skewness: 0.05 (Symmetric) Kurtosis: -0.40 (Light tails) Zeros: 0 (0.0%) Outliers (IQR): 0 (0.0%) 🔧 Recommended Transformation: none Reason: Distribution is approximately normal (skewness: 0.05) Priority: low
====================================================================== Column: lag0_send_hour_count Type: numeric_discrete (Confidence: 70%) ---------------------------------------------------------------------- 📊 Distribution Statistics: Mean: 1.221 | Median: 1.000 | Std: 0.661 Range: [1.000, 24.000] Percentiles: 1%=1.000, 25%=1.000, 75%=1.000, 99%=3.000 📈 Shape Analysis: Skewness: 12.96 (Right-skewed) Kurtosis: 350.62 (Heavy tails/outliers) Zeros: 0 (0.0%) Outliers (IQR): 891 (17.8%) 🔧 Recommended Transformation: cap_then_log Reason: High skewness (12.96) with significant outliers (17.8%) Priority: high
====================================================================== Column: lag0_send_hour_max Type: numeric_discrete (Confidence: 70%) ---------------------------------------------------------------------- 📊 Distribution Statistics: Mean: 13.964 | Median: 14.000 | Std: 3.821 Range: [6.000, 22.000] Percentiles: 1%=6.000, 25%=11.000, 75%=17.000, 99%=22.000 📈 Shape Analysis: Skewness: -0.03 (Symmetric) Kurtosis: -0.55 (Light tails) Zeros: 0 (0.0%) Outliers (IQR): 0 (0.0%) 🔧 Recommended Transformation: none Reason: Distribution is approximately normal (skewness: -0.03) Priority: low
====================================================================== Column: lag0_bounced_sum Type: numeric_discrete (Confidence: 70%) ---------------------------------------------------------------------- 📊 Distribution Statistics: Mean: 0.024 | Median: 0.000 | Std: 0.153 Range: [0.000, 2.000] Percentiles: 1%=0.000, 25%=0.000, 75%=0.000, 99%=1.000 📈 Shape Analysis: Skewness: 6.44 (Right-skewed) Kurtosis: 40.95 (Heavy tails/outliers) Zeros: 4,881 (97.7%) Outliers (IQR): 117 (2.3%) 🔧 Recommended Transformation: zero_inflation_handling Reason: Zero-inflation (97.7%) combined with high skewness (6.44) Priority: high ⚠️ Consider creating a binary indicator for zeros plus log transform of non-zero values
====================================================================== Column: lag0_bounced_mean Type: numeric_discrete (Confidence: 70%) ---------------------------------------------------------------------- 📊 Distribution Statistics: Mean: 0.018 | Median: 0.000 | Std: 0.127 Range: [0.000, 1.000] Percentiles: 1%=0.000, 25%=0.000, 75%=0.000, 99%=1.000 📈 Shape Analysis: Skewness: 7.14 (Right-skewed) Kurtosis: 50.75 (Heavy tails/outliers) Zeros: 4,881 (97.7%) Outliers (IQR): 117 (2.3%) 🔧 Recommended Transformation: zero_inflation_handling Reason: Zero-inflation (97.7%) combined with high skewness (7.14) Priority: high ⚠️ Consider creating a binary indicator for zeros plus log transform of non-zero values
====================================================================== Column: lag0_bounced_count Type: numeric_discrete (Confidence: 70%) ---------------------------------------------------------------------- 📊 Distribution Statistics: Mean: 1.221 | Median: 1.000 | Std: 0.661 Range: [1.000, 24.000] Percentiles: 1%=1.000, 25%=1.000, 75%=1.000, 99%=3.000 📈 Shape Analysis: Skewness: 12.96 (Right-skewed) Kurtosis: 350.62 (Heavy tails/outliers) Zeros: 0 (0.0%) Outliers (IQR): 891 (17.8%) 🔧 Recommended Transformation: cap_then_log Reason: High skewness (12.96) with significant outliers (17.8%) Priority: high
====================================================================== Column: lag0_time_to_open_hours_sum Type: numeric_continuous (Confidence: 90%) ---------------------------------------------------------------------- 📊 Distribution Statistics: Mean: 0.689 | Median: 0.000 | Std: 2.340 Range: [0.000, 29.600] Percentiles: 1%=0.000, 25%=0.000, 75%=0.000, 99%=11.703 📈 Shape Analysis: Skewness: 5.34 (Right-skewed) Kurtosis: 37.08 (Heavy tails/outliers) Zeros: 4,162 (83.3%) Outliers (IQR): 836 (16.7%) 🔧 Recommended Transformation: zero_inflation_handling Reason: Zero-inflation (83.3%) combined with high skewness (5.34) Priority: high ⚠️ Consider creating a binary indicator for zeros plus log transform of non-zero values
====================================================================== Column: lag0_time_to_open_hours_mean Type: numeric_continuous (Confidence: 90%) ---------------------------------------------------------------------- 📊 Distribution Statistics: Mean: 3.895 | Median: 2.600 | Std: 4.136 Range: [0.000, 29.600] Percentiles: 1%=0.000, 25%=1.200, 75%=5.200, 99%=19.850 📈 Shape Analysis: Skewness: 2.36 (Right-skewed) Kurtosis: 7.64 (Heavy tails/outliers) Zeros: 10 (1.2%) Outliers (IQR): 47 (5.6%) 🔧 Recommended Transformation: cap_then_log Reason: High skewness (2.36) with significant outliers (5.6%) Priority: high
====================================================================== Column: lag0_time_to_open_hours_count Type: numeric_discrete (Confidence: 70%) ---------------------------------------------------------------------- 📊 Distribution Statistics: Mean: 0.179 | Median: 0.000 | Std: 0.415 Range: [0.000, 6.000] Percentiles: 1%=0.000, 25%=0.000, 75%=0.000, 99%=1.000 📈 Shape Analysis: Skewness: 2.65 (Right-skewed) Kurtosis: 12.49 (Heavy tails/outliers) Zeros: 4,152 (83.1%) Outliers (IQR): 846 (16.9%) 🔧 Recommended Transformation: zero_inflation_handling Reason: Zero-inflation (83.1%) combined with high skewness (2.65) Priority: high ⚠️ Consider creating a binary indicator for zeros plus log transform of non-zero values
====================================================================== Column: lag0_time_to_open_hours_max Type: numeric_continuous (Confidence: 90%) ---------------------------------------------------------------------- 📊 Distribution Statistics: Mean: 3.984 | Median: 2.700 | Std: 4.212 Range: [0.000, 29.600] Percentiles: 1%=0.000, 25%=1.200, 75%=5.200, 99%=19.850 📈 Shape Analysis: Skewness: 2.27 (Right-skewed) Kurtosis: 6.95 (Heavy tails/outliers) Zeros: 10 (1.2%) Outliers (IQR): 54 (6.4%) 🔧 Recommended Transformation: cap_then_log Reason: High skewness (2.27) with significant outliers (6.4%) Priority: high
====================================================================== Column: lag1_opened_sum Type: numeric_discrete (Confidence: 70%) ---------------------------------------------------------------------- 📊 Distribution Statistics: Mean: 0.203 | Median: 0.000 | Std: 0.433 Range: [0.000, 3.000] Percentiles: 1%=0.000, 25%=0.000, 75%=0.000, 99%=1.570 📈 Shape Analysis: Skewness: 2.01 (Right-skewed) Kurtosis: 3.81 (Heavy tails/outliers) Zeros: 763 (80.8%) Outliers (IQR): 181 (19.2%) 🔧 Recommended Transformation: zero_inflation_handling Reason: Zero-inflation (80.8%) combined with high skewness (2.01) Priority: high ⚠️ Consider creating a binary indicator for zeros plus log transform of non-zero values
====================================================================== Column: lag1_opened_mean Type: numeric_discrete (Confidence: 70%) ---------------------------------------------------------------------- 📊 Distribution Statistics: Mean: 0.172 | Median: 0.000 | Std: 0.365 Range: [0.000, 1.000] Percentiles: 1%=0.000, 25%=0.000, 75%=0.000, 99%=1.000 📈 Shape Analysis: Skewness: 1.74 (Right-skewed) Kurtosis: 1.15 (Light tails) Zeros: 763 (80.8%) Outliers (IQR): 181 (19.2%) 🔧 Recommended Transformation: zero_inflation_handling Reason: Significant zero-inflation (80.8%) Priority: medium ⚠️ Many zero values may indicate a mixture distribution
====================================================================== Column: lag1_opened_count Type: numeric_discrete (Confidence: 70%) ---------------------------------------------------------------------- 📊 Distribution Statistics: Mean: 0.224 | Median: 0.000 | Std: 0.530 Range: [0.000, 8.000] Percentiles: 1%=0.000, 25%=0.000, 75%=0.000, 99%=2.000 📈 Shape Analysis: Skewness: 3.59 (Right-skewed) Kurtosis: 23.31 (Heavy tails/outliers) Zeros: 4,054 (81.1%) Outliers (IQR): 944 (18.9%) 🔧 Recommended Transformation: zero_inflation_handling Reason: Zero-inflation (81.1%) combined with high skewness (3.59) Priority: high ⚠️ Consider creating a binary indicator for zeros plus log transform of non-zero values
====================================================================== Column: lag1_clicked_sum Type: numeric_discrete (Confidence: 70%) ---------------------------------------------------------------------- 📊 Distribution Statistics: Mean: 0.061 | Median: 0.000 | Std: 0.245 Range: [0.000, 2.000] Percentiles: 1%=0.000, 25%=0.000, 75%=0.000, 99%=1.000 📈 Shape Analysis: Skewness: 3.87 (Right-skewed) Kurtosis: 14.15 (Heavy tails/outliers) Zeros: 887 (94.0%) Outliers (IQR): 57 (6.0%) 🔧 Recommended Transformation: zero_inflation_handling Reason: Zero-inflation (94.0%) combined with high skewness (3.87) Priority: high ⚠️ Consider creating a binary indicator for zeros plus log transform of non-zero values
====================================================================== Column: lag1_clicked_mean Type: numeric_discrete (Confidence: 70%) ---------------------------------------------------------------------- 📊 Distribution Statistics: Mean: 0.053 | Median: 0.000 | Std: 0.216 Range: [0.000, 1.000] Percentiles: 1%=0.000, 25%=0.000, 75%=0.000, 99%=1.000 📈 Shape Analysis: Skewness: 4.03 (Right-skewed) Kurtosis: 14.57 (Heavy tails/outliers) Zeros: 887 (94.0%) Outliers (IQR): 57 (6.0%) 🔧 Recommended Transformation: zero_inflation_handling Reason: Zero-inflation (94.0%) combined with high skewness (4.03) Priority: high ⚠️ Consider creating a binary indicator for zeros plus log transform of non-zero values
====================================================================== Column: lag1_clicked_count Type: numeric_discrete (Confidence: 70%) ---------------------------------------------------------------------- 📊 Distribution Statistics: Mean: 0.224 | Median: 0.000 | Std: 0.530 Range: [0.000, 8.000] Percentiles: 1%=0.000, 25%=0.000, 75%=0.000, 99%=2.000 📈 Shape Analysis: Skewness: 3.59 (Right-skewed) Kurtosis: 23.31 (Heavy tails/outliers) Zeros: 4,054 (81.1%) Outliers (IQR): 944 (18.9%) 🔧 Recommended Transformation: zero_inflation_handling Reason: Zero-inflation (81.1%) combined with high skewness (3.59) Priority: high ⚠️ Consider creating a binary indicator for zeros plus log transform of non-zero values
====================================================================== Column: lag1_send_hour_sum Type: numeric_continuous (Confidence: 90%) ---------------------------------------------------------------------- 📊 Distribution Statistics: Mean: 16.089 | Median: 14.000 | Std: 9.233 Range: [6.000, 115.000] Percentiles: 1%=6.000, 25%=11.000, 75%=18.000, 99%=52.280 📈 Shape Analysis: Skewness: 3.61 (Right-skewed) Kurtosis: 22.79 (Heavy tails/outliers) Zeros: 0 (0.0%) Outliers (IQR): 74 (7.8%) 🔧 Recommended Transformation: cap_then_log Reason: High skewness (3.61) with significant outliers (7.8%) Priority: high
====================================================================== Column: lag1_send_hour_mean Type: numeric_continuous (Confidence: 90%) ---------------------------------------------------------------------- 📊 Distribution Statistics: Mean: 13.503 | Median: 13.500 | Std: 3.789 Range: [6.000, 22.000] Percentiles: 1%=6.000, 25%=11.000, 75%=16.000, 99%=22.000 📈 Shape Analysis: Skewness: 0.01 (Symmetric) Kurtosis: -0.41 (Light tails) Zeros: 0 (0.0%) Outliers (IQR): 0 (0.0%) 🔧 Recommended Transformation: none Reason: Distribution is approximately normal (skewness: 0.01) Priority: low
====================================================================== Column: lag1_send_hour_count Type: numeric_discrete (Confidence: 70%) ---------------------------------------------------------------------- 📊 Distribution Statistics: Mean: 0.224 | Median: 0.000 | Std: 0.530 Range: [0.000, 8.000] Percentiles: 1%=0.000, 25%=0.000, 75%=0.000, 99%=2.000 📈 Shape Analysis: Skewness: 3.59 (Right-skewed) Kurtosis: 23.31 (Heavy tails/outliers) Zeros: 4,054 (81.1%) Outliers (IQR): 944 (18.9%) 🔧 Recommended Transformation: zero_inflation_handling Reason: Zero-inflation (81.1%) combined with high skewness (3.59) Priority: high ⚠️ Consider creating a binary indicator for zeros plus log transform of non-zero values
====================================================================== Column: lag1_send_hour_max Type: numeric_discrete (Confidence: 70%) ---------------------------------------------------------------------- 📊 Distribution Statistics: Mean: 13.854 | Median: 14.000 | Std: 3.975 Range: [6.000, 22.000] Percentiles: 1%=6.000, 25%=11.000, 75%=17.000, 99%=22.000 📈 Shape Analysis: Skewness: -0.05 (Symmetric) Kurtosis: -0.56 (Light tails) Zeros: 0 (0.0%) Outliers (IQR): 0 (0.0%) 🔧 Recommended Transformation: none Reason: Distribution is approximately normal (skewness: -0.05) Priority: low
====================================================================== Column: lag1_bounced_mean Type: numeric_discrete (Confidence: 70%) ---------------------------------------------------------------------- 📊 Distribution Statistics: Mean: 0.016 | Median: 0.000 | Std: 0.118 Range: [0.000, 1.000] Percentiles: 1%=0.000, 25%=0.000, 75%=0.000, 99%=1.000 📈 Shape Analysis: Skewness: 7.81 (Right-skewed) Kurtosis: 61.15 (Heavy tails/outliers) Zeros: 924 (97.9%) Outliers (IQR): 20 (2.1%) 🔧 Recommended Transformation: zero_inflation_handling Reason: Zero-inflation (97.9%) combined with high skewness (7.81) Priority: high ⚠️ Consider creating a binary indicator for zeros plus log transform of non-zero values
====================================================================== Column: lag1_bounced_count Type: numeric_discrete (Confidence: 70%) ---------------------------------------------------------------------- 📊 Distribution Statistics: Mean: 0.224 | Median: 0.000 | Std: 0.530 Range: [0.000, 8.000] Percentiles: 1%=0.000, 25%=0.000, 75%=0.000, 99%=2.000 📈 Shape Analysis: Skewness: 3.59 (Right-skewed) Kurtosis: 23.31 (Heavy tails/outliers) Zeros: 4,054 (81.1%) Outliers (IQR): 944 (18.9%) 🔧 Recommended Transformation: zero_inflation_handling Reason: Zero-inflation (81.1%) combined with high skewness (3.59) Priority: high ⚠️ Consider creating a binary indicator for zeros plus log transform of non-zero values
====================================================================== Column: lag1_time_to_open_hours_sum Type: numeric_continuous (Confidence: 90%) ---------------------------------------------------------------------- 📊 Distribution Statistics: Mean: 0.791 | Median: 0.000 | Std: 2.249 Range: [0.000, 24.900] Percentiles: 1%=0.000, 25%=0.000, 75%=0.000, 99%=10.613 📈 Shape Analysis: Skewness: 4.05 (Right-skewed) Kurtosis: 22.65 (Heavy tails/outliers) Zeros: 766 (81.1%) Outliers (IQR): 178 (18.9%) 🔧 Recommended Transformation: zero_inflation_handling Reason: Zero-inflation (81.1%) combined with high skewness (4.05) Priority: high ⚠️ Consider creating a binary indicator for zeros plus log transform of non-zero values
====================================================================== Column: lag1_time_to_open_hours_mean Type: numeric_continuous (Confidence: 90%) ---------------------------------------------------------------------- 📊 Distribution Statistics: Mean: 3.963 | Median: 3.200 | Std: 3.505 Range: [0.000, 24.900] Percentiles: 1%=0.000, 25%=1.400, 75%=5.700, 99%=13.820 📈 Shape Analysis: Skewness: 1.92 (Right-skewed) Kurtosis: 6.71 (Heavy tails/outliers) Zeros: 3 (1.7%) Outliers (IQR): 5 (2.8%) 🔧 Recommended Transformation: sqrt_transform Reason: Moderate skewness (1.92) Priority: medium
====================================================================== Column: lag1_time_to_open_hours_count Type: numeric_discrete (Confidence: 70%) ---------------------------------------------------------------------- 📊 Distribution Statistics: Mean: 0.038 | Median: 0.000 | Std: 0.204 Range: [0.000, 3.000] Percentiles: 1%=0.000, 25%=0.000, 75%=0.000, 99%=1.000 📈 Shape Analysis: Skewness: 5.77 (Right-skewed) Kurtosis: 37.96 (Heavy tails/outliers) Zeros: 4,817 (96.4%) Outliers (IQR): 181 (3.6%) 🔧 Recommended Transformation: zero_inflation_handling Reason: Zero-inflation (96.4%) combined with high skewness (5.77) Priority: high ⚠️ Consider creating a binary indicator for zeros plus log transform of non-zero values
====================================================================== Column: lag1_time_to_open_hours_max Type: numeric_continuous (Confidence: 90%) ---------------------------------------------------------------------- 📊 Distribution Statistics: Mean: 4.025 | Median: 3.400 | Std: 3.502 Range: [0.000, 24.900] Percentiles: 1%=0.000, 25%=1.500, 75%=5.800, 99%=13.820 📈 Shape Analysis: Skewness: 1.88 (Right-skewed) Kurtosis: 6.61 (Heavy tails/outliers) Zeros: 3 (1.7%) Outliers (IQR): 5 (2.8%) 🔧 Recommended Transformation: sqrt_transform Reason: Moderate skewness (1.88) Priority: medium
====================================================================== Column: lag2_opened_sum Type: numeric_discrete (Confidence: 70%) ---------------------------------------------------------------------- 📊 Distribution Statistics: Mean: 0.220 | Median: 0.000 | Std: 0.450 Range: [0.000, 4.000] Percentiles: 1%=0.000, 25%=0.000, 75%=0.000, 99%=1.000 📈 Shape Analysis: Skewness: 2.22 (Right-skewed) Kurtosis: 7.42 (Heavy tails/outliers) Zeros: 714 (79.2%) Outliers (IQR): 188 (20.8%) 🔧 Recommended Transformation: zero_inflation_handling Reason: Zero-inflation (79.2%) combined with high skewness (2.22) Priority: high ⚠️ Consider creating a binary indicator for zeros plus log transform of non-zero values
====================================================================== Column: lag2_opened_mean Type: numeric_discrete (Confidence: 70%) ---------------------------------------------------------------------- 📊 Distribution Statistics: Mean: 0.180 | Median: 0.000 | Std: 0.370 Range: [0.000, 1.000] Percentiles: 1%=0.000, 25%=0.000, 75%=0.000, 99%=1.000 📈 Shape Analysis: Skewness: 1.67 (Right-skewed) Kurtosis: 0.94 (Light tails) Zeros: 714 (79.2%) Outliers (IQR): 188 (20.8%) 🔧 Recommended Transformation: zero_inflation_handling Reason: Significant zero-inflation (79.2%) Priority: medium ⚠️ Many zero values may indicate a mixture distribution
====================================================================== Column: lag2_opened_count Type: numeric_discrete (Confidence: 70%) ---------------------------------------------------------------------- 📊 Distribution Statistics: Mean: 0.214 | Median: 0.000 | Std: 0.518 Range: [0.000, 7.000] Percentiles: 1%=0.000, 25%=0.000, 75%=0.000, 99%=2.000 📈 Shape Analysis: Skewness: 3.53 (Right-skewed) Kurtosis: 20.98 (Heavy tails/outliers) Zeros: 4,096 (82.0%) Outliers (IQR): 902 (18.0%) 🔧 Recommended Transformation: zero_inflation_handling Reason: Zero-inflation (82.0%) combined with high skewness (3.53) Priority: high ⚠️ Consider creating a binary indicator for zeros plus log transform of non-zero values
====================================================================== Column: lag2_clicked_sum Type: numeric_discrete (Confidence: 70%) ---------------------------------------------------------------------- 📊 Distribution Statistics: Mean: 0.065 | Median: 0.000 | Std: 0.252 Range: [0.000, 2.000] Percentiles: 1%=0.000, 25%=0.000, 75%=0.000, 99%=1.000 📈 Shape Analysis: Skewness: 3.73 (Right-skewed) Kurtosis: 12.97 (Heavy tails/outliers) Zeros: 844 (93.6%) Outliers (IQR): 58 (6.4%) 🔧 Recommended Transformation: zero_inflation_handling Reason: Zero-inflation (93.6%) combined with high skewness (3.73) Priority: high ⚠️ Consider creating a binary indicator for zeros plus log transform of non-zero values
====================================================================== Column: lag2_clicked_mean Type: numeric_discrete (Confidence: 70%) ---------------------------------------------------------------------- 📊 Distribution Statistics: Mean: 0.053 | Median: 0.000 | Std: 0.214 Range: [0.000, 1.000] Percentiles: 1%=0.000, 25%=0.000, 75%=0.000, 99%=1.000 📈 Shape Analysis: Skewness: 4.02 (Right-skewed) Kurtosis: 14.64 (Heavy tails/outliers) Zeros: 844 (93.6%) Outliers (IQR): 58 (6.4%) 🔧 Recommended Transformation: zero_inflation_handling Reason: Zero-inflation (93.6%) combined with high skewness (4.02) Priority: high ⚠️ Consider creating a binary indicator for zeros plus log transform of non-zero values
====================================================================== Column: lag2_clicked_count Type: numeric_discrete (Confidence: 70%) ---------------------------------------------------------------------- 📊 Distribution Statistics: Mean: 0.214 | Median: 0.000 | Std: 0.518 Range: [0.000, 7.000] Percentiles: 1%=0.000, 25%=0.000, 75%=0.000, 99%=2.000 📈 Shape Analysis: Skewness: 3.53 (Right-skewed) Kurtosis: 20.98 (Heavy tails/outliers) Zeros: 4,096 (82.0%) Outliers (IQR): 902 (18.0%) 🔧 Recommended Transformation: zero_inflation_handling Reason: Zero-inflation (82.0%) combined with high skewness (3.53) Priority: high ⚠️ Consider creating a binary indicator for zeros plus log transform of non-zero values
====================================================================== Column: lag2_send_hour_sum Type: numeric_continuous (Confidence: 90%) ---------------------------------------------------------------------- 📊 Distribution Statistics: Mean: 15.989 | Median: 14.000 | Std: 8.653 Range: [6.000, 91.000] Percentiles: 1%=6.000, 25%=11.000, 75%=18.000, 99%=52.980 📈 Shape Analysis: Skewness: 3.24 (Right-skewed) Kurtosis: 16.68 (Heavy tails/outliers) Zeros: 0 (0.0%) Outliers (IQR): 63 (7.0%) 🔧 Recommended Transformation: cap_then_log Reason: High skewness (3.24) with significant outliers (7.0%) Priority: high
====================================================================== Column: lag2_send_hour_mean Type: numeric_continuous (Confidence: 90%) ---------------------------------------------------------------------- 📊 Distribution Statistics: Mean: 13.514 | Median: 13.500 | Std: 3.683 Range: [6.000, 22.000] Percentiles: 1%=6.000, 25%=11.000, 75%=16.000, 99%=22.000 📈 Shape Analysis: Skewness: 0.05 (Symmetric) Kurtosis: -0.43 (Light tails) Zeros: 0 (0.0%) Outliers (IQR): 0 (0.0%) 🔧 Recommended Transformation: none Reason: Distribution is approximately normal (skewness: 0.05) Priority: low
====================================================================== Column: lag2_send_hour_count Type: numeric_discrete (Confidence: 70%) ---------------------------------------------------------------------- 📊 Distribution Statistics: Mean: 0.214 | Median: 0.000 | Std: 0.518 Range: [0.000, 7.000] Percentiles: 1%=0.000, 25%=0.000, 75%=0.000, 99%=2.000 📈 Shape Analysis: Skewness: 3.53 (Right-skewed) Kurtosis: 20.98 (Heavy tails/outliers) Zeros: 4,096 (82.0%) Outliers (IQR): 902 (18.0%) 🔧 Recommended Transformation: zero_inflation_handling Reason: Zero-inflation (82.0%) combined with high skewness (3.53) Priority: high ⚠️ Consider creating a binary indicator for zeros plus log transform of non-zero values
====================================================================== Column: lag2_send_hour_max Type: numeric_discrete (Confidence: 70%) ---------------------------------------------------------------------- 📊 Distribution Statistics: Mean: 13.822 | Median: 14.000 | Std: 3.815 Range: [6.000, 22.000] Percentiles: 1%=6.000, 25%=11.000, 75%=16.000, 99%=22.000 📈 Shape Analysis: Skewness: -0.02 (Symmetric) Kurtosis: -0.52 (Light tails) Zeros: 0 (0.0%) Outliers (IQR): 0 (0.0%) 🔧 Recommended Transformation: none Reason: Distribution is approximately normal (skewness: -0.02) Priority: low
====================================================================== Column: lag2_bounced_mean Type: numeric_discrete (Confidence: 70%) ---------------------------------------------------------------------- 📊 Distribution Statistics: Mean: 0.021 | Median: 0.000 | Std: 0.138 Range: [0.000, 1.000] Percentiles: 1%=0.000, 25%=0.000, 75%=0.000, 99%=1.000 📈 Shape Analysis: Skewness: 6.61 (Right-skewed) Kurtosis: 42.98 (Heavy tails/outliers) Zeros: 879 (97.5%) Outliers (IQR): 23 (2.5%) 🔧 Recommended Transformation: zero_inflation_handling Reason: Zero-inflation (97.5%) combined with high skewness (6.61) Priority: high ⚠️ Consider creating a binary indicator for zeros plus log transform of non-zero values
====================================================================== Column: lag2_bounced_count Type: numeric_discrete (Confidence: 70%) ---------------------------------------------------------------------- 📊 Distribution Statistics: Mean: 0.214 | Median: 0.000 | Std: 0.518 Range: [0.000, 7.000] Percentiles: 1%=0.000, 25%=0.000, 75%=0.000, 99%=2.000 📈 Shape Analysis: Skewness: 3.53 (Right-skewed) Kurtosis: 20.98 (Heavy tails/outliers) Zeros: 4,096 (82.0%) Outliers (IQR): 902 (18.0%) 🔧 Recommended Transformation: zero_inflation_handling Reason: Zero-inflation (82.0%) combined with high skewness (3.53) Priority: high ⚠️ Consider creating a binary indicator for zeros plus log transform of non-zero values
====================================================================== Column: lag2_time_to_open_hours_sum Type: numeric_continuous (Confidence: 90%) ---------------------------------------------------------------------- 📊 Distribution Statistics: Mean: 0.909 | Median: 0.000 | Std: 2.452 Range: [0.000, 21.000] Percentiles: 1%=0.000, 25%=0.000, 75%=0.000, 99%=11.999 📈 Shape Analysis: Skewness: 3.65 (Right-skewed) Kurtosis: 16.23 (Heavy tails/outliers) Zeros: 714 (79.2%) Outliers (IQR): 188 (20.8%) 🔧 Recommended Transformation: zero_inflation_handling Reason: Zero-inflation (79.2%) combined with high skewness (3.65) Priority: high ⚠️ Consider creating a binary indicator for zeros plus log transform of non-zero values
====================================================================== Column: lag2_time_to_open_hours_mean Type: numeric_continuous (Confidence: 90%) ---------------------------------------------------------------------- 📊 Distribution Statistics: Mean: 4.198 | Median: 3.425 | Std: 3.509 Range: [0.100, 19.700] Percentiles: 1%=0.187, 25%=1.475, 75%=5.650, 99%=14.208 📈 Shape Analysis: Skewness: 1.33 (Right-skewed) Kurtosis: 2.11 (Light tails) Zeros: 0 (0.0%) Outliers (IQR): 9 (4.8%) 🔧 Recommended Transformation: sqrt_transform Reason: Moderate skewness (1.33) Priority: medium
====================================================================== Column: lag2_time_to_open_hours_count Type: numeric_discrete (Confidence: 70%) ---------------------------------------------------------------------- 📊 Distribution Statistics: Mean: 0.040 | Median: 0.000 | Std: 0.209 Range: [0.000, 4.000] Percentiles: 1%=0.000, 25%=0.000, 75%=0.000, 99%=1.000 📈 Shape Analysis: Skewness: 6.27 (Right-skewed) Kurtosis: 54.88 (Heavy tails/outliers) Zeros: 4,810 (96.2%) Outliers (IQR): 188 (3.8%) 🔧 Recommended Transformation: zero_inflation_handling Reason: Zero-inflation (96.2%) combined with high skewness (6.27) Priority: high ⚠️ Consider creating a binary indicator for zeros plus log transform of non-zero values
====================================================================== Column: lag2_time_to_open_hours_max Type: numeric_continuous (Confidence: 90%) ---------------------------------------------------------------------- 📊 Distribution Statistics: Mean: 4.292 | Median: 3.450 | Std: 3.588 Range: [0.100, 19.700] Percentiles: 1%=0.187, 25%=1.500, 75%=5.900, 99%=14.295 📈 Shape Analysis: Skewness: 1.29 (Right-skewed) Kurtosis: 1.86 (Light tails) Zeros: 0 (0.0%) Outliers (IQR): 7 (3.7%) 🔧 Recommended Transformation: sqrt_transform Reason: Moderate skewness (1.29) Priority: medium
====================================================================== Column: lag3_opened_sum Type: numeric_discrete (Confidence: 70%) ---------------------------------------------------------------------- 📊 Distribution Statistics: Mean: 0.203 | Median: 0.000 | Std: 0.421 Range: [0.000, 2.000] Percentiles: 1%=0.000, 25%=0.000, 75%=0.000, 99%=1.000 📈 Shape Analysis: Skewness: 1.78 (Right-skewed) Kurtosis: 1.99 (Light tails) Zeros: 748 (80.4%) Outliers (IQR): 182 (19.6%) 🔧 Recommended Transformation: zero_inflation_handling Reason: Significant zero-inflation (80.4%) Priority: medium ⚠️ Many zero values may indicate a mixture distribution
====================================================================== Column: lag3_opened_mean Type: numeric_discrete (Confidence: 70%) ---------------------------------------------------------------------- 📊 Distribution Statistics: Mean: 0.173 | Median: 0.000 | Std: 0.365 Range: [0.000, 1.000] Percentiles: 1%=0.000, 25%=0.000, 75%=0.000, 99%=1.000 📈 Shape Analysis: Skewness: 1.74 (Right-skewed) Kurtosis: 1.16 (Light tails) Zeros: 748 (80.4%) Outliers (IQR): 182 (19.6%) 🔧 Recommended Transformation: zero_inflation_handling Reason: Significant zero-inflation (80.4%) Priority: medium ⚠️ Many zero values may indicate a mixture distribution
====================================================================== Column: lag3_opened_count Type: numeric_discrete (Confidence: 70%) ---------------------------------------------------------------------- 📊 Distribution Statistics: Mean: 0.221 | Median: 0.000 | Std: 0.520 Range: [0.000, 6.000] Percentiles: 1%=0.000, 25%=0.000, 75%=0.000, 99%=2.000 📈 Shape Analysis: Skewness: 3.19 (Right-skewed) Kurtosis: 15.50 (Heavy tails/outliers) Zeros: 4,068 (81.4%) Outliers (IQR): 930 (18.6%) 🔧 Recommended Transformation: zero_inflation_handling Reason: Zero-inflation (81.4%) combined with high skewness (3.19) Priority: high ⚠️ Consider creating a binary indicator for zeros plus log transform of non-zero values
====================================================================== Column: lag3_clicked_mean Type: numeric_discrete (Confidence: 70%) ---------------------------------------------------------------------- 📊 Distribution Statistics: Mean: 0.049 | Median: 0.000 | Std: 0.210 Range: [0.000, 1.000] Percentiles: 1%=0.000, 25%=0.000, 75%=0.000, 99%=1.000 📈 Shape Analysis: Skewness: 4.19 (Right-skewed) Kurtosis: 15.91 (Heavy tails/outliers) Zeros: 879 (94.5%) Outliers (IQR): 51 (5.5%) 🔧 Recommended Transformation: zero_inflation_handling Reason: Zero-inflation (94.5%) combined with high skewness (4.19) Priority: high ⚠️ Consider creating a binary indicator for zeros plus log transform of non-zero values
====================================================================== Column: lag3_clicked_count Type: numeric_discrete (Confidence: 70%) ---------------------------------------------------------------------- 📊 Distribution Statistics: Mean: 0.221 | Median: 0.000 | Std: 0.520 Range: [0.000, 6.000] Percentiles: 1%=0.000, 25%=0.000, 75%=0.000, 99%=2.000 📈 Shape Analysis: Skewness: 3.19 (Right-skewed) Kurtosis: 15.50 (Heavy tails/outliers) Zeros: 4,068 (81.4%) Outliers (IQR): 930 (18.6%) 🔧 Recommended Transformation: zero_inflation_handling Reason: Zero-inflation (81.4%) combined with high skewness (3.19) Priority: high ⚠️ Consider creating a binary indicator for zeros plus log transform of non-zero values
====================================================================== Column: lag3_send_hour_sum Type: numeric_continuous (Confidence: 90%) ---------------------------------------------------------------------- 📊 Distribution Statistics: Mean: 16.137 | Median: 14.000 | Std: 8.406 Range: [6.000, 72.000] Percentiles: 1%=6.000, 25%=11.000, 75%=18.000, 99%=50.840 📈 Shape Analysis: Skewness: 2.68 (Right-skewed) Kurtosis: 10.46 (Heavy tails/outliers) Zeros: 0 (0.0%) Outliers (IQR): 74 (8.0%) 🔧 Recommended Transformation: cap_then_log Reason: High skewness (2.68) with significant outliers (8.0%) Priority: high
====================================================================== Column: lag3_send_hour_mean Type: numeric_continuous (Confidence: 90%) ---------------------------------------------------------------------- 📊 Distribution Statistics: Mean: 13.593 | Median: 13.500 | Std: 3.625 Range: [6.000, 22.000] Percentiles: 1%=6.000, 25%=11.000, 75%=16.000, 99%=22.000 📈 Shape Analysis: Skewness: 0.06 (Symmetric) Kurtosis: -0.44 (Light tails) Zeros: 0 (0.0%) Outliers (IQR): 0 (0.0%) 🔧 Recommended Transformation: none Reason: Distribution is approximately normal (skewness: 0.06) Priority: low
====================================================================== Column: lag3_send_hour_count Type: numeric_discrete (Confidence: 70%) ---------------------------------------------------------------------- 📊 Distribution Statistics: Mean: 0.221 | Median: 0.000 | Std: 0.520 Range: [0.000, 6.000] Percentiles: 1%=0.000, 25%=0.000, 75%=0.000, 99%=2.000 📈 Shape Analysis: Skewness: 3.19 (Right-skewed) Kurtosis: 15.50 (Heavy tails/outliers) Zeros: 4,068 (81.4%) Outliers (IQR): 930 (18.6%) 🔧 Recommended Transformation: zero_inflation_handling Reason: Zero-inflation (81.4%) combined with high skewness (3.19) Priority: high ⚠️ Consider creating a binary indicator for zeros plus log transform of non-zero values
====================================================================== Column: lag3_send_hour_max Type: numeric_discrete (Confidence: 70%) ---------------------------------------------------------------------- 📊 Distribution Statistics: Mean: 13.923 | Median: 14.000 | Std: 3.774 Range: [6.000, 22.000] Percentiles: 1%=6.000, 25%=11.000, 75%=17.000, 99%=22.000 📈 Shape Analysis: Skewness: -0.01 (Symmetric) Kurtosis: -0.56 (Light tails) Zeros: 0 (0.0%) Outliers (IQR): 0 (0.0%) 🔧 Recommended Transformation: none Reason: Distribution is approximately normal (skewness: -0.01) Priority: low
====================================================================== Column: lag3_bounced_mean Type: numeric_discrete (Confidence: 70%) ---------------------------------------------------------------------- 📊 Distribution Statistics: Mean: 0.027 | Median: 0.000 | Std: 0.156 Range: [0.000, 1.000] Percentiles: 1%=0.000, 25%=0.000, 75%=0.000, 99%=1.000 📈 Shape Analysis: Skewness: 5.88 (Right-skewed) Kurtosis: 33.29 (Heavy tails/outliers) Zeros: 902 (97.0%) Outliers (IQR): 28 (3.0%) 🔧 Recommended Transformation: zero_inflation_handling Reason: Zero-inflation (97.0%) combined with high skewness (5.88) Priority: high ⚠️ Consider creating a binary indicator for zeros plus log transform of non-zero values
====================================================================== Column: lag3_bounced_count Type: numeric_discrete (Confidence: 70%) ---------------------------------------------------------------------- 📊 Distribution Statistics: Mean: 0.221 | Median: 0.000 | Std: 0.520 Range: [0.000, 6.000] Percentiles: 1%=0.000, 25%=0.000, 75%=0.000, 99%=2.000 📈 Shape Analysis: Skewness: 3.19 (Right-skewed) Kurtosis: 15.50 (Heavy tails/outliers) Zeros: 4,068 (81.4%) Outliers (IQR): 930 (18.6%) 🔧 Recommended Transformation: zero_inflation_handling Reason: Zero-inflation (81.4%) combined with high skewness (3.19) Priority: high ⚠️ Consider creating a binary indicator for zeros plus log transform of non-zero values
====================================================================== Column: lag3_time_to_open_hours_sum Type: numeric_continuous (Confidence: 90%) ---------------------------------------------------------------------- 📊 Distribution Statistics: Mean: 0.781 | Median: 0.000 | Std: 2.295 Range: [0.000, 19.900] Percentiles: 1%=0.000, 25%=0.000, 75%=0.000, 99%=11.242 📈 Shape Analysis: Skewness: 3.95 (Right-skewed) Kurtosis: 18.23 (Heavy tails/outliers) Zeros: 752 (80.9%) Outliers (IQR): 178 (19.1%) 🔧 Recommended Transformation: zero_inflation_handling Reason: Zero-inflation (80.9%) combined with high skewness (3.95) Priority: high ⚠️ Consider creating a binary indicator for zeros plus log transform of non-zero values
====================================================================== Column: lag3_time_to_open_hours_mean Type: numeric_continuous (Confidence: 90%) ---------------------------------------------------------------------- 📊 Distribution Statistics: Mean: 3.909 | Median: 2.650 | Std: 3.767 Range: [0.000, 19.900] Percentiles: 1%=0.000, 25%=1.200, 75%=5.300, 99%=16.109 📈 Shape Analysis: Skewness: 1.51 (Right-skewed) Kurtosis: 2.40 (Light tails) Zeros: 4 (2.2%) Outliers (IQR): 9 (4.9%) 🔧 Recommended Transformation: sqrt_transform Reason: Moderate skewness (1.51) Priority: medium
====================================================================== Column: lag3_time_to_open_hours_count Type: numeric_discrete (Confidence: 70%) ---------------------------------------------------------------------- 📊 Distribution Statistics: Mean: 0.038 | Median: 0.000 | Std: 0.198 Range: [0.000, 2.000] Percentiles: 1%=0.000, 25%=0.000, 75%=0.000, 99%=1.000 📈 Shape Analysis: Skewness: 5.38 (Right-skewed) Kurtosis: 30.10 (Heavy tails/outliers) Zeros: 4,816 (96.4%) Outliers (IQR): 182 (3.6%) 🔧 Recommended Transformation: zero_inflation_handling Reason: Zero-inflation (96.4%) combined with high skewness (5.38) Priority: high ⚠️ Consider creating a binary indicator for zeros plus log transform of non-zero values
====================================================================== Column: lag3_time_to_open_hours_max Type: numeric_continuous (Confidence: 90%) ---------------------------------------------------------------------- 📊 Distribution Statistics: Mean: 3.962 | Median: 2.700 | Std: 3.762 Range: [0.000, 19.900] Percentiles: 1%=0.000, 25%=1.200, 75%=5.550, 99%=16.109 📈 Shape Analysis: Skewness: 1.48 (Right-skewed) Kurtosis: 2.34 (Light tails) Zeros: 4 (2.2%) Outliers (IQR): 8 (4.4%) 🔧 Recommended Transformation: sqrt_transform Reason: Moderate skewness (1.48) Priority: medium
====================================================================== Column: opened_velocity Type: numeric_discrete (Confidence: 70%) ---------------------------------------------------------------------- 📊 Distribution Statistics: Mean: -0.000 | Median: 0.000 | Std: 0.020 Range: [-0.100, 0.133] Percentiles: 1%=-0.033, 25%=0.000, 75%=0.000, 99%=0.033 📈 Shape Analysis: Skewness: 0.26 (Symmetric) Kurtosis: 4.21 (Heavy tails/outliers) Zeros: 674 (71.4%) Outliers (IQR): 270 (28.6%) 🔧 Recommended Transformation: zero_inflation_handling Reason: Significant zero-inflation (71.4%) Priority: medium ⚠️ Many zero values may indicate a mixture distribution
====================================================================== Column: opened_velocity_pct Type: numeric_discrete (Confidence: 70%) ---------------------------------------------------------------------- 📊 Distribution Statistics: Mean: -0.704 | Median: -1.000 | Std: 0.565 Range: [-1.000, 2.000] Percentiles: 1%=-1.000, 25%=-1.000, 75%=-1.000, 99%=1.000 📈 Shape Analysis: Skewness: 1.98 (Right-skewed) Kurtosis: 3.86 (Heavy tails/outliers) Zeros: 36 (19.9%) Outliers (IQR): 45 (24.9%) 🔧 Recommended Transformation: yeo_johnson Reason: Moderate skewness (1.98) with negative values Priority: medium
====================================================================== Column: clicked_velocity Type: numeric_discrete (Confidence: 70%) ---------------------------------------------------------------------- 📊 Distribution Statistics: Mean: -0.000 | Median: 0.000 | Std: 0.011 Range: [-0.067, 0.033] Percentiles: 1%=-0.033, 25%=0.000, 75%=0.000, 99%=0.033 📈 Shape Analysis: Skewness: -0.28 (Symmetric) Kurtosis: 7.26 (Heavy tails/outliers) Zeros: 844 (89.4%) Outliers (IQR): 100 (10.6%) 🔧 Recommended Transformation: zero_inflation_handling Reason: Significant zero-inflation (89.4%) Priority: medium ⚠️ Many zero values may indicate a mixture distribution
====================================================================== Column: send_hour_velocity Type: numeric_continuous (Confidence: 90%) ---------------------------------------------------------------------- 📊 Distribution Statistics: Mean: 0.069 | Median: 0.033 | Std: 0.443 Range: [-1.300, 6.933] Percentiles: 1%=-0.867, 25%=-0.133, 75%=0.233, 99%=1.219 📈 Shape Analysis: Skewness: 4.33 (Right-skewed) Kurtosis: 61.54 (Heavy tails/outliers) Zeros: 55 (5.8%) Outliers (IQR): 69 (7.3%) 🔧 Recommended Transformation: yeo_johnson Reason: High skewness (4.33) with negative values present Priority: high ⚠️ Yeo-Johnson handles negative values unlike log/sqrt
====================================================================== Column: send_hour_velocity_pct Type: numeric_continuous (Confidence: 90%) ---------------------------------------------------------------------- 📊 Distribution Statistics: Mean: 0.310 | Median: 0.071 | Std: 0.954 Range: [-0.848, 13.000] Percentiles: 1%=-0.722, 25%=-0.263, 75%=0.628, 99%=3.714 📈 Shape Analysis: Skewness: 3.94 (Right-skewed) Kurtosis: 36.14 (Heavy tails/outliers) Zeros: 55 (5.8%) Outliers (IQR): 43 (4.6%) 🔧 Recommended Transformation: yeo_johnson Reason: High skewness (3.94) with negative values present Priority: high ⚠️ Yeo-Johnson handles negative values unlike log/sqrt
====================================================================== Column: bounced_velocity Type: numeric_discrete (Confidence: 70%) ---------------------------------------------------------------------- 📊 Distribution Statistics: Mean: 0.001 | Median: 0.000 | Std: 0.008 Range: [-0.033, 0.033] Percentiles: 1%=-0.033, 25%=0.000, 75%=0.000, 99%=0.033 📈 Shape Analysis: Skewness: 1.11 (Right-skewed) Kurtosis: 15.08 (Heavy tails/outliers) Zeros: 892 (94.5%) Outliers (IQR): 52 (5.5%) 🔧 Recommended Transformation: zero_inflation_handling Reason: Significant zero-inflation (94.5%) Priority: medium ⚠️ Many zero values may indicate a mixture distribution
====================================================================== Column: time_to_open_hours_velocity Type: numeric_continuous (Confidence: 90%) ---------------------------------------------------------------------- 📊 Distribution Statistics: Mean: 0.003 | Median: 0.000 | Std: 0.118 Range: [-0.830, 0.900] Percentiles: 1%=-0.354, 25%=0.000, 75%=0.000, 99%=0.465 📈 Shape Analysis: Skewness: 1.06 (Right-skewed) Kurtosis: 13.06 (Heavy tails/outliers) Zeros: 642 (68.0%) Outliers (IQR): 302 (32.0%) 🔧 Recommended Transformation: zero_inflation_handling Reason: Significant zero-inflation (68.0%) Priority: medium ⚠️ Many zero values may indicate a mixture distribution
====================================================================== Column: time_to_open_hours_velocity_pct Type: numeric_continuous (Confidence: 90%) ---------------------------------------------------------------------- 📊 Distribution Statistics: Mean: 0.066 | Median: -1.000 | Std: 5.805 Range: [-1.000, 70.000] Percentiles: 1%=-1.000, 25%=-1.000, 75%=-0.993, 99%=12.530 📈 Shape Analysis: Skewness: 10.37 (Right-skewed) Kurtosis: 121.24 (Heavy tails/outliers) Zeros: 1 (0.6%) Outliers (IQR): 44 (24.7%) 🔧 Recommended Transformation: yeo_johnson Reason: High skewness (10.37) with negative values present Priority: high ⚠️ Yeo-Johnson handles negative values unlike log/sqrt
====================================================================== Column: opened_acceleration Type: numeric_discrete (Confidence: 70%) ---------------------------------------------------------------------- 📊 Distribution Statistics: Mean: -0.002 | Median: 0.000 | Std: 0.039 Range: [-0.167, 0.100] Percentiles: 1%=-0.133, 25%=0.000, 75%=0.033, 99%=0.067 📈 Shape Analysis: Skewness: -1.08 (Left-skewed) Kurtosis: 2.63 (Light tails) Zeros: 118 (54.6%) Outliers (IQR): 29 (13.4%) 🔧 Recommended Transformation: zero_inflation_handling Reason: Significant zero-inflation (54.6%) Priority: medium ⚠️ Many zero values may indicate a mixture distribution
====================================================================== Column: opened_momentum Type: numeric_discrete (Confidence: 70%) ---------------------------------------------------------------------- 📊 Distribution Statistics: Mean: 0.007 | Median: 0.000 | Std: 0.026 Range: [-0.033, 0.533] Percentiles: 1%=0.000, 25%=0.000, 75%=0.000, 99%=0.067 📈 Shape Analysis: Skewness: 11.60 (Right-skewed) Kurtosis: 197.90 (Heavy tails/outliers) Zeros: 810 (85.8%) Outliers (IQR): 134 (14.2%) 🔧 Recommended Transformation: zero_inflation_handling Reason: Zero-inflation (85.8%) combined with high skewness (11.60) Priority: high ⚠️ Consider creating a binary indicator for zeros plus log transform of non-zero values
====================================================================== Column: clicked_acceleration Type: numeric_discrete (Confidence: 70%) ---------------------------------------------------------------------- 📊 Distribution Statistics: Mean: -0.001 | Median: 0.000 | Std: 0.022 Range: [-0.133, 0.067] Percentiles: 1%=-0.067, 25%=0.000, 75%=0.000, 99%=0.033 📈 Shape Analysis: Skewness: -1.96 (Left-skewed) Kurtosis: 8.55 (Heavy tails/outliers) Zeros: 174 (80.6%) Outliers (IQR): 42 (19.4%) 🔧 Recommended Transformation: zero_inflation_handling Reason: Significant zero-inflation (80.6%) Priority: medium ⚠️ Many zero values may indicate a mixture distribution
====================================================================== Column: send_hour_acceleration Type: numeric_continuous (Confidence: 90%) ---------------------------------------------------------------------- 📊 Distribution Statistics: Mean: 0.067 | Median: 0.067 | Std: 0.737 Range: [-2.133, 2.833] Percentiles: 1%=-1.990, 25%=-0.300, 75%=0.500, 99%=1.818 📈 Shape Analysis: Skewness: -0.09 (Symmetric) Kurtosis: 1.99 (Light tails) Zeros: 7 (3.2%) Outliers (IQR): 11 (5.1%) 🔧 Recommended Transformation: cap_outliers Reason: Significant outliers (5.1%) despite low skewness Priority: medium ⚠️ Consider investigating outlier causes before capping
====================================================================== Column: send_hour_momentum Type: numeric_continuous (Confidence: 90%) ---------------------------------------------------------------------- 📊 Distribution Statistics: Mean: 5.931 | Median: 0.467 | Std: 53.154 Range: [-39.200, 1553.067] Percentiles: 1%=-11.228, 25%=-1.600, 75%=4.000, 99%=70.501 📈 Shape Analysis: Skewness: 26.41 (Right-skewed) Kurtosis: 763.67 (Heavy tails/outliers) Zeros: 55 (5.8%) Outliers (IQR): 137 (14.5%) 🔧 Recommended Transformation: yeo_johnson Reason: High skewness (26.41) with negative values present Priority: high ⚠️ Yeo-Johnson handles negative values unlike log/sqrt
====================================================================== Column: bounced_acceleration Type: numeric_discrete (Confidence: 70%) ---------------------------------------------------------------------- 📊 Distribution Statistics: Mean: 0.002 | Median: 0.000 | Std: 0.014 Range: [-0.067, 0.033] Percentiles: 1%=-0.067, 25%=0.000, 75%=0.000, 99%=0.033 📈 Shape Analysis: Skewness: -1.25 (Left-skewed) Kurtosis: 11.65 (Heavy tails/outliers) Zeros: 191 (88.4%) Outliers (IQR): 25 (11.6%) 🔧 Recommended Transformation: zero_inflation_handling Reason: Significant zero-inflation (88.4%) Priority: medium ⚠️ Many zero values may indicate a mixture distribution
====================================================================== Column: time_to_open_hours_acceleration Type: numeric_continuous (Confidence: 90%) ---------------------------------------------------------------------- 📊 Distribution Statistics: Mean: -0.002 | Median: 0.000 | Std: 0.204 Range: [-0.907, 0.567] Percentiles: 1%=-0.727, 25%=0.000, 75%=0.024, 99%=0.516 📈 Shape Analysis: Skewness: -1.20 (Left-skewed) Kurtosis: 5.33 (Heavy tails/outliers) Zeros: 116 (53.7%) Outliers (IQR): 85 (39.4%) 🔧 Recommended Transformation: zero_inflation_handling Reason: Significant zero-inflation (53.7%) Priority: medium ⚠️ Many zero values may indicate a mixture distribution
====================================================================== Column: time_to_open_hours_momentum Type: numeric_continuous (Confidence: 90%) ---------------------------------------------------------------------- 📊 Distribution Statistics: Mean: 0.262 | Median: 0.000 | Std: 1.425 Range: [-0.806, 24.300] Percentiles: 1%=-0.141, 25%=0.000, 75%=0.000, 99%=7.608 📈 Shape Analysis: Skewness: 9.23 (Right-skewed) Kurtosis: 112.85 (Heavy tails/outliers) Zeros: 775 (82.1%) Outliers (IQR): 169 (17.9%) 🔧 Recommended Transformation: zero_inflation_handling Reason: Zero-inflation (82.1%) combined with high skewness (9.23) Priority: high ⚠️ Consider creating a binary indicator for zeros plus log transform of non-zero values
====================================================================== Column: opened_beginning Type: numeric_discrete (Confidence: 70%) ---------------------------------------------------------------------- 📊 Distribution Statistics: Mean: 1.385 | Median: 1.000 | Std: 1.423 Range: [0.000, 16.000] Percentiles: 1%=0.000, 25%=0.000, 75%=2.000, 99%=6.000 📈 Shape Analysis: Skewness: 1.98 (Right-skewed) Kurtosis: 8.80 (Heavy tails/outliers) Zeros: 1,479 (30.0%) Outliers (IQR): 73 (1.5%) 🔧 Recommended Transformation: zero_inflation_handling Reason: Significant zero-inflation (30.0%) Priority: medium ⚠️ Many zero values may indicate a mixture distribution
====================================================================== Column: opened_end Type: numeric_discrete (Confidence: 70%) ---------------------------------------------------------------------- 📊 Distribution Statistics: Mean: 1.237 | Median: 1.000 | Std: 1.423 Range: [0.000, 17.000] Percentiles: 1%=0.000, 25%=0.000, 75%=2.000, 99%=6.000 📈 Shape Analysis: Skewness: 1.77 (Right-skewed) Kurtosis: 6.72 (Heavy tails/outliers) Zeros: 1,955 (39.7%) Outliers (IQR): 55 (1.1%) 🔧 Recommended Transformation: zero_inflation_handling Reason: Significant zero-inflation (39.7%) Priority: medium ⚠️ Many zero values may indicate a mixture distribution
====================================================================== Column: opened_trend_ratio Type: numeric_continuous (Confidence: 90%) ---------------------------------------------------------------------- 📊 Distribution Statistics: Mean: 0.870 | Median: 0.500 | Std: 0.989 Range: [0.000, 7.000] Percentiles: 1%=0.000, 25%=0.000, 75%=1.000, 99%=4.000 📈 Shape Analysis: Skewness: 1.66 (Right-skewed) Kurtosis: 3.49 (Heavy tails/outliers) Zeros: 1,104 (32.0%) Outliers (IQR): 240 (7.0%) 🔧 Recommended Transformation: zero_inflation_handling Reason: Significant zero-inflation (32.0%) Priority: medium ⚠️ Many zero values may indicate a mixture distribution
====================================================================== Column: clicked_beginning Type: numeric_discrete (Confidence: 70%) ---------------------------------------------------------------------- 📊 Distribution Statistics: Mean: 0.420 | Median: 0.000 | Std: 0.696 Range: [0.000, 7.000] Percentiles: 1%=0.000, 25%=0.000, 75%=1.000, 99%=3.000 📈 Shape Analysis: Skewness: 2.03 (Right-skewed) Kurtosis: 6.01 (Heavy tails/outliers) Zeros: 3,315 (67.3%) Outliers (IQR): 73 (1.5%) 🔧 Recommended Transformation: zero_inflation_handling Reason: Zero-inflation (67.3%) combined with high skewness (2.03) Priority: high ⚠️ Consider creating a binary indicator for zeros plus log transform of non-zero values
====================================================================== Column: clicked_end Type: numeric_discrete (Confidence: 70%) ---------------------------------------------------------------------- 📊 Distribution Statistics: Mean: 0.381 | Median: 0.000 | Std: 0.672 Range: [0.000, 6.000] Percentiles: 1%=0.000, 25%=0.000, 75%=1.000, 99%=3.000 📈 Shape Analysis: Skewness: 2.06 (Right-skewed) Kurtosis: 5.45 (Heavy tails/outliers) Zeros: 3,472 (70.5%) Outliers (IQR): 68 (1.4%) 🔧 Recommended Transformation: zero_inflation_handling Reason: Zero-inflation (70.5%) combined with high skewness (2.06) Priority: high ⚠️ Consider creating a binary indicator for zeros plus log transform of non-zero values
====================================================================== Column: clicked_trend_ratio Type: numeric_discrete (Confidence: 70%) ---------------------------------------------------------------------- 📊 Distribution Statistics: Mean: 0.402 | Median: 0.000 | Std: 0.650 Range: [0.000, 5.000] Percentiles: 1%=0.000, 25%=0.000, 75%=1.000, 99%=3.000 📈 Shape Analysis: Skewness: 1.93 (Right-skewed) Kurtosis: 4.65 (Heavy tails/outliers) Zeros: 1,032 (64.1%) Outliers (IQR): 20 (1.2%) 🔧 Recommended Transformation: zero_inflation_handling Reason: Significant zero-inflation (64.1%) Priority: medium ⚠️ Many zero values may indicate a mixture distribution
====================================================================== Column: send_hour_beginning Type: numeric_continuous (Confidence: 90%) ---------------------------------------------------------------------- 📊 Distribution Statistics: Mean: 79.199 | Median: 74.000 | Std: 48.213 Range: [6.000, 541.000] Percentiles: 1%=10.000, 25%=47.000, 75%=101.000, 99%=248.760 📈 Shape Analysis: Skewness: 2.29 (Right-skewed) Kurtosis: 12.26 (Heavy tails/outliers) Zeros: 0 (0.0%) Outliers (IQR): 132 (2.7%) 🔧 Recommended Transformation: log_transform Reason: High positive skewness (2.29) with all positive values Priority: high
====================================================================== Column: send_hour_end Type: numeric_continuous (Confidence: 90%) ---------------------------------------------------------------------- 📊 Distribution Statistics: Mean: 82.094 | Median: 76.000 | Std: 48.684 Range: [6.000, 613.000] Percentiles: 1%=10.000, 25%=51.000, 75%=104.000, 99%=249.280 📈 Shape Analysis: Skewness: 2.00 (Right-skewed) Kurtosis: 10.42 (Heavy tails/outliers) Zeros: 0 (0.0%) Outliers (IQR): 157 (3.2%) 🔧 Recommended Transformation: sqrt_transform Reason: Moderate skewness (2.00) Priority: medium
====================================================================== Column: send_hour_trend_ratio Type: numeric_continuous (Confidence: 90%) ---------------------------------------------------------------------- 📊 Distribution Statistics: Mean: 1.270 | Median: 1.031 | Std: 0.989 Range: [0.092, 23.833] Percentiles: 1%=0.208, 25%=0.702, 75%=1.523, 99%=4.887 📈 Shape Analysis: Skewness: 5.00 (Right-skewed) Kurtosis: 67.66 (Heavy tails/outliers) Zeros: 0 (0.0%) Outliers (IQR): 302 (6.1%) 🔧 Recommended Transformation: cap_then_log Reason: High skewness (5.00) with significant outliers (6.1%) Priority: high
====================================================================== Column: bounced_beginning Type: numeric_discrete (Confidence: 70%) ---------------------------------------------------------------------- 📊 Distribution Statistics: Mean: 0.124 | Median: 0.000 | Std: 0.359 Range: [0.000, 3.000] Percentiles: 1%=0.000, 25%=0.000, 75%=0.000, 99%=1.000 📈 Shape Analysis: Skewness: 3.06 (Right-skewed) Kurtosis: 10.56 (Heavy tails/outliers) Zeros: 4,355 (88.4%) Outliers (IQR): 570 (11.6%) 🔧 Recommended Transformation: zero_inflation_handling Reason: Zero-inflation (88.4%) combined with high skewness (3.06) Priority: high ⚠️ Consider creating a binary indicator for zeros plus log transform of non-zero values
====================================================================== Column: bounced_end Type: numeric_discrete (Confidence: 70%) ---------------------------------------------------------------------- 📊 Distribution Statistics: Mean: 0.125 | Median: 0.000 | Std: 0.357 Range: [0.000, 3.000] Percentiles: 1%=0.000, 25%=0.000, 75%=0.000, 99%=1.000 📈 Shape Analysis: Skewness: 2.96 (Right-skewed) Kurtosis: 9.46 (Heavy tails/outliers) Zeros: 4,351 (88.3%) Outliers (IQR): 574 (11.7%) 🔧 Recommended Transformation: zero_inflation_handling Reason: Zero-inflation (88.3%) combined with high skewness (2.96) Priority: high ⚠️ Consider creating a binary indicator for zeros plus log transform of non-zero values
====================================================================== Column: bounced_trend_ratio Type: numeric_discrete (Confidence: 70%) ---------------------------------------------------------------------- 📊 Distribution Statistics: Mean: 0.155 | Median: 0.000 | Std: 0.405 Range: [0.000, 3.000] Percentiles: 1%=0.000, 25%=0.000, 75%=0.000, 99%=2.000 📈 Shape Analysis: Skewness: 2.83 (Right-skewed) Kurtosis: 8.96 (Heavy tails/outliers) Zeros: 488 (85.6%) Outliers (IQR): 82 (14.4%) 🔧 Recommended Transformation: zero_inflation_handling Reason: Zero-inflation (85.6%) combined with high skewness (2.83) Priority: high ⚠️ Consider creating a binary indicator for zeros plus log transform of non-zero values
====================================================================== Column: time_to_open_hours_beginning Type: numeric_continuous (Confidence: 90%) ---------------------------------------------------------------------- 📊 Distribution Statistics: Mean: 5.443 | Median: 2.900 | Std: 7.335 Range: [0.000, 95.300] Percentiles: 1%=0.000, 25%=0.000, 75%=8.300, 99%=31.500 📈 Shape Analysis: Skewness: 2.67 (Right-skewed) Kurtosis: 13.49 (Heavy tails/outliers) Zeros: 1,502 (30.5%) Outliers (IQR): 209 (4.2%) 🔧 Recommended Transformation: zero_inflation_handling Reason: Zero-inflation (30.5%) combined with high skewness (2.67) Priority: high ⚠️ Consider creating a binary indicator for zeros plus log transform of non-zero values
====================================================================== Column: time_to_open_hours_end Type: numeric_continuous (Confidence: 90%) ---------------------------------------------------------------------- 📊 Distribution Statistics: Mean: 4.833 | Median: 1.700 | Std: 7.044 Range: [0.000, 71.500] Percentiles: 1%=0.000, 25%=0.000, 75%=7.400, 99%=29.976 📈 Shape Analysis: Skewness: 2.30 (Right-skewed) Kurtosis: 7.76 (Heavy tails/outliers) Zeros: 1,970 (40.0%) Outliers (IQR): 251 (5.1%) 🔧 Recommended Transformation: zero_inflation_handling Reason: Zero-inflation (40.0%) combined with high skewness (2.30) Priority: high ⚠️ Consider creating a binary indicator for zeros plus log transform of non-zero values
====================================================================== Column: time_to_open_hours_trend_ratio Type: numeric_continuous (Confidence: 90%) ---------------------------------------------------------------------- 📊 Distribution Statistics: Mean: 2.480 | Median: 0.412 | Std: 9.399 Range: [0.000, 218.000] Percentiles: 1%=0.000, 25%=0.000, 75%=1.586, 99%=38.725 📈 Shape Analysis: Skewness: 11.13 (Right-skewed) Kurtosis: 174.48 (Heavy tails/outliers) Zeros: 1,106 (32.3%) Outliers (IQR): 394 (11.5%) 🔧 Recommended Transformation: zero_inflation_handling Reason: Zero-inflation (32.3%) combined with high skewness (11.13) Priority: high ⚠️ Consider creating a binary indicator for zeros plus log transform of non-zero values
====================================================================== Column: days_since_last_event_y Type: numeric_discrete (Confidence: 70%) ---------------------------------------------------------------------- 📊 Distribution Statistics: Mean: 0.000 | Median: 0.000 | Std: 0.000 Range: [0.000, 0.000] Percentiles: 1%=0.000, 25%=0.000, 75%=0.000, 99%=0.000 📈 Shape Analysis: Skewness: 0.00 (Symmetric) Kurtosis: 0.00 (Light tails) Zeros: 4,998 (100.0%) Outliers (IQR): 0 (0.0%) 🔧 Recommended Transformation: zero_inflation_handling Reason: Significant zero-inflation (100.0%) Priority: medium ⚠️ Many zero values may indicate a mixture distribution
====================================================================== Column: days_since_first_event_y Type: numeric_continuous (Confidence: 90%) ---------------------------------------------------------------------- 📊 Distribution Statistics: Mean: 2276.371 | Median: 2745.000 | Std: 976.745 Range: [0.000, 3282.000] Percentiles: 1%=29.970, 25%=1553.750, 75%=3037.000, 99%=3258.000 📈 Shape Analysis: Skewness: -0.95 (Left-skewed) Kurtosis: -0.50 (Light tails) Zeros: 35 (0.7%) Outliers (IQR): 0 (0.0%) 🔧 Recommended Transformation: none Reason: Distribution is approximately normal (skewness: -0.95) Priority: low
====================================================================== Column: active_span_days Type: numeric_continuous (Confidence: 90%) ---------------------------------------------------------------------- 📊 Distribution Statistics: Mean: 2276.371 | Median: 2745.000 | Std: 976.745 Range: [0.000, 3282.000] Percentiles: 1%=29.970, 25%=1553.750, 75%=3037.000, 99%=3258.000 📈 Shape Analysis: Skewness: -0.95 (Left-skewed) Kurtosis: -0.50 (Light tails) Zeros: 35 (0.7%) Outliers (IQR): 0 (0.0%) 🔧 Recommended Transformation: none Reason: Distribution is approximately normal (skewness: -0.95) Priority: low
====================================================================== Column: recency_ratio Type: numeric_discrete (Confidence: 70%) ---------------------------------------------------------------------- 📊 Distribution Statistics: Mean: 0.000 | Median: 0.000 | Std: 0.000 Range: [0.000, 0.000] Percentiles: 1%=0.000, 25%=0.000, 75%=0.000, 99%=0.000 📈 Shape Analysis: Skewness: 0.00 (Symmetric) Kurtosis: 0.00 (Light tails) Zeros: 4,998 (100.0%) Outliers (IQR): 0 (0.0%) 🔧 Recommended Transformation: zero_inflation_handling Reason: Significant zero-inflation (100.0%) Priority: medium ⚠️ Many zero values may indicate a mixture distribution
====================================================================== Column: event_frequency Type: numeric_continuous (Confidence: 90%) ---------------------------------------------------------------------- 📊 Distribution Statistics: Mean: 0.315 | Median: 0.196 | Std: 1.186 Range: [0.072, 60.000] Percentiles: 1%=0.126, 25%=0.170, 75%=0.256, 99%=1.937 📈 Shape Analysis: Skewness: 35.78 (Right-skewed) Kurtosis: 1562.28 (Heavy tails/outliers) Zeros: 0 (0.0%) Outliers (IQR): 550 (11.1%) 🔧 Recommended Transformation: cap_then_log Reason: High skewness (35.78) with significant outliers (11.1%) Priority: high
====================================================================== Column: inter_event_gap_mean Type: numeric_continuous (Confidence: 90%) ---------------------------------------------------------------------- 📊 Distribution Statistics: Mean: 157.100 | Median: 165.182 | Std: 54.689 Range: [1.000, 829.000] Percentiles: 1%=19.207, 25%=129.165, 75%=190.219, 99%=303.538 📈 Shape Analysis: Skewness: 0.40 (Symmetric) Kurtosis: 6.97 (Heavy tails/outliers) Zeros: 0 (0.0%) Outliers (IQR): 208 (4.2%) 🔧 Recommended Transformation: none Reason: Distribution is approximately normal (skewness: 0.40) Priority: low
====================================================================== Column: inter_event_gap_std Type: numeric_continuous (Confidence: 90%) ---------------------------------------------------------------------- 📊 Distribution Statistics: Mean: 143.798 | Median: 144.447 | Std: 62.407 Range: [0.000, 576.999] Percentiles: 1%=0.000, 25%=107.526, 75%=179.898, 99%=306.029 📈 Shape Analysis: Skewness: 0.37 (Symmetric) Kurtosis: 1.92 (Light tails) Zeros: 72 (1.5%) Outliers (IQR): 82 (1.7%) 🔧 Recommended Transformation: none Reason: Distribution is approximately normal (skewness: 0.37) Priority: low
====================================================================== Column: inter_event_gap_max Type: numeric_continuous (Confidence: 90%) ---------------------------------------------------------------------- 📊 Distribution Statistics: Mean: 497.434 | Median: 488.000 | Std: 218.570 Range: [1.000, 1661.000] Percentiles: 1%=42.620, 25%=361.000, 75%=622.000, 99%=1107.280 📈 Shape Analysis: Skewness: 0.44 (Symmetric) Kurtosis: 0.90 (Light tails) Zeros: 0 (0.0%) Outliers (IQR): 96 (1.9%) 🔧 Recommended Transformation: none Reason: Distribution is approximately normal (skewness: 0.44) Priority: low
====================================================================== Column: regularity_score Type: numeric_continuous (Confidence: 90%) ---------------------------------------------------------------------- 📊 Distribution Statistics: Mean: 0.141 | Median: 0.088 | Std: 0.183 Range: [0.000, 1.000] Percentiles: 1%=0.000, 25%=0.000, 75%=0.216, 99%=1.000 📈 Shape Analysis: Skewness: 2.29 (Right-skewed) Kurtosis: 7.01 (Heavy tails/outliers) Zeros: 1,652 (33.3%) Outliers (IQR): 171 (3.4%) 🔧 Recommended Transformation: zero_inflation_handling Reason: Zero-inflation (33.3%) combined with high skewness (2.29) Priority: high ⚠️ Consider creating a binary indicator for zeros plus log transform of non-zero values
====================================================================== Column: opened_vs_cohort_mean Type: numeric_discrete (Confidence: 70%) ---------------------------------------------------------------------- 📊 Distribution Statistics: Mean: 0.000 | Median: -0.179 | Std: 0.415 Range: [-0.179, 5.821] Percentiles: 1%=-0.179, 25%=-0.179, 75%=-0.179, 99%=0.821 📈 Shape Analysis: Skewness: 2.65 (Right-skewed) Kurtosis: 12.49 (Heavy tails/outliers) Zeros: 0 (0.0%) Outliers (IQR): 846 (16.9%) 🔧 Recommended Transformation: yeo_johnson Reason: High skewness (2.65) with negative values present Priority: high ⚠️ Yeo-Johnson handles negative values unlike log/sqrt
====================================================================== Column: opened_vs_cohort_pct Type: numeric_discrete (Confidence: 70%) ---------------------------------------------------------------------- 📊 Distribution Statistics: Mean: 1.000 | Median: 0.000 | Std: 2.321 Range: [0.000, 33.581] Percentiles: 1%=0.000, 25%=0.000, 75%=0.000, 99%=5.597 📈 Shape Analysis: Skewness: 2.65 (Right-skewed) Kurtosis: 12.49 (Heavy tails/outliers) Zeros: 4,152 (83.1%) Outliers (IQR): 846 (16.9%) 🔧 Recommended Transformation: zero_inflation_handling Reason: Zero-inflation (83.1%) combined with high skewness (2.65) Priority: high ⚠️ Consider creating a binary indicator for zeros plus log transform of non-zero values
====================================================================== Column: opened_cohort_zscore Type: numeric_discrete (Confidence: 70%) ---------------------------------------------------------------------- 📊 Distribution Statistics: Mean: 0.000 | Median: -0.431 | Std: 1.000 Range: [-0.431, 14.037] Percentiles: 1%=-0.431, 25%=-0.431, 75%=-0.431, 99%=1.980 📈 Shape Analysis: Skewness: 2.65 (Right-skewed) Kurtosis: 12.49 (Heavy tails/outliers) Zeros: 0 (0.0%) Outliers (IQR): 846 (16.9%) 🔧 Recommended Transformation: yeo_johnson Reason: High skewness (2.65) with negative values present Priority: high ⚠️ Yeo-Johnson handles negative values unlike log/sqrt
====================================================================== Column: clicked_vs_cohort_mean Type: numeric_discrete (Confidence: 70%) ---------------------------------------------------------------------- 📊 Distribution Statistics: Mean: 0.000 | Median: -0.054 | Std: 0.234 Range: [-0.054, 3.946] Percentiles: 1%=-0.054, 25%=-0.054, 75%=-0.054, 99%=0.946 📈 Shape Analysis: Skewness: 4.73 (Right-skewed) Kurtosis: 29.86 (Heavy tails/outliers) Zeros: 0 (0.0%) Outliers (IQR): 265 (5.3%) 🔧 Recommended Transformation: yeo_johnson Reason: High skewness (4.73) with negative values present Priority: high ⚠️ Yeo-Johnson handles negative values unlike log/sqrt
====================================================================== Column: clicked_vs_cohort_pct Type: numeric_discrete (Confidence: 70%) ---------------------------------------------------------------------- 📊 Distribution Statistics: Mean: 1.000 | Median: 0.000 | Std: 4.321 Range: [0.000, 73.771] Percentiles: 1%=0.000, 25%=0.000, 75%=0.000, 99%=18.443 📈 Shape Analysis: Skewness: 4.73 (Right-skewed) Kurtosis: 29.86 (Heavy tails/outliers) Zeros: 4,733 (94.7%) Outliers (IQR): 265 (5.3%) 🔧 Recommended Transformation: zero_inflation_handling Reason: Zero-inflation (94.7%) combined with high skewness (4.73) Priority: high ⚠️ Consider creating a binary indicator for zeros plus log transform of non-zero values
====================================================================== Column: clicked_cohort_zscore Type: numeric_discrete (Confidence: 70%) ---------------------------------------------------------------------- 📊 Distribution Statistics: Mean: -0.000 | Median: -0.231 | Std: 1.000 Range: [-0.231, 16.841] Percentiles: 1%=-0.231, 25%=-0.231, 75%=-0.231, 99%=4.037 📈 Shape Analysis: Skewness: 4.73 (Right-skewed) Kurtosis: 29.86 (Heavy tails/outliers) Zeros: 0 (0.0%) Outliers (IQR): 265 (5.3%) 🔧 Recommended Transformation: yeo_johnson Reason: High skewness (4.73) with negative values present Priority: high ⚠️ Yeo-Johnson handles negative values unlike log/sqrt
====================================================================== Column: send_hour_vs_cohort_mean Type: numeric_continuous (Confidence: 90%) ---------------------------------------------------------------------- 📊 Distribution Statistics: Mean: 0.000 | Median: -1.543 | Std: 10.061 Range: [-10.543, 329.457] Percentiles: 1%=-10.543, 25%=-5.543, 75%=2.457, 99%=28.457 📈 Shape Analysis: Skewness: 10.65 (Right-skewed) Kurtosis: 276.91 (Heavy tails/outliers) Zeros: 0 (0.0%) Outliers (IQR): 278 (5.6%) 🔧 Recommended Transformation: yeo_johnson Reason: High skewness (10.65) with negative values present Priority: high ⚠️ Yeo-Johnson handles negative values unlike log/sqrt
====================================================================== Column: send_hour_vs_cohort_pct Type: numeric_continuous (Confidence: 90%) ---------------------------------------------------------------------- 📊 Distribution Statistics: Mean: 1.000 | Median: 0.907 | Std: 0.608 Range: [0.363, 20.916] Percentiles: 1%=0.363, 25%=0.665, 75%=1.149, 99%=2.720 📈 Shape Analysis: Skewness: 10.65 (Right-skewed) Kurtosis: 276.91 (Heavy tails/outliers) Zeros: 0 (0.0%) Outliers (IQR): 278 (5.6%) 🔧 Recommended Transformation: cap_then_log Reason: High skewness (10.65) with significant outliers (5.6%) Priority: high
====================================================================== Column: send_hour_cohort_zscore Type: numeric_continuous (Confidence: 90%) ---------------------------------------------------------------------- 📊 Distribution Statistics: Mean: 0.000 | Median: -0.153 | Std: 1.000 Range: [-1.048, 32.745] Percentiles: 1%=-1.048, 25%=-0.551, 75%=0.244, 99%=2.828 📈 Shape Analysis: Skewness: 10.65 (Right-skewed) Kurtosis: 276.91 (Heavy tails/outliers) Zeros: 0 (0.0%) Outliers (IQR): 278 (5.6%) 🔧 Recommended Transformation: yeo_johnson Reason: High skewness (10.65) with negative values present Priority: high ⚠️ Yeo-Johnson handles negative values unlike log/sqrt
====================================================================== Column: bounced_vs_cohort_mean Type: numeric_discrete (Confidence: 70%) ---------------------------------------------------------------------- 📊 Distribution Statistics: Mean: 0.000 | Median: -0.024 | Std: 0.153 Range: [-0.024, 1.976] Percentiles: 1%=-0.024, 25%=-0.024, 75%=-0.024, 99%=0.976 📈 Shape Analysis: Skewness: 6.44 (Right-skewed) Kurtosis: 40.95 (Heavy tails/outliers) Zeros: 0 (0.0%) Outliers (IQR): 117 (2.3%) 🔧 Recommended Transformation: yeo_johnson Reason: High skewness (6.44) with negative values present Priority: high ⚠️ Yeo-Johnson handles negative values unlike log/sqrt
====================================================================== Column: bounced_vs_cohort_pct Type: numeric_discrete (Confidence: 70%) ---------------------------------------------------------------------- 📊 Distribution Statistics: Mean: 1.000 | Median: 0.000 | Std: 6.487 Range: [0.000, 84.712] Percentiles: 1%=0.000, 25%=0.000, 75%=0.000, 99%=42.356 📈 Shape Analysis: Skewness: 6.44 (Right-skewed) Kurtosis: 40.95 (Heavy tails/outliers) Zeros: 4,881 (97.7%) Outliers (IQR): 117 (2.3%) 🔧 Recommended Transformation: zero_inflation_handling Reason: Zero-inflation (97.7%) combined with high skewness (6.44) Priority: high ⚠️ Consider creating a binary indicator for zeros plus log transform of non-zero values
====================================================================== Column: bounced_cohort_zscore Type: numeric_discrete (Confidence: 70%) ---------------------------------------------------------------------- 📊 Distribution Statistics: Mean: 0.000 | Median: -0.154 | Std: 1.000 Range: [-0.154, 12.904] Percentiles: 1%=-0.154, 25%=-0.154, 75%=-0.154, 99%=6.375 📈 Shape Analysis: Skewness: 6.44 (Right-skewed) Kurtosis: 40.95 (Heavy tails/outliers) Zeros: 0 (0.0%) Outliers (IQR): 117 (2.3%) 🔧 Recommended Transformation: yeo_johnson Reason: High skewness (6.44) with negative values present Priority: high ⚠️ Yeo-Johnson handles negative values unlike log/sqrt
====================================================================== Column: time_to_open_hours_vs_cohort_mean Type: numeric_continuous (Confidence: 90%) ---------------------------------------------------------------------- 📊 Distribution Statistics: Mean: 0.000 | Median: -0.689 | Std: 2.340 Range: [-0.689, 28.911] Percentiles: 1%=-0.689, 25%=-0.689, 75%=-0.689, 99%=11.014 📈 Shape Analysis: Skewness: 5.34 (Right-skewed) Kurtosis: 37.08 (Heavy tails/outliers) Zeros: 0 (0.0%) Outliers (IQR): 836 (16.7%) 🔧 Recommended Transformation: yeo_johnson Reason: High skewness (5.34) with negative values present Priority: high ⚠️ Yeo-Johnson handles negative values unlike log/sqrt
====================================================================== Column: time_to_open_hours_vs_cohort_pct Type: numeric_continuous (Confidence: 90%) ---------------------------------------------------------------------- 📊 Distribution Statistics: Mean: 1.000 | Median: 0.000 | Std: 3.396 Range: [0.000, 42.955] Percentiles: 1%=0.000, 25%=0.000, 75%=0.000, 99%=16.983 📈 Shape Analysis: Skewness: 5.34 (Right-skewed) Kurtosis: 37.08 (Heavy tails/outliers) Zeros: 4,162 (83.3%) Outliers (IQR): 836 (16.7%) 🔧 Recommended Transformation: zero_inflation_handling Reason: Zero-inflation (83.3%) combined with high skewness (5.34) Priority: high ⚠️ Consider creating a binary indicator for zeros plus log transform of non-zero values
====================================================================== Column: time_to_open_hours_cohort_zscore Type: numeric_continuous (Confidence: 90%) ---------------------------------------------------------------------- 📊 Distribution Statistics: Mean: 0.000 | Median: -0.294 | Std: 1.000 Range: [-0.294, 12.356] Percentiles: 1%=-0.294, 25%=-0.294, 75%=-0.294, 99%=4.707 📈 Shape Analysis: Skewness: 5.34 (Right-skewed) Kurtosis: 37.08 (Heavy tails/outliers) Zeros: 0 (0.0%) Outliers (IQR): 836 (16.7%) 🔧 Recommended Transformation: yeo_johnson Reason: High skewness (5.34) with negative values present Priority: high ⚠️ Yeo-Johnson handles negative values unlike log/sqrt
Show/Hide Code
# Numerical Feature Statistics Table
if numeric_cols:
stats_data = []
for col_name in numeric_cols:
series = df[col_name].dropna()
if len(series) > 0:
stats_data.append({
"feature": col_name,
"count": len(series),
"mean": series.mean(),
"std": series.std(),
"min": series.min(),
"25%": series.quantile(0.25),
"50%": series.quantile(0.50),
"75%": series.quantile(0.75),
"95%": series.quantile(0.95),
"99%": series.quantile(0.99),
"max": series.max(),
"skewness": stats.skew(series),
"kurtosis": stats.kurtosis(series)
})
stats_df = pd.DataFrame(stats_data)
# Format for display
display_stats = stats_df.copy()
for col in ["mean", "std", "min", "25%", "50%", "75%", "95%", "99%", "max"]:
display_stats[col] = display_stats[col].apply(lambda x: f"{x:.3f}")
display_stats["skewness"] = display_stats["skewness"].apply(lambda x: f"{x:.3f}")
display_stats["kurtosis"] = display_stats["kurtosis"].apply(lambda x: f"{x:.3f}")
print("=" * 80)
print("NUMERICAL FEATURE STATISTICS")
print("=" * 80)
display(display_stats)
================================================================================ NUMERICAL FEATURE STATISTICS ================================================================================
| feature | count | mean | std | min | 25% | 50% | 75% | 95% | 99% | max | skewness | kurtosis | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | event_count_180d | 4998 | 0.639 | 1.009 | 0.000 | 0.000 | 0.000 | 1.000 | 3.000 | 4.000 | 11.000 | 2.151 | 7.453 |
| 1 | event_count_365d | 4998 | 1.316 | 1.656 | 0.000 | 0.000 | 1.000 | 2.000 | 4.000 | 7.000 | 15.000 | 1.611 | 4.247 |
| 2 | event_count_all_time | 4998 | 16.566 | 9.139 | 1.000 | 12.000 | 16.000 | 19.000 | 30.000 | 53.000 | 112.000 | 2.612 | 16.094 |
| 3 | opened_sum_180d | 4998 | 0.152 | 0.424 | 0.000 | 0.000 | 0.000 | 0.000 | 1.000 | 2.000 | 4.000 | 3.141 | 11.507 |
| 4 | opened_mean_180d | 1914 | 0.232 | 0.366 | 0.000 | 0.000 | 0.000 | 0.500 | 1.000 | 1.000 | 1.000 | 1.267 | 0.042 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 174 | bounced_vs_cohort_pct | 4998 | 1.000 | 6.487 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 42.356 | 84.712 | 6.442 | 40.903 |
| 175 | bounced_cohort_zscore | 4998 | 0.000 | 1.000 | -0.154 | -0.154 | -0.154 | -0.154 | -0.154 | 6.375 | 12.904 | 6.442 | 40.903 |
| 176 | time_to_open_hours_vs_cohort_mean | 4998 | 0.000 | 2.340 | -0.689 | -0.689 | -0.689 | -0.689 | 4.211 | 11.014 | 28.911 | 5.341 | 37.046 |
| 177 | time_to_open_hours_vs_cohort_pct | 4998 | 1.000 | 3.396 | 0.000 | 0.000 | 0.000 | 0.000 | 7.111 | 16.983 | 42.955 | 5.341 | 37.046 |
| 178 | time_to_open_hours_cohort_zscore | 4998 | 0.000 | 1.000 | -0.294 | -0.294 | -0.294 | -0.294 | 1.800 | 4.707 | 12.356 | 5.341 | 37.046 |
179 rows × 13 columns
4.5 Distribution Summary & Transformation Plan¶
This table summarizes all numeric columns with their recommended transformations.
Show/Hide Code
# Build transformation summary table
summary_data = []
for col_name in numeric_cols:
analysis = analyses.get(col_name)
rec = recommendations.get(col_name)
if analysis and rec:
summary_data.append({
"Column": col_name,
"Skewness": f"{analysis.skewness:.2f}",
"Kurtosis": f"{analysis.kurtosis:.2f}",
"Zeros %": f"{analysis.zero_percentage:.1f}%",
"Outliers %": f"{analysis.outlier_percentage:.1f}%",
"Transform": rec.recommended_transform.value,
"Priority": rec.priority
})
# Add Gold transformation recommendation if not "none"
if rec.recommended_transform != TransformationType.NONE and registry.gold:
registry.add_gold_transformation(
column=col_name,
transform=rec.recommended_transform.value,
parameters=rec.parameters,
rationale=rec.reason,
source_notebook="04_column_deep_dive"
)
if summary_data:
summary_df = pd.DataFrame(summary_data)
display_table(summary_df)
# Show how many transformation recommendations were added
transform_count = sum(1 for r in recommendations.values() if r and r.recommended_transform != TransformationType.NONE)
if transform_count > 0 and registry.gold:
print(f"\n✅ Added {transform_count} transformation recommendations to Gold layer")
else:
console.info("No numeric columns to summarize")
| Column | Skewness | Kurtosis | Zeros % | Outliers % | Transform | Priority |
|---|---|---|---|---|---|---|
| event_count_180d | 2.15 | 7.46 | 61.7% | 5.7% | zero_inflation_handling | high |
| event_count_365d | 1.61 | 4.25 | 47.3% | 2.0% | zero_inflation_handling | medium |
| event_count_all_time | 2.61 | 16.11 | 0.0% | 6.0% | cap_then_log | high |
| opened_sum_180d | 3.14 | 11.52 | 87.0% | 13.0% | zero_inflation_handling | high |
| opened_mean_180d | 1.27 | 0.05 | 66.1% | 0.0% | zero_inflation_handling | medium |
| opened_count_180d | 2.15 | 7.46 | 61.7% | 5.7% | zero_inflation_handling | high |
| clicked_sum_180d | 4.78 | 23.72 | 95.4% | 4.6% | zero_inflation_handling | high |
| clicked_mean_180d | 3.26 | 10.01 | 87.9% | 12.1% | zero_inflation_handling | high |
| clicked_count_180d | 2.15 | 7.46 | 61.7% | 5.7% | zero_inflation_handling | high |
| send_hour_sum_180d | 2.19 | 7.43 | 61.7% | 4.5% | zero_inflation_handling | high |
| send_hour_mean_180d | 0.04 | -0.10 | 0.0% | 0.0% | none | low |
| send_hour_max_180d | -0.17 | -0.44 | 0.0% | 0.0% | none | low |
| send_hour_count_180d | 2.15 | 7.46 | 61.7% | 5.7% | zero_inflation_handling | high |
| bounced_sum_180d | 9.70 | 100.59 | 98.8% | 1.2% | zero_inflation_handling | high |
| bounced_mean_180d | 7.20 | 55.40 | 96.9% | 3.1% | zero_inflation_handling | high |
| bounced_count_180d | 2.15 | 7.46 | 61.7% | 5.7% | zero_inflation_handling | high |
| time_to_open_hours_sum_180d | 5.51 | 38.63 | 87.2% | 12.8% | zero_inflation_handling | high |
| time_to_open_hours_mean_180d | 1.90 | 5.01 | 1.1% | 6.2% | sqrt_transform | medium |
| time_to_open_hours_max_180d | 1.76 | 3.97 | 1.1% | 6.0% | sqrt_transform | medium |
| time_to_open_hours_count_180d | 3.14 | 11.52 | 87.0% | 13.0% | zero_inflation_handling | high |
| opened_sum_365d | 2.37 | 7.21 | 75.9% | 24.1% | zero_inflation_handling | high |
| opened_mean_365d | 1.19 | 0.43 | 54.3% | 0.0% | zero_inflation_handling | medium |
| opened_count_365d | 1.61 | 4.25 | 47.3% | 2.0% | zero_inflation_handling | medium |
| clicked_sum_365d | 3.48 | 12.93 | 90.9% | 9.1% | zero_inflation_handling | high |
| clicked_mean_365d | 3.06 | 10.07 | 82.8% | 17.2% | zero_inflation_handling | high |
| clicked_count_365d | 1.61 | 4.25 | 47.3% | 2.0% | zero_inflation_handling | medium |
| send_hour_sum_365d | 1.62 | 3.96 | 47.3% | 2.2% | zero_inflation_handling | medium |
| send_hour_mean_365d | 0.09 | 0.28 | 0.0% | 2.4% | none | low |
| send_hour_max_365d | -0.34 | -0.23 | 0.0% | 0.0% | none | low |
| send_hour_count_365d | 1.61 | 4.25 | 47.3% | 2.0% | zero_inflation_handling | medium |
| bounced_sum_365d | 6.95 | 56.96 | 97.5% | 2.5% | zero_inflation_handling | high |
| bounced_mean_365d | 6.70 | 52.39 | 95.2% | 4.8% | zero_inflation_handling | high |
| bounced_count_365d | 1.61 | 4.25 | 47.3% | 2.0% | zero_inflation_handling | medium |
| time_to_open_hours_sum_365d | 3.94 | 20.10 | 76.2% | 23.8% | zero_inflation_handling | high |
| time_to_open_hours_mean_365d | 2.25 | 9.46 | 1.1% | 4.6% | yeo_johnson | high |
| time_to_open_hours_max_365d | 1.90 | 6.19 | 1.1% | 4.0% | sqrt_transform | medium |
| time_to_open_hours_count_365d | 2.37 | 7.21 | 75.9% | 24.1% | zero_inflation_handling | high |
| opened_sum_all_time | 2.27 | 13.25 | 13.4% | 2.1% | yeo_johnson | high |
| opened_mean_all_time | 0.38 | 0.73 | 13.4% | 0.7% | none | low |
| opened_count_all_time | 2.61 | 16.11 | 0.0% | 6.0% | cap_then_log | high |
| clicked_sum_all_time | 2.00 | 8.18 | 39.4% | 1.0% | zero_inflation_handling | high |
| clicked_mean_all_time | 1.26 | 2.33 | 39.4% | 1.0% | zero_inflation_handling | medium |
| clicked_count_all_time | 2.61 | 16.11 | 0.0% | 6.0% | cap_then_log | high |
| send_hour_sum_all_time | 2.52 | 15.09 | 0.0% | 5.1% | cap_then_log | high |
| send_hour_mean_all_time | -0.25 | 3.46 | 0.0% | 3.2% | none | low |
| send_hour_max_all_time | -1.44 | 3.55 | 0.0% | 2.4% | sqrt_transform | medium |
| send_hour_count_all_time | 2.61 | 16.11 | 0.0% | 6.0% | cap_then_log | high |
| bounced_sum_all_time | 1.92 | 4.18 | 71.1% | 1.1% | zero_inflation_handling | medium |
| bounced_mean_all_time | 7.27 | 114.67 | 71.1% | 4.8% | zero_inflation_handling | high |
| bounced_count_all_time | 2.61 | 16.11 | 0.0% | 6.0% | cap_then_log | high |
✅ Added 159 transformation recommendations to Gold layer
4.6 Categorical Columns Analysis¶
📖 Distribution Metrics (Analogues to Numeric Skewness/Kurtosis):
| Metric | Interpretation | Action |
|---|---|---|
| Imbalance Ratio | Largest / Smallest category count | > 10: Consider grouping rare categories |
| Entropy | Diversity measure (0 = one category, higher = more uniform) | Low entropy: May need stratified sampling |
| Top-3 Concentration | % of data in top 3 categories | > 90%: Rare categories may cause issues |
| Rare Category % | Categories with < 1% of data | High %: Group into "Other" category |
📖 Encoding Recommendations:
- Low cardinality (≤5) → One-hot encoding
- Medium cardinality (6-20) → One-hot or Target encoding
- High cardinality (>20) → Target encoding or Frequency encoding
- Cyclical (days, months) → Sin/Cos encoding
⚠️ Common Issues:
- Rare categories can cause overfitting with one-hot encoding
- High cardinality + one-hot = feature explosion
- Imbalanced categories may need special handling in train/test splits
Show/Hide Code
# Use framework's CategoricalDistributionAnalyzer
cat_analyzer = CategoricalDistributionAnalyzer()
categorical_cols = [
name for name, col in findings.columns.items()
if col.inferred_type in [ColumnType.CATEGORICAL_NOMINAL, ColumnType.CATEGORICAL_ORDINAL, ColumnType.CATEGORICAL_CYCLICAL]
and col.inferred_type != ColumnType.TEXT # TEXT columns processed separately in 02a
and name not in TEMPORAL_METADATA_COLS
]
# Analyze all categorical columns
cat_analyses = cat_analyzer.analyze_dataframe(df, categorical_cols)
# Get encoding recommendations
cyclical_cols = [name for name, col in findings.columns.items()
if col.inferred_type == ColumnType.CATEGORICAL_CYCLICAL]
cat_recommendations = cat_analyzer.get_all_recommendations(df, categorical_cols, cyclical_columns=cyclical_cols)
for col_name in categorical_cols:
col_info = findings.columns[col_name]
analysis = cat_analyses.get(col_name)
rec = next((r for r in cat_recommendations if r.column_name == col_name), None)
print(f"\n{'='*70}")
print(f"Column: {col_name}")
print(f"Type: {col_info.inferred_type.value} (Confidence: {col_info.confidence:.0%})")
print("-" * 70)
if analysis:
print("\n📊 Distribution Metrics:")
print(f" Categories: {analysis.category_count}")
print(f" Imbalance Ratio: {analysis.imbalance_ratio:.1f}x (largest/smallest)")
print(f" Entropy: {analysis.entropy:.2f} ({analysis.normalized_entropy*100:.0f}% of max)")
print(f" Top-1 Concentration: {analysis.top1_concentration:.1f}%")
print(f" Top-3 Concentration: {analysis.top3_concentration:.1f}%")
print(f" Rare Categories (<1%): {analysis.rare_category_count}")
# Interpretation
print("\n📈 Interpretation:")
if analysis.has_low_diversity:
print(" ⚠️ LOW DIVERSITY: Distribution dominated by few categories")
elif analysis.normalized_entropy > 0.9:
print(" ✓ HIGH DIVERSITY: Categories are relatively balanced")
else:
print(" ✓ MODERATE DIVERSITY: Some category dominance but acceptable")
if analysis.imbalance_ratio > 100:
print(" 🔴 SEVERE IMBALANCE: Rarest category has very few samples")
elif analysis.is_imbalanced:
print(" 🟡 MODERATE IMBALANCE: Consider grouping rare categories")
# Recommendations
if rec:
print("\n🔧 Recommendations:")
print(f" Encoding: {rec.encoding_type.value}")
print(f" Reason: {rec.reason}")
print(f" Priority: {rec.priority}")
if rec.preprocessing_steps:
print(" Preprocessing:")
for step in rec.preprocessing_steps:
print(f" • {step}")
if rec.warnings:
for warn in rec.warnings:
print(f" ⚠️ {warn}")
# Visualization
value_counts = df[col_name].value_counts()
subtitle = f"Entropy: {analysis.normalized_entropy*100:.0f}% | Imbalance: {analysis.imbalance_ratio:.1f}x | Rare: {analysis.rare_category_count}" if analysis else ""
fig = charts.bar_chart(
value_counts.head(10).index.tolist(),
value_counts.head(10).values.tolist(),
title=f"Top Categories: {col_name}<br><sub>{subtitle}</sub>"
)
display_figure(fig)
# Summary table and add recommendations to registry
if cat_analyses:
print("\n" + "=" * 70)
print("CATEGORICAL COLUMNS SUMMARY")
print("=" * 70)
summary_data = []
for col_name, analysis in cat_analyses.items():
rec = next((r for r in cat_recommendations if r.column_name == col_name), None)
summary_data.append({
"Column": col_name,
"Categories": analysis.category_count,
"Imbalance": f"{analysis.imbalance_ratio:.1f}x",
"Entropy": f"{analysis.normalized_entropy*100:.0f}%",
"Top-3 Conc.": f"{analysis.top3_concentration:.1f}%",
"Rare (<1%)": analysis.rare_category_count,
"Encoding": rec.encoding_type.value if rec else "N/A"
})
# Add encoding recommendation to Gold layer
if rec and registry.gold:
registry.add_gold_encoding(
column=col_name,
method=rec.encoding_type.value,
rationale=rec.reason,
source_notebook="04_column_deep_dive"
)
display_table(pd.DataFrame(summary_data))
if registry.gold:
print(f"\n✅ Added {len(cat_recommendations)} encoding recommendations to Gold layer")
====================================================================== Column: lifecycle_quadrant Type: categorical_nominal (Confidence: 90%) ---------------------------------------------------------------------- 📊 Distribution Metrics: Categories: 4 Imbalance Ratio: 2.1x (largest/smallest) Entropy: 1.91 (96% of max) Top-1 Concentration: 33.7% Top-3 Concentration: 83.7% Rare Categories (<1%): 0 📈 Interpretation: ✓ HIGH DIVERSITY: Categories are relatively balanced 🔧 Recommendations: Encoding: one_hot Reason: Low cardinality (4 categories) - safe feature expansion Priority: low
====================================================================== Column: recency_bucket Type: categorical_nominal (Confidence: 90%) ---------------------------------------------------------------------- 📊 Distribution Metrics: Categories: 5 Imbalance Ratio: 25.1x (largest/smallest) Entropy: 1.64 (71% of max) Top-1 Concentration: 61.7% Top-3 Concentration: 90.3% Rare Categories (<1%): 0 📈 Interpretation: ✓ MODERATE DIVERSITY: Some category dominance but acceptable 🟡 MODERATE IMBALANCE: Consider grouping rare categories 🔧 Recommendations: Encoding: one_hot Reason: Low cardinality (5 categories) - safe feature expansion Priority: low ⚠️ Use stratified sampling to preserve rare category representation
====================================================================== CATEGORICAL COLUMNS SUMMARY ======================================================================
| Column | Categories | Imbalance | Entropy | Top-3 Conc. | Rare (<1%) | Encoding |
|---|---|---|---|---|---|---|
| lifecycle_quadrant | 4 | 2.1x | 96% | 83.7% | 0 | one_hot |
| recency_bucket | 5 | 25.1x | 71% | 90.3% | 0 | one_hot |
✅ Added 2 encoding recommendations to Gold layer
4.7 Datetime Columns Analysis¶
📖 Unlike numeric transformations, datetime analysis recommends NEW FEATURES to create:
| Recommendation Type | Purpose | Examples |
|---|---|---|
| Feature Engineering | Create predictive features from dates | days_since_signup, tenure_years, month_sin_cos |
| Modeling Strategy | How to structure train/test | Time-based splits when trends detected |
| Data Quality | Issues to address before modeling | Placeholder dates (1/1/1900) to filter |
📖 Feature Engineering Strategies:
- Recency:
days_since_X- How recent was the event? (useful for predicting behavior) - Tenure:
tenure_years- How long has customer been active? (maturity/loyalty) - Duration:
days_between_A_and_B- Time between events (e.g., signup to first purchase) - Cyclical:
month_sin,month_cos- Preserves that December is near January - Categorical:
is_weekend,is_quarter_end- Behavioral indicators
Show/Hide Code
from customer_retention.stages.profiling.temporal_analyzer import TemporalRecommendationType
datetime_cols = [
name for name, col in findings.columns.items()
if col.inferred_type == ColumnType.DATETIME
and name not in TEMPORAL_METADATA_COLS
]
temporal_analyzer = TemporalAnalyzer()
# Store all datetime recommendations grouped by type
feature_engineering_recs = []
modeling_strategy_recs = []
data_quality_recs = []
datetime_summaries = []
for col_name in datetime_cols:
col_info = findings.columns[col_name]
print(f"\n{'='*70}")
print(f"Column: {col_name}")
print(f"Type: {col_info.inferred_type.value} (Confidence: {col_info.confidence:.0%})")
print(f"{'='*70}")
date_series = pd.to_datetime(df[col_name], errors='coerce', format='mixed')
valid_dates = date_series.dropna()
print(f"\n📅 Date Range: {valid_dates.min()} to {valid_dates.max()}")
print(f" Nulls: {date_series.isna().sum():,} ({date_series.isna().mean()*100:.1f}%)")
# Basic temporal analysis
analysis = temporal_analyzer.analyze(date_series)
print(f" Auto-detected granularity: {analysis.granularity.value}")
print(f" Span: {analysis.span_days:,} days ({analysis.span_days/365:.1f} years)")
# Growth analysis
growth = temporal_analyzer.calculate_growth_rate(date_series)
if growth.get("has_data"):
print("\n📈 Growth Analysis:")
print(f" Trend: {growth['trend_direction'].upper()}")
print(f" Overall growth: {growth['overall_growth_pct']:+.1f}%")
print(f" Avg monthly growth: {growth['avg_monthly_growth']:+.1f}%")
# Seasonality analysis
seasonality = temporal_analyzer.analyze_seasonality(date_series)
if seasonality.has_seasonality:
print("\n🔄 Seasonality Detected:")
print(f" Peak months: {', '.join(seasonality.peak_periods[:3])}")
print(f" Trough months: {', '.join(seasonality.trough_periods[:3])}")
print(f" Seasonal strength: {seasonality.seasonal_strength:.2f}")
# Get recommendations using framework
other_dates = [c for c in datetime_cols if c != col_name]
recommendations = temporal_analyzer.recommend_features(date_series, col_name, other_date_columns=other_dates)
# Group by recommendation type
col_feature_recs = [r for r in recommendations if r.recommendation_type == TemporalRecommendationType.FEATURE_ENGINEERING]
col_modeling_recs = [r for r in recommendations if r.recommendation_type == TemporalRecommendationType.MODELING_STRATEGY]
col_quality_recs = [r for r in recommendations if r.recommendation_type == TemporalRecommendationType.DATA_QUALITY]
feature_engineering_recs.extend(col_feature_recs)
modeling_strategy_recs.extend(col_modeling_recs)
data_quality_recs.extend(col_quality_recs)
# Display recommendations grouped by type
if col_feature_recs:
print("\n🛠️ FEATURES TO CREATE:")
for rec in col_feature_recs:
priority_icon = "🔴" if rec.priority == "high" else "🟡" if rec.priority == "medium" else "✓"
print(f" {priority_icon} {rec.feature_name} ({rec.category})")
print(f" Why: {rec.reason}")
if rec.code_hint:
print(f" Code: {rec.code_hint}")
if col_modeling_recs:
print("\n⚙️ MODELING CONSIDERATIONS:")
for rec in col_modeling_recs:
priority_icon = "🔴" if rec.priority == "high" else "🟡" if rec.priority == "medium" else "✓"
print(f" {priority_icon} {rec.feature_name}")
print(f" Why: {rec.reason}")
if col_quality_recs:
print("\n⚠️ DATA QUALITY ISSUES:")
for rec in col_quality_recs:
priority_icon = "🔴" if rec.priority == "high" else "🟡" if rec.priority == "medium" else "✓"
print(f" {priority_icon} {rec.feature_name}")
print(f" Why: {rec.reason}")
if rec.code_hint:
print(f" Code: {rec.code_hint}")
# Standard extractions always available
print("\n Standard extractions available: year, month, day, day_of_week, quarter")
# Store summary
datetime_summaries.append({
"Column": col_name,
"Span (days)": analysis.span_days,
"Seasonality": "Yes" if seasonality.has_seasonality else "No",
"Trend": growth.get('trend_direction', 'N/A').capitalize() if growth.get("has_data") else "N/A",
"Features to Create": len(col_feature_recs),
"Modeling Notes": len(col_modeling_recs),
"Quality Issues": len(col_quality_recs)
})
# === VISUALIZATIONS ===
if growth.get("has_data"):
fig = charts.growth_summary_indicators(growth, title=f"Growth Summary: {col_name}")
display_figure(fig)
chart_type = "line" if analysis.granularity in [TemporalGranularity.DAY, TemporalGranularity.WEEK] else "bar"
fig = charts.temporal_distribution(analysis, title=f"Records Over Time: {col_name}", chart_type=chart_type)
display_figure(fig)
fig = charts.temporal_trend(analysis, title=f"Trend Analysis: {col_name}")
display_figure(fig)
yoy_data = temporal_analyzer.year_over_year_comparison(date_series)
if len(yoy_data) > 1:
fig = charts.year_over_year_lines(yoy_data, title=f"Year-over-Year: {col_name}")
display_figure(fig)
fig = charts.year_month_heatmap(yoy_data, title=f"Records Heatmap: {col_name}")
display_figure(fig)
if growth.get("has_data"):
fig = charts.cumulative_growth_chart(growth["cumulative"], title=f"Cumulative Records: {col_name}")
display_figure(fig)
fig = charts.temporal_heatmap(date_series, title=f"Day of Week Distribution: {col_name}")
display_figure(fig)
# === DATETIME SUMMARY ===
if datetime_summaries:
print("\n" + "=" * 70)
print("DATETIME COLUMNS SUMMARY")
print("=" * 70)
display_table(pd.DataFrame(datetime_summaries))
# Summary by recommendation type
print("\n📋 ALL RECOMMENDATIONS BY TYPE:")
if feature_engineering_recs:
print(f"\n🛠️ FEATURES TO CREATE ({len(feature_engineering_recs)}):")
for i, rec in enumerate(feature_engineering_recs, 1):
priority_icon = "🔴" if rec.priority == "high" else "🟡" if rec.priority == "medium" else "✓"
print(f" {i}. {priority_icon} {rec.feature_name}")
if modeling_strategy_recs:
print(f"\n⚙️ MODELING CONSIDERATIONS ({len(modeling_strategy_recs)}):")
for i, rec in enumerate(modeling_strategy_recs, 1):
priority_icon = "🔴" if rec.priority == "high" else "🟡" if rec.priority == "medium" else "✓"
print(f" {i}. {priority_icon} {rec.feature_name}: {rec.reason}")
if data_quality_recs:
print(f"\n⚠️ DATA QUALITY TO ADDRESS ({len(data_quality_recs)}):")
for i, rec in enumerate(data_quality_recs, 1):
priority_icon = "🔴" if rec.priority == "high" else "🟡" if rec.priority == "medium" else "✓"
print(f" {i}. {priority_icon} {rec.feature_name}: {rec.reason}")
# Add recommendations to registry
added_derived = 0
added_modeling = 0
# Add feature engineering recommendations to Silver layer (derived columns)
if registry.silver:
for rec in feature_engineering_recs:
registry.add_silver_derived(
column=rec.feature_name,
expression=rec.code_hint or "",
feature_type=rec.category,
rationale=rec.reason,
source_notebook="04_column_deep_dive"
)
added_derived += 1
# Add modeling strategy recommendations to Bronze layer
seen_strategies = set()
for rec in modeling_strategy_recs:
if rec.feature_name not in seen_strategies:
registry.add_bronze_modeling_strategy(
strategy=rec.feature_name,
column=datetime_cols[0] if datetime_cols else "",
parameters={"category": rec.category},
rationale=rec.reason,
source_notebook="04_column_deep_dive"
)
seen_strategies.add(rec.feature_name)
added_modeling += 1
print(f"\n✅ Added {added_derived} derived column recommendations to Silver layer")
print(f"✅ Added {added_modeling} modeling strategy recommendations to Bronze layer")
4.8 Type Override (Optional)¶
If any column types were incorrectly inferred, you can override them here.
Common overrides:
- Binary columns detected as numeric →
ColumnType.BINARY - IDs detected as numeric →
ColumnType.IDENTIFIER - Ordinal categories detected as nominal →
ColumnType.CATEGORICAL_ORDINAL
Show/Hide Code
# === TYPE OVERRIDES ===
# Uncomment and modify to override any incorrectly inferred types
TYPE_OVERRIDES = {
# "column_name": ColumnType.NEW_TYPE,
# Examples:
# "is_active": ColumnType.BINARY,
# "user_id": ColumnType.IDENTIFIER,
# "satisfaction_level": ColumnType.CATEGORICAL_ORDINAL,
}
if TYPE_OVERRIDES:
print("Applying type overrides:")
for col_name, new_type in TYPE_OVERRIDES.items():
if col_name in findings.columns:
old_type = findings.columns[col_name].inferred_type.value
findings.columns[col_name].inferred_type = new_type
findings.columns[col_name].confidence = 1.0
findings.columns[col_name].evidence.append("Manually overridden")
print(f" {col_name}: {old_type} → {new_type.value}")
else:
print("No type overrides configured.")
print("To override a type, add entries to TYPE_OVERRIDES dictionary above.")
No type overrides configured. To override a type, add entries to TYPE_OVERRIDES dictionary above.
4.9 Data Segmentation Analysis¶
Purpose: Determine if the dataset contains natural subgroups that might benefit from separate models.
📖 Why This Matters:
- Some datasets have distinct customer segments with very different behaviors
- A single model might struggle to capture patterns that vary significantly across segments
- Segmented models can improve accuracy but add maintenance complexity
Recommendations:
- single_model - Data is homogeneous; one model for all records
- consider_segmentation - Some variation exists; evaluate if complexity is worth it
- strong_segmentation - Distinct segments with different target rates; separate models likely beneficial
Important: This is exploratory guidance only. The final decision depends on business context, model complexity tolerance, and available resources.
Show/Hide Code
from customer_retention.stages.profiling import SegmentAnalyzer
# Initialize segment analyzer
segment_analyzer = SegmentAnalyzer()
# Find target column if detected
target_col = None
for col_name, col_info in findings.columns.items():
if col_info.inferred_type == ColumnType.TARGET:
target_col = col_name
break
# Run segmentation analysis using numeric features
print("="*70)
print("DATA SEGMENTATION ANALYSIS")
print("="*70)
segmentation = segment_analyzer.analyze(
df,
target_col=target_col,
feature_cols=numeric_cols if numeric_cols else None,
max_segments=5
)
print("\n🎯 Analysis Results:")
print(f" Method: {segmentation.method.value}")
print(f" Detected Segments: {segmentation.n_segments}")
print(f" Cluster Quality Score: {segmentation.quality_score:.2f}")
if segmentation.target_variance_ratio is not None:
print(f" Target Variance Ratio: {segmentation.target_variance_ratio:.2f}")
print("\n📊 Segment Profiles:")
for profile in segmentation.profiles:
target_info = f" | Target Rate: {profile.target_rate*100:.1f}%" if profile.target_rate is not None else ""
print(f" Segment {profile.segment_id}: {profile.size:,} records ({profile.size_pct:.1f}%){target_info}")
# Display recommendation card
fig = charts.segment_recommendation_card(segmentation)
display_figure(fig)
# Display segment overview
fig = charts.segment_overview(segmentation, title="Segment Overview")
display_figure(fig)
# Display feature comparison if we have features
if segmentation.n_segments > 1 and any(p.defining_features for p in segmentation.profiles):
fig = charts.segment_feature_comparison(segmentation, title="Feature Comparison Across Segments")
display_figure(fig)
print("\n📝 Rationale:")
for reason in segmentation.rationale:
print(f" • {reason}")
====================================================================== DATA SEGMENTATION ANALYSIS ====================================================================== 🎯 Analysis Results: Method: kmeans Detected Segments: 1 Cluster Quality Score: 0.00 Target Variance Ratio: 0.00 📊 Segment Profiles: Segment 0: 4,998 records (100.0%) | Target Rate: 44.6%
📝 Rationale: • Insufficient data for meaningful segmentation
4.10 Save Updated Findings¶
Show/Hide Code
# Save updated findings back to the same file
findings.save(FINDINGS_PATH)
print(f"Updated findings saved to: {FINDINGS_PATH}")
# Save recommendations registry
recommendations_path = FINDINGS_PATH.replace("_findings.yaml", "_recommendations.yaml")
registry.save(recommendations_path)
print(f"Recommendations saved to: {recommendations_path}")
# Summary of recommendations
all_recs = registry.all_recommendations
print("\n📋 Recommendations Summary:")
print(f" Bronze layer: {len(registry.get_by_layer('bronze'))} recommendations")
print(f" Silver layer: {len(registry.get_by_layer('silver'))} recommendations")
print(f" Gold layer: {len(registry.get_by_layer('gold'))} recommendations")
print(f" Total: {len(all_recs)} recommendations")
Updated findings saved to: /Users/Vital/python/CustomerRetention/experiments/runs/email-6301db6c/datasets/customer_emails/findings/customer_emails_aggregated_findings.yaml Recommendations saved to: /Users/Vital/python/CustomerRetention/experiments/runs/email-6301db6c/datasets/customer_emails/findings/customer_emails_aggregated_recommendations.yaml 📋 Recommendations Summary: Bronze layer: 3 recommendations Silver layer: 0 recommendations Gold layer: 161 recommendations Total: 164 recommendations
Summary: What We Learned¶
In this notebook, we performed a deep dive analysis that included:
- Value Range Validation - Validated rates, binary fields, and non-negative constraints
- Numeric Distribution Analysis - Calculated skewness, kurtosis, and percentiles with transformation recommendations
- Categorical Distribution Analysis - Calculated imbalance ratio, entropy, and concentration with encoding recommendations
- Datetime Analysis - Analyzed seasonality, trends, and patterns with feature engineering recommendations
- Data Segmentation - Evaluated if natural subgroups exist that might benefit from separate models
Key Metrics Reference¶
Numeric Columns:
| Metric | Threshold | Action |
|---|---|---|
| Skewness | |skew| > 1 | Log transform |
| Kurtosis | > 10 | Cap outliers first |
| Zero % | > 40% | Zero-inflation handling |
Categorical Columns:
| Metric | Threshold | Action |
|---|---|---|
| Imbalance Ratio | > 10x | Group rare categories |
| Entropy | < 50% | Stratified sampling |
| Rare Categories | > 0 | Group into "Other" |
Datetime Columns:
| Finding | Action |
|---|---|
| Seasonality | Add cyclical month encoding |
| Strong trend | Time-based train/test split |
| Multiple dates | Calculate duration features |
| Placeholder dates | Filter or flag |
Transformation & Encoding Summary¶
Review the summary tables above for:
- Numeric: Which columns need log transforms, capping, or zero-inflation handling
- Categorical: Which encoding to use and whether to group rare categories
- Datetime: Which temporal features to engineer based on detected patterns
Next Steps¶
Continue to 02_source_integrity.ipynb to:
- Analyze duplicate records and value conflicts
- Deep dive into missing value patterns
- Analyze outliers with IQR method
- Check data consistency
- Get cleaning recommendations
Or jump to 05_feature_opportunities.ipynb if you want to see derived feature recommendations.
Save Reminder: Save this notebook (Ctrl+S / Cmd+S) before running the next one. The next notebook will automatically export this notebook's HTML documentation from the saved file.