Chapter 2: Source IntegrityΒΆ
Purpose: Deep dive into data quality issues with actionable remediation strategies.
What you'll learn:
- How to analyze missing value patterns (MCAR vs MAR vs MNAR)
- How to detect and handle outliers using IQR method
- How to validate date sequences and binary fields
- How to implement data cleanup strategies
Outputs:
- Missing value analysis with correlation patterns
- Outlier detection with visualization
- Date logic validation results
- Binary field validation
- Cleanup code examples ready to use
Quality Assessment FrameworkΒΆ
| Issue Type | Detection Method | Common Solutions |
|---|---|---|
| Missing Values | Null counts, pattern analysis | Impute (mean/median/mode), drop, flag |
| Outliers | IQR, Z-score, isolation forest | Cap/clip, winsorize, transform, keep (if valid) |
| Date Logic | Sequence validation | Set placeholders to NULL, exclude invalid |
| Duplicates | Key uniqueness | Drop exact, keep most recent |
| Invalid Values | Range/domain checks | Correct, flag, exclude |
2.1 SetupΒΆ
Show/Hide Code
from customer_retention.analysis.notebook_progress import track_and_export_previous
track_and_export_previous("02_source_integrity.ipynb")
import pandas as pd
import plotly.express as px
import plotly.graph_objects as go
from plotly.subplots import make_subplots
from customer_retention.analysis.auto_explorer import ExplorationFindings, RecommendationRegistry
from customer_retention.analysis.visualization import ChartBuilder, display_figure
from customer_retention.core.config.column_config import ColumnType
from customer_retention.core.config.experiments import (
FINDINGS_DIR,
)
Show/Hide Code
from pathlib import Path
import yaml
from customer_retention.analysis.auto_explorer import load_notebook_findings
FINDINGS_PATH, _namespace, dataset_name = load_notebook_findings("02_source_integrity.ipynb")
print(f"Using: {FINDINGS_PATH}")
findings = ExplorationFindings.load(FINDINGS_PATH)
# Load data - handle aggregated vs standard paths
from customer_retention.analysis.auto_explorer.active_dataset_store import (
load_active_dataset,
load_merge_dataset,
)
from customer_retention.core.config.column_config import DatasetGranularity
from customer_retention.stages.temporal import TEMPORAL_METADATA_COLS
if "_aggregated" in str(FINDINGS_PATH):
df = load_merge_dataset(_namespace, dataset_name, DatasetGranularity.EVENT_LEVEL)
data_source = f"aggregated:{dataset_name}"
else:
df = load_active_dataset(_namespace, dataset_name)
data_source = dataset_name
print(f"Loaded data from: {data_source}")
print(f"Shape: {df.shape}")
charts = ChartBuilder()
# Load or initialize recommendation registry
RECOMMENDATIONS_PATH = FINDINGS_PATH.replace("_findings.yaml", "_recommendations.yaml")
if Path(RECOMMENDATIONS_PATH).exists():
with open(RECOMMENDATIONS_PATH, "r") as f:
registry = RecommendationRegistry.from_dict(yaml.safe_load(f))
print(f"Loaded existing recommendations from: {RECOMMENDATIONS_PATH}")
else:
registry = RecommendationRegistry()
registry.init_bronze(findings.source_path)
if findings.target_column:
registry.init_gold(findings.target_column)
entity_col = next((name for name, col in findings.columns.items() if col.inferred_type == ColumnType.IDENTIFIER), None)
if entity_col:
registry.init_silver(entity_col)
print("Initialized new recommendation registry")
print(f"\nLoaded findings for {findings.column_count} columns")
Using: /Users/Vital/python/CustomerRetention/experiments/runs/retail-e7471284/datasets/customer_retention_retail/findings/customer_retention_retail_aggregated_findings.yaml
Loaded data from: aggregated:customer_retention_retail Shape: (30770, 407) Initialized new recommendation registry Loaded findings for 407 columns
2.2 Duplicate AnalysisΒΆ
π Why This Matters:
- Duplicate records can skew statistics and model training
- Key column duplicates may indicate data quality issues or event-level data
- Value conflicts (same key, different values) require investigation
What to Watch For:
- Exact duplicates: Identical rows that should be deduplicated
- Key duplicates: Same identifier with different values (may indicate updates or errors)
- Value conflicts: Same key with conflicting values in important columns
Show/Hide Code
from customer_retention.stages.validation import DataValidator
validator = DataValidator()
# Auto-detect potential key columns
potential_keys = [name for name, col in findings.columns.items()
if col.inferred_type.value in ('identifier', 'id') or 'id' in name.lower()]
KEY_COLUMN = potential_keys[0] if potential_keys else None
print("=" * 60)
print("DUPLICATE ANALYSIS")
print("=" * 60)
if KEY_COLUMN:
dup_result = validator.check_duplicates(df, key_column=KEY_COLUMN, check_value_conflicts=True)
print(f"\nKey Column: {KEY_COLUMN}")
print(f"Total Rows: {dup_result.total_rows:,}")
print(f"Unique Keys: {dup_result.unique_keys:,}")
print(f"Duplicate Keys: {dup_result.duplicate_keys:,} ({dup_result.duplicate_percentage:.2f}%)")
# Exact duplicates
if dup_result.exact_duplicate_rows > 0:
print(f"\nβ οΈ Exact duplicate rows: {dup_result.exact_duplicate_rows:,}")
dup_mask = df.duplicated(keep=False)
dup_examples = df[dup_mask].head(6)
if len(dup_examples) > 0:
print("\nExample duplicate rows:")
display(dup_examples)
# Add deduplication recommendation for exact duplicates
registry.add_bronze_deduplication(
key_column=KEY_COLUMN, strategy="drop_exact_duplicates",
rationale=f"{dup_result.exact_duplicate_rows} exact duplicate rows detected",
source_notebook="02_source_integrity"
)
else:
print("\nβ No exact duplicate rows")
# Value conflicts
if dup_result.has_value_conflicts:
print(f"\nβ οΈ Value conflicts detected in: {', '.join(dup_result.conflict_columns[:5])}")
if findings.target_column and findings.target_column in dup_result.conflict_columns:
print(f" π΄ CRITICAL: Target '{findings.target_column}' has conflicting values!")
# Show examples of conflicting records
key_counts = df[KEY_COLUMN].value_counts()
dup_keys = key_counts[key_counts > 1].head(3).index.tolist()
if dup_keys:
print("\nExample records with duplicate keys:")
conflict_examples = df[df[KEY_COLUMN].isin(dup_keys)].sort_values(KEY_COLUMN).head(10)
display(conflict_examples)
# Add deduplication recommendation for value conflicts
registry.add_bronze_deduplication(
key_column=KEY_COLUMN, strategy="keep_first",
rationale=f"Value conflicts in {len(dup_result.conflict_columns)} columns",
source_notebook="02_source_integrity",
conflict_columns=dup_result.conflict_columns[:5]
)
else:
print("\nβ No value conflicts")
# Duplicate frequency distribution
if dup_result.duplicate_keys > 0:
key_counts = df[KEY_COLUMN].value_counts()
dup_distribution = key_counts[key_counts > 1].value_counts().sort_index()
if len(dup_distribution) > 0:
print("\nDuplicate frequency distribution:")
for count, num_keys in dup_distribution.head(5).items():
print(f" Keys appearing {count}x: {num_keys:,}")
# Recommendations
print("\nπ‘ RECOMMENDATIONS:")
if dup_result.exact_duplicate_rows > 0:
print(" β’ Remove exact duplicates: df.drop_duplicates()")
if dup_result.has_value_conflicts:
print(" β’ For value conflicts, decide strategy:")
print(" - Keep most recent (if you have a timestamp)")
print(" - Keep first occurrence: df.drop_duplicates(subset=[KEY_COLUMN], keep='first')")
print(" - Aggregate values (for numeric columns)")
else:
print("\nβ οΈ No key column detected.")
print(" Set KEY_COLUMN above to enable duplicate analysis.")
print(f" Available columns: {list(findings.columns.keys())[:10]}...")
============================================================ DUPLICATE ANALYSIS ============================================================ Key Column: custid Total Rows: 30,770 Unique Keys: 30,769 Duplicate Keys: 0 (0.00%) β No exact duplicate rows β No value conflicts
2.3 Overall Quality ScoreΒΆ
Show/Hide Code
print(f"Overall Quality Score: {findings.overall_quality_score:.1f}/100")
if findings.overall_quality_score >= 90:
print("Excellent: Data is high quality and ready for modeling.")
elif findings.overall_quality_score >= 70:
print("Good: Minor quality issues that should be addressed.")
elif findings.overall_quality_score >= 50:
print("Fair: Significant quality issues require attention.")
else:
print("Poor: Major quality issues must be resolved before modeling.")
Overall Quality Score: 80.0/100 Good: Minor quality issues that should be addressed.
2.4 Target Variable AnalysisΒΆ
Understanding target distribution is critical for:
- Class imbalance affects model training and evaluation metrics
- Business context helps interpret what we're trying to predict
- Sampling strategies depend on imbalance severity
Show/Hide Code
print("=" * 60)
print(f"TARGET VARIABLE DISTRIBUTION: {findings.target_column}")
print("=" * 60)
if findings.target_column and findings.target_column in df.columns:
target_series = df[findings.target_column]
target_counts = target_series.value_counts().sort_index()
# Create distribution table
dist_data = []
for val, count in target_counts.items():
pct = count / len(df) * 100
dist_data.append({
findings.target_column: val,
"count": count,
"percentage": f"{pct:.3f}"
})
dist_df = pd.DataFrame(dist_data)
display(dist_df)
# Calculate imbalance metrics
if len(target_counts) == 2:
majority = target_counts.max()
minority = target_counts.min()
minority_class = target_counts.idxmin()
imbalance_ratio = majority / minority
retention_rate = target_counts.get(1, 0) / len(df) * 100
print(f"\nImbalance ratio: {imbalance_ratio:.2f}:1 (minority class: {minority_class})")
print(f"Retention rate: {retention_rate:.1f}%")
# Business context
if retention_rate > 70:
print(f"\nπ Business Context: {retention_rate:.0f}% retention is healthy!")
print(" Churned customers are the minority class we want to predict.")
elif retention_rate > 50:
print(f"\nπ Business Context: {retention_rate:.0f}% retention is moderate.")
print(" Balanced focus on both retention and churn prediction.")
else:
print(f"\nβ οΈ Business Context: {retention_rate:.0f}% retention is concerning!")
print(" High churn rate requires urgent attention.")
# Modeling recommendations based on imbalance
print("\nβ οΈ Class imbalance considerations for modeling:")
print(" - Use stratified sampling for train/test splits")
print(" - Consider class weights in model training")
print(" - Evaluate with Precision-Recall AUC (not just ROC-AUC)")
print(" - Focus on recall for churned class (catch at-risk customers)")
# Add imbalance strategy recommendation
if imbalance_ratio < 3:
strategy = "stratified_sampling"
rationale = f"Mild imbalance ({imbalance_ratio:.2f}:1) - stratified sampling sufficient"
print(" - SMOTE not needed (imbalance is mild)")
elif imbalance_ratio < 5:
strategy = "class_weights"
rationale = f"Moderate imbalance ({imbalance_ratio:.2f}:1) - use class weights"
print(" - SMOTE may not be necessary (imbalance is moderate)")
else:
strategy = "smote"
rationale = f"Severe imbalance ({imbalance_ratio:.2f}:1) - consider SMOTE"
print(" - Consider SMOTE or undersampling (imbalance is severe)")
registry.add_bronze_imbalance_strategy(
target_column=findings.target_column,
imbalance_ratio=imbalance_ratio,
minority_class=minority_class,
strategy=strategy,
rationale=rationale,
source_notebook="02_source_integrity"
)
# Visualization
fig = make_subplots(rows=1, cols=2, specs=[[{"type": "pie"}, {"type": "bar"}]],
subplot_titles=["Class Distribution", "Count Comparison"])
labels = [f"{'Retained' if v == 1 else 'Churned'} ({v})" for v in target_counts.index]
fig.add_trace(go.Pie(labels=labels, values=target_counts.values, hole=0.4,
marker_colors=["#2ecc71", "#e74c3c"]), row=1, col=1)
fig.add_trace(go.Bar(x=labels, y=target_counts.values,
marker_color=["#e74c3c", "#2ecc71"]), row=1, col=2)
fig.update_layout(height=350, title_text="Target Variable Distribution",
showlegend=False, template="plotly_white")
display_figure(fig)
else:
print(f"\nMulticlass target with {len(target_counts)} classes")
fig = go.Figure(go.Bar(x=[str(v) for v in target_counts.index], y=target_counts.values,
marker_color=px.colors.qualitative.Set2[:len(target_counts)]))
fig.update_layout(height=350, title_text="Target Variable Distribution",
xaxis_title=findings.target_column, yaxis_title="Count",
template="plotly_white")
display_figure(fig)
else:
print("\nβ οΈ No target column detected or specified.")
print(" Set target_hint parameter in DataExplorer.explore() or")
print(" manually specify in findings.target_column")
============================================================ TARGET VARIABLE DISTRIBUTION: retained ============================================================
| retained | count | percentage | |
|---|---|---|---|
| 0 | 0.0 | 6316 | 20.526 |
| 1 | 1.0 | 24453 | 79.470 |
Imbalance ratio: 3.87:1 (minority class: 0.0) Retention rate: 79.5% π Business Context: 79% retention is healthy! Churned customers are the minority class we want to predict. β οΈ Class imbalance considerations for modeling: - Use stratified sampling for train/test splits - Consider class weights in model training - Evaluate with Precision-Recall AUC (not just ROC-AUC) - Focus on recall for churned class (catch at-risk customers) - SMOTE may not be necessary (imbalance is moderate)
2.5 Missing Value AnalysisΒΆ
π Interpretation Guide:
- MCAR (Missing Completely at Random): Missing values have no pattern - safe to impute with mean/median
- MAR (Missing at Random): Missingness depends on other observed variables - use regression imputation
- MNAR (Missing Not at Random): Missingness depends on the missing value itself - create missing indicator
β οΈ What to Watch For:
- Columns with >50% missing may need to be dropped
- Highly correlated missing patterns suggest MAR
- ID columns with missing values indicate data integrity issues
Show/Hide Code
missing_data = []
for col_name, col_info in findings.columns.items():
null_count = col_info.universal_metrics.get("null_count", 0)
null_pct = col_info.universal_metrics.get("null_percentage", 0)
if null_count > 0:
missing_data.append({
"Column": col_name,
"Missing Count": null_count,
"Missing %": f"{null_pct:.2f}%"
})
if missing_data:
missing_df = pd.DataFrame(missing_data).sort_values("Missing Count", ascending=False)
print("Columns with Missing Values:")
display(missing_df)
fig = charts.bar_chart(
missing_df["Column"].tolist(),
[float(x.replace("%", "")) for x in missing_df["Missing %"].tolist()],
title="Missing Value Percentage by Column"
)
display_figure(fig)
else:
print("No missing values detected.")
Columns with Missing Values:
| Column | Missing Count | Missing % | |
|---|---|---|---|
| 172 | lag2_refill_mean | 30770 | 100.00% |
| 164 | lag2_ordfreq_mean | 30770 | 100.00% |
| 167 | lag2_paperless_sum | 30770 | 100.00% |
| 168 | lag2_paperless_mean | 30770 | 100.00% |
| 170 | lag2_paperless_max | 30770 | 100.00% |
| ... | ... | ... | ... |
| 81 | lag0_avgorder_count | 1 | 0.00% |
| 80 | lag0_avgorder_mean | 1 | 0.00% |
| 181 | lag2_created_delta_hours_count | 1 | 0.00% |
| 79 | lag0_avgorder_sum | 1 | 0.00% |
| 0 | custid | 1 | 0.00% |
345 rows Γ 3 columns
2.6 Missing Value PatternsΒΆ
π How to Read the Correlation Heatmap:
- Correlation = 1.0: Columns always missing together (same rows)
- Correlation > 0.5: Strong pattern - investigate the relationship
- Correlation β 0: Independent missing patterns (MCAR likely)
Show/Hide Code
missing_matrix = df.isnull()
missing_correlations = missing_matrix.corr()
cols_with_missing = [col for col in df.columns if df[col].isnull().any()]
if len(cols_with_missing) > 1:
print("Missing Value Correlations (MCAR vs MAR analysis):")
fig = charts.heatmap(
missing_correlations.loc[cols_with_missing, cols_with_missing].values,
x_labels=cols_with_missing,
y_labels=cols_with_missing,
title="Missing Value Pattern Correlation"
)
display_figure(fig)
Missing Value Correlations (MCAR vs MAR analysis):
2.7 Segment-Aware Outlier AnalysisΒΆ
π Why Segment Before Detecting Outliers?
Global outlier detection can produce false positives when data contains natural segments:
- Retail vs Enterprise customers: Order values of $5K may be outliers for retail but normal for enterprise
- New vs Established accounts: Activity patterns differ dramatically by customer tenure
- Geographic segments: Regional price differences can appear as outliers globally
β οΈ The Risk: If you remove "outliers" that are actually valid data from a different segment, you lose critical patterns needed for accurate modeling.
π What This Analysis Does:
- Detects natural data segments (using clustering or explicit segment columns)
- Compares global outliers vs segment-specific outliers
- Identifies "false outliers" - values flagged globally but normal within their segment
- Recommends whether segment-specific outlier treatment is beneficial
Show/Hide Code
from customer_retention.stages.profiling import SegmentAwareOutlierAnalyzer
print("=" * 80)
print("SEGMENT-AWARE OUTLIER ANALYSIS")
print("=" * 80)
# Get numeric columns for analysis
numeric_cols = [
name for name, col in findings.columns.items()
if col.inferred_type in [ColumnType.NUMERIC_CONTINUOUS, ColumnType.NUMERIC_DISCRETE]
and name not in TEMPORAL_METADATA_COLS
and name in df.columns
]
# === CONFIGURATION ===
# Option 1: Specify an explicit segment column if you have one (e.g., customer_type, region)
SEGMENT_COL = None # e.g., "customer_segment", "account_type"
# Option 2: Load from findings metadata if saved in previous notebook
if SEGMENT_COL is None and "segment_column" in findings.metadata:
SEGMENT_COL = findings.metadata["segment_column"]
print(f"Using segment column from findings: {SEGMENT_COL}")
if numeric_cols:
analyzer = SegmentAwareOutlierAnalyzer(max_segments=5)
# Run segment-aware analysis
segment_result = analyzer.analyze(
df,
feature_cols=numeric_cols,
segment_col=SEGMENT_COL,
target_col=findings.target_column
)
print("\nπ SEGMENTATION RESULTS:")
print(f" Segments detected: {segment_result.n_segments}")
if segment_result.n_segments > 1:
print("\nπ GLOBAL VS SEGMENT OUTLIER COMPARISON:")
print("-" * 60)
comparison_data = []
for col in numeric_cols:
global_outliers = segment_result.global_analysis[col].outliers_detected
segment_outliers = sum(
seg[col].outliers_detected
for seg in segment_result.segment_analysis.values()
if col in seg
)
false_outliers = segment_result.false_outliers.get(col, 0)
if global_outliers > 0:
reduction_pct = (global_outliers - segment_outliers) / global_outliers * 100
false_pct = false_outliers / global_outliers * 100
else:
reduction_pct = 0
false_pct = 0
comparison_data.append({
"Feature": col,
"Global Outliers": global_outliers,
"Segment Outliers": segment_outliers,
"False Outliers": false_outliers,
"Reduction": f"{reduction_pct:.1f}%"
})
comparison_df = pd.DataFrame(comparison_data)
display(comparison_df)
# Show false outlier analysis
has_false_outliers = any(segment_result.false_outliers.get(col, 0) > 0 for col in numeric_cols)
if has_false_outliers:
print("\nβ οΈ FALSE OUTLIERS DETECTED:")
print(" (Global outliers that are normal within their segment)")
for col, count in segment_result.false_outliers.items():
if count > 0:
global_count = segment_result.global_analysis[col].outliers_detected
pct = count / global_count * 100 if global_count > 0 else 0
print(f" β’ {col}: {count} false outliers ({pct:.1f}% of global)")
# Recommendations
print("\nπ‘ RECOMMENDATIONS:")
if segment_result.segmentation_recommended:
print(" β
SEGMENT-SPECIFIC OUTLIER TREATMENT RECOMMENDED")
for rec in segment_result.recommendations:
print(f" β’ {rec}")
# Add outlier recommendations for columns with high false outlier rate
for col, count in segment_result.false_outliers.items():
if count > 0:
global_count = segment_result.global_analysis[col].outliers_detected
false_pct = count / global_count * 100 if global_count > 0 else 0
if false_pct > 50: # High false outlier rate
registry.add_bronze_outlier(
column=col, action="segment_aware_cap",
parameters={"method": "segment_iqr", "n_segments": segment_result.n_segments},
rationale=f"{false_pct:.0f}% of global outliers are segment-normal",
source_notebook="02_source_integrity"
)
else:
print(" βΉοΈ Global outlier treatment is appropriate for this data")
# Rationale
print("\nπ RATIONALE:")
for rationale in segment_result.rationale:
print(f" β’ {rationale}")
# Visualization: Compare outlier counts
cols_with_diff = [
row["Feature"] for _, row in comparison_df.iterrows()
if row["Global Outliers"] > 0 and row["Global Outliers"] != row["Segment Outliers"]
]
if cols_with_diff and len(cols_with_diff) <= 8:
fig = go.Figure()
global_counts = [comparison_df[comparison_df["Feature"] == c]["Global Outliers"].values[0] for c in cols_with_diff]
segment_counts = [comparison_df[comparison_df["Feature"] == c]["Segment Outliers"].values[0] for c in cols_with_diff]
fig.add_trace(go.Bar(name="Global Outliers", x=cols_with_diff, y=global_counts, marker_color="#e74c3c"))
fig.add_trace(go.Bar(name="Segment Outliers", x=cols_with_diff, y=segment_counts, marker_color="#2ecc71"))
fig.update_layout(
barmode="group",
title="Global vs Segment-Specific Outlier Detection",
xaxis_title="Feature",
yaxis_title="Outlier Count",
template="plotly_white",
height=400
)
display_figure(fig)
else:
print("\n βΉοΈ Data appears homogeneous (single segment)")
print(" β Proceeding with standard global outlier detection")
# Store result in findings metadata for use in later notebooks
findings.metadata["segment_aware_analysis"] = {
"n_segments": segment_result.n_segments,
"segmentation_recommended": segment_result.segmentation_recommended,
"recommendations": segment_result.recommendations
}
else:
print("\nNo numeric columns to analyze for outliers.")
================================================================================ SEGMENT-AWARE OUTLIER ANALYSIS ================================================================================
π SEGMENTATION RESULTS: Segments detected: 1 βΉοΈ Data appears homogeneous (single segment) β Proceeding with standard global outlier detection
2.8 Global Outlier DetectionΒΆ
π IQR Method Explained:
- Q1 = 25th percentile, Q3 = 75th percentile
- IQR = Q3 - Q1 (the middle 50% of data)
- Lower Bound = Q1 - 1.5 Γ IQR
- Upper Bound = Q3 + 1.5 Γ IQR
- Values outside these bounds are considered outliers
β οΈ Important Considerations:
- Review section 3.7 above to determine if global or segment-specific outlier treatment is appropriate
- Outliers in rate fields (>100%) are likely errors β Cap at 100
- Outliers in amount fields may be valid high-value customers β Keep but consider capping for modeling
- High outlier % (>10%) suggests heavy-tailed distribution β Consider log transform instead of capping
Show/Hide Code
print("=" * 80)
print("OUTLIER DETECTION (IQR Method)")
print("=" * 80)
numeric_cols = [
name for name, col in findings.columns.items()
if col.inferred_type in [ColumnType.NUMERIC_CONTINUOUS, ColumnType.NUMERIC_DISCRETE]
and name not in TEMPORAL_METADATA_COLS
and name in df.columns
]
# Build comprehensive outlier table
outlier_data = []
for col_name in numeric_cols:
series = df[col_name].dropna()
q1 = series.quantile(0.25)
q3 = series.quantile(0.75)
iqr = q3 - q1
lower_bound = q1 - 1.5 * iqr
upper_bound = q3 + 1.5 * iqr
outliers_low = (series < lower_bound).sum()
outliers_high = (series > upper_bound).sum()
total_outliers = outliers_low + outliers_high
outlier_data.append({
"feature": col_name,
"Q1": q1,
"Q3": q3,
"IQR": iqr,
"lower_bound": lower_bound,
"upper_bound": upper_bound,
"outliers_low": outliers_low,
"outliers_high": outliers_high,
"total_outliers": total_outliers,
"outlier_pct": total_outliers / len(series) * 100
})
outlier_df = pd.DataFrame(outlier_data)
if outlier_df.empty:
print("\nNo numeric columns available for outlier detection.")
else:
# Display IQR bounds table
print("\nπ IQR BOUNDS TABLE:")
bounds_display = outlier_df[["feature", "Q1", "Q3", "IQR", "lower_bound", "upper_bound",
"outliers_low", "outliers_high"]].copy()
for col in ["Q1", "Q3", "IQR", "lower_bound", "upper_bound"]:
bounds_display[col] = bounds_display[col].apply(lambda x: f"{x:.2f}")
display(bounds_display)
# Outlier summary for columns with issues
cols_with_outliers = outlier_df[outlier_df["total_outliers"] > 0].copy()
if len(cols_with_outliers) > 0:
print("\nβ οΈ COLUMNS WITH OUTLIERS:")
for _, row in cols_with_outliers.iterrows():
severity = "π΄ HIGH" if row["outlier_pct"] > 10 else "π‘ MEDIUM" if row["outlier_pct"] > 5 else "π’ LOW"
print(f"\n {row['feature']}: {row['total_outliers']:,} outliers ({row['outlier_pct']:.2f}%) {severity}")
print(f" Lower bound: {row['lower_bound']:.2f} | Upper bound: {row['upper_bound']:.2f}")
if row["outliers_low"] > 0:
print(f" Below lower: {row['outliers_low']:,}")
if row["outliers_high"] > 0:
print(f" Above upper: {row['outliers_high']:,}")
# Determine action and add recommendation (skip if segment-aware already added)
col_name = row['feature']
existing_outlier_recs = [r for r in registry.bronze.outlier_handling if r.target_column == col_name]
if not existing_outlier_recs and row["outlier_pct"] > 5: # Only add if significant and not already handled
if row["outlier_pct"] > 10:
action = "log_transform"
rationale = f"{row['outlier_pct']:.1f}% outliers - heavy tails require log transform"
print(" β Consider log transform or RobustScaler")
else:
action = "winsorize"
rationale = f"{row['outlier_pct']:.1f}% outliers - winsorize to 1st/99th percentile"
print(" β Consider Winsorization (clip to 1st/99th percentile)")
registry.add_bronze_outlier(
column=col_name, action=action,
parameters={"method": "iqr", "lower_bound": row["lower_bound"], "upper_bound": row["upper_bound"]},
rationale=rationale,
source_notebook="02_source_integrity"
)
elif row["outlier_pct"] <= 5:
print(" β Minor issue, can cap at IQR bounds if needed")
else:
print("\nβ
No significant outliers detected in numeric columns")
# Box plots for columns with outliers
if len(cols_with_outliers) > 0 and len(cols_with_outliers) <= 6:
outlier_cols = cols_with_outliers["feature"].tolist()
fig = make_subplots(rows=1, cols=len(outlier_cols), subplot_titles=outlier_cols)
for i, col in enumerate(outlier_cols, 1):
fig.add_trace(
go.Box(y=df[col].dropna(), name=col, boxpoints="outliers",
marker_color="#3498db", showlegend=False),
row=1, col=i
)
fig.update_layout(height=400, title_text="Outlier Distribution (Box Plots)",
template="plotly_white")
display_figure(fig)
================================================================================ OUTLIER DETECTION (IQR Method) ================================================================================
π IQR BOUNDS TABLE:
| feature | Q1 | Q3 | IQR | lower_bound | upper_bound | outliers_low | outliers_high | |
|---|---|---|---|---|---|---|---|---|
| 0 | event_count_all_time | 1.00 | 1.00 | 0.00 | 1.00 | 1.00 | 1 | 5 |
| 1 | esent_sum_all_time | 16.00 | 42.00 | 26.00 | -23.00 | 81.00 | 0 | 12 |
| 2 | esent_mean_all_time | 16.00 | 42.00 | 26.00 | -23.00 | 81.00 | 0 | 12 |
| 3 | esent_max_all_time | 16.00 | 42.00 | 26.00 | -23.00 | 81.00 | 0 | 12 |
| 4 | esent_count_all_time | 1.00 | 1.00 | 0.00 | 1.00 | 1.00 | 1 | 5 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 341 | created_delta_hours_vs_cohort_mean | 618.52 | 3162.52 | 2544.00 | -3197.48 | 6978.52 | 4646 | 0 |
| 342 | created_delta_hours_vs_cohort_pct | 0.00 | 0.80 | 0.80 | -1.21 | 2.01 | 0 | 4646 |
| 343 | created_delta_hours_cohort_zscore | 0.10 | 0.49 | 0.39 | -0.50 | 1.08 | 4646 | 0 |
| 344 | created_hour_vs_cohort_mean | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0 | 0 |
| 345 | created_hour_vs_cohort_pct | nan | nan | nan | nan | nan | 0 | 0 |
346 rows Γ 8 columns
β οΈ COLUMNS WITH OUTLIERS:
event_count_all_time: 6 outliers (0.02%) π’ LOW
Lower bound: 1.00 | Upper bound: 1.00
Below lower: 1
Above upper: 5
β Minor issue, can cap at IQR bounds if needed
esent_sum_all_time: 12 outliers (0.04%) π’ LOW
Lower bound: -23.00 | Upper bound: 81.00
Above upper: 12
β Minor issue, can cap at IQR bounds if needed
esent_mean_all_time: 12 outliers (0.04%) π’ LOW
Lower bound: -23.00 | Upper bound: 81.00
Above upper: 12
β Minor issue, can cap at IQR bounds if needed
esent_max_all_time: 12 outliers (0.04%) π’ LOW
Lower bound: -23.00 | Upper bound: 81.00
Above upper: 12
β Minor issue, can cap at IQR bounds if needed
esent_count_all_time: 6 outliers (0.02%) π’ LOW
Lower bound: 1.00 | Upper bound: 1.00
Below lower: 1
Above upper: 5
β Minor issue, can cap at IQR bounds if needed
eopenrate_sum_all_time: 1,181 outliers (3.84%) π’ LOW
Lower bound: -54.79 | Upper bound: 96.88
Above upper: 1,181
β Minor issue, can cap at IQR bounds if needed
eopenrate_mean_all_time: 1,180 outliers (3.84%) π’ LOW
Lower bound: -54.79 | Upper bound: 96.88
Above upper: 1,180
β Minor issue, can cap at IQR bounds if needed
eopenrate_max_all_time: 1,181 outliers (3.84%) π’ LOW
Lower bound: -54.79 | Upper bound: 96.88
Above upper: 1,181
β Minor issue, can cap at IQR bounds if needed
eopenrate_count_all_time: 6 outliers (0.02%) π’ LOW
Lower bound: 1.00 | Upper bound: 1.00
Below lower: 1
Above upper: 5
β Minor issue, can cap at IQR bounds if needed
eclickrate_sum_all_time: 2,787 outliers (9.06%) π‘ MEDIUM
Lower bound: -10.71 | Upper bound: 17.86
Above upper: 2,787
β Consider Winsorization (clip to 1st/99th percentile)
eclickrate_mean_all_time: 2,787 outliers (9.06%) π‘ MEDIUM
Lower bound: -10.71 | Upper bound: 17.86
Above upper: 2,787
β Consider Winsorization (clip to 1st/99th percentile)
eclickrate_max_all_time: 2,787 outliers (9.06%) π‘ MEDIUM
Lower bound: -10.71 | Upper bound: 17.86
Above upper: 2,787
β Consider Winsorization (clip to 1st/99th percentile)
eclickrate_count_all_time: 6 outliers (0.02%) π’ LOW
Lower bound: 1.00 | Upper bound: 1.00
Below lower: 1
Above upper: 5
β Minor issue, can cap at IQR bounds if needed
avgorder_sum_all_time: 1,741 outliers (5.66%) π‘ MEDIUM
Lower bound: -11.35 | Upper bound: 125.64
Above upper: 1,741
β Consider Winsorization (clip to 1st/99th percentile)
avgorder_mean_all_time: 1,742 outliers (5.66%) π‘ MEDIUM
Lower bound: -11.34 | Upper bound: 125.62
Above upper: 1,742
β Consider Winsorization (clip to 1st/99th percentile)
avgorder_max_all_time: 1,742 outliers (5.66%) π‘ MEDIUM
Lower bound: -11.34 | Upper bound: 125.62
Above upper: 1,742
β Consider Winsorization (clip to 1st/99th percentile)
avgorder_count_all_time: 6 outliers (0.02%) π’ LOW
Lower bound: 1.00 | Upper bound: 1.00
Below lower: 1
Above upper: 5
β Minor issue, can cap at IQR bounds if needed
ordfreq_sum_all_time: 3,727 outliers (12.11%) π΄ HIGH
Lower bound: -0.06 | Upper bound: 0.10
Above upper: 3,727
β Consider log transform or RobustScaler
ordfreq_mean_all_time: 3,727 outliers (12.11%) π΄ HIGH
Lower bound: -0.06 | Upper bound: 0.10
Above upper: 3,727
β Consider log transform or RobustScaler
ordfreq_max_all_time: 3,727 outliers (12.11%) π΄ HIGH
Lower bound: -0.06 | Upper bound: 0.10
Above upper: 3,727
β Consider log transform or RobustScaler
ordfreq_count_all_time: 6 outliers (0.02%) π’ LOW
Lower bound: 1.00 | Upper bound: 1.00
Below lower: 1
Above upper: 5
β Minor issue, can cap at IQR bounds if needed
paperless_count_all_time: 6 outliers (0.02%) π’ LOW
Lower bound: 1.00 | Upper bound: 1.00
Below lower: 1
Above upper: 5
β Minor issue, can cap at IQR bounds if needed
refill_count_all_time: 6 outliers (0.02%) π’ LOW
Lower bound: 1.00 | Upper bound: 1.00
Below lower: 1
Above upper: 5
β Minor issue, can cap at IQR bounds if needed
doorstep_count_all_time: 6 outliers (0.02%) π’ LOW
Lower bound: 1.00 | Upper bound: 1.00
Below lower: 1
Above upper: 5
β Minor issue, can cap at IQR bounds if needed
created_delta_hours_sum_all_time: 4,646 outliers (15.10%) π΄ HIGH
Lower bound: -6360.00 | Upper bound: 3816.00
Below lower: 4,646
β Consider log transform or RobustScaler
created_delta_hours_mean_all_time: 4,587 outliers (15.08%) π΄ HIGH
Lower bound: -6540.00 | Upper bound: 3924.00
Below lower: 4,587
β Consider log transform or RobustScaler
created_delta_hours_max_all_time: 4,587 outliers (15.08%) π΄ HIGH
Lower bound: -6540.00 | Upper bound: 3924.00
Below lower: 4,587
β Consider log transform or RobustScaler
created_delta_hours_count_all_time: 353 outliers (1.15%) π’ LOW
Lower bound: 1.00 | Upper bound: 1.00
Below lower: 352
Above upper: 1
β Minor issue, can cap at IQR bounds if needed
created_hour_count_all_time: 332 outliers (1.08%) π’ LOW
Lower bound: 1.00 | Upper bound: 1.00
Below lower: 331
Above upper: 1
β Minor issue, can cap at IQR bounds if needed
created_dow_count_all_time: 332 outliers (1.08%) π’ LOW
Lower bound: 1.00 | Upper bound: 1.00
Below lower: 331
Above upper: 1
β Minor issue, can cap at IQR bounds if needed
created_is_weekend_count_all_time: 332 outliers (1.08%) π’ LOW
Lower bound: 1.00 | Upper bound: 1.00
Below lower: 331
Above upper: 1
β Minor issue, can cap at IQR bounds if needed
firstorder_delta_hours_sum_all_time: 4,646 outliers (15.10%) π΄ HIGH
Lower bound: -792.00 | Upper bound: 1320.00
Below lower: 272
Above upper: 4,374
β Consider log transform or RobustScaler
firstorder_delta_hours_mean_all_time: 4,646 outliers (15.11%) π΄ HIGH
Lower bound: -792.00 | Upper bound: 1320.00
Below lower: 272
Above upper: 4,374
β Consider log transform or RobustScaler
firstorder_delta_hours_max_all_time: 4,646 outliers (15.11%) π΄ HIGH
Lower bound: -792.00 | Upper bound: 1320.00
Below lower: 272
Above upper: 4,374
β Consider log transform or RobustScaler
firstorder_delta_hours_count_all_time: 18 outliers (0.06%) π’ LOW
Lower bound: 1.00 | Upper bound: 1.00
Below lower: 13
Above upper: 5
β Minor issue, can cap at IQR bounds if needed
firstorder_hour_count_all_time: 18 outliers (0.06%) π’ LOW
Lower bound: 1.00 | Upper bound: 1.00
Below lower: 13
Above upper: 5
β Minor issue, can cap at IQR bounds if needed
firstorder_dow_count_all_time: 18 outliers (0.06%) π’ LOW
Lower bound: 1.00 | Upper bound: 1.00
Below lower: 13
Above upper: 5
β Minor issue, can cap at IQR bounds if needed
firstorder_is_weekend_mean_all_time: 7,082 outliers (23.02%) π΄ HIGH
Lower bound: 0.00 | Upper bound: 0.00
Above upper: 7,082
β Consider log transform or RobustScaler
firstorder_is_weekend_count_all_time: 6 outliers (0.02%) π’ LOW
Lower bound: 1.00 | Upper bound: 1.00
Below lower: 1
Above upper: 5
β Minor issue, can cap at IQR bounds if needed
days_since_created_sum_all_time: 4,646 outliers (15.10%) π΄ HIGH
Lower bound: -159.00 | Upper bound: 265.00
Above upper: 4,646
β Consider log transform or RobustScaler
days_since_created_mean_all_time: 4,587 outliers (15.08%) π΄ HIGH
Lower bound: -163.50 | Upper bound: 272.50
Above upper: 4,587
β Consider log transform or RobustScaler
days_since_created_max_all_time: 4,587 outliers (15.08%) π΄ HIGH
Lower bound: -163.50 | Upper bound: 272.50
Above upper: 4,587
β Consider log transform or RobustScaler
days_since_created_count_all_time: 353 outliers (1.15%) π’ LOW
Lower bound: 1.00 | Upper bound: 1.00
Below lower: 352
Above upper: 1
β Minor issue, can cap at IQR bounds if needed
days_until_created_sum_all_time: 4,646 outliers (15.10%) π΄ HIGH
Lower bound: -265.00 | Upper bound: 159.00
Below lower: 4,646
β Consider log transform or RobustScaler
days_until_created_mean_all_time: 4,587 outliers (15.08%) π΄ HIGH
Lower bound: -272.50 | Upper bound: 163.50
Below lower: 4,587
β Consider log transform or RobustScaler
days_until_created_max_all_time: 4,587 outliers (15.08%) π΄ HIGH
Lower bound: -272.50 | Upper bound: 163.50
Below lower: 4,587
β Consider log transform or RobustScaler
days_until_created_count_all_time: 353 outliers (1.15%) π’ LOW
Lower bound: 1.00 | Upper bound: 1.00
Below lower: 352
Above upper: 1
β Minor issue, can cap at IQR bounds if needed
log1p_days_since_created_count_all_time: 353 outliers (1.15%) π’ LOW
Lower bound: 1.00 | Upper bound: 1.00
Below lower: 352
Above upper: 1
β Minor issue, can cap at IQR bounds if needed
is_missing_created_count_all_time: 6 outliers (0.02%) π’ LOW
Lower bound: 1.00 | Upper bound: 1.00
Below lower: 1
Above upper: 5
β Minor issue, can cap at IQR bounds if needed
is_future_created_count_all_time: 332 outliers (1.08%) π’ LOW
Lower bound: 1.00 | Upper bound: 1.00
Below lower: 331
Above upper: 1
β Minor issue, can cap at IQR bounds if needed
days_since_firstorder_sum_all_time: 5,209 outliers (16.93%) π΄ HIGH
Lower bound: -69.00 | Upper bound: 115.00
Above upper: 5,209
β Consider log transform or RobustScaler
days_since_firstorder_mean_all_time: 5,209 outliers (16.94%) π΄ HIGH
Lower bound: -69.00 | Upper bound: 115.00
Above upper: 5,209
β Consider log transform or RobustScaler
days_since_firstorder_max_all_time: 5,209 outliers (16.94%) π΄ HIGH
Lower bound: -69.00 | Upper bound: 115.00
Above upper: 5,209
β Consider log transform or RobustScaler
days_since_firstorder_count_all_time: 25 outliers (0.08%) π’ LOW
Lower bound: 1.00 | Upper bound: 1.00
Below lower: 24
Above upper: 1
β Minor issue, can cap at IQR bounds if needed
days_until_firstorder_sum_all_time: 5,209 outliers (16.93%) π΄ HIGH
Lower bound: -115.00 | Upper bound: 69.00
Below lower: 5,209
β Consider log transform or RobustScaler
days_until_firstorder_mean_all_time: 5,209 outliers (16.94%) π΄ HIGH
Lower bound: -115.00 | Upper bound: 69.00
Below lower: 5,209
β Consider log transform or RobustScaler
days_until_firstorder_max_all_time: 5,209 outliers (16.94%) π΄ HIGH
Lower bound: -115.00 | Upper bound: 69.00
Below lower: 5,209
β Consider log transform or RobustScaler
days_until_firstorder_count_all_time: 25 outliers (0.08%) π’ LOW
Lower bound: 1.00 | Upper bound: 1.00
Below lower: 24
Above upper: 1
β Minor issue, can cap at IQR bounds if needed
log1p_days_since_firstorder_count_all_time: 25 outliers (0.08%) π’ LOW
Lower bound: 1.00 | Upper bound: 1.00
Below lower: 24
Above upper: 1
β Minor issue, can cap at IQR bounds if needed
is_missing_firstorder_count_all_time: 6 outliers (0.02%) π’ LOW
Lower bound: 1.00 | Upper bound: 1.00
Below lower: 1
Above upper: 5
β Minor issue, can cap at IQR bounds if needed
is_future_firstorder_count_all_time: 4 outliers (0.01%) π’ LOW
Lower bound: 1.00 | Upper bound: 1.00
Below lower: 3
Above upper: 1
β Minor issue, can cap at IQR bounds if needed
lastorder_delta_hours_sum_all_time: 4,657 outliers (15.13%) π΄ HIGH
Lower bound: -3816.00 | Upper bound: 6360.00
Below lower: 11
Above upper: 4,646
β Consider log transform or RobustScaler
lastorder_delta_hours_mean_all_time: 4,643 outliers (15.10%) π΄ HIGH
Lower bound: -3852.00 | Upper bound: 6420.00
Below lower: 11
Above upper: 4,632
β Consider log transform or RobustScaler
lastorder_delta_hours_max_all_time: 4,643 outliers (15.10%) π΄ HIGH
Lower bound: -3852.00 | Upper bound: 6420.00
Below lower: 11
Above upper: 4,632
β Consider log transform or RobustScaler
lastorder_delta_hours_count_all_time: 19 outliers (0.06%) π’ LOW
Lower bound: 1.00 | Upper bound: 1.00
Below lower: 18
Above upper: 1
β Minor issue, can cap at IQR bounds if needed
lastorder_hour_count_all_time: 19 outliers (0.06%) π’ LOW
Lower bound: 1.00 | Upper bound: 1.00
Below lower: 18
Above upper: 1
β Minor issue, can cap at IQR bounds if needed
lastorder_dow_count_all_time: 19 outliers (0.06%) π’ LOW
Lower bound: 1.00 | Upper bound: 1.00
Below lower: 18
Above upper: 1
β Minor issue, can cap at IQR bounds if needed
lastorder_is_weekend_count_all_time: 6 outliers (0.02%) π’ LOW
Lower bound: 1.00 | Upper bound: 1.00
Below lower: 1
Above upper: 5
β Minor issue, can cap at IQR bounds if needed
days_since_first_event_x: 6,472 outliers (21.03%) π΄ HIGH
Lower bound: 789.50 | Upper bound: 2665.50
Below lower: 4,835
Above upper: 1,637
β Consider log transform or RobustScaler
cohort_year: 7,722 outliers (25.10%) π΄ HIGH
Lower bound: 2010.50 | Upper bound: 2014.50
Below lower: 2,023
Above upper: 5,699
β Consider log transform or RobustScaler
lag0_esent_sum: 12 outliers (0.04%) π’ LOW
Lower bound: -23.00 | Upper bound: 81.00
Above upper: 12
β Minor issue, can cap at IQR bounds if needed
lag0_esent_mean: 12 outliers (0.04%) π’ LOW
Lower bound: -23.00 | Upper bound: 81.00
Above upper: 12
β Minor issue, can cap at IQR bounds if needed
lag0_esent_max: 12 outliers (0.04%) π’ LOW
Lower bound: -23.00 | Upper bound: 81.00
Above upper: 12
β Minor issue, can cap at IQR bounds if needed
lag0_eopenrate_sum: 1,180 outliers (3.84%) π’ LOW
Lower bound: -54.79 | Upper bound: 96.88
Above upper: 1,180
β Minor issue, can cap at IQR bounds if needed
lag0_eopenrate_mean: 1,180 outliers (3.84%) π’ LOW
Lower bound: -54.79 | Upper bound: 96.88
Above upper: 1,180
β Minor issue, can cap at IQR bounds if needed
lag0_eopenrate_max: 1,180 outliers (3.84%) π’ LOW
Lower bound: -54.79 | Upper bound: 96.88
Above upper: 1,180
β Minor issue, can cap at IQR bounds if needed
lag0_eclickrate_sum: 2,787 outliers (9.06%) π‘ MEDIUM
Lower bound: -10.71 | Upper bound: 17.86
Above upper: 2,787
β Consider Winsorization (clip to 1st/99th percentile)
lag0_eclickrate_mean: 2,787 outliers (9.06%) π‘ MEDIUM
Lower bound: -10.71 | Upper bound: 17.86
Above upper: 2,787
β Consider Winsorization (clip to 1st/99th percentile)
lag0_eclickrate_max: 2,787 outliers (9.06%) π‘ MEDIUM
Lower bound: -10.71 | Upper bound: 17.86
Above upper: 2,787
β Consider Winsorization (clip to 1st/99th percentile)
lag0_avgorder_sum: 1,742 outliers (5.66%) π‘ MEDIUM
Lower bound: -11.34 | Upper bound: 125.62
Above upper: 1,742
β Consider Winsorization (clip to 1st/99th percentile)
lag0_avgorder_mean: 1,742 outliers (5.66%) π‘ MEDIUM
Lower bound: -11.34 | Upper bound: 125.62
Above upper: 1,742
β Consider Winsorization (clip to 1st/99th percentile)
lag0_avgorder_max: 1,742 outliers (5.66%) π‘ MEDIUM
Lower bound: -11.34 | Upper bound: 125.62
Above upper: 1,742
β Consider Winsorization (clip to 1st/99th percentile)
lag0_ordfreq_sum: 3,727 outliers (12.11%) π΄ HIGH
Lower bound: -0.06 | Upper bound: 0.10
Above upper: 3,727
β Consider log transform or RobustScaler
lag0_ordfreq_mean: 3,727 outliers (12.11%) π΄ HIGH
Lower bound: -0.06 | Upper bound: 0.10
Above upper: 3,727
β Consider log transform or RobustScaler
lag0_ordfreq_max: 3,727 outliers (12.11%) π΄ HIGH
Lower bound: -0.06 | Upper bound: 0.10
Above upper: 3,727
β Consider log transform or RobustScaler
lag0_created_delta_hours_sum: 4,646 outliers (15.10%) π΄ HIGH
Lower bound: -6360.00 | Upper bound: 3816.00
Below lower: 4,646
β Consider log transform or RobustScaler
lag0_created_delta_hours_mean: 4,587 outliers (15.08%) π΄ HIGH
Lower bound: -6540.00 | Upper bound: 3924.00
Below lower: 4,587
β Consider log transform or RobustScaler
lag0_created_delta_hours_max: 4,587 outliers (15.08%) π΄ HIGH
Lower bound: -6540.00 | Upper bound: 3924.00
Below lower: 4,587
β Consider log transform or RobustScaler
eopenrate_beginning: 2 outliers (40.00%) π΄ HIGH
Lower bound: 19.64 | Upper bound: 33.93
Below lower: 1
Above upper: 1
β Consider log transform or RobustScaler
eopenrate_end: 1 outliers (20.00%) π΄ HIGH
Lower bound: -25.00 | Upper bound: 41.67
Above upper: 1
β Consider log transform or RobustScaler
eopenrate_trend_ratio: 1 outliers (25.00%) π΄ HIGH
Lower bound: -0.73 | Upper bound: 1.55
Above upper: 1
β Consider log transform or RobustScaler
esent_vs_cohort_mean: 12 outliers (0.04%) π’ LOW
Lower bound: -51.14 | Upper bound: 52.86
Above upper: 12
β Minor issue, can cap at IQR bounds if needed
esent_vs_cohort_pct: 12 outliers (0.04%) π’ LOW
Lower bound: -0.82 | Upper bound: 2.88
Above upper: 12
β Minor issue, can cap at IQR bounds if needed
esent_cohort_zscore: 12 outliers (0.04%) π’ LOW
Lower bound: -3.05 | Upper bound: 3.16
Above upper: 12
β Minor issue, can cap at IQR bounds if needed
eopenrate_vs_cohort_mean: 1,180 outliers (3.84%) π’ LOW
Lower bound: -80.34 | Upper bound: 71.33
Above upper: 1,180
β Minor issue, can cap at IQR bounds if needed
eopenrate_vs_cohort_pct: 1,180 outliers (3.84%) π’ LOW
Lower bound: -2.14 | Upper bound: 3.79
Above upper: 1,180
β Minor issue, can cap at IQR bounds if needed
eopenrate_cohort_zscore: 1,180 outliers (3.84%) π’ LOW
Lower bound: -2.72 | Upper bound: 2.41
Above upper: 1,180
β Minor issue, can cap at IQR bounds if needed
eclickrate_vs_cohort_mean: 2,787 outliers (9.06%) π‘ MEDIUM
Lower bound: -16.39 | Upper bound: 12.19
Above upper: 2,787
β Consider Winsorization (clip to 1st/99th percentile)
eclickrate_vs_cohort_pct: 2,787 outliers (9.06%) π‘ MEDIUM
Lower bound: -1.89 | Upper bound: 3.15
Above upper: 2,787
β Consider Winsorization (clip to 1st/99th percentile)
eclickrate_cohort_zscore: 2,787 outliers (9.06%) π‘ MEDIUM
Lower bound: -1.55 | Upper bound: 1.15
Above upper: 2,787
β Consider Winsorization (clip to 1st/99th percentile)
avgorder_vs_cohort_mean: 1,742 outliers (5.66%) π‘ MEDIUM
Lower bound: -73.19 | Upper bound: 63.77
Above upper: 1,742
β Consider Winsorization (clip to 1st/99th percentile)
avgorder_vs_cohort_pct: 1,742 outliers (5.66%) π‘ MEDIUM
Lower bound: -0.18 | Upper bound: 2.03
Above upper: 1,742
β Consider Winsorization (clip to 1st/99th percentile)
avgorder_cohort_zscore: 1,742 outliers (5.66%) π‘ MEDIUM
Lower bound: -1.79 | Upper bound: 1.56
Above upper: 1,742
β Consider Winsorization (clip to 1st/99th percentile)
ordfreq_vs_cohort_mean: 3,727 outliers (12.11%) π΄ HIGH
Lower bound: -0.10 | Upper bound: 0.06
Above upper: 3,727
β Consider log transform or RobustScaler
ordfreq_vs_cohort_pct: 3,727 outliers (12.11%) π΄ HIGH
Lower bound: -1.62 | Upper bound: 2.70
Above upper: 3,727
β Consider log transform or RobustScaler
ordfreq_cohort_zscore: 3,727 outliers (12.11%) π΄ HIGH
Lower bound: -0.95 | Upper bound: 0.62
Above upper: 3,727
β Consider log transform or RobustScaler
created_delta_hours_vs_cohort_mean: 4,646 outliers (15.10%) π΄ HIGH
Lower bound: -3197.48 | Upper bound: 6978.52
Below lower: 4,646
β Consider log transform or RobustScaler
created_delta_hours_vs_cohort_pct: 4,646 outliers (15.10%) π΄ HIGH
Lower bound: -1.21 | Upper bound: 2.01
Above upper: 4,646
β Consider log transform or RobustScaler
created_delta_hours_cohort_zscore: 4,646 outliers (15.10%) π΄ HIGH
Lower bound: -0.50 | Upper bound: 1.08
Below lower: 4,646
β Consider log transform or RobustScaler
2.9 Date Logic ValidationΒΆ
π What This Checks:
- Date ranges and suspicious placeholder dates (e.g., 1/1/1900, 1/1/2004)
- Date sequence violations if
DATE_SEQUENCEis configured below
β οΈ Common Issues:
- Very old dates (pre-2005): Often placeholder values β Set to NULL
- Sequence violations (e.g.,
last_purchase < first_purchase): Data entry errors β Flag for review
π‘ Configuration:
Set DATE_SEQUENCE below to validate that dates occur in expected chronological order.
Show/Hide Code
# === DATE SEQUENCE CONFIGURATION ===
# Define expected chronological order of date columns (earliest to latest)
# Example: ["account_created", "first_purchase", "last_purchase"]
# Set to None or empty list to skip sequence validation
# Option 1: Override here
DATE_SEQUENCE = None # e.g., ["created", "firstorder", "lastorder"]
# Option 2: Load from findings (saved in notebook 01)
if DATE_SEQUENCE is None and "date_sequence" in findings.metadata:
DATE_SEQUENCE = findings.metadata["date_sequence"]
print(f"Loaded date sequence from findings: {DATE_SEQUENCE}")
# Detect date columns from findings
date_cols = [name for name, col in findings.columns.items()
if col.inferred_type == ColumnType.DATETIME]
print("=" * 60)
print("DATE LOGIC VALIDATION")
print("=" * 60)
print(f"\nDetected date columns: {date_cols}")
if date_cols:
df_dates = df.copy()
for col in date_cols:
df_dates[col] = pd.to_datetime(df_dates[col], errors='coerce', format='mixed')
# Date ranges
print("\nπ
DATE RANGES:")
for col in date_cols:
print(f" {col}: {df_dates[col].min()} to {df_dates[col].max()}")
# Placeholder detection
print("\nπ΅οΈ PLACEHOLDER DATE DETECTION:")
for col in date_cols:
old_dates = (df_dates[col] < '2005-01-01').sum()
if old_dates > 0:
print(f" {col}: {old_dates:,} dates before 2005 (possible placeholders)")
else:
print(f" {col}: No suspicious early dates")
# Sequence validation
if DATE_SEQUENCE and len(DATE_SEQUENCE) >= 2:
valid_sequence_cols = [c for c in DATE_SEQUENCE if c in date_cols]
if len(valid_sequence_cols) >= 2:
print("\nπ DATE SEQUENCE VALIDATION:")
print(f" Expected order: {' β€ '.join(valid_sequence_cols)}")
total_violations = 0
for i in range(len(valid_sequence_cols) - 1):
col1, col2 = valid_sequence_cols[i], valid_sequence_cols[i + 1]
# Check where col2 < col1 (violation)
mask = df_dates[col1].notna() & df_dates[col2].notna()
violations = (df_dates.loc[mask, col2] < df_dates.loc[mask, col1]).sum()
total_violations += violations
if violations > 0:
pct = violations / mask.sum() * 100
print(f" β οΈ {col2} < {col1}: {violations:,} violations ({pct:.2f}%)")
else:
print(f" β {col1} β€ {col2}: No violations")
if total_violations == 0:
print("\n β
All date sequences valid")
else:
print(f"\n β οΈ Total sequence violations: {total_violations:,}")
else:
print(f"\nβ οΈ DATE_SEQUENCE columns not found in data: {DATE_SEQUENCE}")
print(f" Available date columns: {date_cols}")
else:
print("\nπ‘ TIP: Set DATE_SEQUENCE above or in notebook 01 to enable sequence validation")
if len(date_cols) >= 2:
print(f" Available date columns: {date_cols}")
else:
print("\nNo date columns detected.")
============================================================ DATE LOGIC VALIDATION ============================================================ Detected date columns: [] No date columns detected.
2.10 Binary Field ValidationΒΆ
Binary fields should contain only 0 and 1 values. Any other values indicate data quality issues.
Show/Hide Code
binary_cols = [name for name, col in findings.columns.items()
if col.inferred_type == ColumnType.BINARY
and name not in TEMPORAL_METADATA_COLS]
print("=" * 60)
print("BINARY FIELD VALIDATION")
print("=" * 60)
print(f"\nDetected binary columns: {binary_cols}")
if binary_cols:
binary_results = []
for col in binary_cols:
unique_vals = sorted(df[col].dropna().unique())
is_valid = set(unique_vals).issubset({0, 1, 0.0, 1.0})
count_0 = (df[col] == 0).sum()
count_1 = (df[col] == 1).sum()
total = count_0 + count_1
binary_results.append({
'column': col,
'unique_values': unique_vals,
'is_valid': is_valid,
'count_0': count_0,
'count_1': count_1,
'pct_1': count_1 / total * 100 if total > 0 else 0
})
status = "β" if is_valid else "β οΈ"
print(f"\n{status} {col}:")
print(f" Unique values: {unique_vals}")
print(f" 0 (No): {count_0:,} ({count_0/total*100:.1f}%)")
print(f" 1 (Yes): {count_1:,} ({count_1/total*100:.1f}%)")
if not is_valid:
invalid_vals = [v for v in unique_vals if v not in [0, 1, 0.0, 1.0]]
print(f" β οΈ Invalid values found: {invalid_vals}")
if len(binary_cols) <= 6:
n_cols = len(binary_cols)
fig = make_subplots(rows=1, cols=n_cols, subplot_titles=binary_cols)
for i, col in enumerate(binary_cols, 1):
counts = df[col].value_counts().sort_index()
fig.add_trace(
go.Bar(x=['No (0)', 'Yes (1)'], y=[counts.get(0, 0), counts.get(1, 0)],
marker_color=['#d62728', '#2ca02c'], showlegend=False),
row=1, col=i
)
fig.update_layout(height=350, title_text="Binary Field Distributions",
template='plotly_white')
display_figure(fig)
else:
print("\nNo binary columns detected.")
============================================================ BINARY FIELD VALIDATION ============================================================ Detected binary columns: ['paperless_mean_all_time', 'paperless_max_all_time', 'refill_sum_all_time', 'refill_mean_all_time', 'refill_max_all_time', 'doorstep_sum_all_time', 'doorstep_mean_all_time', 'doorstep_max_all_time', 'created_is_weekend_sum_all_time', 'created_is_weekend_mean_all_time', 'created_is_weekend_max_all_time', 'firstorder_is_weekend_sum_all_time', 'firstorder_is_weekend_max_all_time', 'is_missing_firstorder_sum_all_time', 'is_missing_firstorder_mean_all_time', 'is_missing_firstorder_max_all_time', 'lastorder_is_weekend_sum_all_time', 'lastorder_is_weekend_mean_all_time', 'lastorder_is_weekend_max_all_time', 'lifecycle_quadrant', 'lag0_paperless_sum', 'lag0_paperless_mean', 'lag0_paperless_max', 'lag0_refill_sum', 'lag0_refill_mean', 'lag0_refill_max', 'lag0_doorstep_sum', 'lag0_doorstep_mean', 'lag0_doorstep_max', 'lag0_created_delta_hours_count', 'lag0_created_hour_count', 'eclickrate_end', 'avgorder_trend_ratio', 'ordfreq_beginning', 'ordfreq_end', 'paperless_beginning', 'paperless_end', 'days_since_first_event_y', 'active_span_days', 'paperless_vs_cohort_mean', 'paperless_vs_cohort_pct', 'paperless_cohort_zscore', 'refill_vs_cohort_mean', 'refill_vs_cohort_pct', 'refill_cohort_zscore', 'doorstep_vs_cohort_mean', 'doorstep_vs_cohort_pct', 'doorstep_cohort_zscore'] β paperless_mean_all_time: Unique values: [np.float64(0.0), np.float64(1.0)] 0 (No): 10,796 (35.1%) 1 (Yes): 19,973 (64.9%) β paperless_max_all_time: Unique values: [np.float64(0.0), np.float64(1.0)] 0 (No): 10,796 (35.1%) 1 (Yes): 19,973 (64.9%) β refill_sum_all_time: Unique values: [np.int64(0), np.int64(1)] 0 (No): 27,846 (90.5%) 1 (Yes): 2,924 (9.5%) β refill_mean_all_time: Unique values: [np.float64(0.0), np.float64(1.0)] 0 (No): 27,845 (90.5%) 1 (Yes): 2,924 (9.5%) β refill_max_all_time: Unique values: [np.float64(0.0), np.float64(1.0)] 0 (No): 27,845 (90.5%) 1 (Yes): 2,924 (9.5%) β doorstep_sum_all_time: Unique values: [np.int64(0), np.int64(1)] 0 (No): 29,572 (96.1%) 1 (Yes): 1,198 (3.9%) β doorstep_mean_all_time: Unique values: [np.float64(0.0), np.float64(1.0)] 0 (No): 29,571 (96.1%) 1 (Yes): 1,198 (3.9%) β doorstep_max_all_time: Unique values: [np.float64(0.0), np.float64(1.0)] 0 (No): 29,571 (96.1%) 1 (Yes): 1,198 (3.9%) β created_is_weekend_sum_all_time: Unique values: [np.float64(0.0), np.float64(1.0)] 0 (No): 23,426 (76.1%) 1 (Yes): 7,344 (23.9%) β created_is_weekend_mean_all_time: Unique values: [np.float64(0.0), np.float64(1.0)] 0 (No): 23,095 (75.9%) 1 (Yes): 7,344 (24.1%) β created_is_weekend_max_all_time: Unique values: [np.float64(0.0), np.float64(1.0)] 0 (No): 23,095 (75.9%) 1 (Yes): 7,344 (24.1%) β firstorder_is_weekend_sum_all_time: Unique values: [np.float64(0.0), np.float64(1.0)] 0 (No): 23,688 (77.0%) 1 (Yes): 7,082 (23.0%) β firstorder_is_weekend_max_all_time: Unique values: [np.float64(0.0), np.float64(1.0)] 0 (No): 23,687 (77.0%) 1 (Yes): 7,082 (23.0%) β is_missing_firstorder_sum_all_time: Unique values: [np.float64(0.0), np.float64(1.0)] 0 (No): 30,758 (100.0%) 1 (Yes): 12 (0.0%) β is_missing_firstorder_mean_all_time: Unique values: [np.float64(0.0), np.float64(1.0)] 0 (No): 30,757 (100.0%) 1 (Yes): 12 (0.0%) β is_missing_firstorder_max_all_time: Unique values: [np.float64(0.0), np.float64(1.0)] 0 (No): 30,757 (100.0%) 1 (Yes): 12 (0.0%) β lastorder_is_weekend_sum_all_time: Unique values: [np.float64(0.0), np.float64(1.0)] 0 (No): 23,255 (75.6%) 1 (Yes): 7,515 (24.4%) β lastorder_is_weekend_mean_all_time: Unique values: [np.float64(0.0), np.float64(1.0)] 0 (No): 23,254 (75.6%) 1 (Yes): 7,515 (24.4%) β lastorder_is_weekend_max_all_time: Unique values: [np.float64(0.0), np.float64(1.0)] 0 (No): 23,254 (75.6%) 1 (Yes): 7,515 (24.4%) β οΈ lifecycle_quadrant: Unique values: ['Occasional & Loyal', 'Steady & Loyal'] 0 (No): 0 (nan%) 1 (Yes): 0 (nan%) β οΈ Invalid values found: ['Occasional & Loyal', 'Steady & Loyal'] β lag0_paperless_sum: Unique values: [np.float64(0.0), np.float64(1.0)] 0 (No): 10,796 (35.1%) 1 (Yes): 19,973 (64.9%) β lag0_paperless_mean: Unique values: [np.float64(0.0), np.float64(1.0)] 0 (No): 10,796 (35.1%) 1 (Yes): 19,973 (64.9%) β lag0_paperless_max: Unique values: [np.float64(0.0), np.float64(1.0)] 0 (No): 10,796 (35.1%) 1 (Yes): 19,973 (64.9%) β lag0_refill_sum: Unique values: [np.float64(0.0), np.float64(1.0)] 0 (No): 27,845 (90.5%) 1 (Yes): 2,924 (9.5%) β lag0_refill_mean: Unique values: [np.float64(0.0), np.float64(1.0)] 0 (No): 27,845 (90.5%) 1 (Yes): 2,924 (9.5%) β lag0_refill_max: Unique values: [np.float64(0.0), np.float64(1.0)] 0 (No): 27,845 (90.5%) 1 (Yes): 2,924 (9.5%) β lag0_doorstep_sum: Unique values: [np.float64(0.0), np.float64(1.0)] 0 (No): 29,571 (96.1%) 1 (Yes): 1,198 (3.9%) β lag0_doorstep_mean: Unique values: [np.float64(0.0), np.float64(1.0)] 0 (No): 29,571 (96.1%) 1 (Yes): 1,198 (3.9%) β lag0_doorstep_max: Unique values: [np.float64(0.0), np.float64(1.0)] 0 (No): 29,571 (96.1%) 1 (Yes): 1,198 (3.9%) β lag0_created_delta_hours_count: Unique values: [np.float64(0.0), np.float64(1.0)] 0 (No): 351 (1.1%) 1 (Yes): 30,418 (98.9%) β lag0_created_hour_count: Unique values: [np.float64(0.0), np.float64(1.0)] 0 (No): 334 (1.1%) 1 (Yes): 30,435 (98.9%) β οΈ eclickrate_end: Unique values: [np.float64(0.0), np.float64(13.63636364)] 0 (No): 4 (100.0%) 1 (Yes): 0 (0.0%) β οΈ Invalid values found: [np.float64(13.63636364)] β οΈ avgorder_trend_ratio: Unique values: [np.float64(1.0), np.float64(1.00990099009901)] 0 (No): 0 (0.0%) 1 (Yes): 4 (100.0%) β οΈ Invalid values found: [np.float64(1.00990099009901)] β οΈ ordfreq_beginning: Unique values: [np.float64(0.0), np.float64(0.048076923)] 0 (No): 4 (100.0%) 1 (Yes): 0 (0.0%) β οΈ Invalid values found: [np.float64(0.048076923)] β οΈ ordfreq_end: Unique values: [np.float64(0.0), np.float64(0.012320329)] 0 (No): 4 (100.0%) 1 (Yes): 0 (0.0%) β οΈ Invalid values found: [np.float64(0.012320329)] β paperless_beginning: Unique values: [np.float64(0.0), np.float64(1.0)] 0 (No): 3 (60.0%) 1 (Yes): 2 (40.0%) β paperless_end: Unique values: [np.float64(0.0), np.float64(1.0)] 0 (No): 3 (60.0%) 1 (Yes): 2 (40.0%) β οΈ days_since_first_event_y: Unique values: [np.float64(0.0), np.float64(1462.0)] 0 (No): 30,764 (100.0%) 1 (Yes): 0 (0.0%) β οΈ Invalid values found: [np.float64(1462.0)] β οΈ active_span_days: Unique values: [np.float64(0.0), np.float64(1462.0)] 0 (No): 30,764 (100.0%) 1 (Yes): 0 (0.0%) β οΈ Invalid values found: [np.float64(1462.0)] β οΈ paperless_vs_cohort_mean: Unique values: [np.float64(-0.6491273684552634), np.float64(0.3508726315447366)] 0 (No): 0 (nan%) 1 (Yes): 0 (nan%) β οΈ Invalid values found: [np.float64(-0.6491273684552634), np.float64(0.3508726315447366)] β οΈ paperless_vs_cohort_pct: Unique values: [np.float64(0.0), np.float64(1.5405297151154058)] 0 (No): 10,796 (100.0%) 1 (Yes): 0 (0.0%) β οΈ Invalid values found: [np.float64(1.5405297151154058)] β οΈ paperless_cohort_zscore: Unique values: [np.float64(-1.3601385679791749), np.float64(0.7351953126672595)] 0 (No): 0 (nan%) 1 (Yes): 0 (nan%) β οΈ Invalid values found: [np.float64(-1.3601385679791749), np.float64(0.7351953126672595)] β οΈ refill_vs_cohort_mean: Unique values: [np.float64(-0.09503071273034548), np.float64(0.9049692872696545)] 0 (No): 0 (nan%) 1 (Yes): 0 (nan%) β οΈ Invalid values found: [np.float64(-0.09503071273034548), np.float64(0.9049692872696545)] β οΈ refill_vs_cohort_pct: Unique values: [np.float64(0.0), np.float64(10.522913816689465)] 0 (No): 27,845 (100.0%) 1 (Yes): 0 (0.0%) β οΈ Invalid values found: [np.float64(10.522913816689465)] β οΈ refill_cohort_zscore: Unique values: [np.float64(-0.32404700778323625), np.float64(3.085871727675859)] 0 (No): 0 (nan%) 1 (Yes): 0 (nan%) β οΈ Invalid values found: [np.float64(-0.32404700778323625), np.float64(3.085871727675859)] β οΈ doorstep_vs_cohort_mean: Unique values: [np.float64(-0.038935292014690114), np.float64(0.9610647079853099)] 0 (No): 0 (nan%) 1 (Yes): 0 (nan%) β οΈ Invalid values found: [np.float64(-0.038935292014690114), np.float64(0.9610647079853099)] β οΈ doorstep_vs_cohort_pct: Unique values: [np.float64(0.0), np.float64(25.683639398998327)] 0 (No): 29,571 (100.0%) 1 (Yes): 0 (0.0%) β οΈ Invalid values found: [np.float64(25.683639398998327)] β οΈ doorstep_cohort_zscore: Unique values: [np.float64(-0.20127430974488797), np.float64(4.96818248202511)] 0 (No): 0 (nan%) 1 (Yes): 0 (nan%) β οΈ Invalid values found: [np.float64(-0.20127430974488797), np.float64(4.96818248202511)]
2.11 Data Consistency ChecksΒΆ
Check for case variants, leading/trailing spaces, and other string inconsistencies.
Show/Hide Code
consistency_issues = []
for col_name in df.select_dtypes(include=['object']).columns:
if col_name in TEMPORAL_METADATA_COLS:
continue
unique_vals = df[col_name].dropna().unique()
case_variants = {}
for val in unique_vals:
lower_val = str(val).lower().strip()
if lower_val not in case_variants:
case_variants[lower_val] = []
case_variants[lower_val].append(val)
for lower_val, variants in case_variants.items():
if len(variants) > 1:
consistency_issues.append({
"Column": col_name,
"Issue": "Case/Spacing Variants",
"Details": str(variants[:5]),
"variants": variants
})
if consistency_issues:
print("Data Consistency Issues:")
display(pd.DataFrame([{k: v for k, v in issue.items() if k != "variants"} for issue in consistency_issues]))
# Add consistency recommendations
for issue in consistency_issues:
registry.add_bronze_consistency(
column=issue["Column"],
issue_type="case_variants",
action="normalize_lower",
variants=issue["variants"][:5],
rationale=f"{len(issue['variants'])} case/spacing variants detected",
source_notebook="02_source_integrity"
)
else:
print("No consistency issues detected.")
No consistency issues detected.
2.12 Quality Improvement RecommendationsΒΆ
Automated recommendations based on the issues detected above.
Show/Hide Code
from customer_retention.analysis.auto_explorer import RecommendationEngine
recommender = RecommendationEngine()
cleaning_recs = recommender.recommend_cleaning(findings)
print("=" * 80)
print("QUALITY IMPROVEMENT RECOMMENDATIONS")
print("=" * 80)
if cleaning_recs:
# Group by severity
high_severity = [r for r in cleaning_recs if r.severity == "high"]
medium_severity = [r for r in cleaning_recs if r.severity == "medium"]
low_severity = [r for r in cleaning_recs if r.severity == "low"]
if high_severity:
print("\nπ΄ HIGH PRIORITY (must fix before modeling):")
print("-" * 60)
for rec in high_severity:
print(f"\n π {rec.column_name}")
print(f" Issue: {rec.description}")
print(f" Strategy: {rec.strategy_label}")
print(f" Impact: {rec.problem_impact}")
if rec.action_steps:
print(" Action Steps:")
for step in rec.action_steps:
print(f" β’ {step}")
if medium_severity:
print("\nπ‘ MEDIUM PRIORITY (recommended fixes):")
print("-" * 60)
for rec in medium_severity:
print(f"\n π {rec.column_name}")
print(f" Issue: {rec.description}")
print(f" Strategy: {rec.strategy_label}")
print(f" Impact: {rec.problem_impact}")
if rec.action_steps:
print(" Action Steps:")
for step in rec.action_steps:
print(f" β’ {step}")
if low_severity:
print("\nπ’ LOW PRIORITY (nice to have):")
print("-" * 60)
for rec in low_severity:
print(f"\n π {rec.column_name}")
print(f" Issue: {rec.description}")
print(f" Strategy: {rec.strategy_label}")
print(f" Impact: {rec.problem_impact}")
# Persist cleaning recommendations to registry
for rec in cleaning_recs:
# Check if not already added by previous sections
existing_null = [r for r in registry.bronze.null_handling if r.target_column == rec.column_name]
existing_outlier = [r for r in registry.bronze.outlier_handling if r.target_column == rec.column_name]
if rec.issue_type in ["null_values", "missing_values"] and not existing_null:
strategy = "median" if "median" in rec.strategy.lower() else "mode" if "mode" in rec.strategy.lower() else "drop"
registry.add_bronze_null(
column=rec.column_name,
strategy=strategy,
rationale=rec.description,
source_notebook="02_source_integrity"
)
elif rec.issue_type == "outliers" and not existing_outlier:
registry.add_bronze_outlier(
column=rec.column_name,
action="winsorize" if "winsor" in rec.strategy.lower() else "cap",
parameters={"severity": rec.severity, "affected_rows": rec.affected_rows},
rationale=rec.description,
source_notebook="02_source_integrity"
)
# Summary table
print("\n" + "=" * 80)
print("CLEANUP SUMMARY")
print("=" * 80)
summary_data = []
for rec in cleaning_recs:
summary_data.append({
"Column": rec.column_name,
"Issue": rec.issue_type.replace("_", " ").title(),
"Severity": rec.severity.upper(),
"Recommended Action": rec.strategy_label,
"Affected Rows": f"{rec.affected_rows:,}"
})
summary_df = pd.DataFrame(summary_data)
display(summary_df)
# Total impact
total_affected = sum(r.affected_rows for r in cleaning_recs)
unique_affected = min(total_affected, len(df)) # Can't exceed total rows
print(f"\nTotal potentially affected: {total_affected:,} cell values")
print(f"Columns needing attention: {len(cleaning_recs)}")
else:
print("\nβ
No cleaning recommendations - data quality is excellent!")
================================================================================
QUALITY IMPROVEMENT RECOMMENDATIONS
================================================================================
π΄ HIGH PRIORITY (must fix before modeling):
------------------------------------------------------------
π lag1_esent_sum
Issue: 100.0% missing values (critical)
Strategy: Drop Column or Create Missing Indicator
Impact: Models will fail or lose significant data. High missingness often indicates systematic data collection issues.
Action Steps:
β’ Investigate why so much data is missing (data collection issue?)
β’ If pattern-based: create binary indicator column for 'is_missing'
β’ If random: consider dropping column if not critical
β’ If critical: use advanced imputation (KNN, iterative)
π lag1_esent_mean
Issue: 100.0% missing values (critical)
Strategy: Drop Column or Create Missing Indicator
Impact: Models will fail or lose significant data. High missingness often indicates systematic data collection issues.
Action Steps:
β’ Investigate why so much data is missing (data collection issue?)
β’ If pattern-based: create binary indicator column for 'is_missing'
β’ If random: consider dropping column if not critical
β’ If critical: use advanced imputation (KNN, iterative)
π lag1_esent_max
Issue: 100.0% missing values (critical)
Strategy: Drop Column or Create Missing Indicator
Impact: Models will fail or lose significant data. High missingness often indicates systematic data collection issues.
Action Steps:
β’ Investigate why so much data is missing (data collection issue?)
β’ If pattern-based: create binary indicator column for 'is_missing'
β’ If random: consider dropping column if not critical
β’ If critical: use advanced imputation (KNN, iterative)
π lag1_eopenrate_sum
Issue: 100.0% missing values (critical)
Strategy: Drop Column or Create Missing Indicator
Impact: Models will fail or lose significant data. High missingness often indicates systematic data collection issues.
Action Steps:
β’ Investigate why so much data is missing (data collection issue?)
β’ If pattern-based: create binary indicator column for 'is_missing'
β’ If random: consider dropping column if not critical
β’ If critical: use advanced imputation (KNN, iterative)
π lag1_eopenrate_mean
Issue: 100.0% missing values (critical)
Strategy: Drop Column or Create Missing Indicator
Impact: Models will fail or lose significant data. High missingness often indicates systematic data collection issues.
Action Steps:
β’ Investigate why so much data is missing (data collection issue?)
β’ If pattern-based: create binary indicator column for 'is_missing'
β’ If random: consider dropping column if not critical
β’ If critical: use advanced imputation (KNN, iterative)
π lag1_eopenrate_max
Issue: 100.0% missing values (critical)
Strategy: Drop Column or Create Missing Indicator
Impact: Models will fail or lose significant data. High missingness often indicates systematic data collection issues.
Action Steps:
β’ Investigate why so much data is missing (data collection issue?)
β’ If pattern-based: create binary indicator column for 'is_missing'
β’ If random: consider dropping column if not critical
β’ If critical: use advanced imputation (KNN, iterative)
π lag1_eclickrate_sum
Issue: 100.0% missing values (critical)
Strategy: Drop Column or Create Missing Indicator
Impact: Models will fail or lose significant data. High missingness often indicates systematic data collection issues.
Action Steps:
β’ Investigate why so much data is missing (data collection issue?)
β’ If pattern-based: create binary indicator column for 'is_missing'
β’ If random: consider dropping column if not critical
β’ If critical: use advanced imputation (KNN, iterative)
π lag1_eclickrate_mean
Issue: 100.0% missing values (critical)
Strategy: Drop Column or Create Missing Indicator
Impact: Models will fail or lose significant data. High missingness often indicates systematic data collection issues.
Action Steps:
β’ Investigate why so much data is missing (data collection issue?)
β’ If pattern-based: create binary indicator column for 'is_missing'
β’ If random: consider dropping column if not critical
β’ If critical: use advanced imputation (KNN, iterative)
π lag1_eclickrate_max
Issue: 100.0% missing values (critical)
Strategy: Drop Column or Create Missing Indicator
Impact: Models will fail or lose significant data. High missingness often indicates systematic data collection issues.
Action Steps:
β’ Investigate why so much data is missing (data collection issue?)
β’ If pattern-based: create binary indicator column for 'is_missing'
β’ If random: consider dropping column if not critical
β’ If critical: use advanced imputation (KNN, iterative)
π lag1_avgorder_sum
Issue: 100.0% missing values (critical)
Strategy: Drop Column or Create Missing Indicator
Impact: Models will fail or lose significant data. High missingness often indicates systematic data collection issues.
Action Steps:
β’ Investigate why so much data is missing (data collection issue?)
β’ If pattern-based: create binary indicator column for 'is_missing'
β’ If random: consider dropping column if not critical
β’ If critical: use advanced imputation (KNN, iterative)
π lag1_avgorder_mean
Issue: 100.0% missing values (critical)
Strategy: Drop Column or Create Missing Indicator
Impact: Models will fail or lose significant data. High missingness often indicates systematic data collection issues.
Action Steps:
β’ Investigate why so much data is missing (data collection issue?)
β’ If pattern-based: create binary indicator column for 'is_missing'
β’ If random: consider dropping column if not critical
β’ If critical: use advanced imputation (KNN, iterative)
π lag1_avgorder_max
Issue: 100.0% missing values (critical)
Strategy: Drop Column or Create Missing Indicator
Impact: Models will fail or lose significant data. High missingness often indicates systematic data collection issues.
Action Steps:
β’ Investigate why so much data is missing (data collection issue?)
β’ If pattern-based: create binary indicator column for 'is_missing'
β’ If random: consider dropping column if not critical
β’ If critical: use advanced imputation (KNN, iterative)
π lag1_ordfreq_sum
Issue: 100.0% missing values (critical)
Strategy: Drop Column or Create Missing Indicator
Impact: Models will fail or lose significant data. High missingness often indicates systematic data collection issues.
Action Steps:
β’ Investigate why so much data is missing (data collection issue?)
β’ If pattern-based: create binary indicator column for 'is_missing'
β’ If random: consider dropping column if not critical
β’ If critical: use advanced imputation (KNN, iterative)
π lag1_ordfreq_mean
Issue: 100.0% missing values (critical)
Strategy: Drop Column or Create Missing Indicator
Impact: Models will fail or lose significant data. High missingness often indicates systematic data collection issues.
Action Steps:
β’ Investigate why so much data is missing (data collection issue?)
β’ If pattern-based: create binary indicator column for 'is_missing'
β’ If random: consider dropping column if not critical
β’ If critical: use advanced imputation (KNN, iterative)
π lag1_ordfreq_max
Issue: 100.0% missing values (critical)
Strategy: Drop Column or Create Missing Indicator
Impact: Models will fail or lose significant data. High missingness often indicates systematic data collection issues.
Action Steps:
β’ Investigate why so much data is missing (data collection issue?)
β’ If pattern-based: create binary indicator column for 'is_missing'
β’ If random: consider dropping column if not critical
β’ If critical: use advanced imputation (KNN, iterative)
π lag1_paperless_sum
Issue: 100.0% missing values (critical)
Strategy: Drop Column or Create Missing Indicator
Impact: Models will fail or lose significant data. High missingness often indicates systematic data collection issues.
Action Steps:
β’ Investigate why so much data is missing (data collection issue?)
β’ If pattern-based: create binary indicator column for 'is_missing'
β’ If random: consider dropping column if not critical
β’ If critical: use advanced imputation (KNN, iterative)
π lag1_paperless_mean
Issue: 100.0% missing values (critical)
Strategy: Drop Column or Create Missing Indicator
Impact: Models will fail or lose significant data. High missingness often indicates systematic data collection issues.
Action Steps:
β’ Investigate why so much data is missing (data collection issue?)
β’ If pattern-based: create binary indicator column for 'is_missing'
β’ If random: consider dropping column if not critical
β’ If critical: use advanced imputation (KNN, iterative)
π lag1_paperless_max
Issue: 100.0% missing values (critical)
Strategy: Drop Column or Create Missing Indicator
Impact: Models will fail or lose significant data. High missingness often indicates systematic data collection issues.
Action Steps:
β’ Investigate why so much data is missing (data collection issue?)
β’ If pattern-based: create binary indicator column for 'is_missing'
β’ If random: consider dropping column if not critical
β’ If critical: use advanced imputation (KNN, iterative)
π lag1_refill_sum
Issue: 100.0% missing values (critical)
Strategy: Drop Column or Create Missing Indicator
Impact: Models will fail or lose significant data. High missingness often indicates systematic data collection issues.
Action Steps:
β’ Investigate why so much data is missing (data collection issue?)
β’ If pattern-based: create binary indicator column for 'is_missing'
β’ If random: consider dropping column if not critical
β’ If critical: use advanced imputation (KNN, iterative)
π lag1_refill_mean
Issue: 100.0% missing values (critical)
Strategy: Drop Column or Create Missing Indicator
Impact: Models will fail or lose significant data. High missingness often indicates systematic data collection issues.
Action Steps:
β’ Investigate why so much data is missing (data collection issue?)
β’ If pattern-based: create binary indicator column for 'is_missing'
β’ If random: consider dropping column if not critical
β’ If critical: use advanced imputation (KNN, iterative)
π lag1_refill_max
Issue: 100.0% missing values (critical)
Strategy: Drop Column or Create Missing Indicator
Impact: Models will fail or lose significant data. High missingness often indicates systematic data collection issues.
Action Steps:
β’ Investigate why so much data is missing (data collection issue?)
β’ If pattern-based: create binary indicator column for 'is_missing'
β’ If random: consider dropping column if not critical
β’ If critical: use advanced imputation (KNN, iterative)
π lag1_doorstep_sum
Issue: 100.0% missing values (critical)
Strategy: Drop Column or Create Missing Indicator
Impact: Models will fail or lose significant data. High missingness often indicates systematic data collection issues.
Action Steps:
β’ Investigate why so much data is missing (data collection issue?)
β’ If pattern-based: create binary indicator column for 'is_missing'
β’ If random: consider dropping column if not critical
β’ If critical: use advanced imputation (KNN, iterative)
π lag1_doorstep_mean
Issue: 100.0% missing values (critical)
Strategy: Drop Column or Create Missing Indicator
Impact: Models will fail or lose significant data. High missingness often indicates systematic data collection issues.
Action Steps:
β’ Investigate why so much data is missing (data collection issue?)
β’ If pattern-based: create binary indicator column for 'is_missing'
β’ If random: consider dropping column if not critical
β’ If critical: use advanced imputation (KNN, iterative)
π lag1_doorstep_max
Issue: 100.0% missing values (critical)
Strategy: Drop Column or Create Missing Indicator
Impact: Models will fail or lose significant data. High missingness often indicates systematic data collection issues.
Action Steps:
β’ Investigate why so much data is missing (data collection issue?)
β’ If pattern-based: create binary indicator column for 'is_missing'
β’ If random: consider dropping column if not critical
β’ If critical: use advanced imputation (KNN, iterative)
π lag1_created_delta_hours_sum
Issue: 100.0% missing values (critical)
Strategy: Drop Column or Create Missing Indicator
Impact: Models will fail or lose significant data. High missingness often indicates systematic data collection issues.
Action Steps:
β’ Investigate why so much data is missing (data collection issue?)
β’ If pattern-based: create binary indicator column for 'is_missing'
β’ If random: consider dropping column if not critical
β’ If critical: use advanced imputation (KNN, iterative)
π lag1_created_delta_hours_mean
Issue: 100.0% missing values (critical)
Strategy: Drop Column or Create Missing Indicator
Impact: Models will fail or lose significant data. High missingness often indicates systematic data collection issues.
Action Steps:
β’ Investigate why so much data is missing (data collection issue?)
β’ If pattern-based: create binary indicator column for 'is_missing'
β’ If random: consider dropping column if not critical
β’ If critical: use advanced imputation (KNN, iterative)
π lag1_created_delta_hours_max
Issue: 100.0% missing values (critical)
Strategy: Drop Column or Create Missing Indicator
Impact: Models will fail or lose significant data. High missingness often indicates systematic data collection issues.
Action Steps:
β’ Investigate why so much data is missing (data collection issue?)
β’ If pattern-based: create binary indicator column for 'is_missing'
β’ If random: consider dropping column if not critical
β’ If critical: use advanced imputation (KNN, iterative)
π lag1_created_hour_sum
Issue: 100.0% missing values (critical)
Strategy: Drop Column or Create Missing Indicator
Impact: Models will fail or lose significant data. High missingness often indicates systematic data collection issues.
Action Steps:
β’ Investigate why so much data is missing (data collection issue?)
β’ If pattern-based: create binary indicator column for 'is_missing'
β’ If random: consider dropping column if not critical
β’ If critical: use advanced imputation (KNN, iterative)
π lag1_created_hour_mean
Issue: 100.0% missing values (critical)
Strategy: Drop Column or Create Missing Indicator
Impact: Models will fail or lose significant data. High missingness often indicates systematic data collection issues.
Action Steps:
β’ Investigate why so much data is missing (data collection issue?)
β’ If pattern-based: create binary indicator column for 'is_missing'
β’ If random: consider dropping column if not critical
β’ If critical: use advanced imputation (KNN, iterative)
π lag1_created_hour_max
Issue: 100.0% missing values (critical)
Strategy: Drop Column or Create Missing Indicator
Impact: Models will fail or lose significant data. High missingness often indicates systematic data collection issues.
Action Steps:
β’ Investigate why so much data is missing (data collection issue?)
β’ If pattern-based: create binary indicator column for 'is_missing'
β’ If random: consider dropping column if not critical
β’ If critical: use advanced imputation (KNN, iterative)
π lag2_esent_sum
Issue: 100.0% missing values (critical)
Strategy: Drop Column or Create Missing Indicator
Impact: Models will fail or lose significant data. High missingness often indicates systematic data collection issues.
Action Steps:
β’ Investigate why so much data is missing (data collection issue?)
β’ If pattern-based: create binary indicator column for 'is_missing'
β’ If random: consider dropping column if not critical
β’ If critical: use advanced imputation (KNN, iterative)
π lag2_esent_mean
Issue: 100.0% missing values (critical)
Strategy: Drop Column or Create Missing Indicator
Impact: Models will fail or lose significant data. High missingness often indicates systematic data collection issues.
Action Steps:
β’ Investigate why so much data is missing (data collection issue?)
β’ If pattern-based: create binary indicator column for 'is_missing'
β’ If random: consider dropping column if not critical
β’ If critical: use advanced imputation (KNN, iterative)
π lag2_esent_max
Issue: 100.0% missing values (critical)
Strategy: Drop Column or Create Missing Indicator
Impact: Models will fail or lose significant data. High missingness often indicates systematic data collection issues.
Action Steps:
β’ Investigate why so much data is missing (data collection issue?)
β’ If pattern-based: create binary indicator column for 'is_missing'
β’ If random: consider dropping column if not critical
β’ If critical: use advanced imputation (KNN, iterative)
π lag2_eopenrate_sum
Issue: 100.0% missing values (critical)
Strategy: Drop Column or Create Missing Indicator
Impact: Models will fail or lose significant data. High missingness often indicates systematic data collection issues.
Action Steps:
β’ Investigate why so much data is missing (data collection issue?)
β’ If pattern-based: create binary indicator column for 'is_missing'
β’ If random: consider dropping column if not critical
β’ If critical: use advanced imputation (KNN, iterative)
π lag2_eopenrate_mean
Issue: 100.0% missing values (critical)
Strategy: Drop Column or Create Missing Indicator
Impact: Models will fail or lose significant data. High missingness often indicates systematic data collection issues.
Action Steps:
β’ Investigate why so much data is missing (data collection issue?)
β’ If pattern-based: create binary indicator column for 'is_missing'
β’ If random: consider dropping column if not critical
β’ If critical: use advanced imputation (KNN, iterative)
π lag2_eopenrate_max
Issue: 100.0% missing values (critical)
Strategy: Drop Column or Create Missing Indicator
Impact: Models will fail or lose significant data. High missingness often indicates systematic data collection issues.
Action Steps:
β’ Investigate why so much data is missing (data collection issue?)
β’ If pattern-based: create binary indicator column for 'is_missing'
β’ If random: consider dropping column if not critical
β’ If critical: use advanced imputation (KNN, iterative)
π lag2_eclickrate_sum
Issue: 100.0% missing values (critical)
Strategy: Drop Column or Create Missing Indicator
Impact: Models will fail or lose significant data. High missingness often indicates systematic data collection issues.
Action Steps:
β’ Investigate why so much data is missing (data collection issue?)
β’ If pattern-based: create binary indicator column for 'is_missing'
β’ If random: consider dropping column if not critical
β’ If critical: use advanced imputation (KNN, iterative)
π lag2_eclickrate_mean
Issue: 100.0% missing values (critical)
Strategy: Drop Column or Create Missing Indicator
Impact: Models will fail or lose significant data. High missingness often indicates systematic data collection issues.
Action Steps:
β’ Investigate why so much data is missing (data collection issue?)
β’ If pattern-based: create binary indicator column for 'is_missing'
β’ If random: consider dropping column if not critical
β’ If critical: use advanced imputation (KNN, iterative)
π lag2_eclickrate_max
Issue: 100.0% missing values (critical)
Strategy: Drop Column or Create Missing Indicator
Impact: Models will fail or lose significant data. High missingness often indicates systematic data collection issues.
Action Steps:
β’ Investigate why so much data is missing (data collection issue?)
β’ If pattern-based: create binary indicator column for 'is_missing'
β’ If random: consider dropping column if not critical
β’ If critical: use advanced imputation (KNN, iterative)
π lag2_avgorder_sum
Issue: 100.0% missing values (critical)
Strategy: Drop Column or Create Missing Indicator
Impact: Models will fail or lose significant data. High missingness often indicates systematic data collection issues.
Action Steps:
β’ Investigate why so much data is missing (data collection issue?)
β’ If pattern-based: create binary indicator column for 'is_missing'
β’ If random: consider dropping column if not critical
β’ If critical: use advanced imputation (KNN, iterative)
π lag2_avgorder_mean
Issue: 100.0% missing values (critical)
Strategy: Drop Column or Create Missing Indicator
Impact: Models will fail or lose significant data. High missingness often indicates systematic data collection issues.
Action Steps:
β’ Investigate why so much data is missing (data collection issue?)
β’ If pattern-based: create binary indicator column for 'is_missing'
β’ If random: consider dropping column if not critical
β’ If critical: use advanced imputation (KNN, iterative)
π lag2_avgorder_max
Issue: 100.0% missing values (critical)
Strategy: Drop Column or Create Missing Indicator
Impact: Models will fail or lose significant data. High missingness often indicates systematic data collection issues.
Action Steps:
β’ Investigate why so much data is missing (data collection issue?)
β’ If pattern-based: create binary indicator column for 'is_missing'
β’ If random: consider dropping column if not critical
β’ If critical: use advanced imputation (KNN, iterative)
π lag2_ordfreq_sum
Issue: 100.0% missing values (critical)
Strategy: Drop Column or Create Missing Indicator
Impact: Models will fail or lose significant data. High missingness often indicates systematic data collection issues.
Action Steps:
β’ Investigate why so much data is missing (data collection issue?)
β’ If pattern-based: create binary indicator column for 'is_missing'
β’ If random: consider dropping column if not critical
β’ If critical: use advanced imputation (KNN, iterative)
π lag2_ordfreq_mean
Issue: 100.0% missing values (critical)
Strategy: Drop Column or Create Missing Indicator
Impact: Models will fail or lose significant data. High missingness often indicates systematic data collection issues.
Action Steps:
β’ Investigate why so much data is missing (data collection issue?)
β’ If pattern-based: create binary indicator column for 'is_missing'
β’ If random: consider dropping column if not critical
β’ If critical: use advanced imputation (KNN, iterative)
π lag2_ordfreq_max
Issue: 100.0% missing values (critical)
Strategy: Drop Column or Create Missing Indicator
Impact: Models will fail or lose significant data. High missingness often indicates systematic data collection issues.
Action Steps:
β’ Investigate why so much data is missing (data collection issue?)
β’ If pattern-based: create binary indicator column for 'is_missing'
β’ If random: consider dropping column if not critical
β’ If critical: use advanced imputation (KNN, iterative)
π lag2_paperless_sum
Issue: 100.0% missing values (critical)
Strategy: Drop Column or Create Missing Indicator
Impact: Models will fail or lose significant data. High missingness often indicates systematic data collection issues.
Action Steps:
β’ Investigate why so much data is missing (data collection issue?)
β’ If pattern-based: create binary indicator column for 'is_missing'
β’ If random: consider dropping column if not critical
β’ If critical: use advanced imputation (KNN, iterative)
π lag2_paperless_mean
Issue: 100.0% missing values (critical)
Strategy: Drop Column or Create Missing Indicator
Impact: Models will fail or lose significant data. High missingness often indicates systematic data collection issues.
Action Steps:
β’ Investigate why so much data is missing (data collection issue?)
β’ If pattern-based: create binary indicator column for 'is_missing'
β’ If random: consider dropping column if not critical
β’ If critical: use advanced imputation (KNN, iterative)
π lag2_paperless_max
Issue: 100.0% missing values (critical)
Strategy: Drop Column or Create Missing Indicator
Impact: Models will fail or lose significant data. High missingness often indicates systematic data collection issues.
Action Steps:
β’ Investigate why so much data is missing (data collection issue?)
β’ If pattern-based: create binary indicator column for 'is_missing'
β’ If random: consider dropping column if not critical
β’ If critical: use advanced imputation (KNN, iterative)
π lag2_refill_sum
Issue: 100.0% missing values (critical)
Strategy: Drop Column or Create Missing Indicator
Impact: Models will fail or lose significant data. High missingness often indicates systematic data collection issues.
Action Steps:
β’ Investigate why so much data is missing (data collection issue?)
β’ If pattern-based: create binary indicator column for 'is_missing'
β’ If random: consider dropping column if not critical
β’ If critical: use advanced imputation (KNN, iterative)
π lag2_refill_mean
Issue: 100.0% missing values (critical)
Strategy: Drop Column or Create Missing Indicator
Impact: Models will fail or lose significant data. High missingness often indicates systematic data collection issues.
Action Steps:
β’ Investigate why so much data is missing (data collection issue?)
β’ If pattern-based: create binary indicator column for 'is_missing'
β’ If random: consider dropping column if not critical
β’ If critical: use advanced imputation (KNN, iterative)
π lag2_refill_max
Issue: 100.0% missing values (critical)
Strategy: Drop Column or Create Missing Indicator
Impact: Models will fail or lose significant data. High missingness often indicates systematic data collection issues.
Action Steps:
β’ Investigate why so much data is missing (data collection issue?)
β’ If pattern-based: create binary indicator column for 'is_missing'
β’ If random: consider dropping column if not critical
β’ If critical: use advanced imputation (KNN, iterative)
π lag2_doorstep_sum
Issue: 100.0% missing values (critical)
Strategy: Drop Column or Create Missing Indicator
Impact: Models will fail or lose significant data. High missingness often indicates systematic data collection issues.
Action Steps:
β’ Investigate why so much data is missing (data collection issue?)
β’ If pattern-based: create binary indicator column for 'is_missing'
β’ If random: consider dropping column if not critical
β’ If critical: use advanced imputation (KNN, iterative)
π lag2_doorstep_mean
Issue: 100.0% missing values (critical)
Strategy: Drop Column or Create Missing Indicator
Impact: Models will fail or lose significant data. High missingness often indicates systematic data collection issues.
Action Steps:
β’ Investigate why so much data is missing (data collection issue?)
β’ If pattern-based: create binary indicator column for 'is_missing'
β’ If random: consider dropping column if not critical
β’ If critical: use advanced imputation (KNN, iterative)
π lag2_doorstep_max
Issue: 100.0% missing values (critical)
Strategy: Drop Column or Create Missing Indicator
Impact: Models will fail or lose significant data. High missingness often indicates systematic data collection issues.
Action Steps:
β’ Investigate why so much data is missing (data collection issue?)
β’ If pattern-based: create binary indicator column for 'is_missing'
β’ If random: consider dropping column if not critical
β’ If critical: use advanced imputation (KNN, iterative)
π lag2_created_delta_hours_sum
Issue: 100.0% missing values (critical)
Strategy: Drop Column or Create Missing Indicator
Impact: Models will fail or lose significant data. High missingness often indicates systematic data collection issues.
Action Steps:
β’ Investigate why so much data is missing (data collection issue?)
β’ If pattern-based: create binary indicator column for 'is_missing'
β’ If random: consider dropping column if not critical
β’ If critical: use advanced imputation (KNN, iterative)
π lag2_created_delta_hours_mean
Issue: 100.0% missing values (critical)
Strategy: Drop Column or Create Missing Indicator
Impact: Models will fail or lose significant data. High missingness often indicates systematic data collection issues.
Action Steps:
β’ Investigate why so much data is missing (data collection issue?)
β’ If pattern-based: create binary indicator column for 'is_missing'
β’ If random: consider dropping column if not critical
β’ If critical: use advanced imputation (KNN, iterative)
π lag2_created_delta_hours_max
Issue: 100.0% missing values (critical)
Strategy: Drop Column or Create Missing Indicator
Impact: Models will fail or lose significant data. High missingness often indicates systematic data collection issues.
Action Steps:
β’ Investigate why so much data is missing (data collection issue?)
β’ If pattern-based: create binary indicator column for 'is_missing'
β’ If random: consider dropping column if not critical
β’ If critical: use advanced imputation (KNN, iterative)
π lag2_created_hour_sum
Issue: 100.0% missing values (critical)
Strategy: Drop Column or Create Missing Indicator
Impact: Models will fail or lose significant data. High missingness often indicates systematic data collection issues.
Action Steps:
β’ Investigate why so much data is missing (data collection issue?)
β’ If pattern-based: create binary indicator column for 'is_missing'
β’ If random: consider dropping column if not critical
β’ If critical: use advanced imputation (KNN, iterative)
π lag2_created_hour_mean
Issue: 100.0% missing values (critical)
Strategy: Drop Column or Create Missing Indicator
Impact: Models will fail or lose significant data. High missingness often indicates systematic data collection issues.
Action Steps:
β’ Investigate why so much data is missing (data collection issue?)
β’ If pattern-based: create binary indicator column for 'is_missing'
β’ If random: consider dropping column if not critical
β’ If critical: use advanced imputation (KNN, iterative)
π lag2_created_hour_max
Issue: 100.0% missing values (critical)
Strategy: Drop Column or Create Missing Indicator
Impact: Models will fail or lose significant data. High missingness often indicates systematic data collection issues.
Action Steps:
β’ Investigate why so much data is missing (data collection issue?)
β’ If pattern-based: create binary indicator column for 'is_missing'
β’ If random: consider dropping column if not critical
β’ If critical: use advanced imputation (KNN, iterative)
π lag3_esent_sum
Issue: 100.0% missing values (critical)
Strategy: Drop Column or Create Missing Indicator
Impact: Models will fail or lose significant data. High missingness often indicates systematic data collection issues.
Action Steps:
β’ Investigate why so much data is missing (data collection issue?)
β’ If pattern-based: create binary indicator column for 'is_missing'
β’ If random: consider dropping column if not critical
β’ If critical: use advanced imputation (KNN, iterative)
π lag3_esent_mean
Issue: 100.0% missing values (critical)
Strategy: Drop Column or Create Missing Indicator
Impact: Models will fail or lose significant data. High missingness often indicates systematic data collection issues.
Action Steps:
β’ Investigate why so much data is missing (data collection issue?)
β’ If pattern-based: create binary indicator column for 'is_missing'
β’ If random: consider dropping column if not critical
β’ If critical: use advanced imputation (KNN, iterative)
π lag3_esent_max
Issue: 100.0% missing values (critical)
Strategy: Drop Column or Create Missing Indicator
Impact: Models will fail or lose significant data. High missingness often indicates systematic data collection issues.
Action Steps:
β’ Investigate why so much data is missing (data collection issue?)
β’ If pattern-based: create binary indicator column for 'is_missing'
β’ If random: consider dropping column if not critical
β’ If critical: use advanced imputation (KNN, iterative)
π lag3_eopenrate_sum
Issue: 100.0% missing values (critical)
Strategy: Drop Column or Create Missing Indicator
Impact: Models will fail or lose significant data. High missingness often indicates systematic data collection issues.
Action Steps:
β’ Investigate why so much data is missing (data collection issue?)
β’ If pattern-based: create binary indicator column for 'is_missing'
β’ If random: consider dropping column if not critical
β’ If critical: use advanced imputation (KNN, iterative)
π lag3_eopenrate_mean
Issue: 100.0% missing values (critical)
Strategy: Drop Column or Create Missing Indicator
Impact: Models will fail or lose significant data. High missingness often indicates systematic data collection issues.
Action Steps:
β’ Investigate why so much data is missing (data collection issue?)
β’ If pattern-based: create binary indicator column for 'is_missing'
β’ If random: consider dropping column if not critical
β’ If critical: use advanced imputation (KNN, iterative)
π lag3_eopenrate_max
Issue: 100.0% missing values (critical)
Strategy: Drop Column or Create Missing Indicator
Impact: Models will fail or lose significant data. High missingness often indicates systematic data collection issues.
Action Steps:
β’ Investigate why so much data is missing (data collection issue?)
β’ If pattern-based: create binary indicator column for 'is_missing'
β’ If random: consider dropping column if not critical
β’ If critical: use advanced imputation (KNN, iterative)
π lag3_eclickrate_sum
Issue: 100.0% missing values (critical)
Strategy: Drop Column or Create Missing Indicator
Impact: Models will fail or lose significant data. High missingness often indicates systematic data collection issues.
Action Steps:
β’ Investigate why so much data is missing (data collection issue?)
β’ If pattern-based: create binary indicator column for 'is_missing'
β’ If random: consider dropping column if not critical
β’ If critical: use advanced imputation (KNN, iterative)
π lag3_eclickrate_mean
Issue: 100.0% missing values (critical)
Strategy: Drop Column or Create Missing Indicator
Impact: Models will fail or lose significant data. High missingness often indicates systematic data collection issues.
Action Steps:
β’ Investigate why so much data is missing (data collection issue?)
β’ If pattern-based: create binary indicator column for 'is_missing'
β’ If random: consider dropping column if not critical
β’ If critical: use advanced imputation (KNN, iterative)
π lag3_eclickrate_max
Issue: 100.0% missing values (critical)
Strategy: Drop Column or Create Missing Indicator
Impact: Models will fail or lose significant data. High missingness often indicates systematic data collection issues.
Action Steps:
β’ Investigate why so much data is missing (data collection issue?)
β’ If pattern-based: create binary indicator column for 'is_missing'
β’ If random: consider dropping column if not critical
β’ If critical: use advanced imputation (KNN, iterative)
π lag3_avgorder_sum
Issue: 100.0% missing values (critical)
Strategy: Drop Column or Create Missing Indicator
Impact: Models will fail or lose significant data. High missingness often indicates systematic data collection issues.
Action Steps:
β’ Investigate why so much data is missing (data collection issue?)
β’ If pattern-based: create binary indicator column for 'is_missing'
β’ If random: consider dropping column if not critical
β’ If critical: use advanced imputation (KNN, iterative)
π lag3_avgorder_mean
Issue: 100.0% missing values (critical)
Strategy: Drop Column or Create Missing Indicator
Impact: Models will fail or lose significant data. High missingness often indicates systematic data collection issues.
Action Steps:
β’ Investigate why so much data is missing (data collection issue?)
β’ If pattern-based: create binary indicator column for 'is_missing'
β’ If random: consider dropping column if not critical
β’ If critical: use advanced imputation (KNN, iterative)
π lag3_avgorder_max
Issue: 100.0% missing values (critical)
Strategy: Drop Column or Create Missing Indicator
Impact: Models will fail or lose significant data. High missingness often indicates systematic data collection issues.
Action Steps:
β’ Investigate why so much data is missing (data collection issue?)
β’ If pattern-based: create binary indicator column for 'is_missing'
β’ If random: consider dropping column if not critical
β’ If critical: use advanced imputation (KNN, iterative)
π lag3_ordfreq_sum
Issue: 100.0% missing values (critical)
Strategy: Drop Column or Create Missing Indicator
Impact: Models will fail or lose significant data. High missingness often indicates systematic data collection issues.
Action Steps:
β’ Investigate why so much data is missing (data collection issue?)
β’ If pattern-based: create binary indicator column for 'is_missing'
β’ If random: consider dropping column if not critical
β’ If critical: use advanced imputation (KNN, iterative)
π lag3_ordfreq_mean
Issue: 100.0% missing values (critical)
Strategy: Drop Column or Create Missing Indicator
Impact: Models will fail or lose significant data. High missingness often indicates systematic data collection issues.
Action Steps:
β’ Investigate why so much data is missing (data collection issue?)
β’ If pattern-based: create binary indicator column for 'is_missing'
β’ If random: consider dropping column if not critical
β’ If critical: use advanced imputation (KNN, iterative)
π lag3_ordfreq_max
Issue: 100.0% missing values (critical)
Strategy: Drop Column or Create Missing Indicator
Impact: Models will fail or lose significant data. High missingness often indicates systematic data collection issues.
Action Steps:
β’ Investigate why so much data is missing (data collection issue?)
β’ If pattern-based: create binary indicator column for 'is_missing'
β’ If random: consider dropping column if not critical
β’ If critical: use advanced imputation (KNN, iterative)
π lag3_paperless_sum
Issue: 100.0% missing values (critical)
Strategy: Drop Column or Create Missing Indicator
Impact: Models will fail or lose significant data. High missingness often indicates systematic data collection issues.
Action Steps:
β’ Investigate why so much data is missing (data collection issue?)
β’ If pattern-based: create binary indicator column for 'is_missing'
β’ If random: consider dropping column if not critical
β’ If critical: use advanced imputation (KNN, iterative)
π lag3_paperless_mean
Issue: 100.0% missing values (critical)
Strategy: Drop Column or Create Missing Indicator
Impact: Models will fail or lose significant data. High missingness often indicates systematic data collection issues.
Action Steps:
β’ Investigate why so much data is missing (data collection issue?)
β’ If pattern-based: create binary indicator column for 'is_missing'
β’ If random: consider dropping column if not critical
β’ If critical: use advanced imputation (KNN, iterative)
π lag3_paperless_max
Issue: 100.0% missing values (critical)
Strategy: Drop Column or Create Missing Indicator
Impact: Models will fail or lose significant data. High missingness often indicates systematic data collection issues.
Action Steps:
β’ Investigate why so much data is missing (data collection issue?)
β’ If pattern-based: create binary indicator column for 'is_missing'
β’ If random: consider dropping column if not critical
β’ If critical: use advanced imputation (KNN, iterative)
π lag3_refill_sum
Issue: 100.0% missing values (critical)
Strategy: Drop Column or Create Missing Indicator
Impact: Models will fail or lose significant data. High missingness often indicates systematic data collection issues.
Action Steps:
β’ Investigate why so much data is missing (data collection issue?)
β’ If pattern-based: create binary indicator column for 'is_missing'
β’ If random: consider dropping column if not critical
β’ If critical: use advanced imputation (KNN, iterative)
π lag3_refill_mean
Issue: 100.0% missing values (critical)
Strategy: Drop Column or Create Missing Indicator
Impact: Models will fail or lose significant data. High missingness often indicates systematic data collection issues.
Action Steps:
β’ Investigate why so much data is missing (data collection issue?)
β’ If pattern-based: create binary indicator column for 'is_missing'
β’ If random: consider dropping column if not critical
β’ If critical: use advanced imputation (KNN, iterative)
π lag3_refill_max
Issue: 100.0% missing values (critical)
Strategy: Drop Column or Create Missing Indicator
Impact: Models will fail or lose significant data. High missingness often indicates systematic data collection issues.
Action Steps:
β’ Investigate why so much data is missing (data collection issue?)
β’ If pattern-based: create binary indicator column for 'is_missing'
β’ If random: consider dropping column if not critical
β’ If critical: use advanced imputation (KNN, iterative)
π lag3_doorstep_sum
Issue: 100.0% missing values (critical)
Strategy: Drop Column or Create Missing Indicator
Impact: Models will fail or lose significant data. High missingness often indicates systematic data collection issues.
Action Steps:
β’ Investigate why so much data is missing (data collection issue?)
β’ If pattern-based: create binary indicator column for 'is_missing'
β’ If random: consider dropping column if not critical
β’ If critical: use advanced imputation (KNN, iterative)
π lag3_doorstep_mean
Issue: 100.0% missing values (critical)
Strategy: Drop Column or Create Missing Indicator
Impact: Models will fail or lose significant data. High missingness often indicates systematic data collection issues.
Action Steps:
β’ Investigate why so much data is missing (data collection issue?)
β’ If pattern-based: create binary indicator column for 'is_missing'
β’ If random: consider dropping column if not critical
β’ If critical: use advanced imputation (KNN, iterative)
π lag3_doorstep_max
Issue: 100.0% missing values (critical)
Strategy: Drop Column or Create Missing Indicator
Impact: Models will fail or lose significant data. High missingness often indicates systematic data collection issues.
Action Steps:
β’ Investigate why so much data is missing (data collection issue?)
β’ If pattern-based: create binary indicator column for 'is_missing'
β’ If random: consider dropping column if not critical
β’ If critical: use advanced imputation (KNN, iterative)
π lag3_created_delta_hours_sum
Issue: 100.0% missing values (critical)
Strategy: Drop Column or Create Missing Indicator
Impact: Models will fail or lose significant data. High missingness often indicates systematic data collection issues.
Action Steps:
β’ Investigate why so much data is missing (data collection issue?)
β’ If pattern-based: create binary indicator column for 'is_missing'
β’ If random: consider dropping column if not critical
β’ If critical: use advanced imputation (KNN, iterative)
π lag3_created_delta_hours_mean
Issue: 100.0% missing values (critical)
Strategy: Drop Column or Create Missing Indicator
Impact: Models will fail or lose significant data. High missingness often indicates systematic data collection issues.
Action Steps:
β’ Investigate why so much data is missing (data collection issue?)
β’ If pattern-based: create binary indicator column for 'is_missing'
β’ If random: consider dropping column if not critical
β’ If critical: use advanced imputation (KNN, iterative)
π lag3_created_delta_hours_max
Issue: 100.0% missing values (critical)
Strategy: Drop Column or Create Missing Indicator
Impact: Models will fail or lose significant data. High missingness often indicates systematic data collection issues.
Action Steps:
β’ Investigate why so much data is missing (data collection issue?)
β’ If pattern-based: create binary indicator column for 'is_missing'
β’ If random: consider dropping column if not critical
β’ If critical: use advanced imputation (KNN, iterative)
π lag3_created_hour_sum
Issue: 100.0% missing values (critical)
Strategy: Drop Column or Create Missing Indicator
Impact: Models will fail or lose significant data. High missingness often indicates systematic data collection issues.
Action Steps:
β’ Investigate why so much data is missing (data collection issue?)
β’ If pattern-based: create binary indicator column for 'is_missing'
β’ If random: consider dropping column if not critical
β’ If critical: use advanced imputation (KNN, iterative)
π lag3_created_hour_mean
Issue: 100.0% missing values (critical)
Strategy: Drop Column or Create Missing Indicator
Impact: Models will fail or lose significant data. High missingness often indicates systematic data collection issues.
Action Steps:
β’ Investigate why so much data is missing (data collection issue?)
β’ If pattern-based: create binary indicator column for 'is_missing'
β’ If random: consider dropping column if not critical
β’ If critical: use advanced imputation (KNN, iterative)
π lag3_created_hour_max
Issue: 100.0% missing values (critical)
Strategy: Drop Column or Create Missing Indicator
Impact: Models will fail or lose significant data. High missingness often indicates systematic data collection issues.
Action Steps:
β’ Investigate why so much data is missing (data collection issue?)
β’ If pattern-based: create binary indicator column for 'is_missing'
β’ If random: consider dropping column if not critical
β’ If critical: use advanced imputation (KNN, iterative)
π esent_velocity
Issue: 100.0% missing values (critical)
Strategy: Drop Column or Create Missing Indicator
Impact: Models will fail or lose significant data. High missingness often indicates systematic data collection issues.
Action Steps:
β’ Investigate why so much data is missing (data collection issue?)
β’ If pattern-based: create binary indicator column for 'is_missing'
β’ If random: consider dropping column if not critical
β’ If critical: use advanced imputation (KNN, iterative)
π esent_velocity_pct
Issue: 100.0% missing values (critical)
Strategy: Drop Column or Create Missing Indicator
Impact: Models will fail or lose significant data. High missingness often indicates systematic data collection issues.
Action Steps:
β’ Investigate why so much data is missing (data collection issue?)
β’ If pattern-based: create binary indicator column for 'is_missing'
β’ If random: consider dropping column if not critical
β’ If critical: use advanced imputation (KNN, iterative)
π eopenrate_velocity
Issue: 100.0% missing values (critical)
Strategy: Drop Column or Create Missing Indicator
Impact: Models will fail or lose significant data. High missingness often indicates systematic data collection issues.
Action Steps:
β’ Investigate why so much data is missing (data collection issue?)
β’ If pattern-based: create binary indicator column for 'is_missing'
β’ If random: consider dropping column if not critical
β’ If critical: use advanced imputation (KNN, iterative)
π eopenrate_velocity_pct
Issue: 100.0% missing values (critical)
Strategy: Drop Column or Create Missing Indicator
Impact: Models will fail or lose significant data. High missingness often indicates systematic data collection issues.
Action Steps:
β’ Investigate why so much data is missing (data collection issue?)
β’ If pattern-based: create binary indicator column for 'is_missing'
β’ If random: consider dropping column if not critical
β’ If critical: use advanced imputation (KNN, iterative)
π eclickrate_velocity
Issue: 100.0% missing values (critical)
Strategy: Drop Column or Create Missing Indicator
Impact: Models will fail or lose significant data. High missingness often indicates systematic data collection issues.
Action Steps:
β’ Investigate why so much data is missing (data collection issue?)
β’ If pattern-based: create binary indicator column for 'is_missing'
β’ If random: consider dropping column if not critical
β’ If critical: use advanced imputation (KNN, iterative)
π eclickrate_velocity_pct
Issue: 100.0% missing values (critical)
Strategy: Drop Column or Create Missing Indicator
Impact: Models will fail or lose significant data. High missingness often indicates systematic data collection issues.
Action Steps:
β’ Investigate why so much data is missing (data collection issue?)
β’ If pattern-based: create binary indicator column for 'is_missing'
β’ If random: consider dropping column if not critical
β’ If critical: use advanced imputation (KNN, iterative)
π avgorder_velocity
Issue: 100.0% missing values (critical)
Strategy: Drop Column or Create Missing Indicator
Impact: Models will fail or lose significant data. High missingness often indicates systematic data collection issues.
Action Steps:
β’ Investigate why so much data is missing (data collection issue?)
β’ If pattern-based: create binary indicator column for 'is_missing'
β’ If random: consider dropping column if not critical
β’ If critical: use advanced imputation (KNN, iterative)
π avgorder_velocity_pct
Issue: 100.0% missing values (critical)
Strategy: Drop Column or Create Missing Indicator
Impact: Models will fail or lose significant data. High missingness often indicates systematic data collection issues.
Action Steps:
β’ Investigate why so much data is missing (data collection issue?)
β’ If pattern-based: create binary indicator column for 'is_missing'
β’ If random: consider dropping column if not critical
β’ If critical: use advanced imputation (KNN, iterative)
π ordfreq_velocity
Issue: 100.0% missing values (critical)
Strategy: Drop Column or Create Missing Indicator
Impact: Models will fail or lose significant data. High missingness often indicates systematic data collection issues.
Action Steps:
β’ Investigate why so much data is missing (data collection issue?)
β’ If pattern-based: create binary indicator column for 'is_missing'
β’ If random: consider dropping column if not critical
β’ If critical: use advanced imputation (KNN, iterative)
π ordfreq_velocity_pct
Issue: 100.0% missing values (critical)
Strategy: Drop Column or Create Missing Indicator
Impact: Models will fail or lose significant data. High missingness often indicates systematic data collection issues.
Action Steps:
β’ Investigate why so much data is missing (data collection issue?)
β’ If pattern-based: create binary indicator column for 'is_missing'
β’ If random: consider dropping column if not critical
β’ If critical: use advanced imputation (KNN, iterative)
π paperless_velocity
Issue: 100.0% missing values (critical)
Strategy: Drop Column or Create Missing Indicator
Impact: Models will fail or lose significant data. High missingness often indicates systematic data collection issues.
Action Steps:
β’ Investigate why so much data is missing (data collection issue?)
β’ If pattern-based: create binary indicator column for 'is_missing'
β’ If random: consider dropping column if not critical
β’ If critical: use advanced imputation (KNN, iterative)
π paperless_velocity_pct
Issue: 100.0% missing values (critical)
Strategy: Drop Column or Create Missing Indicator
Impact: Models will fail or lose significant data. High missingness often indicates systematic data collection issues.
Action Steps:
β’ Investigate why so much data is missing (data collection issue?)
β’ If pattern-based: create binary indicator column for 'is_missing'
β’ If random: consider dropping column if not critical
β’ If critical: use advanced imputation (KNN, iterative)
π refill_velocity
Issue: 100.0% missing values (critical)
Strategy: Drop Column or Create Missing Indicator
Impact: Models will fail or lose significant data. High missingness often indicates systematic data collection issues.
Action Steps:
β’ Investigate why so much data is missing (data collection issue?)
β’ If pattern-based: create binary indicator column for 'is_missing'
β’ If random: consider dropping column if not critical
β’ If critical: use advanced imputation (KNN, iterative)
π refill_velocity_pct
Issue: 100.0% missing values (critical)
Strategy: Drop Column or Create Missing Indicator
Impact: Models will fail or lose significant data. High missingness often indicates systematic data collection issues.
Action Steps:
β’ Investigate why so much data is missing (data collection issue?)
β’ If pattern-based: create binary indicator column for 'is_missing'
β’ If random: consider dropping column if not critical
β’ If critical: use advanced imputation (KNN, iterative)
π doorstep_velocity
Issue: 100.0% missing values (critical)
Strategy: Drop Column or Create Missing Indicator
Impact: Models will fail or lose significant data. High missingness often indicates systematic data collection issues.
Action Steps:
β’ Investigate why so much data is missing (data collection issue?)
β’ If pattern-based: create binary indicator column for 'is_missing'
β’ If random: consider dropping column if not critical
β’ If critical: use advanced imputation (KNN, iterative)
π doorstep_velocity_pct
Issue: 100.0% missing values (critical)
Strategy: Drop Column or Create Missing Indicator
Impact: Models will fail or lose significant data. High missingness often indicates systematic data collection issues.
Action Steps:
β’ Investigate why so much data is missing (data collection issue?)
β’ If pattern-based: create binary indicator column for 'is_missing'
β’ If random: consider dropping column if not critical
β’ If critical: use advanced imputation (KNN, iterative)
π created_delta_hours_velocity
Issue: 100.0% missing values (critical)
Strategy: Drop Column or Create Missing Indicator
Impact: Models will fail or lose significant data. High missingness often indicates systematic data collection issues.
Action Steps:
β’ Investigate why so much data is missing (data collection issue?)
β’ If pattern-based: create binary indicator column for 'is_missing'
β’ If random: consider dropping column if not critical
β’ If critical: use advanced imputation (KNN, iterative)
π created_delta_hours_velocity_pct
Issue: 100.0% missing values (critical)
Strategy: Drop Column or Create Missing Indicator
Impact: Models will fail or lose significant data. High missingness often indicates systematic data collection issues.
Action Steps:
β’ Investigate why so much data is missing (data collection issue?)
β’ If pattern-based: create binary indicator column for 'is_missing'
β’ If random: consider dropping column if not critical
β’ If critical: use advanced imputation (KNN, iterative)
π created_hour_velocity
Issue: 100.0% missing values (critical)
Strategy: Drop Column or Create Missing Indicator
Impact: Models will fail or lose significant data. High missingness often indicates systematic data collection issues.
Action Steps:
β’ Investigate why so much data is missing (data collection issue?)
β’ If pattern-based: create binary indicator column for 'is_missing'
β’ If random: consider dropping column if not critical
β’ If critical: use advanced imputation (KNN, iterative)
π created_hour_velocity_pct
Issue: 100.0% missing values (critical)
Strategy: Drop Column or Create Missing Indicator
Impact: Models will fail or lose significant data. High missingness often indicates systematic data collection issues.
Action Steps:
β’ Investigate why so much data is missing (data collection issue?)
β’ If pattern-based: create binary indicator column for 'is_missing'
β’ If random: consider dropping column if not critical
β’ If critical: use advanced imputation (KNN, iterative)
π esent_acceleration
Issue: 100.0% missing values (critical)
Strategy: Drop Column or Create Missing Indicator
Impact: Models will fail or lose significant data. High missingness often indicates systematic data collection issues.
Action Steps:
β’ Investigate why so much data is missing (data collection issue?)
β’ If pattern-based: create binary indicator column for 'is_missing'
β’ If random: consider dropping column if not critical
β’ If critical: use advanced imputation (KNN, iterative)
π esent_momentum
Issue: 100.0% missing values (critical)
Strategy: Drop Column or Create Missing Indicator
Impact: Models will fail or lose significant data. High missingness often indicates systematic data collection issues.
Action Steps:
β’ Investigate why so much data is missing (data collection issue?)
β’ If pattern-based: create binary indicator column for 'is_missing'
β’ If random: consider dropping column if not critical
β’ If critical: use advanced imputation (KNN, iterative)
π eopenrate_acceleration
Issue: 100.0% missing values (critical)
Strategy: Drop Column or Create Missing Indicator
Impact: Models will fail or lose significant data. High missingness often indicates systematic data collection issues.
Action Steps:
β’ Investigate why so much data is missing (data collection issue?)
β’ If pattern-based: create binary indicator column for 'is_missing'
β’ If random: consider dropping column if not critical
β’ If critical: use advanced imputation (KNN, iterative)
π eopenrate_momentum
Issue: 100.0% missing values (critical)
Strategy: Drop Column or Create Missing Indicator
Impact: Models will fail or lose significant data. High missingness often indicates systematic data collection issues.
Action Steps:
β’ Investigate why so much data is missing (data collection issue?)
β’ If pattern-based: create binary indicator column for 'is_missing'
β’ If random: consider dropping column if not critical
β’ If critical: use advanced imputation (KNN, iterative)
π eclickrate_acceleration
Issue: 100.0% missing values (critical)
Strategy: Drop Column or Create Missing Indicator
Impact: Models will fail or lose significant data. High missingness often indicates systematic data collection issues.
Action Steps:
β’ Investigate why so much data is missing (data collection issue?)
β’ If pattern-based: create binary indicator column for 'is_missing'
β’ If random: consider dropping column if not critical
β’ If critical: use advanced imputation (KNN, iterative)
π eclickrate_momentum
Issue: 100.0% missing values (critical)
Strategy: Drop Column or Create Missing Indicator
Impact: Models will fail or lose significant data. High missingness often indicates systematic data collection issues.
Action Steps:
β’ Investigate why so much data is missing (data collection issue?)
β’ If pattern-based: create binary indicator column for 'is_missing'
β’ If random: consider dropping column if not critical
β’ If critical: use advanced imputation (KNN, iterative)
π avgorder_acceleration
Issue: 100.0% missing values (critical)
Strategy: Drop Column or Create Missing Indicator
Impact: Models will fail or lose significant data. High missingness often indicates systematic data collection issues.
Action Steps:
β’ Investigate why so much data is missing (data collection issue?)
β’ If pattern-based: create binary indicator column for 'is_missing'
β’ If random: consider dropping column if not critical
β’ If critical: use advanced imputation (KNN, iterative)
π avgorder_momentum
Issue: 100.0% missing values (critical)
Strategy: Drop Column or Create Missing Indicator
Impact: Models will fail or lose significant data. High missingness often indicates systematic data collection issues.
Action Steps:
β’ Investigate why so much data is missing (data collection issue?)
β’ If pattern-based: create binary indicator column for 'is_missing'
β’ If random: consider dropping column if not critical
β’ If critical: use advanced imputation (KNN, iterative)
π ordfreq_acceleration
Issue: 100.0% missing values (critical)
Strategy: Drop Column or Create Missing Indicator
Impact: Models will fail or lose significant data. High missingness often indicates systematic data collection issues.
Action Steps:
β’ Investigate why so much data is missing (data collection issue?)
β’ If pattern-based: create binary indicator column for 'is_missing'
β’ If random: consider dropping column if not critical
β’ If critical: use advanced imputation (KNN, iterative)
π ordfreq_momentum
Issue: 100.0% missing values (critical)
Strategy: Drop Column or Create Missing Indicator
Impact: Models will fail or lose significant data. High missingness often indicates systematic data collection issues.
Action Steps:
β’ Investigate why so much data is missing (data collection issue?)
β’ If pattern-based: create binary indicator column for 'is_missing'
β’ If random: consider dropping column if not critical
β’ If critical: use advanced imputation (KNN, iterative)
π paperless_acceleration
Issue: 100.0% missing values (critical)
Strategy: Drop Column or Create Missing Indicator
Impact: Models will fail or lose significant data. High missingness often indicates systematic data collection issues.
Action Steps:
β’ Investigate why so much data is missing (data collection issue?)
β’ If pattern-based: create binary indicator column for 'is_missing'
β’ If random: consider dropping column if not critical
β’ If critical: use advanced imputation (KNN, iterative)
π paperless_momentum
Issue: 100.0% missing values (critical)
Strategy: Drop Column or Create Missing Indicator
Impact: Models will fail or lose significant data. High missingness often indicates systematic data collection issues.
Action Steps:
β’ Investigate why so much data is missing (data collection issue?)
β’ If pattern-based: create binary indicator column for 'is_missing'
β’ If random: consider dropping column if not critical
β’ If critical: use advanced imputation (KNN, iterative)
π refill_acceleration
Issue: 100.0% missing values (critical)
Strategy: Drop Column or Create Missing Indicator
Impact: Models will fail or lose significant data. High missingness often indicates systematic data collection issues.
Action Steps:
β’ Investigate why so much data is missing (data collection issue?)
β’ If pattern-based: create binary indicator column for 'is_missing'
β’ If random: consider dropping column if not critical
β’ If critical: use advanced imputation (KNN, iterative)
π refill_momentum
Issue: 100.0% missing values (critical)
Strategy: Drop Column or Create Missing Indicator
Impact: Models will fail or lose significant data. High missingness often indicates systematic data collection issues.
Action Steps:
β’ Investigate why so much data is missing (data collection issue?)
β’ If pattern-based: create binary indicator column for 'is_missing'
β’ If random: consider dropping column if not critical
β’ If critical: use advanced imputation (KNN, iterative)
π doorstep_acceleration
Issue: 100.0% missing values (critical)
Strategy: Drop Column or Create Missing Indicator
Impact: Models will fail or lose significant data. High missingness often indicates systematic data collection issues.
Action Steps:
β’ Investigate why so much data is missing (data collection issue?)
β’ If pattern-based: create binary indicator column for 'is_missing'
β’ If random: consider dropping column if not critical
β’ If critical: use advanced imputation (KNN, iterative)
π doorstep_momentum
Issue: 100.0% missing values (critical)
Strategy: Drop Column or Create Missing Indicator
Impact: Models will fail or lose significant data. High missingness often indicates systematic data collection issues.
Action Steps:
β’ Investigate why so much data is missing (data collection issue?)
β’ If pattern-based: create binary indicator column for 'is_missing'
β’ If random: consider dropping column if not critical
β’ If critical: use advanced imputation (KNN, iterative)
π created_delta_hours_acceleration
Issue: 100.0% missing values (critical)
Strategy: Drop Column or Create Missing Indicator
Impact: Models will fail or lose significant data. High missingness often indicates systematic data collection issues.
Action Steps:
β’ Investigate why so much data is missing (data collection issue?)
β’ If pattern-based: create binary indicator column for 'is_missing'
β’ If random: consider dropping column if not critical
β’ If critical: use advanced imputation (KNN, iterative)
π created_delta_hours_momentum
Issue: 100.0% missing values (critical)
Strategy: Drop Column or Create Missing Indicator
Impact: Models will fail or lose significant data. High missingness often indicates systematic data collection issues.
Action Steps:
β’ Investigate why so much data is missing (data collection issue?)
β’ If pattern-based: create binary indicator column for 'is_missing'
β’ If random: consider dropping column if not critical
β’ If critical: use advanced imputation (KNN, iterative)
π created_hour_acceleration
Issue: 100.0% missing values (critical)
Strategy: Drop Column or Create Missing Indicator
Impact: Models will fail or lose significant data. High missingness often indicates systematic data collection issues.
Action Steps:
β’ Investigate why so much data is missing (data collection issue?)
β’ If pattern-based: create binary indicator column for 'is_missing'
β’ If random: consider dropping column if not critical
β’ If critical: use advanced imputation (KNN, iterative)
π created_hour_momentum
Issue: 100.0% missing values (critical)
Strategy: Drop Column or Create Missing Indicator
Impact: Models will fail or lose significant data. High missingness often indicates systematic data collection issues.
Action Steps:
β’ Investigate why so much data is missing (data collection issue?)
β’ If pattern-based: create binary indicator column for 'is_missing'
β’ If random: consider dropping column if not critical
β’ If critical: use advanced imputation (KNN, iterative)
π esent_beginning
Issue: 100.0% missing values (critical)
Strategy: Drop Column or Create Missing Indicator
Impact: Models will fail or lose significant data. High missingness often indicates systematic data collection issues.
Action Steps:
β’ Investigate why so much data is missing (data collection issue?)
β’ If pattern-based: create binary indicator column for 'is_missing'
β’ If random: consider dropping column if not critical
β’ If critical: use advanced imputation (KNN, iterative)
π esent_middle
Issue: 100.0% missing values (critical)
Strategy: Drop Column or Create Missing Indicator
Impact: Models will fail or lose significant data. High missingness often indicates systematic data collection issues.
Action Steps:
β’ Investigate why so much data is missing (data collection issue?)
β’ If pattern-based: create binary indicator column for 'is_missing'
β’ If random: consider dropping column if not critical
β’ If critical: use advanced imputation (KNN, iterative)
π esent_end
Issue: 100.0% missing values (critical)
Strategy: Drop Column or Create Missing Indicator
Impact: Models will fail or lose significant data. High missingness often indicates systematic data collection issues.
Action Steps:
β’ Investigate why so much data is missing (data collection issue?)
β’ If pattern-based: create binary indicator column for 'is_missing'
β’ If random: consider dropping column if not critical
β’ If critical: use advanced imputation (KNN, iterative)
π esent_trend_ratio
Issue: 100.0% missing values (critical)
Strategy: Drop Column or Create Missing Indicator
Impact: Models will fail or lose significant data. High missingness often indicates systematic data collection issues.
Action Steps:
β’ Investigate why so much data is missing (data collection issue?)
β’ If pattern-based: create binary indicator column for 'is_missing'
β’ If random: consider dropping column if not critical
β’ If critical: use advanced imputation (KNN, iterative)
π eopenrate_beginning
Issue: 100.0% missing values (critical)
Strategy: Drop Column or Create Missing Indicator
Impact: Models will fail or lose significant data. High missingness often indicates systematic data collection issues.
Action Steps:
β’ Investigate why so much data is missing (data collection issue?)
β’ If pattern-based: create binary indicator column for 'is_missing'
β’ If random: consider dropping column if not critical
β’ If critical: use advanced imputation (KNN, iterative)
π eopenrate_middle
Issue: 100.0% missing values (critical)
Strategy: Drop Column or Create Missing Indicator
Impact: Models will fail or lose significant data. High missingness often indicates systematic data collection issues.
Action Steps:
β’ Investigate why so much data is missing (data collection issue?)
β’ If pattern-based: create binary indicator column for 'is_missing'
β’ If random: consider dropping column if not critical
β’ If critical: use advanced imputation (KNN, iterative)
π eopenrate_end
Issue: 100.0% missing values (critical)
Strategy: Drop Column or Create Missing Indicator
Impact: Models will fail or lose significant data. High missingness often indicates systematic data collection issues.
Action Steps:
β’ Investigate why so much data is missing (data collection issue?)
β’ If pattern-based: create binary indicator column for 'is_missing'
β’ If random: consider dropping column if not critical
β’ If critical: use advanced imputation (KNN, iterative)
π eopenrate_trend_ratio
Issue: 100.0% missing values (critical)
Strategy: Drop Column or Create Missing Indicator
Impact: Models will fail or lose significant data. High missingness often indicates systematic data collection issues.
Action Steps:
β’ Investigate why so much data is missing (data collection issue?)
β’ If pattern-based: create binary indicator column for 'is_missing'
β’ If random: consider dropping column if not critical
β’ If critical: use advanced imputation (KNN, iterative)
π eclickrate_beginning
Issue: 100.0% missing values (critical)
Strategy: Drop Column or Create Missing Indicator
Impact: Models will fail or lose significant data. High missingness often indicates systematic data collection issues.
Action Steps:
β’ Investigate why so much data is missing (data collection issue?)
β’ If pattern-based: create binary indicator column for 'is_missing'
β’ If random: consider dropping column if not critical
β’ If critical: use advanced imputation (KNN, iterative)
π eclickrate_middle
Issue: 100.0% missing values (critical)
Strategy: Drop Column or Create Missing Indicator
Impact: Models will fail or lose significant data. High missingness often indicates systematic data collection issues.
Action Steps:
β’ Investigate why so much data is missing (data collection issue?)
β’ If pattern-based: create binary indicator column for 'is_missing'
β’ If random: consider dropping column if not critical
β’ If critical: use advanced imputation (KNN, iterative)
π eclickrate_end
Issue: 100.0% missing values (critical)
Strategy: Drop Column or Create Missing Indicator
Impact: Models will fail or lose significant data. High missingness often indicates systematic data collection issues.
Action Steps:
β’ Investigate why so much data is missing (data collection issue?)
β’ If pattern-based: create binary indicator column for 'is_missing'
β’ If random: consider dropping column if not critical
β’ If critical: use advanced imputation (KNN, iterative)
π eclickrate_trend_ratio
Issue: 100.0% missing values (critical)
Strategy: Drop Column or Create Missing Indicator
Impact: Models will fail or lose significant data. High missingness often indicates systematic data collection issues.
Action Steps:
β’ Investigate why so much data is missing (data collection issue?)
β’ If pattern-based: create binary indicator column for 'is_missing'
β’ If random: consider dropping column if not critical
β’ If critical: use advanced imputation (KNN, iterative)
π avgorder_beginning
Issue: 100.0% missing values (critical)
Strategy: Drop Column or Create Missing Indicator
Impact: Models will fail or lose significant data. High missingness often indicates systematic data collection issues.
Action Steps:
β’ Investigate why so much data is missing (data collection issue?)
β’ If pattern-based: create binary indicator column for 'is_missing'
β’ If random: consider dropping column if not critical
β’ If critical: use advanced imputation (KNN, iterative)
π avgorder_middle
Issue: 100.0% missing values (critical)
Strategy: Drop Column or Create Missing Indicator
Impact: Models will fail or lose significant data. High missingness often indicates systematic data collection issues.
Action Steps:
β’ Investigate why so much data is missing (data collection issue?)
β’ If pattern-based: create binary indicator column for 'is_missing'
β’ If random: consider dropping column if not critical
β’ If critical: use advanced imputation (KNN, iterative)
π avgorder_end
Issue: 100.0% missing values (critical)
Strategy: Drop Column or Create Missing Indicator
Impact: Models will fail or lose significant data. High missingness often indicates systematic data collection issues.
Action Steps:
β’ Investigate why so much data is missing (data collection issue?)
β’ If pattern-based: create binary indicator column for 'is_missing'
β’ If random: consider dropping column if not critical
β’ If critical: use advanced imputation (KNN, iterative)
π avgorder_trend_ratio
Issue: 100.0% missing values (critical)
Strategy: Drop Column or Create Missing Indicator
Impact: Models will fail or lose significant data. High missingness often indicates systematic data collection issues.
Action Steps:
β’ Investigate why so much data is missing (data collection issue?)
β’ If pattern-based: create binary indicator column for 'is_missing'
β’ If random: consider dropping column if not critical
β’ If critical: use advanced imputation (KNN, iterative)
π ordfreq_beginning
Issue: 100.0% missing values (critical)
Strategy: Drop Column or Create Missing Indicator
Impact: Models will fail or lose significant data. High missingness often indicates systematic data collection issues.
Action Steps:
β’ Investigate why so much data is missing (data collection issue?)
β’ If pattern-based: create binary indicator column for 'is_missing'
β’ If random: consider dropping column if not critical
β’ If critical: use advanced imputation (KNN, iterative)
π ordfreq_middle
Issue: 100.0% missing values (critical)
Strategy: Drop Column or Create Missing Indicator
Impact: Models will fail or lose significant data. High missingness often indicates systematic data collection issues.
Action Steps:
β’ Investigate why so much data is missing (data collection issue?)
β’ If pattern-based: create binary indicator column for 'is_missing'
β’ If random: consider dropping column if not critical
β’ If critical: use advanced imputation (KNN, iterative)
π ordfreq_end
Issue: 100.0% missing values (critical)
Strategy: Drop Column or Create Missing Indicator
Impact: Models will fail or lose significant data. High missingness often indicates systematic data collection issues.
Action Steps:
β’ Investigate why so much data is missing (data collection issue?)
β’ If pattern-based: create binary indicator column for 'is_missing'
β’ If random: consider dropping column if not critical
β’ If critical: use advanced imputation (KNN, iterative)
π ordfreq_trend_ratio
Issue: 100.0% missing values (critical)
Strategy: Drop Column or Create Missing Indicator
Impact: Models will fail or lose significant data. High missingness often indicates systematic data collection issues.
Action Steps:
β’ Investigate why so much data is missing (data collection issue?)
β’ If pattern-based: create binary indicator column for 'is_missing'
β’ If random: consider dropping column if not critical
β’ If critical: use advanced imputation (KNN, iterative)
π paperless_beginning
Issue: 100.0% missing values (critical)
Strategy: Drop Column or Create Missing Indicator
Impact: Models will fail or lose significant data. High missingness often indicates systematic data collection issues.
Action Steps:
β’ Investigate why so much data is missing (data collection issue?)
β’ If pattern-based: create binary indicator column for 'is_missing'
β’ If random: consider dropping column if not critical
β’ If critical: use advanced imputation (KNN, iterative)
π paperless_middle
Issue: 100.0% missing values (critical)
Strategy: Drop Column or Create Missing Indicator
Impact: Models will fail or lose significant data. High missingness often indicates systematic data collection issues.
Action Steps:
β’ Investigate why so much data is missing (data collection issue?)
β’ If pattern-based: create binary indicator column for 'is_missing'
β’ If random: consider dropping column if not critical
β’ If critical: use advanced imputation (KNN, iterative)
π paperless_end
Issue: 100.0% missing values (critical)
Strategy: Drop Column or Create Missing Indicator
Impact: Models will fail or lose significant data. High missingness often indicates systematic data collection issues.
Action Steps:
β’ Investigate why so much data is missing (data collection issue?)
β’ If pattern-based: create binary indicator column for 'is_missing'
β’ If random: consider dropping column if not critical
β’ If critical: use advanced imputation (KNN, iterative)
π paperless_trend_ratio
Issue: 100.0% missing values (critical)
Strategy: Drop Column or Create Missing Indicator
Impact: Models will fail or lose significant data. High missingness often indicates systematic data collection issues.
Action Steps:
β’ Investigate why so much data is missing (data collection issue?)
β’ If pattern-based: create binary indicator column for 'is_missing'
β’ If random: consider dropping column if not critical
β’ If critical: use advanced imputation (KNN, iterative)
π refill_beginning
Issue: 100.0% missing values (critical)
Strategy: Drop Column or Create Missing Indicator
Impact: Models will fail or lose significant data. High missingness often indicates systematic data collection issues.
Action Steps:
β’ Investigate why so much data is missing (data collection issue?)
β’ If pattern-based: create binary indicator column for 'is_missing'
β’ If random: consider dropping column if not critical
β’ If critical: use advanced imputation (KNN, iterative)
π refill_middle
Issue: 100.0% missing values (critical)
Strategy: Drop Column or Create Missing Indicator
Impact: Models will fail or lose significant data. High missingness often indicates systematic data collection issues.
Action Steps:
β’ Investigate why so much data is missing (data collection issue?)
β’ If pattern-based: create binary indicator column for 'is_missing'
β’ If random: consider dropping column if not critical
β’ If critical: use advanced imputation (KNN, iterative)
π refill_end
Issue: 100.0% missing values (critical)
Strategy: Drop Column or Create Missing Indicator
Impact: Models will fail or lose significant data. High missingness often indicates systematic data collection issues.
Action Steps:
β’ Investigate why so much data is missing (data collection issue?)
β’ If pattern-based: create binary indicator column for 'is_missing'
β’ If random: consider dropping column if not critical
β’ If critical: use advanced imputation (KNN, iterative)
π refill_trend_ratio
Issue: 100.0% missing values (critical)
Strategy: Drop Column or Create Missing Indicator
Impact: Models will fail or lose significant data. High missingness often indicates systematic data collection issues.
Action Steps:
β’ Investigate why so much data is missing (data collection issue?)
β’ If pattern-based: create binary indicator column for 'is_missing'
β’ If random: consider dropping column if not critical
β’ If critical: use advanced imputation (KNN, iterative)
π doorstep_beginning
Issue: 100.0% missing values (critical)
Strategy: Drop Column or Create Missing Indicator
Impact: Models will fail or lose significant data. High missingness often indicates systematic data collection issues.
Action Steps:
β’ Investigate why so much data is missing (data collection issue?)
β’ If pattern-based: create binary indicator column for 'is_missing'
β’ If random: consider dropping column if not critical
β’ If critical: use advanced imputation (KNN, iterative)
π doorstep_middle
Issue: 100.0% missing values (critical)
Strategy: Drop Column or Create Missing Indicator
Impact: Models will fail or lose significant data. High missingness often indicates systematic data collection issues.
Action Steps:
β’ Investigate why so much data is missing (data collection issue?)
β’ If pattern-based: create binary indicator column for 'is_missing'
β’ If random: consider dropping column if not critical
β’ If critical: use advanced imputation (KNN, iterative)
π doorstep_end
Issue: 100.0% missing values (critical)
Strategy: Drop Column or Create Missing Indicator
Impact: Models will fail or lose significant data. High missingness often indicates systematic data collection issues.
Action Steps:
β’ Investigate why so much data is missing (data collection issue?)
β’ If pattern-based: create binary indicator column for 'is_missing'
β’ If random: consider dropping column if not critical
β’ If critical: use advanced imputation (KNN, iterative)
π doorstep_trend_ratio
Issue: 100.0% missing values (critical)
Strategy: Drop Column or Create Missing Indicator
Impact: Models will fail or lose significant data. High missingness often indicates systematic data collection issues.
Action Steps:
β’ Investigate why so much data is missing (data collection issue?)
β’ If pattern-based: create binary indicator column for 'is_missing'
β’ If random: consider dropping column if not critical
β’ If critical: use advanced imputation (KNN, iterative)
π created_delta_hours_beginning
Issue: 100.0% missing values (critical)
Strategy: Drop Column or Create Missing Indicator
Impact: Models will fail or lose significant data. High missingness often indicates systematic data collection issues.
Action Steps:
β’ Investigate why so much data is missing (data collection issue?)
β’ If pattern-based: create binary indicator column for 'is_missing'
β’ If random: consider dropping column if not critical
β’ If critical: use advanced imputation (KNN, iterative)
π created_delta_hours_middle
Issue: 100.0% missing values (critical)
Strategy: Drop Column or Create Missing Indicator
Impact: Models will fail or lose significant data. High missingness often indicates systematic data collection issues.
Action Steps:
β’ Investigate why so much data is missing (data collection issue?)
β’ If pattern-based: create binary indicator column for 'is_missing'
β’ If random: consider dropping column if not critical
β’ If critical: use advanced imputation (KNN, iterative)
π created_delta_hours_end
Issue: 100.0% missing values (critical)
Strategy: Drop Column or Create Missing Indicator
Impact: Models will fail or lose significant data. High missingness often indicates systematic data collection issues.
Action Steps:
β’ Investigate why so much data is missing (data collection issue?)
β’ If pattern-based: create binary indicator column for 'is_missing'
β’ If random: consider dropping column if not critical
β’ If critical: use advanced imputation (KNN, iterative)
π created_delta_hours_trend_ratio
Issue: 100.0% missing values (critical)
Strategy: Drop Column or Create Missing Indicator
Impact: Models will fail or lose significant data. High missingness often indicates systematic data collection issues.
Action Steps:
β’ Investigate why so much data is missing (data collection issue?)
β’ If pattern-based: create binary indicator column for 'is_missing'
β’ If random: consider dropping column if not critical
β’ If critical: use advanced imputation (KNN, iterative)
π created_hour_beginning
Issue: 100.0% missing values (critical)
Strategy: Drop Column or Create Missing Indicator
Impact: Models will fail or lose significant data. High missingness often indicates systematic data collection issues.
Action Steps:
β’ Investigate why so much data is missing (data collection issue?)
β’ If pattern-based: create binary indicator column for 'is_missing'
β’ If random: consider dropping column if not critical
β’ If critical: use advanced imputation (KNN, iterative)
π created_hour_middle
Issue: 100.0% missing values (critical)
Strategy: Drop Column or Create Missing Indicator
Impact: Models will fail or lose significant data. High missingness often indicates systematic data collection issues.
Action Steps:
β’ Investigate why so much data is missing (data collection issue?)
β’ If pattern-based: create binary indicator column for 'is_missing'
β’ If random: consider dropping column if not critical
β’ If critical: use advanced imputation (KNN, iterative)
π created_hour_end
Issue: 100.0% missing values (critical)
Strategy: Drop Column or Create Missing Indicator
Impact: Models will fail or lose significant data. High missingness often indicates systematic data collection issues.
Action Steps:
β’ Investigate why so much data is missing (data collection issue?)
β’ If pattern-based: create binary indicator column for 'is_missing'
β’ If random: consider dropping column if not critical
β’ If critical: use advanced imputation (KNN, iterative)
π created_hour_trend_ratio
Issue: 100.0% missing values (critical)
Strategy: Drop Column or Create Missing Indicator
Impact: Models will fail or lose significant data. High missingness often indicates systematic data collection issues.
Action Steps:
β’ Investigate why so much data is missing (data collection issue?)
β’ If pattern-based: create binary indicator column for 'is_missing'
β’ If random: consider dropping column if not critical
β’ If critical: use advanced imputation (KNN, iterative)
π event_frequency
Issue: 100.0% missing values (critical)
Strategy: Drop Column or Create Missing Indicator
Impact: Models will fail or lose significant data. High missingness often indicates systematic data collection issues.
Action Steps:
β’ Investigate why so much data is missing (data collection issue?)
β’ If pattern-based: create binary indicator column for 'is_missing'
β’ If random: consider dropping column if not critical
β’ If critical: use advanced imputation (KNN, iterative)
π inter_event_gap_mean
Issue: 100.0% missing values (critical)
Strategy: Drop Column or Create Missing Indicator
Impact: Models will fail or lose significant data. High missingness often indicates systematic data collection issues.
Action Steps:
β’ Investigate why so much data is missing (data collection issue?)
β’ If pattern-based: create binary indicator column for 'is_missing'
β’ If random: consider dropping column if not critical
β’ If critical: use advanced imputation (KNN, iterative)
π inter_event_gap_std
Issue: 100.0% missing values (critical)
Strategy: Drop Column or Create Missing Indicator
Impact: Models will fail or lose significant data. High missingness often indicates systematic data collection issues.
Action Steps:
β’ Investigate why so much data is missing (data collection issue?)
β’ If pattern-based: create binary indicator column for 'is_missing'
β’ If random: consider dropping column if not critical
β’ If critical: use advanced imputation (KNN, iterative)
π inter_event_gap_max
Issue: 100.0% missing values (critical)
Strategy: Drop Column or Create Missing Indicator
Impact: Models will fail or lose significant data. High missingness often indicates systematic data collection issues.
Action Steps:
β’ Investigate why so much data is missing (data collection issue?)
β’ If pattern-based: create binary indicator column for 'is_missing'
β’ If random: consider dropping column if not critical
β’ If critical: use advanced imputation (KNN, iterative)
π regularity_score
Issue: 100.0% missing values (critical)
Strategy: Drop Column or Create Missing Indicator
Impact: Models will fail or lose significant data. High missingness often indicates systematic data collection issues.
Action Steps:
β’ Investigate why so much data is missing (data collection issue?)
β’ If pattern-based: create binary indicator column for 'is_missing'
β’ If random: consider dropping column if not critical
β’ If critical: use advanced imputation (KNN, iterative)
π created_hour_vs_cohort_pct
Issue: 100.0% missing values (critical)
Strategy: Drop Column or Create Missing Indicator
Impact: Models will fail or lose significant data. High missingness often indicates systematic data collection issues.
Action Steps:
β’ Investigate why so much data is missing (data collection issue?)
β’ If pattern-based: create binary indicator column for 'is_missing'
β’ If random: consider dropping column if not critical
β’ If critical: use advanced imputation (KNN, iterative)
π‘ MEDIUM PRIORITY (recommended fixes):
------------------------------------------------------------
π eclickrate_sum_all_time
Issue: 9.1% outliers detected
Strategy: Clip to Bounds or Winsorize
Impact: Outliers skew mean/std calculations, affect scaling, and can dominate model training. May cause unstable predictions.
Action Steps:
β’ First verify if outliers are valid (high-value customers) or errors
β’ If errors: remove or cap at reasonable bounds
β’ If valid: clip to 1st/99th percentile (Winsorization)
β’ Consider log transform if right-skewed
β’ Use RobustScaler instead of StandardScaler
π eclickrate_mean_all_time
Issue: 9.1% outliers detected
Strategy: Clip to Bounds or Winsorize
Impact: Outliers skew mean/std calculations, affect scaling, and can dominate model training. May cause unstable predictions.
Action Steps:
β’ First verify if outliers are valid (high-value customers) or errors
β’ If errors: remove or cap at reasonable bounds
β’ If valid: clip to 1st/99th percentile (Winsorization)
β’ Consider log transform if right-skewed
β’ Use RobustScaler instead of StandardScaler
π eclickrate_max_all_time
Issue: 9.1% outliers detected
Strategy: Clip to Bounds or Winsorize
Impact: Outliers skew mean/std calculations, affect scaling, and can dominate model training. May cause unstable predictions.
Action Steps:
β’ First verify if outliers are valid (high-value customers) or errors
β’ If errors: remove or cap at reasonable bounds
β’ If valid: clip to 1st/99th percentile (Winsorization)
β’ Consider log transform if right-skewed
β’ Use RobustScaler instead of StandardScaler
π avgorder_sum_all_time
Issue: 5.7% outliers detected
Strategy: Clip to Bounds or Winsorize
Impact: Outliers skew mean/std calculations, affect scaling, and can dominate model training. May cause unstable predictions.
Action Steps:
β’ First verify if outliers are valid (high-value customers) or errors
β’ If errors: remove or cap at reasonable bounds
β’ If valid: clip to 1st/99th percentile (Winsorization)
β’ Consider log transform if right-skewed
β’ Use RobustScaler instead of StandardScaler
π avgorder_mean_all_time
Issue: 5.7% outliers detected
Strategy: Clip to Bounds or Winsorize
Impact: Outliers skew mean/std calculations, affect scaling, and can dominate model training. May cause unstable predictions.
Action Steps:
β’ First verify if outliers are valid (high-value customers) or errors
β’ If errors: remove or cap at reasonable bounds
β’ If valid: clip to 1st/99th percentile (Winsorization)
β’ Consider log transform if right-skewed
β’ Use RobustScaler instead of StandardScaler
π avgorder_max_all_time
Issue: 5.7% outliers detected
Strategy: Clip to Bounds or Winsorize
Impact: Outliers skew mean/std calculations, affect scaling, and can dominate model training. May cause unstable predictions.
Action Steps:
β’ First verify if outliers are valid (high-value customers) or errors
β’ If errors: remove or cap at reasonable bounds
β’ If valid: clip to 1st/99th percentile (Winsorization)
β’ Consider log transform if right-skewed
β’ Use RobustScaler instead of StandardScaler
π ordfreq_sum_all_time
Issue: 12.1% outliers detected
Strategy: Clip to Bounds or Winsorize
Impact: Outliers skew mean/std calculations, affect scaling, and can dominate model training. May cause unstable predictions.
Action Steps:
β’ First verify if outliers are valid (high-value customers) or errors
β’ If errors: remove or cap at reasonable bounds
β’ If valid: clip to 1st/99th percentile (Winsorization)
β’ Consider log transform if right-skewed
β’ Use RobustScaler instead of StandardScaler
π ordfreq_mean_all_time
Issue: 12.1% outliers detected
Strategy: Clip to Bounds or Winsorize
Impact: Outliers skew mean/std calculations, affect scaling, and can dominate model training. May cause unstable predictions.
Action Steps:
β’ First verify if outliers are valid (high-value customers) or errors
β’ If errors: remove or cap at reasonable bounds
β’ If valid: clip to 1st/99th percentile (Winsorization)
β’ Consider log transform if right-skewed
β’ Use RobustScaler instead of StandardScaler
π ordfreq_max_all_time
Issue: 12.1% outliers detected
Strategy: Clip to Bounds or Winsorize
Impact: Outliers skew mean/std calculations, affect scaling, and can dominate model training. May cause unstable predictions.
Action Steps:
β’ First verify if outliers are valid (high-value customers) or errors
β’ If errors: remove or cap at reasonable bounds
β’ If valid: clip to 1st/99th percentile (Winsorization)
β’ Consider log transform if right-skewed
β’ Use RobustScaler instead of StandardScaler
π created_delta_hours_sum_all_time
Issue: 15.1% outliers detected
Strategy: Clip to Bounds or Winsorize
Impact: Outliers skew mean/std calculations, affect scaling, and can dominate model training. May cause unstable predictions.
Action Steps:
β’ First verify if outliers are valid (high-value customers) or errors
β’ If errors: remove or cap at reasonable bounds
β’ If valid: clip to 1st/99th percentile (Winsorization)
β’ Consider log transform if right-skewed
β’ Use RobustScaler instead of StandardScaler
π created_delta_hours_mean_all_time
Issue: 15.1% outliers detected
Strategy: Clip to Bounds or Winsorize
Impact: Outliers skew mean/std calculations, affect scaling, and can dominate model training. May cause unstable predictions.
Action Steps:
β’ First verify if outliers are valid (high-value customers) or errors
β’ If errors: remove or cap at reasonable bounds
β’ If valid: clip to 1st/99th percentile (Winsorization)
β’ Consider log transform if right-skewed
β’ Use RobustScaler instead of StandardScaler
π created_delta_hours_max_all_time
Issue: 15.1% outliers detected
Strategy: Clip to Bounds or Winsorize
Impact: Outliers skew mean/std calculations, affect scaling, and can dominate model training. May cause unstable predictions.
Action Steps:
β’ First verify if outliers are valid (high-value customers) or errors
β’ If errors: remove or cap at reasonable bounds
β’ If valid: clip to 1st/99th percentile (Winsorization)
β’ Consider log transform if right-skewed
β’ Use RobustScaler instead of StandardScaler
π firstorder_delta_hours_sum_all_time
Issue: 15.1% outliers detected
Strategy: Clip to Bounds or Winsorize
Impact: Outliers skew mean/std calculations, affect scaling, and can dominate model training. May cause unstable predictions.
Action Steps:
β’ First verify if outliers are valid (high-value customers) or errors
β’ If errors: remove or cap at reasonable bounds
β’ If valid: clip to 1st/99th percentile (Winsorization)
β’ Consider log transform if right-skewed
β’ Use RobustScaler instead of StandardScaler
π firstorder_delta_hours_mean_all_time
Issue: 15.1% outliers detected
Strategy: Clip to Bounds or Winsorize
Impact: Outliers skew mean/std calculations, affect scaling, and can dominate model training. May cause unstable predictions.
Action Steps:
β’ First verify if outliers are valid (high-value customers) or errors
β’ If errors: remove or cap at reasonable bounds
β’ If valid: clip to 1st/99th percentile (Winsorization)
β’ Consider log transform if right-skewed
β’ Use RobustScaler instead of StandardScaler
π firstorder_delta_hours_max_all_time
Issue: 15.1% outliers detected
Strategy: Clip to Bounds or Winsorize
Impact: Outliers skew mean/std calculations, affect scaling, and can dominate model training. May cause unstable predictions.
Action Steps:
β’ First verify if outliers are valid (high-value customers) or errors
β’ If errors: remove or cap at reasonable bounds
β’ If valid: clip to 1st/99th percentile (Winsorization)
β’ Consider log transform if right-skewed
β’ Use RobustScaler instead of StandardScaler
π firstorder_is_weekend_mean_all_time
Issue: 23.0% outliers detected
Strategy: Clip to Bounds or Winsorize
Impact: Outliers skew mean/std calculations, affect scaling, and can dominate model training. May cause unstable predictions.
Action Steps:
β’ First verify if outliers are valid (high-value customers) or errors
β’ If errors: remove or cap at reasonable bounds
β’ If valid: clip to 1st/99th percentile (Winsorization)
β’ Consider log transform if right-skewed
β’ Use RobustScaler instead of StandardScaler
π days_since_created_sum_all_time
Issue: 15.1% outliers detected
Strategy: Clip to Bounds or Winsorize
Impact: Outliers skew mean/std calculations, affect scaling, and can dominate model training. May cause unstable predictions.
Action Steps:
β’ First verify if outliers are valid (high-value customers) or errors
β’ If errors: remove or cap at reasonable bounds
β’ If valid: clip to 1st/99th percentile (Winsorization)
β’ Consider log transform if right-skewed
β’ Use RobustScaler instead of StandardScaler
π days_since_created_mean_all_time
Issue: 15.1% outliers detected
Strategy: Clip to Bounds or Winsorize
Impact: Outliers skew mean/std calculations, affect scaling, and can dominate model training. May cause unstable predictions.
Action Steps:
β’ First verify if outliers are valid (high-value customers) or errors
β’ If errors: remove or cap at reasonable bounds
β’ If valid: clip to 1st/99th percentile (Winsorization)
β’ Consider log transform if right-skewed
β’ Use RobustScaler instead of StandardScaler
π days_since_created_max_all_time
Issue: 15.1% outliers detected
Strategy: Clip to Bounds or Winsorize
Impact: Outliers skew mean/std calculations, affect scaling, and can dominate model training. May cause unstable predictions.
Action Steps:
β’ First verify if outliers are valid (high-value customers) or errors
β’ If errors: remove or cap at reasonable bounds
β’ If valid: clip to 1st/99th percentile (Winsorization)
β’ Consider log transform if right-skewed
β’ Use RobustScaler instead of StandardScaler
π days_until_created_sum_all_time
Issue: 15.1% outliers detected
Strategy: Clip to Bounds or Winsorize
Impact: Outliers skew mean/std calculations, affect scaling, and can dominate model training. May cause unstable predictions.
Action Steps:
β’ First verify if outliers are valid (high-value customers) or errors
β’ If errors: remove or cap at reasonable bounds
β’ If valid: clip to 1st/99th percentile (Winsorization)
β’ Consider log transform if right-skewed
β’ Use RobustScaler instead of StandardScaler
π days_until_created_mean_all_time
Issue: 15.1% outliers detected
Strategy: Clip to Bounds or Winsorize
Impact: Outliers skew mean/std calculations, affect scaling, and can dominate model training. May cause unstable predictions.
Action Steps:
β’ First verify if outliers are valid (high-value customers) or errors
β’ If errors: remove or cap at reasonable bounds
β’ If valid: clip to 1st/99th percentile (Winsorization)
β’ Consider log transform if right-skewed
β’ Use RobustScaler instead of StandardScaler
π days_until_created_max_all_time
Issue: 15.1% outliers detected
Strategy: Clip to Bounds or Winsorize
Impact: Outliers skew mean/std calculations, affect scaling, and can dominate model training. May cause unstable predictions.
Action Steps:
β’ First verify if outliers are valid (high-value customers) or errors
β’ If errors: remove or cap at reasonable bounds
β’ If valid: clip to 1st/99th percentile (Winsorization)
β’ Consider log transform if right-skewed
β’ Use RobustScaler instead of StandardScaler
π days_since_firstorder_sum_all_time
Issue: 16.9% outliers detected
Strategy: Clip to Bounds or Winsorize
Impact: Outliers skew mean/std calculations, affect scaling, and can dominate model training. May cause unstable predictions.
Action Steps:
β’ First verify if outliers are valid (high-value customers) or errors
β’ If errors: remove or cap at reasonable bounds
β’ If valid: clip to 1st/99th percentile (Winsorization)
β’ Consider log transform if right-skewed
β’ Use RobustScaler instead of StandardScaler
π days_since_firstorder_mean_all_time
Issue: 16.9% outliers detected
Strategy: Clip to Bounds or Winsorize
Impact: Outliers skew mean/std calculations, affect scaling, and can dominate model training. May cause unstable predictions.
Action Steps:
β’ First verify if outliers are valid (high-value customers) or errors
β’ If errors: remove or cap at reasonable bounds
β’ If valid: clip to 1st/99th percentile (Winsorization)
β’ Consider log transform if right-skewed
β’ Use RobustScaler instead of StandardScaler
π days_since_firstorder_max_all_time
Issue: 16.9% outliers detected
Strategy: Clip to Bounds or Winsorize
Impact: Outliers skew mean/std calculations, affect scaling, and can dominate model training. May cause unstable predictions.
Action Steps:
β’ First verify if outliers are valid (high-value customers) or errors
β’ If errors: remove or cap at reasonable bounds
β’ If valid: clip to 1st/99th percentile (Winsorization)
β’ Consider log transform if right-skewed
β’ Use RobustScaler instead of StandardScaler
π days_until_firstorder_sum_all_time
Issue: 16.9% outliers detected
Strategy: Clip to Bounds or Winsorize
Impact: Outliers skew mean/std calculations, affect scaling, and can dominate model training. May cause unstable predictions.
Action Steps:
β’ First verify if outliers are valid (high-value customers) or errors
β’ If errors: remove or cap at reasonable bounds
β’ If valid: clip to 1st/99th percentile (Winsorization)
β’ Consider log transform if right-skewed
β’ Use RobustScaler instead of StandardScaler
π days_until_firstorder_mean_all_time
Issue: 16.9% outliers detected
Strategy: Clip to Bounds or Winsorize
Impact: Outliers skew mean/std calculations, affect scaling, and can dominate model training. May cause unstable predictions.
Action Steps:
β’ First verify if outliers are valid (high-value customers) or errors
β’ If errors: remove or cap at reasonable bounds
β’ If valid: clip to 1st/99th percentile (Winsorization)
β’ Consider log transform if right-skewed
β’ Use RobustScaler instead of StandardScaler
π days_until_firstorder_max_all_time
Issue: 16.9% outliers detected
Strategy: Clip to Bounds or Winsorize
Impact: Outliers skew mean/std calculations, affect scaling, and can dominate model training. May cause unstable predictions.
Action Steps:
β’ First verify if outliers are valid (high-value customers) or errors
β’ If errors: remove or cap at reasonable bounds
β’ If valid: clip to 1st/99th percentile (Winsorization)
β’ Consider log transform if right-skewed
β’ Use RobustScaler instead of StandardScaler
π lastorder_delta_hours_sum_all_time
Issue: 15.1% outliers detected
Strategy: Clip to Bounds or Winsorize
Impact: Outliers skew mean/std calculations, affect scaling, and can dominate model training. May cause unstable predictions.
Action Steps:
β’ First verify if outliers are valid (high-value customers) or errors
β’ If errors: remove or cap at reasonable bounds
β’ If valid: clip to 1st/99th percentile (Winsorization)
β’ Consider log transform if right-skewed
β’ Use RobustScaler instead of StandardScaler
π lastorder_delta_hours_mean_all_time
Issue: 15.1% outliers detected
Strategy: Clip to Bounds or Winsorize
Impact: Outliers skew mean/std calculations, affect scaling, and can dominate model training. May cause unstable predictions.
Action Steps:
β’ First verify if outliers are valid (high-value customers) or errors
β’ If errors: remove or cap at reasonable bounds
β’ If valid: clip to 1st/99th percentile (Winsorization)
β’ Consider log transform if right-skewed
β’ Use RobustScaler instead of StandardScaler
π lastorder_delta_hours_max_all_time
Issue: 15.1% outliers detected
Strategy: Clip to Bounds or Winsorize
Impact: Outliers skew mean/std calculations, affect scaling, and can dominate model training. May cause unstable predictions.
Action Steps:
β’ First verify if outliers are valid (high-value customers) or errors
β’ If errors: remove or cap at reasonable bounds
β’ If valid: clip to 1st/99th percentile (Winsorization)
β’ Consider log transform if right-skewed
β’ Use RobustScaler instead of StandardScaler
π days_since_first_event_x
Issue: 21.0% outliers detected
Strategy: Clip to Bounds or Winsorize
Impact: Outliers skew mean/std calculations, affect scaling, and can dominate model training. May cause unstable predictions.
Action Steps:
β’ First verify if outliers are valid (high-value customers) or errors
β’ If errors: remove or cap at reasonable bounds
β’ If valid: clip to 1st/99th percentile (Winsorization)
β’ Consider log transform if right-skewed
β’ Use RobustScaler instead of StandardScaler
π cohort_year
Issue: 25.1% outliers detected
Strategy: Clip to Bounds or Winsorize
Impact: Outliers skew mean/std calculations, affect scaling, and can dominate model training. May cause unstable predictions.
Action Steps:
β’ First verify if outliers are valid (high-value customers) or errors
β’ If errors: remove or cap at reasonable bounds
β’ If valid: clip to 1st/99th percentile (Winsorization)
β’ Consider log transform if right-skewed
β’ Use RobustScaler instead of StandardScaler
π lag0_eclickrate_sum
Issue: 9.1% outliers detected
Strategy: Clip to Bounds or Winsorize
Impact: Outliers skew mean/std calculations, affect scaling, and can dominate model training. May cause unstable predictions.
Action Steps:
β’ First verify if outliers are valid (high-value customers) or errors
β’ If errors: remove or cap at reasonable bounds
β’ If valid: clip to 1st/99th percentile (Winsorization)
β’ Consider log transform if right-skewed
β’ Use RobustScaler instead of StandardScaler
π lag0_eclickrate_mean
Issue: 9.1% outliers detected
Strategy: Clip to Bounds or Winsorize
Impact: Outliers skew mean/std calculations, affect scaling, and can dominate model training. May cause unstable predictions.
Action Steps:
β’ First verify if outliers are valid (high-value customers) or errors
β’ If errors: remove or cap at reasonable bounds
β’ If valid: clip to 1st/99th percentile (Winsorization)
β’ Consider log transform if right-skewed
β’ Use RobustScaler instead of StandardScaler
π lag0_eclickrate_max
Issue: 9.1% outliers detected
Strategy: Clip to Bounds or Winsorize
Impact: Outliers skew mean/std calculations, affect scaling, and can dominate model training. May cause unstable predictions.
Action Steps:
β’ First verify if outliers are valid (high-value customers) or errors
β’ If errors: remove or cap at reasonable bounds
β’ If valid: clip to 1st/99th percentile (Winsorization)
β’ Consider log transform if right-skewed
β’ Use RobustScaler instead of StandardScaler
π lag0_avgorder_sum
Issue: 5.7% outliers detected
Strategy: Clip to Bounds or Winsorize
Impact: Outliers skew mean/std calculations, affect scaling, and can dominate model training. May cause unstable predictions.
Action Steps:
β’ First verify if outliers are valid (high-value customers) or errors
β’ If errors: remove or cap at reasonable bounds
β’ If valid: clip to 1st/99th percentile (Winsorization)
β’ Consider log transform if right-skewed
β’ Use RobustScaler instead of StandardScaler
π lag0_avgorder_mean
Issue: 5.7% outliers detected
Strategy: Clip to Bounds or Winsorize
Impact: Outliers skew mean/std calculations, affect scaling, and can dominate model training. May cause unstable predictions.
Action Steps:
β’ First verify if outliers are valid (high-value customers) or errors
β’ If errors: remove or cap at reasonable bounds
β’ If valid: clip to 1st/99th percentile (Winsorization)
β’ Consider log transform if right-skewed
β’ Use RobustScaler instead of StandardScaler
π lag0_avgorder_max
Issue: 5.7% outliers detected
Strategy: Clip to Bounds or Winsorize
Impact: Outliers skew mean/std calculations, affect scaling, and can dominate model training. May cause unstable predictions.
Action Steps:
β’ First verify if outliers are valid (high-value customers) or errors
β’ If errors: remove or cap at reasonable bounds
β’ If valid: clip to 1st/99th percentile (Winsorization)
β’ Consider log transform if right-skewed
β’ Use RobustScaler instead of StandardScaler
π lag0_ordfreq_sum
Issue: 12.1% outliers detected
Strategy: Clip to Bounds or Winsorize
Impact: Outliers skew mean/std calculations, affect scaling, and can dominate model training. May cause unstable predictions.
Action Steps:
β’ First verify if outliers are valid (high-value customers) or errors
β’ If errors: remove or cap at reasonable bounds
β’ If valid: clip to 1st/99th percentile (Winsorization)
β’ Consider log transform if right-skewed
β’ Use RobustScaler instead of StandardScaler
π lag0_ordfreq_mean
Issue: 12.1% outliers detected
Strategy: Clip to Bounds or Winsorize
Impact: Outliers skew mean/std calculations, affect scaling, and can dominate model training. May cause unstable predictions.
Action Steps:
β’ First verify if outliers are valid (high-value customers) or errors
β’ If errors: remove or cap at reasonable bounds
β’ If valid: clip to 1st/99th percentile (Winsorization)
β’ Consider log transform if right-skewed
β’ Use RobustScaler instead of StandardScaler
π lag0_ordfreq_max
Issue: 12.1% outliers detected
Strategy: Clip to Bounds or Winsorize
Impact: Outliers skew mean/std calculations, affect scaling, and can dominate model training. May cause unstable predictions.
Action Steps:
β’ First verify if outliers are valid (high-value customers) or errors
β’ If errors: remove or cap at reasonable bounds
β’ If valid: clip to 1st/99th percentile (Winsorization)
β’ Consider log transform if right-skewed
β’ Use RobustScaler instead of StandardScaler
π lag0_created_delta_hours_sum
Issue: 15.1% outliers detected
Strategy: Clip to Bounds or Winsorize
Impact: Outliers skew mean/std calculations, affect scaling, and can dominate model training. May cause unstable predictions.
Action Steps:
β’ First verify if outliers are valid (high-value customers) or errors
β’ If errors: remove or cap at reasonable bounds
β’ If valid: clip to 1st/99th percentile (Winsorization)
β’ Consider log transform if right-skewed
β’ Use RobustScaler instead of StandardScaler
π lag0_created_delta_hours_mean
Issue: 15.1% outliers detected
Strategy: Clip to Bounds or Winsorize
Impact: Outliers skew mean/std calculations, affect scaling, and can dominate model training. May cause unstable predictions.
Action Steps:
β’ First verify if outliers are valid (high-value customers) or errors
β’ If errors: remove or cap at reasonable bounds
β’ If valid: clip to 1st/99th percentile (Winsorization)
β’ Consider log transform if right-skewed
β’ Use RobustScaler instead of StandardScaler
π lag0_created_delta_hours_max
Issue: 15.1% outliers detected
Strategy: Clip to Bounds or Winsorize
Impact: Outliers skew mean/std calculations, affect scaling, and can dominate model training. May cause unstable predictions.
Action Steps:
β’ First verify if outliers are valid (high-value customers) or errors
β’ If errors: remove or cap at reasonable bounds
β’ If valid: clip to 1st/99th percentile (Winsorization)
β’ Consider log transform if right-skewed
β’ Use RobustScaler instead of StandardScaler
π eopenrate_beginning
Issue: 40.0% outliers detected
Strategy: Clip to Bounds or Winsorize
Impact: Outliers skew mean/std calculations, affect scaling, and can dominate model training. May cause unstable predictions.
Action Steps:
β’ First verify if outliers are valid (high-value customers) or errors
β’ If errors: remove or cap at reasonable bounds
β’ If valid: clip to 1st/99th percentile (Winsorization)
β’ Consider log transform if right-skewed
β’ Use RobustScaler instead of StandardScaler
π eopenrate_end
Issue: 20.0% outliers detected
Strategy: Clip to Bounds or Winsorize
Impact: Outliers skew mean/std calculations, affect scaling, and can dominate model training. May cause unstable predictions.
Action Steps:
β’ First verify if outliers are valid (high-value customers) or errors
β’ If errors: remove or cap at reasonable bounds
β’ If valid: clip to 1st/99th percentile (Winsorization)
β’ Consider log transform if right-skewed
β’ Use RobustScaler instead of StandardScaler
π eopenrate_trend_ratio
Issue: 25.0% outliers detected
Strategy: Clip to Bounds or Winsorize
Impact: Outliers skew mean/std calculations, affect scaling, and can dominate model training. May cause unstable predictions.
Action Steps:
β’ First verify if outliers are valid (high-value customers) or errors
β’ If errors: remove or cap at reasonable bounds
β’ If valid: clip to 1st/99th percentile (Winsorization)
β’ Consider log transform if right-skewed
β’ Use RobustScaler instead of StandardScaler
π eclickrate_vs_cohort_mean
Issue: 9.1% outliers detected
Strategy: Clip to Bounds or Winsorize
Impact: Outliers skew mean/std calculations, affect scaling, and can dominate model training. May cause unstable predictions.
Action Steps:
β’ First verify if outliers are valid (high-value customers) or errors
β’ If errors: remove or cap at reasonable bounds
β’ If valid: clip to 1st/99th percentile (Winsorization)
β’ Consider log transform if right-skewed
β’ Use RobustScaler instead of StandardScaler
π eclickrate_vs_cohort_pct
Issue: 9.1% outliers detected
Strategy: Clip to Bounds or Winsorize
Impact: Outliers skew mean/std calculations, affect scaling, and can dominate model training. May cause unstable predictions.
Action Steps:
β’ First verify if outliers are valid (high-value customers) or errors
β’ If errors: remove or cap at reasonable bounds
β’ If valid: clip to 1st/99th percentile (Winsorization)
β’ Consider log transform if right-skewed
β’ Use RobustScaler instead of StandardScaler
π eclickrate_cohort_zscore
Issue: 9.1% outliers detected
Strategy: Clip to Bounds or Winsorize
Impact: Outliers skew mean/std calculations, affect scaling, and can dominate model training. May cause unstable predictions.
Action Steps:
β’ First verify if outliers are valid (high-value customers) or errors
β’ If errors: remove or cap at reasonable bounds
β’ If valid: clip to 1st/99th percentile (Winsorization)
β’ Consider log transform if right-skewed
β’ Use RobustScaler instead of StandardScaler
π avgorder_vs_cohort_mean
Issue: 5.7% outliers detected
Strategy: Clip to Bounds or Winsorize
Impact: Outliers skew mean/std calculations, affect scaling, and can dominate model training. May cause unstable predictions.
Action Steps:
β’ First verify if outliers are valid (high-value customers) or errors
β’ If errors: remove or cap at reasonable bounds
β’ If valid: clip to 1st/99th percentile (Winsorization)
β’ Consider log transform if right-skewed
β’ Use RobustScaler instead of StandardScaler
π avgorder_vs_cohort_pct
Issue: 5.7% outliers detected
Strategy: Clip to Bounds or Winsorize
Impact: Outliers skew mean/std calculations, affect scaling, and can dominate model training. May cause unstable predictions.
Action Steps:
β’ First verify if outliers are valid (high-value customers) or errors
β’ If errors: remove or cap at reasonable bounds
β’ If valid: clip to 1st/99th percentile (Winsorization)
β’ Consider log transform if right-skewed
β’ Use RobustScaler instead of StandardScaler
π avgorder_cohort_zscore
Issue: 5.7% outliers detected
Strategy: Clip to Bounds or Winsorize
Impact: Outliers skew mean/std calculations, affect scaling, and can dominate model training. May cause unstable predictions.
Action Steps:
β’ First verify if outliers are valid (high-value customers) or errors
β’ If errors: remove or cap at reasonable bounds
β’ If valid: clip to 1st/99th percentile (Winsorization)
β’ Consider log transform if right-skewed
β’ Use RobustScaler instead of StandardScaler
π ordfreq_vs_cohort_mean
Issue: 12.1% outliers detected
Strategy: Clip to Bounds or Winsorize
Impact: Outliers skew mean/std calculations, affect scaling, and can dominate model training. May cause unstable predictions.
Action Steps:
β’ First verify if outliers are valid (high-value customers) or errors
β’ If errors: remove or cap at reasonable bounds
β’ If valid: clip to 1st/99th percentile (Winsorization)
β’ Consider log transform if right-skewed
β’ Use RobustScaler instead of StandardScaler
π ordfreq_vs_cohort_pct
Issue: 12.1% outliers detected
Strategy: Clip to Bounds or Winsorize
Impact: Outliers skew mean/std calculations, affect scaling, and can dominate model training. May cause unstable predictions.
Action Steps:
β’ First verify if outliers are valid (high-value customers) or errors
β’ If errors: remove or cap at reasonable bounds
β’ If valid: clip to 1st/99th percentile (Winsorization)
β’ Consider log transform if right-skewed
β’ Use RobustScaler instead of StandardScaler
π ordfreq_cohort_zscore
Issue: 12.1% outliers detected
Strategy: Clip to Bounds or Winsorize
Impact: Outliers skew mean/std calculations, affect scaling, and can dominate model training. May cause unstable predictions.
Action Steps:
β’ First verify if outliers are valid (high-value customers) or errors
β’ If errors: remove or cap at reasonable bounds
β’ If valid: clip to 1st/99th percentile (Winsorization)
β’ Consider log transform if right-skewed
β’ Use RobustScaler instead of StandardScaler
π created_delta_hours_vs_cohort_mean
Issue: 15.1% outliers detected
Strategy: Clip to Bounds or Winsorize
Impact: Outliers skew mean/std calculations, affect scaling, and can dominate model training. May cause unstable predictions.
Action Steps:
β’ First verify if outliers are valid (high-value customers) or errors
β’ If errors: remove or cap at reasonable bounds
β’ If valid: clip to 1st/99th percentile (Winsorization)
β’ Consider log transform if right-skewed
β’ Use RobustScaler instead of StandardScaler
π created_delta_hours_vs_cohort_pct
Issue: 15.1% outliers detected
Strategy: Clip to Bounds or Winsorize
Impact: Outliers skew mean/std calculations, affect scaling, and can dominate model training. May cause unstable predictions.
Action Steps:
β’ First verify if outliers are valid (high-value customers) or errors
β’ If errors: remove or cap at reasonable bounds
β’ If valid: clip to 1st/99th percentile (Winsorization)
β’ Consider log transform if right-skewed
β’ Use RobustScaler instead of StandardScaler
π created_delta_hours_cohort_zscore
Issue: 15.1% outliers detected
Strategy: Clip to Bounds or Winsorize
Impact: Outliers skew mean/std calculations, affect scaling, and can dominate model training. May cause unstable predictions.
Action Steps:
β’ First verify if outliers are valid (high-value customers) or errors
β’ If errors: remove or cap at reasonable bounds
β’ If valid: clip to 1st/99th percentile (Winsorization)
β’ Consider log transform if right-skewed
β’ Use RobustScaler instead of StandardScaler
π’ LOW PRIORITY (nice to have):
------------------------------------------------------------
π custid
Issue: 1 null values (0.0%)
Strategy: Impute with Mode
Impact: Minor impact. Some models (XGBoost, LightGBM) handle nulls natively. Others will fail.
π esent_mean_all_time
Issue: 1 null values (0.0%)
Strategy: Impute with Median
Impact: Minor impact. Some models (XGBoost, LightGBM) handle nulls natively. Others will fail.
π esent_max_all_time
Issue: 1 null values (0.0%)
Strategy: Impute with Median
Impact: Minor impact. Some models (XGBoost, LightGBM) handle nulls natively. Others will fail.
π eopenrate_mean_all_time
Issue: 1 null values (0.0%)
Strategy: Impute with Median
Impact: Minor impact. Some models (XGBoost, LightGBM) handle nulls natively. Others will fail.
π eopenrate_max_all_time
Issue: 1 null values (0.0%)
Strategy: Impute with Median
Impact: Minor impact. Some models (XGBoost, LightGBM) handle nulls natively. Others will fail.
π eclickrate_mean_all_time
Issue: 1 null values (0.0%)
Strategy: Impute with Median
Impact: Minor impact. Some models (XGBoost, LightGBM) handle nulls natively. Others will fail.
π eclickrate_max_all_time
Issue: 1 null values (0.0%)
Strategy: Impute with Median
Impact: Minor impact. Some models (XGBoost, LightGBM) handle nulls natively. Others will fail.
π avgorder_mean_all_time
Issue: 1 null values (0.0%)
Strategy: Impute with Median
Impact: Minor impact. Some models (XGBoost, LightGBM) handle nulls natively. Others will fail.
π avgorder_max_all_time
Issue: 1 null values (0.0%)
Strategy: Impute with Median
Impact: Minor impact. Some models (XGBoost, LightGBM) handle nulls natively. Others will fail.
π ordfreq_mean_all_time
Issue: 1 null values (0.0%)
Strategy: Impute with Median
Impact: Minor impact. Some models (XGBoost, LightGBM) handle nulls natively. Others will fail.
π ordfreq_max_all_time
Issue: 1 null values (0.0%)
Strategy: Impute with Median
Impact: Minor impact. Some models (XGBoost, LightGBM) handle nulls natively. Others will fail.
π paperless_mean_all_time
Issue: 1 null values (0.0%)
Strategy: Impute with Mode
Impact: Minor impact. Some models (XGBoost, LightGBM) handle nulls natively. Others will fail.
π paperless_max_all_time
Issue: 1 null values (0.0%)
Strategy: Impute with Mode
Impact: Minor impact. Some models (XGBoost, LightGBM) handle nulls natively. Others will fail.
π refill_mean_all_time
Issue: 1 null values (0.0%)
Strategy: Impute with Mode
Impact: Minor impact. Some models (XGBoost, LightGBM) handle nulls natively. Others will fail.
π refill_max_all_time
Issue: 1 null values (0.0%)
Strategy: Impute with Mode
Impact: Minor impact. Some models (XGBoost, LightGBM) handle nulls natively. Others will fail.
π doorstep_mean_all_time
Issue: 1 null values (0.0%)
Strategy: Impute with Mode
Impact: Minor impact. Some models (XGBoost, LightGBM) handle nulls natively. Others will fail.
π doorstep_max_all_time
Issue: 1 null values (0.0%)
Strategy: Impute with Mode
Impact: Minor impact. Some models (XGBoost, LightGBM) handle nulls natively. Others will fail.
π created_delta_hours_mean_all_time
Issue: 352 null values (1.1%)
Strategy: Impute with Median
Impact: Minor impact. Some models (XGBoost, LightGBM) handle nulls natively. Others will fail.
π created_delta_hours_max_all_time
Issue: 352 null values (1.1%)
Strategy: Impute with Median
Impact: Minor impact. Some models (XGBoost, LightGBM) handle nulls natively. Others will fail.
π created_hour_mean_all_time
Issue: 331 null values (1.1%)
Strategy: Impute with Median
Impact: Minor impact. Some models (XGBoost, LightGBM) handle nulls natively. Others will fail.
π created_hour_max_all_time
Issue: 331 null values (1.1%)
Strategy: Impute with Median
Impact: Minor impact. Some models (XGBoost, LightGBM) handle nulls natively. Others will fail.
π created_dow_mean_all_time
Issue: 331 null values (1.1%)
Strategy: Impute with Median
Impact: Minor impact. Some models (XGBoost, LightGBM) handle nulls natively. Others will fail.
π created_dow_max_all_time
Issue: 331 null values (1.1%)
Strategy: Impute with Median
Impact: Minor impact. Some models (XGBoost, LightGBM) handle nulls natively. Others will fail.
π created_is_weekend_mean_all_time
Issue: 331 null values (1.1%)
Strategy: Impute with Mode
Impact: Minor impact. Some models (XGBoost, LightGBM) handle nulls natively. Others will fail.
π created_is_weekend_max_all_time
Issue: 331 null values (1.1%)
Strategy: Impute with Mode
Impact: Minor impact. Some models (XGBoost, LightGBM) handle nulls natively. Others will fail.
π firstorder_delta_hours_mean_all_time
Issue: 13 null values (0.0%)
Strategy: Impute with Median
Impact: Minor impact. Some models (XGBoost, LightGBM) handle nulls natively. Others will fail.
π firstorder_delta_hours_max_all_time
Issue: 13 null values (0.0%)
Strategy: Impute with Median
Impact: Minor impact. Some models (XGBoost, LightGBM) handle nulls natively. Others will fail.
π firstorder_hour_mean_all_time
Issue: 13 null values (0.0%)
Strategy: Impute with Median
Impact: Minor impact. Some models (XGBoost, LightGBM) handle nulls natively. Others will fail.
π firstorder_hour_max_all_time
Issue: 13 null values (0.0%)
Strategy: Impute with Median
Impact: Minor impact. Some models (XGBoost, LightGBM) handle nulls natively. Others will fail.
π firstorder_dow_mean_all_time
Issue: 13 null values (0.0%)
Strategy: Impute with Median
Impact: Minor impact. Some models (XGBoost, LightGBM) handle nulls natively. Others will fail.
π firstorder_dow_max_all_time
Issue: 13 null values (0.0%)
Strategy: Impute with Median
Impact: Minor impact. Some models (XGBoost, LightGBM) handle nulls natively. Others will fail.
π firstorder_is_weekend_mean_all_time
Issue: 1 null values (0.0%)
Strategy: Impute with Median
Impact: Minor impact. Some models (XGBoost, LightGBM) handle nulls natively. Others will fail.
π firstorder_is_weekend_max_all_time
Issue: 1 null values (0.0%)
Strategy: Impute with Mode
Impact: Minor impact. Some models (XGBoost, LightGBM) handle nulls natively. Others will fail.
π days_since_created_mean_all_time
Issue: 352 null values (1.1%)
Strategy: Impute with Median
Impact: Minor impact. Some models (XGBoost, LightGBM) handle nulls natively. Others will fail.
π days_since_created_max_all_time
Issue: 352 null values (1.1%)
Strategy: Impute with Median
Impact: Minor impact. Some models (XGBoost, LightGBM) handle nulls natively. Others will fail.
π days_until_created_mean_all_time
Issue: 352 null values (1.1%)
Strategy: Impute with Median
Impact: Minor impact. Some models (XGBoost, LightGBM) handle nulls natively. Others will fail.
π days_until_created_max_all_time
Issue: 352 null values (1.1%)
Strategy: Impute with Median
Impact: Minor impact. Some models (XGBoost, LightGBM) handle nulls natively. Others will fail.
π log1p_days_since_created_mean_all_time
Issue: 352 null values (1.1%)
Strategy: Impute with Median
Impact: Minor impact. Some models (XGBoost, LightGBM) handle nulls natively. Others will fail.
π log1p_days_since_created_max_all_time
Issue: 352 null values (1.1%)
Strategy: Impute with Median
Impact: Minor impact. Some models (XGBoost, LightGBM) handle nulls natively. Others will fail.
π is_missing_created_mean_all_time
Issue: 1 null values (0.0%)
Strategy: Impute with Median
Impact: Minor impact. Some models (XGBoost, LightGBM) handle nulls natively. Others will fail.
π is_missing_created_max_all_time
Issue: 1 null values (0.0%)
Strategy: Impute with Median
Impact: Minor impact. Some models (XGBoost, LightGBM) handle nulls natively. Others will fail.
π is_future_created_mean_all_time
Issue: 331 null values (1.1%)
Strategy: Impute with Median
Impact: Minor impact. Some models (XGBoost, LightGBM) handle nulls natively. Others will fail.
π is_future_created_max_all_time
Issue: 331 null values (1.1%)
Strategy: Impute with Median
Impact: Minor impact. Some models (XGBoost, LightGBM) handle nulls natively. Others will fail.
π days_since_firstorder_mean_all_time
Issue: 24 null values (0.1%)
Strategy: Impute with Median
Impact: Minor impact. Some models (XGBoost, LightGBM) handle nulls natively. Others will fail.
π days_since_firstorder_max_all_time
Issue: 24 null values (0.1%)
Strategy: Impute with Median
Impact: Minor impact. Some models (XGBoost, LightGBM) handle nulls natively. Others will fail.
π days_until_firstorder_mean_all_time
Issue: 24 null values (0.1%)
Strategy: Impute with Median
Impact: Minor impact. Some models (XGBoost, LightGBM) handle nulls natively. Others will fail.
π days_until_firstorder_max_all_time
Issue: 24 null values (0.1%)
Strategy: Impute with Median
Impact: Minor impact. Some models (XGBoost, LightGBM) handle nulls natively. Others will fail.
π log1p_days_since_firstorder_mean_all_time
Issue: 24 null values (0.1%)
Strategy: Impute with Median
Impact: Minor impact. Some models (XGBoost, LightGBM) handle nulls natively. Others will fail.
π log1p_days_since_firstorder_max_all_time
Issue: 24 null values (0.1%)
Strategy: Impute with Median
Impact: Minor impact. Some models (XGBoost, LightGBM) handle nulls natively. Others will fail.
π is_missing_firstorder_mean_all_time
Issue: 1 null values (0.0%)
Strategy: Impute with Mode
Impact: Minor impact. Some models (XGBoost, LightGBM) handle nulls natively. Others will fail.
π is_missing_firstorder_max_all_time
Issue: 1 null values (0.0%)
Strategy: Impute with Mode
Impact: Minor impact. Some models (XGBoost, LightGBM) handle nulls natively. Others will fail.
π is_future_firstorder_mean_all_time
Issue: 3 null values (0.0%)
Strategy: Impute with Median
Impact: Minor impact. Some models (XGBoost, LightGBM) handle nulls natively. Others will fail.
π is_future_firstorder_max_all_time
Issue: 3 null values (0.0%)
Strategy: Impute with Median
Impact: Minor impact. Some models (XGBoost, LightGBM) handle nulls natively. Others will fail.
π lastorder_delta_hours_mean_all_time
Issue: 18 null values (0.1%)
Strategy: Impute with Median
Impact: Minor impact. Some models (XGBoost, LightGBM) handle nulls natively. Others will fail.
π lastorder_delta_hours_max_all_time
Issue: 18 null values (0.1%)
Strategy: Impute with Median
Impact: Minor impact. Some models (XGBoost, LightGBM) handle nulls natively. Others will fail.
π lastorder_hour_mean_all_time
Issue: 18 null values (0.1%)
Strategy: Impute with Median
Impact: Minor impact. Some models (XGBoost, LightGBM) handle nulls natively. Others will fail.
π lastorder_hour_max_all_time
Issue: 18 null values (0.1%)
Strategy: Impute with Median
Impact: Minor impact. Some models (XGBoost, LightGBM) handle nulls natively. Others will fail.
π lastorder_dow_mean_all_time
Issue: 18 null values (0.1%)
Strategy: Impute with Median
Impact: Minor impact. Some models (XGBoost, LightGBM) handle nulls natively. Others will fail.
π lastorder_dow_max_all_time
Issue: 18 null values (0.1%)
Strategy: Impute with Median
Impact: Minor impact. Some models (XGBoost, LightGBM) handle nulls natively. Others will fail.
π lastorder_is_weekend_mean_all_time
Issue: 1 null values (0.0%)
Strategy: Impute with Mode
Impact: Minor impact. Some models (XGBoost, LightGBM) handle nulls natively. Others will fail.
π lastorder_is_weekend_max_all_time
Issue: 1 null values (0.0%)
Strategy: Impute with Mode
Impact: Minor impact. Some models (XGBoost, LightGBM) handle nulls natively. Others will fail.
π days_since_first_event_x
Issue: 1 null values (0.0%)
Strategy: Impute with Median
Impact: Minor impact. Some models (XGBoost, LightGBM) handle nulls natively. Others will fail.
π lifecycle_quadrant
Issue: 1 null values (0.0%)
Strategy: Impute with Mode
Impact: Minor impact. Some models (XGBoost, LightGBM) handle nulls natively. Others will fail.
π retained
Issue: 1 null values (0.0%)
Strategy: Impute with Mode
Impact: Minor impact. Some models (XGBoost, LightGBM) handle nulls natively. Others will fail.
π dow_sin
Issue: 1 null values (0.0%)
Strategy: Impute with Median
Impact: Minor impact. Some models (XGBoost, LightGBM) handle nulls natively. Others will fail.
π dow_cos
Issue: 1 null values (0.0%)
Strategy: Impute with Median
Impact: Minor impact. Some models (XGBoost, LightGBM) handle nulls natively. Others will fail.
π cohort_year
Issue: 1 null values (0.0%)
Strategy: Impute with Median
Impact: Minor impact. Some models (XGBoost, LightGBM) handle nulls natively. Others will fail.
π lag0_esent_sum
Issue: 1 null values (0.0%)
Strategy: Impute with Median
Impact: Minor impact. Some models (XGBoost, LightGBM) handle nulls natively. Others will fail.
π lag0_esent_mean
Issue: 1 null values (0.0%)
Strategy: Impute with Median
Impact: Minor impact. Some models (XGBoost, LightGBM) handle nulls natively. Others will fail.
π lag0_esent_count
Issue: 1 null values (0.0%)
Strategy: Impute with Median
Impact: Minor impact. Some models (XGBoost, LightGBM) handle nulls natively. Others will fail.
π lag0_esent_max
Issue: 1 null values (0.0%)
Strategy: Impute with Median
Impact: Minor impact. Some models (XGBoost, LightGBM) handle nulls natively. Others will fail.
π lag0_eopenrate_sum
Issue: 1 null values (0.0%)
Strategy: Impute with Median
Impact: Minor impact. Some models (XGBoost, LightGBM) handle nulls natively. Others will fail.
π lag0_eopenrate_mean
Issue: 1 null values (0.0%)
Strategy: Impute with Median
Impact: Minor impact. Some models (XGBoost, LightGBM) handle nulls natively. Others will fail.
π lag0_eopenrate_count
Issue: 1 null values (0.0%)
Strategy: Impute with Median
Impact: Minor impact. Some models (XGBoost, LightGBM) handle nulls natively. Others will fail.
π lag0_eopenrate_max
Issue: 1 null values (0.0%)
Strategy: Impute with Median
Impact: Minor impact. Some models (XGBoost, LightGBM) handle nulls natively. Others will fail.
π lag0_eclickrate_sum
Issue: 1 null values (0.0%)
Strategy: Impute with Median
Impact: Minor impact. Some models (XGBoost, LightGBM) handle nulls natively. Others will fail.
π lag0_eclickrate_mean
Issue: 1 null values (0.0%)
Strategy: Impute with Median
Impact: Minor impact. Some models (XGBoost, LightGBM) handle nulls natively. Others will fail.
π lag0_eclickrate_count
Issue: 1 null values (0.0%)
Strategy: Impute with Median
Impact: Minor impact. Some models (XGBoost, LightGBM) handle nulls natively. Others will fail.
π lag0_eclickrate_max
Issue: 1 null values (0.0%)
Strategy: Impute with Median
Impact: Minor impact. Some models (XGBoost, LightGBM) handle nulls natively. Others will fail.
π lag0_avgorder_sum
Issue: 1 null values (0.0%)
Strategy: Impute with Median
Impact: Minor impact. Some models (XGBoost, LightGBM) handle nulls natively. Others will fail.
π lag0_avgorder_mean
Issue: 1 null values (0.0%)
Strategy: Impute with Median
Impact: Minor impact. Some models (XGBoost, LightGBM) handle nulls natively. Others will fail.
π lag0_avgorder_count
Issue: 1 null values (0.0%)
Strategy: Impute with Median
Impact: Minor impact. Some models (XGBoost, LightGBM) handle nulls natively. Others will fail.
π lag0_avgorder_max
Issue: 1 null values (0.0%)
Strategy: Impute with Median
Impact: Minor impact. Some models (XGBoost, LightGBM) handle nulls natively. Others will fail.
π lag0_ordfreq_sum
Issue: 1 null values (0.0%)
Strategy: Impute with Median
Impact: Minor impact. Some models (XGBoost, LightGBM) handle nulls natively. Others will fail.
π lag0_ordfreq_mean
Issue: 1 null values (0.0%)
Strategy: Impute with Median
Impact: Minor impact. Some models (XGBoost, LightGBM) handle nulls natively. Others will fail.
π lag0_ordfreq_count
Issue: 1 null values (0.0%)
Strategy: Impute with Median
Impact: Minor impact. Some models (XGBoost, LightGBM) handle nulls natively. Others will fail.
π lag0_ordfreq_max
Issue: 1 null values (0.0%)
Strategy: Impute with Median
Impact: Minor impact. Some models (XGBoost, LightGBM) handle nulls natively. Others will fail.
π lag0_paperless_sum
Issue: 1 null values (0.0%)
Strategy: Impute with Mode
Impact: Minor impact. Some models (XGBoost, LightGBM) handle nulls natively. Others will fail.
π lag0_paperless_mean
Issue: 1 null values (0.0%)
Strategy: Impute with Mode
Impact: Minor impact. Some models (XGBoost, LightGBM) handle nulls natively. Others will fail.
π lag0_paperless_count
Issue: 1 null values (0.0%)
Strategy: Impute with Median
Impact: Minor impact. Some models (XGBoost, LightGBM) handle nulls natively. Others will fail.
π lag0_paperless_max
Issue: 1 null values (0.0%)
Strategy: Impute with Mode
Impact: Minor impact. Some models (XGBoost, LightGBM) handle nulls natively. Others will fail.
π lag0_refill_sum
Issue: 1 null values (0.0%)
Strategy: Impute with Mode
Impact: Minor impact. Some models (XGBoost, LightGBM) handle nulls natively. Others will fail.
π lag0_refill_mean
Issue: 1 null values (0.0%)
Strategy: Impute with Mode
Impact: Minor impact. Some models (XGBoost, LightGBM) handle nulls natively. Others will fail.
π lag0_refill_count
Issue: 1 null values (0.0%)
Strategy: Impute with Median
Impact: Minor impact. Some models (XGBoost, LightGBM) handle nulls natively. Others will fail.
π lag0_refill_max
Issue: 1 null values (0.0%)
Strategy: Impute with Mode
Impact: Minor impact. Some models (XGBoost, LightGBM) handle nulls natively. Others will fail.
π lag0_doorstep_sum
Issue: 1 null values (0.0%)
Strategy: Impute with Mode
Impact: Minor impact. Some models (XGBoost, LightGBM) handle nulls natively. Others will fail.
π lag0_doorstep_mean
Issue: 1 null values (0.0%)
Strategy: Impute with Mode
Impact: Minor impact. Some models (XGBoost, LightGBM) handle nulls natively. Others will fail.
π lag0_doorstep_count
Issue: 1 null values (0.0%)
Strategy: Impute with Median
Impact: Minor impact. Some models (XGBoost, LightGBM) handle nulls natively. Others will fail.
π lag0_doorstep_max
Issue: 1 null values (0.0%)
Strategy: Impute with Mode
Impact: Minor impact. Some models (XGBoost, LightGBM) handle nulls natively. Others will fail.
π lag0_created_delta_hours_sum
Issue: 1 null values (0.0%)
Strategy: Impute with Median
Impact: Minor impact. Some models (XGBoost, LightGBM) handle nulls natively. Others will fail.
π lag0_created_delta_hours_mean
Issue: 352 null values (1.1%)
Strategy: Impute with Median
Impact: Minor impact. Some models (XGBoost, LightGBM) handle nulls natively. Others will fail.
π lag0_created_delta_hours_count
Issue: 1 null values (0.0%)
Strategy: Impute with Mode
Impact: Minor impact. Some models (XGBoost, LightGBM) handle nulls natively. Others will fail.
π lag0_created_delta_hours_max
Issue: 352 null values (1.1%)
Strategy: Impute with Median
Impact: Minor impact. Some models (XGBoost, LightGBM) handle nulls natively. Others will fail.
π lag0_created_hour_sum
Issue: 1 null values (0.0%)
Strategy: Impute with Median
Impact: Minor impact. Some models (XGBoost, LightGBM) handle nulls natively. Others will fail.
π lag0_created_hour_mean
Issue: 335 null values (1.1%)
Strategy: Impute with Median
Impact: Minor impact. Some models (XGBoost, LightGBM) handle nulls natively. Others will fail.
π lag0_created_hour_count
Issue: 1 null values (0.0%)
Strategy: Impute with Mode
Impact: Minor impact. Some models (XGBoost, LightGBM) handle nulls natively. Others will fail.
π lag0_created_hour_max
Issue: 335 null values (1.1%)
Strategy: Impute with Median
Impact: Minor impact. Some models (XGBoost, LightGBM) handle nulls natively. Others will fail.
π lag1_esent_count
Issue: 1 null values (0.0%)
Strategy: Impute with Median
Impact: Minor impact. Some models (XGBoost, LightGBM) handle nulls natively. Others will fail.
π lag1_eopenrate_count
Issue: 1 null values (0.0%)
Strategy: Impute with Median
Impact: Minor impact. Some models (XGBoost, LightGBM) handle nulls natively. Others will fail.
π lag1_eclickrate_count
Issue: 1 null values (0.0%)
Strategy: Impute with Median
Impact: Minor impact. Some models (XGBoost, LightGBM) handle nulls natively. Others will fail.
π lag1_avgorder_count
Issue: 1 null values (0.0%)
Strategy: Impute with Median
Impact: Minor impact. Some models (XGBoost, LightGBM) handle nulls natively. Others will fail.
π lag1_ordfreq_count
Issue: 1 null values (0.0%)
Strategy: Impute with Median
Impact: Minor impact. Some models (XGBoost, LightGBM) handle nulls natively. Others will fail.
π lag1_paperless_count
Issue: 1 null values (0.0%)
Strategy: Impute with Median
Impact: Minor impact. Some models (XGBoost, LightGBM) handle nulls natively. Others will fail.
π lag1_refill_count
Issue: 1 null values (0.0%)
Strategy: Impute with Median
Impact: Minor impact. Some models (XGBoost, LightGBM) handle nulls natively. Others will fail.
π lag1_doorstep_count
Issue: 1 null values (0.0%)
Strategy: Impute with Median
Impact: Minor impact. Some models (XGBoost, LightGBM) handle nulls natively. Others will fail.
π lag1_created_delta_hours_count
Issue: 1 null values (0.0%)
Strategy: Impute with Median
Impact: Minor impact. Some models (XGBoost, LightGBM) handle nulls natively. Others will fail.
π lag1_created_hour_count
Issue: 1 null values (0.0%)
Strategy: Impute with Median
Impact: Minor impact. Some models (XGBoost, LightGBM) handle nulls natively. Others will fail.
π lag2_esent_count
Issue: 1 null values (0.0%)
Strategy: Impute with Median
Impact: Minor impact. Some models (XGBoost, LightGBM) handle nulls natively. Others will fail.
π lag2_eopenrate_count
Issue: 1 null values (0.0%)
Strategy: Impute with Median
Impact: Minor impact. Some models (XGBoost, LightGBM) handle nulls natively. Others will fail.
π lag2_eclickrate_count
Issue: 1 null values (0.0%)
Strategy: Impute with Median
Impact: Minor impact. Some models (XGBoost, LightGBM) handle nulls natively. Others will fail.
π lag2_avgorder_count
Issue: 1 null values (0.0%)
Strategy: Impute with Median
Impact: Minor impact. Some models (XGBoost, LightGBM) handle nulls natively. Others will fail.
π lag2_ordfreq_count
Issue: 1 null values (0.0%)
Strategy: Impute with Median
Impact: Minor impact. Some models (XGBoost, LightGBM) handle nulls natively. Others will fail.
π lag2_paperless_count
Issue: 1 null values (0.0%)
Strategy: Impute with Median
Impact: Minor impact. Some models (XGBoost, LightGBM) handle nulls natively. Others will fail.
π lag2_refill_count
Issue: 1 null values (0.0%)
Strategy: Impute with Median
Impact: Minor impact. Some models (XGBoost, LightGBM) handle nulls natively. Others will fail.
π lag2_doorstep_count
Issue: 1 null values (0.0%)
Strategy: Impute with Median
Impact: Minor impact. Some models (XGBoost, LightGBM) handle nulls natively. Others will fail.
π lag2_created_delta_hours_count
Issue: 1 null values (0.0%)
Strategy: Impute with Median
Impact: Minor impact. Some models (XGBoost, LightGBM) handle nulls natively. Others will fail.
π lag2_created_hour_count
Issue: 1 null values (0.0%)
Strategy: Impute with Median
Impact: Minor impact. Some models (XGBoost, LightGBM) handle nulls natively. Others will fail.
π lag3_esent_count
Issue: 1 null values (0.0%)
Strategy: Impute with Median
Impact: Minor impact. Some models (XGBoost, LightGBM) handle nulls natively. Others will fail.
π lag3_eopenrate_count
Issue: 1 null values (0.0%)
Strategy: Impute with Median
Impact: Minor impact. Some models (XGBoost, LightGBM) handle nulls natively. Others will fail.
π lag3_eclickrate_count
Issue: 1 null values (0.0%)
Strategy: Impute with Median
Impact: Minor impact. Some models (XGBoost, LightGBM) handle nulls natively. Others will fail.
π lag3_avgorder_count
Issue: 1 null values (0.0%)
Strategy: Impute with Median
Impact: Minor impact. Some models (XGBoost, LightGBM) handle nulls natively. Others will fail.
π lag3_ordfreq_count
Issue: 1 null values (0.0%)
Strategy: Impute with Median
Impact: Minor impact. Some models (XGBoost, LightGBM) handle nulls natively. Others will fail.
π lag3_paperless_count
Issue: 1 null values (0.0%)
Strategy: Impute with Median
Impact: Minor impact. Some models (XGBoost, LightGBM) handle nulls natively. Others will fail.
π lag3_refill_count
Issue: 1 null values (0.0%)
Strategy: Impute with Median
Impact: Minor impact. Some models (XGBoost, LightGBM) handle nulls natively. Others will fail.
π lag3_doorstep_count
Issue: 1 null values (0.0%)
Strategy: Impute with Median
Impact: Minor impact. Some models (XGBoost, LightGBM) handle nulls natively. Others will fail.
π lag3_created_delta_hours_count
Issue: 1 null values (0.0%)
Strategy: Impute with Median
Impact: Minor impact. Some models (XGBoost, LightGBM) handle nulls natively. Others will fail.
π lag3_created_hour_count
Issue: 1 null values (0.0%)
Strategy: Impute with Median
Impact: Minor impact. Some models (XGBoost, LightGBM) handle nulls natively. Others will fail.
π days_since_last_event
Issue: 1 null values (0.0%)
Strategy: Impute with Median
Impact: Minor impact. Some models (XGBoost, LightGBM) handle nulls natively. Others will fail.
π days_since_first_event_y
Issue: 1 null values (0.0%)
Strategy: Impute with Mode
Impact: Minor impact. Some models (XGBoost, LightGBM) handle nulls natively. Others will fail.
π active_span_days
Issue: 1 null values (0.0%)
Strategy: Impute with Mode
Impact: Minor impact. Some models (XGBoost, LightGBM) handle nulls natively. Others will fail.
π recency_ratio
Issue: 1 null values (0.0%)
Strategy: Impute with Median
Impact: Minor impact. Some models (XGBoost, LightGBM) handle nulls natively. Others will fail.
π esent_vs_cohort_mean
Issue: 1 null values (0.0%)
Strategy: Impute with Median
Impact: Minor impact. Some models (XGBoost, LightGBM) handle nulls natively. Others will fail.
π esent_vs_cohort_pct
Issue: 1 null values (0.0%)
Strategy: Impute with Median
Impact: Minor impact. Some models (XGBoost, LightGBM) handle nulls natively. Others will fail.
π esent_cohort_zscore
Issue: 1 null values (0.0%)
Strategy: Impute with Median
Impact: Minor impact. Some models (XGBoost, LightGBM) handle nulls natively. Others will fail.
π eopenrate_vs_cohort_mean
Issue: 1 null values (0.0%)
Strategy: Impute with Median
Impact: Minor impact. Some models (XGBoost, LightGBM) handle nulls natively. Others will fail.
π eopenrate_vs_cohort_pct
Issue: 1 null values (0.0%)
Strategy: Impute with Median
Impact: Minor impact. Some models (XGBoost, LightGBM) handle nulls natively. Others will fail.
π eopenrate_cohort_zscore
Issue: 1 null values (0.0%)
Strategy: Impute with Median
Impact: Minor impact. Some models (XGBoost, LightGBM) handle nulls natively. Others will fail.
π eclickrate_vs_cohort_mean
Issue: 1 null values (0.0%)
Strategy: Impute with Median
Impact: Minor impact. Some models (XGBoost, LightGBM) handle nulls natively. Others will fail.
π eclickrate_vs_cohort_pct
Issue: 1 null values (0.0%)
Strategy: Impute with Median
Impact: Minor impact. Some models (XGBoost, LightGBM) handle nulls natively. Others will fail.
π eclickrate_cohort_zscore
Issue: 1 null values (0.0%)
Strategy: Impute with Median
Impact: Minor impact. Some models (XGBoost, LightGBM) handle nulls natively. Others will fail.
π avgorder_vs_cohort_mean
Issue: 1 null values (0.0%)
Strategy: Impute with Median
Impact: Minor impact. Some models (XGBoost, LightGBM) handle nulls natively. Others will fail.
π avgorder_vs_cohort_pct
Issue: 1 null values (0.0%)
Strategy: Impute with Median
Impact: Minor impact. Some models (XGBoost, LightGBM) handle nulls natively. Others will fail.
π avgorder_cohort_zscore
Issue: 1 null values (0.0%)
Strategy: Impute with Median
Impact: Minor impact. Some models (XGBoost, LightGBM) handle nulls natively. Others will fail.
π ordfreq_vs_cohort_mean
Issue: 1 null values (0.0%)
Strategy: Impute with Median
Impact: Minor impact. Some models (XGBoost, LightGBM) handle nulls natively. Others will fail.
π ordfreq_vs_cohort_pct
Issue: 1 null values (0.0%)
Strategy: Impute with Median
Impact: Minor impact. Some models (XGBoost, LightGBM) handle nulls natively. Others will fail.
π ordfreq_cohort_zscore
Issue: 1 null values (0.0%)
Strategy: Impute with Median
Impact: Minor impact. Some models (XGBoost, LightGBM) handle nulls natively. Others will fail.
π paperless_vs_cohort_mean
Issue: 1 null values (0.0%)
Strategy: Impute with Mode
Impact: Minor impact. Some models (XGBoost, LightGBM) handle nulls natively. Others will fail.
π paperless_vs_cohort_pct
Issue: 1 null values (0.0%)
Strategy: Impute with Mode
Impact: Minor impact. Some models (XGBoost, LightGBM) handle nulls natively. Others will fail.
π paperless_cohort_zscore
Issue: 1 null values (0.0%)
Strategy: Impute with Mode
Impact: Minor impact. Some models (XGBoost, LightGBM) handle nulls natively. Others will fail.
π refill_vs_cohort_mean
Issue: 1 null values (0.0%)
Strategy: Impute with Mode
Impact: Minor impact. Some models (XGBoost, LightGBM) handle nulls natively. Others will fail.
π refill_vs_cohort_pct
Issue: 1 null values (0.0%)
Strategy: Impute with Mode
Impact: Minor impact. Some models (XGBoost, LightGBM) handle nulls natively. Others will fail.
π refill_cohort_zscore
Issue: 1 null values (0.0%)
Strategy: Impute with Mode
Impact: Minor impact. Some models (XGBoost, LightGBM) handle nulls natively. Others will fail.
π doorstep_vs_cohort_mean
Issue: 1 null values (0.0%)
Strategy: Impute with Mode
Impact: Minor impact. Some models (XGBoost, LightGBM) handle nulls natively. Others will fail.
π doorstep_vs_cohort_pct
Issue: 1 null values (0.0%)
Strategy: Impute with Mode
Impact: Minor impact. Some models (XGBoost, LightGBM) handle nulls natively. Others will fail.
π doorstep_cohort_zscore
Issue: 1 null values (0.0%)
Strategy: Impute with Mode
Impact: Minor impact. Some models (XGBoost, LightGBM) handle nulls natively. Others will fail.
π created_delta_hours_vs_cohort_mean
Issue: 1 null values (0.0%)
Strategy: Impute with Median
Impact: Minor impact. Some models (XGBoost, LightGBM) handle nulls natively. Others will fail.
π created_delta_hours_vs_cohort_pct
Issue: 1 null values (0.0%)
Strategy: Impute with Median
Impact: Minor impact. Some models (XGBoost, LightGBM) handle nulls natively. Others will fail.
π created_delta_hours_cohort_zscore
Issue: 1 null values (0.0%)
Strategy: Impute with Median
Impact: Minor impact. Some models (XGBoost, LightGBM) handle nulls natively. Others will fail.
π created_hour_vs_cohort_mean
Issue: 1 null values (0.0%)
Strategy: Impute with Median
Impact: Minor impact. Some models (XGBoost, LightGBM) handle nulls natively. Others will fail.
================================================================================
CLEANUP SUMMARY
================================================================================
| Column | Issue | Severity | Recommended Action | Affected Rows | |
|---|---|---|---|---|---|
| 0 | custid | Null Values | LOW | Impute with Mode | 1 |
| 1 | esent_mean_all_time | Null Values | LOW | Impute with Median | 1 |
| 2 | esent_max_all_time | Null Values | LOW | Impute with Median | 1 |
| 3 | eopenrate_mean_all_time | Null Values | LOW | Impute with Median | 1 |
| 4 | eopenrate_max_all_time | Null Values | LOW | Impute with Median | 1 |
| ... | ... | ... | ... | ... | ... |
| 400 | created_delta_hours_vs_cohort_pct | Outliers | MEDIUM | Clip to Bounds or Winsorize | 4,646 |
| 401 | created_delta_hours_cohort_zscore | Null Values | LOW | Impute with Median | 1 |
| 402 | created_delta_hours_cohort_zscore | Outliers | MEDIUM | Clip to Bounds or Winsorize | 4,646 |
| 403 | created_hour_vs_cohort_mean | Null Values | LOW | Impute with Median | 1 |
| 404 | created_hour_vs_cohort_pct | Missing Values | HIGH | Drop Column or Create Missing Indicator | 30,770 |
405 rows Γ 5 columns
Total potentially affected: 5,673,144 cell values Columns needing attention: 405
2.13 Save Updated FindingsΒΆ
Show/Hide Code
# Save updated findings back to the same file
findings.save(FINDINGS_PATH)
print(f"Updated findings saved to: {FINDINGS_PATH}")
# Save recommendations registry
registry.save(RECOMMENDATIONS_PATH)
print(f"Recommendations saved to: {RECOMMENDATIONS_PATH}")
# Summary of recommendations
all_recs = registry.all_recommendations
print("\nπ Recommendations Summary:")
print(f" Bronze layer: {len(registry.get_by_layer('bronze'))} recommendations")
if registry.silver:
print(f" Silver layer: {len(registry.get_by_layer('silver'))} recommendations")
if registry.gold:
print(f" Gold layer: {len(registry.get_by_layer('gold'))} recommendations")
print(f" Total: {len(all_recs)} recommendations")
Updated findings saved to: /Users/Vital/python/CustomerRetention/experiments/runs/retail-e7471284/datasets/customer_retention_retail/findings/customer_retention_retail_aggregated_findings.yaml Recommendations saved to: /Users/Vital/python/CustomerRetention/experiments/runs/retail-e7471284/datasets/customer_retention_retail/findings/customer_retention_retail_aggregated_recommendations.yaml π Recommendations Summary: Bronze layer: 406 recommendations Silver layer: 0 recommendations Gold layer: 0 recommendations Total: 406 recommendations
Summary: What We LearnedΒΆ
In this notebook, we performed a comprehensive quality assessment:
- Duplicate Analysis - Identified key-based duplicates, exact duplicates, and value conflicts
- Target Variable - Analyzed class distribution and imbalance for modeling guidance
- Missing Values - Analyzed patterns (MCAR/MAR/MNAR) and correlations
- Segment-Aware Outliers - Detected natural data segments to avoid false positive outliers
- Global Outliers - Detected using IQR method with bounds and percentages
- Date Logic - Validated temporal sequences and detected placeholders
- Binary Fields - Verified 0/1 encoding and distributions
- Consistency - Checked for case variants and spacing issues
- Recommendations - Generated automated cleaning strategies
Key InsightsΒΆ
Duplicate Analysis:
- Exact duplicates should be removed before modeling
- Key duplicates with value conflicts require investigation and a resolution strategy
- High duplicate percentages may indicate event-level data requiring aggregation
Segment-Aware Analysis:
- If segments were detected, some global "outliers" may actually be valid data from different customer segments
- Enterprise vs retail customers, new vs established accounts often have legitimately different value ranges
- Use segment-specific outlier treatment when recommended to preserve important patterns
Key Cleanup Actions for This DatasetΒΆ
Based on the analysis above:
- Duplicates: Review key duplicates and resolve value conflicts
- Missing Values: Low (0.06%) - can drop or impute with mode
- Outliers: Check segment-aware analysis results - some may be false positives
- Date Issues: Check for placeholder dates before lastorder
- Binary Fields: All valid with 0/1 encoding
Next StepsΒΆ
Continue to 05_relationship_analysis.ipynb to:
- Explore correlations between features
- Analyze feature-target relationships
- Identify potential feature interactions
- Detect multicollinearity issues
Save Reminder: Save this notebook (Ctrl+S / Cmd+S) before running the next one. The next notebook will automatically export this notebook's HTML documentation from the saved file.