An Exception was encountered at 'In [10]'.

Chapter 4a: Text Columns Deep Dive¶

Purpose: Transform TEXT columns (tickets, emails, messages) into numeric features using embeddings and dimensionality reduction.

When to use this notebook:

  • Your dataset contains TEXT columns (unstructured text data)
  • Detected automatically if ColumnType.TEXT found in findings

What you'll learn:

  • How text embeddings capture semantic meaning
  • Why PCA reduces dimensions while preserving variance
  • How to choose between fast vs high-quality embedding models

Outputs:

  • PC features (text_pc1, text_pc2, ...) for each TEXT column
  • TextProcessingMetadata in findings
  • Recommendations for production pipeline

Two Approaches to Text Feature Engineering¶

Approach Method When to Use
1. Embeddings + PCA (This notebook) Sentence-transformers → PCA General semantic features
2. LLM Labeling (Future) LLM on samples → Train classifier Specific categories needed

Approach 1: Embeddings + Dimensionality Reduction (Current)¶

TEXT Column → Embeddings → PCA → pc1, pc2, ..., pcN
  • Embeddings: Dense vectors capturing semantic meaning (similar texts = similar vectors)
  • PCA: Reduces dimensions to N components covering target variance (default 95%)
  • Output: Numeric features usable with standard ML models

Embedding Model Options¶

Model Size Embedding Dim Speed Quality Best For
MiniLM (default) 90 MB 384 Fast Good CPU, quick iteration, small datasets
Qwen3-0.6B 1.2 GB 1024 Medium Better GPU available, production quality
Qwen3-4B 8 GB 2560 Slow High 16GB+ GPU, multilingual, high accuracy
Qwen3-8B 16 GB 4096 Slowest Highest 32GB+ GPU, research, max quality

Note: Models are downloaded on first use (lazy loading). Qwen3 models require GPU for reasonable performance.

Approach 2: LLM Labeling (Future Enhancement)¶

TEXT Column → Sample → LLM Labels → Train Classifier → Apply to All
  • Use when you need specific categorical labels (sentiment, topic, intent)
  • More expensive but more interpretable

4a.1 Load Previous Findings¶

In [1]:
Show/Hide Code
from customer_retention.analysis.notebook_progress import track_and_export_previous

track_and_export_previous("04a_text_columns_deep_dive.ipynb")

import numpy as np
import plotly.express as px
import plotly.graph_objects as go
from plotly.subplots import make_subplots

from customer_retention.analysis.auto_explorer import ExplorationFindings, TextProcessingMetadata
from customer_retention.analysis.visualization import ChartBuilder, display_figure
from customer_retention.core.config.column_config import ColumnType
from customer_retention.core.config.experiments import (
    EXPERIMENTS_DIR,
    FINDINGS_DIR,
)
from customer_retention.stages.profiling import (
    TextColumnProcessor,
    TextProcessingConfig,
    get_model_info,
    list_available_models,
)
In [2]:
Show/Hide Code
from customer_retention.analysis.auto_explorer import load_notebook_findings

FINDINGS_PATH, _namespace, dataset_name = load_notebook_findings("04a_text_columns_deep_dive.ipynb")
print(f"Using: {FINDINGS_PATH}")

findings = ExplorationFindings.load(FINDINGS_PATH)
print(f"\nLoaded findings for {findings.column_count} columns from {findings.source_path}")
Using: /Users/Vital/python/CustomerRetention/experiments/runs/email-6301db6c/datasets/customer_emails/findings/customer_emails_findings.yaml

Loaded findings for 13 columns from ../tests/fixtures/customer_emails.csv
In [3]:
Show/Hide Code
# Identify TEXT columns
text_columns = [
    name for name, col in findings.columns.items()
    if col.inferred_type == ColumnType.TEXT
]

if not text_columns:
    print("\u26a0\ufe0f No TEXT columns detected in this dataset.")
    print("   This notebook is only needed when TEXT columns are present.")
    print("   Continue to notebook 02_source_integrity.ipynb")
else:
    print(f"\u2705 Found {len(text_columns)} TEXT column(s):")
    for col in text_columns:
        col_info = findings.columns[col]
        print(f"   - {col} (Confidence: {col_info.confidence:.0%})")
⚠️ No TEXT columns detected in this dataset.
   This notebook is only needed when TEXT columns are present.
   Continue to notebook 02_source_integrity.ipynb

4a.2 Load Source Data¶

In [4]:
Show/Hide Code
from customer_retention.analysis.auto_explorer.active_dataset_store import load_active_dataset

df = load_active_dataset(_namespace, dataset_name)
charts = ChartBuilder()

print(f"Loaded {len(df):,} rows x {len(df.columns)} columns")
print(f"Data source: {dataset_name}")
Loaded 83,198 rows x 13 columns
Data source: customer_emails

4a.3 Configuration¶

Available Embedding Models¶

Run the cell below to see available models and their specifications. Then configure your choice.

In [5]:
Show/Hide Code
# Display available embedding models
print("Available Embedding Models")
print("=" * 80)
print(f"{'Preset':<15} {'Model':<35} {'Size':<10} {'Dim':<8} {'GPU?'}")
print("-" * 80)

for preset in list_available_models():
    info = get_model_info(preset)
    size = f"{info['size_mb']} MB" if info['size_mb'] < 1000 else f"{info['size_mb']/1000:.1f} GB"
    gpu = "Yes" if info['gpu_recommended'] else "No"
    print(f"{preset:<15} {info['model_name']:<35} {size:<10} {info['embedding_dim']:<8} {gpu}")
    print(f"                {info['description']}")
    print()

print("\nModels are downloaded on first use. Choose based on your hardware and quality needs.")
Available Embedding Models
================================================================================
Preset          Model                               Size       Dim      GPU?
--------------------------------------------------------------------------------
minilm          all-MiniLM-L6-v2                    90 MB      384      No
                Fast, lightweight model. Good for CPU and quick experimentation.

qwen3-0.6b      Qwen/Qwen3-Embedding-0.6B           1.2 GB     1024     Yes
                Higher quality embeddings, multilingual. Requires GPU for reasonable speed.

qwen3-4b        Qwen/Qwen3-Embedding-4B             8.0 GB     2560     Yes
                High quality, large model. Requires significant GPU memory (16GB+).

qwen3-8b        Qwen/Qwen3-Embedding-8B             16.0 GB    4096     Yes
                Highest quality, very large model. Requires 32GB+ GPU memory.


Models are downloaded on first use. Choose based on your hardware and quality needs.
In [6]:
Show/Hide Code
# === TEXT PROCESSING CONFIGURATION ===
# Choose your embedding model preset:
#   "minilm"     - Fast, CPU-friendly, good for exploration (default)
#   "qwen3-0.6b" - Better quality, needs GPU
#   "qwen3-4b"   - High quality, needs 16GB+ GPU
#   "qwen3-8b"   - Highest quality, needs 32GB+ GPU

EMBEDDING_PRESET = "minilm"  # Change this to try different models

# PCA configuration
VARIANCE_THRESHOLD = 0.95  # Keep components explaining 95% of variance
MIN_COMPONENTS = 2         # At least 2 features per text column
MAX_COMPONENTS = None      # No upper limit (set to e.g., 20 to cap)

# Get model info and create config
model_info = get_model_info(EMBEDDING_PRESET)
config = TextProcessingConfig(
    embedding_model=model_info["model_name"],
    variance_threshold=VARIANCE_THRESHOLD,
    max_components=MAX_COMPONENTS,
    min_components=MIN_COMPONENTS,
    batch_size=32
)

print("Text Processing Configuration")
print("=" * 50)
print(f"  Preset: {EMBEDDING_PRESET}")
print(f"  Model: {config.embedding_model}")
print(f"  Model size: {model_info['size_mb']} MB")
print(f"  Embedding dimension: {model_info['embedding_dim']}")
print(f"  GPU recommended: {'Yes' if model_info['gpu_recommended'] else 'No'}")
print()
print(f"  Variance threshold: {config.variance_threshold:.0%}")
print(f"  Min components: {config.min_components}")
print(f"  Max components: {config.max_components or 'unlimited'}")

if model_info['gpu_recommended']:
    print()
    print("Note: This model works best with GPU. Processing may be slow on CPU.")
Text Processing Configuration
==================================================
  Preset: minilm
  Model: all-MiniLM-L6-v2
  Model size: 90 MB
  Embedding dimension: 384
  GPU recommended: No

  Variance threshold: 95%
  Min components: 2
  Max components: unlimited

4a.4 Text Column Analysis¶

Before processing, let's understand each TEXT column.

In [7]:
Show/Hide Code
if text_columns:
    for col_name in text_columns:
        print(f"\n{'='*70}")
        print(f"Column: {col_name}")
        print(f"{'='*70}")

        text_series = df[col_name].fillna("")

        # Basic statistics
        non_empty = (text_series.str.len() > 0).sum()
        avg_length = text_series.str.len().mean()
        max_length = text_series.str.len().max()

        print("\n\U0001f4ca Statistics:")
        print(f"   Total rows: {len(text_series):,}")
        print(f"   Non-empty: {non_empty:,} ({non_empty/len(text_series)*100:.1f}%)")
        print(f"   Avg length: {avg_length:.0f} characters")
        print(f"   Max length: {max_length:,} characters")

        # Sample texts
        print("\n\U0001f4dd Sample texts:")
        samples = text_series[text_series.str.len() > 10].head(3)
        for i, sample in enumerate(samples, 1):
            truncated = sample[:100] + "..." if len(sample) > 100 else sample
            print(f"   {i}. {truncated}")

        # Text length distribution
        lengths = text_series.str.len()
        fig = go.Figure()
        fig.add_trace(go.Histogram(x=lengths[lengths > 0], nbinsx=50,
                                    marker_color='steelblue', opacity=0.7))
        fig.add_vline(x=lengths.median(), line_dash="solid", line_color="green",
                      annotation_text=f"Median: {lengths.median():.0f}")
        fig.update_layout(
            title=f"Text Length Distribution: {col_name}",
            xaxis_title="Character Count",
            yaxis_title="Frequency",
            template="plotly_white",
            height=350
        )
        display_figure(fig)

4a.5 Process Text Columns¶

This step:

  1. Generates embeddings using sentence-transformers
  2. Applies PCA to reduce dimensions
  3. Creates PC feature columns
In [8]:
Show/Hide Code
if text_columns:
    processor = TextColumnProcessor(config)

    print("Processing TEXT columns...")
    print("(This may take a moment for large datasets)\n")

    results = []
    df_processed = df.copy()

    for col_name in text_columns:
        print(f"\n{'='*70}")
        print(f"Processing: {col_name}")
        print(f"{'='*70}")

        df_processed, result = processor.process_column(df_processed, col_name)
        results.append(result)

        print("\n\u2705 Processing complete:")
        print(f"   Embedding shape: {result.embeddings_shape}")
        print(f"   Components kept: {result.n_components}")
        print(f"   Explained variance: {result.explained_variance:.1%}")
        print(f"   Features created: {', '.join(result.component_columns)}")

    print(f"\n\n{'='*70}")
    print("PROCESSING SUMMARY")
    print(f"{'='*70}")
    print(f"\nOriginal columns: {len(df.columns)}")
    print(f"New columns added: {len(df_processed.columns) - len(df.columns)}")
    print(f"Total columns: {len(df_processed.columns)}")

4a.6 Visualize Results¶

Understanding the PC features created from text embeddings.

In [9]:
Show/Hide Code
if text_columns and results:
    for result in results:
        print(f"\n{'='*70}")
        print(f"Results: {result.column_name}")
        print(f"{'='*70}")

        # Explained variance per component
        reducer = processor._reducers[result.column_name]
        var_ratios = reducer._pca.explained_variance_ratio_
        cumulative = np.cumsum(var_ratios)

        fig = make_subplots(rows=1, cols=2,
                            subplot_titles=("Variance per Component", "Cumulative Variance"))

        fig.add_trace(go.Bar(
            x=[f"PC{i+1}" for i in range(len(var_ratios))],
            y=var_ratios,
            marker_color='steelblue'
        ), row=1, col=1)

        fig.add_trace(go.Scatter(
            x=[f"PC{i+1}" for i in range(len(cumulative))],
            y=cumulative,
            mode='lines+markers',
            line_color='green'
        ), row=1, col=2)

        fig.add_hline(y=config.variance_threshold, line_dash="dash", line_color="red",
                      annotation_text=f"Target: {config.variance_threshold:.0%}",
                      row=1, col=2)

        fig.update_layout(
            title=f"PCA Results: {result.column_name}",
            height=400,
            template="plotly_white",
            showlegend=False
        )
        fig.update_yaxes(title_text="Variance Ratio", row=1, col=1)
        fig.update_yaxes(title_text="Cumulative Variance", row=1, col=2)
        display_figure(fig)

        # PC feature distributions
        if len(result.component_columns) >= 2:
            fig = px.scatter(
                df_processed,
                x=result.component_columns[0],
                y=result.component_columns[1],
                title=f"PC1 vs PC2: {result.column_name}",
                opacity=0.5
            )
            fig.update_layout(template="plotly_white", height=400)
            display_figure(fig)

4a.7 Update Findings with Text Processing Metadata¶

Execution using papermill encountered an exception here and stopped:

In [10]:
Show/Hide Code
if text_columns and results:
    for result in results:
        metadata = TextProcessingMetadata(
            column_name=result.column_name,
            embedding_model=config.embedding_model,
            embedding_dim=result.embeddings_shape[1],
            n_components=result.n_components,
            explained_variance=result.explained_variance,
            component_columns=result.component_columns,
            variance_threshold_used=config.variance_threshold,
            processing_approach="pca"
        )
        findings.text_processing[result.column_name] = metadata

        print(f"\u2705 Added metadata for {result.column_name}:")
        print(f"   Model: {metadata.embedding_model}")
        print(f"   Components: {metadata.n_components}")
        print(f"   Explained variance: {metadata.explained_variance:.1%}")

    findings.save(FINDINGS_PATH)
    print(f"\nFindings saved to: {FINDINGS_PATH}")

from customer_retention.analysis.notebook_html_exporter import export_notebook_html

export_notebook_html(Path("04a_text_columns_deep_dive.ipynb"), EXPERIMENTS_DIR / "docs")
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Cell In[10], line 25
     21     print(f"\nFindings saved to: {FINDINGS_PATH}")
     23 from customer_retention.analysis.notebook_html_exporter import export_notebook_html
---> 25 export_notebook_html(Path("04a_text_columns_deep_dive.ipynb"), EXPERIMENTS_DIR / "docs")

NameError: name 'Path' is not defined

4a.8 Generate Recommendations¶

In [ ]:
Show/Hide Code
if text_columns and results:
    print("\n" + "="*70)
    print("PRODUCTION RECOMMENDATIONS")
    print("="*70)

    for result in results:
        print(f"\n\U0001f527 {result.column_name}:")
        print("   Action: embed_reduce (embeddings + PCA)")
        print(f"   Model: {config.embedding_model}")
        print(f"   Variance threshold: {config.variance_threshold:.0%}")
        print(f"   Expected features: {result.n_components}")
        print(f"   Feature names: {', '.join(result.component_columns[:3])}...")

    print("\n\U0001f4a1 These recommendations will be used by the pipeline generator.")
    print("   The same processing will be applied in production.")

Summary¶

In this notebook, we:

  1. Analyzed TEXT columns for length and content patterns
  2. Generated embeddings using sentence-transformers
  3. Applied PCA to reduce dimensions while preserving variance
  4. Created numeric features (pc1, pc2, ...) for downstream ML
  5. Updated findings with processing metadata

Key Results¶

Column Components Explained Variance
(Filled by execution)

Next Steps¶

Continue to 02_source_integrity.ipynb to:

  • Analyze duplicate records and value conflicts
  • Deep dive into missing value patterns
  • Analyze outliers with IQR method
  • Get cleaning recommendations

Save Reminder: Save this notebook (Ctrl+S / Cmd+S) before running the next one. The next notebook will automatically export this notebook's HTML documentation from the saved file.