Targets Guide ============= A comprehensive guide to Rhoa's advanced target generation system for machine learning. Learn how to create optimized binary classification targets for trading strategies. Overview -------- Rhoa's ``generate_target_combinations`` function creates 8 different binary classification targets, each representing different ways to define a "profitable trade." The system automatically finds optimal parameters using either: 1. **Auto Mode**: Pareto optimization to find optimal period AND threshold 2. **Manual Mode**: Elbow method to find optimal threshold for fixed period This guide explains both modes in depth, the mathematics behind them, and best practices for production use. Why Target Generation Matters ------------------------------ The Problem ~~~~~~~~~~~ When building ML models for trading, you need to define what constitutes a "buy signal." This requires answering: 1. **How far ahead should the price move?** (lookback period) 2. **How much should it move?** (threshold percentage) 3. **Which price metric to use?** (close-to-close, high-to-close, etc.) Poor choices lead to: - **Too many signals**: High transaction costs, low precision - **Too few signals**: Insufficient training data - **Unrealistic targets**: Model learns patterns that don't translate to profit The Solution ~~~~~~~~~~~~ Rhoa's target generation: - Tests all 8 common target definitions - Automatically finds optimal parameters - Balances class distribution - Provides detailed metadata - Ensures reproducibility Quick Start ----------- Auto Mode (Recommended) ~~~~~~~~~~~~~~~~~~~~~~~ .. code-block:: python from rhoa.targets import generate_target_combinations import pandas as pd # Load your OHLC data df = pd.read_csv('stock_data.csv') # Generate optimized targets targets, meta = generate_target_combinations( df, mode='auto', target_class_balance=0.5 # 50% positive instances ) # Check what was found print(f"Method 7 uses period {meta['method_7']['period']}, " f"threshold {meta['method_7']['threshold']}%") # Method 7 uses period 6, threshold 4.0% # Use in ML pipeline print(targets.head()) # Target_1 Target_2 Target_3 ... Target_8 # 0 False False False ... False # 1 True False True ... True Manual Mode ~~~~~~~~~~~ .. code-block:: python # Fixed 5-day lookback, optimize thresholds targets, meta = generate_target_combinations( df, mode='manual', lookback_periods=5 ) # All methods use period=5 print(meta['method_1']) # {'period': 5, 'threshold': 6.0, 'instances': 22, 'pct_of_max': 1.4} The 8 Target Methods -------------------- Each method defines "success" differently: .. list-table:: :header-rows: 1 :widths: 10 35 25 30 * - Method - Definition - Formula - Use Case * - 1 - Close[N] / Close[0] - Future close vs. current close - Conservative, actual exit * - 2 - Close[N] / High[0] - Future close vs. current high - Buy at top of range * - 3 - High[N] / Close[0] - Future high vs. current close - Intraday profit potential * - 4 - High[N] / High[0] - Future high vs. current high - Very conservative * - 5 - MaxClose / Close[0] - Best close in period vs. current - Best exit timing * - 6 - MaxClose / High[0] - Best close vs. current high - Optimal buy at top * - 7 - MaxHigh / Close[0] - Best high in period vs. current - Maximum profit potential * - 8 - MaxHigh / High[0] - Best high vs. current high - Ultra-conservative max Method Details ~~~~~~~~~~~~~~ **Method 1: Close[N] / Close[0] - 1 >= threshold** Most conservative. Represents buying at close today, selling at close N days later. **Method 7: MaxHigh / Close[0] - 1 >= threshold** Most generous. Represents the maximum profit achievable in the next N days, assuming perfect intraday timing. **Recommendation**: Start with Method 7 for training data abundance, then validate with Method 1 for conservative estimates. Auto Mode Deep Dive ------------------- How Pareto Optimization Works ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Auto mode searches across both period (time) and threshold (percentage) dimensions to find optimal combinations. **Objective**: Find parameters that: 1. Maximize threshold (higher quality signals) 2. Minimize period (faster trades) 3. Achieve target class balance (e.g., 50% positive instances) **Mathematical Formulation**: For each method, we solve: .. math:: \text{maximize} \quad & threshold \\ \text{minimize} \quad & period \\ \text{minimize} \quad & |instances - target\_instances| \\ \text{subject to} \quad & period \in [min\_period, max\_period] \\ & threshold \in [min\_pct, max\_pct] This is a multi-objective optimization problem solved using Pareto frontier analysis. The Algorithm ~~~~~~~~~~~~~ **Step 1: Find Maximum Instances** For each method and period, calculate maximum possible instances (threshold=0%): .. code-block:: python for period in range(1, 21): future_close = df['Close'].shift(-period) max_instances = (future_close / df['Close'] - 1 >= 0).sum() **Step 2: Calculate Target Instances** .. code-block:: python target_instances = max_instances * target_class_balance # e.g., max_instances=500, balance=0.5 → target=250 **Step 3: Search Parameter Space** Test all combinations of period and threshold: .. code-block:: python results = [] for period in range(1, 21): for threshold_pct in range(0, 100): instances = count_instances(period, threshold_pct) deviation = abs(instances - target_instances) results.append({ 'period': period, 'threshold': threshold_pct, 'instances': instances, 'deviation': deviation }) **Step 4: Pareto Optimization** Find Pareto-optimal solutions (non-dominated solutions): .. code-block:: python from paretoset import paretoset # Pareto optimization: max threshold, min period, min deviation data = results[['threshold', 'period', 'deviation']].values mask = paretoset(data, sense=["max", "min", "min"]) pareto_solutions = results[mask] **Step 5: Select Best Solution** From Pareto frontier, choose solution closest to target balance: .. code-block:: python pareto_solutions['distance'] = abs( pareto_solutions['instances'] / max_instances - target_class_balance ) best = pareto_solutions.loc[pareto_solutions['distance'].idxmin()] Pareto Frontier Example ~~~~~~~~~~~~~~~~~~~~~~~ Imagine these solutions for Method 7: .. code-block:: text Period Threshold Instances Deviation ------ --------- --------- --------- 3 8.0 180 70 ← Pareto optimal 5 6.0 210 40 ← Pareto optimal 6 4.0 249 1 ← BEST (closest to 250) 10 3.0 255 5 ← Pareto optimal 15 2.0 240 10 Solutions at period 6, threshold 4.0% is selected because: - It's on the Pareto frontier (not dominated) - Instances (249) closest to target (250) - Good trade-off: moderate period, reasonable threshold Parameters Explained ~~~~~~~~~~~~~~~~~~~~ **target_class_balance** (float, default=0.5) Target percentage of positive instances. - 0.3 (30%): Conservative, higher quality signals - 0.5 (50%): Balanced, plenty of training data - 0.7 (70%): Aggressive, many signals **Example**: .. code-block:: python # Conservative: only 30% positive targets, meta = generate_target_combinations( df, mode='auto', target_class_balance=0.3 ) **min_period / max_period** (int, defaults: 1/20) Period range to search. - Shorter periods (1-5): Day trading - Medium periods (5-15): Swing trading - Longer periods (15-30): Position trading **Example**: .. code-block:: python # Search only swing trading range targets, meta = generate_target_combinations( df, mode='auto', min_period=5, max_period=15 ) **min_pct / max_pct / step** (int, defaults: 0/100/1) Threshold range and granularity. **Example**: .. code-block:: python # Search 2-10% in 0.5% increments targets, meta = generate_target_combinations( df, mode='auto', min_pct=2, max_pct=10, step=0.5 # Finer granularity ) Manual Mode Deep Dive --------------------- How the Elbow Method Works ~~~~~~~~~~~~~~~~~~~~~~~~~~ Manual mode fixes the lookback period and finds the optimal threshold using the "elbow method." **Concept**: As threshold increases, instances decrease. The "elbow" is where diminishing returns begin - the optimal balance between quality and quantity. **Mathematical Basis**: Plot instances vs. threshold: .. code-block:: text Threshold (%) Instances ------------- --------- 0 500 ← Max instances 1 450 2 380 3 300 4 240 ← Elbow point 5 220 6 205 ... ... 20 5 ← Very few The Elbow Algorithm ~~~~~~~~~~~~~~~~~~~ **Step 1: Calculate Instances Across Thresholds** .. code-block:: python from kneed import KneeLocator import numpy as np thresholds = np.arange(0, 100, 1) instances = [] for threshold in thresholds: count = (future_price / current_price - 1 >= threshold/100).sum() instances.append(count) **Step 2: Find Knee Point** .. code-block:: python kn = KneeLocator( thresholds, instances, curve='convex', # Curve shape direction='decreasing' # Instances decrease as threshold increases ) optimal_threshold = kn.elbow **Step 3: Generate Targets** Use the detected elbow threshold: .. code-block:: python target = (future_price / current_price - 1 >= optimal_threshold / 100) Visual Example ~~~~~~~~~~~~~~ .. code-block:: text Instances | 500 |* | 400 | * | 300 | * | ╲ 200 | * ← Elbow at threshold ≈ 4% | ╲ 100 | ╲___ | ╲____ 0 |_______________╲____ 0 2 4 6 8 10 Threshold (%) The elbow at 4% represents the optimal threshold where: - Still have substantial instances (200) - Threshold is meaningful (4% return) - Diminishing returns begin beyond this point Parameters Explained ~~~~~~~~~~~~~~~~~~~~ **lookback_periods** (int, default=5) Fixed number of periods to look forward. **Example**: .. code-block:: python # 10-day lookback targets, meta = generate_target_combinations( df, mode='manual', lookback_periods=10 ) # All methods use period=10 for i in range(1, 9): assert meta[f'method_{i}']['period'] == 10 Metadata Structure ------------------ Understanding the Output ~~~~~~~~~~~~~~~~~~~~~~~~ The metadata dictionary contains rich information: .. code-block:: python targets, meta = generate_target_combinations(df, mode='auto') # Metadata structure meta = { 'mode': 'auto', # or 'manual' 'method_1': { 'period': 5, # Lookback period 'threshold': 3.5, # Threshold percentage 'instances': 247, # Number of positive instances 'pct_of_max': 45.2 # Percentage of maximum possible instances }, # ... method_2 through method_8 ... } **Field Meanings**: - ``period``: How many days/periods to look forward - ``threshold``: Minimum return % required for positive label - ``instances``: How many data points are positive - ``pct_of_max``: What % of theoretical maximum this represents Using Metadata ~~~~~~~~~~~~~~ **Compare Methods**: .. code-block:: python import pandas as pd # Create comparison DataFrame comparison = pd.DataFrame([ meta[f'method_{i}'] for i in range(1, 9) ]) comparison.index = [f'Method_{i}' for i in range(1, 9)] print(comparison) # period threshold instances pct_of_max # Method_1 5 3.5 247 45.2 # Method_2 4 5.0 198 38.1 # ... **Save for Reproducibility**: .. code-block:: python import json # Save metadata with open('target_metadata.json', 'w') as f: json.dump(meta, f, indent=2) # Load later with open('target_metadata.json', 'r') as f: loaded_meta = json.load(f) **Apply to New Data**: .. code-block:: python # Apply same parameters to test set def apply_target_params(df, method_meta): period = method_meta['period'] threshold = method_meta['threshold'] / 100 future_close = df['Close'].shift(-period) return (future_close / df['Close'] - 1 >= threshold) # Apply Method 7 parameters to test data test_target = apply_target_params(test_df, meta['method_7']) Choosing Between Modes ----------------------- Use Auto Mode When: ~~~~~~~~~~~~~~~~~~~ - You want optimal parameters for your specific data - Class balance is critical (e.g., balanced dataset for training) - You're exploring different timeframes - You need reproducible, data-driven decisions - You have sufficient data (500+ rows) **Example Use Cases**: - Initial model development - Production systems with regular retraining - Research and strategy development Use Manual Mode When: ~~~~~~~~~~~~~~~~~~~~~~ - You have a specific trading timeframe in mind - You want to compare performance across fixed horizons - You're validating a hypothesis - You have domain knowledge about appropriate periods - You want simpler, more interpretable parameters **Example Use Cases**: - Backtesting specific strategies (e.g., "5-day swing trades") - Regulatory or operational constraints on holding periods - Comparing different stocks on equal footing Comparison Example ~~~~~~~~~~~~~~~~~~ .. code-block:: python # Auto mode: Let optimizer decide everything auto_targets, auto_meta = generate_target_combinations( df, mode='auto', target_class_balance=0.5 ) # Result: period=7, threshold=3.8%, instances=512 (50.1% of max) # Manual mode: You control the period manual_targets, manual_meta = generate_target_combinations( df, mode='manual', lookback_periods=7 ) # Result: period=7, threshold=5.2%, instances=384 (37.5% of max) # Auto mode found lower threshold to hit target balance Complete Workflow Example -------------------------- End-to-End ML Pipeline ~~~~~~~~~~~~~~~~~~~~~~ .. code-block:: python import pandas as pd import numpy as np from rhoa.targets import generate_target_combinations from sklearn.ensemble import RandomForestClassifier from sklearn.metrics import classification_report import json # 1. Load and prepare data df = pd.read_csv('stock_data.csv') df['Date'] = pd.to_datetime(df['Date']) df = df.sort_values('Date').reset_index(drop=True) # 2. Time-based split (IMPORTANT: split before target generation) split_idx = int(len(df) * 0.8) train_df = df[:split_idx].copy() test_df = df[split_idx:].copy() # 3. Generate targets on TRAINING data only targets_train, meta = generate_target_combinations( train_df, mode='auto', target_class_balance=0.4 # 40% positive ) print(f"Using Method 7: {meta['method_7']}") # {'period': 6, 'threshold': 4.2, 'instances': 201, 'pct_of_max': 39.8} # 4. Save metadata for reproducibility with open('target_config.json', 'w') as f: json.dump(meta, f, indent=2) # 5. Create features train_df['SMA_20'] = train_df['Close'].rolling(20).mean() train_df['SMA_50'] = train_df['Close'].rolling(50).mean() train_df['Returns'] = train_df['Close'].pct_change() train_df['Volatility'] = train_df['Returns'].rolling(20).std() # 6. Combine features and target train_df['Target'] = targets_train['Target_7'] train_clean = train_df.dropna() # 7. Train model feature_cols = ['SMA_20', 'SMA_50', 'Returns', 'Volatility'] X_train = train_clean[feature_cols] y_train = train_clean['Target'] model = RandomForestClassifier(n_estimators=100, random_state=42) model.fit(X_train, y_train) # 8. Apply SAME parameters to test data test_period = meta['method_7']['period'] test_threshold = meta['method_7']['threshold'] / 100 future_close = test_df['Close'].shift(-test_period) future_high = test_df['High'].shift(-test_period) future_max_high = test_df['High'].shift(-test_period).rolling( window=test_period, min_periods=1 ).max().shift(test_period) test_df['Target'] = (future_max_high / test_df['Close'] - 1 >= test_threshold) # 9. Create test features test_df['SMA_20'] = test_df['Close'].rolling(20).mean() test_df['SMA_50'] = test_df['Close'].rolling(50).mean() test_df['Returns'] = test_df['Close'].pct_change() test_df['Volatility'] = test_df['Returns'].rolling(20).std() test_clean = test_df.dropna() # 10. Evaluate X_test = test_clean[feature_cols] y_test = test_clean['Target'] y_pred = model.predict(X_test) print(classification_report(y_test, y_pred)) # precision recall f1-score # 0 0.88 0.91 0.90 # 1 0.76 0.69 0.72 # 11. Visualize results test_clean.rhoa.plots.signal( y_pred=y_pred, y_true=y_test, date_col='Date', price_col='Close', threshold=meta['method_7']['threshold'], title=f"Method 7 Predictions (Period={test_period}, Threshold={test_threshold*100:.1f}%)" ) Best Practices -------------- Data Preparation ~~~~~~~~~~~~~~~~ **Always Split Before Target Generation**: .. code-block:: python # CORRECT train, test = split(df) targets_train, meta = generate_target_combinations(train) # Apply meta parameters to test # WRONG - Look-ahead bias! targets, meta = generate_target_combinations(df) train, test = split(df) **Handle NaN Values**: .. code-block:: python # Targets create NaN for last N rows (future unknown) print(targets.tail()) # Last 'period' rows will be NaN # Always drop NaN before training combined = pd.concat([features, targets], axis=1) clean_data = combined.dropna() **Ensure Data Quality**: .. code-block:: python # Check for missing values assert df[['Open', 'High', 'Low', 'Close']].isnull().sum().sum() == 0 # Check OHLC relationships assert (df['High'] >= df['Close']).all() assert (df['Close'] >= df['Low']).all() assert (df['High'] >= df['Low']).all() Parameter Selection ~~~~~~~~~~~~~~~~~~~ **Start Conservative**: .. code-block:: python # Start with lower balance for higher quality targets, meta = generate_target_combinations( df, mode='auto', target_class_balance=0.3 # Only best 30% ) **Consider Your Trading Style**: .. code-block:: python # Day trading: short periods targets, meta = generate_target_combinations( df, mode='auto', min_period=1, max_period=5 ) # Swing trading: medium periods targets, meta = generate_target_combinations( df, mode='auto', min_period=5, max_period=15 ) # Position trading: long periods targets, meta = generate_target_combinations( df, mode='auto', min_period=15, max_period=30 ) **Match Threshold to Costs**: .. code-block:: python # If trading costs are 0.5% round-trip # Minimum profitable threshold should be > 0.5% targets, meta = generate_target_combinations( df, mode='auto', min_pct=1, # Start at 1% minimum max_pct=20 ) Method Selection ~~~~~~~~~~~~~~~~ **For Training (generous)**: .. code-block:: python # Use Method 7 or 8 for more positive examples y_train = targets_train['Target_7'] # MaxHigh/Close # More data for model to learn from **For Validation (conservative)**: .. code-block:: python # Use Method 1 for realistic validation y_val = targets_val['Target_1'] # Close/Close # More realistic profit expectations **Compare Multiple Methods**: .. code-block:: python # Train on each method, compare results results = {} for i in range(1, 9): X_train, y_train = features, targets[f'Target_{i}'] model.fit(X_train, y_train) results[f'Method_{i}'] = evaluate(model, X_test, y_test) # Choose method with best out-of-sample performance Reproducibility ~~~~~~~~~~~~~~~ **Always Save Metadata**: .. code-block:: python # Save with timestamp import datetime meta_with_timestamp = { 'generated_at': datetime.datetime.now().isoformat(), 'data_shape': df.shape, 'date_range': f"{df['Date'].min()} to {df['Date'].max()}", **meta } with open('target_metadata.json', 'w') as f: json.dump(meta_with_timestamp, f, indent=2) **Version Your Data and Targets**: .. code-block:: python # Save targets alongside data df_with_targets = pd.concat([df, targets], axis=1) df_with_targets.to_csv(f'data_with_targets_{datetime.date.today()}.csv') Common Pitfalls --------------- Pitfall 1: Look-Ahead Bias ~~~~~~~~~~~~~~~~~~~~~~~~~~ **Problem**: Generating targets on full dataset before splitting. .. code-block:: python # WRONG targets, meta = generate_target_combinations(df) # Uses all data train, test = split(df) # Then split # Why wrong: Optimization saw test data **Solution**: Generate on train only. .. code-block:: python # CORRECT train, test = split(df) targets_train, meta = generate_target_combinations(train) # Apply meta params to test Pitfall 2: Ignoring Class Imbalance ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ **Problem**: Using default 50% balance with limited data. .. code-block:: python # With 100 samples and 50% balance targets, meta = generate_target_combinations( small_df, # Only 100 rows target_class_balance=0.5 ) # Result: Only 50 positive examples (too few!) **Solution**: Adjust balance or get more data. .. code-block:: python # Better targets, meta = generate_target_combinations( small_df, target_class_balance=0.7 # 70 positive examples ) Pitfall 3: Unrealistic Thresholds ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ **Problem**: Thresholds below transaction costs. .. code-block:: python # Transaction cost = 0.5% round-trip # But threshold = 0.3% # Every "profitable" trade actually loses money! **Solution**: Set minimum threshold above costs. .. code-block:: python targets, meta = generate_target_combinations( df, min_pct=1, # 1% minimum (above 0.5% costs) ) Pitfall 4: Over-Optimizing on Single Stock ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ **Problem**: Perfect parameters for AAPL may not work for others. **Solution**: Test on multiple assets. .. code-block:: python # Find parameters on basket of stocks all_targets = [] for ticker in ['AAPL', 'MSFT', 'GOOGL']: df = load_data(ticker) targets, meta = generate_target_combinations(df) all_targets.append(targets) # Use common parameters across stocks Advanced Topics --------------- Multi-Target Ensemble ~~~~~~~~~~~~~~~~~~~~~ Train separate models on different target methods and ensemble: .. code-block:: python # Train models on different targets models = {} for i in [1, 5, 7]: # Conservative, medium, aggressive X, y = features, targets[f'Target_{i}'] model = RandomForestClassifier() model.fit(X, y) models[i] = model # Ensemble: require 2 out of 3 to agree pred_1 = models[1].predict_proba(X_test)[:, 1] pred_5 = models[5].predict_proba(X_test)[:, 1] pred_7 = models[7].predict_proba(X_test)[:, 1] ensemble_pred = ((pred_1 > 0.5) + (pred_5 > 0.5) + (pred_7 > 0.5)) >= 2 Custom Target Generation ~~~~~~~~~~~~~~~~~~~~~~~~ Create your own target based on Rhoa's patterns: .. code-block:: python def custom_target(df, period, threshold, cost_pct=0.5): """Target that accounts for transaction costs.""" future_high = df['High'].shift(-period) future_low = df['Low'].shift(-period) # Best case profit profit = (future_high / df['Close'] - 1) - (cost_pct / 100) # Worst case loss (stop loss at 2%) loss = (future_low / df['Close'] - 1) - (cost_pct / 100) # Positive if profit > threshold AND loss > -2% return (profit >= threshold / 100) & (loss >= -0.02) # Use custom target target = custom_target(df, period=5, threshold=3.0) Performance Tips ---------------- For Large Datasets ~~~~~~~~~~~~~~~~~~ .. code-block:: python # Reduce search space targets, meta = generate_target_combinations( large_df, mode='auto', period_step=2, # Check every 2 periods instead of 1 step=2 # Check every 2% threshold instead of 1% ) # Runs 4x faster with minimal accuracy loss For Production ~~~~~~~~~~~~~~ .. code-block:: python # Cache targets if data doesn't change import joblib cache_file = 'targets_cache.pkl' if os.path.exists(cache_file): targets, meta = joblib.load(cache_file) else: targets, meta = generate_target_combinations(df) joblib.dump((targets, meta), cache_file) Further Reading --------------- - :doc:`indicators_guide` - Using indicators as features - :doc:`visualization_guide` - Evaluating target quality - :doc:`/examples/target_generation` - Hands-on examples - :doc:`/examples/complete_pipeline` - Full ML pipeline - :doc:`/api/targets` - API reference Summary ------- Key takeaways: 1. **Auto mode** for optimal parameters, **manual mode** for fixed periods 2. Always split data **before** generating targets 3. Save metadata for reproducibility 4. Choose target method based on use case (Method 7 for training, Method 1 for validation) 5. Account for transaction costs in threshold selection 6. Test parameters across multiple assets 7. Validate on true out-of-sample data The target generation system is designed to remove guesswork and provide data-driven, reproducible trading signal definitions for machine learning.