Targets Guide
A comprehensive guide to Rhoa’s advanced target generation system for machine learning. Learn how to create optimized binary classification targets for trading strategies.
Overview
Rhoa’s generate_target_combinations function creates 8 different binary classification targets, each representing different ways to define a “profitable trade.” The system automatically finds optimal parameters using either:
Auto Mode: Pareto optimization to find optimal period AND threshold
Manual Mode: Elbow method to find optimal threshold for fixed period
This guide explains both modes in depth, the mathematics behind them, and best practices for production use.
Why Target Generation Matters
The Problem
When building ML models for trading, you need to define what constitutes a “buy signal.” This requires answering:
How far ahead should the price move? (lookback period)
How much should it move? (threshold percentage)
Which price metric to use? (close-to-close, high-to-close, etc.)
Poor choices lead to:
Too many signals: High transaction costs, low precision
Too few signals: Insufficient training data
Unrealistic targets: Model learns patterns that don’t translate to profit
The Solution
Rhoa’s target generation:
Tests all 8 common target definitions
Automatically finds optimal parameters
Balances class distribution
Provides detailed metadata
Ensures reproducibility
Quick Start
Auto Mode (Recommended)
from rhoa.targets import generate_target_combinations
import pandas as pd
# Load your OHLC data
df = pd.read_csv('stock_data.csv')
# Generate optimized targets
targets, meta = generate_target_combinations(
df,
mode='auto',
target_class_balance=0.5 # 50% positive instances
)
# Check what was found
print(f"Method 7 uses period {meta['method_7']['period']}, "
f"threshold {meta['method_7']['threshold']}%")
# Method 7 uses period 6, threshold 4.0%
# Use in ML pipeline
print(targets.head())
# Target_1 Target_2 Target_3 ... Target_8
# 0 False False False ... False
# 1 True False True ... True
Manual Mode
# Fixed 5-day lookback, optimize thresholds
targets, meta = generate_target_combinations(
df,
mode='manual',
lookback_periods=5
)
# All methods use period=5
print(meta['method_1'])
# {'period': 5, 'threshold': 6.0, 'instances': 22, 'pct_of_max': 1.4}
The 8 Target Methods
Each method defines “success” differently:
Method |
Definition |
Formula |
Use Case |
|---|---|---|---|
1 |
Close[N] / Close[0] |
Future close vs. current close |
Conservative, actual exit |
2 |
Close[N] / High[0] |
Future close vs. current high |
Buy at top of range |
3 |
High[N] / Close[0] |
Future high vs. current close |
Intraday profit potential |
4 |
High[N] / High[0] |
Future high vs. current high |
Very conservative |
5 |
MaxClose / Close[0] |
Best close in period vs. current |
Best exit timing |
6 |
MaxClose / High[0] |
Best close vs. current high |
Optimal buy at top |
7 |
MaxHigh / Close[0] |
Best high in period vs. current |
Maximum profit potential |
8 |
MaxHigh / High[0] |
Best high vs. current high |
Ultra-conservative max |
Method Details
Method 1: Close[N] / Close[0] - 1 >= threshold
Most conservative. Represents buying at close today, selling at close N days later.
Method 7: MaxHigh / Close[0] - 1 >= threshold
Most generous. Represents the maximum profit achievable in the next N days, assuming perfect intraday timing.
Recommendation: Start with Method 7 for training data abundance, then validate with Method 1 for conservative estimates.
Auto Mode Deep Dive
How Pareto Optimization Works
Auto mode searches across both period (time) and threshold (percentage) dimensions to find optimal combinations.
Objective: Find parameters that:
Maximize threshold (higher quality signals)
Minimize period (faster trades)
Achieve target class balance (e.g., 50% positive instances)
Mathematical Formulation:
For each method, we solve:
This is a multi-objective optimization problem solved using Pareto frontier analysis.
The Algorithm
Step 1: Find Maximum Instances
For each method and period, calculate maximum possible instances (threshold=0%):
for period in range(1, 21):
future_close = df['Close'].shift(-period)
max_instances = (future_close / df['Close'] - 1 >= 0).sum()
Step 2: Calculate Target Instances
target_instances = max_instances * target_class_balance
# e.g., max_instances=500, balance=0.5 → target=250
Step 3: Search Parameter Space
Test all combinations of period and threshold:
results = []
for period in range(1, 21):
for threshold_pct in range(0, 100):
instances = count_instances(period, threshold_pct)
deviation = abs(instances - target_instances)
results.append({
'period': period,
'threshold': threshold_pct,
'instances': instances,
'deviation': deviation
})
Step 4: Pareto Optimization
Find Pareto-optimal solutions (non-dominated solutions):
from paretoset import paretoset
# Pareto optimization: max threshold, min period, min deviation
data = results[['threshold', 'period', 'deviation']].values
mask = paretoset(data, sense=["max", "min", "min"])
pareto_solutions = results[mask]
Step 5: Select Best Solution
From Pareto frontier, choose solution closest to target balance:
pareto_solutions['distance'] = abs(
pareto_solutions['instances'] / max_instances - target_class_balance
)
best = pareto_solutions.loc[pareto_solutions['distance'].idxmin()]
Pareto Frontier Example
Imagine these solutions for Method 7:
Period Threshold Instances Deviation
------ --------- --------- ---------
3 8.0 180 70 ← Pareto optimal
5 6.0 210 40 ← Pareto optimal
6 4.0 249 1 ← BEST (closest to 250)
10 3.0 255 5 ← Pareto optimal
15 2.0 240 10
Solutions at period 6, threshold 4.0% is selected because:
It’s on the Pareto frontier (not dominated)
Instances (249) closest to target (250)
Good trade-off: moderate period, reasonable threshold
Parameters Explained
- target_class_balance (float, default=0.5)
Target percentage of positive instances.
0.3 (30%): Conservative, higher quality signals
0.5 (50%): Balanced, plenty of training data
0.7 (70%): Aggressive, many signals
Example:
# Conservative: only 30% positive targets, meta = generate_target_combinations( df, mode='auto', target_class_balance=0.3 )
- min_period / max_period (int, defaults: 1/20)
Period range to search.
Shorter periods (1-5): Day trading
Medium periods (5-15): Swing trading
Longer periods (15-30): Position trading
Example:
# Search only swing trading range targets, meta = generate_target_combinations( df, mode='auto', min_period=5, max_period=15 )
- min_pct / max_pct / step (int, defaults: 0/100/1)
Threshold range and granularity.
Example:
# Search 2-10% in 0.5% increments targets, meta = generate_target_combinations( df, mode='auto', min_pct=2, max_pct=10, step=0.5 # Finer granularity )
Manual Mode Deep Dive
How the Elbow Method Works
Manual mode fixes the lookback period and finds the optimal threshold using the “elbow method.”
Concept: As threshold increases, instances decrease. The “elbow” is where diminishing returns begin - the optimal balance between quality and quantity.
Mathematical Basis:
Plot instances vs. threshold:
Threshold (%) Instances
------------- ---------
0 500 ← Max instances
1 450
2 380
3 300
4 240 ← Elbow point
5 220
6 205
... ...
20 5 ← Very few
The Elbow Algorithm
Step 1: Calculate Instances Across Thresholds
from kneed import KneeLocator
import numpy as np
thresholds = np.arange(0, 100, 1)
instances = []
for threshold in thresholds:
count = (future_price / current_price - 1 >= threshold/100).sum()
instances.append(count)
Step 2: Find Knee Point
kn = KneeLocator(
thresholds,
instances,
curve='convex', # Curve shape
direction='decreasing' # Instances decrease as threshold increases
)
optimal_threshold = kn.elbow
Step 3: Generate Targets
Use the detected elbow threshold:
target = (future_price / current_price - 1 >= optimal_threshold / 100)
Visual Example
Instances
|
500 |*
|
400 | *
|
300 | *
| ╲
200 | * ← Elbow at threshold ≈ 4%
| ╲
100 | ╲___
| ╲____
0 |_______________╲____
0 2 4 6 8 10 Threshold (%)
The elbow at 4% represents the optimal threshold where: - Still have substantial instances (200) - Threshold is meaningful (4% return) - Diminishing returns begin beyond this point
Parameters Explained
- lookback_periods (int, default=5)
Fixed number of periods to look forward.
Example:
# 10-day lookback targets, meta = generate_target_combinations( df, mode='manual', lookback_periods=10 ) # All methods use period=10 for i in range(1, 9): assert meta[f'method_{i}']['period'] == 10
Metadata Structure
Understanding the Output
The metadata dictionary contains rich information:
targets, meta = generate_target_combinations(df, mode='auto')
# Metadata structure
meta = {
'mode': 'auto', # or 'manual'
'method_1': {
'period': 5, # Lookback period
'threshold': 3.5, # Threshold percentage
'instances': 247, # Number of positive instances
'pct_of_max': 45.2 # Percentage of maximum possible instances
},
# ... method_2 through method_8 ...
}
Field Meanings:
period: How many days/periods to look forwardthreshold: Minimum return % required for positive labelinstances: How many data points are positivepct_of_max: What % of theoretical maximum this represents
Using Metadata
Compare Methods:
import pandas as pd
# Create comparison DataFrame
comparison = pd.DataFrame([
meta[f'method_{i}'] for i in range(1, 9)
])
comparison.index = [f'Method_{i}' for i in range(1, 9)]
print(comparison)
# period threshold instances pct_of_max
# Method_1 5 3.5 247 45.2
# Method_2 4 5.0 198 38.1
# ...
Save for Reproducibility:
import json
# Save metadata
with open('target_metadata.json', 'w') as f:
json.dump(meta, f, indent=2)
# Load later
with open('target_metadata.json', 'r') as f:
loaded_meta = json.load(f)
Apply to New Data:
# Apply same parameters to test set
def apply_target_params(df, method_meta):
period = method_meta['period']
threshold = method_meta['threshold'] / 100
future_close = df['Close'].shift(-period)
return (future_close / df['Close'] - 1 >= threshold)
# Apply Method 7 parameters to test data
test_target = apply_target_params(test_df, meta['method_7'])
Choosing Between Modes
Use Auto Mode When:
You want optimal parameters for your specific data
Class balance is critical (e.g., balanced dataset for training)
You’re exploring different timeframes
You need reproducible, data-driven decisions
You have sufficient data (500+ rows)
Example Use Cases:
Initial model development
Production systems with regular retraining
Research and strategy development
Use Manual Mode When:
You have a specific trading timeframe in mind
You want to compare performance across fixed horizons
You’re validating a hypothesis
You have domain knowledge about appropriate periods
You want simpler, more interpretable parameters
Example Use Cases:
Backtesting specific strategies (e.g., “5-day swing trades”)
Regulatory or operational constraints on holding periods
Comparing different stocks on equal footing
Comparison Example
# Auto mode: Let optimizer decide everything
auto_targets, auto_meta = generate_target_combinations(
df, mode='auto', target_class_balance=0.5
)
# Result: period=7, threshold=3.8%, instances=512 (50.1% of max)
# Manual mode: You control the period
manual_targets, manual_meta = generate_target_combinations(
df, mode='manual', lookback_periods=7
)
# Result: period=7, threshold=5.2%, instances=384 (37.5% of max)
# Auto mode found lower threshold to hit target balance
Complete Workflow Example
End-to-End ML Pipeline
import pandas as pd
import numpy as np
from rhoa.targets import generate_target_combinations
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report
import json
# 1. Load and prepare data
df = pd.read_csv('stock_data.csv')
df['Date'] = pd.to_datetime(df['Date'])
df = df.sort_values('Date').reset_index(drop=True)
# 2. Time-based split (IMPORTANT: split before target generation)
split_idx = int(len(df) * 0.8)
train_df = df[:split_idx].copy()
test_df = df[split_idx:].copy()
# 3. Generate targets on TRAINING data only
targets_train, meta = generate_target_combinations(
train_df,
mode='auto',
target_class_balance=0.4 # 40% positive
)
print(f"Using Method 7: {meta['method_7']}")
# {'period': 6, 'threshold': 4.2, 'instances': 201, 'pct_of_max': 39.8}
# 4. Save metadata for reproducibility
with open('target_config.json', 'w') as f:
json.dump(meta, f, indent=2)
# 5. Create features
train_df['SMA_20'] = train_df['Close'].rolling(20).mean()
train_df['SMA_50'] = train_df['Close'].rolling(50).mean()
train_df['Returns'] = train_df['Close'].pct_change()
train_df['Volatility'] = train_df['Returns'].rolling(20).std()
# 6. Combine features and target
train_df['Target'] = targets_train['Target_7']
train_clean = train_df.dropna()
# 7. Train model
feature_cols = ['SMA_20', 'SMA_50', 'Returns', 'Volatility']
X_train = train_clean[feature_cols]
y_train = train_clean['Target']
model = RandomForestClassifier(n_estimators=100, random_state=42)
model.fit(X_train, y_train)
# 8. Apply SAME parameters to test data
test_period = meta['method_7']['period']
test_threshold = meta['method_7']['threshold'] / 100
future_close = test_df['Close'].shift(-test_period)
future_high = test_df['High'].shift(-test_period)
future_max_high = test_df['High'].shift(-test_period).rolling(
window=test_period, min_periods=1
).max().shift(test_period)
test_df['Target'] = (future_max_high / test_df['Close'] - 1 >= test_threshold)
# 9. Create test features
test_df['SMA_20'] = test_df['Close'].rolling(20).mean()
test_df['SMA_50'] = test_df['Close'].rolling(50).mean()
test_df['Returns'] = test_df['Close'].pct_change()
test_df['Volatility'] = test_df['Returns'].rolling(20).std()
test_clean = test_df.dropna()
# 10. Evaluate
X_test = test_clean[feature_cols]
y_test = test_clean['Target']
y_pred = model.predict(X_test)
print(classification_report(y_test, y_pred))
# precision recall f1-score
# 0 0.88 0.91 0.90
# 1 0.76 0.69 0.72
# 11. Visualize results
test_clean.rhoa.plots.signal(
y_pred=y_pred,
y_true=y_test,
date_col='Date',
price_col='Close',
threshold=meta['method_7']['threshold'],
title=f"Method 7 Predictions (Period={test_period}, Threshold={test_threshold*100:.1f}%)"
)
Best Practices
Data Preparation
Always Split Before Target Generation:
# CORRECT
train, test = split(df)
targets_train, meta = generate_target_combinations(train)
# Apply meta parameters to test
# WRONG - Look-ahead bias!
targets, meta = generate_target_combinations(df)
train, test = split(df)
Handle NaN Values:
# Targets create NaN for last N rows (future unknown)
print(targets.tail())
# Last 'period' rows will be NaN
# Always drop NaN before training
combined = pd.concat([features, targets], axis=1)
clean_data = combined.dropna()
Ensure Data Quality:
# Check for missing values
assert df[['Open', 'High', 'Low', 'Close']].isnull().sum().sum() == 0
# Check OHLC relationships
assert (df['High'] >= df['Close']).all()
assert (df['Close'] >= df['Low']).all()
assert (df['High'] >= df['Low']).all()
Parameter Selection
Start Conservative:
# Start with lower balance for higher quality
targets, meta = generate_target_combinations(
df,
mode='auto',
target_class_balance=0.3 # Only best 30%
)
Consider Your Trading Style:
# Day trading: short periods
targets, meta = generate_target_combinations(
df,
mode='auto',
min_period=1,
max_period=5
)
# Swing trading: medium periods
targets, meta = generate_target_combinations(
df,
mode='auto',
min_period=5,
max_period=15
)
# Position trading: long periods
targets, meta = generate_target_combinations(
df,
mode='auto',
min_period=15,
max_period=30
)
Match Threshold to Costs:
# If trading costs are 0.5% round-trip
# Minimum profitable threshold should be > 0.5%
targets, meta = generate_target_combinations(
df,
mode='auto',
min_pct=1, # Start at 1% minimum
max_pct=20
)
Method Selection
For Training (generous):
# Use Method 7 or 8 for more positive examples
y_train = targets_train['Target_7'] # MaxHigh/Close
# More data for model to learn from
For Validation (conservative):
# Use Method 1 for realistic validation
y_val = targets_val['Target_1'] # Close/Close
# More realistic profit expectations
Compare Multiple Methods:
# Train on each method, compare results
results = {}
for i in range(1, 9):
X_train, y_train = features, targets[f'Target_{i}']
model.fit(X_train, y_train)
results[f'Method_{i}'] = evaluate(model, X_test, y_test)
# Choose method with best out-of-sample performance
Reproducibility
Always Save Metadata:
# Save with timestamp
import datetime
meta_with_timestamp = {
'generated_at': datetime.datetime.now().isoformat(),
'data_shape': df.shape,
'date_range': f"{df['Date'].min()} to {df['Date'].max()}",
**meta
}
with open('target_metadata.json', 'w') as f:
json.dump(meta_with_timestamp, f, indent=2)
Version Your Data and Targets:
# Save targets alongside data
df_with_targets = pd.concat([df, targets], axis=1)
df_with_targets.to_csv(f'data_with_targets_{datetime.date.today()}.csv')
Common Pitfalls
Pitfall 1: Look-Ahead Bias
Problem: Generating targets on full dataset before splitting.
# WRONG
targets, meta = generate_target_combinations(df) # Uses all data
train, test = split(df) # Then split
# Why wrong: Optimization saw test data
Solution: Generate on train only.
# CORRECT
train, test = split(df)
targets_train, meta = generate_target_combinations(train)
# Apply meta params to test
Pitfall 2: Ignoring Class Imbalance
Problem: Using default 50% balance with limited data.
# With 100 samples and 50% balance
targets, meta = generate_target_combinations(
small_df, # Only 100 rows
target_class_balance=0.5
)
# Result: Only 50 positive examples (too few!)
Solution: Adjust balance or get more data.
# Better
targets, meta = generate_target_combinations(
small_df,
target_class_balance=0.7 # 70 positive examples
)
Pitfall 3: Unrealistic Thresholds
Problem: Thresholds below transaction costs.
# Transaction cost = 0.5% round-trip
# But threshold = 0.3%
# Every "profitable" trade actually loses money!
Solution: Set minimum threshold above costs.
targets, meta = generate_target_combinations(
df,
min_pct=1, # 1% minimum (above 0.5% costs)
)
Pitfall 4: Over-Optimizing on Single Stock
Problem: Perfect parameters for AAPL may not work for others.
Solution: Test on multiple assets.
# Find parameters on basket of stocks
all_targets = []
for ticker in ['AAPL', 'MSFT', 'GOOGL']:
df = load_data(ticker)
targets, meta = generate_target_combinations(df)
all_targets.append(targets)
# Use common parameters across stocks
Advanced Topics
Multi-Target Ensemble
Train separate models on different target methods and ensemble:
# Train models on different targets
models = {}
for i in [1, 5, 7]: # Conservative, medium, aggressive
X, y = features, targets[f'Target_{i}']
model = RandomForestClassifier()
model.fit(X, y)
models[i] = model
# Ensemble: require 2 out of 3 to agree
pred_1 = models[1].predict_proba(X_test)[:, 1]
pred_5 = models[5].predict_proba(X_test)[:, 1]
pred_7 = models[7].predict_proba(X_test)[:, 1]
ensemble_pred = ((pred_1 > 0.5) + (pred_5 > 0.5) + (pred_7 > 0.5)) >= 2
Custom Target Generation
Create your own target based on Rhoa’s patterns:
def custom_target(df, period, threshold, cost_pct=0.5):
"""Target that accounts for transaction costs."""
future_high = df['High'].shift(-period)
future_low = df['Low'].shift(-period)
# Best case profit
profit = (future_high / df['Close'] - 1) - (cost_pct / 100)
# Worst case loss (stop loss at 2%)
loss = (future_low / df['Close'] - 1) - (cost_pct / 100)
# Positive if profit > threshold AND loss > -2%
return (profit >= threshold / 100) & (loss >= -0.02)
# Use custom target
target = custom_target(df, period=5, threshold=3.0)
Performance Tips
For Large Datasets
# Reduce search space
targets, meta = generate_target_combinations(
large_df,
mode='auto',
period_step=2, # Check every 2 periods instead of 1
step=2 # Check every 2% threshold instead of 1%
)
# Runs 4x faster with minimal accuracy loss
For Production
# Cache targets if data doesn't change
import joblib
cache_file = 'targets_cache.pkl'
if os.path.exists(cache_file):
targets, meta = joblib.load(cache_file)
else:
targets, meta = generate_target_combinations(df)
joblib.dump((targets, meta), cache_file)
Further Reading
Indicators Guide - Using indicators as features
Visualization Guide - Evaluating target quality
Target Generation - Hands-on examples
Complete ML Pipeline - Full ML pipeline
rhoa.targets module - API reference
Summary
Key takeaways:
Auto mode for optimal parameters, manual mode for fixed periods
Always split data before generating targets
Save metadata for reproducibility
Choose target method based on use case (Method 7 for training, Method 1 for validation)
Account for transaction costs in threshold selection
Test parameters across multiple assets
Validate on true out-of-sample data
The target generation system is designed to remove guesswork and provide data-driven, reproducible trading signal definitions for machine learning.