Targets Guide

A comprehensive guide to Rhoa’s advanced target generation system for machine learning. Learn how to create optimized binary classification targets for trading strategies.

Overview

Rhoa’s generate_target_combinations function creates 8 different binary classification targets, each representing different ways to define a “profitable trade.” The system automatically finds optimal parameters using either:

Auto Mode: Pareto optimization to find optimal period AND threshold
Manual Mode: Elbow method to find optimal threshold for fixed period

This guide explains both modes in depth, the mathematics behind them, and best practices for production use.

Why Target Generation Matters

The Problem

When building ML models for trading, you need to define what constitutes a “buy signal.” This requires answering:

How far ahead should the price move? (lookback period)
How much should it move? (threshold percentage)
Which price metric to use? (close-to-close, high-to-close, etc.)

Poor choices lead to:

Too many signals: High transaction costs, low precision
Too few signals: Insufficient training data
Unrealistic targets: Model learns patterns that don’t translate to profit

The Solution

Rhoa’s target generation:

Tests all 8 common target definitions
Automatically finds optimal parameters
Balances class distribution
Provides detailed metadata
Ensures reproducibility

Quick Start

Auto Mode (Recommended)

from rhoa.targets import generate_target_combinations
import pandas as pd

# Load your OHLC data
df = pd.read_csv('stock_data.csv')

# Generate optimized targets
targets, meta = generate_target_combinations(
    df,
    mode='auto',
    target_class_balance=0.5  # 50% positive instances
)

# Check what was found
print(f"Method 7 uses period {meta['method_7']['period']}, "
      f"threshold {meta['method_7']['threshold']}%")
# Method 7 uses period 6, threshold 4.0%

# Use in ML pipeline
print(targets.head())
#    Target_1  Target_2  Target_3  ...  Target_8
# 0     False     False     False  ...     False
# 1      True     False      True  ...      True

Manual Mode

# Fixed 5-day lookback, optimize thresholds
targets, meta = generate_target_combinations(
    df,
    mode='manual',
    lookback_periods=5
)

# All methods use period=5
print(meta['method_1'])
# {'period': 5, 'threshold': 6.0, 'instances': 22, 'pct_of_max': 1.4}

The 8 Target Methods

Each method defines “success” differently:

Method	Definition	Formula	Use Case
1	Close[N] / Close[0]	Future close vs. current close	Conservative, actual exit
2	Close[N] / High[0]	Future close vs. current high	Buy at top of range
3	High[N] / Close[0]	Future high vs. current close	Intraday profit potential
4	High[N] / High[0]	Future high vs. current high	Very conservative
5	MaxClose / Close[0]	Best close in period vs. current	Best exit timing
6	MaxClose / High[0]	Best close vs. current high	Optimal buy at top
7	MaxHigh / Close[0]	Best high in period vs. current	Maximum profit potential
8	MaxHigh / High[0]	Best high vs. current high	Ultra-conservative max

Method Details

Method 1: Close[N] / Close[0] - 1 >= threshold

Most conservative. Represents buying at close today, selling at close N days later.

Method 7: MaxHigh / Close[0] - 1 >= threshold

Most generous. Represents the maximum profit achievable in the next N days, assuming perfect intraday timing.

Recommendation: Start with Method 7 for training data abundance, then validate with Method 1 for conservative estimates.

Auto Mode Deep Dive

How Pareto Optimization Works

Auto mode searches across both period (time) and threshold (percentage) dimensions to find optimal combinations.

Objective: Find parameters that:

Maximize threshold (higher quality signals)
Minimize period (faster trades)
Achieve target class balance (e.g., 50% positive instances)

Mathematical Formulation:

For each method, we solve:

\[\begin{split}\text{maximize} \quad & threshold \\ \text{minimize} \quad & period \\ \text{minimize} \quad & |instances - target\_instances| \\ \text{subject to} \quad & period \in [min\_period, max\_period] \\ & threshold \in [min\_pct, max\_pct]\end{split}\]

This is a multi-objective optimization problem solved using Pareto frontier analysis.

The Algorithm

Step 1: Find Maximum Instances

For each method and period, calculate maximum possible instances (threshold=0%):

for period in range(1, 21):
    future_close = df['Close'].shift(-period)
    max_instances = (future_close / df['Close'] - 1 >= 0).sum()

Step 2: Calculate Target Instances

target_instances = max_instances * target_class_balance
# e.g., max_instances=500, balance=0.5 → target=250

Step 3: Search Parameter Space

Test all combinations of period and threshold:

results = []
for period in range(1, 21):
    for threshold_pct in range(0, 100):
        instances = count_instances(period, threshold_pct)
        deviation = abs(instances - target_instances)
        results.append({
            'period': period,
            'threshold': threshold_pct,
            'instances': instances,
            'deviation': deviation
        })

Step 4: Pareto Optimization

Find Pareto-optimal solutions (non-dominated solutions):

from paretoset import paretoset

# Pareto optimization: max threshold, min period, min deviation
data = results[['threshold', 'period', 'deviation']].values
mask = paretoset(data, sense=["max", "min", "min"])
pareto_solutions = results[mask]

Step 5: Select Best Solution

From Pareto frontier, choose solution closest to target balance:

pareto_solutions['distance'] = abs(
    pareto_solutions['instances'] / max_instances - target_class_balance
)
best = pareto_solutions.loc[pareto_solutions['distance'].idxmin()]

Pareto Frontier Example

Imagine these solutions for Method 7:

Period  Threshold  Instances  Deviation
------  ---------  ---------  ---------
     8.0        180        70      ← Pareto optimal
     6.0        210        40      ← Pareto optimal
     4.0        249         1      ← BEST (closest to 250)
     3.0        255         5      ← Pareto optimal
     2.0        240        10

Solutions at period 6, threshold 4.0% is selected because:

It’s on the Pareto frontier (not dominated)
Instances (249) closest to target (250)
Good trade-off: moderate period, reasonable threshold

Parameters Explained

target_class_balance (float, default=0.5)

Target percentage of positive instances.

0.3 (30%): Conservative, higher quality signals
0.5 (50%): Balanced, plenty of training data
0.7 (70%): Aggressive, many signals

Example:

# Conservative: only 30% positive
targets, meta = generate_target_combinations(
    df, mode='auto', target_class_balance=0.3
)

min_period / max_period (int, defaults: 1/20)

Period range to search.

Shorter periods (1-5): Day trading
Medium periods (5-15): Swing trading
Longer periods (15-30): Position trading

Example:

# Search only swing trading range
targets, meta = generate_target_combinations(
    df,
    mode='auto',
    min_period=5,
    max_period=15
)

min_pct / max_pct / step (int, defaults: 0/100/1)

Threshold range and granularity.

Example:

# Search 2-10% in 0.5% increments
targets, meta = generate_target_combinations(
    df,
    mode='auto',
    min_pct=2,
    max_pct=10,
    step=0.5  # Finer granularity
)

Manual Mode Deep Dive

How the Elbow Method Works

Manual mode fixes the lookback period and finds the optimal threshold using the “elbow method.”

Concept: As threshold increases, instances decrease. The “elbow” is where diminishing returns begin - the optimal balance between quality and quantity.

Mathematical Basis:

Plot instances vs. threshold:

Threshold (%)    Instances
-------------    ---------
           500     ← Max instances
           450
           380
           300
           240     ← Elbow point
           220
           205
     ...            ...
             5     ← Very few

The Elbow Algorithm

Step 1: Calculate Instances Across Thresholds

from kneed import KneeLocator
import numpy as np

thresholds = np.arange(0, 100, 1)
instances = []

for threshold in thresholds:
    count = (future_price / current_price - 1 >= threshold/100).sum()
    instances.append(count)

Step 2: Find Knee Point

kn = KneeLocator(
    thresholds,
    instances,
    curve='convex',      # Curve shape
    direction='decreasing'  # Instances decrease as threshold increases
)

optimal_threshold = kn.elbow

Step 3: Generate Targets

Use the detected elbow threshold:

target = (future_price / current_price - 1 >= optimal_threshold / 100)

Visual Example

   Instances
    |
500 |*
    |
400 | *
    |
300 |  *
    |   ╲
200 |    *  ← Elbow at threshold ≈ 4%
    |     ╲
100 |      ╲___
    |          ╲____
  0 |_______________╲____
    0   2   4   6   8   10  Threshold (%)

The elbow at 4% represents the optimal threshold where: - Still have substantial instances (200) - Threshold is meaningful (4% return) - Diminishing returns begin beyond this point

Parameters Explained

lookback_periods (int, default=5)

Fixed number of periods to look forward.

Example:

# 10-day lookback
targets, meta = generate_target_combinations(
    df,
    mode='manual',
    lookback_periods=10
)

# All methods use period=10
for i in range(1, 9):
    assert meta[f'method_{i}']['period'] == 10

Metadata Structure

Understanding the Output

The metadata dictionary contains rich information:

targets, meta = generate_target_combinations(df, mode='auto')

# Metadata structure
meta = {
    'mode': 'auto',  # or 'manual'
    'method_1': {
        'period': 5,          # Lookback period
        'threshold': 3.5,     # Threshold percentage
        'instances': 247,     # Number of positive instances
        'pct_of_max': 45.2    # Percentage of maximum possible instances
    },
    # ... method_2 through method_8 ...
}

Field Meanings:

period: How many days/periods to look forward
threshold: Minimum return % required for positive label
instances: How many data points are positive
pct_of_max: What % of theoretical maximum this represents

Using Metadata

Compare Methods:

import pandas as pd

# Create comparison DataFrame
comparison = pd.DataFrame([
    meta[f'method_{i}'] for i in range(1, 9)
])
comparison.index = [f'Method_{i}' for i in range(1, 9)]

print(comparison)
#           period  threshold  instances  pct_of_max
# Method_1       5        3.5        247        45.2
# Method_2       4        5.0        198        38.1
# ...

Save for Reproducibility:

import json

# Save metadata
with open('target_metadata.json', 'w') as f:
    json.dump(meta, f, indent=2)

# Load later
with open('target_metadata.json', 'r') as f:
    loaded_meta = json.load(f)

Apply to New Data:

# Apply same parameters to test set
def apply_target_params(df, method_meta):
    period = method_meta['period']
    threshold = method_meta['threshold'] / 100

    future_close = df['Close'].shift(-period)
    return (future_close / df['Close'] - 1 >= threshold)

# Apply Method 7 parameters to test data
test_target = apply_target_params(test_df, meta['method_7'])

Choosing Between Modes

Use Auto Mode When:

You want optimal parameters for your specific data
Class balance is critical (e.g., balanced dataset for training)
You’re exploring different timeframes
You need reproducible, data-driven decisions
You have sufficient data (500+ rows)

Example Use Cases:

Initial model development
Production systems with regular retraining
Research and strategy development

Use Manual Mode When:

You have a specific trading timeframe in mind
You want to compare performance across fixed horizons
You’re validating a hypothesis
You have domain knowledge about appropriate periods
You want simpler, more interpretable parameters

Example Use Cases:

Backtesting specific strategies (e.g., “5-day swing trades”)
Regulatory or operational constraints on holding periods
Comparing different stocks on equal footing

Comparison Example

# Auto mode: Let optimizer decide everything
auto_targets, auto_meta = generate_target_combinations(
    df, mode='auto', target_class_balance=0.5
)
# Result: period=7, threshold=3.8%, instances=512 (50.1% of max)

# Manual mode: You control the period
manual_targets, manual_meta = generate_target_combinations(
    df, mode='manual', lookback_periods=7
)
# Result: period=7, threshold=5.2%, instances=384 (37.5% of max)

# Auto mode found lower threshold to hit target balance

Complete Workflow Example

End-to-End ML Pipeline

import pandas as pd
import numpy as np
from rhoa.targets import generate_target_combinations
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report
import json

# 1. Load and prepare data
df = pd.read_csv('stock_data.csv')
df['Date'] = pd.to_datetime(df['Date'])
df = df.sort_values('Date').reset_index(drop=True)

# 2. Time-based split (IMPORTANT: split before target generation)
split_idx = int(len(df) * 0.8)
train_df = df[:split_idx].copy()
test_df = df[split_idx:].copy()

# 3. Generate targets on TRAINING data only
targets_train, meta = generate_target_combinations(
    train_df,
    mode='auto',
    target_class_balance=0.4  # 40% positive
)

print(f"Using Method 7: {meta['method_7']}")
# {'period': 6, 'threshold': 4.2, 'instances': 201, 'pct_of_max': 39.8}

# 4. Save metadata for reproducibility
with open('target_config.json', 'w') as f:
    json.dump(meta, f, indent=2)

# 5. Create features
train_df['SMA_20'] = train_df['Close'].rolling(20).mean()
train_df['SMA_50'] = train_df['Close'].rolling(50).mean()
train_df['Returns'] = train_df['Close'].pct_change()
train_df['Volatility'] = train_df['Returns'].rolling(20).std()

# 6. Combine features and target
train_df['Target'] = targets_train['Target_7']
train_clean = train_df.dropna()

# 7. Train model
feature_cols = ['SMA_20', 'SMA_50', 'Returns', 'Volatility']
X_train = train_clean[feature_cols]
y_train = train_clean['Target']

model = RandomForestClassifier(n_estimators=100, random_state=42)
model.fit(X_train, y_train)

# 8. Apply SAME parameters to test data
test_period = meta['method_7']['period']
test_threshold = meta['method_7']['threshold'] / 100

future_close = test_df['Close'].shift(-test_period)
future_high = test_df['High'].shift(-test_period)
future_max_high = test_df['High'].shift(-test_period).rolling(
    window=test_period, min_periods=1
).max().shift(test_period)

test_df['Target'] = (future_max_high / test_df['Close'] - 1 >= test_threshold)

# 9. Create test features
test_df['SMA_20'] = test_df['Close'].rolling(20).mean()
test_df['SMA_50'] = test_df['Close'].rolling(50).mean()
test_df['Returns'] = test_df['Close'].pct_change()
test_df['Volatility'] = test_df['Returns'].rolling(20).std()

test_clean = test_df.dropna()

# 10. Evaluate
X_test = test_clean[feature_cols]
y_test = test_clean['Target']
y_pred = model.predict(X_test)

print(classification_report(y_test, y_pred))
#               precision    recall  f1-score
#          0       0.88      0.91      0.90
#          1       0.76      0.69      0.72

# 11. Visualize results
test_clean.rhoa.plots.signal(
    y_pred=y_pred,
    y_true=y_test,
    date_col='Date',
    price_col='Close',
    threshold=meta['method_7']['threshold'],
    title=f"Method 7 Predictions (Period={test_period}, Threshold={test_threshold*100:.1f}%)"
)

Best Practices

Data Preparation

Always Split Before Target Generation:

# CORRECT
train, test = split(df)
targets_train, meta = generate_target_combinations(train)
# Apply meta parameters to test

# WRONG - Look-ahead bias!
targets, meta = generate_target_combinations(df)
train, test = split(df)

Handle NaN Values:

# Targets create NaN for last N rows (future unknown)
print(targets.tail())
# Last 'period' rows will be NaN

# Always drop NaN before training
combined = pd.concat([features, targets], axis=1)
clean_data = combined.dropna()

Ensure Data Quality:

# Check for missing values
assert df[['Open', 'High', 'Low', 'Close']].isnull().sum().sum() == 0

# Check OHLC relationships
assert (df['High'] >= df['Close']).all()
assert (df['Close'] >= df['Low']).all()
assert (df['High'] >= df['Low']).all()

Parameter Selection

Start Conservative:

# Start with lower balance for higher quality
targets, meta = generate_target_combinations(
    df,
    mode='auto',
    target_class_balance=0.3  # Only best 30%
)

Consider Your Trading Style:

# Day trading: short periods
targets, meta = generate_target_combinations(
    df,
    mode='auto',
    min_period=1,
    max_period=5
)

# Swing trading: medium periods
targets, meta = generate_target_combinations(
    df,
    mode='auto',
    min_period=5,
    max_period=15
)

# Position trading: long periods
targets, meta = generate_target_combinations(
    df,
    mode='auto',
    min_period=15,
    max_period=30
)

Match Threshold to Costs:

# If trading costs are 0.5% round-trip
# Minimum profitable threshold should be > 0.5%
targets, meta = generate_target_combinations(
    df,
    mode='auto',
    min_pct=1,  # Start at 1% minimum
    max_pct=20
)

Method Selection

For Training (generous):

# Use Method 7 or 8 for more positive examples
y_train = targets_train['Target_7']  # MaxHigh/Close
# More data for model to learn from

For Validation (conservative):

# Use Method 1 for realistic validation
y_val = targets_val['Target_1']  # Close/Close
# More realistic profit expectations

Compare Multiple Methods:

# Train on each method, compare results
results = {}
for i in range(1, 9):
    X_train, y_train = features, targets[f'Target_{i}']
    model.fit(X_train, y_train)
    results[f'Method_{i}'] = evaluate(model, X_test, y_test)

# Choose method with best out-of-sample performance

Reproducibility

Always Save Metadata:

# Save with timestamp
import datetime

meta_with_timestamp = {
    'generated_at': datetime.datetime.now().isoformat(),
    'data_shape': df.shape,
    'date_range': f"{df['Date'].min()} to {df['Date'].max()}",
    **meta
}

with open('target_metadata.json', 'w') as f:
    json.dump(meta_with_timestamp, f, indent=2)

Version Your Data and Targets:

# Save targets alongside data
df_with_targets = pd.concat([df, targets], axis=1)
df_with_targets.to_csv(f'data_with_targets_{datetime.date.today()}.csv')

Common Pitfalls

Pitfall 1: Look-Ahead Bias

Problem: Generating targets on full dataset before splitting.

# WRONG
targets, meta = generate_target_combinations(df)  # Uses all data
train, test = split(df)  # Then split

# Why wrong: Optimization saw test data

Solution: Generate on train only.

# CORRECT
train, test = split(df)
targets_train, meta = generate_target_combinations(train)
# Apply meta params to test

Pitfall 2: Ignoring Class Imbalance

Problem: Using default 50% balance with limited data.

# With 100 samples and 50% balance
targets, meta = generate_target_combinations(
    small_df,  # Only 100 rows
    target_class_balance=0.5
)
# Result: Only 50 positive examples (too few!)

Solution: Adjust balance or get more data.

# Better
targets, meta = generate_target_combinations(
    small_df,
    target_class_balance=0.7  # 70 positive examples
)

Pitfall 3: Unrealistic Thresholds

Problem: Thresholds below transaction costs.

# Transaction cost = 0.5% round-trip
# But threshold = 0.3%
# Every "profitable" trade actually loses money!

Solution: Set minimum threshold above costs.

targets, meta = generate_target_combinations(
    df,
    min_pct=1,  # 1% minimum (above 0.5% costs)
)

Pitfall 4: Over-Optimizing on Single Stock

Problem: Perfect parameters for AAPL may not work for others.

Solution: Test on multiple assets.

# Find parameters on basket of stocks
all_targets = []
for ticker in ['AAPL', 'MSFT', 'GOOGL']:
    df = load_data(ticker)
    targets, meta = generate_target_combinations(df)
    all_targets.append(targets)

# Use common parameters across stocks

Advanced Topics

Multi-Target Ensemble

Train separate models on different target methods and ensemble:

# Train models on different targets
models = {}
for i in [1, 5, 7]:  # Conservative, medium, aggressive
    X, y = features, targets[f'Target_{i}']
    model = RandomForestClassifier()
    model.fit(X, y)
    models[i] = model

# Ensemble: require 2 out of 3 to agree
pred_1 = models[1].predict_proba(X_test)[:, 1]
pred_5 = models[5].predict_proba(X_test)[:, 1]
pred_7 = models[7].predict_proba(X_test)[:, 1]

ensemble_pred = ((pred_1 > 0.5) + (pred_5 > 0.5) + (pred_7 > 0.5)) >= 2

Custom Target Generation

Create your own target based on Rhoa’s patterns:

def custom_target(df, period, threshold, cost_pct=0.5):
    """Target that accounts for transaction costs."""
    future_high = df['High'].shift(-period)
    future_low = df['Low'].shift(-period)

    # Best case profit
    profit = (future_high / df['Close'] - 1) - (cost_pct / 100)

    # Worst case loss (stop loss at 2%)
    loss = (future_low / df['Close'] - 1) - (cost_pct / 100)

    # Positive if profit > threshold AND loss > -2%
    return (profit >= threshold / 100) & (loss >= -0.02)

# Use custom target
target = custom_target(df, period=5, threshold=3.0)

Performance Tips

For Large Datasets

# Reduce search space
targets, meta = generate_target_combinations(
    large_df,
    mode='auto',
    period_step=2,  # Check every 2 periods instead of 1
    step=2          # Check every 2% threshold instead of 1%
)
# Runs 4x faster with minimal accuracy loss

For Production

# Cache targets if data doesn't change
import joblib

cache_file = 'targets_cache.pkl'

if os.path.exists(cache_file):
    targets, meta = joblib.load(cache_file)
else:
    targets, meta = generate_target_combinations(df)
    joblib.dump((targets, meta), cache_file)

Summary

Key takeaways:

Auto mode for optimal parameters, manual mode for fixed periods
Always split data before generating targets
Save metadata for reproducibility
Choose target method based on use case (Method 7 for training, Method 1 for validation)
Account for transaction costs in threshold selection
Test parameters across multiple assets
Validate on true out-of-sample data

The target generation system is designed to remove guesswork and provide data-driven, reproducible trading signal definitions for machine learning.

Targets Guide

Overview

Why Target Generation Matters

The Problem

The Solution

Quick Start

Auto Mode (Recommended)

Manual Mode

The 8 Target Methods

Method Details

Auto Mode Deep Dive

How Pareto Optimization Works

The Algorithm

Pareto Frontier Example

Parameters Explained

Manual Mode Deep Dive

How the Elbow Method Works

The Elbow Algorithm

Visual Example

Parameters Explained

Metadata Structure

Understanding the Output

Using Metadata

Choosing Between Modes

Use Auto Mode When:

Use Manual Mode When:

Comparison Example

Complete Workflow Example

End-to-End ML Pipeline

Best Practices

Data Preparation

Parameter Selection

Method Selection

Reproducibility

Common Pitfalls

Pitfall 1: Look-Ahead Bias

Pitfall 2: Ignoring Class Imbalance

Pitfall 3: Unrealistic Thresholds

Pitfall 4: Over-Optimizing on Single Stock

Advanced Topics

Multi-Target Ensemble

Custom Target Generation

Performance Tips

For Large Datasets

For Production

Further Reading

Summary