Targets Guide

A comprehensive guide to Rhoa’s advanced target generation system for machine learning. Learn how to create optimized binary classification targets for trading strategies.

Overview

Rhoa’s generate_target_combinations function creates 8 different binary classification targets, each representing different ways to define a “profitable trade.” The system automatically finds optimal parameters using either:

  1. Auto Mode: Pareto optimization to find optimal period AND threshold

  2. Manual Mode: Elbow method to find optimal threshold for fixed period

This guide explains both modes in depth, the mathematics behind them, and best practices for production use.

Why Target Generation Matters

The Problem

When building ML models for trading, you need to define what constitutes a “buy signal.” This requires answering:

  1. How far ahead should the price move? (lookback period)

  2. How much should it move? (threshold percentage)

  3. Which price metric to use? (close-to-close, high-to-close, etc.)

Poor choices lead to:

  • Too many signals: High transaction costs, low precision

  • Too few signals: Insufficient training data

  • Unrealistic targets: Model learns patterns that don’t translate to profit

The Solution

Rhoa’s target generation:

  • Tests all 8 common target definitions

  • Automatically finds optimal parameters

  • Balances class distribution

  • Provides detailed metadata

  • Ensures reproducibility

Quick Start

Manual Mode

# Fixed 5-day lookback, optimize thresholds
targets, meta = generate_target_combinations(
    df,
    mode='manual',
    lookback_periods=5
)

# All methods use period=5
print(meta['method_1'])
# {'period': 5, 'threshold': 6.0, 'instances': 22, 'pct_of_max': 1.4}

The 8 Target Methods

Each method defines “success” differently:

Method

Definition

Formula

Use Case

1

Close[N] / Close[0]

Future close vs. current close

Conservative, actual exit

2

Close[N] / High[0]

Future close vs. current high

Buy at top of range

3

High[N] / Close[0]

Future high vs. current close

Intraday profit potential

4

High[N] / High[0]

Future high vs. current high

Very conservative

5

MaxClose / Close[0]

Best close in period vs. current

Best exit timing

6

MaxClose / High[0]

Best close vs. current high

Optimal buy at top

7

MaxHigh / Close[0]

Best high in period vs. current

Maximum profit potential

8

MaxHigh / High[0]

Best high vs. current high

Ultra-conservative max

Method Details

Method 1: Close[N] / Close[0] - 1 >= threshold

Most conservative. Represents buying at close today, selling at close N days later.

Method 7: MaxHigh / Close[0] - 1 >= threshold

Most generous. Represents the maximum profit achievable in the next N days, assuming perfect intraday timing.

Recommendation: Start with Method 7 for training data abundance, then validate with Method 1 for conservative estimates.

Auto Mode Deep Dive

How Pareto Optimization Works

Auto mode searches across both period (time) and threshold (percentage) dimensions to find optimal combinations.

Objective: Find parameters that:

  1. Maximize threshold (higher quality signals)

  2. Minimize period (faster trades)

  3. Achieve target class balance (e.g., 50% positive instances)

Mathematical Formulation:

For each method, we solve:

\[\begin{split}\text{maximize} \quad & threshold \\ \text{minimize} \quad & period \\ \text{minimize} \quad & |instances - target\_instances| \\ \text{subject to} \quad & period \in [min\_period, max\_period] \\ & threshold \in [min\_pct, max\_pct]\end{split}\]

This is a multi-objective optimization problem solved using Pareto frontier analysis.

The Algorithm

Step 1: Find Maximum Instances

For each method and period, calculate maximum possible instances (threshold=0%):

for period in range(1, 21):
    future_close = df['Close'].shift(-period)
    max_instances = (future_close / df['Close'] - 1 >= 0).sum()

Step 2: Calculate Target Instances

target_instances = max_instances * target_class_balance
# e.g., max_instances=500, balance=0.5 → target=250

Step 3: Search Parameter Space

Test all combinations of period and threshold:

results = []
for period in range(1, 21):
    for threshold_pct in range(0, 100):
        instances = count_instances(period, threshold_pct)
        deviation = abs(instances - target_instances)
        results.append({
            'period': period,
            'threshold': threshold_pct,
            'instances': instances,
            'deviation': deviation
        })

Step 4: Pareto Optimization

Find Pareto-optimal solutions (non-dominated solutions):

from paretoset import paretoset

# Pareto optimization: max threshold, min period, min deviation
data = results[['threshold', 'period', 'deviation']].values
mask = paretoset(data, sense=["max", "min", "min"])
pareto_solutions = results[mask]

Step 5: Select Best Solution

From Pareto frontier, choose solution closest to target balance:

pareto_solutions['distance'] = abs(
    pareto_solutions['instances'] / max_instances - target_class_balance
)
best = pareto_solutions.loc[pareto_solutions['distance'].idxmin()]

Pareto Frontier Example

Imagine these solutions for Method 7:

Period  Threshold  Instances  Deviation
------  ---------  ---------  ---------
   3       8.0        180        70      ← Pareto optimal
   5       6.0        210        40      ← Pareto optimal
   6       4.0        249         1      ← BEST (closest to 250)
  10       3.0        255         5      ← Pareto optimal
  15       2.0        240        10

Solutions at period 6, threshold 4.0% is selected because:

  • It’s on the Pareto frontier (not dominated)

  • Instances (249) closest to target (250)

  • Good trade-off: moderate period, reasonable threshold

Parameters Explained

target_class_balance (float, default=0.5)

Target percentage of positive instances.

  • 0.3 (30%): Conservative, higher quality signals

  • 0.5 (50%): Balanced, plenty of training data

  • 0.7 (70%): Aggressive, many signals

Example:

# Conservative: only 30% positive
targets, meta = generate_target_combinations(
    df, mode='auto', target_class_balance=0.3
)
min_period / max_period (int, defaults: 1/20)

Period range to search.

  • Shorter periods (1-5): Day trading

  • Medium periods (5-15): Swing trading

  • Longer periods (15-30): Position trading

Example:

# Search only swing trading range
targets, meta = generate_target_combinations(
    df,
    mode='auto',
    min_period=5,
    max_period=15
)
min_pct / max_pct / step (int, defaults: 0/100/1)

Threshold range and granularity.

Example:

# Search 2-10% in 0.5% increments
targets, meta = generate_target_combinations(
    df,
    mode='auto',
    min_pct=2,
    max_pct=10,
    step=0.5  # Finer granularity
)

Manual Mode Deep Dive

How the Elbow Method Works

Manual mode fixes the lookback period and finds the optimal threshold using the “elbow method.”

Concept: As threshold increases, instances decrease. The “elbow” is where diminishing returns begin - the optimal balance between quality and quantity.

Mathematical Basis:

Plot instances vs. threshold:

Threshold (%)    Instances
-------------    ---------
      0             500     ← Max instances
      1             450
      2             380
      3             300
      4             240     ← Elbow point
      5             220
      6             205
     ...            ...
     20               5     ← Very few

The Elbow Algorithm

Step 1: Calculate Instances Across Thresholds

from kneed import KneeLocator
import numpy as np

thresholds = np.arange(0, 100, 1)
instances = []

for threshold in thresholds:
    count = (future_price / current_price - 1 >= threshold/100).sum()
    instances.append(count)

Step 2: Find Knee Point

kn = KneeLocator(
    thresholds,
    instances,
    curve='convex',      # Curve shape
    direction='decreasing'  # Instances decrease as threshold increases
)

optimal_threshold = kn.elbow

Step 3: Generate Targets

Use the detected elbow threshold:

target = (future_price / current_price - 1 >= optimal_threshold / 100)

Visual Example

   Instances
    |
500 |*
    |
400 | *
    |
300 |  *
    |   ╲
200 |    *  ← Elbow at threshold ≈ 4%
    |     ╲
100 |      ╲___
    |          ╲____
  0 |_______________╲____
    0   2   4   6   8   10  Threshold (%)

The elbow at 4% represents the optimal threshold where: - Still have substantial instances (200) - Threshold is meaningful (4% return) - Diminishing returns begin beyond this point

Parameters Explained

lookback_periods (int, default=5)

Fixed number of periods to look forward.

Example:

# 10-day lookback
targets, meta = generate_target_combinations(
    df,
    mode='manual',
    lookback_periods=10
)

# All methods use period=10
for i in range(1, 9):
    assert meta[f'method_{i}']['period'] == 10

Metadata Structure

Understanding the Output

The metadata dictionary contains rich information:

targets, meta = generate_target_combinations(df, mode='auto')

# Metadata structure
meta = {
    'mode': 'auto',  # or 'manual'
    'method_1': {
        'period': 5,          # Lookback period
        'threshold': 3.5,     # Threshold percentage
        'instances': 247,     # Number of positive instances
        'pct_of_max': 45.2    # Percentage of maximum possible instances
    },
    # ... method_2 through method_8 ...
}

Field Meanings:

  • period: How many days/periods to look forward

  • threshold: Minimum return % required for positive label

  • instances: How many data points are positive

  • pct_of_max: What % of theoretical maximum this represents

Using Metadata

Compare Methods:

import pandas as pd

# Create comparison DataFrame
comparison = pd.DataFrame([
    meta[f'method_{i}'] for i in range(1, 9)
])
comparison.index = [f'Method_{i}' for i in range(1, 9)]

print(comparison)
#           period  threshold  instances  pct_of_max
# Method_1       5        3.5        247        45.2
# Method_2       4        5.0        198        38.1
# ...

Save for Reproducibility:

import json

# Save metadata
with open('target_metadata.json', 'w') as f:
    json.dump(meta, f, indent=2)

# Load later
with open('target_metadata.json', 'r') as f:
    loaded_meta = json.load(f)

Apply to New Data:

# Apply same parameters to test set
def apply_target_params(df, method_meta):
    period = method_meta['period']
    threshold = method_meta['threshold'] / 100

    future_close = df['Close'].shift(-period)
    return (future_close / df['Close'] - 1 >= threshold)

# Apply Method 7 parameters to test data
test_target = apply_target_params(test_df, meta['method_7'])

Choosing Between Modes

Use Auto Mode When:

  • You want optimal parameters for your specific data

  • Class balance is critical (e.g., balanced dataset for training)

  • You’re exploring different timeframes

  • You need reproducible, data-driven decisions

  • You have sufficient data (500+ rows)

Example Use Cases:

  • Initial model development

  • Production systems with regular retraining

  • Research and strategy development

Use Manual Mode When:

  • You have a specific trading timeframe in mind

  • You want to compare performance across fixed horizons

  • You’re validating a hypothesis

  • You have domain knowledge about appropriate periods

  • You want simpler, more interpretable parameters

Example Use Cases:

  • Backtesting specific strategies (e.g., “5-day swing trades”)

  • Regulatory or operational constraints on holding periods

  • Comparing different stocks on equal footing

Comparison Example

# Auto mode: Let optimizer decide everything
auto_targets, auto_meta = generate_target_combinations(
    df, mode='auto', target_class_balance=0.5
)
# Result: period=7, threshold=3.8%, instances=512 (50.1% of max)

# Manual mode: You control the period
manual_targets, manual_meta = generate_target_combinations(
    df, mode='manual', lookback_periods=7
)
# Result: period=7, threshold=5.2%, instances=384 (37.5% of max)

# Auto mode found lower threshold to hit target balance

Complete Workflow Example

End-to-End ML Pipeline

import pandas as pd
import numpy as np
from rhoa.targets import generate_target_combinations
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report
import json

# 1. Load and prepare data
df = pd.read_csv('stock_data.csv')
df['Date'] = pd.to_datetime(df['Date'])
df = df.sort_values('Date').reset_index(drop=True)

# 2. Time-based split (IMPORTANT: split before target generation)
split_idx = int(len(df) * 0.8)
train_df = df[:split_idx].copy()
test_df = df[split_idx:].copy()

# 3. Generate targets on TRAINING data only
targets_train, meta = generate_target_combinations(
    train_df,
    mode='auto',
    target_class_balance=0.4  # 40% positive
)

print(f"Using Method 7: {meta['method_7']}")
# {'period': 6, 'threshold': 4.2, 'instances': 201, 'pct_of_max': 39.8}

# 4. Save metadata for reproducibility
with open('target_config.json', 'w') as f:
    json.dump(meta, f, indent=2)

# 5. Create features
train_df['SMA_20'] = train_df['Close'].rolling(20).mean()
train_df['SMA_50'] = train_df['Close'].rolling(50).mean()
train_df['Returns'] = train_df['Close'].pct_change()
train_df['Volatility'] = train_df['Returns'].rolling(20).std()

# 6. Combine features and target
train_df['Target'] = targets_train['Target_7']
train_clean = train_df.dropna()

# 7. Train model
feature_cols = ['SMA_20', 'SMA_50', 'Returns', 'Volatility']
X_train = train_clean[feature_cols]
y_train = train_clean['Target']

model = RandomForestClassifier(n_estimators=100, random_state=42)
model.fit(X_train, y_train)

# 8. Apply SAME parameters to test data
test_period = meta['method_7']['period']
test_threshold = meta['method_7']['threshold'] / 100

future_close = test_df['Close'].shift(-test_period)
future_high = test_df['High'].shift(-test_period)
future_max_high = test_df['High'].shift(-test_period).rolling(
    window=test_period, min_periods=1
).max().shift(test_period)

test_df['Target'] = (future_max_high / test_df['Close'] - 1 >= test_threshold)

# 9. Create test features
test_df['SMA_20'] = test_df['Close'].rolling(20).mean()
test_df['SMA_50'] = test_df['Close'].rolling(50).mean()
test_df['Returns'] = test_df['Close'].pct_change()
test_df['Volatility'] = test_df['Returns'].rolling(20).std()

test_clean = test_df.dropna()

# 10. Evaluate
X_test = test_clean[feature_cols]
y_test = test_clean['Target']
y_pred = model.predict(X_test)

print(classification_report(y_test, y_pred))
#               precision    recall  f1-score
#          0       0.88      0.91      0.90
#          1       0.76      0.69      0.72

# 11. Visualize results
test_clean.rhoa.plots.signal(
    y_pred=y_pred,
    y_true=y_test,
    date_col='Date',
    price_col='Close',
    threshold=meta['method_7']['threshold'],
    title=f"Method 7 Predictions (Period={test_period}, Threshold={test_threshold*100:.1f}%)"
)

Best Practices

Data Preparation

Always Split Before Target Generation:

# CORRECT
train, test = split(df)
targets_train, meta = generate_target_combinations(train)
# Apply meta parameters to test

# WRONG - Look-ahead bias!
targets, meta = generate_target_combinations(df)
train, test = split(df)

Handle NaN Values:

# Targets create NaN for last N rows (future unknown)
print(targets.tail())
# Last 'period' rows will be NaN

# Always drop NaN before training
combined = pd.concat([features, targets], axis=1)
clean_data = combined.dropna()

Ensure Data Quality:

# Check for missing values
assert df[['Open', 'High', 'Low', 'Close']].isnull().sum().sum() == 0

# Check OHLC relationships
assert (df['High'] >= df['Close']).all()
assert (df['Close'] >= df['Low']).all()
assert (df['High'] >= df['Low']).all()

Parameter Selection

Start Conservative:

# Start with lower balance for higher quality
targets, meta = generate_target_combinations(
    df,
    mode='auto',
    target_class_balance=0.3  # Only best 30%
)

Consider Your Trading Style:

# Day trading: short periods
targets, meta = generate_target_combinations(
    df,
    mode='auto',
    min_period=1,
    max_period=5
)

# Swing trading: medium periods
targets, meta = generate_target_combinations(
    df,
    mode='auto',
    min_period=5,
    max_period=15
)

# Position trading: long periods
targets, meta = generate_target_combinations(
    df,
    mode='auto',
    min_period=15,
    max_period=30
)

Match Threshold to Costs:

# If trading costs are 0.5% round-trip
# Minimum profitable threshold should be > 0.5%
targets, meta = generate_target_combinations(
    df,
    mode='auto',
    min_pct=1,  # Start at 1% minimum
    max_pct=20
)

Method Selection

For Training (generous):

# Use Method 7 or 8 for more positive examples
y_train = targets_train['Target_7']  # MaxHigh/Close
# More data for model to learn from

For Validation (conservative):

# Use Method 1 for realistic validation
y_val = targets_val['Target_1']  # Close/Close
# More realistic profit expectations

Compare Multiple Methods:

# Train on each method, compare results
results = {}
for i in range(1, 9):
    X_train, y_train = features, targets[f'Target_{i}']
    model.fit(X_train, y_train)
    results[f'Method_{i}'] = evaluate(model, X_test, y_test)

# Choose method with best out-of-sample performance

Reproducibility

Always Save Metadata:

# Save with timestamp
import datetime

meta_with_timestamp = {
    'generated_at': datetime.datetime.now().isoformat(),
    'data_shape': df.shape,
    'date_range': f"{df['Date'].min()} to {df['Date'].max()}",
    **meta
}

with open('target_metadata.json', 'w') as f:
    json.dump(meta_with_timestamp, f, indent=2)

Version Your Data and Targets:

# Save targets alongside data
df_with_targets = pd.concat([df, targets], axis=1)
df_with_targets.to_csv(f'data_with_targets_{datetime.date.today()}.csv')

Common Pitfalls

Pitfall 1: Look-Ahead Bias

Problem: Generating targets on full dataset before splitting.

# WRONG
targets, meta = generate_target_combinations(df)  # Uses all data
train, test = split(df)  # Then split

# Why wrong: Optimization saw test data

Solution: Generate on train only.

# CORRECT
train, test = split(df)
targets_train, meta = generate_target_combinations(train)
# Apply meta params to test

Pitfall 2: Ignoring Class Imbalance

Problem: Using default 50% balance with limited data.

# With 100 samples and 50% balance
targets, meta = generate_target_combinations(
    small_df,  # Only 100 rows
    target_class_balance=0.5
)
# Result: Only 50 positive examples (too few!)

Solution: Adjust balance or get more data.

# Better
targets, meta = generate_target_combinations(
    small_df,
    target_class_balance=0.7  # 70 positive examples
)

Pitfall 3: Unrealistic Thresholds

Problem: Thresholds below transaction costs.

# Transaction cost = 0.5% round-trip
# But threshold = 0.3%
# Every "profitable" trade actually loses money!

Solution: Set minimum threshold above costs.

targets, meta = generate_target_combinations(
    df,
    min_pct=1,  # 1% minimum (above 0.5% costs)
)

Pitfall 4: Over-Optimizing on Single Stock

Problem: Perfect parameters for AAPL may not work for others.

Solution: Test on multiple assets.

# Find parameters on basket of stocks
all_targets = []
for ticker in ['AAPL', 'MSFT', 'GOOGL']:
    df = load_data(ticker)
    targets, meta = generate_target_combinations(df)
    all_targets.append(targets)

# Use common parameters across stocks

Advanced Topics

Multi-Target Ensemble

Train separate models on different target methods and ensemble:

# Train models on different targets
models = {}
for i in [1, 5, 7]:  # Conservative, medium, aggressive
    X, y = features, targets[f'Target_{i}']
    model = RandomForestClassifier()
    model.fit(X, y)
    models[i] = model

# Ensemble: require 2 out of 3 to agree
pred_1 = models[1].predict_proba(X_test)[:, 1]
pred_5 = models[5].predict_proba(X_test)[:, 1]
pred_7 = models[7].predict_proba(X_test)[:, 1]

ensemble_pred = ((pred_1 > 0.5) + (pred_5 > 0.5) + (pred_7 > 0.5)) >= 2

Custom Target Generation

Create your own target based on Rhoa’s patterns:

def custom_target(df, period, threshold, cost_pct=0.5):
    """Target that accounts for transaction costs."""
    future_high = df['High'].shift(-period)
    future_low = df['Low'].shift(-period)

    # Best case profit
    profit = (future_high / df['Close'] - 1) - (cost_pct / 100)

    # Worst case loss (stop loss at 2%)
    loss = (future_low / df['Close'] - 1) - (cost_pct / 100)

    # Positive if profit > threshold AND loss > -2%
    return (profit >= threshold / 100) & (loss >= -0.02)

# Use custom target
target = custom_target(df, period=5, threshold=3.0)

Performance Tips

For Large Datasets

# Reduce search space
targets, meta = generate_target_combinations(
    large_df,
    mode='auto',
    period_step=2,  # Check every 2 periods instead of 1
    step=2          # Check every 2% threshold instead of 1%
)
# Runs 4x faster with minimal accuracy loss

For Production

# Cache targets if data doesn't change
import joblib

cache_file = 'targets_cache.pkl'

if os.path.exists(cache_file):
    targets, meta = joblib.load(cache_file)
else:
    targets, meta = generate_target_combinations(df)
    joblib.dump((targets, meta), cache_file)

Further Reading

Summary

Key takeaways:

  1. Auto mode for optimal parameters, manual mode for fixed periods

  2. Always split data before generating targets

  3. Save metadata for reproducibility

  4. Choose target method based on use case (Method 7 for training, Method 1 for validation)

  5. Account for transaction costs in threshold selection

  6. Test parameters across multiple assets

  7. Validate on true out-of-sample data

The target generation system is designed to remove guesswork and provide data-driven, reproducible trading signal definitions for machine learning.