Source code for rhoa.targets

# rhoa - A pandas DataFrame extension for technical analysis
# Copyright (C) 2025 nainajnahO
#
# This program is free software: you can redistribute it and/or modify
# it under the terms of the GNU General Public License as published by
# the Free Software Foundation, either version 3 of the License, or
# (at your option) any later version.
#
# This program is distributed in the hope that it will be useful,
# but WITHOUT ANY WARRANTY; without even the implied warranty of
# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
# GNU General Public License for more details.
#
# You should have received a copy of the GNU General Public License
# along with this program.  If not, see <https://www.gnu.org/licenses/>.

"""
Target Generation for Financial Time Series Analysis
=====================================================

This module provides tools for generating optimized binary classification targets
based on future price movements in financial time series data. It supports two
optimization strategies: Pareto multi-objective optimization and elbow method
threshold selection.

Core Functionality
------------------
- Generate 8 target combinations using different entry/exit price definitions
- Automatic parameter optimization for lookback periods and thresholds
- Support for both end-of-period and maximum-during-period gain calculations
- Flexible class balance targeting or elbow-based threshold selection

Target Methods
--------------
The module generates eight distinct target definitions, each representing a
different combination of entry price (Close[t] or High[t]) and exit price
(end-of-period or maximum-during-period):

1. Close[t+N] / Close[t] - Conservative entry, end-point exit
2. Close[t+N] / High[t] - Aggressive entry, end-point exit
3. High[t+N] / Close[t] - Conservative entry, high exit
4. High[t+N] / High[t] - Aggressive entry, high exit
5. max(Close[t+1:t+N]) / Close[t] - Conservative entry, optimal close exit
6. max(Close[t+1:t+N]) / High[t] - Aggressive entry, optimal close exit
7. max(High[t+1:t+N]) / Close[t] - Conservative entry, optimal high exit
8. max(High[t+1:t+N]) / High[t] - Aggressive entry, optimal high exit

Optimization Modes
------------------
Auto Mode (Pareto Optimization)
    Searches the space of (period, threshold) combinations to find Pareto-optimal
    solutions that balance three objectives:
    - Maximize threshold (higher precision requirements)
    - Minimize period (shorter holding time)
    - Minimize deviation from target class balance

Manual Mode (Elbow Method)
    Uses a fixed lookback period and finds the optimal threshold using the
    elbow/knee point detection on the curve of instance counts vs. thresholds.

Mathematical Background
-----------------------
Pareto Optimization
    A solution is Pareto-optimal if no other solution exists that improves at
    least one objective without worsening any other. For point A to dominate B:
    - A must be better than B in at least one objective
    - A must be no worse than B in all other objectives

Elbow Method
    Identifies the point of maximum curvature on a convex decreasing curve by
    finding the point with maximum perpendicular distance from the line
    connecting curve endpoints.

Dependencies
------------
- pandas : DataFrame operations and time series handling
- numpy : Numerical computations
- kneed : Elbow/knee point detection (KneeLocator)
- paretoset : Pareto frontier computation

Notes
-----
**Look-ahead Bias Warning:**
    These targets use future price information and should only be used for
    target creation in supervised learning. Never use future data for feature
    engineering or training, only for defining what the model should predict.

**NaN Handling:**
    The last N rows (where N is the lookback period) will contain NaN values
    due to insufficient future data. These should be excluded from analysis.

**Class Imbalance:**
    Very low target_class_balance (< 0.1) may result in insufficient positive
    examples. Very high values (> 0.9) may result in overly easy targets with
    poor discrimination.

Examples
--------
Basic usage with auto mode:

>>> import pandas as pd
>>> import numpy as np
>>> from rhoa.targets import generate_target_combinations
>>>
>>> # Load your OHLC data
>>> df = pd.read_csv('prices.csv', index_col='Date', parse_dates=True)
>>>
>>> # Generate targets with 50% class balance
>>> targets, metadata = generate_target_combinations(
...     df, mode='auto', target_class_balance=0.5
... )
>>>
>>> print(f"Generated {len(targets.columns)} targets")
>>> print(f"Target_1 has {targets['Target_1'].sum()} positive instances")

Manual mode with fixed period:

>>> targets, metadata = generate_target_combinations(
...     df, mode='manual', lookback_periods=10
... )
>>> print(f"Method 1 threshold: {metadata['method_1']['threshold']}%")

See Also
--------
generate_target_combinations : Main function for target generation.

References
----------
.. [1] Pareto, V. (1906). "Manuale di economia politica"
.. [2] Satopää, V., et al. (2011). "Finding a 'Kneedle' in a Haystack:
       Detecting Knee Points in System Behavior"
.. [3] Deb, K., et al. (2002). "A fast and elitist multiobjective genetic
       algorithm: NSGA-II"
"""

import pandas as pd
import numpy as np
from typing import Tuple, Dict
from kneed import KneeLocator
from paretoset import paretoset


[docs] def generate_target_combinations( df: pd.DataFrame, mode: str = 'auto', # Manual mode parameters: lookback_periods: int = 5, # Auto mode parameters: target_class_balance: float = 0.5, min_period: int = 1, max_period: int = 20, period_step: int = 1, # Common parameters: min_pct: int = 0, max_pct: int = 100, step: int = 1, close_col: str = "Close", high_col: str = "High" ) -> Tuple[pd.DataFrame, Dict]: """ Generate eight target combinations with optimized thresholds and lookback periods. This function creates binary classification targets based on future price movements, supporting two optimization modes: automatic Pareto-based optimization and manual elbow-based threshold selection. Each target represents a different way of measuring future price gains relative to current entry prices. Parameters ---------- df : pd.DataFrame Input DataFrame containing OHLC (Open, High, Low, Close) price data. Must have at least the columns specified by `close_col` and `high_col`. Index should be a time series (e.g., DatetimeIndex). mode : {'auto', 'manual'}, default='auto' Optimization mode to use: - 'auto' : Uses Pareto optimization to find optimal lookback period and threshold that balance multiple objectives (maximize threshold, minimize period, achieve target class balance). - 'manual' : Uses fixed lookback period with elbow method to find optimal threshold based on the curve of instance counts vs. thresholds. lookback_periods : int, default=5 Number of periods to look forward for future price calculations. Only used when `mode='manual'`. Must be >= 1 and < len(df). target_class_balance : float, default=0.5 Target proportion of positive class instances (range: 0.0 to 1.0). Only used when `mode='auto'`. For example, 0.5 means aim for 50% positive instances, 0.3 means 30%. min_period : int, default=1 Minimum lookback period to consider in optimization search space. Only used when `mode='auto'`. Must be >= 1. max_period : int, default=20 Maximum lookback period to consider in optimization search space. Only used when `mode='auto'`. Must be > min_period and < len(df). period_step : int, default=1 Increment step for lookback period search. Only used when `mode='auto'`. Must be >= 1. min_pct : int, default=0 Minimum threshold percentage to consider (e.g., 0 for 0%). Must be >= 0 and < max_pct. max_pct : int, default=100 Maximum threshold percentage to consider (e.g., 100 for 100%). Must be > min_pct and <= 100. step : int, default=1 Increment step for threshold search in percentage points. Must be >= 1. close_col : str, default='Close' Name of the close price column in the DataFrame. high_col : str, default='High' Name of the high price column in the DataFrame. Returns ------- targets_df : pd.DataFrame DataFrame with same index as input `df`, containing 8 boolean columns: - Target_1 : (Close[t+N] / Close[t]) - 1 >= threshold - Target_2 : (Close[t+N] / High[t]) - 1 >= threshold - Target_3 : (High[t+N] / Close[t]) - 1 >= threshold - Target_4 : (High[t+N] / High[t]) - 1 >= threshold - Target_5 : (max(Close[t+1:t+N]) / Close[t]) - 1 >= threshold - Target_6 : (max(Close[t+1:t+N]) / High[t]) - 1 >= threshold - Target_7 : (max(High[t+1:t+N]) / Close[t]) - 1 >= threshold - Target_8 : (max(High[t+1:t+N]) / High[t]) - 1 >= threshold Where N is the optimized lookback period, and threshold is the optimized percentage gain threshold. metadata : dict Dictionary containing optimization results and configuration with keys: - 'mode' : str The mode used ('auto' or 'manual'). - 'method_1' through 'method_8' : dict Each method dictionary contains: - 'period' : int Optimal lookback period in number of time steps. - 'threshold' : float Optimal threshold as percentage (e.g., 5.0 for 5%). - 'instances' : int Number of positive instances at the optimal parameters. - 'pct_of_max' : float Percentage of maximum possible instances (at 0% threshold). Raises ------ ValueError If the DataFrame is empty. ValueError If `close_col` or `high_col` not found in DataFrame columns. ValueError If `mode` is not 'auto' or 'manual'. See Also -------- _find_optimal_params_pareto : Pareto optimization for auto mode. _find_optimal_params_elbow : Elbow method for manual mode. _generate_targets : Generates target columns from optimal parameters. Notes ----- **Target Interpretation:** - Targets 1-4 measure end-of-period gains (single point in time at t+N). - Targets 5-8 measure maximum gains during the period (any time in [t+1, t+N]). - Using High[t] as denominator (Targets 2, 4, 6, 8) is more conservative than Close[t], as it requires overcoming intraday peaks. - Maximum-based targets (5-8) capture exit opportunities that might occur before the end of the lookback period. **Pareto Optimization (Auto Mode):** The Pareto optimization finds solutions that are not dominated by any other solution in the objective space. A solution A dominates solution B if: - A is better than B in at least one objective - A is no worse than B in all other objectives For this problem, we optimize three objectives: 1. Maximize threshold (prefer higher gain requirements) 2. Minimize period (prefer shorter holding periods) 3. Minimize deviation from target class balance From the Pareto-optimal set, we select the solution closest to the target class balance. **Elbow Method (Manual Mode):** The elbow method finds the "knee" or "elbow" point on the curve of instance counts vs. thresholds. This point represents the optimal trade-off where: - Increasing threshold further causes steep drops in instances (high cost) - Decreasing threshold provides diminishing returns in instances Mathematically, the elbow is found by maximizing the distance from the curve to the line connecting the endpoints. **Common Pitfalls:** - Insufficient data: Ensure df has enough rows for the lookback period. - Look-ahead bias: Do not use future data for training; only for target creation. - Class imbalance: Very low or very high target_class_balance may yield poor results. - NaN handling: Last N rows will have NaN targets due to insufficient future data. Examples -------- **Example 1: Auto mode with default parameters** >>> import pandas as pd >>> import numpy as np >>> from rhoa.targets import generate_target_combinations >>> >>> # Create sample OHLC data >>> np.random.seed(42) >>> dates = pd.date_range('2020-01-01', periods=100, freq='D') >>> df = pd.DataFrame({ ... 'Close': 100 + np.cumsum(np.random.randn(100)), ... 'High': 100 + np.cumsum(np.random.randn(100)) + 1 ... }, index=dates) >>> >>> # Generate targets with auto mode >>> targets, meta = generate_target_combinations(df, mode='auto') >>> >>> # Check results >>> print(f"Mode: {meta['mode']}") Mode: auto >>> print(f"Target_7 period: {meta['method_7']['period']}") Target_7 period: 6 >>> print(f"Target_7 threshold: {meta['method_7']['threshold']}%") Target_7 threshold: 4.0% >>> print(f"Positive instances: {targets['Target_7'].sum()}") Positive instances: 249 **Example 2: Manual mode with custom lookback period** >>> # Generate targets with manual mode >>> targets, meta = generate_target_combinations( ... df, ... mode='manual', ... lookback_periods=10, ... min_pct=0, ... max_pct=20, ... step=1 ... ) >>> >>> # Check results for method 1 >>> method_1 = meta['method_1'] >>> print(f"Period: {method_1['period']}, Threshold: {method_1['threshold']}%") Period: 10, Threshold: 6.0% >>> print(f"Instances: {method_1['instances']} ({method_1['pct_of_max']:.1f}% of max)") Instances: 22 (1.4% of max) **Example 3: Target specific class balance** >>> # Aim for 30% positive instances >>> targets, meta = generate_target_combinations( ... df, ... mode='auto', ... target_class_balance=0.3, ... min_period=1, ... max_period=15 ... ) >>> >>> # Verify class balance for each target >>> for i in range(1, 9): ... positive_pct = targets[f'Target_{i}'].sum() / len(targets) * 100 ... print(f"Target_{i}: {positive_pct:.1f}% positive") Target_1: 29.5% positive Target_2: 30.2% positive ... **Example 4: Custom column names** >>> # DataFrame with different column names >>> df_custom = df.rename(columns={'Close': 'close_price', 'High': 'high_price'}) >>> targets, meta = generate_target_combinations( ... df_custom, ... mode='auto', ... close_col='close_price', ... high_col='high_price' ... ) >>> print(targets.columns.tolist()) ['Target_1', 'Target_2', 'Target_3', 'Target_4', 'Target_5', 'Target_6', 'Target_7', 'Target_8'] References ---------- .. [1] Pareto, V. (1906). "Manuale di economia politica" .. [2] Satopää, V., et al. (2011). "Finding a 'Kneedle' in a Haystack: Detecting Knee Points in System Behavior" """ # 1. Validate inputs _validate_inputs(df, mode, close_col, high_col) if mode == 'auto': # 2a. Auto mode: Pareto optimization optimal_params = _find_optimal_params_pareto( df, target_class_balance, min_period, max_period, period_step, min_pct, max_pct, step, close_col, high_col ) else: # mode == 'manual' # 2b. Manual mode: Fixed period, find elbow optimal_params = _find_optimal_params_elbow( df, lookback_periods, min_pct, max_pct, step, close_col, high_col ) # 3. Generate 8 target columns using optimal params targets_df = _generate_targets(df, optimal_params, close_col, high_col) # 4. Create metadata dict metadata = { 'mode': mode, **optimal_params # Contains method_1 through method_8 dicts } return targets_df, metadata
def _validate_inputs(df: pd.DataFrame, mode: str, close_col: str, high_col: str) -> None: """ Validate input parameters for target generation. Ensures the DataFrame is not empty, contains required columns, and the mode parameter is valid before proceeding with target generation. Parameters ---------- df : pd.DataFrame Input DataFrame to validate. mode : str Optimization mode to validate. Must be 'auto' or 'manual'. close_col : str Name of the close price column to check for existence. high_col : str Name of the high price column to check for existence. Raises ------ ValueError If DataFrame is empty. ValueError If `close_col` not found in DataFrame columns. ValueError If `high_col` not found in DataFrame columns. ValueError If `mode` is not 'auto' or 'manual'. Notes ----- This is a helper function that performs early validation to provide clear error messages before expensive computation begins. Examples -------- >>> import pandas as pd >>> df = pd.DataFrame({'Close': [100, 101], 'High': [102, 103]}) >>> _validate_inputs(df, 'auto', 'Close', 'High') # No error >>> _validate_inputs(df, 'invalid', 'Close', 'High') # Raises ValueError Traceback (most recent call last): ... ValueError: mode must be 'auto' or 'manual', got 'invalid' """ if df.empty: raise ValueError("DataFrame is empty") if close_col not in df.columns: raise ValueError(f"Column '{close_col}' not found in DataFrame") if high_col not in df.columns: raise ValueError(f"Column '{high_col}' not found in DataFrame") if mode not in ['auto', 'manual']: raise ValueError(f"mode must be 'auto' or 'manual', got '{mode}'") def _calculate_future_values( df: pd.DataFrame, period: int, close_col: str, high_col: str ) -> Tuple[pd.Series, pd.Series, pd.Series, pd.Series]: """ Calculate future price values and maximum values over the lookback period. This function computes four forward-looking price series that are used to create the eight target definitions. It handles both point-in-time future values (at period N) and maximum values over the entire period (1 to N). Parameters ---------- df : pd.DataFrame Input DataFrame containing price data. period : int Number of periods to look forward. Must be >= 1. close_col : str Name of the close price column. high_col : str Name of the high price column. Returns ------- future_close : pd.Series Close price at time t+period. Shape matches input df. Last `period` values will be NaN due to insufficient future data. future_high : pd.Series High price at time t+period. Shape matches input df. Last `period` values will be NaN. future_max_close : pd.Series Maximum close price over the window [t+1, t+period]. Last `period` values will be NaN. future_max_high : pd.Series Maximum high price over the window [t+1, t+period]. Last `period` values will be NaN. Notes ----- **Shifting Logic:** - `shift(-period)` moves values backward in time, making future values available at current time index. - For max values, we shift forward, calculate rolling max, then shift back to align the window correctly with [t+1, t+period]. **Rolling Window:** The rolling window calculation uses `min_periods=1` to handle edge cases at the start of the series, but in practice, the initial values are not used due to the shifting operations. **Memory Efficiency:** The function creates four new Series objects but does not copy the entire DataFrame, making it efficient for large datasets. Examples -------- >>> import pandas as pd >>> import numpy as np >>> df = pd.DataFrame({ ... 'Close': [100, 102, 101, 105, 103, 107], ... 'High': [101, 103, 102, 106, 104, 108] ... }) >>> future_close, future_high, future_max_close, future_max_high = \\ ... _calculate_future_values(df, period=2, close_col='Close', high_col='High') >>> >>> # future_close is Close shifted back by 2 >>> print(future_close.values) [101. 105. 103. 107. nan nan] >>> >>> # future_max_close is max(Close) over next 2 periods >>> print(future_max_close.values) [102. 105. 105. 107. nan nan] See Also -------- pd.Series.shift : Shift index by desired number of periods. pd.Series.rolling : Provide rolling window calculations. """ future_close = df[close_col].shift(-period) future_high = df[high_col].shift(-period) # For max values: shift forward, calculate rolling max, then shift back future_max_close = df[close_col].shift(-period).rolling( window=period, min_periods=1 ).max().shift(period) future_max_high = df[high_col].shift(-period).rolling( window=period, min_periods=1 ).max().shift(period) return future_close, future_high, future_max_close, future_max_high def _find_optimal_params_pareto( df: pd.DataFrame, target_class_balance: float, min_period: int, max_period: int, period_step: int, min_pct: int, max_pct: int, step: int, close_col: str, high_col: str ) -> Dict: """ Find optimal parameters using Pareto multi-objective optimization. This function searches the parameter space of (period, threshold) combinations to find Pareto-optimal solutions that balance three competing objectives: maximize threshold, minimize period, and minimize deviation from target class balance. From the Pareto-optimal set, it selects the solution closest to the desired class balance for each of the 8 target methods. Parameters ---------- df : pd.DataFrame Input DataFrame containing OHLC price data. target_class_balance : float Target proportion of positive class instances (0.0 to 1.0). For example, 0.5 means aim for 50% positive instances. min_period : int Minimum lookback period to consider. Must be >= 1. max_period : int Maximum lookback period to consider. Must be > min_period. period_step : int Increment for period search. Must be >= 1. min_pct : int Minimum threshold percentage (e.g., 0 for 0%). max_pct : int Maximum threshold percentage (e.g., 100 for 100%). step : int Increment for threshold search in percentage points. close_col : str Name of the close price column. high_col : str Name of the high price column. Returns ------- optimal_params : dict Dictionary with keys 'method_1' through 'method_8', where each value is a dictionary containing: - 'period' : int Optimal lookback period. - 'threshold' : float Optimal threshold as percentage (e.g., 5.0 for 5%). - 'instances' : int Number of positive instances at optimal point. - 'pct_of_max' : float Percentage of maximum possible instances (at 0% threshold). Notes ----- **Pareto Optimization:** A solution (period, threshold) is Pareto-optimal if no other solution exists that is better in at least one objective and no worse in all others. Mathematically, solution A dominates solution B if: .. math:: \\forall i: f_i(A) \\geq f_i(B) \\text{ and } \\exists j: f_j(A) > f_j(B) where :math:`f_i` are the objective functions (with appropriate sense). **Objectives:** 1. **Maximize threshold**: Higher thresholds mean more stringent requirements, leading to higher precision predictions. .. math:: \\text{maximize } \\theta 2. **Minimize period**: Shorter holding periods reduce risk and capital allocation time. .. math:: \\text{minimize } N 3. **Minimize deviation from target balance**: Stay close to the desired class distribution. .. math:: \\text{minimize } |\\text{instances}(\\theta, N) - \\text{target}| **Two-Pass Algorithm:** 1. **First pass**: Calculate maximum instances at 0% threshold for each period to establish the upper bound and compute target instance counts. 2. **Second pass**: Evaluate all (period, threshold) combinations, storing results for each method along with deviation from target. 3. **Selection**: Apply Pareto dominance to find optimal set, then select solution closest to target class balance. **Computational Complexity:** Time complexity: O(P × T × M) where: - P = number of periods = (max_period - min_period) / period_step - T = number of thresholds = (max_pct - min_pct) / step - M = number of methods = 8 Space complexity: O(P × T × M) to store all candidate solutions. **Common Pitfalls:** - If target_class_balance is too low (e.g., < 0.01), the search may not find feasible solutions within the threshold range. - Large search spaces (many periods and thresholds) increase computation time. - Insufficient data relative to max_period can lead to poor statistics. Examples -------- >>> import pandas as pd >>> import numpy as np >>> np.random.seed(42) >>> df = pd.DataFrame({ ... 'Close': 100 + np.cumsum(np.random.randn(100)), ... 'High': 102 + np.cumsum(np.random.randn(100)) ... }) >>> optimal = _find_optimal_params_pareto( ... df, target_class_balance=0.5, min_period=1, max_period=10, ... period_step=1, min_pct=0, max_pct=20, step=1, ... close_col='Close', high_col='High' ... ) >>> print(optimal['method_1']) {'period': 5, 'threshold': 7.0, 'instances': 48, 'pct_of_max': 49.8} See Also -------- paretoset : Library for computing Pareto-optimal sets. _find_optimal_params_elbow : Alternative elbow method for manual mode. References ---------- .. [1] Pareto, V. (1906). "Manuale di economia politica" .. [2] Deb, K., et al. (2002). "A fast and elitist multiobjective genetic algorithm: NSGA-II" """ # First pass: find maximum instances for each method at 0% threshold max_instances = [0] * 8 for period in range(min_period, max_period + 1, period_step): future_close, future_high, future_max_close, future_max_high = \ _calculate_future_values(df, period, close_col, high_col) # Calculate max instances for each method at threshold=0 method_instances = [ (future_close / df[close_col] - 1 >= 0).sum(), (future_close / df[high_col] - 1 >= 0).sum(), (future_high / df[close_col] - 1 >= 0).sum(), (future_high / df[high_col] - 1 >= 0).sum(), (future_max_close / df[close_col] - 1 >= 0).sum(), (future_max_close / df[high_col] - 1 >= 0).sum(), (future_max_high / df[close_col] - 1 >= 0).sum(), (future_max_high / df[high_col] - 1 >= 0).sum() ] for i in range(8): max_instances[i] = max(max_instances[i], method_instances[i]) # Calculate target instances for each method target_instances = [int(mi * target_class_balance) for mi in max_instances] # Second pass: collect all combinations for each method results = {f'method_{i+1}': [] for i in range(8)} for period in range(min_period, max_period + 1, period_step): future_close, future_high, future_max_close, future_max_high = \ _calculate_future_values(df, period, close_col, high_col) for threshold_pct in range(min_pct, max_pct, step): threshold = threshold_pct / 100 # Calculate instances for all 8 methods instances = [ (future_close / df[close_col] - 1 >= threshold).sum(), (future_close / df[high_col] - 1 >= threshold).sum(), (future_high / df[close_col] - 1 >= threshold).sum(), (future_high / df[high_col] - 1 >= threshold).sum(), (future_max_close / df[close_col] - 1 >= threshold).sum(), (future_max_close / df[high_col] - 1 >= threshold).sum(), (future_max_high / df[close_col] - 1 >= threshold).sum(), (future_max_high / df[high_col] - 1 >= threshold).sum() ] # Store results for each method for i in range(8): deviation = abs(instances[i] - target_instances[i]) pct_of_max = (instances[i] / max_instances[i] * 100) if max_instances[i] > 0 else 0 results[f'method_{i+1}'].append({ 'period': period, 'threshold': threshold_pct, 'instances': instances[i], 'pct_of_max': pct_of_max, 'deviation': deviation }) # Find Pareto-optimal solution for each method optimal_params = {} for method_idx in range(8): method_key = f'method_{method_idx + 1}' method_results = pd.DataFrame(results[method_key]) # Pareto optimization: maximize threshold, minimize period, minimize deviation data = method_results[['threshold', 'period', 'deviation']].values mask = paretoset(data, sense=["max", "min", "min"]) pareto_df = method_results[mask].copy() # Select solution closest to target_class_balance pareto_df['abs_dev_from_target'] = ( pareto_df['pct_of_max'] / 100 - target_class_balance ).abs() best_solution = pareto_df.loc[pareto_df['abs_dev_from_target'].idxmin()] optimal_params[method_key] = { 'period': int(best_solution['period']), 'threshold': float(best_solution['threshold']), 'instances': int(best_solution['instances']), 'pct_of_max': float(best_solution['pct_of_max']) } return optimal_params def _find_optimal_params_elbow( df: pd.DataFrame, lookback_periods: int, min_pct: int, max_pct: int, step: int, close_col: str, high_col: str ) -> Dict: """ Find optimal thresholds using the elbow method with a fixed lookback period. The elbow method identifies the "knee" or "elbow" point on the curve of instance counts versus threshold percentages. This point represents an optimal trade-off where increasing the threshold further results in disproportionately fewer positive instances, while decreasing it provides diminishing returns. Parameters ---------- df : pd.DataFrame Input DataFrame containing OHLC price data. lookback_periods : int Fixed number of periods to look forward. Must be >= 1 and < len(df). min_pct : int Minimum threshold percentage (e.g., 0 for 0%). max_pct : int Maximum threshold percentage (e.g., 100 for 100%). step : int Increment for threshold search in percentage points. Must be >= 1. close_col : str Name of the close price column. high_col : str Name of the high price column. Returns ------- optimal_params : dict Dictionary with keys 'method_1' through 'method_8', where each value is a dictionary containing: - 'period' : int Lookback period (same value `lookback_periods` for all methods). - 'threshold' : float Elbow threshold as percentage (e.g., 6.0 for 6%). - 'instances' : int Number of positive instances at the elbow point. - 'pct_of_max' : float Percentage of maximum possible instances (at 0% threshold). Notes ----- **Elbow Method:** The elbow method finds the point of maximum curvature on a curve. For a decreasing convex curve (instances vs. threshold), the elbow represents the threshold where: - Below the elbow: Small threshold increases cause gradual instance decreases - Above the elbow: Small threshold increases cause steep instance decreases **Mathematical Formulation:** Given points on curve :math:`(x_i, y_i)` for thresholds :math:`x_i` and instance counts :math:`y_i`, the elbow maximizes the perpendicular distance from the curve to the line connecting the endpoints. For a line from :math:`(x_0, y_0)` to :math:`(x_n, y_n)`, the distance from point :math:`(x_i, y_i)` is: .. math:: d_i = \\frac{|(y_n - y_0)x_i - (x_n - x_0)y_i + x_n y_0 - y_n x_0|} {\\sqrt{(y_n - y_0)^2 + (x_n - x_0)^2}} The elbow is at :math:`\\arg\\max_i d_i`. **KneeLocator Implementation:** This function uses the `kneed` library's `KneeLocator` class with: - `curve='convex'`: The instance count curve is convex (curves downward) - `direction='decreasing'`: Instance counts decrease as threshold increases **Fallback Behavior:** If no elbow is detected (e.g., monotonic curve with no clear knee), the function defaults to `min_pct` to ensure a valid result. **Advantages:** - Simple and interpretable method - No need to specify target class balance - Automatic threshold selection based on curve shape **Limitations:** - Fixed lookback period (no period optimization) - May not find elbow if curve is too smooth or too noisy - Assumes convex decreasing relationship Examples -------- >>> import pandas as pd >>> import numpy as np >>> np.random.seed(42) >>> df = pd.DataFrame({ ... 'Close': 100 + np.cumsum(np.random.randn(100)), ... 'High': 102 + np.cumsum(np.random.randn(100)) ... }) >>> optimal = _find_optimal_params_elbow( ... df, lookback_periods=5, min_pct=0, max_pct=50, step=1, ... close_col='Close', high_col='High' ... ) >>> print(optimal['method_1']) {'period': 5, 'threshold': 6.0, 'instances': 22, 'pct_of_max': 22.4} >>> >>> # All methods use the same period >>> periods = [optimal[f'method_{i}']['period'] for i in range(1, 9)] >>> print(all(p == 5 for p in periods)) True See Also -------- KneeLocator : Knee/elbow detection algorithm from kneed library. _find_optimal_params_pareto : Pareto optimization for auto mode. References ---------- .. [1] Satopää, V., et al. (2011). "Finding a 'Kneedle' in a Haystack: Detecting Knee Points in System Behavior" .. [2] Zhao, Q., et al. (2008). "Knee Point Detection in BIC for Detecting the Number of Clusters" """ # Calculate future values for fixed period future_close, future_high, future_max_close, future_max_high = \ _calculate_future_values(df, lookback_periods, close_col, high_col) # Calculate instance counts across threshold range for all 8 methods x = np.array(range(min_pct, max_pct, step)) methods_data = [ [(future_close / df[close_col] - 1 >= i / 100).sum() for i in range(min_pct, max_pct, step)], [(future_close / df[high_col] - 1 >= i / 100).sum() for i in range(min_pct, max_pct, step)], [(future_high / df[close_col] - 1 >= i / 100).sum() for i in range(min_pct, max_pct, step)], [(future_high / df[high_col] - 1 >= i / 100).sum() for i in range(min_pct, max_pct, step)], [(future_max_close / df[close_col] - 1 >= i / 100).sum() for i in range(min_pct, max_pct, step)], [(future_max_close / df[high_col] - 1 >= i / 100).sum() for i in range(min_pct, max_pct, step)], [(future_max_high / df[close_col] - 1 >= i / 100).sum() for i in range(min_pct, max_pct, step)], [(future_max_high / df[high_col] - 1 >= i / 100).sum() for i in range(min_pct, max_pct, step)] ] # Find elbow points for all methods optimal_params = {} for method_idx, pct_lst in enumerate(methods_data): method_key = f'method_{method_idx + 1}' # Find elbow using KneeLocator kn = KneeLocator(x, pct_lst, curve='convex', direction='decreasing') elbow_threshold = kn.elbow if kn.elbow is not None else min_pct # Get instances at elbow elbow_idx = int((elbow_threshold - min_pct) / step) instances = pct_lst[elbow_idx] if elbow_idx < len(pct_lst) else 0 # Calculate max instances (at threshold=0) max_inst = pct_lst[0] if len(pct_lst) > 0 else 1 pct_of_max = (instances / max_inst * 100) if max_inst > 0 else 0 optimal_params[method_key] = { 'period': lookback_periods, 'threshold': float(elbow_threshold), 'instances': int(instances), 'pct_of_max': float(pct_of_max) } return optimal_params def _generate_targets( df: pd.DataFrame, optimal_params: Dict, close_col: str, high_col: str ) -> pd.DataFrame: """ Generate eight binary target columns using optimized parameters. Creates boolean target columns based on whether future price gains exceed specified thresholds. Each target uses its own optimized period and threshold determined by either Pareto optimization or the elbow method. Parameters ---------- df : pd.DataFrame Original input DataFrame containing price data. optimal_params : dict Dictionary with keys 'method_1' through 'method_8', where each value contains 'period' and 'threshold' parameters. close_col : str Name of the close price column. high_col : str Name of the high price column. Returns ------- targets_df : pd.DataFrame DataFrame with same index as input `df`, containing 8 boolean columns named 'Target_1' through 'Target_8'. Each column has True where the gain exceeds the threshold, False otherwise, and NaN for rows without sufficient future data. Notes ----- **Target Formulas:** For time t, lookback period N, threshold :math:`\\theta`: - **Target_1**: :math:`\\frac{\\text{Close}[t+N]}{\\text{Close}[t]} - 1 \\geq \\theta` - **Target_2**: :math:`\\frac{\\text{Close}[t+N]}{\\text{High}[t]} - 1 \\geq \\theta` - **Target_3**: :math:`\\frac{\\text{High}[t+N]}{\\text{Close}[t]} - 1 \\geq \\theta` - **Target_4**: :math:`\\frac{\\text{High}[t+N]}{\\text{High}[t]} - 1 \\geq \\theta` - **Target_5**: :math:`\\frac{\\max_{i=1}^{N}(\\text{Close}[t+i])}{\\text{Close}[t]} - 1 \\geq \\theta` - **Target_6**: :math:`\\frac{\\max_{i=1}^{N}(\\text{Close}[t+i])}{\\text{High}[t]} - 1 \\geq \\theta` - **Target_7**: :math:`\\frac{\\max_{i=1}^{N}(\\text{High}[t+i])}{\\text{Close}[t]} - 1 \\geq \\theta` - **Target_8**: :math:`\\frac{\\max_{i=1}^{N}(\\text{High}[t+i])}{\\text{High}[t]} - 1 \\geq \\theta` **Entry Price Interpretation:** - Close[t]: Assume entry at closing price of day t - High[t]: Assume entry at intraday high of day t (more conservative) **Exit Price Interpretation:** - Point targets (1-4): Exit at specific time t+N - Maximum targets (5-8): Exit at any optimal time in [t+1, t+N] **NaN Handling:** The last N rows of each target will contain NaN values because there is insufficient future data to calculate the gains. These rows should be excluded from training and evaluation. **Memory Efficiency:** Each target calculation recomputes future values for its specific period, which allows different periods per target but requires multiple passes over the data. For large datasets, consider caching intermediate results. Examples -------- >>> import pandas as pd >>> import numpy as np >>> df = pd.DataFrame({ ... 'Close': [100, 102, 101, 105, 103, 107, 106, 110], ... 'High': [101, 103, 102, 106, 104, 108, 107, 111] ... }) >>> optimal_params = { ... 'method_1': {'period': 2, 'threshold': 5.0}, ... 'method_2': {'period': 2, 'threshold': 5.0}, ... 'method_3': {'period': 2, 'threshold': 5.0}, ... 'method_4': {'period': 2, 'threshold': 5.0}, ... 'method_5': {'period': 3, 'threshold': 7.0}, ... 'method_6': {'period': 3, 'threshold': 7.0}, ... 'method_7': {'period': 3, 'threshold': 7.0}, ... 'method_8': {'period': 3, 'threshold': 7.0} ... } >>> targets = _generate_targets(df, optimal_params, 'Close', 'High') >>> print(targets.columns.tolist()) ['Target_1', 'Target_2', 'Target_3', 'Target_4', 'Target_5', 'Target_6', 'Target_7', 'Target_8'] >>> print(targets['Target_1'].dtype) bool >>> >>> # Check target calculation for first row >>> # Close[2]/Close[0] = 101/100 = 1.01, gain = 1% >>> # 1% < 5% threshold, so False >>> print(targets['Target_1'].iloc[0]) False See Also -------- _calculate_future_values : Computes future price series. """ targets = {} for method_idx in range(8): method_key = f'method_{method_idx + 1}' params = optimal_params[method_key] period = params['period'] threshold = params['threshold'] / 100 # Convert percentage to decimal # Calculate future values for this method's period future_close, future_high, future_max_close, future_max_high = \ _calculate_future_values(df, period, close_col, high_col) # Generate target based on method if method_idx == 0: # Method 1: Close[N]/Close[0] targets[f'Target_{method_idx + 1}'] = (future_close / df[close_col] - 1 >= threshold) elif method_idx == 1: # Method 2: Close[N]/High[0] targets[f'Target_{method_idx + 1}'] = (future_close / df[high_col] - 1 >= threshold) elif method_idx == 2: # Method 3: High[N]/Close[0] targets[f'Target_{method_idx + 1}'] = (future_high / df[close_col] - 1 >= threshold) elif method_idx == 3: # Method 4: High[N]/High[0] targets[f'Target_{method_idx + 1}'] = (future_high / df[high_col] - 1 >= threshold) elif method_idx == 4: # Method 5: MaxClose/Close[0] targets[f'Target_{method_idx + 1}'] = (future_max_close / df[close_col] - 1 >= threshold) elif method_idx == 5: # Method 6: MaxClose/High[0] targets[f'Target_{method_idx + 1}'] = (future_max_close / df[high_col] - 1 >= threshold) elif method_idx == 6: # Method 7: MaxHigh/Close[0] targets[f'Target_{method_idx + 1}'] = (future_max_high / df[close_col] - 1 >= threshold) elif method_idx == 7: # Method 8: MaxHigh/High[0] targets[f'Target_{method_idx + 1}'] = (future_max_high / df[high_col] - 1 >= threshold) return pd.DataFrame(targets, index=df.index)