Basic Concepts

This guide introduces fundamental concepts you need to understand before using Rhoa effectively.

Pandas DataFrame Extension

Rhoa extends pandas using the accessor API, which allows adding custom methods to pandas objects.

How Accessors Work

import pandas as pd
import rhoa  # This registers the accessors

# Now DataFrame and Series objects have .indicators accessor
prices = pd.Series([100, 102, 105, 103, 107])
sma = prices.rhoa.indicators.sma(window_size=3)

# DataFrame objects have .plots accessor
df.rhoa.plots.signal(y_pred=predictions, y_true=targets)

Key Points: - You must import rhoa to register the accessors - Accessors feel like native pandas methods - They return standard pandas objects (Series, DataFrame) - Perfect for method chaining

Technical Indicators

Technical indicators are mathematical calculations based on historical price, volume, or open interest.

Categories of Indicators

Trend Indicators

Show the direction of price movement (up, down, sideways).

Examples: SMA, EMA, ADX, Parabolic SAR

Momentum Indicators

Measure the speed and strength of price movements.

Examples: RSI, MACD, Stochastic, Williams %R

Volatility Indicators

Measure the rate of price changes (high/low volatility).

Examples: ATR, Bollinger Bands, Standard Deviation

Oscillators

Bounded indicators that fluctuate between fixed levels.

Examples: RSI (0-100), Stochastic (0-100), CCI

Indicator Properties

Window Size (Period)

Most indicators use a rolling window. Larger windows = smoother but more lag.

sma_20 = df.rhoa.indicators.sma(window_size=20)  # Slower, smoother
sma_5 = df.rhoa.indicators.sma(window_size=5)    # Faster, noisier

NaN Values

Indicators create NaN for initial periods where insufficient data exists.

sma_10 = df.rhoa.indicators.sma(window_size=10)
# First 9 values will be NaN

Lagging vs. Leading

Lagging: Based on past prices (most indicators)
Leading: Attempts to predict future moves (rare, often unreliable)

Time Series Considerations

Financial time series have unique properties that affect analysis.

Stationarity

Definition: A stationary series has constant mean, variance, and autocorrelation over time.

Why It Matters: Most ML models assume stationarity. Raw prices are non-stationary.

Solutions: - Use returns instead of prices - Use indicators (already somewhat stationary) - Apply differencing or detrending

# Non-stationary (price levels)
prices = df['Close']

# More stationary (returns)
returns = df['Close'].pct_change()

# Stationary indicator
rsi = df.rhoa.indicators.rsi(14)

Autocorrelation

Financial data is often autocorrelated (today’s price depends on yesterday’s).

Implications: - Can’t use random train/test splits - Must use time-based splits - Cross-validation requires special handling (TimeSeriesSplit)

Correct Split:

# Time-based split (correct)
split_date = '2024-01-01'
train = df[df['Date'] < split_date]
test = df[df['Date'] >= split_date]

Wrong Split:

# Random split (WRONG for time series!)
from sklearn.model_selection import train_test_split
train, test = train_test_split(df)  # Don't do this!

Look-Ahead Bias

Definition: Using future information in training that wouldn’t be available in production.

Common Mistakes: 1. Normalizing before splitting (uses future statistics) 2. Generating targets on full dataset then splitting 3. Using future price data in features

Avoiding Bias:

# WRONG: Normalize then split
df_norm = (df - df.mean()) / df.std()
train, test = split(df_norm)

# CORRECT: Split then normalize
train, test = split(df)
train_norm = (train - train.mean()) / train.std()
test_norm = (test - train.mean()) / train.std()  # Use TRAIN statistics!

Machine Learning for Trading

Applying ML to trading requires special considerations.

Binary Classification

Most trading strategies can be framed as binary classification: - Class 1: Buy signal (price will increase enough to profit) - Class 0: No signal (price won’t increase enough, or will decrease)

Key Decisions: - What threshold defines “enough to profit”? - What time horizon to consider? - How to handle transaction costs?

Rhoa’s target generation addresses these questions systematically.

Class Imbalance

Real trading data often has severe class imbalance: - True opportunities may be rare (5-20% of days) - Too many positives = frequent trading, high costs - Too few positives = model rarely trades

Solutions: - Adjust target thresholds (Rhoa’s auto mode) - Use appropriate metrics (precision, recall, F1, not just accuracy) - Consider cost-sensitive learning

# Control class balance
targets, meta = generate_target_combinations(
    df,
    mode='auto',
    target_class_balance=0.3  # 30% positive instances
)

Evaluation Metrics

For trading strategies, different metrics matter:

Precision (How many signals are correct?): Critical for avoiding false trades and transaction costs.
Recall (How many opportunities are caught?): Important for not missing profitable trades.
F1 Score (Harmonic mean of precision and recall): Balances both concerns.
Sharpe Ratio (Risk-adjusted returns): The ultimate metric for trading strategies.

from sklearn.metrics import precision_score, recall_score, f1_score

precision = precision_score(y_test, y_pred)  # Want this high!
recall = recall_score(y_test, y_pred)        # And this!
f1 = f1_score(y_test, y_pred)                # Compromise

Data Requirements

Quality Over Quantity

Minimum Data: - At least 200-500 data points for basic indicators - 1000+ data points for reliable ML training - More data for longer-period indicators

Data Quality: - No missing values (or properly handled) - Correct OHLC relationships (High ≥ Close ≥ Low) - Adjusted for splits and dividends - Consistent time intervals

Required Columns

For Basic Indicators: - Close price (minimum requirement)

For Advanced Indicators: - Open, High, Low, Close (OHLC) - Volume (optional but recommended)

For ML: - Date/Timestamp column - OHLC data - Sufficient history

# Minimal DataFrame structure
df = pd.DataFrame({
    'Date': pd.date_range('2023-01-01', periods=100),
    'Open': [100, 102, ...],
    'High': [105, 106, ...],
    'Low': [98, 100, ...],
    'Close': [103, 104, ...],
    'Volume': [1000000, 1200000, ...]
})

Common Pitfalls

Overfitting

Using too many features
Training on too little data
Not using proper cross-validation
Optimizing on test set

Data Leakage

Using future information
Improper normalization
Including target in features

Unrealistic Assumptions

Ignoring transaction costs
Assuming perfect execution
Not accounting for slippage
Ignoring liquidity constraints

Poor Risk Management

No stop losses
Over-leveraging
No position sizing
Correlated positions

Next Steps

Now that you understand the basics, dive into specific topics:

Indicators Guide - Learn about each indicator
Targets Guide - Master target generation
Visualization Guide - Evaluate your models
Examples - See practical examples