Basic Concepts

This guide introduces fundamental concepts you need to understand before using Rhoa effectively.

Pandas DataFrame Extension

Rhoa extends pandas using the accessor API, which allows adding custom methods to pandas objects.

How Accessors Work

import pandas as pd
import rhoa  # This registers the accessors

# Now DataFrame and Series objects have .indicators accessor
prices = pd.Series([100, 102, 105, 103, 107])
sma = prices.rhoa.indicators.sma(window_size=3)

# DataFrame objects have .plots accessor
df.rhoa.plots.signal(y_pred=predictions, y_true=targets)

Key Points: - You must import rhoa to register the accessors - Accessors feel like native pandas methods - They return standard pandas objects (Series, DataFrame) - Perfect for method chaining

Technical Indicators

Technical indicators are mathematical calculations based on historical price, volume, or open interest.

Categories of Indicators

Trend Indicators

Show the direction of price movement (up, down, sideways).

Examples: SMA, EMA, ADX, Parabolic SAR

Momentum Indicators

Measure the speed and strength of price movements.

Examples: RSI, MACD, Stochastic, Williams %R

Volatility Indicators

Measure the rate of price changes (high/low volatility).

Examples: ATR, Bollinger Bands, Standard Deviation

Oscillators

Bounded indicators that fluctuate between fixed levels.

Examples: RSI (0-100), Stochastic (0-100), CCI

Indicator Properties

Window Size (Period)

Most indicators use a rolling window. Larger windows = smoother but more lag.

sma_20 = df.rhoa.indicators.sma(window_size=20)  # Slower, smoother
sma_5 = df.rhoa.indicators.sma(window_size=5)    # Faster, noisier
NaN Values

Indicators create NaN for initial periods where insufficient data exists.

sma_10 = df.rhoa.indicators.sma(window_size=10)
# First 9 values will be NaN
Lagging vs. Leading
  • Lagging: Based on past prices (most indicators)

  • Leading: Attempts to predict future moves (rare, often unreliable)

Time Series Considerations

Financial time series have unique properties that affect analysis.

Stationarity

Definition: A stationary series has constant mean, variance, and autocorrelation over time.

Why It Matters: Most ML models assume stationarity. Raw prices are non-stationary.

Solutions: - Use returns instead of prices - Use indicators (already somewhat stationary) - Apply differencing or detrending

# Non-stationary (price levels)
prices = df['Close']

# More stationary (returns)
returns = df['Close'].pct_change()

# Stationary indicator
rsi = df.rhoa.indicators.rsi(14)

Autocorrelation

Financial data is often autocorrelated (today’s price depends on yesterday’s).

Implications: - Can’t use random train/test splits - Must use time-based splits - Cross-validation requires special handling (TimeSeriesSplit)

Correct Split:

# Time-based split (correct)
split_date = '2024-01-01'
train = df[df['Date'] < split_date]
test = df[df['Date'] >= split_date]

Wrong Split:

# Random split (WRONG for time series!)
from sklearn.model_selection import train_test_split
train, test = train_test_split(df)  # Don't do this!

Look-Ahead Bias

Definition: Using future information in training that wouldn’t be available in production.

Common Mistakes: 1. Normalizing before splitting (uses future statistics) 2. Generating targets on full dataset then splitting 3. Using future price data in features

Avoiding Bias:

# WRONG: Normalize then split
df_norm = (df - df.mean()) / df.std()
train, test = split(df_norm)

# CORRECT: Split then normalize
train, test = split(df)
train_norm = (train - train.mean()) / train.std()
test_norm = (test - train.mean()) / train.std()  # Use TRAIN statistics!

Machine Learning for Trading

Applying ML to trading requires special considerations.

Binary Classification

Most trading strategies can be framed as binary classification: - Class 1: Buy signal (price will increase enough to profit) - Class 0: No signal (price won’t increase enough, or will decrease)

Key Decisions: - What threshold defines “enough to profit”? - What time horizon to consider? - How to handle transaction costs?

Rhoa’s target generation addresses these questions systematically.

Class Imbalance

Real trading data often has severe class imbalance: - True opportunities may be rare (5-20% of days) - Too many positives = frequent trading, high costs - Too few positives = model rarely trades

Solutions: - Adjust target thresholds (Rhoa’s auto mode) - Use appropriate metrics (precision, recall, F1, not just accuracy) - Consider cost-sensitive learning

# Control class balance
targets, meta = generate_target_combinations(
    df,
    mode='auto',
    target_class_balance=0.3  # 30% positive instances
)

Evaluation Metrics

For trading strategies, different metrics matter:

Precision (How many signals are correct?)

Critical for avoiding false trades and transaction costs.

Recall (How many opportunities are caught?)

Important for not missing profitable trades.

F1 Score (Harmonic mean of precision and recall)

Balances both concerns.

Sharpe Ratio (Risk-adjusted returns)

The ultimate metric for trading strategies.

from sklearn.metrics import precision_score, recall_score, f1_score

precision = precision_score(y_test, y_pred)  # Want this high!
recall = recall_score(y_test, y_pred)        # And this!
f1 = f1_score(y_test, y_pred)                # Compromise

Data Requirements

Quality Over Quantity

Minimum Data: - At least 200-500 data points for basic indicators - 1000+ data points for reliable ML training - More data for longer-period indicators

Data Quality: - No missing values (or properly handled) - Correct OHLC relationships (High ≥ Close ≥ Low) - Adjusted for splits and dividends - Consistent time intervals

Required Columns

For Basic Indicators: - Close price (minimum requirement)

For Advanced Indicators: - Open, High, Low, Close (OHLC) - Volume (optional but recommended)

For ML: - Date/Timestamp column - OHLC data - Sufficient history

# Minimal DataFrame structure
df = pd.DataFrame({
    'Date': pd.date_range('2023-01-01', periods=100),
    'Open': [100, 102, ...],
    'High': [105, 106, ...],
    'Low': [98, 100, ...],
    'Close': [103, 104, ...],
    'Volume': [1000000, 1200000, ...]
})

Common Pitfalls

Overfitting
  • Using too many features

  • Training on too little data

  • Not using proper cross-validation

  • Optimizing on test set

Data Leakage
  • Using future information

  • Improper normalization

  • Including target in features

Unrealistic Assumptions
  • Ignoring transaction costs

  • Assuming perfect execution

  • Not accounting for slippage

  • Ignoring liquidity constraints

Poor Risk Management
  • No stop losses

  • Over-leveraging

  • No position sizing

  • Correlated positions

Next Steps

Now that you understand the basics, dive into specific topics: