Basic Concepts
This guide introduces fundamental concepts you need to understand before using Rhoa effectively.
Pandas DataFrame Extension
Rhoa extends pandas using the accessor API, which allows adding custom methods to pandas objects.
How Accessors Work
import pandas as pd
import rhoa # This registers the accessors
# Now DataFrame and Series objects have .indicators accessor
prices = pd.Series([100, 102, 105, 103, 107])
sma = prices.rhoa.indicators.sma(window_size=3)
# DataFrame objects have .plots accessor
df.rhoa.plots.signal(y_pred=predictions, y_true=targets)
Key Points: - You must import rhoa to register the accessors - Accessors feel like native pandas methods - They return standard pandas objects (Series, DataFrame) - Perfect for method chaining
Technical Indicators
Technical indicators are mathematical calculations based on historical price, volume, or open interest.
Categories of Indicators
- Trend Indicators
Show the direction of price movement (up, down, sideways).
Examples: SMA, EMA, ADX, Parabolic SAR
- Momentum Indicators
Measure the speed and strength of price movements.
Examples: RSI, MACD, Stochastic, Williams %R
- Volatility Indicators
Measure the rate of price changes (high/low volatility).
Examples: ATR, Bollinger Bands, Standard Deviation
- Oscillators
Bounded indicators that fluctuate between fixed levels.
Examples: RSI (0-100), Stochastic (0-100), CCI
Indicator Properties
- Window Size (Period)
Most indicators use a rolling window. Larger windows = smoother but more lag.
sma_20 = df.rhoa.indicators.sma(window_size=20) # Slower, smoother sma_5 = df.rhoa.indicators.sma(window_size=5) # Faster, noisier
- NaN Values
Indicators create NaN for initial periods where insufficient data exists.
sma_10 = df.rhoa.indicators.sma(window_size=10) # First 9 values will be NaN
- Lagging vs. Leading
Lagging: Based on past prices (most indicators)
Leading: Attempts to predict future moves (rare, often unreliable)
Time Series Considerations
Financial time series have unique properties that affect analysis.
Stationarity
Definition: A stationary series has constant mean, variance, and autocorrelation over time.
Why It Matters: Most ML models assume stationarity. Raw prices are non-stationary.
Solutions: - Use returns instead of prices - Use indicators (already somewhat stationary) - Apply differencing or detrending
# Non-stationary (price levels)
prices = df['Close']
# More stationary (returns)
returns = df['Close'].pct_change()
# Stationary indicator
rsi = df.rhoa.indicators.rsi(14)
Autocorrelation
Financial data is often autocorrelated (today’s price depends on yesterday’s).
Implications: - Can’t use random train/test splits - Must use time-based splits - Cross-validation requires special handling (TimeSeriesSplit)
Correct Split:
# Time-based split (correct)
split_date = '2024-01-01'
train = df[df['Date'] < split_date]
test = df[df['Date'] >= split_date]
Wrong Split:
# Random split (WRONG for time series!)
from sklearn.model_selection import train_test_split
train, test = train_test_split(df) # Don't do this!
Look-Ahead Bias
Definition: Using future information in training that wouldn’t be available in production.
Common Mistakes: 1. Normalizing before splitting (uses future statistics) 2. Generating targets on full dataset then splitting 3. Using future price data in features
Avoiding Bias:
# WRONG: Normalize then split
df_norm = (df - df.mean()) / df.std()
train, test = split(df_norm)
# CORRECT: Split then normalize
train, test = split(df)
train_norm = (train - train.mean()) / train.std()
test_norm = (test - train.mean()) / train.std() # Use TRAIN statistics!
Machine Learning for Trading
Applying ML to trading requires special considerations.
Binary Classification
Most trading strategies can be framed as binary classification: - Class 1: Buy signal (price will increase enough to profit) - Class 0: No signal (price won’t increase enough, or will decrease)
Key Decisions: - What threshold defines “enough to profit”? - What time horizon to consider? - How to handle transaction costs?
Rhoa’s target generation addresses these questions systematically.
Class Imbalance
Real trading data often has severe class imbalance: - True opportunities may be rare (5-20% of days) - Too many positives = frequent trading, high costs - Too few positives = model rarely trades
Solutions: - Adjust target thresholds (Rhoa’s auto mode) - Use appropriate metrics (precision, recall, F1, not just accuracy) - Consider cost-sensitive learning
# Control class balance
targets, meta = generate_target_combinations(
df,
mode='auto',
target_class_balance=0.3 # 30% positive instances
)
Evaluation Metrics
For trading strategies, different metrics matter:
- Precision (How many signals are correct?)
Critical for avoiding false trades and transaction costs.
- Recall (How many opportunities are caught?)
Important for not missing profitable trades.
- F1 Score (Harmonic mean of precision and recall)
Balances both concerns.
- Sharpe Ratio (Risk-adjusted returns)
The ultimate metric for trading strategies.
from sklearn.metrics import precision_score, recall_score, f1_score
precision = precision_score(y_test, y_pred) # Want this high!
recall = recall_score(y_test, y_pred) # And this!
f1 = f1_score(y_test, y_pred) # Compromise
Data Requirements
Quality Over Quantity
Minimum Data: - At least 200-500 data points for basic indicators - 1000+ data points for reliable ML training - More data for longer-period indicators
Data Quality: - No missing values (or properly handled) - Correct OHLC relationships (High ≥ Close ≥ Low) - Adjusted for splits and dividends - Consistent time intervals
Required Columns
For Basic Indicators: - Close price (minimum requirement)
For Advanced Indicators: - Open, High, Low, Close (OHLC) - Volume (optional but recommended)
For ML: - Date/Timestamp column - OHLC data - Sufficient history
# Minimal DataFrame structure
df = pd.DataFrame({
'Date': pd.date_range('2023-01-01', periods=100),
'Open': [100, 102, ...],
'High': [105, 106, ...],
'Low': [98, 100, ...],
'Close': [103, 104, ...],
'Volume': [1000000, 1200000, ...]
})
Common Pitfalls
- Overfitting
Using too many features
Training on too little data
Not using proper cross-validation
Optimizing on test set
- Data Leakage
Using future information
Improper normalization
Including target in features
- Unrealistic Assumptions
Ignoring transaction costs
Assuming perfect execution
Not accounting for slippage
Ignoring liquidity constraints
- Poor Risk Management
No stop losses
Over-leveraging
No position sizing
Correlated positions
Next Steps
Now that you understand the basics, dive into specific topics:
Indicators Guide - Learn about each indicator
Targets Guide - Master target generation
Visualization Guide - Evaluate your models
Examples - See practical examples