Basic Concepts
==============

This guide introduces fundamental concepts you need to understand before using Rhoa effectively.

Pandas DataFrame Extension
---------------------------

Rhoa extends pandas using the **accessor API**, which allows adding custom methods to pandas objects.

How Accessors Work
~~~~~~~~~~~~~~~~~~

.. code-block:: python

   import pandas as pd
   import rhoa  # This registers the accessors

   # Now DataFrame and Series objects have .indicators accessor
   prices = pd.Series([100, 102, 105, 103, 107])
   sma = prices.rhoa.indicators.sma(window_size=3)

   # DataFrame objects have .plots accessor
   df.rhoa.plots.signal(y_pred=predictions, y_true=targets)

**Key Points:**
- You must import rhoa to register the accessors
- Accessors feel like native pandas methods
- They return standard pandas objects (Series, DataFrame)
- Perfect for method chaining

Technical Indicators
--------------------

Technical indicators are mathematical calculations based on historical price, volume, or open interest.

Categories of Indicators
~~~~~~~~~~~~~~~~~~~~~~~~~

**Trend Indicators**
   Show the direction of price movement (up, down, sideways).

   Examples: SMA, EMA, ADX, Parabolic SAR

**Momentum Indicators**
   Measure the speed and strength of price movements.

   Examples: RSI, MACD, Stochastic, Williams %R

**Volatility Indicators**
   Measure the rate of price changes (high/low volatility).

   Examples: ATR, Bollinger Bands, Standard Deviation

**Oscillators**
   Bounded indicators that fluctuate between fixed levels.

   Examples: RSI (0-100), Stochastic (0-100), CCI

Indicator Properties
~~~~~~~~~~~~~~~~~~~~

**Window Size (Period)**
   Most indicators use a rolling window. Larger windows = smoother but more lag.

   .. code-block:: python

      sma_20 = df.rhoa.indicators.sma(window_size=20)  # Slower, smoother
      sma_5 = df.rhoa.indicators.sma(window_size=5)    # Faster, noisier

**NaN Values**
   Indicators create NaN for initial periods where insufficient data exists.

   .. code-block:: python

      sma_10 = df.rhoa.indicators.sma(window_size=10)
      # First 9 values will be NaN

**Lagging vs. Leading**
   - **Lagging**: Based on past prices (most indicators)
   - **Leading**: Attempts to predict future moves (rare, often unreliable)

Time Series Considerations
---------------------------

Financial time series have unique properties that affect analysis.

Stationarity
~~~~~~~~~~~~

**Definition**: A stationary series has constant mean, variance, and autocorrelation over time.

**Why It Matters**: Most ML models assume stationarity. Raw prices are non-stationary.

**Solutions**:
- Use returns instead of prices
- Use indicators (already somewhat stationary)
- Apply differencing or detrending

.. code-block:: python

   # Non-stationary (price levels)
   prices = df['Close']

   # More stationary (returns)
   returns = df['Close'].pct_change()

   # Stationary indicator
   rsi = df.rhoa.indicators.rsi(14)

Autocorrelation
~~~~~~~~~~~~~~~

Financial data is often autocorrelated (today's price depends on yesterday's).

**Implications**:
- Can't use random train/test splits
- Must use time-based splits
- Cross-validation requires special handling (TimeSeriesSplit)

**Correct Split**:

.. code-block:: python

   # Time-based split (correct)
   split_date = '2024-01-01'
   train = df[df['Date'] < split_date]
   test = df[df['Date'] >= split_date]

**Wrong Split**:

.. code-block:: python

   # Random split (WRONG for time series!)
   from sklearn.model_selection import train_test_split
   train, test = train_test_split(df)  # Don't do this!

Look-Ahead Bias
~~~~~~~~~~~~~~~

**Definition**: Using future information in training that wouldn't be available in production.

**Common Mistakes**:
1. Normalizing before splitting (uses future statistics)
2. Generating targets on full dataset then splitting
3. Using future price data in features

**Avoiding Bias**:

.. code-block:: python

   # WRONG: Normalize then split
   df_norm = (df - df.mean()) / df.std()
   train, test = split(df_norm)

   # CORRECT: Split then normalize
   train, test = split(df)
   train_norm = (train - train.mean()) / train.std()
   test_norm = (test - train.mean()) / train.std()  # Use TRAIN statistics!

Machine Learning for Trading
-----------------------------

Applying ML to trading requires special considerations.

Binary Classification
~~~~~~~~~~~~~~~~~~~~~~

Most trading strategies can be framed as binary classification:
- **Class 1**: Buy signal (price will increase enough to profit)
- **Class 0**: No signal (price won't increase enough, or will decrease)

**Key Decisions**:
- What threshold defines "enough to profit"?
- What time horizon to consider?
- How to handle transaction costs?

Rhoa's target generation addresses these questions systematically.

Class Imbalance
~~~~~~~~~~~~~~~

Real trading data often has severe class imbalance:
- True opportunities may be rare (5-20% of days)
- Too many positives = frequent trading, high costs
- Too few positives = model rarely trades

**Solutions**:
- Adjust target thresholds (Rhoa's auto mode)
- Use appropriate metrics (precision, recall, F1, not just accuracy)
- Consider cost-sensitive learning

.. code-block:: python

   # Control class balance
   targets, meta = generate_target_combinations(
       df,
       mode='auto',
       target_class_balance=0.3  # 30% positive instances
   )

Evaluation Metrics
~~~~~~~~~~~~~~~~~~

For trading strategies, different metrics matter:

**Precision (How many signals are correct?)**
   Critical for avoiding false trades and transaction costs.

**Recall (How many opportunities are caught?)**
   Important for not missing profitable trades.

**F1 Score (Harmonic mean of precision and recall)**
   Balances both concerns.

**Sharpe Ratio (Risk-adjusted returns)**
   The ultimate metric for trading strategies.

.. code-block:: python

   from sklearn.metrics import precision_score, recall_score, f1_score

   precision = precision_score(y_test, y_pred)  # Want this high!
   recall = recall_score(y_test, y_pred)        # And this!
   f1 = f1_score(y_test, y_pred)                # Compromise

Data Requirements
-----------------

Quality Over Quantity
~~~~~~~~~~~~~~~~~~~~~

**Minimum Data**:
- At least 200-500 data points for basic indicators
- 1000+ data points for reliable ML training
- More data for longer-period indicators

**Data Quality**:
- No missing values (or properly handled)
- Correct OHLC relationships (High ≥ Close ≥ Low)
- Adjusted for splits and dividends
- Consistent time intervals

Required Columns
~~~~~~~~~~~~~~~~

**For Basic Indicators**:
- Close price (minimum requirement)

**For Advanced Indicators**:
- Open, High, Low, Close (OHLC)
- Volume (optional but recommended)

**For ML**:
- Date/Timestamp column
- OHLC data
- Sufficient history

.. code-block:: python

   # Minimal DataFrame structure
   df = pd.DataFrame({
       'Date': pd.date_range('2023-01-01', periods=100),
       'Open': [100, 102, ...],
       'High': [105, 106, ...],
       'Low': [98, 100, ...],
       'Close': [103, 104, ...],
       'Volume': [1000000, 1200000, ...]
   })

Common Pitfalls
---------------

**Overfitting**
   - Using too many features
   - Training on too little data
   - Not using proper cross-validation
   - Optimizing on test set

**Data Leakage**
   - Using future information
   - Improper normalization
   - Including target in features

**Unrealistic Assumptions**
   - Ignoring transaction costs
   - Assuming perfect execution
   - Not accounting for slippage
   - Ignoring liquidity constraints

**Poor Risk Management**
   - No stop losses
   - Over-leveraging
   - No position sizing
   - Correlated positions

Next Steps
----------

Now that you understand the basics, dive into specific topics:

- :doc:`indicators_guide` - Learn about each indicator
- :doc:`targets_guide` - Master target generation
- :doc:`visualization_guide` - Evaluate your models
- :doc:`/examples/index` - See practical examples