Skip to content
VC
Case Study · Local AI · On-device inference

Local LLM Trading Engine: Ollama + RAG on M3 24GB

Multi-model decision engine 100% on-device: qwen2.5:14b primary + llama3.1:8b fallback + LLaVA:7b vision for chart screenshots + ChromaDB RAG. Zero API cost, zero data leakage, full control over prompts and decision system.

Type
Local-LLM decision engine
Hardware
MacBook Air M3 · 24GB unified RAM
Stack
Ollama · qwen2.5:14b · LLaVA:7b · ChromaDB
Outcome
0 violations, 76% NO_TRADE
01 · Pain Point

Cloud LLM for trading — three reasons not to use it

(1) Latency. With hundreds of signals per day, every request to OpenAI/Claude adds seconds → you miss entry points. (2) Cost. $0.01-0.10 per signal × 1000 signals = $100/day on inference alone. (3) Strategy leakage. Trading edges in prompts are IP. Sending them to OpenAI means sharing your competitive advantage.

Needed a fully local engine capable of receiving ICT/Smart Money Concepts signals (BHM-3BP, iFVG, IMPULSIVE, REVERSAL), analyzing confluence (HTF bias + displacement + liquidity), and producing BUY/SELL/NO_TRADE decisions with a confidence metric.

02 · Stack

Three models + RAG + bridge

Primary · 14.8B params
qwen2.5:14b
9.0 GB · latency 4-11 sec · quality ⭐⭐⭐

Primary model for the trade decision. Reasoning over structured-input signals + historical context from RAG.

Fallback · 8B params
llama3.1:8b
4.9 GB · latency 7-10 sec · quality ⭐⭐

Backup on qwen timeouts and for fast auxiliary queries (market classification, sanity check).

Vision · 7B params
LLaVA:7b
4.7 GB · image analysis

Vision model for chart-screenshot analysis: pattern recognition, visual validation of order block / FVG / liquidity sweep.

Knowledge
ChromaDB · RAG
embeddings + similarity search

Stores historical trades (win/loss + context). Pulls 5-10 nearest analogs for the current signal → gives the LLM context for the decision.

03 · Pipeline

Decision engine with confidence score

For each incoming signal (BHM-3BP / iFVG / SM-OB-CONT / SM-ASIA-SWEEP), the decision engine assembles:

  • Tier classification of the signal (A/B/C/D) → base confidence 0.45-0.65
  • Confluence bonuses: HTF bias (+0.05) + displacement (+0.03) + liquidity sweep (+0.02)
  • Historical modifier via RAG: 5-10 similar trades are pulled, win rate is computed
  • Regulatory gating: time-of-day, session (Asia/London/NY), seasonality

The LLM produces a final confidence and decision (BUY/SELL/NO_TRADE). Threshold 0.70 — below that is cut off as "borderline". Code guardrails check lot size, distance to TP/SL, correlation with open positions.

Example (SM_OB_CONT BTCUSDT NY session)
Base confidence (Tier B): 0.58
Full confluence bonus: +0.10 (HTF bias + displacement + liquidity)
Historical modifier: +0.04 (win rate 58%)
Final confidence: 0.72 ≥ 0.70 → SELL
04 · Results

Phase-3A dry run · 100 signals

NO_TRADE rate
76%

within the 60-75% corridor → strict noise filter

Guardrails violations
0

all safety checks passed, no lot-size / SL / correlation violations

Inference cost
$0

vs OpenAI/Claude API ~$100/day at comparable volume

Where it fits for clients: a local-LLM stack is ideal where data must not leave the company perimeter: healthcare (FZ-152 PII), banking, defense-affiliated, corporate security, legal-tech. Deployment on an M2/M3 Mac mini (~₽1000/mo electricity), Apple Silicon with unified memory handles 14B models at comfortable latency.

"The same approach I use myself — applied to you." Custom RAG over your documents + multi-model fallback + local inference. No external APIs, no IP leakage, full control.

Готовы начать?

Аудит за 5 000 ₽ — с конкретным отчётом и сметой

Расскажу что внедрить в вашем бизнесе в первую очередь, какая будет окупаемость, и нужен ли вообще AI для вашей задачи (иногда — нет).

Или просто напишите свой вопрос — отвечу в течение 2 часов