Case Study · Local AI · On-device inference

Local LLM Trading Engine: Ollama + RAG on M3 24GB

Multi-model decision engine 100% on-device: qwen2.5:14b primary + llama3.1:8b fallback + LLaVA:7b vision for chart screenshots + ChromaDB RAG. Zero API cost, zero data leakage, full control over prompts and decision system.

Type

Local-LLM decision engine

Hardware

MacBook Air M3 · 24GB unified RAM

Stack

Ollama · qwen2.5:14b · LLaVA:7b · ChromaDB

Outcome

0 violations, 76% NO_TRADE

01 · Pain Point

Cloud LLM for trading — three reasons not to use it

(1) Latency. With hundreds of signals per day, every request to OpenAI/Claude adds seconds → you miss entry points. (2) Cost. $0.01-0.10 per signal × 1000 signals = $100/day on inference alone. (3) Strategy leakage. Trading edges in prompts are IP. Sending them to OpenAI means sharing your competitive advantage.

Needed a fully local engine capable of receiving ICT/Smart Money Concepts signals (BHM-3BP, iFVG, IMPULSIVE, REVERSAL), analyzing confluence (HTF bias + displacement + liquidity), and producing BUY/SELL/NO_TRADE decisions with a confidence metric.

02 · Stack

Three models + RAG + bridge

Primary · 14.8B params

qwen2.5:14b

9.0 GB · latency 4-11 sec · quality ⭐⭐⭐

Primary model for the trade decision. Reasoning over structured-input signals + historical context from RAG.

Fallback · 8B params

llama3.1:8b

4.9 GB · latency 7-10 sec · quality ⭐⭐

Backup on qwen timeouts and for fast auxiliary queries (market classification, sanity check).

Vision · 7B params

LLaVA:7b

4.7 GB · image analysis

Vision model for chart-screenshot analysis: pattern recognition, visual validation of order block / FVG / liquidity sweep.

Knowledge

ChromaDB · RAG

embeddings + similarity search

Stores historical trades (win/loss + context). Pulls 5-10 nearest analogs for the current signal → gives the LLM context for the decision.

03 · Pipeline

Decision engine with confidence score

For each incoming signal (BHM-3BP / iFVG / SM-OB-CONT / SM-ASIA-SWEEP), the decision engine assembles:

Tier classification of the signal (A/B/C/D) → base confidence 0.45-0.65
Confluence bonuses: HTF bias (+0.05) + displacement (+0.03) + liquidity sweep (+0.02)
Historical modifier via RAG: 5-10 similar trades are pulled, win rate is computed
Regulatory gating: time-of-day, session (Asia/London/NY), seasonality

The LLM produces a final confidence and decision (BUY/SELL/NO_TRADE). Threshold 0.70 — below that is cut off as "borderline". Code guardrails check lot size, distance to TP/SL, correlation with open positions.

Example (SM_OB_CONT BTCUSDT NY session)

Base confidence (Tier B): 0.58

Full confluence bonus: +0.10 (HTF bias + displacement + liquidity)

Historical modifier: +0.04 (win rate 58%)

Final confidence: 0.72 ≥ 0.70 → SELL

04 · Results

Phase-3A dry run · 100 signals

NO_TRADE rate

76%

within the 60-75% corridor → strict noise filter

Guardrails violations

all safety checks passed, no lot-size / SL / correlation violations

Inference cost

vs OpenAI/Claude API ~$100/day at comparable volume

Where it fits for clients: a local-LLM stack is ideal where data must not leave the company perimeter: healthcare (FZ-152 PII), banking, defense-affiliated, corporate security, legal-tech. Deployment on an M2/M3 Mac mini (~₽1000/mo electricity), Apple Silicon with unified memory handles 14B models at comfortable latency.

"The same approach I use myself — applied to you." Custom RAG over your documents + multi-model fallback + local inference. No external APIs, no IP leakage, full control.

Готовы начать?

Аудит за 5 000 ₽ — с конкретным отчётом и сметой

Расскажу что внедрить в вашем бизнесе в первую очередь, какая будет окупаемость, и нужен ли вообще AI для вашей задачи (иногда — нет).

Записаться на аудит Написать в Telegram

Или просто напишите свой вопрос — отвечу в течение 2 часов