Monitoring a Trading Bot: Logs, Metrics, and Alerts
Instrument a trading bot with structured logs, key metrics, a health endpoint, and alerts on backlog and dead-letters — tool-agnostic with Prometheus and Grafana examples.
Engineering and education only. This article covers observability for software systems. It is not investment advice, a trading signal, or a recommendation. Connecting a bot to live execution is your responsibility and risk.
A trading bot fails quietly. Unlike a web app, no user files a ticket when a consumer stalls or a webhook stops arriving — the process simply does less and less until you notice. Observability is how you turn silent failure into a page. The three pillars are logs (what happened), metrics (how much, how fast), and alerts (tell me before it matters).
Structured logging
Plain-text logs are unsearchable at scale. Emit one JSON object per event so a log backend can index and filter on fields.
{"ts":"2026-06-17T20:05:01.412Z","level":"info","event":"order_intent_processed","correlation_id":"req_8c1f","symbol":"ESM6","latency_ms":42,"queue_depth":3}The two fields that pay for themselves are a correlation id — propagated from the inbound request through every downstream step — and an event name you can group by. With those, "show me every step for req_8c1f" and "count order_intent_rejected in the last hour" both become one query.
import structlog
log = structlog.get_logger()
log.info("order_intent_processed",
correlation_id=cid, symbol=sym, latency_ms=elapsed, queue_depth=depth)Log decisions and transitions, not every line of code. A good log line answers what changed and why.
The metrics that matter
Pick a small set of numbers that describe the system's health regardless of strategy. The classic four for a queue-driven bot:
- Queue depth / backlog — items waiting to be processed. A steadily rising backlog means consumers cannot keep up. This is your single best leading indicator.
- Error rate — failures per minute, ideally split by type (validation, downstream timeout, unexpected). Track it as a ratio of total events so it scales with load.
- Latency — time from event received to event handled. Watch the p95 and p99, not the mean; tail latency is where trouble hides.
- Heartbeat / liveness — a timestamp the bot updates each loop. If
now - last_heartbeatexceeds a threshold, the process is wedged even if it is technically running.
Expose these in a format a scraper can read. Prometheus and Grafana are common, but the pattern is tool-agnostic — any system that pulls counters and gauges works.
from prometheus_client import Counter, Gauge, Histogram
events = Counter("bot_events_total", "events handled", ["event", "result"])
backlog = Gauge("bot_queue_depth", "items awaiting processing")
latency = Histogram("bot_handle_seconds", "handle latency")
heartbeat = Gauge("bot_heartbeat_unixtime", "last loop tick")A /health endpoint
Expose a cheap HTTP endpoint that an orchestrator (Kubernetes, a load balancer, an uptime checker) can poll. Distinguish liveness (is the process up?) from readiness (can it actually do work right now?).
from fastapi import FastAPI, Response
import time
app = FastAPI()
STALE_AFTER = 60 # seconds
@app.get("/health")
def health():
age = time.time() - last_heartbeat()
deps_ok = broker_reachable() and db_reachable()
healthy = age < STALE_AFTER and deps_ok
body = {"status": "ok" if healthy else "degraded",
"heartbeat_age_s": round(age, 1), "deps_ok": deps_ok}
return Response(content=str(body), status_code=200 if healthy else 503)Return a non-200 status when unhealthy so checkers react automatically — a body of "status": "degraded" with a 200 will be ignored by most probes.
Alerting on what actually breaks
Alert on symptoms a human must act on, not on every blip. The highest-value alerts for a bot:
- Dead-letter arrivals — any message landing in a dead-letter queue means an event failed every retry. Alert on
> 0. - Growing backlog — queue depth above a threshold for several minutes, which catches a stuck or under-provisioned consumer.
- Stale heartbeat — no tick within the staleness window; the process is alive but not progressing.
- Error-rate spike — error ratio crossing a baseline, pointing at a bad deploy or a failing dependency.
- alert: BacklogGrowing
expr: bot_queue_depth > 100
for: 5m
labels: { severity: page }
annotations:
summary: "Queue backlog above 100 for 5m — consumer may be stuck"Route paging alerts to a human channel and informational ones to a dashboard. Every alert should be actionable — if there is no response, it is noise and should be tuned down or deleted.
Dashboards
A single dashboard with backlog, error rate, p99 latency, and heartbeat age — sparklines over the last few hours — lets anyone answer "is the bot healthy?" in five seconds. Grafana panels backed by the metrics above are the usual choice, but the principle holds for any visualization layer.
Where to go next
With logs, metrics, and alerts in place, harden the inputs to your system next. If your bot ingests external triggers, see how to secure inbound webhooks so a malformed or forged request never becomes a processed event.