Skip to content
FactorQX
Engineeringautomationinfrastructure

The Engineering Checklist Before You Automate Anything

A pre-flight engineering checklist — idempotency, validation, observability, kill-switches, secrets, sandboxing, rate limits, reconciliation, and alerting.

FactorQX 4 min read

Education and engineering only. This is a checklist about software discipline, not about what to automate, trade, or build a business on. Nothing here is financial advice or a recommendation of any kind.

Automation removes the human in the loop, and the human was the system's last line of defense. A person notices a duplicate, hesitates at a weird number, stops when something smells wrong. Code does none of that unless you build it in. This is the checklist we run before letting any unattended process touch a real external system — payments, brokers, infrastructure, anything with side effects.

Idempotency

The first question for any automated action: what happens if it runs twice? Networks retry, queues redeliver, processes restart mid-flight. If a duplicate request causes a duplicate side effect, you have a bug waiting for a bad day. Attach a unique idempotency key to every state-changing request and make the receiving side deduplicate on it. The correct behavior for a retried operation is "same result, no second effect."

Validation at the boundary

Validate every input the instant it crosses into your system, before it touches any logic. Range-check numbers, reject malformed payloads, and refuse anything you cannot fully parse rather than guessing. A good rule: an automated system should be more paranoid than a human operator, because it will faithfully execute a garbage instruction at full speed.

Observability

If you cannot see what the automation did, you cannot trust it. That means three things working together:

  • Structured logs for every decision, with enough context to replay it.
  • Metrics for rates, latencies, and error counts so you can see trends, not just incidents.
  • Traces so a single action can be followed end-to-end across services.

Build this before go-live. Adding observability after an incident means you already lost the evidence for the incident that prompted it.

A kill-switch

Every unattended system needs a way to stop it now, without a deploy. A single flag — checked at the top of the main loop — that halts new actions and lets in-flight ones drain cleanly. Test that the switch actually works on a quiet day, because the day you need it is the worst possible time to discover it was wired up wrong.

killswitch.py
def main_loop(should_run, do_work):
    while should_run():          # cheap flag check: env, file, or feature flag
        try:
            do_work()
        except Exception:
            log.exception("work failed; loop continues")
    log.info("kill-switch tripped; draining and exiting")

Secrets handling

API keys and tokens never live in source, in logs, or in error messages. Load them from a secrets manager or injected environment at runtime, scope each credential to the narrowest permission set it needs, and rotate on a schedule. Assume any string you log could end up in a screenshot — so make sure secrets are never in one.

Test against a sandbox

Point the automation at a sandbox or paper endpoint and let it run for real, end-to-end, before it ever touches production. Sandboxes catch the integration bugs unit tests cannot: malformed auth, wrong field names, timeouts, partial responses. Treat a clean sandbox run as a gate, not a nicety.

Rate limits and backoff

You will hit a rate limit; the only question is whether you handle it gracefully. Respect documented limits proactively, back off exponentially with jitter on 429 and 5xx responses, and cap retries so a degraded dependency does not turn into an infinite hammer. A retry storm can take down the very service you depend on.

Reconciliation

Trust nothing; verify everything. After acting, independently confirm the result against the source of truth — did the action you think succeeded actually land, exactly once, with the values you intended? Run reconciliation on a schedule and alert on any drift between your internal record and the external system's record. This is how you catch the silent failures that monitoring misses.

Alerting that someone reads

An alert nobody sees is a log line. Route alerts to a channel a human actually watches, set thresholds so they fire on real problems rather than noise, and make every alert actionable — it should tell the responder what broke and where to look. Alert fatigue is a failure mode: if everything pages, nothing does.

Where to go next

Treat this list as a gate, not a wish list — each item is a yes/no you answer before flipping the switch. Wire the kill-switch and observability first, because they are what let you recover when one of the other items turns out to be wrong in production. Everything else is easier to fix once you can see and stop the system.

Educational content. This post covers software development and research methods only. It is not investment advice, a trading signal, or a recommendation. See our disclaimer.

More from the blog