n8n workflows that scale: retries, idempotency, alerts
Build automations like product features — observable, debuggable, safe at 2 AM.
The problem with "quick automations"
Most n8n workflows I inherit from clients look like this: a happy-path sequence of nodes that works 95% of the time. The other 5%? Silent failures, duplicate records, missed leads.
Building for reliability
Idempotency first
Every workflow should be safe to re-run. Use external IDs, upsert patterns, and deduplication checks before any destructive action.
Retry with backoff
n8n's retry settings are good, but you want exponential backoff for external APIs. Set retries to 3, initial delay 5s, backoff factor 2.
Dead-letter queues
When retries exhaust, don't just log — route to a dedicated error workflow that alerts and stores the failed payload for manual review.
Observability
Every production workflow should have:
- Structured logs with correlation IDs
- Success/failure metrics in a dashboard (Grafana, Datadog)
- Slack/PagerDuty alerts on error rates above threshold
Real example
On a recent CRM sync project, we went from 2% silent failure rate to 0.3% failure rate with full visibility into every failure — by adding these four patterns.
Related articles
A practical checklist for 95+ Lighthouse
The few things that actually move LCP/CLS consistently: image strategy, font loading tactics, predictable layouts, and the small CSS choices that compound across pages.
Shipping AI assistants with guardrails & source transparency
UX patterns that make retrieval trustworthy and reduce support risk during rollout.
Migrating WordPress to Next.js without losing SEO
A redirect strategy, schema preservation plan, and gotchas I've hit on 10+ migrations.