unraid-docker-manager/.planning/research/PITFALLS.md

# Pitfalls Research

**Domain:** Migration from Docker Socket Proxy to Unraid GraphQL API
**Researched:** 2026-02-09
**Confidence:** MEDIUM (mixture of verified Unraid-specific issues and general GraphQL migration patterns)

## Critical Pitfalls

### Pitfall 1: Container ID Format Mismatch Breaking All Operations

**What goes wrong:**
All container operations fail with "container not found" errors despite containers existing. Docker uses 12-character hex IDs (`8a9907a24576`), Unraid GraphQL uses PrefixedID format (`{server_hash}:{container_hash}` — two 64-character SHA256 strings). Passing Docker IDs to Unraid API or vice versa results in complete operation failure.

**Why it happens:**
Migration assumes container IDs are interchangeable between systems. Developers test lookup operations that succeed (name-based), miss that action operations using cached Docker IDs will fail when routed to Unraid API. The 290-node workflow system uses Execute Workflow nodes that pass containerId between sub-workflows — if any node still uses Docker IDs after cutover, errors propagate silently through the chain.

**How to avoid:**
1. Create container ID translation layer BEFORE migration (Phase 1)
2. Add runtime validation: reject IDs not matching `^[a-f0-9]{64}:[a-f0-9]{64}$` pattern
3. Update ALL 17 Execute Workflow input preparation nodes to use Unraid ID format
4. Store ONLY Unraid PrefixedIDs in callback data after migration
5. Test with containers having similar names but different IDs

**Warning signs:**
- Operations succeed via text commands (resolve by name) but fail via inline keyboard callbacks (use cached IDs)
- HTTP 400 "invalid container ID format" errors from Unraid API
- Batch operations fail for some containers but not others
- Telegram callback data still contains 12-character hex strings after cutover

**Phase to address:**
Phase 1 (Container ID Mapping Layer) — MUST complete before any live API calls

---

### Pitfall 2: myunraid.net Cloud Relay Internet Dependency Kills Local Network Operations

**What goes wrong:**
Bot becomes completely non-functional during internet outages despite both Unraid server and n8n container being on the same LAN. Users lose container management capability when they need it most (troubleshooting network issues). The system goes from zero-latency local Docker socket access (sub-10ms) to 200-500ms cloud relay latency, or complete failure if Unraid's cloud relay service has an outage.

**Why it happens:**
Direct LAN IP access fails because Unraid's nginx redirects HTTP→HTTPS and strips auth headers on redirect. Developers choose myunraid.net cloud relay as "working solution" without implementing fallback strategy. The ARCHITECTURE.md documents this as the solution, not a compromise.

**How to avoid:**
1. Implement dual-path fallback: attempt direct HTTPS with proper SSL handling first, fall back to myunraid.net if connection fails
2. Add network connectivity pre-flight check before each API call batch
3. Expose degraded mode: if cloud relay unavailable, switch back to Docker socket proxy (requires keeping proxy running during migration period)
4. Monitor myunraid.net relay latency and availability as first-class metrics
5. Document internet dependency in user-facing error messages

**Warning signs:**
- Timeout errors during internet outage testing
- Latency spikes visible in execution logs (compare pre/post migration)
- Users report "bot stopped working" correlated with ISP issues
- Unraid server reachable via LAN but bot reports "cannot connect"

**Phase to address:**
Phase 2 (Network Resilience Strategy) — BEFORE cutover, implement fallback mechanism

---

### Pitfall 3: GraphQL Query Result Structure Changes Break Response Parsing

**What goes wrong:**
Bot sends commands but returns garbled data, shows empty container lists, or crashes on status checks. Field name changes (`state: "RUNNING"` vs `status: "running"`), nested structure differences (Docker's flat JSON vs GraphQL's nested response), and uppercase/lowercase variations break parsing logic across 60 Code nodes in the main workflow.

**Why it happens:**
Docker REST API returns flat JSON arrays. GraphQL returns nested `{ data: { docker: { containers: [...] } } }` structure. Developers update a few obvious parsing nodes but miss edge cases in error handling, batch processing, and inline keyboard builders. The codebase already has field behavior documentation warnings (`state` values are UPPERCASE, `names` prefixed with `/`) suggesting parsing brittleness.

**How to avoid:**
1. Create GraphQL response normalization layer that transforms Unraid responses to match Docker API shape
2. Add response schema validation in EVERY HTTP Request node (n8n's JSON schema validation)
3. Test response parsing independently from workflow logic (unit test the Code nodes)
4. Document ALL field format differences in normalization layer comments
5. Use TypeScript types for response shapes (n8n Code nodes support TypeScript)

**Warning signs:**
- Container list shows but names display as `undefined` or `[object Object]`
- Status command returns "running" for stopped containers or vice versa
- Batch selection keyboard shows wrong container names
- Error messages contain GraphQL error structure (`response.errors[0].message`) instead of friendly text

**Phase to address:**
Phase 3 (Response Schema Normalization) — BEFORE touching any sub-workflow, build and test normalization

---

### Pitfall 4: Unraid GraphQL Schema Changes Silently Break Operations

**What goes wrong:**
Operations that worked yesterday fail today with cryptic errors. Unraid's GraphQL schema evolves (field additions, deprecations, type changes) but the bot has no detection mechanism. The ARCHITECTURE.md already documents one schema discrepancy: `isUpdateAvailable` field documented in Phase 14 research does NOT exist in actual Unraid 7.2 schema.

**Why it happens:**
GraphQL schemas evolve continuously (additive changes, deprecations) per best practices. Unlike REST API versioning (breaking changes = new `/v2/` endpoint), GraphQL encourages in-place evolution. Phase 14 research used outdated/incorrect sources. No schema introspection validation in the deployment pipeline means schema mismatches only surface as runtime errors.

**How to avoid:**
1. Implement schema introspection check at workflow startup (query `__schema` endpoint)
2. Store expected schema snapshot in repo, compare on deployment
3. Add field existence checks BEFORE using optional fields in queries
4. Use GraphQL Inspector or similar tooling in CI/CD to detect breaking changes
5. Subscribe to Unraid API changelog/release notes

**Warning signs:**
- New Unraid version installed, bot starts throwing "unknown field" errors
- Operations succeed on test server (older Unraid) but fail on production (newer Unraid)
- GraphQL returns `errors: [{ message: "Cannot query field 'X' on type 'Y'" }]`
- Update status sync stops working after Unraid update

**Phase to address:**
Phase 4 (Schema Validation Layer) — Add introspection checks, implement before full cutover

---

### Pitfall 5: Credential Rotation Kills Bot Mid-Operation

**What goes wrong:**
Bot stops responding to all commands. Unraid admin rotates API key for security hygiene (recommended practice for 2026), but n8n's "Unraid API Key" Header Auth credential still uses old key. All GraphQL requests return 401 Unauthorized. The dual-credential system (`.env.unraid-api` for CLI testing + n8n Header Auth for workflows) means updating one doesn't update the other.

**Why it happens:**
2026 security best practices mandate regular credential rotation. API keys "remain valid forever unless someone revokes or rotates them manually" per research. The system uses TWO separate credential stores that must be manually synchronized. No monitoring detects credential expiration. Unraid doesn't warn before rotating keys.

**How to avoid:**
1. Consolidate credential storage: use ONLY n8n Header Auth, remove `.env.unraid-api` CLI pattern
2. Implement 401 error detection with user-friendly message: "API key invalid, check Unraid API Keys settings"
3. Add credential validation endpoint check on workflow startup
4. Document credential rotation procedure in CLAUDE.md and user docs
5. Consider OAuth 2.0 migration if Unraid adds support (more rotation-friendly)

**Warning signs:**
- All GraphQL operations fail with 401 errors
- Bot worked yesterday, stopped today without code changes
- CLI testing with `.env.unraid-api` works but workflows fail (keys out of sync)
- Unraid API Keys page shows "Last used: N days ago" with large N value

**Phase to address:**
Phase 5 (Authentication Resilience) — Implement before cutover, add monitoring

---

### Pitfall 6: Sub-Workflow Timeout Errors Lost in Propagation

**What goes wrong:**
User triggers container update, bot appears to hang, no error message returned. After 2 minutes, execution silently fails. Logs show sub-workflow timeout but main workflow never receives error. User retries, creates duplicate operations. Known n8n issue: "Execute Workflow node ignores the timeout of the sub-workflow."

**Why it happens:**
n8n Execute Workflow nodes don't properly propagate sub-workflow timeout errors to parent workflow. Cloud relay adds 200-500ms latency per request. Update operations (pull image, recreate container) that completed in 10-30 seconds with local Docker socket now take 60-120 seconds. Default timeout becomes too aggressive, but timeout errors don't surface to user.

**How to avoid:**
1. Increase ALL sub-workflow timeouts by 3-5x to account for cloud relay latency
2. Implement client-side timeout in main workflow (Code node timestamp checks)
3. Add progress indicators for long-running operations (Telegram "typing" action every 10 seconds)
4. Configure HTTP Request node timeouts explicitly (don't rely on workflow-level timeout)
5. Test timeouts with network throttling simulation

**Warning signs:**
- Update operations show "executing" for 2+ minutes then disappear
- Execution logs show sub-workflow timeout but no error message sent to user
- User reports "bot doesn't respond to update commands"
- Success rate drops for slow operations (image pull, large container recreate)

**Phase to address:**
Phase 6 (Timeout Hardening) — Adjust before cutover, test under latency

---

### Pitfall 7: Race Condition Between Container State Query and Action Execution

**What goes wrong:**
User issues "stop plex" command. Bot queries container list (container running), sends stop command, but container already stopped by another process (Unraid WebGUI, another bot user). Unraid API returns error "container not running" but bot displays "successfully stopped." Callback data contains stale container state from 30 seconds ago (Telegram message edit cycle).

**Why it happens:**
GraphQL query and mutation are separate HTTP requests with 200-500ms cloud relay latency each. Container state can change between query and action. Docker socket proxy had sub-10ms latency making race conditions rare. Telegram inline keyboards cache container state in callback data (64-byte limit prevents re-querying). Multiple users can trigger conflicting actions on same container.

**How to avoid:**
1. Implement optimistic locking: query container state immediately before action, abort if state changed
2. Add version/timestamp to callback data, reject stale callbacks (>30 seconds old)
3. Handle "already in target state" as success (304 pattern from Docker API)
4. Query fresh state after action completes, show actual result to user
5. Add conflict detection: if action fails with state error, query and show current state

**Warning signs:**
- "Successfully stopped X" message but container still running when user checks status
- Action commands fail with "container already stopped/started" errors
- Batch operations report success but some containers in wrong state
- Multiple users report conflicts when managing same container

**Phase to address:**
Phase 7 (State Consistency Layer) — Implement before cutover, critical for multi-user

---

### Pitfall 8: Dual-Write Period Data Inconsistency

**What goes wrong:**
During migration cutover, some operations write to Docker API, others to Unraid API. Container list query returns different results depending on which API responded. Status updates go to Unraid but actions go to Docker, creating split-brain state. Rollback impossible because no single source of truth exists.

**Why it happens:**
Phased migration requires running both systems simultaneously. Developer enables feature flag to route reads to Unraid but keeps writes on Docker for safety. Cache invalidation becomes impossible — Docker changes invisible to Unraid queries, Unraid changes invisible to Docker queries. Callback data mixes Docker IDs and Unraid IDs from different query sources.

**How to avoid:**
1. Implement write-forwarding: Docker writes also trigger Unraid API updates (or vice versa)
2. Route ALL traffic through abstraction layer that handles dual-write internally
3. Keep cutover window SHORT (hours not days) to minimize inconsistency window
4. Use feature flag for routing but maintain single source of truth (either Docker OR Unraid)
5. Add request tracing to identify which API served each operation

**Warning signs:**
- Status command shows different container list than batch selection keyboard
- Container appears stopped in one interface, running in another
- Update operation succeeds but status doesn't refresh in Unraid WebGUI
- Rollback leaves orphaned container state (metadata mismatch between APIs)

**Phase to address:**
Phase 8 (Cutover Strategy) — Plan before implementation starts, execution in final phase

---

### Pitfall 9: GraphQL Batching vs n8n Batch Processing Confusion

**What goes wrong:**
Batch update operations (update all :latest containers) that processed 10 containers in 30 seconds now take 5+ minutes or timeout. Each container update triggers separate GraphQL HTTP Request → 10 containers = 10 round-trips through cloud relay. Response body parsing fails because developer assumes GraphQL response batching (send multiple queries in single request) but implements n8n batch processing (loop through items).

**Why it happens:**
n8n's batching (Items per Batch setting on HTTP Request node) is for rate limiting, NOT efficient batching. GraphQL supports query batching but requires specific request format. Cloud relay latency multiplied by sequential operations destroys performance. Docker socket proxy had negligible latency so sequential operations were acceptable.

**How to avoid:**
1. Use GraphQL batching for reads: single request with multiple container queries
2. Keep mutations sequential (safer) but add parallel processing for independent operations
3. Configure n8n HTTP Request node batching: 3-5 items per batch, 500ms interval
4. Add progress streaming: update Telegram message after each container (don't wait for all)
5. Implement timeout circuit breaker: abort batch if any single operation takes >60 seconds

**Warning signs:**
- Batch operations work for 2-3 containers but timeout for 10+
- Linear performance degradation (10 containers takes 10x longer than 1)
- n8n execution logs show sequential HTTP requests with 500ms gaps
- User cancels batch operations because they appear hung

**Phase to address:**
Phase 9 (Batch Performance Optimization) — After basic operations work, before batch features enabled

---

### Pitfall 10: Telegram Callback Data Size Limit Breaks With Longer IDs

**What goes wrong:**
Inline keyboard buttons stop working. User taps "Stop" button on container status page, nothing happens. Logs show "callback data exceeds 64 bytes" error. Docker IDs (12 chars) fit in callback format `stop:8a9907a24576`, Unraid PrefixedIDs (129 chars) do not fit `stop:{64-char-hash}:{64-char-hash}`.

**Why it happens:**
Telegram's 64-byte callback data limit was manageable with Docker IDs. System already uses bitmap encoding for batch selection (base36 BigInt), but single-container operations still use colon-delimited format. Migration assumes callback format unchanged, doesn't account for 10x ID length increase.

**How to avoid:**
1. Implement container ID shortening: store PrefixedID lookup table in workflow static data, use index in callback
2. Alternative: hash PrefixedID to 8-character base62 string, store mapping
3. Update callback format: `s:idx` where idx is lookup key, not full container ID
4. Test ALL callback patterns (status, actions, confirmation, batch) with Unraid IDs
5. Implement callback data size validation in Prepare Input nodes

**Warning signs:**
- Callback queries fail silently (no error to user)
- n8n logs show "callback data size exceeded" errors
- Inline keyboard buttons work for containers with short names, fail for others
- Parse Callback Data node returns truncated IDs

**Phase to address:**
Phase 2 (Callback Data Encoding) — Parallel to Phase 1, before any inline keyboard migration

---

## Technical Debt Patterns

| Shortcut | Immediate Benefit | Long-term Cost | When Acceptable |
|----------|-------------------|----------------|-----------------|
| Keep Docker socket proxy running during migration, route errors back to it | Zero-downtime cutover, instant rollback | Maintenance burden, two credential systems, split-brain debugging | Acceptable for 1-2 week migration window MAX |
| Skip GraphQL response normalization, update parsers directly | Fewer code layers, "simpler" architecture | 60+ Code nodes to update, high bug rate, impossible to rollback | Never — normalization is mandatory |
| Use n8n workflow static data for ID lookup table | No external database needed | Static data unreliable (execution-scoped per ARCHITECTURE.md), lost on workflow reimport | Never — already documented as broken in CLAUDE.md |
| Implement feature flag routing in main workflow only | Easy to toggle, single point of control | Sub-workflows unaware of API source, error messages confusing | Acceptable if sub-workflows receive normalized responses |
| Skip schema introspection validation | Faster deployment, fewer dependencies | Silent breakage on Unraid updates, no early warning | Never — schema changes are inevitable |

## Integration Gotchas

| Integration | Common Mistake | Correct Approach |
|-------------|----------------|------------------|
| n8n GraphQL node | Using dedicated GraphQL node instead of HTTP Request node | Use HTTP Request node with POST to `/graphql` — better error handling, supports Header Auth credential |
| n8n Header Auth | Setting credential in HTTP Request node but forgetting to configure in sub-workflows | ALL 7 sub-workflows need credential configured, not inherited from main workflow |
| Unraid API authentication | Using environment variables directly in workflow expressions | Use n8n credential system, environment variables only for host URL |
| myunraid.net URL format | Including `/graphql` in `UNRAID_HOST` environment variable | Env var should be base URL only, append `/graphql` in HTTP Request node URL field |
| GraphQL error responses | Checking `response.error` like REST APIs | GraphQL returns HTTP 200 with `errors` array, check `response.errors` not `response.error` |
| Container ID format | Assuming IDs are strings, treating them as opaque tokens | Validate ID format `^[a-f0-9]{64}:[a-f0-9]{64}$`, store in typed fields |
| Docker 204 No Content | Assuming empty response = error | Empty response body with HTTP 204 = success per CLAUDE.md |

## Performance Traps

| Trap | Symptoms | Prevention | When It Breaks |
|------|----------|------------|----------------|
| Sequential GraphQL queries in loops | Batch operations timeout, linear slowdown | Use GraphQL query batching or parallel HTTP requests | 5+ containers in batch operation |
| No HTTP Request timeout configuration | Indefinite hangs, zombie workflows | Set explicit timeout on EVERY HTTP Request node (30-60 seconds) | First cloud relay hiccup |
| Callback data re-querying | Every inline keyboard tap queries full container list | Cache container state in callback data (within 64-byte limit) | 10+ active users, rate limiting kicks in |
| Missing retry logic for transient errors | Intermittent failures, user frustration | Implement exponential backoff retry (3 attempts, 1s → 2s → 4s delay) | Network instability, cloud relay rate limits |
| No operation result caching | Same container queried 5 times in single workflow execution | Cache query results in workflow execution context for 30 seconds | Complex workflows with multiple sub-workflow calls |

## Security Mistakes

| Mistake | Risk | Prevention |
|---------|------|------------|
| Storing API key in workflow JSON | Credential exposure in git, logs, backups | Use n8n credential system exclusively, never hardcode |
| No API permission scope validation | Over-privileged API key, blast radius on compromise | Use minimal permission (`DOCKER:UPDATE_ANY` only), validate in workflow |
| Telegram user ID auth in single location | Bypass via direct sub-workflow execution | Implement auth check in EVERY sub-workflow, not just main |
| Logging full GraphQL responses | API key, sensitive container config in logs | Log only operation result, redact credentials from error messages |
| No rate limiting on bot commands | API key exhaustion, Unraid API rate limits | Implement per-user rate limiting (5 commands/minute), queue batched operations |

## UX Pitfalls

| Pitfall | User Impact | Better Approach |
|---------|-------------|-----------------|
| No latency indication | User unsure if command received, double-taps, duplicate operations | Send immediate "Processing..." message, update on completion |
| Generic error messages | "Operation failed" tells user nothing, can't self-recover | Parse Unraid API errors, show actionable message: "Container already stopped, current state: exited" |
| No migration communication | Users confused why bot slower after "upgrade" | Send broadcast message before cutover: "Bot migrating to Unraid API, expect 2-3x slower responses for improved reliability" |
| Hiding internet dependency | Users blame bot when ISP down | Error message: "Cannot reach Unraid API (requires internet), check network connection" |
| No rollback announcement | Users report bugs, developer fixes by rollback, users still see bugs (cache) | Announce rollbacks: "Rolled back to Docker socket, please retry failed operations" |

## "Looks Done But Isn't" Checklist

- [ ] **Container actions:** Often missing state validation BEFORE action — verify error message when stopping already-stopped container shows current state
- [ ] **GraphQL errors:** Often missing `response.errors` array parsing — verify malformed query returns user-friendly message, not JSON dump
- [ ] **Timeout handling:** Often missing client-side timeout — verify 2-minute operation shows progress indicator, doesn't appear hung
- [ ] **Credential expiration:** Often missing 401 error detection — verify rotated API key returns "credential invalid" not generic error
- [ ] **Callback data encoding:** Often missing length validation — verify longest possible container ID + action fits in 64 bytes
- [ ] **Schema validation:** Often missing field existence checks — verify missing field returns helpful error, not "undefined is not a function"
- [ ] **Batch progress:** Often missing incremental updates — verify batch operation shows "3/10 completed" updates, not just final result
- [ ] **Rollback procedure:** Often missing documented steps — verify CLAUDE.md has exact commands to switch back to Docker socket proxy
- [ ] **Dual-credential sync:** Often missing procedure to update both `.env.unraid-api` and n8n credential — verify documented workflow
- [ ] **Performance baseline:** Often missing pre-migration metrics — verify recorded latency/success rate to compare post-migration

## Recovery Strategies

| Pitfall | Recovery Cost | Recovery Steps |
|---------|---------------|----------------|
| Container ID mismatch breaking all operations | HIGH (all operations broken) | 1. Rollback to Docker socket proxy immediately 2. Implement ID translation layer 3. Test with synthetic Unraid IDs 4. Re-deploy |
| myunraid.net relay outage | LOW (temporary, auto-recover) | 1. Wait for relay recovery OR 2. Implement LAN fallback if extended outage 3. Monitor status at connect.myunraid.net |
| GraphQL response parsing errors | MEDIUM (degraded functionality) | 1. Identify broken Code node from error logs 2. Add response schema logging 3. Fix parser 4. Redeploy affected sub-workflow |
| Schema changes breaking queries | MEDIUM (affected features broken) | 1. Query Unraid `__schema` endpoint 2. Compare to expected schema snapshot 3. Update queries to match current schema 4. Add missing field checks |
| Credential rotation killing bot | LOW (quick fix) | 1. Generate new API key in Unraid 2. Update n8n Header Auth credential 3. Reactivate workflow (auto-retries) |
| Sub-workflow timeout errors | LOW (increase timeouts) | 1. Identify timeout threshold from logs 2. Increase sub-workflow timeout by 3x 3. Add progress indicators 4. Redeploy |
| Race condition state conflicts | MEDIUM (requires code changes) | 1. Implement fresh state query before action 2. Handle "already in state" as success 3. Show actual state after operation |
| Dual-write inconsistency | HIGH (data integrity compromised) | 1. Choose source of truth (Docker OR Unraid) 2. Query truth source, discard other 3. Regenerate callback data 4. Force user refresh |
| Batch operation performance issues | MEDIUM (requires optimization) | 1. Implement GraphQL batching for reads 2. Add parallel processing for mutations 3. Stream progress updates |
| Callback data size exceeded | MEDIUM (redesign callback format) | 1. Implement ID shortening with lookup table 2. Update ALL Prepare Input nodes 3. Test all callback paths 4. Redeploy |

## Pitfall-to-Phase Mapping

| Pitfall | Prevention Phase | Verification |
|---------|------------------|--------------|
| Container ID format mismatch | Phase 1: ID Mapping Layer | Test Docker ID fails validation, Unraid ID passes, translation correct |
| myunraid.net dependency | Phase 2: Network Resilience | Disconnect internet, verify fallback message or graceful degradation |
| GraphQL response structure | Phase 3: Response Normalization | Compare normalized output to Docker API shape, all fields present |
| Schema changes | Phase 4: Schema Validation | Change expected schema snapshot, verify detection on next workflow run |
| Credential rotation | Phase 5: Auth Resilience | Rotate API key, verify 401 error message user-friendly and actionable |
| Sub-workflow timeouts | Phase 6: Timeout Hardening | Simulate 2-minute operation, verify progress indicator and completion |
| Race conditions | Phase 7: State Consistency | Two users stop same container simultaneously, verify conflict resolution |
| Dual-write inconsistency | Phase 8: Cutover Strategy | Query both APIs during cutover, verify consistent results |
| Batch performance | Phase 9: Batch Optimization | Update 10 containers, verify completion <60 seconds with progress |
| Callback data size | Phase 2: Callback Encoding | Generate callback with longest ID, verify <64 bytes |

## Sources

**GraphQL Migration Patterns:**
- [Schema Migration - GraphQL](https://dgraph.io/docs/graphql/schema/migration/)
- [How to Handle Versioning in GraphQL APIs](https://oneuptime.com/blog/post/2026-01-24-graphql-api-versioning/view)
- [Migrating from REST to GraphQL - GitHub Docs](https://docs.github.com/en/graphql/guides/migrating-from-rest-to-graphql)
- [3 GraphQL pitfalls and how we avoid them](https://www.vanta.com/resources/3-graphql-pitfalls-and-steps-to-avoid-them)

**n8n Integration Issues:**
- [HTTP Request node common issues | n8n Docs](https://docs.n8n.io/integrations/builtin/core-nodes/n8n-nodes-base.httprequest/common-issues/)
- [Error handling | n8n Docs](https://docs.n8n.io/flow-logic/error-handling/)
- [Execute Workflow node ignores timeout - GitHub Issue #1572](https://github.com/n8n-io/n8n/issues/1572)
- [Error Handling in n8n: How to Retry & Monitor Workflows](https://easify-ai.com/error-handling-in-n8n-monitor-workflow-failures/)

**Migration Strategy:**
- [API migration dual-write pattern - AWS DMS](https://aws.amazon.com/blogs/database/rolling-back-from-a-migration-with-aws-dms/)
- [Zero-Downtime Database Migration: The Complete Engineering Guide](https://dev.to/ari-ghosh/zero-downtime-database-migration-the-definitive-guide-5672)
- [Canary releases with feature flags](https://www.getunleash.io/blog/canary-deployment-what-is-it)

**Authentication & Security:**
- [API Authentication Best Practices in 2026](https://dev.to/apiverve/api-authentication-best-practices-in-2026-3k4a)
- [Migrate from API keys to OAuth 2.1](https://www.scalekit.com/blog/migrating-from-api-keys-to-oauth-mcp-servers)

**Container Management:**
- [Race condition between stop and rm - GitHub Issue #130](https://github.com/apple/container/issues/130)
- [Eventual Consistency in Distributed Systems](https://www.geeksforgeeks.org/system-design/eventual-consistency-in-distributive-systems-learn-system-design/)

**Unraid Specific:**
- [Unraid Connect overview & setup | Unraid Docs](https://docs.unraid.net/unraid-connect/overview-and-setup/)
- Project ARCHITECTURE.md (verified container ID format, field behaviors, myunraid.net requirement)
- Project CLAUDE.md (Docker API patterns, n8n conventions, static data limitations)

---
*Pitfalls research for: Unraid Docker Manager — Docker Socket to GraphQL API Migration*
*Researched: 2026-02-09*
*Confidence: MEDIUM (verified Unraid-specific issues HIGH, general GraphQL patterns MEDIUM, n8n integration issues HIGH)*