Files
unraid-docker-manager/.planning/research/PITFALLS.md
2026-02-09 08:08:25 -05:00

29 KiB

Pitfalls Research

Domain: Migration from Docker Socket Proxy to Unraid GraphQL API Researched: 2026-02-09 Confidence: MEDIUM (mixture of verified Unraid-specific issues and general GraphQL migration patterns)

Critical Pitfalls

Pitfall 1: Container ID Format Mismatch Breaking All Operations

What goes wrong: All container operations fail with "container not found" errors despite containers existing. Docker uses 12-character hex IDs (8a9907a24576), Unraid GraphQL uses PrefixedID format ({server_hash}:{container_hash} — two 64-character SHA256 strings). Passing Docker IDs to Unraid API or vice versa results in complete operation failure.

Why it happens: Migration assumes container IDs are interchangeable between systems. Developers test lookup operations that succeed (name-based), miss that action operations using cached Docker IDs will fail when routed to Unraid API. The 290-node workflow system uses Execute Workflow nodes that pass containerId between sub-workflows — if any node still uses Docker IDs after cutover, errors propagate silently through the chain.

How to avoid:

  1. Create container ID translation layer BEFORE migration (Phase 1)
  2. Add runtime validation: reject IDs not matching ^[a-f0-9]{64}:[a-f0-9]{64}$ pattern
  3. Update ALL 17 Execute Workflow input preparation nodes to use Unraid ID format
  4. Store ONLY Unraid PrefixedIDs in callback data after migration
  5. Test with containers having similar names but different IDs

Warning signs:

  • Operations succeed via text commands (resolve by name) but fail via inline keyboard callbacks (use cached IDs)
  • HTTP 400 "invalid container ID format" errors from Unraid API
  • Batch operations fail for some containers but not others
  • Telegram callback data still contains 12-character hex strings after cutover

Phase to address: Phase 1 (Container ID Mapping Layer) — MUST complete before any live API calls


Pitfall 2: myunraid.net Cloud Relay Internet Dependency Kills Local Network Operations

What goes wrong: Bot becomes completely non-functional during internet outages despite both Unraid server and n8n container being on the same LAN. Users lose container management capability when they need it most (troubleshooting network issues). The system goes from zero-latency local Docker socket access (sub-10ms) to 200-500ms cloud relay latency, or complete failure if Unraid's cloud relay service has an outage.

Why it happens: Direct LAN IP access fails because Unraid's nginx redirects HTTP→HTTPS and strips auth headers on redirect. Developers choose myunraid.net cloud relay as "working solution" without implementing fallback strategy. The ARCHITECTURE.md documents this as the solution, not a compromise.

How to avoid:

  1. Implement dual-path fallback: attempt direct HTTPS with proper SSL handling first, fall back to myunraid.net if connection fails
  2. Add network connectivity pre-flight check before each API call batch
  3. Expose degraded mode: if cloud relay unavailable, switch back to Docker socket proxy (requires keeping proxy running during migration period)
  4. Monitor myunraid.net relay latency and availability as first-class metrics
  5. Document internet dependency in user-facing error messages

Warning signs:

  • Timeout errors during internet outage testing
  • Latency spikes visible in execution logs (compare pre/post migration)
  • Users report "bot stopped working" correlated with ISP issues
  • Unraid server reachable via LAN but bot reports "cannot connect"

Phase to address: Phase 2 (Network Resilience Strategy) — BEFORE cutover, implement fallback mechanism


Pitfall 3: GraphQL Query Result Structure Changes Break Response Parsing

What goes wrong: Bot sends commands but returns garbled data, shows empty container lists, or crashes on status checks. Field name changes (state: "RUNNING" vs status: "running"), nested structure differences (Docker's flat JSON vs GraphQL's nested response), and uppercase/lowercase variations break parsing logic across 60 Code nodes in the main workflow.

Why it happens: Docker REST API returns flat JSON arrays. GraphQL returns nested { data: { docker: { containers: [...] } } } structure. Developers update a few obvious parsing nodes but miss edge cases in error handling, batch processing, and inline keyboard builders. The codebase already has field behavior documentation warnings (state values are UPPERCASE, names prefixed with /) suggesting parsing brittleness.

How to avoid:

  1. Create GraphQL response normalization layer that transforms Unraid responses to match Docker API shape
  2. Add response schema validation in EVERY HTTP Request node (n8n's JSON schema validation)
  3. Test response parsing independently from workflow logic (unit test the Code nodes)
  4. Document ALL field format differences in normalization layer comments
  5. Use TypeScript types for response shapes (n8n Code nodes support TypeScript)

Warning signs:

  • Container list shows but names display as undefined or [object Object]
  • Status command returns "running" for stopped containers or vice versa
  • Batch selection keyboard shows wrong container names
  • Error messages contain GraphQL error structure (response.errors[0].message) instead of friendly text

Phase to address: Phase 3 (Response Schema Normalization) — BEFORE touching any sub-workflow, build and test normalization


Pitfall 4: Unraid GraphQL Schema Changes Silently Break Operations

What goes wrong: Operations that worked yesterday fail today with cryptic errors. Unraid's GraphQL schema evolves (field additions, deprecations, type changes) but the bot has no detection mechanism. The ARCHITECTURE.md already documents one schema discrepancy: isUpdateAvailable field documented in Phase 14 research does NOT exist in actual Unraid 7.2 schema.

Why it happens: GraphQL schemas evolve continuously (additive changes, deprecations) per best practices. Unlike REST API versioning (breaking changes = new /v2/ endpoint), GraphQL encourages in-place evolution. Phase 14 research used outdated/incorrect sources. No schema introspection validation in the deployment pipeline means schema mismatches only surface as runtime errors.

How to avoid:

  1. Implement schema introspection check at workflow startup (query __schema endpoint)
  2. Store expected schema snapshot in repo, compare on deployment
  3. Add field existence checks BEFORE using optional fields in queries
  4. Use GraphQL Inspector or similar tooling in CI/CD to detect breaking changes
  5. Subscribe to Unraid API changelog/release notes

Warning signs:

  • New Unraid version installed, bot starts throwing "unknown field" errors
  • Operations succeed on test server (older Unraid) but fail on production (newer Unraid)
  • GraphQL returns errors: [{ message: "Cannot query field 'X' on type 'Y'" }]
  • Update status sync stops working after Unraid update

Phase to address: Phase 4 (Schema Validation Layer) — Add introspection checks, implement before full cutover


Pitfall 5: Credential Rotation Kills Bot Mid-Operation

What goes wrong: Bot stops responding to all commands. Unraid admin rotates API key for security hygiene (recommended practice for 2026), but n8n's "Unraid API Key" Header Auth credential still uses old key. All GraphQL requests return 401 Unauthorized. The dual-credential system (.env.unraid-api for CLI testing + n8n Header Auth for workflows) means updating one doesn't update the other.

Why it happens: 2026 security best practices mandate regular credential rotation. API keys "remain valid forever unless someone revokes or rotates them manually" per research. The system uses TWO separate credential stores that must be manually synchronized. No monitoring detects credential expiration. Unraid doesn't warn before rotating keys.

How to avoid:

  1. Consolidate credential storage: use ONLY n8n Header Auth, remove .env.unraid-api CLI pattern
  2. Implement 401 error detection with user-friendly message: "API key invalid, check Unraid API Keys settings"
  3. Add credential validation endpoint check on workflow startup
  4. Document credential rotation procedure in CLAUDE.md and user docs
  5. Consider OAuth 2.0 migration if Unraid adds support (more rotation-friendly)

Warning signs:

  • All GraphQL operations fail with 401 errors
  • Bot worked yesterday, stopped today without code changes
  • CLI testing with .env.unraid-api works but workflows fail (keys out of sync)
  • Unraid API Keys page shows "Last used: N days ago" with large N value

Phase to address: Phase 5 (Authentication Resilience) — Implement before cutover, add monitoring


Pitfall 6: Sub-Workflow Timeout Errors Lost in Propagation

What goes wrong: User triggers container update, bot appears to hang, no error message returned. After 2 minutes, execution silently fails. Logs show sub-workflow timeout but main workflow never receives error. User retries, creates duplicate operations. Known n8n issue: "Execute Workflow node ignores the timeout of the sub-workflow."

Why it happens: n8n Execute Workflow nodes don't properly propagate sub-workflow timeout errors to parent workflow. Cloud relay adds 200-500ms latency per request. Update operations (pull image, recreate container) that completed in 10-30 seconds with local Docker socket now take 60-120 seconds. Default timeout becomes too aggressive, but timeout errors don't surface to user.

How to avoid:

  1. Increase ALL sub-workflow timeouts by 3-5x to account for cloud relay latency
  2. Implement client-side timeout in main workflow (Code node timestamp checks)
  3. Add progress indicators for long-running operations (Telegram "typing" action every 10 seconds)
  4. Configure HTTP Request node timeouts explicitly (don't rely on workflow-level timeout)
  5. Test timeouts with network throttling simulation

Warning signs:

  • Update operations show "executing" for 2+ minutes then disappear
  • Execution logs show sub-workflow timeout but no error message sent to user
  • User reports "bot doesn't respond to update commands"
  • Success rate drops for slow operations (image pull, large container recreate)

Phase to address: Phase 6 (Timeout Hardening) — Adjust before cutover, test under latency


Pitfall 7: Race Condition Between Container State Query and Action Execution

What goes wrong: User issues "stop plex" command. Bot queries container list (container running), sends stop command, but container already stopped by another process (Unraid WebGUI, another bot user). Unraid API returns error "container not running" but bot displays "successfully stopped." Callback data contains stale container state from 30 seconds ago (Telegram message edit cycle).

Why it happens: GraphQL query and mutation are separate HTTP requests with 200-500ms cloud relay latency each. Container state can change between query and action. Docker socket proxy had sub-10ms latency making race conditions rare. Telegram inline keyboards cache container state in callback data (64-byte limit prevents re-querying). Multiple users can trigger conflicting actions on same container.

How to avoid:

  1. Implement optimistic locking: query container state immediately before action, abort if state changed
  2. Add version/timestamp to callback data, reject stale callbacks (>30 seconds old)
  3. Handle "already in target state" as success (304 pattern from Docker API)
  4. Query fresh state after action completes, show actual result to user
  5. Add conflict detection: if action fails with state error, query and show current state

Warning signs:

  • "Successfully stopped X" message but container still running when user checks status
  • Action commands fail with "container already stopped/started" errors
  • Batch operations report success but some containers in wrong state
  • Multiple users report conflicts when managing same container

Phase to address: Phase 7 (State Consistency Layer) — Implement before cutover, critical for multi-user


Pitfall 8: Dual-Write Period Data Inconsistency

What goes wrong: During migration cutover, some operations write to Docker API, others to Unraid API. Container list query returns different results depending on which API responded. Status updates go to Unraid but actions go to Docker, creating split-brain state. Rollback impossible because no single source of truth exists.

Why it happens: Phased migration requires running both systems simultaneously. Developer enables feature flag to route reads to Unraid but keeps writes on Docker for safety. Cache invalidation becomes impossible — Docker changes invisible to Unraid queries, Unraid changes invisible to Docker queries. Callback data mixes Docker IDs and Unraid IDs from different query sources.

How to avoid:

  1. Implement write-forwarding: Docker writes also trigger Unraid API updates (or vice versa)
  2. Route ALL traffic through abstraction layer that handles dual-write internally
  3. Keep cutover window SHORT (hours not days) to minimize inconsistency window
  4. Use feature flag for routing but maintain single source of truth (either Docker OR Unraid)
  5. Add request tracing to identify which API served each operation

Warning signs:

  • Status command shows different container list than batch selection keyboard
  • Container appears stopped in one interface, running in another
  • Update operation succeeds but status doesn't refresh in Unraid WebGUI
  • Rollback leaves orphaned container state (metadata mismatch between APIs)

Phase to address: Phase 8 (Cutover Strategy) — Plan before implementation starts, execution in final phase


Pitfall 9: GraphQL Batching vs n8n Batch Processing Confusion

What goes wrong: Batch update operations (update all :latest containers) that processed 10 containers in 30 seconds now take 5+ minutes or timeout. Each container update triggers separate GraphQL HTTP Request → 10 containers = 10 round-trips through cloud relay. Response body parsing fails because developer assumes GraphQL response batching (send multiple queries in single request) but implements n8n batch processing (loop through items).

Why it happens: n8n's batching (Items per Batch setting on HTTP Request node) is for rate limiting, NOT efficient batching. GraphQL supports query batching but requires specific request format. Cloud relay latency multiplied by sequential operations destroys performance. Docker socket proxy had negligible latency so sequential operations were acceptable.

How to avoid:

  1. Use GraphQL batching for reads: single request with multiple container queries
  2. Keep mutations sequential (safer) but add parallel processing for independent operations
  3. Configure n8n HTTP Request node batching: 3-5 items per batch, 500ms interval
  4. Add progress streaming: update Telegram message after each container (don't wait for all)
  5. Implement timeout circuit breaker: abort batch if any single operation takes >60 seconds

Warning signs:

  • Batch operations work for 2-3 containers but timeout for 10+
  • Linear performance degradation (10 containers takes 10x longer than 1)
  • n8n execution logs show sequential HTTP requests with 500ms gaps
  • User cancels batch operations because they appear hung

Phase to address: Phase 9 (Batch Performance Optimization) — After basic operations work, before batch features enabled


Pitfall 10: Telegram Callback Data Size Limit Breaks With Longer IDs

What goes wrong: Inline keyboard buttons stop working. User taps "Stop" button on container status page, nothing happens. Logs show "callback data exceeds 64 bytes" error. Docker IDs (12 chars) fit in callback format stop:8a9907a24576, Unraid PrefixedIDs (129 chars) do not fit stop:{64-char-hash}:{64-char-hash}.

Why it happens: Telegram's 64-byte callback data limit was manageable with Docker IDs. System already uses bitmap encoding for batch selection (base36 BigInt), but single-container operations still use colon-delimited format. Migration assumes callback format unchanged, doesn't account for 10x ID length increase.

How to avoid:

  1. Implement container ID shortening: store PrefixedID lookup table in workflow static data, use index in callback
  2. Alternative: hash PrefixedID to 8-character base62 string, store mapping
  3. Update callback format: s:idx where idx is lookup key, not full container ID
  4. Test ALL callback patterns (status, actions, confirmation, batch) with Unraid IDs
  5. Implement callback data size validation in Prepare Input nodes

Warning signs:

  • Callback queries fail silently (no error to user)
  • n8n logs show "callback data size exceeded" errors
  • Inline keyboard buttons work for containers with short names, fail for others
  • Parse Callback Data node returns truncated IDs

Phase to address: Phase 2 (Callback Data Encoding) — Parallel to Phase 1, before any inline keyboard migration


Technical Debt Patterns

Shortcut Immediate Benefit Long-term Cost When Acceptable
Keep Docker socket proxy running during migration, route errors back to it Zero-downtime cutover, instant rollback Maintenance burden, two credential systems, split-brain debugging Acceptable for 1-2 week migration window MAX
Skip GraphQL response normalization, update parsers directly Fewer code layers, "simpler" architecture 60+ Code nodes to update, high bug rate, impossible to rollback Never — normalization is mandatory
Use n8n workflow static data for ID lookup table No external database needed Static data unreliable (execution-scoped per ARCHITECTURE.md), lost on workflow reimport Never — already documented as broken in CLAUDE.md
Implement feature flag routing in main workflow only Easy to toggle, single point of control Sub-workflows unaware of API source, error messages confusing Acceptable if sub-workflows receive normalized responses
Skip schema introspection validation Faster deployment, fewer dependencies Silent breakage on Unraid updates, no early warning Never — schema changes are inevitable

Integration Gotchas

Integration Common Mistake Correct Approach
n8n GraphQL node Using dedicated GraphQL node instead of HTTP Request node Use HTTP Request node with POST to /graphql — better error handling, supports Header Auth credential
n8n Header Auth Setting credential in HTTP Request node but forgetting to configure in sub-workflows ALL 7 sub-workflows need credential configured, not inherited from main workflow
Unraid API authentication Using environment variables directly in workflow expressions Use n8n credential system, environment variables only for host URL
myunraid.net URL format Including /graphql in UNRAID_HOST environment variable Env var should be base URL only, append /graphql in HTTP Request node URL field
GraphQL error responses Checking response.error like REST APIs GraphQL returns HTTP 200 with errors array, check response.errors not response.error
Container ID format Assuming IDs are strings, treating them as opaque tokens Validate ID format ^[a-f0-9]{64}:[a-f0-9]{64}$, store in typed fields
Docker 204 No Content Assuming empty response = error Empty response body with HTTP 204 = success per CLAUDE.md

Performance Traps

Trap Symptoms Prevention When It Breaks
Sequential GraphQL queries in loops Batch operations timeout, linear slowdown Use GraphQL query batching or parallel HTTP requests 5+ containers in batch operation
No HTTP Request timeout configuration Indefinite hangs, zombie workflows Set explicit timeout on EVERY HTTP Request node (30-60 seconds) First cloud relay hiccup
Callback data re-querying Every inline keyboard tap queries full container list Cache container state in callback data (within 64-byte limit) 10+ active users, rate limiting kicks in
Missing retry logic for transient errors Intermittent failures, user frustration Implement exponential backoff retry (3 attempts, 1s → 2s → 4s delay) Network instability, cloud relay rate limits
No operation result caching Same container queried 5 times in single workflow execution Cache query results in workflow execution context for 30 seconds Complex workflows with multiple sub-workflow calls

Security Mistakes

Mistake Risk Prevention
Storing API key in workflow JSON Credential exposure in git, logs, backups Use n8n credential system exclusively, never hardcode
No API permission scope validation Over-privileged API key, blast radius on compromise Use minimal permission (DOCKER:UPDATE_ANY only), validate in workflow
Telegram user ID auth in single location Bypass via direct sub-workflow execution Implement auth check in EVERY sub-workflow, not just main
Logging full GraphQL responses API key, sensitive container config in logs Log only operation result, redact credentials from error messages
No rate limiting on bot commands API key exhaustion, Unraid API rate limits Implement per-user rate limiting (5 commands/minute), queue batched operations

UX Pitfalls

Pitfall User Impact Better Approach
No latency indication User unsure if command received, double-taps, duplicate operations Send immediate "Processing..." message, update on completion
Generic error messages "Operation failed" tells user nothing, can't self-recover Parse Unraid API errors, show actionable message: "Container already stopped, current state: exited"
No migration communication Users confused why bot slower after "upgrade" Send broadcast message before cutover: "Bot migrating to Unraid API, expect 2-3x slower responses for improved reliability"
Hiding internet dependency Users blame bot when ISP down Error message: "Cannot reach Unraid API (requires internet), check network connection"
No rollback announcement Users report bugs, developer fixes by rollback, users still see bugs (cache) Announce rollbacks: "Rolled back to Docker socket, please retry failed operations"

"Looks Done But Isn't" Checklist

  • Container actions: Often missing state validation BEFORE action — verify error message when stopping already-stopped container shows current state
  • GraphQL errors: Often missing response.errors array parsing — verify malformed query returns user-friendly message, not JSON dump
  • Timeout handling: Often missing client-side timeout — verify 2-minute operation shows progress indicator, doesn't appear hung
  • Credential expiration: Often missing 401 error detection — verify rotated API key returns "credential invalid" not generic error
  • Callback data encoding: Often missing length validation — verify longest possible container ID + action fits in 64 bytes
  • Schema validation: Often missing field existence checks — verify missing field returns helpful error, not "undefined is not a function"
  • Batch progress: Often missing incremental updates — verify batch operation shows "3/10 completed" updates, not just final result
  • Rollback procedure: Often missing documented steps — verify CLAUDE.md has exact commands to switch back to Docker socket proxy
  • Dual-credential sync: Often missing procedure to update both .env.unraid-api and n8n credential — verify documented workflow
  • Performance baseline: Often missing pre-migration metrics — verify recorded latency/success rate to compare post-migration

Recovery Strategies

Pitfall Recovery Cost Recovery Steps
Container ID mismatch breaking all operations HIGH (all operations broken) 1. Rollback to Docker socket proxy immediately 2. Implement ID translation layer 3. Test with synthetic Unraid IDs 4. Re-deploy
myunraid.net relay outage LOW (temporary, auto-recover) 1. Wait for relay recovery OR 2. Implement LAN fallback if extended outage 3. Monitor status at connect.myunraid.net
GraphQL response parsing errors MEDIUM (degraded functionality) 1. Identify broken Code node from error logs 2. Add response schema logging 3. Fix parser 4. Redeploy affected sub-workflow
Schema changes breaking queries MEDIUM (affected features broken) 1. Query Unraid __schema endpoint 2. Compare to expected schema snapshot 3. Update queries to match current schema 4. Add missing field checks
Credential rotation killing bot LOW (quick fix) 1. Generate new API key in Unraid 2. Update n8n Header Auth credential 3. Reactivate workflow (auto-retries)
Sub-workflow timeout errors LOW (increase timeouts) 1. Identify timeout threshold from logs 2. Increase sub-workflow timeout by 3x 3. Add progress indicators 4. Redeploy
Race condition state conflicts MEDIUM (requires code changes) 1. Implement fresh state query before action 2. Handle "already in state" as success 3. Show actual state after operation
Dual-write inconsistency HIGH (data integrity compromised) 1. Choose source of truth (Docker OR Unraid) 2. Query truth source, discard other 3. Regenerate callback data 4. Force user refresh
Batch operation performance issues MEDIUM (requires optimization) 1. Implement GraphQL batching for reads 2. Add parallel processing for mutations 3. Stream progress updates
Callback data size exceeded MEDIUM (redesign callback format) 1. Implement ID shortening with lookup table 2. Update ALL Prepare Input nodes 3. Test all callback paths 4. Redeploy

Pitfall-to-Phase Mapping

Pitfall Prevention Phase Verification
Container ID format mismatch Phase 1: ID Mapping Layer Test Docker ID fails validation, Unraid ID passes, translation correct
myunraid.net dependency Phase 2: Network Resilience Disconnect internet, verify fallback message or graceful degradation
GraphQL response structure Phase 3: Response Normalization Compare normalized output to Docker API shape, all fields present
Schema changes Phase 4: Schema Validation Change expected schema snapshot, verify detection on next workflow run
Credential rotation Phase 5: Auth Resilience Rotate API key, verify 401 error message user-friendly and actionable
Sub-workflow timeouts Phase 6: Timeout Hardening Simulate 2-minute operation, verify progress indicator and completion
Race conditions Phase 7: State Consistency Two users stop same container simultaneously, verify conflict resolution
Dual-write inconsistency Phase 8: Cutover Strategy Query both APIs during cutover, verify consistent results
Batch performance Phase 9: Batch Optimization Update 10 containers, verify completion <60 seconds with progress
Callback data size Phase 2: Callback Encoding Generate callback with longest ID, verify <64 bytes

Sources

GraphQL Migration Patterns:

n8n Integration Issues:

Migration Strategy:

Authentication & Security:

Container Management:

Unraid Specific:

  • Unraid Connect overview & setup | Unraid Docs
  • Project ARCHITECTURE.md (verified container ID format, field behaviors, myunraid.net requirement)
  • Project CLAUDE.md (Docker API patterns, n8n conventions, static data limitations)

Pitfalls research for: Unraid Docker Manager — Docker Socket to GraphQL API Migration Researched: 2026-02-09 Confidence: MEDIUM (verified Unraid-specific issues HIGH, general GraphQL patterns MEDIUM, n8n integration issues HIGH)