Files

T

Lucas Berger bab819f6c8 docs: complete v1.4 project research synthesis

2026-02-09 08:08:25 -05:00

29 KiB

Raw Permalink Blame History

Pitfalls Research

Domain: Migration from Docker Socket Proxy to Unraid GraphQL API Researched: 2026-02-09 Confidence: MEDIUM (mixture of verified Unraid-specific issues and general GraphQL migration patterns)

Critical Pitfalls

Pitfall 1: Container ID Format Mismatch Breaking All Operations

What goes wrong: All container operations fail with "container not found" errors despite containers existing. Docker uses 12-character hex IDs (8a9907a24576), Unraid GraphQL uses PrefixedID format ({server_hash}:{container_hash} — two 64-character SHA256 strings). Passing Docker IDs to Unraid API or vice versa results in complete operation failure.

Why it happens: Migration assumes container IDs are interchangeable between systems. Developers test lookup operations that succeed (name-based), miss that action operations using cached Docker IDs will fail when routed to Unraid API. The 290-node workflow system uses Execute Workflow nodes that pass containerId between sub-workflows — if any node still uses Docker IDs after cutover, errors propagate silently through the chain.

How to avoid:

Create container ID translation layer BEFORE migration (Phase 1)
Add runtime validation: reject IDs not matching ^[a-f0-9]{64}:[a-f0-9]{64}$ pattern
Update ALL 17 Execute Workflow input preparation nodes to use Unraid ID format
Store ONLY Unraid PrefixedIDs in callback data after migration
Test with containers having similar names but different IDs

Warning signs:

Operations succeed via text commands (resolve by name) but fail via inline keyboard callbacks (use cached IDs)
HTTP 400 "invalid container ID format" errors from Unraid API
Batch operations fail for some containers but not others
Telegram callback data still contains 12-character hex strings after cutover

Phase to address: Phase 1 (Container ID Mapping Layer) — MUST complete before any live API calls

Pitfall 2: myunraid.net Cloud Relay Internet Dependency Kills Local Network Operations

What goes wrong: Bot becomes completely non-functional during internet outages despite both Unraid server and n8n container being on the same LAN. Users lose container management capability when they need it most (troubleshooting network issues). The system goes from zero-latency local Docker socket access (sub-10ms) to 200-500ms cloud relay latency, or complete failure if Unraid's cloud relay service has an outage.

Why it happens: Direct LAN IP access fails because Unraid's nginx redirects HTTP→HTTPS and strips auth headers on redirect. Developers choose myunraid.net cloud relay as "working solution" without implementing fallback strategy. The ARCHITECTURE.md documents this as the solution, not a compromise.

How to avoid:

Implement dual-path fallback: attempt direct HTTPS with proper SSL handling first, fall back to myunraid.net if connection fails
Add network connectivity pre-flight check before each API call batch
Expose degraded mode: if cloud relay unavailable, switch back to Docker socket proxy (requires keeping proxy running during migration period)
Monitor myunraid.net relay latency and availability as first-class metrics
Document internet dependency in user-facing error messages

Warning signs:

Timeout errors during internet outage testing
Latency spikes visible in execution logs (compare pre/post migration)
Users report "bot stopped working" correlated with ISP issues
Unraid server reachable via LAN but bot reports "cannot connect"

Phase to address: Phase 2 (Network Resilience Strategy) — BEFORE cutover, implement fallback mechanism

Pitfall 3: GraphQL Query Result Structure Changes Break Response Parsing

What goes wrong: Bot sends commands but returns garbled data, shows empty container lists, or crashes on status checks. Field name changes (state: "RUNNING" vs status: "running"), nested structure differences (Docker's flat JSON vs GraphQL's nested response), and uppercase/lowercase variations break parsing logic across 60 Code nodes in the main workflow.

Why it happens: Docker REST API returns flat JSON arrays. GraphQL returns nested { data: { docker: { containers: [...] } } } structure. Developers update a few obvious parsing nodes but miss edge cases in error handling, batch processing, and inline keyboard builders. The codebase already has field behavior documentation warnings (state values are UPPERCASE, names prefixed with /) suggesting parsing brittleness.

How to avoid:

Create GraphQL response normalization layer that transforms Unraid responses to match Docker API shape
Add response schema validation in EVERY HTTP Request node (n8n's JSON schema validation)
Test response parsing independently from workflow logic (unit test the Code nodes)
Document ALL field format differences in normalization layer comments
Use TypeScript types for response shapes (n8n Code nodes support TypeScript)

Warning signs:

Container list shows but names display as undefined or [object Object]
Status command returns "running" for stopped containers or vice versa
Batch selection keyboard shows wrong container names
Error messages contain GraphQL error structure (response.errors[0].message) instead of friendly text

Phase to address: Phase 3 (Response Schema Normalization) — BEFORE touching any sub-workflow, build and test normalization

Pitfall 4: Unraid GraphQL Schema Changes Silently Break Operations

What goes wrong: Operations that worked yesterday fail today with cryptic errors. Unraid's GraphQL schema evolves (field additions, deprecations, type changes) but the bot has no detection mechanism. The ARCHITECTURE.md already documents one schema discrepancy: isUpdateAvailable field documented in Phase 14 research does NOT exist in actual Unraid 7.2 schema.

Why it happens: GraphQL schemas evolve continuously (additive changes, deprecations) per best practices. Unlike REST API versioning (breaking changes = new /v2/ endpoint), GraphQL encourages in-place evolution. Phase 14 research used outdated/incorrect sources. No schema introspection validation in the deployment pipeline means schema mismatches only surface as runtime errors.

How to avoid:

Implement schema introspection check at workflow startup (query __schema endpoint)
Store expected schema snapshot in repo, compare on deployment
Add field existence checks BEFORE using optional fields in queries
Use GraphQL Inspector or similar tooling in CI/CD to detect breaking changes
Subscribe to Unraid API changelog/release notes

Warning signs:

New Unraid version installed, bot starts throwing "unknown field" errors
Operations succeed on test server (older Unraid) but fail on production (newer Unraid)
GraphQL returns errors: [{ message: "Cannot query field 'X' on type 'Y'" }]
Update status sync stops working after Unraid update

Phase to address: Phase 4 (Schema Validation Layer) — Add introspection checks, implement before full cutover

Pitfall 5: Credential Rotation Kills Bot Mid-Operation

What goes wrong: Bot stops responding to all commands. Unraid admin rotates API key for security hygiene (recommended practice for 2026), but n8n's "Unraid API Key" Header Auth credential still uses old key. All GraphQL requests return 401 Unauthorized. The dual-credential system (.env.unraid-api for CLI testing + n8n Header Auth for workflows) means updating one doesn't update the other.

Why it happens: 2026 security best practices mandate regular credential rotation. API keys "remain valid forever unless someone revokes or rotates them manually" per research. The system uses TWO separate credential stores that must be manually synchronized. No monitoring detects credential expiration. Unraid doesn't warn before rotating keys.

How to avoid:

Consolidate credential storage: use ONLY n8n Header Auth, remove .env.unraid-api CLI pattern
Implement 401 error detection with user-friendly message: "API key invalid, check Unraid API Keys settings"
Add credential validation endpoint check on workflow startup
Document credential rotation procedure in CLAUDE.md and user docs
Consider OAuth 2.0 migration if Unraid adds support (more rotation-friendly)

Warning signs:

All GraphQL operations fail with 401 errors
Bot worked yesterday, stopped today without code changes
CLI testing with .env.unraid-api works but workflows fail (keys out of sync)
Unraid API Keys page shows "Last used: N days ago" with large N value

Phase to address: Phase 5 (Authentication Resilience) — Implement before cutover, add monitoring

Pitfall 6: Sub-Workflow Timeout Errors Lost in Propagation

What goes wrong: User triggers container update, bot appears to hang, no error message returned. After 2 minutes, execution silently fails. Logs show sub-workflow timeout but main workflow never receives error. User retries, creates duplicate operations. Known n8n issue: "Execute Workflow node ignores the timeout of the sub-workflow."

Why it happens: n8n Execute Workflow nodes don't properly propagate sub-workflow timeout errors to parent workflow. Cloud relay adds 200-500ms latency per request. Update operations (pull image, recreate container) that completed in 10-30 seconds with local Docker socket now take 60-120 seconds. Default timeout becomes too aggressive, but timeout errors don't surface to user.

How to avoid:

Increase ALL sub-workflow timeouts by 3-5x to account for cloud relay latency
Implement client-side timeout in main workflow (Code node timestamp checks)
Add progress indicators for long-running operations (Telegram "typing" action every 10 seconds)
Configure HTTP Request node timeouts explicitly (don't rely on workflow-level timeout)
Test timeouts with network throttling simulation

Warning signs:

Update operations show "executing" for 2+ minutes then disappear
Execution logs show sub-workflow timeout but no error message sent to user
User reports "bot doesn't respond to update commands"
Success rate drops for slow operations (image pull, large container recreate)

Phase to address: Phase 6 (Timeout Hardening) — Adjust before cutover, test under latency

Pitfall 7: Race Condition Between Container State Query and Action Execution

What goes wrong: User issues "stop plex" command. Bot queries container list (container running), sends stop command, but container already stopped by another process (Unraid WebGUI, another bot user). Unraid API returns error "container not running" but bot displays "successfully stopped." Callback data contains stale container state from 30 seconds ago (Telegram message edit cycle).

Why it happens: GraphQL query and mutation are separate HTTP requests with 200-500ms cloud relay latency each. Container state can change between query and action. Docker socket proxy had sub-10ms latency making race conditions rare. Telegram inline keyboards cache container state in callback data (64-byte limit prevents re-querying). Multiple users can trigger conflicting actions on same container.

How to avoid:

Implement optimistic locking: query container state immediately before action, abort if state changed
Add version/timestamp to callback data, reject stale callbacks (>30 seconds old)
Handle "already in target state" as success (304 pattern from Docker API)
Query fresh state after action completes, show actual result to user
Add conflict detection: if action fails with state error, query and show current state

Warning signs:

"Successfully stopped X" message but container still running when user checks status
Action commands fail with "container already stopped/started" errors
Batch operations report success but some containers in wrong state
Multiple users report conflicts when managing same container

Phase to address: Phase 7 (State Consistency Layer) — Implement before cutover, critical for multi-user

Pitfall 8: Dual-Write Period Data Inconsistency

What goes wrong: During migration cutover, some operations write to Docker API, others to Unraid API. Container list query returns different results depending on which API responded. Status updates go to Unraid but actions go to Docker, creating split-brain state. Rollback impossible because no single source of truth exists.

Why it happens: Phased migration requires running both systems simultaneously. Developer enables feature flag to route reads to Unraid but keeps writes on Docker for safety. Cache invalidation becomes impossible — Docker changes invisible to Unraid queries, Unraid changes invisible to Docker queries. Callback data mixes Docker IDs and Unraid IDs from different query sources.

How to avoid:

Implement write-forwarding: Docker writes also trigger Unraid API updates (or vice versa)
Route ALL traffic through abstraction layer that handles dual-write internally
Keep cutover window SHORT (hours not days) to minimize inconsistency window
Use feature flag for routing but maintain single source of truth (either Docker OR Unraid)
Add request tracing to identify which API served each operation

Warning signs:

Status command shows different container list than batch selection keyboard
Container appears stopped in one interface, running in another
Update operation succeeds but status doesn't refresh in Unraid WebGUI
Rollback leaves orphaned container state (metadata mismatch between APIs)

Phase to address: Phase 8 (Cutover Strategy) — Plan before implementation starts, execution in final phase

Pitfall 9: GraphQL Batching vs n8n Batch Processing Confusion

What goes wrong: Batch update operations (update all :latest containers) that processed 10 containers in 30 seconds now take 5+ minutes or timeout. Each container update triggers separate GraphQL HTTP Request → 10 containers = 10 round-trips through cloud relay. Response body parsing fails because developer assumes GraphQL response batching (send multiple queries in single request) but implements n8n batch processing (loop through items).

Why it happens: n8n's batching (Items per Batch setting on HTTP Request node) is for rate limiting, NOT efficient batching. GraphQL supports query batching but requires specific request format. Cloud relay latency multiplied by sequential operations destroys performance. Docker socket proxy had negligible latency so sequential operations were acceptable.

How to avoid:

Use GraphQL batching for reads: single request with multiple container queries
Keep mutations sequential (safer) but add parallel processing for independent operations
Configure n8n HTTP Request node batching: 3-5 items per batch, 500ms interval
Add progress streaming: update Telegram message after each container (don't wait for all)
Implement timeout circuit breaker: abort batch if any single operation takes >60 seconds

Warning signs:

Batch operations work for 2-3 containers but timeout for 10+
Linear performance degradation (10 containers takes 10x longer than 1)
n8n execution logs show sequential HTTP requests with 500ms gaps
User cancels batch operations because they appear hung

Phase to address: Phase 9 (Batch Performance Optimization) — After basic operations work, before batch features enabled

Pitfall 10: Telegram Callback Data Size Limit Breaks With Longer IDs

What goes wrong: Inline keyboard buttons stop working. User taps "Stop" button on container status page, nothing happens. Logs show "callback data exceeds 64 bytes" error. Docker IDs (12 chars) fit in callback format stop:8a9907a24576, Unraid PrefixedIDs (129 chars) do not fit stop:{64-char-hash}:{64-char-hash}.

Why it happens: Telegram's 64-byte callback data limit was manageable with Docker IDs. System already uses bitmap encoding for batch selection (base36 BigInt), but single-container operations still use colon-delimited format. Migration assumes callback format unchanged, doesn't account for 10x ID length increase.

How to avoid:

Implement container ID shortening: store PrefixedID lookup table in workflow static data, use index in callback
Alternative: hash PrefixedID to 8-character base62 string, store mapping
Update callback format: s:idx where idx is lookup key, not full container ID
Test ALL callback patterns (status, actions, confirmation, batch) with Unraid IDs
Implement callback data size validation in Prepare Input nodes

Warning signs:

Callback queries fail silently (no error to user)
n8n logs show "callback data size exceeded" errors
Inline keyboard buttons work for containers with short names, fail for others
Parse Callback Data node returns truncated IDs

Phase to address: Phase 2 (Callback Data Encoding) — Parallel to Phase 1, before any inline keyboard migration

Technical Debt Patterns

Shortcut	Immediate Benefit	Long-term Cost	When Acceptable
Keep Docker socket proxy running during migration, route errors back to it	Zero-downtime cutover, instant rollback	Maintenance burden, two credential systems, split-brain debugging	Acceptable for 1-2 week migration window MAX
Skip GraphQL response normalization, update parsers directly	Fewer code layers, "simpler" architecture	60+ Code nodes to update, high bug rate, impossible to rollback	Never — normalization is mandatory
Use n8n workflow static data for ID lookup table	No external database needed	Static data unreliable (execution-scoped per ARCHITECTURE.md), lost on workflow reimport	Never — already documented as broken in CLAUDE.md
Implement feature flag routing in main workflow only	Easy to toggle, single point of control	Sub-workflows unaware of API source, error messages confusing	Acceptable if sub-workflows receive normalized responses
Skip schema introspection validation	Faster deployment, fewer dependencies	Silent breakage on Unraid updates, no early warning	Never — schema changes are inevitable

Integration Gotchas

Integration	Common Mistake	Correct Approach
n8n GraphQL node	Using dedicated GraphQL node instead of HTTP Request node	Use HTTP Request node with POST to `/graphql` — better error handling, supports Header Auth credential
n8n Header Auth	Setting credential in HTTP Request node but forgetting to configure in sub-workflows	ALL 7 sub-workflows need credential configured, not inherited from main workflow
Unraid API authentication	Using environment variables directly in workflow expressions	Use n8n credential system, environment variables only for host URL
myunraid.net URL format	Including `/graphql` in `UNRAID_HOST` environment variable	Env var should be base URL only, append `/graphql` in HTTP Request node URL field
GraphQL error responses	Checking `response.error` like REST APIs	GraphQL returns HTTP 200 with `errors` array, check `response.errors` not `response.error`
Container ID format	Assuming IDs are strings, treating them as opaque tokens	Validate ID format `^[a-f0-9]{64}:[a-f0-9]{64}$`, store in typed fields
Docker 204 No Content	Assuming empty response = error	Empty response body with HTTP 204 = success per CLAUDE.md

Performance Traps

Trap	Symptoms	Prevention	When It Breaks
Sequential GraphQL queries in loops	Batch operations timeout, linear slowdown	Use GraphQL query batching or parallel HTTP requests	5+ containers in batch operation
No HTTP Request timeout configuration	Indefinite hangs, zombie workflows	Set explicit timeout on EVERY HTTP Request node (30-60 seconds)	First cloud relay hiccup
Callback data re-querying	Every inline keyboard tap queries full container list	Cache container state in callback data (within 64-byte limit)	10+ active users, rate limiting kicks in
Missing retry logic for transient errors	Intermittent failures, user frustration	Implement exponential backoff retry (3 attempts, 1s → 2s → 4s delay)	Network instability, cloud relay rate limits
No operation result caching	Same container queried 5 times in single workflow execution	Cache query results in workflow execution context for 30 seconds	Complex workflows with multiple sub-workflow calls

Security Mistakes

Mistake	Risk	Prevention
Storing API key in workflow JSON	Credential exposure in git, logs, backups	Use n8n credential system exclusively, never hardcode
No API permission scope validation	Over-privileged API key, blast radius on compromise	Use minimal permission (`DOCKER:UPDATE_ANY` only), validate in workflow
Telegram user ID auth in single location	Bypass via direct sub-workflow execution	Implement auth check in EVERY sub-workflow, not just main
Logging full GraphQL responses	API key, sensitive container config in logs	Log only operation result, redact credentials from error messages
No rate limiting on bot commands	API key exhaustion, Unraid API rate limits	Implement per-user rate limiting (5 commands/minute), queue batched operations

UX Pitfalls

Pitfall	User Impact	Better Approach
No latency indication	User unsure if command received, double-taps, duplicate operations	Send immediate "Processing..." message, update on completion
Generic error messages	"Operation failed" tells user nothing, can't self-recover	Parse Unraid API errors, show actionable message: "Container already stopped, current state: exited"
No migration communication	Users confused why bot slower after "upgrade"	Send broadcast message before cutover: "Bot migrating to Unraid API, expect 2-3x slower responses for improved reliability"
Hiding internet dependency	Users blame bot when ISP down	Error message: "Cannot reach Unraid API (requires internet), check network connection"
No rollback announcement	Users report bugs, developer fixes by rollback, users still see bugs (cache)	Announce rollbacks: "Rolled back to Docker socket, please retry failed operations"

"Looks Done But Isn't" Checklist

Container actions: Often missing state validation BEFORE action — verify error message when stopping already-stopped container shows current state
GraphQL errors: Often missing response.errors array parsing — verify malformed query returns user-friendly message, not JSON dump
Timeout handling: Often missing client-side timeout — verify 2-minute operation shows progress indicator, doesn't appear hung
Credential expiration: Often missing 401 error detection — verify rotated API key returns "credential invalid" not generic error
Callback data encoding: Often missing length validation — verify longest possible container ID + action fits in 64 bytes
Schema validation: Often missing field existence checks — verify missing field returns helpful error, not "undefined is not a function"
Batch progress: Often missing incremental updates — verify batch operation shows "3/10 completed" updates, not just final result
Rollback procedure: Often missing documented steps — verify CLAUDE.md has exact commands to switch back to Docker socket proxy
Dual-credential sync: Often missing procedure to update both .env.unraid-api and n8n credential — verify documented workflow
Performance baseline: Often missing pre-migration metrics — verify recorded latency/success rate to compare post-migration

Recovery Strategies

Pitfall	Recovery Cost	Recovery Steps
Container ID mismatch breaking all operations	HIGH (all operations broken)	1. Rollback to Docker socket proxy immediately 2. Implement ID translation layer 3. Test with synthetic Unraid IDs 4. Re-deploy
myunraid.net relay outage	LOW (temporary, auto-recover)	1. Wait for relay recovery OR 2. Implement LAN fallback if extended outage 3. Monitor status at connect.myunraid.net
GraphQL response parsing errors	MEDIUM (degraded functionality)	1. Identify broken Code node from error logs 2. Add response schema logging 3. Fix parser 4. Redeploy affected sub-workflow
Schema changes breaking queries	MEDIUM (affected features broken)	1. Query Unraid `__schema` endpoint 2. Compare to expected schema snapshot 3. Update queries to match current schema 4. Add missing field checks
Credential rotation killing bot	LOW (quick fix)	1. Generate new API key in Unraid 2. Update n8n Header Auth credential 3. Reactivate workflow (auto-retries)
Sub-workflow timeout errors	LOW (increase timeouts)	1. Identify timeout threshold from logs 2. Increase sub-workflow timeout by 3x 3. Add progress indicators 4. Redeploy
Race condition state conflicts	MEDIUM (requires code changes)	1. Implement fresh state query before action 2. Handle "already in state" as success 3. Show actual state after operation
Dual-write inconsistency	HIGH (data integrity compromised)	1. Choose source of truth (Docker OR Unraid) 2. Query truth source, discard other 3. Regenerate callback data 4. Force user refresh
Batch operation performance issues	MEDIUM (requires optimization)	1. Implement GraphQL batching for reads 2. Add parallel processing for mutations 3. Stream progress updates
Callback data size exceeded	MEDIUM (redesign callback format)	1. Implement ID shortening with lookup table 2. Update ALL Prepare Input nodes 3. Test all callback paths 4. Redeploy

Pitfall-to-Phase Mapping

Pitfall	Prevention Phase	Verification
Container ID format mismatch	Phase 1: ID Mapping Layer	Test Docker ID fails validation, Unraid ID passes, translation correct
myunraid.net dependency	Phase 2: Network Resilience	Disconnect internet, verify fallback message or graceful degradation
GraphQL response structure	Phase 3: Response Normalization	Compare normalized output to Docker API shape, all fields present
Schema changes	Phase 4: Schema Validation	Change expected schema snapshot, verify detection on next workflow run
Credential rotation	Phase 5: Auth Resilience	Rotate API key, verify 401 error message user-friendly and actionable
Sub-workflow timeouts	Phase 6: Timeout Hardening	Simulate 2-minute operation, verify progress indicator and completion
Race conditions	Phase 7: State Consistency	Two users stop same container simultaneously, verify conflict resolution
Dual-write inconsistency	Phase 8: Cutover Strategy	Query both APIs during cutover, verify consistent results
Batch performance	Phase 9: Batch Optimization	Update 10 containers, verify completion <60 seconds with progress
Callback data size	Phase 2: Callback Encoding	Generate callback with longest ID, verify <64 bytes

Sources

GraphQL Migration Patterns:

n8n Integration Issues:

Migration Strategy:

Authentication & Security:

Container Management:

Unraid Specific:

Unraid Connect overview & setup | Unraid Docs
Project ARCHITECTURE.md (verified container ID format, field behaviors, myunraid.net requirement)
Project CLAUDE.md (Docker API patterns, n8n conventions, static data limitations)

Pitfalls research for: Unraid Docker Manager — Docker Socket to GraphQL API Migration Researched: 2026-02-09 Confidence: MEDIUM (verified Unraid-specific issues HIGH, general GraphQL patterns MEDIUM, n8n integration issues HIGH)

29 KiB Raw Permalink Blame History

Pitfalls Research

Critical Pitfalls

Pitfall 1: Container ID Format Mismatch Breaking All Operations

Pitfall 2: myunraid.net Cloud Relay Internet Dependency Kills Local Network Operations

Pitfall 3: GraphQL Query Result Structure Changes Break Response Parsing

Pitfall 4: Unraid GraphQL Schema Changes Silently Break Operations

Pitfall 5: Credential Rotation Kills Bot Mid-Operation

Pitfall 6: Sub-Workflow Timeout Errors Lost in Propagation

Pitfall 7: Race Condition Between Container State Query and Action Execution

Pitfall 8: Dual-Write Period Data Inconsistency

Pitfall 9: GraphQL Batching vs n8n Batch Processing Confusion

Pitfall 10: Telegram Callback Data Size Limit Breaks With Longer IDs

Technical Debt Patterns

Integration Gotchas

Performance Traps

Security Mistakes

UX Pitfalls

"Looks Done But Isn't" Checklist

Recovery Strategies

Pitfall-to-Phase Mapping

Sources

29 KiB

Raw Permalink Blame History