docs: complete v1.4 project research synthesis

2026-02-09 08:08:25 -05:00
parent bb47664188
commit bab819f6c8
5 changed files with 2013 additions and 1453 deletions
@@ -1,458 +1,399 @@
 # Pitfalls Research

-**Domain:** Unraid Update Status Sync for Existing Docker Management Bot
-**Researched:** 2026-02-08
-**Confidence:** MEDIUM
-
-Research combines verified Unraid architecture (HIGH confidence) with integration patterns from community sources (MEDIUM confidence). File format and API internals have LIMITED documentation — risk areas flagged for phase-specific investigation.
+**Domain:** Migration from Docker Socket Proxy to Unraid GraphQL API
+**Researched:** 2026-02-09
+**Confidence:** MEDIUM (mixture of verified Unraid-specific issues and general GraphQL migration patterns)

 ## Critical Pitfalls

-### Pitfall 1: State Desync Between Docker API and Unraid's Internal Tracking
+### Pitfall 1: Container ID Format Mismatch Breaking All Operations

 **What goes wrong:**
-After bot-initiated updates via Docker API (pull + recreate), Unraid's Docker tab continues showing "update ready" status. Unraid doesn't detect that the container was updated externally. This creates user confusion ("I just updated, why does it still show?") and leads to duplicate update attempts.
+All container operations fail with "container not found" errors despite containers existing. Docker uses 12-character hex IDs (`8a9907a24576`), Unraid GraphQL uses PrefixedID format (`{server_hash}:{container_hash}` — two 64-character SHA256 strings). Passing Docker IDs to Unraid API or vice versa results in complete operation failure.

 **Why it happens:**
-Unraid tracks update status through multiple mechanisms that aren't automatically synchronized with Docker API operations:
- `/var/lib/docker/unraid-update-status.json` — cached update status file (stale after external updates)
- DockerManifestService cache — compares local image digests to registry manifests
- Real-time DockerEventService — monitors Docker daemon events but doesn't trigger update status recalculation
-
-The bot bypasses Unraid's template system entirely, so Unraid "probably doesn't check if a container has magically been updated and change its UI" (watchtower discussion).
+Migration assumes container IDs are interchangeable between systems. Developers test lookup operations that succeed (name-based), miss that action operations using cached Docker IDs will fail when routed to Unraid API. The 290-node workflow system uses Execute Workflow nodes that pass containerId between sub-workflows — if any node still uses Docker IDs after cutover, errors propagate silently through the chain.

 **How to avoid:**
-Phase 1 (Investigation) must determine ALL state locations:
-1. **Verify update status file format** — inspect `/var/lib/docker/unraid-update-status.json` structure (undocumented, requires reverse engineering)
-2. **Document cache invalidation triggers** — what causes DockerManifestService to recompute?
-3. **Test event-based refresh** — does recreating a container trigger update check, or only on manual "Check for Updates"?
-
-Phase 2 (Sync Implementation) options (in order of safety):
- **Option A (safest):** Delete stale entries from `unraid-update-status.json` for updated containers (forces recalculation on next check)
- **Option B (if A insufficient):** Call Unraid API update check endpoint after bot updates (triggers full recalc)
- **Option C (last resort):** Directly modify `unraid-update-status.json` with current digest (highest risk of corruption)
+1. Create container ID translation layer BEFORE migration (Phase 1)
+2. Add runtime validation: reject IDs not matching `^[a-f0-9]{64}:[a-f0-9]{64}$` pattern
+3. Update ALL 17 Execute Workflow input preparation nodes to use Unraid ID format
+4. Store ONLY Unraid PrefixedIDs in callback data after migration
+5. Test with containers having similar names but different IDs

 **Warning signs:**
- "Apply Update" shown in Unraid UI immediately after bot reports successful update
- Unraid notification shows update available for container that bot just updated
- `/var/lib/docker/unraid-update-status.json` modified timestamp doesn't change after bot update
+- Operations succeed via text commands (resolve by name) but fail via inline keyboard callbacks (use cached IDs)
+- HTTP 400 "invalid container ID format" errors from Unraid API
+- Batch operations fail for some containers but not others
+- Telegram callback data still contains 12-character hex strings after cutover

 **Phase to address:**
-Phase 1 (Investigation & File Format Analysis) — understand state structure
-Phase 2 (Sync Implementation) — implement chosen sync strategy
-Phase 3 (UAT) — verify sync works across Unraid versions
+Phase 1 (Container ID Mapping Layer) — MUST complete before any live API calls

 ---

-### Pitfall 2: Race Condition Between Unraid's Periodic Update Check and Bot Sync-Back
+### Pitfall 2: myunraid.net Cloud Relay Internet Dependency Kills Local Network Operations

 **What goes wrong:**
-Unraid periodically checks for updates (user-configurable interval, often 15-60 minutes). If the bot writes to `unraid-update-status.json` while Unraid's update check is running, data corruption or lost updates occur. Symptoms: Unraid shows containers as "update ready" immediately after sync, or sync writes are silently discarded.
+Bot becomes completely non-functional during internet outages despite both Unraid server and n8n container being on the same LAN. Users lose container management capability when they need it most (troubleshooting network issues). The system goes from zero-latency local Docker socket access (sub-10ms) to 200-500ms cloud relay latency, or complete failure if Unraid's cloud relay service has an outage.

 **Why it happens:**
-Two processes writing to the same file without coordination:
- Unraid's update check: reads file → queries registries → writes full file
- Bot sync: reads file → modifies entry → writes full file
-
-If both run concurrently, last writer wins (lost update problem). No evidence of file locking in Unraid's update status handling.
+Direct LAN IP access fails because Unraid's nginx redirects HTTP→HTTPS and strips auth headers on redirect. Developers choose myunraid.net cloud relay as "working solution" without implementing fallback strategy. The ARCHITECTURE.md documents this as the solution, not a compromise.

 **How to avoid:**
-1. **Read-modify-write atomicity:** Use file locking or atomic write (write to temp file, atomic rename)
-2. **Timestamp verification:** Read file, modify, check mtime before write — retry if changed
-3. **Idempotent sync:** Deleting entries (Option A above) is safer than modifying — delete is idempotent
-4. **Rate limiting:** Don't sync immediately after update — wait 5-10 seconds to avoid collision with Unraid's Docker event handler
-
-Phase 2 implementation requirements:
- Use Python's `fcntl.flock()` or atomic file operations
- Include retry logic with exponential backoff (max 3 attempts)
- Log all file modification failures for debugging
+1. Implement dual-path fallback: attempt direct HTTPS with proper SSL handling first, fall back to myunraid.net if connection fails
+2. Add network connectivity pre-flight check before each API call batch
+3. Expose degraded mode: if cloud relay unavailable, switch back to Docker socket proxy (requires keeping proxy running during migration period)
+4. Monitor myunraid.net relay latency and availability as first-class metrics
+5. Document internet dependency in user-facing error messages

 **Warning signs:**
- Sync reports success but Unraid state unchanged
- File modification timestamp inconsistent with sync execution time
- "Resource temporarily unavailable" errors when accessing the file
+- Timeout errors during internet outage testing
+- Latency spikes visible in execution logs (compare pre/post migration)
+- Users report "bot stopped working" correlated with ISP issues
+- Unraid server reachable via LAN but bot reports "cannot connect"

 **Phase to address:**
-Phase 2 (Sync Implementation) — implement atomic file operations and retry logic
+Phase 2 (Network Resilience Strategy) — BEFORE cutover, implement fallback mechanism

 ---

-### Pitfall 3: Unraid Version Compatibility — Internal Format Changes Break Integration
+### Pitfall 3: GraphQL Query Result Structure Changes Break Response Parsing

 **What goes wrong:**
-Unraid updates change the structure of `/var/lib/docker/unraid-update-status.json` or introduce new update tracking mechanisms. Bot's sync logic breaks silently (no status updates) or corrupts the file (containers disappear from UI, update checks fail).
+Bot sends commands but returns garbled data, shows empty container lists, or crashes on status checks. Field name changes (`state: "RUNNING"` vs `status: "running"`), nested structure differences (Docker's flat JSON vs GraphQL's nested response), and uppercase/lowercase variations break parsing logic across 60 Code nodes in the main workflow.

 **Why it happens:**
- File format is undocumented (no schema, no version field)
- Unraid 7.x introduced major API changes (GraphQL, new DockerService architecture)
- Past example: Unraid 6.12.8 template errors that "previously were silently ignored could cause Docker containers to fail to start"
- No backward compatibility guarantees for internal files
-
-Historical evidence of breaking changes:
- Unraid 7.2.1 (Nov 2025): Docker localhost networking broke
- Unraid 6.12.8: Docker template validation strictness increased
- Unraid API open-sourced Jan 2025 — likely more changes incoming
+Docker REST API returns flat JSON arrays. GraphQL returns nested `{ data: { docker: { containers: [...] } } }` structure. Developers update a few obvious parsing nodes but miss edge cases in error handling, batch processing, and inline keyboard builders. The codebase already has field behavior documentation warnings (`state` values are UPPERCASE, `names` prefixed with `/`) suggesting parsing brittleness.

 **How to avoid:**
-1. **Version detection:** Read Unraid version from `/etc/unraid-version` or API
-2. **Format validation:** Before modifying file, validate expected structure (reject unknown formats)
-3. **Graceful degradation:** If file format unrecognized, log error and skip sync (preserve existing bot functionality)
-4. **Testing matrix:** Test against Unraid 6.11, 6.12, 7.0, 7.2 (Phase 3)
-
-Phase 1 requirements:
- Document current file format for Unraid 7.x
- Check Unraid forums for known format changes across versions
- Identify version-specific differences (if any)
-
-Phase 2 implementation:
-```python
-SUPPORTED_VERSIONS = ['6.11', '6.12', '7.0', '7.1', '7.2']
-version = read_unraid_version()
-if not version_compatible(version):
-    log_error(f"Unsupported Unraid version: {version}")
-    return  # Skip sync, preserve bot functionality
-```
+1. Create GraphQL response normalization layer that transforms Unraid responses to match Docker API shape
+2. Add response schema validation in EVERY HTTP Request node (n8n's JSON schema validation)
+3. Test response parsing independently from workflow logic (unit test the Code nodes)
+4. Document ALL field format differences in normalization layer comments
+5. Use TypeScript types for response shapes (n8n Code nodes support TypeScript)

 **Warning signs:**
- After Unraid upgrade, sync stops working (no errors, just no state change)
- Unraid Docker tab shows errors or missing containers after bot update
- File size changes significantly after Unraid upgrade (format change)
+- Container list shows but names display as `undefined` or `[object Object]`
+- Status command returns "running" for stopped containers or vice versa
+- Batch selection keyboard shows wrong container names
+- Error messages contain GraphQL error structure (`response.errors[0].message`) instead of friendly text

 **Phase to address:**
-Phase 1 (Investigation) — document current format, check version differences
-Phase 2 (Implementation) — add version detection and validation
-Phase 3 (UAT) — test across Unraid versions
+Phase 3 (Response Schema Normalization) — BEFORE touching any sub-workflow, build and test normalization

 ---

-### Pitfall 4: Docker Socket Proxy Blocks Filesystem Access — n8n Can't Reach Unraid State Files
+### Pitfall 4: Unraid GraphQL Schema Changes Silently Break Operations

 **What goes wrong:**
-The bot runs inside n8n container, which accesses Docker via socket proxy (security layer). Socket proxy filters Docker API endpoints but doesn't provide filesystem access. `/var/lib/docker/unraid-update-status.json` is on the Unraid host, unreachable from n8n container.
-
-Attempting to mount host paths into n8n violates security boundary and creates maintenance burden (n8n updates require preserving mounts).
+Operations that worked yesterday fail today with cryptic errors. Unraid's GraphQL schema evolves (field additions, deprecations, type changes) but the bot has no detection mechanism. The ARCHITECTURE.md already documents one schema discrepancy: `isUpdateAvailable` field documented in Phase 14 research does NOT exist in actual Unraid 7.2 schema.

 **Why it happens:**
-Current architecture (from ARCHITECTURE.md):
-```
-n8n container → docker-socket-proxy → Docker Engine
-```
-
-Socket proxy security model:
- Grants specific Docker API endpoints (containers, images, exec)
- Blocks direct filesystem access
- n8n has no `/host` mount (intentional security decision)
-
-Mounting `/var/lib/docker` into n8n container:
- Bypasses socket proxy security (defeats the purpose)
- Requires n8n container restart when file path changes
- Couples n8n deployment to Unraid internals
+GraphQL schemas evolve continuously (additive changes, deprecations) per best practices. Unlike REST API versioning (breaking changes = new `/v2/` endpoint), GraphQL encourages in-place evolution. Phase 14 research used outdated/incorrect sources. No schema introspection validation in the deployment pipeline means schema mismatches only surface as runtime errors.

 **How to avoid:**
-Three architectural options (order of preference):
-
-**Option A: Unraid API Integration (cleanest, highest effort)**
- Use Unraid's native API (GraphQL or REST) if update status endpoints exist
- Requires: API key management, authentication flow, endpoint documentation
- Benefits: Version-safe, no direct file access, official interface
- Risk: API may not expose update status mutation endpoints
-
-**Option B: Helper Script on Host (recommended for v1.3)**
- Small Python script runs on Unraid host (not in container)
- n8n triggers via `docker exec` to host helper or webhook
- Helper has direct filesystem access, performs sync
- Benefits: Clean separation, no n8n filesystem access, minimal coupling
- Implementation: `.planning/research/ARCHITECTURE.md` should detail this pattern
-
-**Option C: Controlled Host Mount (fallback, higher risk)**
- Mount only `/var/lib/docker/unraid-update-status.json` (not entire `/var/lib/docker`)
- Read-only mount + separate write mechanism (requires Docker API or exec)
- Benefits: Direct access
- Risk: Tight coupling, version fragility
-
-**Phase 1 must investigate:**
-1. Does Unraid API expose update status endpoints? (check GraphQL schema)
-2. Can Docker exec reach host scripts? (test in current deployment)
-3. Security implications of each option
+1. Implement schema introspection check at workflow startup (query `__schema` endpoint)
+2. Store expected schema snapshot in repo, compare on deployment
+3. Add field existence checks BEFORE using optional fields in queries
+4. Use GraphQL Inspector or similar tooling in CI/CD to detect breaking changes
+5. Subscribe to Unraid API changelog/release notes

 **Warning signs:**
- "Permission denied" when attempting to read/write status file from n8n
- File not found errors (path doesn't exist in container filesystem)
- n8n container has no visibility of host filesystem
+- New Unraid version installed, bot starts throwing "unknown field" errors
+- Operations succeed on test server (older Unraid) but fail on production (newer Unraid)
+- GraphQL returns `errors: [{ message: "Cannot query field 'X' on type 'Y'" }]`
+- Update status sync stops working after Unraid update

 **Phase to address:**
-Phase 1 (Architecture Decision) — choose integration pattern
-Phase 2 (Implementation) — implement chosen pattern
+Phase 4 (Schema Validation Layer) — Add introspection checks, implement before full cutover

 ---

-### Pitfall 5: Unraid Update Check Triggers While Bot Is Syncing — Notification Spam
+### Pitfall 5: Credential Rotation Kills Bot Mid-Operation

 **What goes wrong:**
-Bot updates container → syncs status back to Unraid → Unraid's periodic update check runs during sync → update check sees partially-written file or stale cache → sends duplicate "update available" notification to user. User receives notification storm when updating multiple containers.
+Bot stops responding to all commands. Unraid admin rotates API key for security hygiene (recommended practice for 2026), but n8n's "Unraid API Key" Header Auth credential still uses old key. All GraphQL requests return 401 Unauthorized. The dual-credential system (`.env.unraid-api` for CLI testing + n8n Header Auth for workflows) means updating one doesn't update the other.

 **Why it happens:**
-Unraid's update check is asynchronous and periodic:
- Notification service triggers on update detection
- No debouncing for rapid state changes
- File write + cache invalidation not atomic
-
-Community evidence:
- "Excessive notifications from unRAID" — users report notification spam
- "Duplicate notifications" — longstanding issue in notification system
- System excludes duplicates from archive but not from active stream
+2026 security best practices mandate regular credential rotation. API keys "remain valid forever unless someone revokes or rotates them manually" per research. The system uses TWO separate credential stores that must be manually synchronized. No monitoring detects credential expiration. Unraid doesn't warn before rotating keys.

 **How to avoid:**
-1. **Sync timing:** Delay sync by 10-30 seconds after update completion (let Docker events settle)
-2. **Batch sync:** If updating multiple containers, sync all at once (not per-container)
-3. **Cache invalidation signal:** If Unraid API provides cache invalidation, trigger AFTER all syncs complete
-4. **Idempotent sync:** Delete entries (forces recalc) rather than writing new digests (avoids partial state)
-
-Phase 2 implementation pattern:
-```javascript
-// In Update sub-workflow
-if (responseMode === 'batch') {
-  return { success: true, skipSync: true }  // Sync after batch completes
-}
-
-// In main workflow (after batch completion)
-const updatedContainers = [...]  // Collect all updated
-await syncAllToUnraid(updatedContainers)  // Single sync operation
-```
+1. Consolidate credential storage: use ONLY n8n Header Auth, remove `.env.unraid-api` CLI pattern
+2. Implement 401 error detection with user-friendly message: "API key invalid, check Unraid API Keys settings"
+3. Add credential validation endpoint check on workflow startup
+4. Document credential rotation procedure in CLAUDE.md and user docs
+5. Consider OAuth 2.0 migration if Unraid adds support (more rotation-friendly)

 **Warning signs:**
- Multiple "update available" notifications for same container within 1 minute
- Notifications triggered immediately after bot update completes
- Unraid notification log shows duplicate entries with close timestamps
+- All GraphQL operations fail with 401 errors
+- Bot worked yesterday, stopped today without code changes
+- CLI testing with `.env.unraid-api` works but workflows fail (keys out of sync)
+- Unraid API Keys page shows "Last used: N days ago" with large N value

 **Phase to address:**
-Phase 2 (Sync Implementation) — add batch sync and timing delays
-Phase 3 (UAT) — verify no notification spam during batch updates
+Phase 5 (Authentication Resilience) — Implement before cutover, add monitoring

 ---

-### Pitfall 6: n8n Workflow State Doesn't Persist — Can't Queue Sync Operations
+### Pitfall 6: Sub-Workflow Timeout Errors Lost in Propagation

 **What goes wrong:**
-Developer assumes n8n workflow static data persists between executions (like Phase 10.2 error logging attempt). Builds queue of "pending syncs" to batch them. Queue is lost between workflow executions. Each update triggers immediate sync attempt → file access contention, race conditions.
+User triggers container update, bot appears to hang, no error message returned. After 2 minutes, execution silently fails. Logs show sub-workflow timeout but main workflow never receives error. User retries, creates duplicate operations. Known n8n issue: "Execute Workflow node ignores the timeout of the sub-workflow."

 **Why it happens:**
-Known limitation from STATE.md:
-> **n8n workflow static data does NOT persist between executions** (execution-scoped, not workflow-scoped)
-
-Phase 10.2 attempted ring buffer + debug commands — entirely removed due to this limitation.
-
-Implications for sync-back:
- Can't queue sync operations across multiple update requests
- Can't implement retry queue for failed syncs
- Each workflow execution is stateless
+n8n Execute Workflow nodes don't properly propagate sub-workflow timeout errors to parent workflow. Cloud relay adds 200-500ms latency per request. Update operations (pull image, recreate container) that completed in 10-30 seconds with local Docker socket now take 60-120 seconds. Default timeout becomes too aggressive, but timeout errors don't surface to user.

 **How to avoid:**
-Don't rely on workflow state for sync coordination. Options:
-
-**Option A: Synchronous sync (simplest)**
- Update container → immediately sync (no queue)
- Atomic file operations handle contention
- Acceptable for single updates, problematic for batch
-
-**Option B: External queue (Redis, file-based)**
- Write pending syncs to external queue
- Separate workflow polls queue and processes batch
- Higher complexity, requires infrastructure
-
-**Option C: Batch-aware sync (recommended)**
- Single updates: sync immediately
- Batch updates: collect all container IDs in batch loop, sync once after completion
- No cross-execution state needed (batch completes in single execution)
-
-Implementation in Phase 2:
-```javascript
-// Batch loop already collects results
-const batchResults = []
-for (const container of containers) {
-  const result = await updateContainer(container)
-  batchResults.push({ containerId, updated: result.updated })
-}
-// After loop completes (still in same execution):
-const toSync = batchResults.filter(r => r.updated).map(r => r.containerId)
-await syncToUnraid(toSync)  // Single sync call
-```
+1. Increase ALL sub-workflow timeouts by 3-5x to account for cloud relay latency
+2. Implement client-side timeout in main workflow (Code node timestamp checks)
+3. Add progress indicators for long-running operations (Telegram "typing" action every 10 seconds)
+4. Configure HTTP Request node timeouts explicitly (don't rely on workflow-level timeout)
+5. Test timeouts with network throttling simulation

 **Warning signs:**
- Developer adds static data writes for sync queue
- Testing shows queue is empty on next execution
- Sync attempts happen per-container instead of batched
+- Update operations show "executing" for 2+ minutes then disappear
+- Execution logs show sub-workflow timeout but no error message sent to user
+- User reports "bot doesn't respond to update commands"
+- Success rate drops for slow operations (image pull, large container recreate)

 **Phase to address:**
-Phase 1 (Architecture) — document stateless constraint, reject queue-based designs
-Phase 2 (Implementation) — use in-execution batching, not cross-execution state
+Phase 6 (Timeout Hardening) — Adjust before cutover, test under latency

 ---

-### Pitfall 7: Unraid's br0 Network Recreate Breaks Container Resolution After Bot Update
+### Pitfall 7: Race Condition Between Container State Query and Action Execution

 **What goes wrong:**
-Bot updates container using Docker API (remove + create) → Unraid recreates bridge network (`br0`) → Docker network ID changes → other containers using `br0` fail to resolve updated container by name → service disruption beyond just the updated container.
+User issues "stop plex" command. Bot queries container list (container running), sends stop command, but container already stopped by another process (Unraid WebGUI, another bot user). Unraid API returns error "container not running" but bot displays "successfully stopped." Callback data contains stale container state from 30 seconds ago (Telegram message edit cycle).

 **Why it happens:**
-Community report: "Unraid recreates 'br0' when the docker service restarts, and then services using 'br0' cannot be started because the ID of 'br0' has changed."
-
-Bot update flow: `docker pull` → `docker stop` → `docker rm` → `docker run` with same config
- If container uses custom bridge network, recreation may trigger network ID change
- Unraid's Docker service monitors for container lifecycle events
- Network recreation is asynchronous to container operations
+GraphQL query and mutation are separate HTTP requests with 200-500ms cloud relay latency each. Container state can change between query and action. Docker socket proxy had sub-10ms latency making race conditions rare. Telegram inline keyboards cache container state in callback data (64-byte limit prevents re-querying). Multiple users can trigger conflicting actions on same container.

 **How to avoid:**
-1. **Preserve network settings:** Ensure container recreation uses identical network config (Phase 2)
-2. **Test network-dependent scenarios:** UAT must include containers with custom networks (Phase 3)
-3. **Graceful degradation:** If network issue detected (container unreachable after update), log error and notify user
-4. **Documentation:** Warn users about potential network disruption during updates (README)
-
-Phase 2 implementation check:
- Current update sub-workflow uses Docker API recreate — verify network config preservation
- Check if `n8n-update.json` copies network settings from old container to new
- Test: update container on `br0`, verify other containers still resolve it
+1. Implement optimistic locking: query container state immediately before action, abort if state changed
+2. Add version/timestamp to callback data, reject stale callbacks (>30 seconds old)
+3. Handle "already in target state" as success (304 pattern from Docker API)
+4. Query fresh state after action completes, show actual result to user
+5. Add conflict detection: if action fails with state error, query and show current state

 **Warning signs:**
- Container starts successfully but is unreachable by hostname
- Other containers report DNS resolution failures after update
- `docker network ls` shows new network ID for `br0` after container update
+- "Successfully stopped X" message but container still running when user checks status
+- Action commands fail with "container already stopped/started" errors
+- Batch operations report success but some containers in wrong state
+- Multiple users report conflicts when managing same container

 **Phase to address:**
-Phase 2 (Update Flow Verification) — ensure network config preservation
-Phase 3 (UAT) — test multi-container network scenarios
+Phase 7 (State Consistency Layer) — Implement before cutover, critical for multi-user
+
+---
+
+### Pitfall 8: Dual-Write Period Data Inconsistency
+
+**What goes wrong:**
+During migration cutover, some operations write to Docker API, others to Unraid API. Container list query returns different results depending on which API responded. Status updates go to Unraid but actions go to Docker, creating split-brain state. Rollback impossible because no single source of truth exists.
+
+**Why it happens:**
+Phased migration requires running both systems simultaneously. Developer enables feature flag to route reads to Unraid but keeps writes on Docker for safety. Cache invalidation becomes impossible — Docker changes invisible to Unraid queries, Unraid changes invisible to Docker queries. Callback data mixes Docker IDs and Unraid IDs from different query sources.
+
+**How to avoid:**
+1. Implement write-forwarding: Docker writes also trigger Unraid API updates (or vice versa)
+2. Route ALL traffic through abstraction layer that handles dual-write internally
+3. Keep cutover window SHORT (hours not days) to minimize inconsistency window
+4. Use feature flag for routing but maintain single source of truth (either Docker OR Unraid)
+5. Add request tracing to identify which API served each operation
+
+**Warning signs:**
+- Status command shows different container list than batch selection keyboard
+- Container appears stopped in one interface, running in another
+- Update operation succeeds but status doesn't refresh in Unraid WebGUI
+- Rollback leaves orphaned container state (metadata mismatch between APIs)
+
+**Phase to address:**
+Phase 8 (Cutover Strategy) — Plan before implementation starts, execution in final phase
+
+---
+
+### Pitfall 9: GraphQL Batching vs n8n Batch Processing Confusion
+
+**What goes wrong:**
+Batch update operations (update all :latest containers) that processed 10 containers in 30 seconds now take 5+ minutes or timeout. Each container update triggers separate GraphQL HTTP Request → 10 containers = 10 round-trips through cloud relay. Response body parsing fails because developer assumes GraphQL response batching (send multiple queries in single request) but implements n8n batch processing (loop through items).
+
+**Why it happens:**
+n8n's batching (Items per Batch setting on HTTP Request node) is for rate limiting, NOT efficient batching. GraphQL supports query batching but requires specific request format. Cloud relay latency multiplied by sequential operations destroys performance. Docker socket proxy had negligible latency so sequential operations were acceptable.
+
+**How to avoid:**
+1. Use GraphQL batching for reads: single request with multiple container queries
+2. Keep mutations sequential (safer) but add parallel processing for independent operations
+3. Configure n8n HTTP Request node batching: 3-5 items per batch, 500ms interval
+4. Add progress streaming: update Telegram message after each container (don't wait for all)
+5. Implement timeout circuit breaker: abort batch if any single operation takes >60 seconds
+
+**Warning signs:**
+- Batch operations work for 2-3 containers but timeout for 10+
+- Linear performance degradation (10 containers takes 10x longer than 1)
+- n8n execution logs show sequential HTTP requests with 500ms gaps
+- User cancels batch operations because they appear hung
+
+**Phase to address:**
+Phase 9 (Batch Performance Optimization) — After basic operations work, before batch features enabled
+
+---
+
+### Pitfall 10: Telegram Callback Data Size Limit Breaks With Longer IDs
+
+**What goes wrong:**
+Inline keyboard buttons stop working. User taps "Stop" button on container status page, nothing happens. Logs show "callback data exceeds 64 bytes" error. Docker IDs (12 chars) fit in callback format `stop:8a9907a24576`, Unraid PrefixedIDs (129 chars) do not fit `stop:{64-char-hash}:{64-char-hash}`.
+
+**Why it happens:**
+Telegram's 64-byte callback data limit was manageable with Docker IDs. System already uses bitmap encoding for batch selection (base36 BigInt), but single-container operations still use colon-delimited format. Migration assumes callback format unchanged, doesn't account for 10x ID length increase.
+
+**How to avoid:**
+1. Implement container ID shortening: store PrefixedID lookup table in workflow static data, use index in callback
+2. Alternative: hash PrefixedID to 8-character base62 string, store mapping
+3. Update callback format: `s:idx` where idx is lookup key, not full container ID
+4. Test ALL callback patterns (status, actions, confirmation, batch) with Unraid IDs
+5. Implement callback data size validation in Prepare Input nodes
+
+**Warning signs:**
+- Callback queries fail silently (no error to user)
+- n8n logs show "callback data size exceeded" errors
+- Inline keyboard buttons work for containers with short names, fail for others
+- Parse Callback Data node returns truncated IDs
+
+**Phase to address:**
+Phase 2 (Callback Data Encoding) — Parallel to Phase 1, before any inline keyboard migration

 ---

 ## Technical Debt Patterns

-Shortcuts that seem reasonable but create long-term problems.
-
 | Shortcut | Immediate Benefit | Long-term Cost | When Acceptable |
 |----------|-------------------|----------------|-----------------|
-| Skip Unraid version detection | Faster implementation | Silent breakage on Unraid upgrades | Never — version changes are documented |
-| Mount `/var/lib/docker` into n8n | Direct file access | Security bypass, tight coupling, upgrade fragility | Only if helper script impossible |
-| Sync immediately after update (no delay) | Simpler code | Race conditions with Unraid update check | Only for single-container updates (not batch) |
-| Assume file format from one Unraid version | Works on dev system | Breaks for users on different versions | Only during Phase 1 investigation (must validate before Phase 2) |
-| Write directly to status file without locking | Avoids complexity | File corruption on concurrent access | Never — use atomic operations |
-| Hardcode file paths | Works today | Breaks if Unraid changes internal structure | Acceptable if combined with version detection + validation |
+| Keep Docker socket proxy running during migration, route errors back to it | Zero-downtime cutover, instant rollback | Maintenance burden, two credential systems, split-brain debugging | Acceptable for 1-2 week migration window MAX |
+| Skip GraphQL response normalization, update parsers directly | Fewer code layers, "simpler" architecture | 60+ Code nodes to update, high bug rate, impossible to rollback | Never — normalization is mandatory |
+| Use n8n workflow static data for ID lookup table | No external database needed | Static data unreliable (execution-scoped per ARCHITECTURE.md), lost on workflow reimport | Never — already documented as broken in CLAUDE.md |
+| Implement feature flag routing in main workflow only | Easy to toggle, single point of control | Sub-workflows unaware of API source, error messages confusing | Acceptable if sub-workflows receive normalized responses |
+| Skip schema introspection validation | Faster deployment, fewer dependencies | Silent breakage on Unraid updates, no early warning | Never — schema changes are inevitable |

 ## Integration Gotchas

-Common mistakes when connecting to external services.
-
 | Integration | Common Mistake | Correct Approach |
 |-------------|----------------|------------------|
-| Unraid update status file | Assume JSON structure is stable | Validate structure before modification, reject unknown formats |
-| Docker socket proxy | Expect filesystem access like Docker socket mount | Use helper script on host OR Unraid API if available |
-| Unraid API (if used) | Assume unauthenticated localhost access | Check auth requirements, API key management |
-| File modification timing | Write immediately after container update | Delay 5-10 seconds to avoid collision with Docker event handlers |
-| Batch operations | Sync after each container update | Collect all updates, sync once after batch completes |
-| Network config preservation | Assume Docker API preserves settings | Explicitly copy network settings from old container inspect to new create |
+| n8n GraphQL node | Using dedicated GraphQL node instead of HTTP Request node | Use HTTP Request node with POST to `/graphql` — better error handling, supports Header Auth credential |
+| n8n Header Auth | Setting credential in HTTP Request node but forgetting to configure in sub-workflows | ALL 7 sub-workflows need credential configured, not inherited from main workflow |
+| Unraid API authentication | Using environment variables directly in workflow expressions | Use n8n credential system, environment variables only for host URL |
+| myunraid.net URL format | Including `/graphql` in `UNRAID_HOST` environment variable | Env var should be base URL only, append `/graphql` in HTTP Request node URL field |
+| GraphQL error responses | Checking `response.error` like REST APIs | GraphQL returns HTTP 200 with `errors` array, check `response.errors` not `response.error` |
+| Container ID format | Assuming IDs are strings, treating them as opaque tokens | Validate ID format `^[a-f0-9]{64}:[a-f0-9]{64}$`, store in typed fields |
+| Docker 204 No Content | Assuming empty response = error | Empty response body with HTTP 204 = success per CLAUDE.md |

 ## Performance Traps

-Patterns that work at small scale but fail as usage grows.
-
 | Trap | Symptoms | Prevention | When It Breaks |
 |------|----------|------------|----------------|
-| Sync per container in batch | File contention, slow batch updates | Batch sync after all updates complete | 5+ containers in batch |
-| Full file rewrite for each sync | High I/O, race window increases | Delete stale entries OR modify only changed entries | 10+ containers tracked |
-| No retry logic for file access | Silent sync failures | Exponential backoff retry (max 3 attempts) | Concurrent Unraid update check |
-| Sync blocks workflow execution | Slow Telegram responses | Async sync (fire and forget) OR move to separate workflow | 3+ second file operations |
-
-Note: Current system has 8-15 containers (from UAT scenarios). Performance traps unlikely to manifest, but prevention is low-cost.
+| Sequential GraphQL queries in loops | Batch operations timeout, linear slowdown | Use GraphQL query batching or parallel HTTP requests | 5+ containers in batch operation |
+| No HTTP Request timeout configuration | Indefinite hangs, zombie workflows | Set explicit timeout on EVERY HTTP Request node (30-60 seconds) | First cloud relay hiccup |
+| Callback data re-querying | Every inline keyboard tap queries full container list | Cache container state in callback data (within 64-byte limit) | 10+ active users, rate limiting kicks in |
+| Missing retry logic for transient errors | Intermittent failures, user frustration | Implement exponential backoff retry (3 attempts, 1s → 2s → 4s delay) | Network instability, cloud relay rate limits |
+| No operation result caching | Same container queried 5 times in single workflow execution | Cache query results in workflow execution context for 30 seconds | Complex workflows with multiple sub-workflow calls |

 ## Security Mistakes

-Domain-specific security issues beyond general web security.
-
 | Mistake | Risk | Prevention |
 |---------|------|------------|
-| Mount entire `/var/lib/docker` into n8n | n8n gains root-level access to all Docker data | Mount only specific file OR use helper script |
-| World-writable status file permissions | Any container can corrupt Unraid state | Verify file permissions, use host-side helper with proper permissions |
-| No validation before writing to status file | Malformed data corrupts Unraid Docker UI | Validate JSON structure, reject unknown formats |
-| Expose Unraid API key in workflow | API key visible in n8n execution logs | Use n8n credentials, not hardcoded keys |
-| Execute arbitrary commands on host | Container escape vector | Whitelist allowed operations in helper script |
+| Storing API key in workflow JSON | Credential exposure in git, logs, backups | Use n8n credential system exclusively, never hardcode |
+| No API permission scope validation | Over-privileged API key, blast radius on compromise | Use minimal permission (`DOCKER:UPDATE_ANY` only), validate in workflow |
+| Telegram user ID auth in single location | Bypass via direct sub-workflow execution | Implement auth check in EVERY sub-workflow, not just main |
+| Logging full GraphQL responses | API key, sensitive container config in logs | Log only operation result, redact credentials from error messages |
+| No rate limiting on bot commands | API key exhaustion, Unraid API rate limits | Implement per-user rate limiting (5 commands/minute), queue batched operations |

 ## UX Pitfalls

-Common user experience mistakes in this domain.
-
 | Pitfall | User Impact | Better Approach |
 |---------|-------------|-----------------|
-| Silent sync failure | User thinks status updated, Unraid still shows "update ready" | Log error to correlation ID, send Telegram notification on sync failure |
-| No indication of sync status | User doesn't know if sync worked | Include in update success message: "Updated + synced to Unraid" |
-| Sync delay causes confusion | User checks Unraid immediately, sees old status | Document 10-30 second sync delay in README troubleshooting |
-| Unraid badge still shows after sync | User thinks update failed | README: explain Unraid caches aggressively, manual "Check for Updates" forces refresh |
-| Batch update spam notifications | 10 updates = 10 Unraid notifications | Batch sync prevents this (if implemented correctly) |
+| No latency indication | User unsure if command received, double-taps, duplicate operations | Send immediate "Processing..." message, update on completion |
+| Generic error messages | "Operation failed" tells user nothing, can't self-recover | Parse Unraid API errors, show actionable message: "Container already stopped, current state: exited" |
+| No migration communication | Users confused why bot slower after "upgrade" | Send broadcast message before cutover: "Bot migrating to Unraid API, expect 2-3x slower responses for improved reliability" |
+| Hiding internet dependency | Users blame bot when ISP down | Error message: "Cannot reach Unraid API (requires internet), check network connection" |
+| No rollback announcement | Users report bugs, developer fixes by rollback, users still see bugs (cache) | Announce rollbacks: "Rolled back to Docker socket, please retry failed operations" |

 ## "Looks Done But Isn't" Checklist

-Things that appear complete but are missing critical pieces.
-
- [ ] **File modification:** Wrote to status file — verify atomic operation (temp file + rename, not direct write)
- [ ] **Batch sync:** Syncs after each update — verify batching for multi-container operations
- [ ] **Version compatibility:** Works on dev Unraid — verify against 6.11, 6.12, 7.0, 7.2
- [ ] **Error handling:** Sync returns success — verify retry logic for file contention
- [ ] **Network preservation:** Container starts after update — verify DNS resolution from other containers
- [ ] **Race condition testing:** Works in sequential tests — verify concurrent update + Unraid check scenario
- [ ] **Filesystem access:** Works on dev system — verify n8n container can actually reach file (or helper script exists)
- [ ] **Notification validation:** No duplicate notifications in single test — verify batch scenario (5+ containers)
+- [ ] **Container actions:** Often missing state validation BEFORE action — verify error message when stopping already-stopped container shows current state
+- [ ] **GraphQL errors:** Often missing `response.errors` array parsing — verify malformed query returns user-friendly message, not JSON dump
+- [ ] **Timeout handling:** Often missing client-side timeout — verify 2-minute operation shows progress indicator, doesn't appear hung
+- [ ] **Credential expiration:** Often missing 401 error detection — verify rotated API key returns "credential invalid" not generic error
+- [ ] **Callback data encoding:** Often missing length validation — verify longest possible container ID + action fits in 64 bytes
+- [ ] **Schema validation:** Often missing field existence checks — verify missing field returns helpful error, not "undefined is not a function"
+- [ ] **Batch progress:** Often missing incremental updates — verify batch operation shows "3/10 completed" updates, not just final result
+- [ ] **Rollback procedure:** Often missing documented steps — verify CLAUDE.md has exact commands to switch back to Docker socket proxy
+- [ ] **Dual-credential sync:** Often missing procedure to update both `.env.unraid-api` and n8n credential — verify documented workflow
+- [ ] **Performance baseline:** Often missing pre-migration metrics — verify recorded latency/success rate to compare post-migration

 ## Recovery Strategies

-When pitfalls occur despite prevention, how to recover.
-
 | Pitfall | Recovery Cost | Recovery Steps |
 |---------|---------------|----------------|
-| Corrupted status file | LOW | Delete `/var/lib/docker/unraid-update-status.json`, Unraid recreates on next update check |
-| State desync (Unraid shows stale) | LOW | Manual "Check for Updates" in Unraid UI forces recalculation |
-| Unraid version breaks format | MEDIUM | Disable sync feature via feature flag, update sync logic for new format |
-| Network resolution broken | MEDIUM | Restart Docker service in Unraid (`Settings -> Docker -> Enable: No -> Yes`) |
-| File permission errors | LOW | Helper script with proper permissions, OR mount file read-only + use API |
-| n8n can't reach status file | HIGH | Architecture change required (add helper script OR switch to API) |
-| Notification spam | LOW | Unraid notification settings: disable Docker update notifications temporarily |
+| Container ID mismatch breaking all operations | HIGH (all operations broken) | 1. Rollback to Docker socket proxy immediately 2. Implement ID translation layer 3. Test with synthetic Unraid IDs 4. Re-deploy |
+| myunraid.net relay outage | LOW (temporary, auto-recover) | 1. Wait for relay recovery OR 2. Implement LAN fallback if extended outage 3. Monitor status at connect.myunraid.net |
+| GraphQL response parsing errors | MEDIUM (degraded functionality) | 1. Identify broken Code node from error logs 2. Add response schema logging 3. Fix parser 4. Redeploy affected sub-workflow |
+| Schema changes breaking queries | MEDIUM (affected features broken) | 1. Query Unraid `__schema` endpoint 2. Compare to expected schema snapshot 3. Update queries to match current schema 4. Add missing field checks |
+| Credential rotation killing bot | LOW (quick fix) | 1. Generate new API key in Unraid 2. Update n8n Header Auth credential 3. Reactivate workflow (auto-retries) |
+| Sub-workflow timeout errors | LOW (increase timeouts) | 1. Identify timeout threshold from logs 2. Increase sub-workflow timeout by 3x 3. Add progress indicators 4. Redeploy |
+| Race condition state conflicts | MEDIUM (requires code changes) | 1. Implement fresh state query before action 2. Handle "already in state" as success 3. Show actual state after operation |
+| Dual-write inconsistency | HIGH (data integrity compromised) | 1. Choose source of truth (Docker OR Unraid) 2. Query truth source, discard other 3. Regenerate callback data 4. Force user refresh |
+| Batch operation performance issues | MEDIUM (requires optimization) | 1. Implement GraphQL batching for reads 2. Add parallel processing for mutations 3. Stream progress updates |
+| Callback data size exceeded | MEDIUM (redesign callback format) | 1. Implement ID shortening with lookup table 2. Update ALL Prepare Input nodes 3. Test all callback paths 4. Redeploy |

 ## Pitfall-to-Phase Mapping

-How roadmap phases should address these pitfalls.
-
 | Pitfall | Prevention Phase | Verification |
 |---------|------------------|--------------|
-| State desync (Docker API vs Unraid) | Phase 1 (Investigation) + Phase 2 (Sync) | UAT: update via bot, verify Unraid shows "up to date" |
-| Race condition (concurrent access) | Phase 2 (Sync Implementation) | Stress test: simultaneous bot update + manual Unraid check |
-| Unraid version compatibility | Phase 1 (Format Documentation) + Phase 3 (Multi-version UAT) | Test on Unraid 6.12, 7.0, 7.2 |
-| Filesystem access from container | Phase 1 (Architecture Decision) | Deploy to prod, verify file access or helper script works |
-| Notification spam | Phase 2 (Batch Sync) | UAT: batch update 5+ containers, count notifications |
-| n8n state persistence assumption | Phase 1 (Architecture) | Code review: reject any `staticData` usage for sync queue |
-| Network recreation (br0) | Phase 2 (Update Flow) + Phase 3 (UAT) | Test: update container on custom network, verify resolution |
+| Container ID format mismatch | Phase 1: ID Mapping Layer | Test Docker ID fails validation, Unraid ID passes, translation correct |
+| myunraid.net dependency | Phase 2: Network Resilience | Disconnect internet, verify fallback message or graceful degradation |
+| GraphQL response structure | Phase 3: Response Normalization | Compare normalized output to Docker API shape, all fields present |
+| Schema changes | Phase 4: Schema Validation | Change expected schema snapshot, verify detection on next workflow run |
+| Credential rotation | Phase 5: Auth Resilience | Rotate API key, verify 401 error message user-friendly and actionable |
+| Sub-workflow timeouts | Phase 6: Timeout Hardening | Simulate 2-minute operation, verify progress indicator and completion |
+| Race conditions | Phase 7: State Consistency | Two users stop same container simultaneously, verify conflict resolution |
+| Dual-write inconsistency | Phase 8: Cutover Strategy | Query both APIs during cutover, verify consistent results |
+| Batch performance | Phase 9: Batch Optimization | Update 10 containers, verify completion <60 seconds with progress |
+| Callback data size | Phase 2: Callback Encoding | Generate callback with longest ID, verify <64 bytes |

 ## Sources

-**HIGH confidence (official/authoritative):**
- [Unraid API — Docker and VM Integration](https://deepwiki.com/unraid/api/2.4.2-notification-system) — DockerService, DockerEventService architecture
- [Unraid API — Notifications Service](https://deepwiki.com/unraid/api/2.4.1-notifications-service) — Race condition handling, duplicate detection
- [Docker Socket Proxy Security](https://github.com/Tecnativa/docker-socket-proxy) — Security model, endpoint filtering
- [Docker Socket Security Critical Vulnerability Guide](https://medium.com/@instatunnel/docker-socket-security-a-critical-vulnerability-guide-76f4137a68c5) — Filesystem access risks
- [n8n Docker File System Access](https://community.n8n.io/t/file-system-access-in-docker-environment/214017) — Container filesystem limitations
+**GraphQL Migration Patterns:**
+- [Schema Migration - GraphQL](https://dgraph.io/docs/graphql/schema/migration/)
+- [How to Handle Versioning in GraphQL APIs](https://oneuptime.com/blog/post/2026-01-24-graphql-api-versioning/view)
+- [Migrating from REST to GraphQL - GitHub Docs](https://docs.github.com/en/graphql/guides/migrating-from-rest-to-graphql)
+- [3 GraphQL pitfalls and how we avoid them](https://www.vanta.com/resources/3-graphql-pitfalls-and-steps-to-avoid-them)

-**MEDIUM confidence (community-verified):**
- [Watchtower Discussion #1389](https://github.com/containrrr/watchtower/discussions/1389) — Unraid doesn't detect external updates
- [Unraid Docker Troubleshooting](https://docs.unraid.net/unraid-os/troubleshooting/common-issues/docker-troubleshooting/) — br0 network recreation issue
- [Unraid Forums: Docker Update Check](https://forums.unraid.net/topic/49041-warning-file_put_contentsvarlibdockerunraid-update-statusjson-blah/) — Status file location
- [Unraid Forums: 7.2.1 Docker Issues](https://forums.unraid.net/topic/195255-unraid-721-upgrade-seems-to-break-docker-functionalities/) — Version upgrade breaking changes
+**n8n Integration Issues:**
+- [HTTP Request node common issues | n8n Docs](https://docs.n8n.io/integrations/builtin/core-nodes/n8n-nodes-base.httprequest/common-issues/)
+- [Error handling | n8n Docs](https://docs.n8n.io/flow-logic/error-handling/)
+- [Execute Workflow node ignores timeout - GitHub Issue #1572](https://github.com/n8n-io/n8n/issues/1572)
+- [Error Handling in n8n: How to Retry & Monitor Workflows](https://easify-ai.com/error-handling-in-n8n-monitor-workflow-failures/)

-**LOW confidence (single source, needs validation):**
- File format structure (`/var/lib/docker/unraid-update-status.json`) — inferred from forum posts, not officially documented
- Unraid update check timing/frequency — user-configurable, no default documented
- Cache invalidation triggers — inferred from API docs, not explicitly tested
+**Migration Strategy:**
+- [API migration dual-write pattern - AWS DMS](https://aws.amazon.com/blogs/database/rolling-back-from-a-migration-with-aws-dms/)
+- [Zero-Downtime Database Migration: The Complete Engineering Guide](https://dev.to/ari-ghosh/zero-downtime-database-migration-the-definitive-guide-5672)
+- [Canary releases with feature flags](https://www.getunleash.io/blog/canary-deployment-what-is-it)

-**Project-specific (from existing codebase):**
- STATE.md — n8n static data limitation (Phase 10.2 findings)
- ARCHITECTURE.md — Current system architecture, socket proxy usage
- CLAUDE.md — n8n workflow patterns, sub-workflow contracts
+**Authentication & Security:**
+- [API Authentication Best Practices in 2026](https://dev.to/apiverve/api-authentication-best-practices-in-2026-3k4a)
+- [Migrate from API keys to OAuth 2.1](https://www.scalekit.com/blog/migrating-from-api-keys-to-oauth-mcp-servers)
+
+**Container Management:**
+- [Race condition between stop and rm - GitHub Issue #130](https://github.com/apple/container/issues/130)
+- [Eventual Consistency in Distributed Systems](https://www.geeksforgeeks.org/system-design/eventual-consistency-in-distributive-systems-learn-system-design/)
+
+**Unraid Specific:**
+- [Unraid Connect overview & setup | Unraid Docs](https://docs.unraid.net/unraid-connect/overview-and-setup/)
+- Project ARCHITECTURE.md (verified container ID format, field behaviors, myunraid.net requirement)
+- Project CLAUDE.md (Docker API patterns, n8n conventions, static data limitations)

 ---
-*Pitfalls research for: Unraid Update Status Sync*
-*Researched: 2026-02-08*
+*Pitfalls research for: Unraid Docker Manager — Docker Socket to GraphQL API Migration*
+*Researched: 2026-02-09*
+*Confidence: MEDIUM (verified Unraid-specific issues HIGH, general GraphQL patterns MEDIUM, n8n integration issues HIGH)*