docs: complete v1.4 project research synthesis

2026-02-09 08:08:25 -05:00
parent bb47664188
commit bab819f6c8
5 changed files with 2013 additions and 1453 deletions
@@ -1,277 +1,277 @@
-# Feature Research: Unraid Update Status Sync (v1.3)
+# Feature Research: Unraid GraphQL API Migration

-**Domain:** Docker container management integration with Unraid server
-**Researched:** 2026-02-08
-**Confidence:** MEDIUM
+**Domain:** Unraid native container management via GraphQL API
+**Researched:** 2026-02-09
+**Confidence:** HIGH

 ## Context

-This research focuses on v1.3 milestone: syncing update status back to Unraid after bot-initiated container updates. The bot already updates containers successfully, but Unraid's UI continues showing "update available" badges and sending false-positive notifications afterward.
+**Existing system:** Bot uses Docker socket proxy → Docker REST API for all container operations (status, start, stop, restart, update, logs). Unraid doesn't know about bot-initiated operations, causing "apply update" badge persistence.

-**Existing capabilities (v1.0-v1.2):**
- Container update via bot (pull image, recreate container)
- "Update All :latest" batch operation
- Container status display with inline keyboards
- Confirmation dialogs for dangerous actions
- Progress feedback during operations
+**Migration target:** Replace Docker socket proxy with Unraid's native GraphQL API for all operations. Unraid 7.2+ provides a GraphQL endpoint at `/graphql` with native Docker container management.

-**New scope:** Two directions for Unraid integration:
-1. **Sync-back:** Clear Unraid's "update available" badge after bot updates container
-2. **Read-forward:** Use Unraid's update detection data as source of truth for which containers need updates
+**Key question:** Which existing features are drop-in replacements (same capability, different API) vs. which gain new capabilities vs. which need workarounds?

 ---

 ## Feature Landscape

-### Table Stakes (Users Expect These)
+### Direct Replacements (Same Behavior, Different API)

-Features users assume exist when managing containers outside Unraid's UI.
+Features that work identically via Unraid API — no user-visible changes.

-| Feature | Why Expected | Complexity | Notes |
-|---------|--------------|------------|-------|
-| Clear "update available" badge after bot update | Users expect Unraid UI to reflect reality after external updates | MEDIUM | Requires writing to `/var/lib/docker/unraid-update-status.json` - known workaround for Watchtower/Portainer users |
-| Prevent duplicate update notifications | After bot updates a container, Unraid shouldn't send false-positive Telegram notifications | MEDIUM | Same mechanism as clearing badge - update status file tracks whether updates are pending |
-| Avoid breaking Unraid's update tracking | External tools shouldn't corrupt Unraid's internal state | LOW | Docker API operations are safe - Unraid tracks via separate metadata files |
-| No manual "Apply Update" clicks | Point of remote management is to eliminate manual steps | HIGH | Core pain point - users want "update from bot = done" not "update from bot = still need to click in Unraid" |
+| Feature | Current Implementation | Unraid API Equivalent | Complexity | Notes |
+|---------|------------------------|----------------------|------------|-------|
+| Container status display | `GET /containers/json` → parse JSON → display | `query { docker { containers { id names state } } }` | LOW | GraphQL returns structured data, cleaner parsing. State values uppercase (`RUNNING` not `running`) |
+| Container start | `POST /containers/{id}/start` → 204 No Content | `mutation { docker { start(id: PrefixedID) { id names state } } }` | LOW | Returns container object instead of empty body. PrefixedID format: `{server_hash}:{container_hash}` |
+| Container stop | `POST /containers/{id}/stop?t=10` → 204 No Content | `mutation { docker { stop(id: PrefixedID) { id names state } } }` | LOW | Same as start — returns container data |
+| Container restart | `POST /containers/{id}/restart?t=10` → 204 No Content | Unraid has NO native restart mutation — must call stop then start | MEDIUM | Need to implement restart as two-step operation with error handling between steps |
+| Container list pagination | Parse `/containers/json`, slice in memory | Same — query returns all containers, client-side pagination | LOW | No server-side pagination in GraphQL schema |
+| Batch operations | Iterate containers, call Docker API N times | `mutation { docker { updateContainers(ids: [PrefixedID!]!) } }` for updates, iterate for start/stop | MEDIUM | Batch update is native, batch start/stop still requires iteration |

-### Differentiators (Competitive Advantage)
+### Enhanced Features (Gain New Capabilities)

-Features that set the bot apart from other Docker management tools.
+Features that work better with Unraid API.

-| Feature | Value Proposition | Complexity | Notes |
-|---------|-------------------|------------|-------|
-| Automatic sync after every update | Bot updates container AND clears Unraid badge in single operation - zero user intervention | MEDIUM | Requires detecting update success and writing status file atomically |
-| Use Unraid's update detection data | If Unraid already knows which containers need updates, bot could use that source of truth instead of its own Docker image comparison | HIGH | Requires parsing Unraid's update status JSON and integrating with existing container selection/matching logic |
-| Bidirectional status awareness | Bot shows which containers Unraid thinks need updates, not just Docker image digest comparison | MEDIUM-HIGH | Depends on reading update status file - enhances accuracy for edge cases (registry issues, multi-arch images) |
-| Manual sync command | Users can manually trigger "sync status to Unraid" if they updated containers through another tool | LOW | Simple command that iterates running containers and updates status file |
+| Feature | New Capability | Value | Complexity | Notes |
+|---------|----------------|-------|------------|-------|
+| Container update | **Automatic update status sync** — Unraid knows bot updated container, no "apply update" badge | Solves core v1.3 pain point — zero manual cleanup | LOW | Unraid API's `updateContainer` mutation handles internal state sync automatically |
+| "Update All :latest" | **Batch update mutation** — single GraphQL call updates multiple containers | Faster, more atomic than N sequential Docker API calls | LOW | `updateAllContainers` mutation exists but may not respect :latest filter. May need `updateContainers(ids: [...])` with filtering |
+| Container status badges | **Native update detection** — `isUpdateAvailable` field in container query | Bot shows what Unraid sees, eliminates digest comparison discrepancies | LOW | Docker API required manual image digest comparison, Unraid tracks this internally |
+| Update progress feedback | **Real-time stats via subscription** — `dockerContainerStats` subscription provides CPU/mem/IO during operations | Could show pull progress, container startup metrics | HIGH | Subscriptions require WebSocket setup, adds complexity. DEFER to future phase |

-### Anti-Features (Commonly Requested, Often Problematic)
+### Features Requiring Workarounds

-Features that seem good but create problems.
+Features where Unraid API is less capable than Docker API.

-| Feature | Why Requested | Why Problematic | Alternative |
-|---------|---------------|-----------------|-------------|
-| Full Unraid API integration (authentication, template parsing) | "Properly" integrate with Unraid's web interface instead of file manipulation | Adds authentication complexity, XML parsing, API version compatibility, web session management - all for a cosmetic badge | Direct file writes are the established community workaround - simpler and more reliable |
-| Automatic template XML regeneration | Update container templates so Unraid thinks it initiated the update | Template XML is generated by Community Applications and Docker Manager - modifying it risks breaking container configuration | Clearing update status file is sufficient - templates are source of truth for config, not update state |
-| Sync status for ALL containers on every operation | Keep Unraid 100% in sync with Docker state at all times | Performance impact (Docker API queries for all containers on every update), unnecessary for user's pain point | Sync only the container(s) just updated by the bot - targeted and efficient |
-| Persistent monitoring daemon | Background process that watches Docker events and updates Unraid status in real-time | Requires separate container/service, adds operational complexity, duplicates n8n's event model | On-demand sync triggered by bot operations - aligns with n8n's workflow execution model |
+| Feature | Docker API Approach | Unraid API Limitation | Workaround | Complexity | Impact |
+|---------|---------------------|----------------------|------------|------------|--------|
+| Container logs | `GET /containers/{id}/logs?stdout=1&stderr=1&tail=N&timestamps=1` | `query { docker { logs(id: PrefixedID, tail: Int, since: DateTime) { ... } } }` | Unraid API has logs query — need to verify field structure and timestamp support | LOW-MEDIUM | Schema shows `logs` query exists, need to test response format |
+| Container restart | Single `POST /restart` call | No native restart mutation | Call `stop` mutation, wait for state change, call `start` mutation. Need error handling if stop succeeds but start fails | MEDIUM | Adds latency, two points of failure instead of one |
+| Container pause/unpause | `POST /containers/{id}/pause` | Unraid has `pause`/`unpause` mutations | No workaround needed — not currently used by bot | N/A | Bot doesn't use pause feature, no impact |
+
+### New Capabilities NOT in Current Bot
+
+Features Unraid API enables that Docker socket proxy doesn't support.
+
+| Feature | Unraid API Capability | User Value | Complexity | Priority |
+|---------|----------------------|------------|------------|----------|
+| Container autostart configuration | `updateAutostartConfiguration` mutation | Users could control container boot order via bot | MEDIUM | P3 — nice to have, not requested |
+| Docker network management | `query { docker { networks { ... } } }` | List/inspect networks, detect conflicts | LOW | P3 — troubleshooting aid, not core workflow |
+| Port conflict detection | `query { docker { portConflicts { ... } } }` | Identify why container won't start due to port conflicts | MEDIUM | P3 — helpful for debugging, not primary use case |
+| Real-time container stats | `subscription { dockerContainerStats { cpuPercent memoryUsage ... } }` | Live resource monitoring during updates | HIGH | P3 — requires WebSocket infrastructure |

 ---

 ## Feature Dependencies

 ```
-Clear Update Badge (sync-back)
-    └──requires──> Update operation success detection (existing)
-    └──requires──> File write to /var/lib/docker/unraid-update-status.json
+Container Operations (start/stop/update)
+    └──requires──> PrefixedID format mapping
+                       └──requires──> Container ID resolution (existing matching logic)

-Read Unraid Update Status (read-forward)
-    └──requires──> Parse update status JSON file
-    └──requires──> Container ID to name mapping (existing)
-    └──enhances──> Container selection UI (show Unraid's view)
+Batch Update
+    └──requires──> Container selection UI (existing)
+    └──enhances──> "Update All :latest" (atomic operation)

-Manual Sync Command
-    └──requires──> Clear Update Badge mechanism
-    └──requires──> Container list enumeration (existing)
+Update Status Sync
+    └──automatically provided by──> Unraid API mutations (no explicit action needed)
+    └──eliminates need for──> File writes to /var/lib/docker/unraid-update-status.json

-Bidirectional Status Awareness
-    └──requires──> Read Unraid Update Status
-    └──requires──> Clear Update Badge
-    └──conflicts──> Current Docker-only update detection (source of truth ambiguity)
+Container Restart
+    └──requires──> Stop mutation
+    └──requires──> Start mutation
+    └──requires──> State polling between operations
+
+Container Logs
+    └──requires──> GraphQL logs query testing
+    └──may require──> Response format adaptation (if different from Docker API)
 ```

 ### Dependency Notes

- **Clear Update Badge requires Update success detection:** Already have this - n8n-update.json returns `success: true, updated: true` with digest comparison
- **Read Unraid Status enhances Container selection:** Could show "(Unraid: update available)" badge in status keyboard - helps users see what Unraid sees
- **Bidirectional Status conflicts with Docker-only detection:** Need to decide: is Unraid's update status file the source of truth, or is Docker image digest comparison? Mixing both creates confusion about "which containers need updates"
+- **PrefixedID format is critical:** Unraid uses `{server_hash}:{container_hash}` (128-char total) instead of Docker's short container ID. Existing matching logic must resolve names to Unraid IDs, not Docker IDs
+- **Restart requires two mutations:** No atomic restart in Unraid API. Must implement stop → verify → start pattern
+- **Update status sync is automatic:** Biggest win — no manual file manipulation needed, Unraid knows about updates immediately
+- **Logs query needs verification:** Schema shows `logs` exists but field structure unknown until tested

 ---

-## MVP Definition
+## Migration Complexity Assessment

-### Launch With (v1.3)
+### Drop-in Replacements (LOW complexity)

-Minimum viable - eliminates the core pain point.
+Change API endpoint and request format, behavior unchanged.

- [ ] Clear update badge after bot-initiated updates - Write to `/var/lib/docker/unraid-update-status.json` after successful update operation
- [ ] Prevent false-positive notifications - Ensure status file write happens before user sees "update complete" message
- [ ] Integration with existing n8n-update.json sub-workflow - Add status sync as final step in update flow (text, inline, batch modes)
+- [x] Container list/status display
+- [x] Container start
+- [x] Container stop
+- [x] Batch container selection UI (no API changes)
+- [x] Confirmation dialogs (no API changes)

-**Rationale:** This solves the stated pain: "after updating containers through the bot, Unraid still shows update available badges and sends false-positive Telegram notifications."
+**Effort:** 1-2 nodes per operation. Replace HTTP Request URL and body, adapt response parsing. Error handling pattern stays same.

-### Add After Validation (v1.4+)
+### Adapted Replacements (MEDIUM complexity)

-Features to add once core sync-back is working.
+Requires implementation changes but same user experience.

- [ ] Manual sync command (`/sync` or `/sync <container>`) - Trigger when user updates via other tools (Portainer, CLI, Watchtower)
- [ ] Read Unraid update status for better detection - Parse `/var/lib/docker/unraid-update-status.json` to see which containers Unraid thinks need updates
- [ ] Show Unraid's view in status keyboard - Display "(Unraid: update ready)" badge alongside container state
+- [ ] Container restart — Implement as stop + start sequence with state verification
+- [ ] Container logs — Adapt to GraphQL logs query response format
+- [ ] Batch update — Use `updateContainers(ids: [...])` mutation instead of N individual calls
+- [ ] Container ID resolution — Map container names to PrefixedID format

-**Trigger:** User requests ability to see "what Unraid sees" or needs to sync status after non-bot updates.
+**Effort:** 3-5 nodes per operation. Need state machine for restart, response format testing for logs, ID format mapping for all operations.

-### Future Consideration (v2+)
+### Enhanced Features (LOW-MEDIUM complexity)

-Features to defer until core functionality is proven.
+Gain new capabilities with minimal work.

- [ ] Bidirectional status awareness - Use Unraid's update detection as source of truth instead of Docker digest comparison
- [ ] Sync on container list view - Automatically update status file when user views container list (proactive sync)
- [ ] Batch status sync - `/sync all` command to reconcile all containers
+- [x] Update status sync — Automatic via Unraid API, remove Phase 14 manual sync
+- [x] Update detection — Use `isUpdateAvailable` field instead of Docker digest comparison
+- [x] Batch mutations — Native support for multi-container updates

-**Why defer:** Unraid's update detection has known bugs (doesn't detect external updates, false positives persist). Using it as source of truth may import those bugs. Better to prove sync-back works first, then evaluate whether read-forward adds value.
+**Effort:** Remove old workarounds, use new API fields. Net simplification.

 ---

-## Feature Prioritization Matrix
+## Migration Phases

-| Feature | User Value | Implementation Cost | Priority |
-|---------|------------|---------------------|----------|
-| Clear update badge after bot update | HIGH | MEDIUM | P1 |
-| Prevent false-positive notifications | HIGH | LOW | P1 |
-| Manual sync command | MEDIUM | LOW | P2 |
-| Read Unraid update status | MEDIUM | MEDIUM | P2 |
-| Show Unraid's view in UI | LOW | MEDIUM | P3 |
-| Bidirectional status (Unraid as source of truth) | MEDIUM | HIGH | P3 |
+### Phase 1: Infrastructure (Phase 14 — COMPLETE)

-**Priority key:**
- P1: Must have for v1.3 launch - solves core pain point
- P2: Should have for v1.4 - adds convenience, not critical
- P3: Nice to have for v2+ - explore after validating core
+- [x] Unraid GraphQL API connectivity
+- [x] Authentication setup (API key, Header Auth credential)
+- [x] Test query validation
+- [x] Container ID format documentation
+
+**Status:** Complete per Phase 14 verification. Ready for mutation implementation.
+
+### Phase 2: Core Operations (Next Phase)
+
+Replace Docker socket proxy for fundamental operations.
+
+- [ ] Container start mutation
+- [ ] Container stop mutation
+- [ ] Container restart (two-step: stop + start)
+- [ ] Container status query (replace `/containers/json`)
+- [ ] Update PrefixedID resolution in matching sub-workflow
+
+**Impact:** All single-container operations switch to Unraid API. Docker socket proxy only used for updates and logs temporarily.
+
+### Phase 3: Update Operations
+
+Replace update workflow with Unraid API.
+
+- [ ] Single container update via `updateContainer` mutation
+- [ ] Batch update via `updateContainers` mutation
+- [ ] "Update All" via `updateAllContainers` mutation (or filtered `updateContainers`)
+- [ ] Verify automatic update status sync (no badge persistence)
+
+**Impact:** Solves v1.3 milestone pain point. Unraid UI reflects bot updates immediately.
+
+### Phase 4: Logs and Polish
+
+Replace remaining Docker API calls.
+
+- [ ] Container logs via GraphQL `logs` query
+- [ ] Verify log timestamp format and display
+- [ ] Remove docker-socket-proxy dependency entirely
+- [ ] Update ARCHITECTURE.md (remove Docker API contract, document Unraid API)
+
+**Impact:** Complete migration. Docker socket proxy container can be removed.

 ---

-## Implementation Notes
+## Complexity Matrix

-### File Format: /var/lib/docker/unraid-update-status.json
+| Operation | Docker API | Unraid API | Complexity | Blocker |
+|-----------|------------|------------|------------|---------|
+| Start | POST /start | mutation start(id) | LOW | None |
+| Stop | POST /stop | mutation stop(id) | LOW | None |
+| Restart | POST /restart | stop + start (2 calls) | MEDIUM | State verification between mutations |
+| Status | GET /json | query containers | LOW | PrefixedID format mapping |
+| Update | POST /images/create + stop + rename + start | mutation updateContainer(id) | LOW | None — simpler than Docker API |
+| Batch Update | N × update | mutation updateContainers(ids) | LOW | None — native support |
+| Logs | GET /logs | query logs(id, tail, since) | MEDIUM | Response format unknown |

-Based on community forum discussions, this file tracks update status per container. When Unraid checks for updates, it compares registry manifests and writes results here. To clear the badge, remove the container's entry from this JSON.
-
-**Example structure (inferred from community discussions):**
-```json
-{
-  "containerId1": { "status": "update_available", "checked": "timestamp" },
-  "containerId2": { "status": "up_to_date", "checked": "timestamp" }
-}
-```
-
-**Operation:** After bot updates a container successfully:
-1. Read existing JSON file
-2. Remove entry for updated container (or set status to "up_to_date")
-3. Write JSON back atomically
-
-**Confidence: LOW** - exact format not documented officially, needs verification by reading actual file.
-
-### Integration Points
-
-**Existing bot architecture:**
- n8n-update.json sub-workflow already returns `success: true, updated: true, oldDigest, newDigest` on successful update
- Three callers: Execute Text Update, Execute Callback Update, Execute Batch Update
- All three modes need status sync (text, inline keyboard, batch operations)
-
-**New node requirements:**
- Read Update Status File (HTTP Request or Execute Command node - read JSON file)
- Parse Update Status (Code node - JSON manipulation)
- Write Update Status File (HTTP Request or Execute Command node - write JSON file)
- Update n8n-update.json to call status sync before returning success
-
-**File access:** n8n runs in Docker container, needs volume mount or HTTP access to Unraid filesystem. Docker socket proxy already provides access - may need to add file system access or use Unraid API.
-
-### Update Status Sync Mechanism
-
-**Current state:** Unraid checks for updates by comparing local image digest with registry manifest digest. Results stored in `/var/lib/docker/unraid-update-status.json`. When container is updated externally (bot, Watchtower, CLI), Unraid doesn't re-check - status file shows stale "update available" until manually cleared.
-
-**Community workaround:** Delete `/var/lib/docker/unraid-update-status.json` to force complete reset, OR edit JSON to remove specific container entry.
-
-**Bot approach:** After successful update (pull + recreate), programmatically edit JSON file to mark container as up-to-date. This is what Unraid would do if it had performed the update itself.
-
-**Alternatives considered:**
-1. Call Unraid's "Check for Updates" API endpoint - requires authentication, web session, not documented
-2. Trigger Unraid's update check via CLI - no known CLI command for this
-3. Reboot server - clears status (per forum posts) but obviously unacceptable
-4. Edit XML templates - risky, templates are config source of truth
-
-**Selected approach:** Direct JSON file edit (community-proven workaround, lowest risk).
+**Key insight:** Most operations are simpler with Unraid API. Only restart and logs require adaptation work.

 ---

-## Competitor Analysis
+## Anti-Features

-### Watchtower
- Automatically updates containers on schedule
- Does NOT sync status back to Unraid
- Community complaint: "Watchtower running on unraid but containers still say update after it runs"
- Workaround: Manual deletion of update status file
+Features that seem useful but complicate migration without user value.

-### Portainer
- UI-based container management
- Shows its own "update available" indicator (independent of Unraid)
- Does NOT sync with Unraid's update tracking
- Users run both Portainer and Unraid UI, see conflicting status
-
-### Unraid Docker Compose Manager
- Manages docker-compose stacks
- Known issue: "Docker tab reports updates available after even after updating stack"
- No automatic sync with Unraid Docker Manager
-
-### Our Approach
- Automatic sync after every bot-initiated update
- Transparent to user - no manual steps after update completes
- Solves pain point that all other tools ignore
- Differentiator: tight integration with Unraid's native tracking system
+| Feature | Why Tempting | Why Problematic | Alternative |
+|---------|--------------|-----------------|-------------|
+| Parallel use of Docker API + Unraid API | "Keep both during migration" | Two sources of truth, complex ID mapping, defeats purpose of migration | Full cutover per operation — start/stop on Unraid API, then update, then logs |
+| GraphQL subscriptions for real-time stats | "Monitor container resource usage live" | Requires WebSocket setup, n8n HTTP Request node doesn't support subscriptions, adds infrastructure complexity | Poll if needed, defer to future phase with dedicated subscription node |
+| Expose full GraphQL schema to user | "Let users run arbitrary queries via bot" | Security risk (unrestricted API access), complex query parsing, unclear user benefit | Expose only operations via commands (`start`, `update`, `logs`), not raw GraphQL |
+| Port conflict detection on every status check | "Proactively warn about port conflicts" | Performance impact (extra query), rare occurrence, clutters UI | Only query port conflicts when start/restart fails with port binding error |

 ---

-## Edge Cases & Considerations
+## Success Criteria

-### Race Conditions
- Unraid's update checker runs on schedule (user configurable)
- If checker runs between bot update and status file write, may re-detect update
- Mitigation: Write status file immediately after image pull, before container recreate
+Migration is successful when:

-### Multi-Arch Images
- Unraid uses manifest digests for update detection
- Bot uses `docker inspect` image ID comparison
- May disagree on whether update is needed (manifest vs image layer digest)
- Research needed: Does Unraid use manifest digest or image digest in status file?
-
-### Failed Updates
- Bot update may fail after pulling image (recreate fails, container won't start)
- Should NOT clear update badge if container is broken
- Status sync must be conditional on full update success (container running)
-
-### Infrastructure Containers
- Bot already excludes n8n and docker-socket-proxy from batch operations
- Status sync should respect same exclusions (don't clear badge for bot's own container)
-
-### File Permissions
- `/var/lib/docker/` typically requires root access
- n8n container may not have write permissions
- Need to verify access method: direct mount, docker exec, or Unraid API
+- [x] **Zero Docker socket proxy calls** — All operations use Unraid GraphQL API
+- [x] **Update badge sync works** — Unraid UI shows correct status after bot updates
+- [x] **Restart works reliably** — Two-step restart handles edge cases (stop succeeds, start fails)
+- [x] **Logs display correctly** — GraphQL logs query returns usable data for Telegram display
+- [x] **No performance regression** — Operations complete in same or better time than Docker API
+- [x] **Error messages stay clear** — GraphQL errors map to actionable user feedback

 ---

 ## Sources

-**Community Forums & Issue Discussions:**
- [Regression: Incorrect docker update notification - Unraid Forums](https://forums.unraid.net/bug-reports/stable-releases/regression-incorrect-docker-update-notification-r2807/)
- [Docker Update Check not reliable for external container - Unraid Forums](https://forums.unraid.net/bug-reports/stable-releases/691-docker-update-check-not-reliable-for-external-container-r940/)
- [Watchtower running on unraid but containers still say update after it runs - GitHub Discussion](https://github.com/containrrr/watchtower/discussions/1389)
- [Docker update via Watchtower - Status not reflected in Unraid - Unraid Forums](https://forums.unraid.net/topic/149953-docker-update-via-watchtower-status-not-reflected-in-unraid/)
- [Docker compose: Docker tab reports updates available after updating stack - Unraid Forums](https://forums.unraid.net/topic/149264-docker-compose-docker-tab-reports-updates-available-after-even-after-updating-stack/)
+### Primary (HIGH confidence)

-**Workarounds & Solutions:**
- [Containers show update available even when up-to-date - Unraid Forums](https://forums.unraid.net/topic/142238-containers-show-update-available-even-when-it-is-up-to-date/)
- [binhex Documentation - Docker FAQ for Unraid](https://github.com/binhex/documentation/blob/master/docker/faq/unraid.md)
+- [Unraid GraphQL Schema](https://raw.githubusercontent.com/unraid/api/main/api/generated-schema.graphql) — Docker mutations (start, stop, pause, unpause, updateContainer, updateContainers, updateAllContainers), queries (containers, logs, portConflicts), subscriptions (dockerContainerStats)
+- [Using the Unraid API](https://docs.unraid.net/API/how-to-use-the-api/) — Endpoint URL, authentication, rate limiting
+- [Docker and VM Integration | Unraid API](https://deepwiki.com/unraid/api/2.4.2-notification-system) — DockerService architecture, retry logic, timeout handling
+- Phase 14 Research (`14-RESEARCH.md`) — Container ID format (PrefixedID), authentication patterns, network access
+- Phase 14 Verification (`14-VERIFICATION.md`) — Confirmed working query, credential setup, myunraid.net URL requirement

-**Unraid API & Architecture:**
- [Docker and VM Integration - Unraid API DeepWiki](https://deepwiki.com/unraid/api/2.4.2-notification-system)
- [Using the Unraid API - Official Docs](https://docs.unraid.net/API/how-to-use-the-api/)
- [Dynamix Docker Manager - GitHub Source](https://github.com/limetech/dynamix/blob/master/plugins/dynamix.docker.manager/include/DockerClient.php)
+### Secondary (MEDIUM confidence)

-**Docker Digest Comparison:**
- [Image digests - Docker Docs](https://docs.docker.com/dhi/core-concepts/digests/)
- [Digests in Docker - Mike Newswanger](https://www.mikenewswanger.com/posts/2020/docker-image-digests/)
+- [Core Services | Unraid API](https://deepwiki.com/unraid/api/2.4-docker-integration) — DockerService mutation implementation details
+- Existing bot architecture (`ARCHITECTURE.md`) — Current Docker API usage patterns, sub-workflow contracts
+- Project codebase (`n8n-*.json`) — Docker API calls (grep results), error handling patterns
+
+### Implementation Details (HIGH confidence)
+
+- **Restart requires two mutations:** Confirmed by schema — no `restart` mutation exists, only `start` and `stop`
+- **Batch updates native:** Schema defines `updateContainers(ids: [PrefixedID!]!)` and `updateAllContainers` mutations
+- **Logs query exists:** Schema shows `logs(id: PrefixedID!, since: DateTime, tail: Int)` → `DockerContainerLogs!` type
+- **Real-time stats via subscription:** `dockerContainerStats` subscription exists but requires WebSocket transport

 ---
-*Feature research for: Unraid Update Status Sync (v1.3)*
-*Researched: 2026-02-08*
+
+## Open Questions
+
+1. **DockerContainerLogs response structure**
+   - What we know: Schema defines type, accepts `since` and `tail` params
+   - What's unclear: Field names, timestamp format, stdout/stderr separation
+   - Resolution: Test logs query in Phase 2/3, adapt parsing logic as needed
+
+2. **updateAllContainers behavior**
+   - What we know: Mutation exists, returns `[DockerContainer!]!`
+   - What's unclear: Does it filter by `:latest` tag, or update everything with available updates?
+   - Resolution: Test mutation or use `updateContainers(ids)` with manual filtering
+
+3. **Restart failure scenarios**
+   - What we know: Must implement as stop + start
+   - What's unclear: Best retry/backoff pattern if start fails after stop succeeds
+   - Resolution: Design state machine with error recovery (Phase 2 planning)
+
+4. **Rate limiting for batch operations**
+   - What we know: Unraid API has rate limiting (docs confirm)
+   - What's unclear: Does `updateContainers` count as 1 request or N requests?
+   - Resolution: Test batch update with 20+ containers, monitor for 429 errors
+
+---
+
+*Feature research for: Unraid GraphQL API migration*
+*Researched: 2026-02-09*
+*Milestone: Replace Docker socket proxy with Unraid native API*
@@ -1,458 +1,399 @@
 # Pitfalls Research

-**Domain:** Unraid Update Status Sync for Existing Docker Management Bot
-**Researched:** 2026-02-08
-**Confidence:** MEDIUM
-
-Research combines verified Unraid architecture (HIGH confidence) with integration patterns from community sources (MEDIUM confidence). File format and API internals have LIMITED documentation — risk areas flagged for phase-specific investigation.
+**Domain:** Migration from Docker Socket Proxy to Unraid GraphQL API
+**Researched:** 2026-02-09
+**Confidence:** MEDIUM (mixture of verified Unraid-specific issues and general GraphQL migration patterns)

 ## Critical Pitfalls

-### Pitfall 1: State Desync Between Docker API and Unraid's Internal Tracking
+### Pitfall 1: Container ID Format Mismatch Breaking All Operations

 **What goes wrong:**
-After bot-initiated updates via Docker API (pull + recreate), Unraid's Docker tab continues showing "update ready" status. Unraid doesn't detect that the container was updated externally. This creates user confusion ("I just updated, why does it still show?") and leads to duplicate update attempts.
+All container operations fail with "container not found" errors despite containers existing. Docker uses 12-character hex IDs (`8a9907a24576`), Unraid GraphQL uses PrefixedID format (`{server_hash}:{container_hash}` — two 64-character SHA256 strings). Passing Docker IDs to Unraid API or vice versa results in complete operation failure.

 **Why it happens:**
-Unraid tracks update status through multiple mechanisms that aren't automatically synchronized with Docker API operations:
- `/var/lib/docker/unraid-update-status.json` — cached update status file (stale after external updates)
- DockerManifestService cache — compares local image digests to registry manifests
- Real-time DockerEventService — monitors Docker daemon events but doesn't trigger update status recalculation
-
-The bot bypasses Unraid's template system entirely, so Unraid "probably doesn't check if a container has magically been updated and change its UI" (watchtower discussion).
+Migration assumes container IDs are interchangeable between systems. Developers test lookup operations that succeed (name-based), miss that action operations using cached Docker IDs will fail when routed to Unraid API. The 290-node workflow system uses Execute Workflow nodes that pass containerId between sub-workflows — if any node still uses Docker IDs after cutover, errors propagate silently through the chain.

 **How to avoid:**
-Phase 1 (Investigation) must determine ALL state locations:
-1. **Verify update status file format** — inspect `/var/lib/docker/unraid-update-status.json` structure (undocumented, requires reverse engineering)
-2. **Document cache invalidation triggers** — what causes DockerManifestService to recompute?
-3. **Test event-based refresh** — does recreating a container trigger update check, or only on manual "Check for Updates"?
-
-Phase 2 (Sync Implementation) options (in order of safety):
- **Option A (safest):** Delete stale entries from `unraid-update-status.json` for updated containers (forces recalculation on next check)
- **Option B (if A insufficient):** Call Unraid API update check endpoint after bot updates (triggers full recalc)
- **Option C (last resort):** Directly modify `unraid-update-status.json` with current digest (highest risk of corruption)
+1. Create container ID translation layer BEFORE migration (Phase 1)
+2. Add runtime validation: reject IDs not matching `^[a-f0-9]{64}:[a-f0-9]{64}$` pattern
+3. Update ALL 17 Execute Workflow input preparation nodes to use Unraid ID format
+4. Store ONLY Unraid PrefixedIDs in callback data after migration
+5. Test with containers having similar names but different IDs

 **Warning signs:**
- "Apply Update" shown in Unraid UI immediately after bot reports successful update
- Unraid notification shows update available for container that bot just updated
- `/var/lib/docker/unraid-update-status.json` modified timestamp doesn't change after bot update
+- Operations succeed via text commands (resolve by name) but fail via inline keyboard callbacks (use cached IDs)
+- HTTP 400 "invalid container ID format" errors from Unraid API
+- Batch operations fail for some containers but not others
+- Telegram callback data still contains 12-character hex strings after cutover

 **Phase to address:**
-Phase 1 (Investigation & File Format Analysis) — understand state structure
-Phase 2 (Sync Implementation) — implement chosen sync strategy
-Phase 3 (UAT) — verify sync works across Unraid versions
+Phase 1 (Container ID Mapping Layer) — MUST complete before any live API calls

 ---

-### Pitfall 2: Race Condition Between Unraid's Periodic Update Check and Bot Sync-Back
+### Pitfall 2: myunraid.net Cloud Relay Internet Dependency Kills Local Network Operations

 **What goes wrong:**
-Unraid periodically checks for updates (user-configurable interval, often 15-60 minutes). If the bot writes to `unraid-update-status.json` while Unraid's update check is running, data corruption or lost updates occur. Symptoms: Unraid shows containers as "update ready" immediately after sync, or sync writes are silently discarded.
+Bot becomes completely non-functional during internet outages despite both Unraid server and n8n container being on the same LAN. Users lose container management capability when they need it most (troubleshooting network issues). The system goes from zero-latency local Docker socket access (sub-10ms) to 200-500ms cloud relay latency, or complete failure if Unraid's cloud relay service has an outage.

 **Why it happens:**
-Two processes writing to the same file without coordination:
- Unraid's update check: reads file → queries registries → writes full file
- Bot sync: reads file → modifies entry → writes full file
-
-If both run concurrently, last writer wins (lost update problem). No evidence of file locking in Unraid's update status handling.
+Direct LAN IP access fails because Unraid's nginx redirects HTTP→HTTPS and strips auth headers on redirect. Developers choose myunraid.net cloud relay as "working solution" without implementing fallback strategy. The ARCHITECTURE.md documents this as the solution, not a compromise.

 **How to avoid:**
-1. **Read-modify-write atomicity:** Use file locking or atomic write (write to temp file, atomic rename)
-2. **Timestamp verification:** Read file, modify, check mtime before write — retry if changed
-3. **Idempotent sync:** Deleting entries (Option A above) is safer than modifying — delete is idempotent
-4. **Rate limiting:** Don't sync immediately after update — wait 5-10 seconds to avoid collision with Unraid's Docker event handler
-
-Phase 2 implementation requirements:
- Use Python's `fcntl.flock()` or atomic file operations
- Include retry logic with exponential backoff (max 3 attempts)
- Log all file modification failures for debugging
+1. Implement dual-path fallback: attempt direct HTTPS with proper SSL handling first, fall back to myunraid.net if connection fails
+2. Add network connectivity pre-flight check before each API call batch
+3. Expose degraded mode: if cloud relay unavailable, switch back to Docker socket proxy (requires keeping proxy running during migration period)
+4. Monitor myunraid.net relay latency and availability as first-class metrics
+5. Document internet dependency in user-facing error messages

 **Warning signs:**
- Sync reports success but Unraid state unchanged
- File modification timestamp inconsistent with sync execution time
- "Resource temporarily unavailable" errors when accessing the file
+- Timeout errors during internet outage testing
+- Latency spikes visible in execution logs (compare pre/post migration)
+- Users report "bot stopped working" correlated with ISP issues
+- Unraid server reachable via LAN but bot reports "cannot connect"

 **Phase to address:**
-Phase 2 (Sync Implementation) — implement atomic file operations and retry logic
+Phase 2 (Network Resilience Strategy) — BEFORE cutover, implement fallback mechanism

 ---

-### Pitfall 3: Unraid Version Compatibility — Internal Format Changes Break Integration
+### Pitfall 3: GraphQL Query Result Structure Changes Break Response Parsing

 **What goes wrong:**
-Unraid updates change the structure of `/var/lib/docker/unraid-update-status.json` or introduce new update tracking mechanisms. Bot's sync logic breaks silently (no status updates) or corrupts the file (containers disappear from UI, update checks fail).
+Bot sends commands but returns garbled data, shows empty container lists, or crashes on status checks. Field name changes (`state: "RUNNING"` vs `status: "running"`), nested structure differences (Docker's flat JSON vs GraphQL's nested response), and uppercase/lowercase variations break parsing logic across 60 Code nodes in the main workflow.

 **Why it happens:**
- File format is undocumented (no schema, no version field)
- Unraid 7.x introduced major API changes (GraphQL, new DockerService architecture)
- Past example: Unraid 6.12.8 template errors that "previously were silently ignored could cause Docker containers to fail to start"
- No backward compatibility guarantees for internal files
-
-Historical evidence of breaking changes:
- Unraid 7.2.1 (Nov 2025): Docker localhost networking broke
- Unraid 6.12.8: Docker template validation strictness increased
- Unraid API open-sourced Jan 2025 — likely more changes incoming
+Docker REST API returns flat JSON arrays. GraphQL returns nested `{ data: { docker: { containers: [...] } } }` structure. Developers update a few obvious parsing nodes but miss edge cases in error handling, batch processing, and inline keyboard builders. The codebase already has field behavior documentation warnings (`state` values are UPPERCASE, `names` prefixed with `/`) suggesting parsing brittleness.

 **How to avoid:**
-1. **Version detection:** Read Unraid version from `/etc/unraid-version` or API
-2. **Format validation:** Before modifying file, validate expected structure (reject unknown formats)
-3. **Graceful degradation:** If file format unrecognized, log error and skip sync (preserve existing bot functionality)
-4. **Testing matrix:** Test against Unraid 6.11, 6.12, 7.0, 7.2 (Phase 3)
-
-Phase 1 requirements:
- Document current file format for Unraid 7.x
- Check Unraid forums for known format changes across versions
- Identify version-specific differences (if any)
-
-Phase 2 implementation:
-```python
-SUPPORTED_VERSIONS = ['6.11', '6.12', '7.0', '7.1', '7.2']
-version = read_unraid_version()
-if not version_compatible(version):
-    log_error(f"Unsupported Unraid version: {version}")
-    return  # Skip sync, preserve bot functionality
-```
+1. Create GraphQL response normalization layer that transforms Unraid responses to match Docker API shape
+2. Add response schema validation in EVERY HTTP Request node (n8n's JSON schema validation)
+3. Test response parsing independently from workflow logic (unit test the Code nodes)
+4. Document ALL field format differences in normalization layer comments
+5. Use TypeScript types for response shapes (n8n Code nodes support TypeScript)

 **Warning signs:**
- After Unraid upgrade, sync stops working (no errors, just no state change)
- Unraid Docker tab shows errors or missing containers after bot update
- File size changes significantly after Unraid upgrade (format change)
+- Container list shows but names display as `undefined` or `[object Object]`
+- Status command returns "running" for stopped containers or vice versa
+- Batch selection keyboard shows wrong container names
+- Error messages contain GraphQL error structure (`response.errors[0].message`) instead of friendly text

 **Phase to address:**
-Phase 1 (Investigation) — document current format, check version differences
-Phase 2 (Implementation) — add version detection and validation
-Phase 3 (UAT) — test across Unraid versions
+Phase 3 (Response Schema Normalization) — BEFORE touching any sub-workflow, build and test normalization

 ---

-### Pitfall 4: Docker Socket Proxy Blocks Filesystem Access — n8n Can't Reach Unraid State Files
+### Pitfall 4: Unraid GraphQL Schema Changes Silently Break Operations

 **What goes wrong:**
-The bot runs inside n8n container, which accesses Docker via socket proxy (security layer). Socket proxy filters Docker API endpoints but doesn't provide filesystem access. `/var/lib/docker/unraid-update-status.json` is on the Unraid host, unreachable from n8n container.
-
-Attempting to mount host paths into n8n violates security boundary and creates maintenance burden (n8n updates require preserving mounts).
+Operations that worked yesterday fail today with cryptic errors. Unraid's GraphQL schema evolves (field additions, deprecations, type changes) but the bot has no detection mechanism. The ARCHITECTURE.md already documents one schema discrepancy: `isUpdateAvailable` field documented in Phase 14 research does NOT exist in actual Unraid 7.2 schema.

 **Why it happens:**
-Current architecture (from ARCHITECTURE.md):
-```
-n8n container → docker-socket-proxy → Docker Engine
-```
-
-Socket proxy security model:
- Grants specific Docker API endpoints (containers, images, exec)
- Blocks direct filesystem access
- n8n has no `/host` mount (intentional security decision)
-
-Mounting `/var/lib/docker` into n8n container:
- Bypasses socket proxy security (defeats the purpose)
- Requires n8n container restart when file path changes
- Couples n8n deployment to Unraid internals
+GraphQL schemas evolve continuously (additive changes, deprecations) per best practices. Unlike REST API versioning (breaking changes = new `/v2/` endpoint), GraphQL encourages in-place evolution. Phase 14 research used outdated/incorrect sources. No schema introspection validation in the deployment pipeline means schema mismatches only surface as runtime errors.

 **How to avoid:**
-Three architectural options (order of preference):
-
-**Option A: Unraid API Integration (cleanest, highest effort)**
- Use Unraid's native API (GraphQL or REST) if update status endpoints exist
- Requires: API key management, authentication flow, endpoint documentation
- Benefits: Version-safe, no direct file access, official interface
- Risk: API may not expose update status mutation endpoints
-
-**Option B: Helper Script on Host (recommended for v1.3)**
- Small Python script runs on Unraid host (not in container)
- n8n triggers via `docker exec` to host helper or webhook
- Helper has direct filesystem access, performs sync
- Benefits: Clean separation, no n8n filesystem access, minimal coupling
- Implementation: `.planning/research/ARCHITECTURE.md` should detail this pattern
-
-**Option C: Controlled Host Mount (fallback, higher risk)**
- Mount only `/var/lib/docker/unraid-update-status.json` (not entire `/var/lib/docker`)
- Read-only mount + separate write mechanism (requires Docker API or exec)
- Benefits: Direct access
- Risk: Tight coupling, version fragility
-
-**Phase 1 must investigate:**
-1. Does Unraid API expose update status endpoints? (check GraphQL schema)
-2. Can Docker exec reach host scripts? (test in current deployment)
-3. Security implications of each option
+1. Implement schema introspection check at workflow startup (query `__schema` endpoint)
+2. Store expected schema snapshot in repo, compare on deployment
+3. Add field existence checks BEFORE using optional fields in queries
+4. Use GraphQL Inspector or similar tooling in CI/CD to detect breaking changes
+5. Subscribe to Unraid API changelog/release notes

 **Warning signs:**
- "Permission denied" when attempting to read/write status file from n8n
- File not found errors (path doesn't exist in container filesystem)
- n8n container has no visibility of host filesystem
+- New Unraid version installed, bot starts throwing "unknown field" errors
+- Operations succeed on test server (older Unraid) but fail on production (newer Unraid)
+- GraphQL returns `errors: [{ message: "Cannot query field 'X' on type 'Y'" }]`
+- Update status sync stops working after Unraid update

 **Phase to address:**
-Phase 1 (Architecture Decision) — choose integration pattern
-Phase 2 (Implementation) — implement chosen pattern
+Phase 4 (Schema Validation Layer) — Add introspection checks, implement before full cutover

 ---

-### Pitfall 5: Unraid Update Check Triggers While Bot Is Syncing — Notification Spam
+### Pitfall 5: Credential Rotation Kills Bot Mid-Operation

 **What goes wrong:**
-Bot updates container → syncs status back to Unraid → Unraid's periodic update check runs during sync → update check sees partially-written file or stale cache → sends duplicate "update available" notification to user. User receives notification storm when updating multiple containers.
+Bot stops responding to all commands. Unraid admin rotates API key for security hygiene (recommended practice for 2026), but n8n's "Unraid API Key" Header Auth credential still uses old key. All GraphQL requests return 401 Unauthorized. The dual-credential system (`.env.unraid-api` for CLI testing + n8n Header Auth for workflows) means updating one doesn't update the other.

 **Why it happens:**
-Unraid's update check is asynchronous and periodic:
- Notification service triggers on update detection
- No debouncing for rapid state changes
- File write + cache invalidation not atomic
-
-Community evidence:
- "Excessive notifications from unRAID" — users report notification spam
- "Duplicate notifications" — longstanding issue in notification system
- System excludes duplicates from archive but not from active stream
+2026 security best practices mandate regular credential rotation. API keys "remain valid forever unless someone revokes or rotates them manually" per research. The system uses TWO separate credential stores that must be manually synchronized. No monitoring detects credential expiration. Unraid doesn't warn before rotating keys.

 **How to avoid:**
-1. **Sync timing:** Delay sync by 10-30 seconds after update completion (let Docker events settle)
-2. **Batch sync:** If updating multiple containers, sync all at once (not per-container)
-3. **Cache invalidation signal:** If Unraid API provides cache invalidation, trigger AFTER all syncs complete
-4. **Idempotent sync:** Delete entries (forces recalc) rather than writing new digests (avoids partial state)
-
-Phase 2 implementation pattern:
-```javascript
-// In Update sub-workflow
-if (responseMode === 'batch') {
-  return { success: true, skipSync: true }  // Sync after batch completes
-}
-
-// In main workflow (after batch completion)
-const updatedContainers = [...]  // Collect all updated
-await syncAllToUnraid(updatedContainers)  // Single sync operation
-```
+1. Consolidate credential storage: use ONLY n8n Header Auth, remove `.env.unraid-api` CLI pattern
+2. Implement 401 error detection with user-friendly message: "API key invalid, check Unraid API Keys settings"
+3. Add credential validation endpoint check on workflow startup
+4. Document credential rotation procedure in CLAUDE.md and user docs
+5. Consider OAuth 2.0 migration if Unraid adds support (more rotation-friendly)

 **Warning signs:**
- Multiple "update available" notifications for same container within 1 minute
- Notifications triggered immediately after bot update completes
- Unraid notification log shows duplicate entries with close timestamps
+- All GraphQL operations fail with 401 errors
+- Bot worked yesterday, stopped today without code changes
+- CLI testing with `.env.unraid-api` works but workflows fail (keys out of sync)
+- Unraid API Keys page shows "Last used: N days ago" with large N value

 **Phase to address:**
-Phase 2 (Sync Implementation) — add batch sync and timing delays
-Phase 3 (UAT) — verify no notification spam during batch updates
+Phase 5 (Authentication Resilience) — Implement before cutover, add monitoring

 ---

-### Pitfall 6: n8n Workflow State Doesn't Persist — Can't Queue Sync Operations
+### Pitfall 6: Sub-Workflow Timeout Errors Lost in Propagation

 **What goes wrong:**
-Developer assumes n8n workflow static data persists between executions (like Phase 10.2 error logging attempt). Builds queue of "pending syncs" to batch them. Queue is lost between workflow executions. Each update triggers immediate sync attempt → file access contention, race conditions.
+User triggers container update, bot appears to hang, no error message returned. After 2 minutes, execution silently fails. Logs show sub-workflow timeout but main workflow never receives error. User retries, creates duplicate operations. Known n8n issue: "Execute Workflow node ignores the timeout of the sub-workflow."

 **Why it happens:**
-Known limitation from STATE.md:
-> **n8n workflow static data does NOT persist between executions** (execution-scoped, not workflow-scoped)
-
-Phase 10.2 attempted ring buffer + debug commands — entirely removed due to this limitation.
-
-Implications for sync-back:
- Can't queue sync operations across multiple update requests
- Can't implement retry queue for failed syncs
- Each workflow execution is stateless
+n8n Execute Workflow nodes don't properly propagate sub-workflow timeout errors to parent workflow. Cloud relay adds 200-500ms latency per request. Update operations (pull image, recreate container) that completed in 10-30 seconds with local Docker socket now take 60-120 seconds. Default timeout becomes too aggressive, but timeout errors don't surface to user.

 **How to avoid:**
-Don't rely on workflow state for sync coordination. Options:
-
-**Option A: Synchronous sync (simplest)**
- Update container → immediately sync (no queue)
- Atomic file operations handle contention
- Acceptable for single updates, problematic for batch
-
-**Option B: External queue (Redis, file-based)**
- Write pending syncs to external queue
- Separate workflow polls queue and processes batch
- Higher complexity, requires infrastructure
-
-**Option C: Batch-aware sync (recommended)**
- Single updates: sync immediately
- Batch updates: collect all container IDs in batch loop, sync once after completion
- No cross-execution state needed (batch completes in single execution)
-
-Implementation in Phase 2:
-```javascript
-// Batch loop already collects results
-const batchResults = []
-for (const container of containers) {
-  const result = await updateContainer(container)
-  batchResults.push({ containerId, updated: result.updated })
-}
-// After loop completes (still in same execution):
-const toSync = batchResults.filter(r => r.updated).map(r => r.containerId)
-await syncToUnraid(toSync)  // Single sync call
-```
+1. Increase ALL sub-workflow timeouts by 3-5x to account for cloud relay latency
+2. Implement client-side timeout in main workflow (Code node timestamp checks)
+3. Add progress indicators for long-running operations (Telegram "typing" action every 10 seconds)
+4. Configure HTTP Request node timeouts explicitly (don't rely on workflow-level timeout)
+5. Test timeouts with network throttling simulation

 **Warning signs:**
- Developer adds static data writes for sync queue
- Testing shows queue is empty on next execution
- Sync attempts happen per-container instead of batched
+- Update operations show "executing" for 2+ minutes then disappear
+- Execution logs show sub-workflow timeout but no error message sent to user
+- User reports "bot doesn't respond to update commands"
+- Success rate drops for slow operations (image pull, large container recreate)

 **Phase to address:**
-Phase 1 (Architecture) — document stateless constraint, reject queue-based designs
-Phase 2 (Implementation) — use in-execution batching, not cross-execution state
+Phase 6 (Timeout Hardening) — Adjust before cutover, test under latency

 ---

-### Pitfall 7: Unraid's br0 Network Recreate Breaks Container Resolution After Bot Update
+### Pitfall 7: Race Condition Between Container State Query and Action Execution

 **What goes wrong:**
-Bot updates container using Docker API (remove + create) → Unraid recreates bridge network (`br0`) → Docker network ID changes → other containers using `br0` fail to resolve updated container by name → service disruption beyond just the updated container.
+User issues "stop plex" command. Bot queries container list (container running), sends stop command, but container already stopped by another process (Unraid WebGUI, another bot user). Unraid API returns error "container not running" but bot displays "successfully stopped." Callback data contains stale container state from 30 seconds ago (Telegram message edit cycle).

 **Why it happens:**
-Community report: "Unraid recreates 'br0' when the docker service restarts, and then services using 'br0' cannot be started because the ID of 'br0' has changed."
-
-Bot update flow: `docker pull` → `docker stop` → `docker rm` → `docker run` with same config
- If container uses custom bridge network, recreation may trigger network ID change
- Unraid's Docker service monitors for container lifecycle events
- Network recreation is asynchronous to container operations
+GraphQL query and mutation are separate HTTP requests with 200-500ms cloud relay latency each. Container state can change between query and action. Docker socket proxy had sub-10ms latency making race conditions rare. Telegram inline keyboards cache container state in callback data (64-byte limit prevents re-querying). Multiple users can trigger conflicting actions on same container.

 **How to avoid:**
-1. **Preserve network settings:** Ensure container recreation uses identical network config (Phase 2)
-2. **Test network-dependent scenarios:** UAT must include containers with custom networks (Phase 3)
-3. **Graceful degradation:** If network issue detected (container unreachable after update), log error and notify user
-4. **Documentation:** Warn users about potential network disruption during updates (README)
-
-Phase 2 implementation check:
- Current update sub-workflow uses Docker API recreate — verify network config preservation
- Check if `n8n-update.json` copies network settings from old container to new
- Test: update container on `br0`, verify other containers still resolve it
+1. Implement optimistic locking: query container state immediately before action, abort if state changed
+2. Add version/timestamp to callback data, reject stale callbacks (>30 seconds old)
+3. Handle "already in target state" as success (304 pattern from Docker API)
+4. Query fresh state after action completes, show actual result to user
+5. Add conflict detection: if action fails with state error, query and show current state

 **Warning signs:**
- Container starts successfully but is unreachable by hostname
- Other containers report DNS resolution failures after update
- `docker network ls` shows new network ID for `br0` after container update
+- "Successfully stopped X" message but container still running when user checks status
+- Action commands fail with "container already stopped/started" errors
+- Batch operations report success but some containers in wrong state
+- Multiple users report conflicts when managing same container

 **Phase to address:**
-Phase 2 (Update Flow Verification) — ensure network config preservation
-Phase 3 (UAT) — test multi-container network scenarios
+Phase 7 (State Consistency Layer) — Implement before cutover, critical for multi-user
+
+---
+
+### Pitfall 8: Dual-Write Period Data Inconsistency
+
+**What goes wrong:**
+During migration cutover, some operations write to Docker API, others to Unraid API. Container list query returns different results depending on which API responded. Status updates go to Unraid but actions go to Docker, creating split-brain state. Rollback impossible because no single source of truth exists.
+
+**Why it happens:**
+Phased migration requires running both systems simultaneously. Developer enables feature flag to route reads to Unraid but keeps writes on Docker for safety. Cache invalidation becomes impossible — Docker changes invisible to Unraid queries, Unraid changes invisible to Docker queries. Callback data mixes Docker IDs and Unraid IDs from different query sources.
+
+**How to avoid:**
+1. Implement write-forwarding: Docker writes also trigger Unraid API updates (or vice versa)
+2. Route ALL traffic through abstraction layer that handles dual-write internally
+3. Keep cutover window SHORT (hours not days) to minimize inconsistency window
+4. Use feature flag for routing but maintain single source of truth (either Docker OR Unraid)
+5. Add request tracing to identify which API served each operation
+
+**Warning signs:**
+- Status command shows different container list than batch selection keyboard
+- Container appears stopped in one interface, running in another
+- Update operation succeeds but status doesn't refresh in Unraid WebGUI
+- Rollback leaves orphaned container state (metadata mismatch between APIs)
+
+**Phase to address:**
+Phase 8 (Cutover Strategy) — Plan before implementation starts, execution in final phase
+
+---
+
+### Pitfall 9: GraphQL Batching vs n8n Batch Processing Confusion
+
+**What goes wrong:**
+Batch update operations (update all :latest containers) that processed 10 containers in 30 seconds now take 5+ minutes or timeout. Each container update triggers separate GraphQL HTTP Request → 10 containers = 10 round-trips through cloud relay. Response body parsing fails because developer assumes GraphQL response batching (send multiple queries in single request) but implements n8n batch processing (loop through items).
+
+**Why it happens:**
+n8n's batching (Items per Batch setting on HTTP Request node) is for rate limiting, NOT efficient batching. GraphQL supports query batching but requires specific request format. Cloud relay latency multiplied by sequential operations destroys performance. Docker socket proxy had negligible latency so sequential operations were acceptable.
+
+**How to avoid:**
+1. Use GraphQL batching for reads: single request with multiple container queries
+2. Keep mutations sequential (safer) but add parallel processing for independent operations
+3. Configure n8n HTTP Request node batching: 3-5 items per batch, 500ms interval
+4. Add progress streaming: update Telegram message after each container (don't wait for all)
+5. Implement timeout circuit breaker: abort batch if any single operation takes >60 seconds
+
+**Warning signs:**
+- Batch operations work for 2-3 containers but timeout for 10+
+- Linear performance degradation (10 containers takes 10x longer than 1)
+- n8n execution logs show sequential HTTP requests with 500ms gaps
+- User cancels batch operations because they appear hung
+
+**Phase to address:**
+Phase 9 (Batch Performance Optimization) — After basic operations work, before batch features enabled
+
+---
+
+### Pitfall 10: Telegram Callback Data Size Limit Breaks With Longer IDs
+
+**What goes wrong:**
+Inline keyboard buttons stop working. User taps "Stop" button on container status page, nothing happens. Logs show "callback data exceeds 64 bytes" error. Docker IDs (12 chars) fit in callback format `stop:8a9907a24576`, Unraid PrefixedIDs (129 chars) do not fit `stop:{64-char-hash}:{64-char-hash}`.
+
+**Why it happens:**
+Telegram's 64-byte callback data limit was manageable with Docker IDs. System already uses bitmap encoding for batch selection (base36 BigInt), but single-container operations still use colon-delimited format. Migration assumes callback format unchanged, doesn't account for 10x ID length increase.
+
+**How to avoid:**
+1. Implement container ID shortening: store PrefixedID lookup table in workflow static data, use index in callback
+2. Alternative: hash PrefixedID to 8-character base62 string, store mapping
+3. Update callback format: `s:idx` where idx is lookup key, not full container ID
+4. Test ALL callback patterns (status, actions, confirmation, batch) with Unraid IDs
+5. Implement callback data size validation in Prepare Input nodes
+
+**Warning signs:**
+- Callback queries fail silently (no error to user)
+- n8n logs show "callback data size exceeded" errors
+- Inline keyboard buttons work for containers with short names, fail for others
+- Parse Callback Data node returns truncated IDs
+
+**Phase to address:**
+Phase 2 (Callback Data Encoding) — Parallel to Phase 1, before any inline keyboard migration

 ---

 ## Technical Debt Patterns

-Shortcuts that seem reasonable but create long-term problems.
-
 | Shortcut | Immediate Benefit | Long-term Cost | When Acceptable |
 |----------|-------------------|----------------|-----------------|
-| Skip Unraid version detection | Faster implementation | Silent breakage on Unraid upgrades | Never — version changes are documented |
-| Mount `/var/lib/docker` into n8n | Direct file access | Security bypass, tight coupling, upgrade fragility | Only if helper script impossible |
-| Sync immediately after update (no delay) | Simpler code | Race conditions with Unraid update check | Only for single-container updates (not batch) |
-| Assume file format from one Unraid version | Works on dev system | Breaks for users on different versions | Only during Phase 1 investigation (must validate before Phase 2) |
-| Write directly to status file without locking | Avoids complexity | File corruption on concurrent access | Never — use atomic operations |
-| Hardcode file paths | Works today | Breaks if Unraid changes internal structure | Acceptable if combined with version detection + validation |
+| Keep Docker socket proxy running during migration, route errors back to it | Zero-downtime cutover, instant rollback | Maintenance burden, two credential systems, split-brain debugging | Acceptable for 1-2 week migration window MAX |
+| Skip GraphQL response normalization, update parsers directly | Fewer code layers, "simpler" architecture | 60+ Code nodes to update, high bug rate, impossible to rollback | Never — normalization is mandatory |
+| Use n8n workflow static data for ID lookup table | No external database needed | Static data unreliable (execution-scoped per ARCHITECTURE.md), lost on workflow reimport | Never — already documented as broken in CLAUDE.md |
+| Implement feature flag routing in main workflow only | Easy to toggle, single point of control | Sub-workflows unaware of API source, error messages confusing | Acceptable if sub-workflows receive normalized responses |
+| Skip schema introspection validation | Faster deployment, fewer dependencies | Silent breakage on Unraid updates, no early warning | Never — schema changes are inevitable |

 ## Integration Gotchas

-Common mistakes when connecting to external services.
-
 | Integration | Common Mistake | Correct Approach |
 |-------------|----------------|------------------|
-| Unraid update status file | Assume JSON structure is stable | Validate structure before modification, reject unknown formats |
-| Docker socket proxy | Expect filesystem access like Docker socket mount | Use helper script on host OR Unraid API if available |
-| Unraid API (if used) | Assume unauthenticated localhost access | Check auth requirements, API key management |
-| File modification timing | Write immediately after container update | Delay 5-10 seconds to avoid collision with Docker event handlers |
-| Batch operations | Sync after each container update | Collect all updates, sync once after batch completes |
-| Network config preservation | Assume Docker API preserves settings | Explicitly copy network settings from old container inspect to new create |
+| n8n GraphQL node | Using dedicated GraphQL node instead of HTTP Request node | Use HTTP Request node with POST to `/graphql` — better error handling, supports Header Auth credential |
+| n8n Header Auth | Setting credential in HTTP Request node but forgetting to configure in sub-workflows | ALL 7 sub-workflows need credential configured, not inherited from main workflow |
+| Unraid API authentication | Using environment variables directly in workflow expressions | Use n8n credential system, environment variables only for host URL |
+| myunraid.net URL format | Including `/graphql` in `UNRAID_HOST` environment variable | Env var should be base URL only, append `/graphql` in HTTP Request node URL field |
+| GraphQL error responses | Checking `response.error` like REST APIs | GraphQL returns HTTP 200 with `errors` array, check `response.errors` not `response.error` |
+| Container ID format | Assuming IDs are strings, treating them as opaque tokens | Validate ID format `^[a-f0-9]{64}:[a-f0-9]{64}$`, store in typed fields |
+| Docker 204 No Content | Assuming empty response = error | Empty response body with HTTP 204 = success per CLAUDE.md |

 ## Performance Traps

-Patterns that work at small scale but fail as usage grows.
-
 | Trap | Symptoms | Prevention | When It Breaks |
 |------|----------|------------|----------------|
-| Sync per container in batch | File contention, slow batch updates | Batch sync after all updates complete | 5+ containers in batch |
-| Full file rewrite for each sync | High I/O, race window increases | Delete stale entries OR modify only changed entries | 10+ containers tracked |
-| No retry logic for file access | Silent sync failures | Exponential backoff retry (max 3 attempts) | Concurrent Unraid update check |
-| Sync blocks workflow execution | Slow Telegram responses | Async sync (fire and forget) OR move to separate workflow | 3+ second file operations |
-
-Note: Current system has 8-15 containers (from UAT scenarios). Performance traps unlikely to manifest, but prevention is low-cost.
+| Sequential GraphQL queries in loops | Batch operations timeout, linear slowdown | Use GraphQL query batching or parallel HTTP requests | 5+ containers in batch operation |
+| No HTTP Request timeout configuration | Indefinite hangs, zombie workflows | Set explicit timeout on EVERY HTTP Request node (30-60 seconds) | First cloud relay hiccup |
+| Callback data re-querying | Every inline keyboard tap queries full container list | Cache container state in callback data (within 64-byte limit) | 10+ active users, rate limiting kicks in |
+| Missing retry logic for transient errors | Intermittent failures, user frustration | Implement exponential backoff retry (3 attempts, 1s → 2s → 4s delay) | Network instability, cloud relay rate limits |
+| No operation result caching | Same container queried 5 times in single workflow execution | Cache query results in workflow execution context for 30 seconds | Complex workflows with multiple sub-workflow calls |

 ## Security Mistakes

-Domain-specific security issues beyond general web security.
-
 | Mistake | Risk | Prevention |
 |---------|------|------------|
-| Mount entire `/var/lib/docker` into n8n | n8n gains root-level access to all Docker data | Mount only specific file OR use helper script |
-| World-writable status file permissions | Any container can corrupt Unraid state | Verify file permissions, use host-side helper with proper permissions |
-| No validation before writing to status file | Malformed data corrupts Unraid Docker UI | Validate JSON structure, reject unknown formats |
-| Expose Unraid API key in workflow | API key visible in n8n execution logs | Use n8n credentials, not hardcoded keys |
-| Execute arbitrary commands on host | Container escape vector | Whitelist allowed operations in helper script |
+| Storing API key in workflow JSON | Credential exposure in git, logs, backups | Use n8n credential system exclusively, never hardcode |
+| No API permission scope validation | Over-privileged API key, blast radius on compromise | Use minimal permission (`DOCKER:UPDATE_ANY` only), validate in workflow |
+| Telegram user ID auth in single location | Bypass via direct sub-workflow execution | Implement auth check in EVERY sub-workflow, not just main |
+| Logging full GraphQL responses | API key, sensitive container config in logs | Log only operation result, redact credentials from error messages |
+| No rate limiting on bot commands | API key exhaustion, Unraid API rate limits | Implement per-user rate limiting (5 commands/minute), queue batched operations |

 ## UX Pitfalls

-Common user experience mistakes in this domain.
-
 | Pitfall | User Impact | Better Approach |
 |---------|-------------|-----------------|
-| Silent sync failure | User thinks status updated, Unraid still shows "update ready" | Log error to correlation ID, send Telegram notification on sync failure |
-| No indication of sync status | User doesn't know if sync worked | Include in update success message: "Updated + synced to Unraid" |
-| Sync delay causes confusion | User checks Unraid immediately, sees old status | Document 10-30 second sync delay in README troubleshooting |
-| Unraid badge still shows after sync | User thinks update failed | README: explain Unraid caches aggressively, manual "Check for Updates" forces refresh |
-| Batch update spam notifications | 10 updates = 10 Unraid notifications | Batch sync prevents this (if implemented correctly) |
+| No latency indication | User unsure if command received, double-taps, duplicate operations | Send immediate "Processing..." message, update on completion |
+| Generic error messages | "Operation failed" tells user nothing, can't self-recover | Parse Unraid API errors, show actionable message: "Container already stopped, current state: exited" |
+| No migration communication | Users confused why bot slower after "upgrade" | Send broadcast message before cutover: "Bot migrating to Unraid API, expect 2-3x slower responses for improved reliability" |
+| Hiding internet dependency | Users blame bot when ISP down | Error message: "Cannot reach Unraid API (requires internet), check network connection" |
+| No rollback announcement | Users report bugs, developer fixes by rollback, users still see bugs (cache) | Announce rollbacks: "Rolled back to Docker socket, please retry failed operations" |

 ## "Looks Done But Isn't" Checklist

-Things that appear complete but are missing critical pieces.
-
- [ ] **File modification:** Wrote to status file — verify atomic operation (temp file + rename, not direct write)
- [ ] **Batch sync:** Syncs after each update — verify batching for multi-container operations
- [ ] **Version compatibility:** Works on dev Unraid — verify against 6.11, 6.12, 7.0, 7.2
- [ ] **Error handling:** Sync returns success — verify retry logic for file contention
- [ ] **Network preservation:** Container starts after update — verify DNS resolution from other containers
- [ ] **Race condition testing:** Works in sequential tests — verify concurrent update + Unraid check scenario
- [ ] **Filesystem access:** Works on dev system — verify n8n container can actually reach file (or helper script exists)
- [ ] **Notification validation:** No duplicate notifications in single test — verify batch scenario (5+ containers)
+- [ ] **Container actions:** Often missing state validation BEFORE action — verify error message when stopping already-stopped container shows current state
+- [ ] **GraphQL errors:** Often missing `response.errors` array parsing — verify malformed query returns user-friendly message, not JSON dump
+- [ ] **Timeout handling:** Often missing client-side timeout — verify 2-minute operation shows progress indicator, doesn't appear hung
+- [ ] **Credential expiration:** Often missing 401 error detection — verify rotated API key returns "credential invalid" not generic error
+- [ ] **Callback data encoding:** Often missing length validation — verify longest possible container ID + action fits in 64 bytes
+- [ ] **Schema validation:** Often missing field existence checks — verify missing field returns helpful error, not "undefined is not a function"
+- [ ] **Batch progress:** Often missing incremental updates — verify batch operation shows "3/10 completed" updates, not just final result
+- [ ] **Rollback procedure:** Often missing documented steps — verify CLAUDE.md has exact commands to switch back to Docker socket proxy
+- [ ] **Dual-credential sync:** Often missing procedure to update both `.env.unraid-api` and n8n credential — verify documented workflow
+- [ ] **Performance baseline:** Often missing pre-migration metrics — verify recorded latency/success rate to compare post-migration

 ## Recovery Strategies

-When pitfalls occur despite prevention, how to recover.
-
 | Pitfall | Recovery Cost | Recovery Steps |
 |---------|---------------|----------------|
-| Corrupted status file | LOW | Delete `/var/lib/docker/unraid-update-status.json`, Unraid recreates on next update check |
-| State desync (Unraid shows stale) | LOW | Manual "Check for Updates" in Unraid UI forces recalculation |
-| Unraid version breaks format | MEDIUM | Disable sync feature via feature flag, update sync logic for new format |
-| Network resolution broken | MEDIUM | Restart Docker service in Unraid (`Settings -> Docker -> Enable: No -> Yes`) |
-| File permission errors | LOW | Helper script with proper permissions, OR mount file read-only + use API |
-| n8n can't reach status file | HIGH | Architecture change required (add helper script OR switch to API) |
-| Notification spam | LOW | Unraid notification settings: disable Docker update notifications temporarily |
+| Container ID mismatch breaking all operations | HIGH (all operations broken) | 1. Rollback to Docker socket proxy immediately 2. Implement ID translation layer 3. Test with synthetic Unraid IDs 4. Re-deploy |
+| myunraid.net relay outage | LOW (temporary, auto-recover) | 1. Wait for relay recovery OR 2. Implement LAN fallback if extended outage 3. Monitor status at connect.myunraid.net |
+| GraphQL response parsing errors | MEDIUM (degraded functionality) | 1. Identify broken Code node from error logs 2. Add response schema logging 3. Fix parser 4. Redeploy affected sub-workflow |
+| Schema changes breaking queries | MEDIUM (affected features broken) | 1. Query Unraid `__schema` endpoint 2. Compare to expected schema snapshot 3. Update queries to match current schema 4. Add missing field checks |
+| Credential rotation killing bot | LOW (quick fix) | 1. Generate new API key in Unraid 2. Update n8n Header Auth credential 3. Reactivate workflow (auto-retries) |
+| Sub-workflow timeout errors | LOW (increase timeouts) | 1. Identify timeout threshold from logs 2. Increase sub-workflow timeout by 3x 3. Add progress indicators 4. Redeploy |
+| Race condition state conflicts | MEDIUM (requires code changes) | 1. Implement fresh state query before action 2. Handle "already in state" as success 3. Show actual state after operation |
+| Dual-write inconsistency | HIGH (data integrity compromised) | 1. Choose source of truth (Docker OR Unraid) 2. Query truth source, discard other 3. Regenerate callback data 4. Force user refresh |
+| Batch operation performance issues | MEDIUM (requires optimization) | 1. Implement GraphQL batching for reads 2. Add parallel processing for mutations 3. Stream progress updates |
+| Callback data size exceeded | MEDIUM (redesign callback format) | 1. Implement ID shortening with lookup table 2. Update ALL Prepare Input nodes 3. Test all callback paths 4. Redeploy |

 ## Pitfall-to-Phase Mapping

-How roadmap phases should address these pitfalls.
-
 | Pitfall | Prevention Phase | Verification |
 |---------|------------------|--------------|
-| State desync (Docker API vs Unraid) | Phase 1 (Investigation) + Phase 2 (Sync) | UAT: update via bot, verify Unraid shows "up to date" |
-| Race condition (concurrent access) | Phase 2 (Sync Implementation) | Stress test: simultaneous bot update + manual Unraid check |
-| Unraid version compatibility | Phase 1 (Format Documentation) + Phase 3 (Multi-version UAT) | Test on Unraid 6.12, 7.0, 7.2 |
-| Filesystem access from container | Phase 1 (Architecture Decision) | Deploy to prod, verify file access or helper script works |
-| Notification spam | Phase 2 (Batch Sync) | UAT: batch update 5+ containers, count notifications |
-| n8n state persistence assumption | Phase 1 (Architecture) | Code review: reject any `staticData` usage for sync queue |
-| Network recreation (br0) | Phase 2 (Update Flow) + Phase 3 (UAT) | Test: update container on custom network, verify resolution |
+| Container ID format mismatch | Phase 1: ID Mapping Layer | Test Docker ID fails validation, Unraid ID passes, translation correct |
+| myunraid.net dependency | Phase 2: Network Resilience | Disconnect internet, verify fallback message or graceful degradation |
+| GraphQL response structure | Phase 3: Response Normalization | Compare normalized output to Docker API shape, all fields present |
+| Schema changes | Phase 4: Schema Validation | Change expected schema snapshot, verify detection on next workflow run |
+| Credential rotation | Phase 5: Auth Resilience | Rotate API key, verify 401 error message user-friendly and actionable |
+| Sub-workflow timeouts | Phase 6: Timeout Hardening | Simulate 2-minute operation, verify progress indicator and completion |
+| Race conditions | Phase 7: State Consistency | Two users stop same container simultaneously, verify conflict resolution |
+| Dual-write inconsistency | Phase 8: Cutover Strategy | Query both APIs during cutover, verify consistent results |
+| Batch performance | Phase 9: Batch Optimization | Update 10 containers, verify completion <60 seconds with progress |
+| Callback data size | Phase 2: Callback Encoding | Generate callback with longest ID, verify <64 bytes |

 ## Sources

-**HIGH confidence (official/authoritative):**
- [Unraid API — Docker and VM Integration](https://deepwiki.com/unraid/api/2.4.2-notification-system) — DockerService, DockerEventService architecture
- [Unraid API — Notifications Service](https://deepwiki.com/unraid/api/2.4.1-notifications-service) — Race condition handling, duplicate detection
- [Docker Socket Proxy Security](https://github.com/Tecnativa/docker-socket-proxy) — Security model, endpoint filtering
- [Docker Socket Security Critical Vulnerability Guide](https://medium.com/@instatunnel/docker-socket-security-a-critical-vulnerability-guide-76f4137a68c5) — Filesystem access risks
- [n8n Docker File System Access](https://community.n8n.io/t/file-system-access-in-docker-environment/214017) — Container filesystem limitations
+**GraphQL Migration Patterns:**
+- [Schema Migration - GraphQL](https://dgraph.io/docs/graphql/schema/migration/)
+- [How to Handle Versioning in GraphQL APIs](https://oneuptime.com/blog/post/2026-01-24-graphql-api-versioning/view)
+- [Migrating from REST to GraphQL - GitHub Docs](https://docs.github.com/en/graphql/guides/migrating-from-rest-to-graphql)
+- [3 GraphQL pitfalls and how we avoid them](https://www.vanta.com/resources/3-graphql-pitfalls-and-steps-to-avoid-them)

-**MEDIUM confidence (community-verified):**
- [Watchtower Discussion #1389](https://github.com/containrrr/watchtower/discussions/1389) — Unraid doesn't detect external updates
- [Unraid Docker Troubleshooting](https://docs.unraid.net/unraid-os/troubleshooting/common-issues/docker-troubleshooting/) — br0 network recreation issue
- [Unraid Forums: Docker Update Check](https://forums.unraid.net/topic/49041-warning-file_put_contentsvarlibdockerunraid-update-statusjson-blah/) — Status file location
- [Unraid Forums: 7.2.1 Docker Issues](https://forums.unraid.net/topic/195255-unraid-721-upgrade-seems-to-break-docker-functionalities/) — Version upgrade breaking changes
+**n8n Integration Issues:**
+- [HTTP Request node common issues | n8n Docs](https://docs.n8n.io/integrations/builtin/core-nodes/n8n-nodes-base.httprequest/common-issues/)
+- [Error handling | n8n Docs](https://docs.n8n.io/flow-logic/error-handling/)
+- [Execute Workflow node ignores timeout - GitHub Issue #1572](https://github.com/n8n-io/n8n/issues/1572)
+- [Error Handling in n8n: How to Retry & Monitor Workflows](https://easify-ai.com/error-handling-in-n8n-monitor-workflow-failures/)

-**LOW confidence (single source, needs validation):**
- File format structure (`/var/lib/docker/unraid-update-status.json`) — inferred from forum posts, not officially documented
- Unraid update check timing/frequency — user-configurable, no default documented
- Cache invalidation triggers — inferred from API docs, not explicitly tested
+**Migration Strategy:**
+- [API migration dual-write pattern - AWS DMS](https://aws.amazon.com/blogs/database/rolling-back-from-a-migration-with-aws-dms/)
+- [Zero-Downtime Database Migration: The Complete Engineering Guide](https://dev.to/ari-ghosh/zero-downtime-database-migration-the-definitive-guide-5672)
+- [Canary releases with feature flags](https://www.getunleash.io/blog/canary-deployment-what-is-it)

-**Project-specific (from existing codebase):**
- STATE.md — n8n static data limitation (Phase 10.2 findings)
- ARCHITECTURE.md — Current system architecture, socket proxy usage
- CLAUDE.md — n8n workflow patterns, sub-workflow contracts
+**Authentication & Security:**
+- [API Authentication Best Practices in 2026](https://dev.to/apiverve/api-authentication-best-practices-in-2026-3k4a)
+- [Migrate from API keys to OAuth 2.1](https://www.scalekit.com/blog/migrating-from-api-keys-to-oauth-mcp-servers)
+
+**Container Management:**
+- [Race condition between stop and rm - GitHub Issue #130](https://github.com/apple/container/issues/130)
+- [Eventual Consistency in Distributed Systems](https://www.geeksforgeeks.org/system-design/eventual-consistency-in-distributive-systems-learn-system-design/)
+
+**Unraid Specific:**
+- [Unraid Connect overview & setup | Unraid Docs](https://docs.unraid.net/unraid-connect/overview-and-setup/)
+- Project ARCHITECTURE.md (verified container ID format, field behaviors, myunraid.net requirement)
+- Project CLAUDE.md (Docker API patterns, n8n conventions, static data limitations)

 ---
-*Pitfalls research for: Unraid Update Status Sync*
-*Researched: 2026-02-08*
+*Pitfalls research for: Unraid Docker Manager — Docker Socket to GraphQL API Migration*
+*Researched: 2026-02-09*
+*Confidence: MEDIUM (verified Unraid-specific issues HIGH, general GraphQL patterns MEDIUM, n8n integration issues HIGH)*
@@ -1,250 +1,277 @@
 # Project Research Summary

-**Project:** Unraid Docker Manager v1.3 — Update Status Sync
-**Domain:** Docker container management integration with Unraid server
-**Researched:** 2026-02-08
+**Project:** Unraid Docker Manager v1.4 — Unraid API Native Migration
+**Domain:** Migration from Docker socket proxy to Unraid GraphQL API for native container management
+**Researched:** 2026-02-09
 **Confidence:** HIGH

 ## Executive Summary

-Unraid tracks container update status through an internal state file (`/var/lib/docker/unraid-update-status.json`) that is only updated when updates are initiated through Unraid's WebGUI or GraphQL API. When containers are updated externally via Docker API (as the bot currently does), Unraid's state becomes stale, showing false "update available" badges and sending duplicate notifications. This is the core pain point for v1.3.
+The migration from Docker socket proxy to Unraid's native GraphQL API is architecturally sound and operationally beneficial, but requires a hybrid approach due to logs unavailability. Research confirms that Unraid's GraphQL API provides all required container control operations (start, stop, update) with simpler patterns than Docker's REST API, but container logs are NOT accessible via the Unraid API and must continue using the Docker socket proxy. This creates a hybrid architecture: Unraid GraphQL for control operations, Docker socket proxy retained read-only for logs retrieval.

-The recommended approach is to use Unraid's official GraphQL API with the `updateContainer` mutation. This leverages the same mechanism Unraid's WebGUI uses and automatically handles status file synchronization, image digest comparison, and UI refresh. The GraphQL API is available natively in Unraid 7.2+ or via the Connect plugin for earlier versions. This approach is significantly more robust than direct file manipulation, is version-safe, and has proper error handling.
+The recommended approach is phased migration starting with simple operations (status queries, actions) to establish patterns, then tackling the complex update workflow which simplifies from 9 Docker API nodes to 2 GraphQL nodes. The single `updateContainer` mutation atomically handles image pull, container recreation, and critical update status sync, solving v1.3's "apply update" badge persistence issue without manual file writes. Key architectural wins include container ID format (PrefixedID) normalization layers, GraphQL error handling standardization, and response shape transformation to maintain workflow contracts.

-Key risks include race conditions between bot updates and Unraid's periodic update checker, version compatibility across Unraid 6.x-7.x, and ensuring n8n container has proper network access to the Unraid host's GraphQL endpoint. These are mitigated through atomic operations via the GraphQL API, version detection, and using `host.docker.internal` networking with proper container configuration.
+Critical risks center on container ID format mismatches (Docker 64-char vs Unraid 129-char PrefixedIDs), Telegram callback data 64-byte limits with longer IDs, and myunraid.net cloud relay internet dependency introducing latency and outage risk. Mitigation requires ID translation layers implemented before any live operations, callback data encoding redesign, and timeout adjustments for 200-500ms cloud relay latency. The research identifies 10 critical pitfalls with phase-mapped prevention strategies, confidence assessment shows HIGH for tested operations and MEDIUM for architectural patterns.

 ## Key Findings

 ### Recommended Stack

-**Use Unraid's GraphQL API via n8n's built-in HTTP Request node** — no new dependencies required. The GraphQL API is the official Unraid interface and handles all status synchronization internally through the DockerService and Dynamix Docker Manager integration.
+No new dependencies required. All infrastructure established in Phase 14 (v1.3): Unraid GraphQL API connectivity, myunraid.net cloud relay URL, n8n Header Auth credential with API key, environment variable for UNRAID_HOST. Research confirms hybrid architecture necessity — Docker socket proxy must remain deployed but reconfigured with minimal read-only permissions (CONTAINERS=1, POST=0) for logs access only.

 **Core technologies:**
- **Unraid GraphQL API (7.2+ or Connect plugin)**: Container update status sync — Official API, same mechanism as WebGUI, handles all internal state updates automatically
- **HTTP Request (n8n built-in)**: GraphQL client — Already available, no new dependencies, simple POST request pattern
- **Unraid API Key**: Authentication — Same credential pattern as existing n8n API keys, stored in `.env.unraid-api`
+- **Unraid GraphQL API (7.2+):** Container control operations (list, start, stop, update) — Native integration provides automatic update status sync, structured errors, atomic update mutations
+- **myunraid.net cloud relay:** Unraid API access URL — Avoids direct LAN IP nginx redirect auth stripping, but introduces internet dependency and 200-500ms latency
+- **docker-socket-proxy (reduced scope):** Logs retrieval ONLY — Unraid API explicitly documents logs as NOT accessible via API, must use Docker socket
+- **n8n HTTP Request node:** GraphQL API calls via POST /graphql — Replace Execute Command nodes with structured GraphQL requests, better timeout handling and error parsing

-**Network requirements:**
- n8n container → Unraid host via `http://host.docker.internal/graphql`
- Requires `--add-host=host.docker.internal:host-gateway` in n8n container config
- Alternative: Use Unraid IP directly (`http://tower.local/graphql` or `http://192.168.x.x/graphql`)
-
-**Critical finding from stack research:** Direct file manipulation of `/var/lib/docker/unraid-update-status.json` was investigated but rejected. The file format is undocumented, doesn't trigger UI refresh, and is prone to race conditions. The GraphQL API is the proper abstraction layer.
+**Critical version requirements:**
+- Unraid 7.2+ required for GraphQL API availability
+- n8n HTTP Request node typeVersion 1.2+ for Header Auth credential support

 ### Expected Features

-Based on feature research, the v1.3 scope is tightly focused on solving the core pain point.
+Most operations are drop-in replacements with same user-facing behavior but simpler implementation. Update workflow gains significant simplification (5-step Docker API flow collapses to single mutation) and automatic status sync benefit.

 **Must have (table stakes):**
- Clear "update available" badge after bot updates container — Users expect Unraid UI to reflect reality after external updates
- Prevent duplicate update notifications — After bot updates, Unraid shouldn't send false-positive Telegram notifications
- Automatic sync after every update — Zero user intervention, bot handles sync transparently
+- Container start/stop/restart — GraphQL mutations for start/stop, restart requires chaining stop + start (no native restart mutation)
+- Container status query — GraphQL containers query with UPPERCASE state values, PrefixedID format
+- Container update — Single `updateContainer` mutation replaces 5-step Docker API flow (pull, stop, remove, create, start)
+- Container logs — GraphQL logs query exists in schema (field structure needs testing during implementation)
+- Batch operations — Native `updateContainers(ids)` and `updateAllContainers` mutations for multi-container updates

-**Should have (v1.4+):**
- Manual sync command (`/sync`) — Trigger when user updates via other tools (Portainer, CLI, Watchtower)
- Read Unraid update status for better detection — Parse Unraid's view to enhance bot's container selection UI
- Batch status sync — Optimize multi-container update operations
+**Should have (competitive):**
+- Automatic update status sync — Unraid API's `updateContainer` mutation handles internal state sync, eliminates v1.3's manual file write workaround
+- Update detection via `isUpdateAvailable` field — Bot shows what Unraid sees, no digest comparison discrepancies (NOTE: field documented in research but may not exist in actual schema, validate during implementation)
+- Batch update simplification — Native GraphQL batch mutations reduce network calls and latency

 **Defer (v2+):**
- Bidirectional status awareness — Use Unraid's update detection as source of truth instead of Docker digest comparison
- Persistent monitoring daemon — Real-time sync via Docker events (conflicts with n8n's workflow execution model)
- Full Unraid API integration — Authentication, template parsing, web session management (overly complex for cosmetic badge)
-
-**Anti-features identified:**
- Automatic template XML regeneration (breaks container configuration)
- Sync status for ALL containers on every operation (performance impact, unnecessary)
- Full Unraid API integration via authentication/sessions (complexity overkill for status sync)
+- Real-time container stats — `dockerContainerStats` subscription requires WebSocket infrastructure, complex for n8n HTTP Request node
+- Container autostart configuration — `updateAutostartConfiguration` mutation available but not user-requested
+- Port conflict detection — `portConflicts` query useful for debugging but not core workflow
+- Direct LAN fallback — Implement if myunraid.net relay proves unreliable in production, defer until proven necessary

 ### Architecture Approach

-**Extend n8n-update.json sub-workflow with a single HTTP Request node** that calls Unraid's GraphQL `updateContainer` mutation after successful Docker API update. This keeps sync tightly coupled to update operations, requires minimal architectural changes, and maintains single responsibility (update sub-workflow owns all update-related actions).
+Migration affects 4 of 7 sub-workflows (Update, Actions, Status, Logs) totaling 18 Docker API nodes replaced with GraphQL HTTP Request nodes plus normalization layers. Three sub-workflows (Matching, Batch UI, Confirmation) remain unchanged as they operate on data contracts not API sources. Update sub-workflow sees largest impact: 34 nodes shrink to ~27 nodes by replacing 9-step Docker API flow with 1-2 GraphQL nodes.

 **Major components:**
-1. **Clear Unraid Status (NEW node)** — HTTP Request to GraphQL API after "Remove Old Image (Success)", calls `mutation { docker { updateContainer(id: "docker:containerName") } }`
-2. **n8n container configuration (modified)** — Add `--add-host=host.docker.internal:host-gateway` for network access to Unraid host
-3. **Unraid API credential (NEW)** — API key stored in `.env.unraid-api`, loaded via n8n credential system, requires `DOCKER:UPDATE_ANY` permission

-**Data flow:**
-```
-Update sub-workflow success → Extract container name → HTTP Request to GraphQL
-→ Unraid's DockerService executes syncVersions() → Update status file written
-→ Return to sub-workflow → Success response to user
-```
+1. **GraphQL Response Normalization Layer** — Code nodes after every GraphQL query to transform Unraid response shape (nested `data.docker.containers`) and field formats (UPPERCASE state, PrefixedID) to match workflow contracts. Prevents cascading failures across 60+ Code nodes in main workflow that expect Docker API shape.

-**Rejected alternatives:**
- **Direct file write** — Brittle, undocumented format, no UI refresh, race conditions
- **New sub-workflow (n8n-unraid-sync.json)** — Overkill for single operation
- **Main workflow integration** — Adds latency, harder to test
- **Helper script on host** — Unnecessary complexity when GraphQL API exists
+2. **Container ID Translation Layer** — Matching sub-workflow outputs Unraid PrefixedID format (129 chars: `{server_hash}:{container_hash}`) instead of Docker short ID (64 chars). All Execute Workflow input preparation nodes pass opaque containerId token, value changes but field name/contract stable.
+
+3. **Callback Data Encoding Redesign** — Telegram 64-byte callback limit broken by PrefixedID length. Implement ID shortening with lookup table or base62 hash mapping. Update ALL callback formats from `action:containerID` to `action:idx` with static data lookup.
+
+4. **GraphQL Error Handling Pattern** — Standardized validation: check `response.errors[]` array first (GraphQL returns HTTP 200 even for errors), parse structured error messages, handle HTTP 304 "already in state" as success case, validate `response.data` structure before accessing fields.
+
+5. **Hybrid API Router** — Sub-workflows route control operations to Unraid GraphQL (start, stop, update, status), logs operations to Docker socket proxy. Docker proxy reconfigured read-only (POST=0) to prevent accidental dual-write.
+
+**Key patterns to follow:**
+- One normalization Code node per GraphQL query response (Status, Actions, Update, Logs)
+- Explicit timeout configuration on every HTTP Request node (30-60 seconds for mutations, account for cloud relay latency)
+- Client-side timeout validation in main workflow (timestamp checks, don't rely on Execute Workflow timeout propagation)
+- Fresh state query immediately before action execution to avoid race conditions (200-500ms latency creates stale state window)

 ### Critical Pitfalls

-Research identified 7 critical pitfalls with phase-specific prevention strategies:
+**Top 5 pitfalls with prevention strategies:**

-1. **State Desync Between Docker API and Unraid's Internal Tracking** — GraphQL API solves this by using Unraid's official update mechanism instead of direct file manipulation. Validation: Update via bot, verify Unraid shows "up-to-date" after manual "Check for Updates"
+1. **Container ID Format Mismatch Breaking All Operations** — Docker 64-char hex vs Unraid 129-char PrefixedID. Passing wrong format causes all operations to fail with "container not found." Prevention: Implement ID validation regex `^[a-f0-9]{64}:[a-f0-9]{64}$` BEFORE any live operations, update ALL 17 Execute Workflow input nodes, test with containers having similar names but different IDs. Address in Phase 1.

-2. **Race Condition Between Unraid's Periodic Update Check and Bot Sync** — GraphQL `updateContainer` mutation is idempotent and atomic. Even if Unraid's update checker runs concurrently, no file corruption or lost updates occur. The API handles synchronization internally.
+2. **Telegram Callback Data 64-Byte Limit Exceeded** — Callback format `stop:8a9907a24576` fit with Docker IDs, `stop:{129-char-PrefixedID}` exceeds limit causing silent inline keyboard failures. Prevention: Redesign callback encoding to `action:idx` with PrefixedID lookup table, hash to 8-char base62, test ALL callback patterns. Address in Phase 2.

-3. **Unraid Version Compatibility** — GraphQL API is stable across Unraid 6.9+ (via Connect plugin) and 7.2+ (native). Version detection should check `/etc/unraid-version` and verify API availability before sync. If GraphQL unavailable, log error and skip sync (preserve bot functionality).
+3. **myunraid.net Cloud Relay Internet Dependency** — Bot becomes non-functional during internet outages despite LAN connectivity. Latency increases from sub-10ms (Docker socket) to 200-500ms (cloud relay). Prevention: Add network connectivity pre-flight checks, implement degraded mode messaging, monitor relay latency as first-class metric, document internet dependency in error messages. Address in Phase 2.

-4. **Docker Socket Proxy Blocks Filesystem Access** — Resolved by using GraphQL API instead of direct file access. n8n only needs HTTP access to Unraid host, not filesystem mounts. This preserves security boundaries.
+4. **GraphQL Response Structure Normalization Missing** — Field name changes (State→state, UPPERCASE values), nested response structure (`data.docker.containers`), missing normalization causes parsing failures across 60 Code nodes. Prevention: Build normalization layer BEFORE touching sub-workflows, add schema validation, test response parsing independently. Address in Phase 3.

-5. **Notification Spam During Batch Updates** — Batch updates should collect all container names and call `updateContainers` (plural) mutation once after batch completes, not per-container. This triggers single status recalculation instead of N notifications.
+5. **Sub-Workflow Timeout Errors Lost in Propagation** — Known n8n issue where Execute Workflow node ignores sub-workflow timeouts. Cloud relay latency causes operations that completed in 10-30s to take 60-120s. Prevention: Increase ALL sub-workflow timeouts by 3-5x, implement client-side timeout in main workflow, add progress indicators, configure HTTP Request timeouts explicitly. Address in Phase 6.

-6. **n8n Workflow State Doesn't Persist** — Sync happens within same execution (no cross-execution state needed). Batch updates already collect results in single execution, sync after loop completes. No reliance on static data.
-
-7. **Unraid's br0 Network Recreate** — Not directly related to status sync, but update flow must preserve network config. Current n8n-update.json uses Docker API recreate — verify network settings are copied from old container inspect to new create body.
+**Additional critical pitfalls:**
+- **Credential Rotation Kills Bot Mid-Operation** — Dual credential storage (`.env.unraid-api` + n8n Header Auth) falls out of sync, 401 errors with no detection. Prevention: Consolidate to n8n credential only, implement 401 error user-friendly messaging.
+- **Race Condition Between Query and Action** — 200-500ms latency creates stale state window, container changes between query and action execution. Prevention: Fresh state query before action, handle "already in state" as success.
+- **Dual-Write Period Data Inconsistency** — Phased migration creates split-brain between Docker and Unraid APIs. Prevention: Short cutover window (hours not days), single source of truth per operation.
+- **Batch Performance Degradation** — Sequential operations multiply cloud relay latency (10 containers = 10x slower). Prevention: GraphQL batching for reads, parallel processing where safe, progress streaming.
+- **GraphQL Schema Changes Silent Breakage** — Unraid API evolves, field additions/deprecations break queries without warning. Prevention: Schema introspection checks on startup, field existence validation before use.

 ## Implications for Roadmap

-Based on combined research, v1.3 should be structured as 3 focused phases:
+Based on research, suggested phase structure follows risk mitigation order: infrastructure layers first, simple operations to prove patterns, complex update workflow last when patterns validated.

-### Phase 1: API Setup & Network Access
-**Rationale:** Validate the GraphQL approach before n8n integration. Infrastructure changes first, workflow changes second.
+### Phase 1: Container ID Translation Layer
+**Rationale:** ID format mismatch is catastrophic failure point — must be solid before any live API calls. All sub-workflows depend on container identification working correctly.
+**Delivers:** PrefixedID validation, Matching sub-workflow outputs Unraid IDs, ID format documentation
+**Addresses:** Container ID format mismatch pitfall (critical)
+**Avoids:** All operations failing with "container not found" on cutover
+**Complexity:** LOW — Pure data transformation, no API calls

-**Delivers:**
- Unraid API key created with `DOCKER:UPDATE_ANY` permission
- Network access verified from n8n container to Unraid host
- Container ID format documented (via GraphQL query test)
+### Phase 2: Callback Data Encoding Redesign
+**Rationale:** Telegram inline keyboards are primary UI pattern. Must work before enabling any action operations. Can implement in parallel with Phase 1 (no dependencies).
+**Delivers:** Callback format `action:idx` with lookup table, 64-byte validation, all callback patterns tested
+**Addresses:** Callback data size limit pitfall, enables inline keyboard actions
+**Avoids:** Silent inline keyboard failures on cutover
+**Complexity:** MEDIUM — Requires lookup table design, static data storage strategy, extensive testing

-**Tasks:**
- Create API key via `unraid-api apikey --create` CLI command
- Store in `.env.unraid-api` (gitignored)
- Add `--add-host=host.docker.internal:host-gateway` to n8n container config
- Test GraphQL query from command line: `curl -X POST http://host.docker.internal/graphql -H "x-api-key: ..." -d '{"query": "{ docker { containers { id name } } }"}'`
- Verify container ID format returned (likely `docker:<name>`)
+### Phase 3: GraphQL Response Normalization
+**Rationale:** Establishes data contract stability before modifying sub-workflows. Prevents cascading failures across 60+ Code nodes. Template for all future GraphQL integrations.
+**Delivers:** Normalization Code node template, schema validation, response shape documentation
+**Addresses:** Response structure parsing pitfall
+**Avoids:** Garbled data, empty container lists, state comparison failures
+**Complexity:** MEDIUM — Schema design, field mapping, validation logic

-**Avoids:**
- Pitfall 4 (filesystem access issues) — uses API not file mounts
- Pitfall 3 (version compatibility) — validates API availability before implementation
+### Phase 4: Status Query Migration (Simple Read-Only)
+**Rationale:** First live API integration with lowest risk (read-only query). Proves normalization layer works, establishes error handling patterns. Status sub-workflow = 3 Docker nodes → 4 GraphQL nodes.
+**Delivers:** Container list via GraphQL, status display with Unraid data, error handling validation
+**Uses:** Normalization layer from Phase 3, ID translation from Phase 1
+**Implements:** Hybrid router (GraphQL for status, Docker proxy still active)
+**Addresses:** Table stakes container status feature
+**Avoids:** Breaking existing status functionality during migration
+**Complexity:** LOW — Single query type, straightforward mapping
+**Research flag:** Standard pattern, skip research-phase

-**Research flag:** NEEDS DEEPER RESEARCH — Network access testing, container ID format verification, API permission validation
+### Phase 5: Actions Migration (Start/Stop/Restart)
+**Rationale:** Proves mutation patterns work before tackling complex update flow. Restart operation tests sequential mutation chaining (stop + start). Actions sub-workflow = 4 Docker nodes → 5 GraphQL nodes.
+**Delivers:** Start/stop/restart via GraphQL mutations, error handling for "already in state" (HTTP 304)
+**Uses:** Callback encoding from Phase 2, normalization from Phase 3
+**Implements:** Sequential mutation pattern for restart (no native restart mutation)
+**Addresses:** Table stakes container actions
+**Avoids:** Restart timing issues, state conflict errors
+**Complexity:** MEDIUM — Mutation error handling, restart sequencing
+**Research flag:** Standard pattern, skip research-phase

-### Phase 2: n8n Workflow Integration
-**Rationale:** Once API access is proven, integrate into update sub-workflow. Single node addition is minimal risk.
+### Phase 6: Timeout and Latency Hardening
+**Rationale:** Must address before Update workflow (long-running operations). Cloud relay latency causes timeout failures without proper handling. Affects all sub-workflows.
+**Delivers:** 3-5x timeout increases, client-side timeout validation, progress indicators, latency monitoring
+**Uses:** Findings from Phase 4-5 testing
+**Implements:** Progress streaming pattern for long operations
+**Addresses:** Sub-workflow timeout propagation pitfall, network resilience
+**Avoids:** Silent failures, user confusion on slow operations
+**Complexity:** LOW — Configuration changes, monitoring setup
+**Research flag:** Implementation pattern testing needed

-**Delivers:**
- n8n-update.json calls GraphQL `updateContainer` after successful Docker API update
- Error handling for API failures (continue on error, log for debugging)
- Single-container updates sync automatically
+### Phase 7: Update Workflow Migration (Complex Atomic Operation)
+**Rationale:** Highest impact phase — 9 Docker nodes → 2 GraphQL nodes, solves v1.3 update status sync issue. Deferred until patterns proven in Phase 4-5 and timeouts hardened in Phase 6.
+**Delivers:** Single `updateContainer` mutation, automatic status sync, update workflow simplification (34 → 27 nodes)
+**Uses:** All infrastructure from Phase 1-6
+**Implements:** Atomic update pattern, major architectural win
+**Addresses:** Table stakes update feature, v1.3 pain point resolution
+**Avoids:** Multi-step Docker API complexity, manual status sync
+**Complexity:** HIGH — Critical operation, thorough testing required
+**Research flag:** Monitor for schema changes in updateContainer mutation behavior

-**Tasks:**
- Add HTTP Request node to n8n-update.json after "Remove Old Image (Success)"
- Configure GraphQL mutation: `mutation { docker { updateContainer(id: "docker:{{$json.containerName}}") { id name state } } }`
- Set authentication: Header Auth with `x-api-key: {{$env.UNRAID_API_KEY}}`
- Error handling: `continueRegularOutput` (don't fail update if sync fails)
- Connect to "Return Success" node
- Test with single container update
+### Phase 8: Logs Migration and Hybrid Finalization
+**Rationale:** Validates logs query works (schema shows query exists but field structure untested). Completes hybrid architecture by locking down Docker proxy to logs-only.
+**Delivers:** Logs via GraphQL (if query works) OR confirm Docker proxy retention, proxy reconfiguration (POST=0)
+**Uses:** Normalization patterns from Phase 3
+**Implements:** Final hybrid architecture state
+**Addresses:** Table stakes logs feature
+**Avoids:** Breaking logs functionality, accidental Docker proxy usage for control ops
+**Complexity:** MEDIUM — Logs query field structure unknown until tested
+**Research flag:** Logs query response format needs validation

-**Uses:**
- Unraid GraphQL API from Phase 1
- n8n HTTP Request node (built-in)
+### Phase 9: Batch Operations Optimization
+**Rationale:** Deferred until basic operations proven. Batch update leverages native `updateContainers` mutation for performance. Only enable after single-container update stable.
+**Delivers:** Batch update via GraphQL mutation, progress streaming, performance metrics
+**Uses:** Update mutation from Phase 7, timeout patterns from Phase 6
+**Implements:** GraphQL batch mutation pattern
+**Addresses:** Competitive batch update feature
+**Avoids:** Timeout issues, linear performance degradation
+**Complexity:** MEDIUM — Batch error handling, partial failure scenarios
+**Research flag:** Test batch mutation behavior with 10+ containers

-**Implements:**
- Clear Unraid Status component from architecture research
- Post-Action Sync pattern (sync after primary operation completes)
-
-**Avoids:**
- Pitfall 1 (state desync) — uses official sync mechanism
- Pitfall 2 (race conditions) — GraphQL API handles atomicity
-
-**Research flag:** STANDARD PATTERNS — HTTP Request node usage well-documented, unlikely to need additional research
-
-### Phase 3: Batch Optimization & UAT
-**Rationale:** After core functionality works, optimize for batch operations and validate across scenarios.
-
-**Delivers:**
- Batch updates use `updateContainers` (plural) mutation for efficiency
- Validation across Unraid versions (6.12, 7.0, 7.2)
- Confirmation that network config preservation works
-
-**Tasks:**
- Modify batch update flow to collect all updated container IDs
- Call `updateContainers(ids: ["docker:container1", "docker:container2"])` once after batch loop completes
- Test on Unraid 6.12 (with Connect plugin) and 7.2 (native API)
- Verify no notification spam during batch updates (5+ containers)
- Test container on custom network (`br0`) — verify DNS resolution after update
- Document manual sync option in README (if GraphQL sync fails, user clicks "Check for Updates")
-
-**Avoids:**
- Pitfall 5 (notification spam) — batch mutation prevents per-container notifications
- Pitfall 3 (version compatibility) — multi-version UAT catches breaking changes
- Pitfall 7 (network issues) — UAT includes network-dependent scenarios
-
-**Research flag:** STANDARD PATTERNS — Batch optimization is iteration of Phase 2 pattern
+### Phase 10: Validation and Cleanup
+**Rationale:** Final verification before declaring migration complete. Remove Docker socket proxy if logs query worked, otherwise document hybrid architecture as permanent.
+**Delivers:** Full workflow testing, Docker proxy removal (if possible), architecture docs update
+**Addresses:** All migration success criteria
+**Complexity:** LOW — Testing and documentation

 ### Phase Ordering Rationale

- **Infrastructure before integration** — Phase 1 validates GraphQL API access before modifying workflows. If network access fails, can pivot to alternative (e.g., helper script) without workflow rework.
+**Dependency chain:** ID Translation (Phase 1) → Callback Encoding (Phase 2) → Normalization (Phase 3) → Status (Phase 4) → Actions (Phase 5) → Timeouts (Phase 6) → Update (Phase 7) → Logs (Phase 8) → Batch (Phase 9) → Cleanup (Phase 10)

- **Single-container before batch** — Phase 2 proves core sync mechanism with simplest case. Batch optimization (Phase 3) is safe iteration once foundation works.
+**Risk mitigation order:** Start with infrastructure layers that prevent catastrophic failures (ID format, callback limits, response parsing), prove patterns with low-risk read-only operations (status query), establish mutation patterns with simple operations (start/stop), harden for production (timeouts/latency), tackle high-impact complex operation (update), finalize hybrid architecture (logs), optimize performance (batch), validate and document.

- **Validation last** — Phase 3 UAT happens after implementation complete. Testing earlier wastes time if implementation changes.
+**Architectural grouping:** Phases 1-3 are pure infrastructure (no API calls), Phases 4-5 prove API integration patterns, Phase 6 hardens for production latency, Phase 7 delivers main migration value (update simplification + status sync), Phases 8-10 complete feature parity and optimize.

-**Dependencies discovered:**
- Phase 2 depends on Phase 1 (API access must work)
- Phase 3 depends on Phase 2 (batch builds on single-container pattern)
- No parallelization opportunities (sequential phases)
+**Pitfall avoidance mapping:** Each phase addresses 1-2 critical pitfalls from research. Phase 1 prevents ID mismatch disaster, Phase 2 prevents callback failures, Phase 3 prevents parsing breakage, Phase 6 prevents timeout frustration, Phase 7 proves atomic operations, Phase 8 locks down hybrid architecture to prevent dual-write.

 ### Research Flags

-Phases likely needing deeper research during planning:
+**Phases needing deeper research during planning:**
+- **Phase 7 (Update Workflow):** updateContainer mutation behavior when already up-to-date unclear — does it return success immediately or pull image again? Batch error handling for updateContainers unknown — if one fails, do others continue? Test with non-critical container first.
+- **Phase 8 (Logs):** DockerContainerLogs GraphQL type field structure unknown — timestamp format, stdout/stderr separation, entry structure all need testing. May require fallback plan if query unusable.
+- **Phase 9 (Batch Operations):** updateAllContainers filter behavior unclear — does it filter by :latest tag or update everything with available updates? Rate limiting impact unknown — does batch count as 1 request or N?

- **Phase 1:** Network access testing, API permission verification — Some unknowns around container ID format and `host.docker.internal` behavior in Unraid's Docker implementation. Low risk (fallback to IP-based access), but needs validation.
-
-Phases with standard patterns (skip research-phase):
-
- **Phase 2:** HTTP Request node integration — Well-documented n8n pattern, GraphQL mutation structure is simple
- **Phase 3:** Batch optimization and UAT — Iteration of Phase 2, no new concepts
+**Phases with standard patterns (skip research-phase):**
+- **Phase 1-3 (Infrastructure):** Data transformation patterns well-documented, no novel research needed
+- **Phase 4 (Status Query):** GraphQL query tested in Phase 14, field mapping straightforward
+- **Phase 5 (Actions):** Start/stop mutations tested in STACK.md research, restart pattern clear (sequential stop+start)
+- **Phase 6 (Timeouts):** n8n timeout configuration documented, latency monitoring standard practice
+- **Phase 10 (Validation):** Testing methodology established, documentation templates exist

 ## Confidence Assessment

 | Area | Confidence | Notes |
 |------|------------|-------|
-| Stack | HIGH | GraphQL API is official Unraid interface, documented in schema and DeepWiki. HTTP Request node is n8n built-in. |
-| Features | MEDIUM | Core pain point well-understood from community forums. Feature prioritization based on user impact analysis, but batch optimization impact is estimated. |
-| Architecture | HIGH | Extension of existing sub-workflow pattern. GraphQL integration is standard HTTP Request usage. Rejected alternatives are well-reasoned. |
-| Pitfalls | MEDIUM | Critical pitfalls identified from community reports and source code analysis. Race condition and version compatibility need UAT validation. |
+| Stack | HIGH | Unraid GraphQL API tested live on Unraid 7.2 (Phase 14 + STACK.md research). Container operations verified via direct API calls. Logs unavailability confirmed by official docs. Hybrid architecture necessity proven. |
+| Features | HIGH | Most operations are direct GraphQL equivalents of Docker API patterns (tested). Update simplification validated via schema + live updateContainer mutation testing. Only uncertainty: isUpdateAvailable field existence (documented but may not be in actual schema). |
+| Architecture | HIGH | 4 of 7 sub-workflows require modification (18 Docker API nodes identified). Normalization layer pattern proven in existing workflows. Container ID format transition validated. Main workflow and 3 sub-workflows confirmed unchanged. |
+| Pitfalls | MEDIUM | Container ID format mismatch validated via testing (HIGH). Callback data limit is Telegram spec (HIGH). Cloud relay dependency documented by Unraid (HIGH). GraphQL migration patterns sourced from industry best practices (MEDIUM). n8n timeout issue confirmed by GitHub issue (HIGH). Schema evolution patterns are general GraphQL risks (MEDIUM). |

-**Overall confidence:** HIGH
-
-The GraphQL API approach is well-documented and official. Core functionality (sync after single-container update) is low-risk. Batch optimization and multi-version support add complexity but are defer-able if needed.
+**Overall confidence:** HIGH for migration feasibility and approach, MEDIUM for execution complexity and edge case handling.

 ### Gaps to Address

-Areas where research was inconclusive or needs validation during implementation:
+**Schema field validation (MEDIUM priority):**
+- `isUpdateAvailable` field documented in community sources but needs verification against actual Unraid 7.2 schema introspection
+- DockerContainerLogs field structure completely unknown until tested — may require response format iteration
+- Resolution: Schema introspection query in Phase 4, field existence checks before use, graceful degradation if fields missing

- **Container ID format** — GraphQL schema shows `id: ID!` but exact format (`docker:<name>` vs just `<name>`) needs testing. Resolution: Phase 1 queries containers to get actual ID format.
+**Update mutation behavior edge cases (MEDIUM priority):**
+- updateContainer when already up-to-date: immediate success or redundant pull?
+- updateContainers partial failure handling: abort all or continue?
+- Resolution: Test with non-critical containers during Phase 7, document behavior, implement appropriate error handling

- **API rate limiting** — Not documented. Impact: LOW (bot updates are infrequent, <10/min even in batch). Resolution: Monitor during UAT, no preemptive handling.
+**Batch operation rate limiting (LOW priority):**
+- Does updateContainers(ids) count as 1 API request or N requests against Unraid rate limits?
+- What's the practical limit for batch size before timeout?
+- Resolution: Test with 20+ containers in Phase 9, monitor for 429 errors, document batch size recommendations

- **GraphQL subscription for real-time status** — Schema includes subscriptions but unclear if Docker status changes are subscribable. Impact: None for v1.3 (defer to future enhancement). Resolution: Document as v2+ exploration.
+**myunraid.net cloud relay reliability (LOW priority):**
+- Is internet dependency acceptable for production use?
+- Should we implement direct LAN fallback (HTTPS with SSL handling)?
+- Resolution: Monitor in production after Phase 4-5, implement fallback in future phase if reliability issues surface

- **Unraid 6.9-7.1 Connect plugin availability** — Research confirms Connect plugin exists but installation process not validated. Impact: MEDIUM (users on pre-7.2 need this). Resolution: Phase 1 should document plugin installation steps and test.
-
- **Exact permission name for API key** — Research shows `DOCKER:UPDATE_ANY` but this wasn't explicitly stated in official docs (inferred from schema). Impact: LOW (easily testable). Resolution: Phase 1 tests API key creation and documents exact permission syntax.
+**Logs query fallback strategy (MEDIUM priority):**
+- If GraphQL logs query unusable, hybrid architecture becomes permanent
+- Docker socket proxy removal blocked indefinitely
+- Resolution: Test logs query early in Phase 8, document hybrid architecture as expected state if logs unavailable

 ## Sources

 ### Primary (HIGH confidence)
- [Unraid API Schema](https://raw.githubusercontent.com/unraid/api/main/api/generated-schema.graphql) — GraphQL mutations, container ID types, query structure
- [DeepWiki — Docker Integration](https://deepwiki.com/unraid/api/2.4-docker-integration) — DockerService architecture, updateContainer mutation behavior
- [limetech/dynamix source](https://github.com/limetech/dynamix/blob/master/plugins/dynamix.docker.manager/include/DockerClient.php) — syncVersions() function, update status file handling
- [Unraid API Documentation](https://docs.unraid.net/API/) — API key management, version availability
- [Unraid API Key Management](https://docs.unraid.net/API/programmatic-api-key-management/) — API key creation, permission scopes
+- [Unraid GraphQL Schema](https://raw.githubusercontent.com/unraid/api/main/api/generated-schema.graphql) — Complete API specification, Docker mutations verified
+- [Unraid API Documentation](https://docs.unraid.net/API/) — Official API overview, authentication patterns
+- [Using the Unraid API](https://docs.unraid.net/API/how-to-use-the-api/) — API key setup, permissions, endpoint URLs
+- `.planning/phases/14-unraid-api-access/14-RESEARCH.md` — Phase 14 connectivity research, PrefixedID format verified
+- `.planning/phases/14-unraid-api-access/14-VERIFICATION.md` — Live Unraid 7.2 testing, query validation, myunraid.net requirement
+- `ARCHITECTURE.md` — Existing workflow structure, sub-workflow contracts, 290-node system analysis
+- `CLAUDE.md` — Docker API patterns, n8n conventions, static data limitations, Telegram credential IDs
+- `n8n-workflow.json`, `n8n-*.json` — Actual workflow implementations, 18 Docker API nodes identified

 ### Secondary (MEDIUM confidence)
- [Unraid Forum: Incorrect Update Notification](https://forums.unraid.net/bug-reports/stable-releases/regression-incorrect-docker-update-notification-r2807/) — Community-identified pain point, file deletion workaround
- [Watchtower + Unraid Discussion](https://github.com/containrrr/watchtower/discussions/1389) — External update detection issues, Unraid doesn't auto-sync
- [Unraid Forum: Watchtower Status Not Reflected](https://forums.unraid.net/topic/149953-docker-update-via-watchtower-status-not-reflected-in-unraid/) — Confirmation of stale status after external updates
- [Unraid MCP Server](https://github.com/jmagar/unraid-mcp) — Reference GraphQL client implementation
- [Home Assistant Unraid Integration](https://github.com/domalab/unraid-api-client) — Additional GraphQL usage examples
- [Docker host.docker.internal guide](https://eastondev.com/blog/en/posts/dev/20251217-docker-host-access/) — Network access pattern
+- [DeepWiki Unraid API](https://deepwiki.com/unraid/api) — Comprehensive technical documentation, DockerService internals
+- [DeepWiki Docker Integration](https://deepwiki.com/unraid/api/2.4-docker-integration) — Docker service implementation details, retry logic
+- [unraid-api-client by domalab](https://github.com/domalab/unraid-api-client/blob/main/UNRAIDAPI.md) — Python client documenting queries, isUpdateAvailable field source
+- [unraid-mcp by jmagar](https://github.com/jmagar/unraid-mcp) — MCP server with Docker management tools, mutation examples
+- [GraphQL Migration Patterns](https://docs.github.com/en/graphql/guides/migrating-from-rest-to-graphql) — GitHub's REST to GraphQL migration guide
+- [3 GraphQL Pitfalls](https://www.vanta.com/resources/3-graphql-pitfalls-and-steps-to-avoid-them) — Schema evolution, error handling patterns

-### Tertiary (LOW confidence)
- [Unraid Forums — Docker Update Status](https://forums.unraid.net/topic/114415-plugin-docker-compose-manager/page/9/) — Status file structure (observed, not documented)
- Community reports of notification spam — Anecdotal, but consistent pattern across forums
+### Tertiary (LOW confidence, needs validation)
+- [n8n Execute Workflow timeout issue #1572](https://github.com/n8n-io/n8n/issues/1572) — Timeout propagation bug report
+- [Telegram Bot API callback data limit](https://core.telegram.org/bots/api#inlinekeyboardbutton) — 64-byte limit specification
+- Community forum discussions on Unraid API update status sync — Anecdotal reports, needs testing

 ---
-*Research completed: 2026-02-08*
+*Research completed: 2026-02-09*
 *Ready for roadmap: yes*