docs: complete v1.4 project research synthesis

2026-02-09 08:08:25 -05:00
parent bb47664188
commit bab819f6c8
5 changed files with 2013 additions and 1453 deletions
@@ -1,250 +1,277 @@
 # Project Research Summary

-**Project:** Unraid Docker Manager v1.3 — Update Status Sync
-**Domain:** Docker container management integration with Unraid server
-**Researched:** 2026-02-08
+**Project:** Unraid Docker Manager v1.4 — Unraid API Native Migration
+**Domain:** Migration from Docker socket proxy to Unraid GraphQL API for native container management
+**Researched:** 2026-02-09
 **Confidence:** HIGH

 ## Executive Summary

-Unraid tracks container update status through an internal state file (`/var/lib/docker/unraid-update-status.json`) that is only updated when updates are initiated through Unraid's WebGUI or GraphQL API. When containers are updated externally via Docker API (as the bot currently does), Unraid's state becomes stale, showing false "update available" badges and sending duplicate notifications. This is the core pain point for v1.3.
+The migration from Docker socket proxy to Unraid's native GraphQL API is architecturally sound and operationally beneficial, but requires a hybrid approach due to logs unavailability. Research confirms that Unraid's GraphQL API provides all required container control operations (start, stop, update) with simpler patterns than Docker's REST API, but container logs are NOT accessible via the Unraid API and must continue using the Docker socket proxy. This creates a hybrid architecture: Unraid GraphQL for control operations, Docker socket proxy retained read-only for logs retrieval.

-The recommended approach is to use Unraid's official GraphQL API with the `updateContainer` mutation. This leverages the same mechanism Unraid's WebGUI uses and automatically handles status file synchronization, image digest comparison, and UI refresh. The GraphQL API is available natively in Unraid 7.2+ or via the Connect plugin for earlier versions. This approach is significantly more robust than direct file manipulation, is version-safe, and has proper error handling.
+The recommended approach is phased migration starting with simple operations (status queries, actions) to establish patterns, then tackling the complex update workflow which simplifies from 9 Docker API nodes to 2 GraphQL nodes. The single `updateContainer` mutation atomically handles image pull, container recreation, and critical update status sync, solving v1.3's "apply update" badge persistence issue without manual file writes. Key architectural wins include container ID format (PrefixedID) normalization layers, GraphQL error handling standardization, and response shape transformation to maintain workflow contracts.

-Key risks include race conditions between bot updates and Unraid's periodic update checker, version compatibility across Unraid 6.x-7.x, and ensuring n8n container has proper network access to the Unraid host's GraphQL endpoint. These are mitigated through atomic operations via the GraphQL API, version detection, and using `host.docker.internal` networking with proper container configuration.
+Critical risks center on container ID format mismatches (Docker 64-char vs Unraid 129-char PrefixedIDs), Telegram callback data 64-byte limits with longer IDs, and myunraid.net cloud relay internet dependency introducing latency and outage risk. Mitigation requires ID translation layers implemented before any live operations, callback data encoding redesign, and timeout adjustments for 200-500ms cloud relay latency. The research identifies 10 critical pitfalls with phase-mapped prevention strategies, confidence assessment shows HIGH for tested operations and MEDIUM for architectural patterns.

 ## Key Findings

 ### Recommended Stack

-**Use Unraid's GraphQL API via n8n's built-in HTTP Request node** — no new dependencies required. The GraphQL API is the official Unraid interface and handles all status synchronization internally through the DockerService and Dynamix Docker Manager integration.
+No new dependencies required. All infrastructure established in Phase 14 (v1.3): Unraid GraphQL API connectivity, myunraid.net cloud relay URL, n8n Header Auth credential with API key, environment variable for UNRAID_HOST. Research confirms hybrid architecture necessity — Docker socket proxy must remain deployed but reconfigured with minimal read-only permissions (CONTAINERS=1, POST=0) for logs access only.

 **Core technologies:**
- **Unraid GraphQL API (7.2+ or Connect plugin)**: Container update status sync — Official API, same mechanism as WebGUI, handles all internal state updates automatically
- **HTTP Request (n8n built-in)**: GraphQL client — Already available, no new dependencies, simple POST request pattern
- **Unraid API Key**: Authentication — Same credential pattern as existing n8n API keys, stored in `.env.unraid-api`
+- **Unraid GraphQL API (7.2+):** Container control operations (list, start, stop, update) — Native integration provides automatic update status sync, structured errors, atomic update mutations
+- **myunraid.net cloud relay:** Unraid API access URL — Avoids direct LAN IP nginx redirect auth stripping, but introduces internet dependency and 200-500ms latency
+- **docker-socket-proxy (reduced scope):** Logs retrieval ONLY — Unraid API explicitly documents logs as NOT accessible via API, must use Docker socket
+- **n8n HTTP Request node:** GraphQL API calls via POST /graphql — Replace Execute Command nodes with structured GraphQL requests, better timeout handling and error parsing

-**Network requirements:**
- n8n container → Unraid host via `http://host.docker.internal/graphql`
- Requires `--add-host=host.docker.internal:host-gateway` in n8n container config
- Alternative: Use Unraid IP directly (`http://tower.local/graphql` or `http://192.168.x.x/graphql`)
-
-**Critical finding from stack research:** Direct file manipulation of `/var/lib/docker/unraid-update-status.json` was investigated but rejected. The file format is undocumented, doesn't trigger UI refresh, and is prone to race conditions. The GraphQL API is the proper abstraction layer.
+**Critical version requirements:**
+- Unraid 7.2+ required for GraphQL API availability
+- n8n HTTP Request node typeVersion 1.2+ for Header Auth credential support

 ### Expected Features

-Based on feature research, the v1.3 scope is tightly focused on solving the core pain point.
+Most operations are drop-in replacements with same user-facing behavior but simpler implementation. Update workflow gains significant simplification (5-step Docker API flow collapses to single mutation) and automatic status sync benefit.

 **Must have (table stakes):**
- Clear "update available" badge after bot updates container — Users expect Unraid UI to reflect reality after external updates
- Prevent duplicate update notifications — After bot updates, Unraid shouldn't send false-positive Telegram notifications
- Automatic sync after every update — Zero user intervention, bot handles sync transparently
+- Container start/stop/restart — GraphQL mutations for start/stop, restart requires chaining stop + start (no native restart mutation)
+- Container status query — GraphQL containers query with UPPERCASE state values, PrefixedID format
+- Container update — Single `updateContainer` mutation replaces 5-step Docker API flow (pull, stop, remove, create, start)
+- Container logs — GraphQL logs query exists in schema (field structure needs testing during implementation)
+- Batch operations — Native `updateContainers(ids)` and `updateAllContainers` mutations for multi-container updates

-**Should have (v1.4+):**
- Manual sync command (`/sync`) — Trigger when user updates via other tools (Portainer, CLI, Watchtower)
- Read Unraid update status for better detection — Parse Unraid's view to enhance bot's container selection UI
- Batch status sync — Optimize multi-container update operations
+**Should have (competitive):**
+- Automatic update status sync — Unraid API's `updateContainer` mutation handles internal state sync, eliminates v1.3's manual file write workaround
+- Update detection via `isUpdateAvailable` field — Bot shows what Unraid sees, no digest comparison discrepancies (NOTE: field documented in research but may not exist in actual schema, validate during implementation)
+- Batch update simplification — Native GraphQL batch mutations reduce network calls and latency

 **Defer (v2+):**
- Bidirectional status awareness — Use Unraid's update detection as source of truth instead of Docker digest comparison
- Persistent monitoring daemon — Real-time sync via Docker events (conflicts with n8n's workflow execution model)
- Full Unraid API integration — Authentication, template parsing, web session management (overly complex for cosmetic badge)
-
-**Anti-features identified:**
- Automatic template XML regeneration (breaks container configuration)
- Sync status for ALL containers on every operation (performance impact, unnecessary)
- Full Unraid API integration via authentication/sessions (complexity overkill for status sync)
+- Real-time container stats — `dockerContainerStats` subscription requires WebSocket infrastructure, complex for n8n HTTP Request node
+- Container autostart configuration — `updateAutostartConfiguration` mutation available but not user-requested
+- Port conflict detection — `portConflicts` query useful for debugging but not core workflow
+- Direct LAN fallback — Implement if myunraid.net relay proves unreliable in production, defer until proven necessary

 ### Architecture Approach

-**Extend n8n-update.json sub-workflow with a single HTTP Request node** that calls Unraid's GraphQL `updateContainer` mutation after successful Docker API update. This keeps sync tightly coupled to update operations, requires minimal architectural changes, and maintains single responsibility (update sub-workflow owns all update-related actions).
+Migration affects 4 of 7 sub-workflows (Update, Actions, Status, Logs) totaling 18 Docker API nodes replaced with GraphQL HTTP Request nodes plus normalization layers. Three sub-workflows (Matching, Batch UI, Confirmation) remain unchanged as they operate on data contracts not API sources. Update sub-workflow sees largest impact: 34 nodes shrink to ~27 nodes by replacing 9-step Docker API flow with 1-2 GraphQL nodes.

 **Major components:**
-1. **Clear Unraid Status (NEW node)** — HTTP Request to GraphQL API after "Remove Old Image (Success)", calls `mutation { docker { updateContainer(id: "docker:containerName") } }`
-2. **n8n container configuration (modified)** — Add `--add-host=host.docker.internal:host-gateway` for network access to Unraid host
-3. **Unraid API credential (NEW)** — API key stored in `.env.unraid-api`, loaded via n8n credential system, requires `DOCKER:UPDATE_ANY` permission

-**Data flow:**
-```
-Update sub-workflow success → Extract container name → HTTP Request to GraphQL
-→ Unraid's DockerService executes syncVersions() → Update status file written
-→ Return to sub-workflow → Success response to user
-```
+1. **GraphQL Response Normalization Layer** — Code nodes after every GraphQL query to transform Unraid response shape (nested `data.docker.containers`) and field formats (UPPERCASE state, PrefixedID) to match workflow contracts. Prevents cascading failures across 60+ Code nodes in main workflow that expect Docker API shape.

-**Rejected alternatives:**
- **Direct file write** — Brittle, undocumented format, no UI refresh, race conditions
- **New sub-workflow (n8n-unraid-sync.json)** — Overkill for single operation
- **Main workflow integration** — Adds latency, harder to test
- **Helper script on host** — Unnecessary complexity when GraphQL API exists
+2. **Container ID Translation Layer** — Matching sub-workflow outputs Unraid PrefixedID format (129 chars: `{server_hash}:{container_hash}`) instead of Docker short ID (64 chars). All Execute Workflow input preparation nodes pass opaque containerId token, value changes but field name/contract stable.
+
+3. **Callback Data Encoding Redesign** — Telegram 64-byte callback limit broken by PrefixedID length. Implement ID shortening with lookup table or base62 hash mapping. Update ALL callback formats from `action:containerID` to `action:idx` with static data lookup.
+
+4. **GraphQL Error Handling Pattern** — Standardized validation: check `response.errors[]` array first (GraphQL returns HTTP 200 even for errors), parse structured error messages, handle HTTP 304 "already in state" as success case, validate `response.data` structure before accessing fields.
+
+5. **Hybrid API Router** — Sub-workflows route control operations to Unraid GraphQL (start, stop, update, status), logs operations to Docker socket proxy. Docker proxy reconfigured read-only (POST=0) to prevent accidental dual-write.
+
+**Key patterns to follow:**
+- One normalization Code node per GraphQL query response (Status, Actions, Update, Logs)
+- Explicit timeout configuration on every HTTP Request node (30-60 seconds for mutations, account for cloud relay latency)
+- Client-side timeout validation in main workflow (timestamp checks, don't rely on Execute Workflow timeout propagation)
+- Fresh state query immediately before action execution to avoid race conditions (200-500ms latency creates stale state window)

 ### Critical Pitfalls

-Research identified 7 critical pitfalls with phase-specific prevention strategies:
+**Top 5 pitfalls with prevention strategies:**

-1. **State Desync Between Docker API and Unraid's Internal Tracking** — GraphQL API solves this by using Unraid's official update mechanism instead of direct file manipulation. Validation: Update via bot, verify Unraid shows "up-to-date" after manual "Check for Updates"
+1. **Container ID Format Mismatch Breaking All Operations** — Docker 64-char hex vs Unraid 129-char PrefixedID. Passing wrong format causes all operations to fail with "container not found." Prevention: Implement ID validation regex `^[a-f0-9]{64}:[a-f0-9]{64}$` BEFORE any live operations, update ALL 17 Execute Workflow input nodes, test with containers having similar names but different IDs. Address in Phase 1.

-2. **Race Condition Between Unraid's Periodic Update Check and Bot Sync** — GraphQL `updateContainer` mutation is idempotent and atomic. Even if Unraid's update checker runs concurrently, no file corruption or lost updates occur. The API handles synchronization internally.
+2. **Telegram Callback Data 64-Byte Limit Exceeded** — Callback format `stop:8a9907a24576` fit with Docker IDs, `stop:{129-char-PrefixedID}` exceeds limit causing silent inline keyboard failures. Prevention: Redesign callback encoding to `action:idx` with PrefixedID lookup table, hash to 8-char base62, test ALL callback patterns. Address in Phase 2.

-3. **Unraid Version Compatibility** — GraphQL API is stable across Unraid 6.9+ (via Connect plugin) and 7.2+ (native). Version detection should check `/etc/unraid-version` and verify API availability before sync. If GraphQL unavailable, log error and skip sync (preserve bot functionality).
+3. **myunraid.net Cloud Relay Internet Dependency** — Bot becomes non-functional during internet outages despite LAN connectivity. Latency increases from sub-10ms (Docker socket) to 200-500ms (cloud relay). Prevention: Add network connectivity pre-flight checks, implement degraded mode messaging, monitor relay latency as first-class metric, document internet dependency in error messages. Address in Phase 2.

-4. **Docker Socket Proxy Blocks Filesystem Access** — Resolved by using GraphQL API instead of direct file access. n8n only needs HTTP access to Unraid host, not filesystem mounts. This preserves security boundaries.
+4. **GraphQL Response Structure Normalization Missing** — Field name changes (State→state, UPPERCASE values), nested response structure (`data.docker.containers`), missing normalization causes parsing failures across 60 Code nodes. Prevention: Build normalization layer BEFORE touching sub-workflows, add schema validation, test response parsing independently. Address in Phase 3.

-5. **Notification Spam During Batch Updates** — Batch updates should collect all container names and call `updateContainers` (plural) mutation once after batch completes, not per-container. This triggers single status recalculation instead of N notifications.
+5. **Sub-Workflow Timeout Errors Lost in Propagation** — Known n8n issue where Execute Workflow node ignores sub-workflow timeouts. Cloud relay latency causes operations that completed in 10-30s to take 60-120s. Prevention: Increase ALL sub-workflow timeouts by 3-5x, implement client-side timeout in main workflow, add progress indicators, configure HTTP Request timeouts explicitly. Address in Phase 6.

-6. **n8n Workflow State Doesn't Persist** — Sync happens within same execution (no cross-execution state needed). Batch updates already collect results in single execution, sync after loop completes. No reliance on static data.
-
-7. **Unraid's br0 Network Recreate** — Not directly related to status sync, but update flow must preserve network config. Current n8n-update.json uses Docker API recreate — verify network settings are copied from old container inspect to new create body.
+**Additional critical pitfalls:**
+- **Credential Rotation Kills Bot Mid-Operation** — Dual credential storage (`.env.unraid-api` + n8n Header Auth) falls out of sync, 401 errors with no detection. Prevention: Consolidate to n8n credential only, implement 401 error user-friendly messaging.
+- **Race Condition Between Query and Action** — 200-500ms latency creates stale state window, container changes between query and action execution. Prevention: Fresh state query before action, handle "already in state" as success.
+- **Dual-Write Period Data Inconsistency** — Phased migration creates split-brain between Docker and Unraid APIs. Prevention: Short cutover window (hours not days), single source of truth per operation.
+- **Batch Performance Degradation** — Sequential operations multiply cloud relay latency (10 containers = 10x slower). Prevention: GraphQL batching for reads, parallel processing where safe, progress streaming.
+- **GraphQL Schema Changes Silent Breakage** — Unraid API evolves, field additions/deprecations break queries without warning. Prevention: Schema introspection checks on startup, field existence validation before use.

 ## Implications for Roadmap

-Based on combined research, v1.3 should be structured as 3 focused phases:
+Based on research, suggested phase structure follows risk mitigation order: infrastructure layers first, simple operations to prove patterns, complex update workflow last when patterns validated.

-### Phase 1: API Setup & Network Access
-**Rationale:** Validate the GraphQL approach before n8n integration. Infrastructure changes first, workflow changes second.
+### Phase 1: Container ID Translation Layer
+**Rationale:** ID format mismatch is catastrophic failure point — must be solid before any live API calls. All sub-workflows depend on container identification working correctly.
+**Delivers:** PrefixedID validation, Matching sub-workflow outputs Unraid IDs, ID format documentation
+**Addresses:** Container ID format mismatch pitfall (critical)
+**Avoids:** All operations failing with "container not found" on cutover
+**Complexity:** LOW — Pure data transformation, no API calls

-**Delivers:**
- Unraid API key created with `DOCKER:UPDATE_ANY` permission
- Network access verified from n8n container to Unraid host
- Container ID format documented (via GraphQL query test)
+### Phase 2: Callback Data Encoding Redesign
+**Rationale:** Telegram inline keyboards are primary UI pattern. Must work before enabling any action operations. Can implement in parallel with Phase 1 (no dependencies).
+**Delivers:** Callback format `action:idx` with lookup table, 64-byte validation, all callback patterns tested
+**Addresses:** Callback data size limit pitfall, enables inline keyboard actions
+**Avoids:** Silent inline keyboard failures on cutover
+**Complexity:** MEDIUM — Requires lookup table design, static data storage strategy, extensive testing

-**Tasks:**
- Create API key via `unraid-api apikey --create` CLI command
- Store in `.env.unraid-api` (gitignored)
- Add `--add-host=host.docker.internal:host-gateway` to n8n container config
- Test GraphQL query from command line: `curl -X POST http://host.docker.internal/graphql -H "x-api-key: ..." -d '{"query": "{ docker { containers { id name } } }"}'`
- Verify container ID format returned (likely `docker:<name>`)
+### Phase 3: GraphQL Response Normalization
+**Rationale:** Establishes data contract stability before modifying sub-workflows. Prevents cascading failures across 60+ Code nodes. Template for all future GraphQL integrations.
+**Delivers:** Normalization Code node template, schema validation, response shape documentation
+**Addresses:** Response structure parsing pitfall
+**Avoids:** Garbled data, empty container lists, state comparison failures
+**Complexity:** MEDIUM — Schema design, field mapping, validation logic

-**Avoids:**
- Pitfall 4 (filesystem access issues) — uses API not file mounts
- Pitfall 3 (version compatibility) — validates API availability before implementation
+### Phase 4: Status Query Migration (Simple Read-Only)
+**Rationale:** First live API integration with lowest risk (read-only query). Proves normalization layer works, establishes error handling patterns. Status sub-workflow = 3 Docker nodes → 4 GraphQL nodes.
+**Delivers:** Container list via GraphQL, status display with Unraid data, error handling validation
+**Uses:** Normalization layer from Phase 3, ID translation from Phase 1
+**Implements:** Hybrid router (GraphQL for status, Docker proxy still active)
+**Addresses:** Table stakes container status feature
+**Avoids:** Breaking existing status functionality during migration
+**Complexity:** LOW — Single query type, straightforward mapping
+**Research flag:** Standard pattern, skip research-phase

-**Research flag:** NEEDS DEEPER RESEARCH — Network access testing, container ID format verification, API permission validation
+### Phase 5: Actions Migration (Start/Stop/Restart)
+**Rationale:** Proves mutation patterns work before tackling complex update flow. Restart operation tests sequential mutation chaining (stop + start). Actions sub-workflow = 4 Docker nodes → 5 GraphQL nodes.
+**Delivers:** Start/stop/restart via GraphQL mutations, error handling for "already in state" (HTTP 304)
+**Uses:** Callback encoding from Phase 2, normalization from Phase 3
+**Implements:** Sequential mutation pattern for restart (no native restart mutation)
+**Addresses:** Table stakes container actions
+**Avoids:** Restart timing issues, state conflict errors
+**Complexity:** MEDIUM — Mutation error handling, restart sequencing
+**Research flag:** Standard pattern, skip research-phase

-### Phase 2: n8n Workflow Integration
-**Rationale:** Once API access is proven, integrate into update sub-workflow. Single node addition is minimal risk.
+### Phase 6: Timeout and Latency Hardening
+**Rationale:** Must address before Update workflow (long-running operations). Cloud relay latency causes timeout failures without proper handling. Affects all sub-workflows.
+**Delivers:** 3-5x timeout increases, client-side timeout validation, progress indicators, latency monitoring
+**Uses:** Findings from Phase 4-5 testing
+**Implements:** Progress streaming pattern for long operations
+**Addresses:** Sub-workflow timeout propagation pitfall, network resilience
+**Avoids:** Silent failures, user confusion on slow operations
+**Complexity:** LOW — Configuration changes, monitoring setup
+**Research flag:** Implementation pattern testing needed

-**Delivers:**
- n8n-update.json calls GraphQL `updateContainer` after successful Docker API update
- Error handling for API failures (continue on error, log for debugging)
- Single-container updates sync automatically
+### Phase 7: Update Workflow Migration (Complex Atomic Operation)
+**Rationale:** Highest impact phase — 9 Docker nodes → 2 GraphQL nodes, solves v1.3 update status sync issue. Deferred until patterns proven in Phase 4-5 and timeouts hardened in Phase 6.
+**Delivers:** Single `updateContainer` mutation, automatic status sync, update workflow simplification (34 → 27 nodes)
+**Uses:** All infrastructure from Phase 1-6
+**Implements:** Atomic update pattern, major architectural win
+**Addresses:** Table stakes update feature, v1.3 pain point resolution
+**Avoids:** Multi-step Docker API complexity, manual status sync
+**Complexity:** HIGH — Critical operation, thorough testing required
+**Research flag:** Monitor for schema changes in updateContainer mutation behavior

-**Tasks:**
- Add HTTP Request node to n8n-update.json after "Remove Old Image (Success)"
- Configure GraphQL mutation: `mutation { docker { updateContainer(id: "docker:{{$json.containerName}}") { id name state } } }`
- Set authentication: Header Auth with `x-api-key: {{$env.UNRAID_API_KEY}}`
- Error handling: `continueRegularOutput` (don't fail update if sync fails)
- Connect to "Return Success" node
- Test with single container update
+### Phase 8: Logs Migration and Hybrid Finalization
+**Rationale:** Validates logs query works (schema shows query exists but field structure untested). Completes hybrid architecture by locking down Docker proxy to logs-only.
+**Delivers:** Logs via GraphQL (if query works) OR confirm Docker proxy retention, proxy reconfiguration (POST=0)
+**Uses:** Normalization patterns from Phase 3
+**Implements:** Final hybrid architecture state
+**Addresses:** Table stakes logs feature
+**Avoids:** Breaking logs functionality, accidental Docker proxy usage for control ops
+**Complexity:** MEDIUM — Logs query field structure unknown until tested
+**Research flag:** Logs query response format needs validation

-**Uses:**
- Unraid GraphQL API from Phase 1
- n8n HTTP Request node (built-in)
+### Phase 9: Batch Operations Optimization
+**Rationale:** Deferred until basic operations proven. Batch update leverages native `updateContainers` mutation for performance. Only enable after single-container update stable.
+**Delivers:** Batch update via GraphQL mutation, progress streaming, performance metrics
+**Uses:** Update mutation from Phase 7, timeout patterns from Phase 6
+**Implements:** GraphQL batch mutation pattern
+**Addresses:** Competitive batch update feature
+**Avoids:** Timeout issues, linear performance degradation
+**Complexity:** MEDIUM — Batch error handling, partial failure scenarios
+**Research flag:** Test batch mutation behavior with 10+ containers

-**Implements:**
- Clear Unraid Status component from architecture research
- Post-Action Sync pattern (sync after primary operation completes)
-
-**Avoids:**
- Pitfall 1 (state desync) — uses official sync mechanism
- Pitfall 2 (race conditions) — GraphQL API handles atomicity
-
-**Research flag:** STANDARD PATTERNS — HTTP Request node usage well-documented, unlikely to need additional research
-
-### Phase 3: Batch Optimization & UAT
-**Rationale:** After core functionality works, optimize for batch operations and validate across scenarios.
-
-**Delivers:**
- Batch updates use `updateContainers` (plural) mutation for efficiency
- Validation across Unraid versions (6.12, 7.0, 7.2)
- Confirmation that network config preservation works
-
-**Tasks:**
- Modify batch update flow to collect all updated container IDs
- Call `updateContainers(ids: ["docker:container1", "docker:container2"])` once after batch loop completes
- Test on Unraid 6.12 (with Connect plugin) and 7.2 (native API)
- Verify no notification spam during batch updates (5+ containers)
- Test container on custom network (`br0`) — verify DNS resolution after update
- Document manual sync option in README (if GraphQL sync fails, user clicks "Check for Updates")
-
-**Avoids:**
- Pitfall 5 (notification spam) — batch mutation prevents per-container notifications
- Pitfall 3 (version compatibility) — multi-version UAT catches breaking changes
- Pitfall 7 (network issues) — UAT includes network-dependent scenarios
-
-**Research flag:** STANDARD PATTERNS — Batch optimization is iteration of Phase 2 pattern
+### Phase 10: Validation and Cleanup
+**Rationale:** Final verification before declaring migration complete. Remove Docker socket proxy if logs query worked, otherwise document hybrid architecture as permanent.
+**Delivers:** Full workflow testing, Docker proxy removal (if possible), architecture docs update
+**Addresses:** All migration success criteria
+**Complexity:** LOW — Testing and documentation

 ### Phase Ordering Rationale

- **Infrastructure before integration** — Phase 1 validates GraphQL API access before modifying workflows. If network access fails, can pivot to alternative (e.g., helper script) without workflow rework.
+**Dependency chain:** ID Translation (Phase 1) → Callback Encoding (Phase 2) → Normalization (Phase 3) → Status (Phase 4) → Actions (Phase 5) → Timeouts (Phase 6) → Update (Phase 7) → Logs (Phase 8) → Batch (Phase 9) → Cleanup (Phase 10)

- **Single-container before batch** — Phase 2 proves core sync mechanism with simplest case. Batch optimization (Phase 3) is safe iteration once foundation works.
+**Risk mitigation order:** Start with infrastructure layers that prevent catastrophic failures (ID format, callback limits, response parsing), prove patterns with low-risk read-only operations (status query), establish mutation patterns with simple operations (start/stop), harden for production (timeouts/latency), tackle high-impact complex operation (update), finalize hybrid architecture (logs), optimize performance (batch), validate and document.

- **Validation last** — Phase 3 UAT happens after implementation complete. Testing earlier wastes time if implementation changes.
+**Architectural grouping:** Phases 1-3 are pure infrastructure (no API calls), Phases 4-5 prove API integration patterns, Phase 6 hardens for production latency, Phase 7 delivers main migration value (update simplification + status sync), Phases 8-10 complete feature parity and optimize.

-**Dependencies discovered:**
- Phase 2 depends on Phase 1 (API access must work)
- Phase 3 depends on Phase 2 (batch builds on single-container pattern)
- No parallelization opportunities (sequential phases)
+**Pitfall avoidance mapping:** Each phase addresses 1-2 critical pitfalls from research. Phase 1 prevents ID mismatch disaster, Phase 2 prevents callback failures, Phase 3 prevents parsing breakage, Phase 6 prevents timeout frustration, Phase 7 proves atomic operations, Phase 8 locks down hybrid architecture to prevent dual-write.

 ### Research Flags

-Phases likely needing deeper research during planning:
+**Phases needing deeper research during planning:**
+- **Phase 7 (Update Workflow):** updateContainer mutation behavior when already up-to-date unclear — does it return success immediately or pull image again? Batch error handling for updateContainers unknown — if one fails, do others continue? Test with non-critical container first.
+- **Phase 8 (Logs):** DockerContainerLogs GraphQL type field structure unknown — timestamp format, stdout/stderr separation, entry structure all need testing. May require fallback plan if query unusable.
+- **Phase 9 (Batch Operations):** updateAllContainers filter behavior unclear — does it filter by :latest tag or update everything with available updates? Rate limiting impact unknown — does batch count as 1 request or N?

- **Phase 1:** Network access testing, API permission verification — Some unknowns around container ID format and `host.docker.internal` behavior in Unraid's Docker implementation. Low risk (fallback to IP-based access), but needs validation.
-
-Phases with standard patterns (skip research-phase):
-
- **Phase 2:** HTTP Request node integration — Well-documented n8n pattern, GraphQL mutation structure is simple
- **Phase 3:** Batch optimization and UAT — Iteration of Phase 2, no new concepts
+**Phases with standard patterns (skip research-phase):**
+- **Phase 1-3 (Infrastructure):** Data transformation patterns well-documented, no novel research needed
+- **Phase 4 (Status Query):** GraphQL query tested in Phase 14, field mapping straightforward
+- **Phase 5 (Actions):** Start/stop mutations tested in STACK.md research, restart pattern clear (sequential stop+start)
+- **Phase 6 (Timeouts):** n8n timeout configuration documented, latency monitoring standard practice
+- **Phase 10 (Validation):** Testing methodology established, documentation templates exist

 ## Confidence Assessment

 | Area | Confidence | Notes |
 |------|------------|-------|
-| Stack | HIGH | GraphQL API is official Unraid interface, documented in schema and DeepWiki. HTTP Request node is n8n built-in. |
-| Features | MEDIUM | Core pain point well-understood from community forums. Feature prioritization based on user impact analysis, but batch optimization impact is estimated. |
-| Architecture | HIGH | Extension of existing sub-workflow pattern. GraphQL integration is standard HTTP Request usage. Rejected alternatives are well-reasoned. |
-| Pitfalls | MEDIUM | Critical pitfalls identified from community reports and source code analysis. Race condition and version compatibility need UAT validation. |
+| Stack | HIGH | Unraid GraphQL API tested live on Unraid 7.2 (Phase 14 + STACK.md research). Container operations verified via direct API calls. Logs unavailability confirmed by official docs. Hybrid architecture necessity proven. |
+| Features | HIGH | Most operations are direct GraphQL equivalents of Docker API patterns (tested). Update simplification validated via schema + live updateContainer mutation testing. Only uncertainty: isUpdateAvailable field existence (documented but may not be in actual schema). |
+| Architecture | HIGH | 4 of 7 sub-workflows require modification (18 Docker API nodes identified). Normalization layer pattern proven in existing workflows. Container ID format transition validated. Main workflow and 3 sub-workflows confirmed unchanged. |
+| Pitfalls | MEDIUM | Container ID format mismatch validated via testing (HIGH). Callback data limit is Telegram spec (HIGH). Cloud relay dependency documented by Unraid (HIGH). GraphQL migration patterns sourced from industry best practices (MEDIUM). n8n timeout issue confirmed by GitHub issue (HIGH). Schema evolution patterns are general GraphQL risks (MEDIUM). |

-**Overall confidence:** HIGH
-
-The GraphQL API approach is well-documented and official. Core functionality (sync after single-container update) is low-risk. Batch optimization and multi-version support add complexity but are defer-able if needed.
+**Overall confidence:** HIGH for migration feasibility and approach, MEDIUM for execution complexity and edge case handling.

 ### Gaps to Address

-Areas where research was inconclusive or needs validation during implementation:
+**Schema field validation (MEDIUM priority):**
+- `isUpdateAvailable` field documented in community sources but needs verification against actual Unraid 7.2 schema introspection
+- DockerContainerLogs field structure completely unknown until tested — may require response format iteration
+- Resolution: Schema introspection query in Phase 4, field existence checks before use, graceful degradation if fields missing

- **Container ID format** — GraphQL schema shows `id: ID!` but exact format (`docker:<name>` vs just `<name>`) needs testing. Resolution: Phase 1 queries containers to get actual ID format.
+**Update mutation behavior edge cases (MEDIUM priority):**
+- updateContainer when already up-to-date: immediate success or redundant pull?
+- updateContainers partial failure handling: abort all or continue?
+- Resolution: Test with non-critical containers during Phase 7, document behavior, implement appropriate error handling

- **API rate limiting** — Not documented. Impact: LOW (bot updates are infrequent, <10/min even in batch). Resolution: Monitor during UAT, no preemptive handling.
+**Batch operation rate limiting (LOW priority):**
+- Does updateContainers(ids) count as 1 API request or N requests against Unraid rate limits?
+- What's the practical limit for batch size before timeout?
+- Resolution: Test with 20+ containers in Phase 9, monitor for 429 errors, document batch size recommendations

- **GraphQL subscription for real-time status** — Schema includes subscriptions but unclear if Docker status changes are subscribable. Impact: None for v1.3 (defer to future enhancement). Resolution: Document as v2+ exploration.
+**myunraid.net cloud relay reliability (LOW priority):**
+- Is internet dependency acceptable for production use?
+- Should we implement direct LAN fallback (HTTPS with SSL handling)?
+- Resolution: Monitor in production after Phase 4-5, implement fallback in future phase if reliability issues surface

- **Unraid 6.9-7.1 Connect plugin availability** — Research confirms Connect plugin exists but installation process not validated. Impact: MEDIUM (users on pre-7.2 need this). Resolution: Phase 1 should document plugin installation steps and test.
-
- **Exact permission name for API key** — Research shows `DOCKER:UPDATE_ANY` but this wasn't explicitly stated in official docs (inferred from schema). Impact: LOW (easily testable). Resolution: Phase 1 tests API key creation and documents exact permission syntax.
+**Logs query fallback strategy (MEDIUM priority):**
+- If GraphQL logs query unusable, hybrid architecture becomes permanent
+- Docker socket proxy removal blocked indefinitely
+- Resolution: Test logs query early in Phase 8, document hybrid architecture as expected state if logs unavailable

 ## Sources

 ### Primary (HIGH confidence)
- [Unraid API Schema](https://raw.githubusercontent.com/unraid/api/main/api/generated-schema.graphql) — GraphQL mutations, container ID types, query structure
- [DeepWiki — Docker Integration](https://deepwiki.com/unraid/api/2.4-docker-integration) — DockerService architecture, updateContainer mutation behavior
- [limetech/dynamix source](https://github.com/limetech/dynamix/blob/master/plugins/dynamix.docker.manager/include/DockerClient.php) — syncVersions() function, update status file handling
- [Unraid API Documentation](https://docs.unraid.net/API/) — API key management, version availability
- [Unraid API Key Management](https://docs.unraid.net/API/programmatic-api-key-management/) — API key creation, permission scopes
+- [Unraid GraphQL Schema](https://raw.githubusercontent.com/unraid/api/main/api/generated-schema.graphql) — Complete API specification, Docker mutations verified
+- [Unraid API Documentation](https://docs.unraid.net/API/) — Official API overview, authentication patterns
+- [Using the Unraid API](https://docs.unraid.net/API/how-to-use-the-api/) — API key setup, permissions, endpoint URLs
+- `.planning/phases/14-unraid-api-access/14-RESEARCH.md` — Phase 14 connectivity research, PrefixedID format verified
+- `.planning/phases/14-unraid-api-access/14-VERIFICATION.md` — Live Unraid 7.2 testing, query validation, myunraid.net requirement
+- `ARCHITECTURE.md` — Existing workflow structure, sub-workflow contracts, 290-node system analysis
+- `CLAUDE.md` — Docker API patterns, n8n conventions, static data limitations, Telegram credential IDs
+- `n8n-workflow.json`, `n8n-*.json` — Actual workflow implementations, 18 Docker API nodes identified

 ### Secondary (MEDIUM confidence)
- [Unraid Forum: Incorrect Update Notification](https://forums.unraid.net/bug-reports/stable-releases/regression-incorrect-docker-update-notification-r2807/) — Community-identified pain point, file deletion workaround
- [Watchtower + Unraid Discussion](https://github.com/containrrr/watchtower/discussions/1389) — External update detection issues, Unraid doesn't auto-sync
- [Unraid Forum: Watchtower Status Not Reflected](https://forums.unraid.net/topic/149953-docker-update-via-watchtower-status-not-reflected-in-unraid/) — Confirmation of stale status after external updates
- [Unraid MCP Server](https://github.com/jmagar/unraid-mcp) — Reference GraphQL client implementation
- [Home Assistant Unraid Integration](https://github.com/domalab/unraid-api-client) — Additional GraphQL usage examples
- [Docker host.docker.internal guide](https://eastondev.com/blog/en/posts/dev/20251217-docker-host-access/) — Network access pattern
+- [DeepWiki Unraid API](https://deepwiki.com/unraid/api) — Comprehensive technical documentation, DockerService internals
+- [DeepWiki Docker Integration](https://deepwiki.com/unraid/api/2.4-docker-integration) — Docker service implementation details, retry logic
+- [unraid-api-client by domalab](https://github.com/domalab/unraid-api-client/blob/main/UNRAIDAPI.md) — Python client documenting queries, isUpdateAvailable field source
+- [unraid-mcp by jmagar](https://github.com/jmagar/unraid-mcp) — MCP server with Docker management tools, mutation examples
+- [GraphQL Migration Patterns](https://docs.github.com/en/graphql/guides/migrating-from-rest-to-graphql) — GitHub's REST to GraphQL migration guide
+- [3 GraphQL Pitfalls](https://www.vanta.com/resources/3-graphql-pitfalls-and-steps-to-avoid-them) — Schema evolution, error handling patterns

-### Tertiary (LOW confidence)
- [Unraid Forums — Docker Update Status](https://forums.unraid.net/topic/114415-plugin-docker-compose-manager/page/9/) — Status file structure (observed, not documented)
- Community reports of notification spam — Anecdotal, but consistent pattern across forums
+### Tertiary (LOW confidence, needs validation)
+- [n8n Execute Workflow timeout issue #1572](https://github.com/n8n-io/n8n/issues/1572) — Timeout propagation bug report
+- [Telegram Bot API callback data limit](https://core.telegram.org/bots/api#inlinekeyboardbutton) — 64-byte limit specification
+- Community forum discussions on Unraid API update status sync — Anecdotal reports, needs testing

 ---
-*Research completed: 2026-02-08*
+*Research completed: 2026-02-09*
 *Ready for roadmap: yes*