Files
unraid-docker-manager/.planning/research/SUMMARY.md
T
2026-02-09 08:08:25 -05:00

278 lines
25 KiB
Markdown

# Project Research Summary
**Project:** Unraid Docker Manager v1.4 — Unraid API Native Migration
**Domain:** Migration from Docker socket proxy to Unraid GraphQL API for native container management
**Researched:** 2026-02-09
**Confidence:** HIGH
## Executive Summary
The migration from Docker socket proxy to Unraid's native GraphQL API is architecturally sound and operationally beneficial, but requires a hybrid approach due to logs unavailability. Research confirms that Unraid's GraphQL API provides all required container control operations (start, stop, update) with simpler patterns than Docker's REST API, but container logs are NOT accessible via the Unraid API and must continue using the Docker socket proxy. This creates a hybrid architecture: Unraid GraphQL for control operations, Docker socket proxy retained read-only for logs retrieval.
The recommended approach is phased migration starting with simple operations (status queries, actions) to establish patterns, then tackling the complex update workflow which simplifies from 9 Docker API nodes to 2 GraphQL nodes. The single `updateContainer` mutation atomically handles image pull, container recreation, and critical update status sync, solving v1.3's "apply update" badge persistence issue without manual file writes. Key architectural wins include container ID format (PrefixedID) normalization layers, GraphQL error handling standardization, and response shape transformation to maintain workflow contracts.
Critical risks center on container ID format mismatches (Docker 64-char vs Unraid 129-char PrefixedIDs), Telegram callback data 64-byte limits with longer IDs, and myunraid.net cloud relay internet dependency introducing latency and outage risk. Mitigation requires ID translation layers implemented before any live operations, callback data encoding redesign, and timeout adjustments for 200-500ms cloud relay latency. The research identifies 10 critical pitfalls with phase-mapped prevention strategies, confidence assessment shows HIGH for tested operations and MEDIUM for architectural patterns.
## Key Findings
### Recommended Stack
No new dependencies required. All infrastructure established in Phase 14 (v1.3): Unraid GraphQL API connectivity, myunraid.net cloud relay URL, n8n Header Auth credential with API key, environment variable for UNRAID_HOST. Research confirms hybrid architecture necessity — Docker socket proxy must remain deployed but reconfigured with minimal read-only permissions (CONTAINERS=1, POST=0) for logs access only.
**Core technologies:**
- **Unraid GraphQL API (7.2+):** Container control operations (list, start, stop, update) — Native integration provides automatic update status sync, structured errors, atomic update mutations
- **myunraid.net cloud relay:** Unraid API access URL — Avoids direct LAN IP nginx redirect auth stripping, but introduces internet dependency and 200-500ms latency
- **docker-socket-proxy (reduced scope):** Logs retrieval ONLY — Unraid API explicitly documents logs as NOT accessible via API, must use Docker socket
- **n8n HTTP Request node:** GraphQL API calls via POST /graphql — Replace Execute Command nodes with structured GraphQL requests, better timeout handling and error parsing
**Critical version requirements:**
- Unraid 7.2+ required for GraphQL API availability
- n8n HTTP Request node typeVersion 1.2+ for Header Auth credential support
### Expected Features
Most operations are drop-in replacements with same user-facing behavior but simpler implementation. Update workflow gains significant simplification (5-step Docker API flow collapses to single mutation) and automatic status sync benefit.
**Must have (table stakes):**
- Container start/stop/restart — GraphQL mutations for start/stop, restart requires chaining stop + start (no native restart mutation)
- Container status query — GraphQL containers query with UPPERCASE state values, PrefixedID format
- Container update — Single `updateContainer` mutation replaces 5-step Docker API flow (pull, stop, remove, create, start)
- Container logs — GraphQL logs query exists in schema (field structure needs testing during implementation)
- Batch operations — Native `updateContainers(ids)` and `updateAllContainers` mutations for multi-container updates
**Should have (competitive):**
- Automatic update status sync — Unraid API's `updateContainer` mutation handles internal state sync, eliminates v1.3's manual file write workaround
- Update detection via `isUpdateAvailable` field — Bot shows what Unraid sees, no digest comparison discrepancies (NOTE: field documented in research but may not exist in actual schema, validate during implementation)
- Batch update simplification — Native GraphQL batch mutations reduce network calls and latency
**Defer (v2+):**
- Real-time container stats — `dockerContainerStats` subscription requires WebSocket infrastructure, complex for n8n HTTP Request node
- Container autostart configuration — `updateAutostartConfiguration` mutation available but not user-requested
- Port conflict detection — `portConflicts` query useful for debugging but not core workflow
- Direct LAN fallback — Implement if myunraid.net relay proves unreliable in production, defer until proven necessary
### Architecture Approach
Migration affects 4 of 7 sub-workflows (Update, Actions, Status, Logs) totaling 18 Docker API nodes replaced with GraphQL HTTP Request nodes plus normalization layers. Three sub-workflows (Matching, Batch UI, Confirmation) remain unchanged as they operate on data contracts not API sources. Update sub-workflow sees largest impact: 34 nodes shrink to ~27 nodes by replacing 9-step Docker API flow with 1-2 GraphQL nodes.
**Major components:**
1. **GraphQL Response Normalization Layer** — Code nodes after every GraphQL query to transform Unraid response shape (nested `data.docker.containers`) and field formats (UPPERCASE state, PrefixedID) to match workflow contracts. Prevents cascading failures across 60+ Code nodes in main workflow that expect Docker API shape.
2. **Container ID Translation Layer** — Matching sub-workflow outputs Unraid PrefixedID format (129 chars: `{server_hash}:{container_hash}`) instead of Docker short ID (64 chars). All Execute Workflow input preparation nodes pass opaque containerId token, value changes but field name/contract stable.
3. **Callback Data Encoding Redesign** — Telegram 64-byte callback limit broken by PrefixedID length. Implement ID shortening with lookup table or base62 hash mapping. Update ALL callback formats from `action:containerID` to `action:idx` with static data lookup.
4. **GraphQL Error Handling Pattern** — Standardized validation: check `response.errors[]` array first (GraphQL returns HTTP 200 even for errors), parse structured error messages, handle HTTP 304 "already in state" as success case, validate `response.data` structure before accessing fields.
5. **Hybrid API Router** — Sub-workflows route control operations to Unraid GraphQL (start, stop, update, status), logs operations to Docker socket proxy. Docker proxy reconfigured read-only (POST=0) to prevent accidental dual-write.
**Key patterns to follow:**
- One normalization Code node per GraphQL query response (Status, Actions, Update, Logs)
- Explicit timeout configuration on every HTTP Request node (30-60 seconds for mutations, account for cloud relay latency)
- Client-side timeout validation in main workflow (timestamp checks, don't rely on Execute Workflow timeout propagation)
- Fresh state query immediately before action execution to avoid race conditions (200-500ms latency creates stale state window)
### Critical Pitfalls
**Top 5 pitfalls with prevention strategies:**
1. **Container ID Format Mismatch Breaking All Operations** — Docker 64-char hex vs Unraid 129-char PrefixedID. Passing wrong format causes all operations to fail with "container not found." Prevention: Implement ID validation regex `^[a-f0-9]{64}:[a-f0-9]{64}$` BEFORE any live operations, update ALL 17 Execute Workflow input nodes, test with containers having similar names but different IDs. Address in Phase 1.
2. **Telegram Callback Data 64-Byte Limit Exceeded** — Callback format `stop:8a9907a24576` fit with Docker IDs, `stop:{129-char-PrefixedID}` exceeds limit causing silent inline keyboard failures. Prevention: Redesign callback encoding to `action:idx` with PrefixedID lookup table, hash to 8-char base62, test ALL callback patterns. Address in Phase 2.
3. **myunraid.net Cloud Relay Internet Dependency** — Bot becomes non-functional during internet outages despite LAN connectivity. Latency increases from sub-10ms (Docker socket) to 200-500ms (cloud relay). Prevention: Add network connectivity pre-flight checks, implement degraded mode messaging, monitor relay latency as first-class metric, document internet dependency in error messages. Address in Phase 2.
4. **GraphQL Response Structure Normalization Missing** — Field name changes (State→state, UPPERCASE values), nested response structure (`data.docker.containers`), missing normalization causes parsing failures across 60 Code nodes. Prevention: Build normalization layer BEFORE touching sub-workflows, add schema validation, test response parsing independently. Address in Phase 3.
5. **Sub-Workflow Timeout Errors Lost in Propagation** — Known n8n issue where Execute Workflow node ignores sub-workflow timeouts. Cloud relay latency causes operations that completed in 10-30s to take 60-120s. Prevention: Increase ALL sub-workflow timeouts by 3-5x, implement client-side timeout in main workflow, add progress indicators, configure HTTP Request timeouts explicitly. Address in Phase 6.
**Additional critical pitfalls:**
- **Credential Rotation Kills Bot Mid-Operation** — Dual credential storage (`.env.unraid-api` + n8n Header Auth) falls out of sync, 401 errors with no detection. Prevention: Consolidate to n8n credential only, implement 401 error user-friendly messaging.
- **Race Condition Between Query and Action** — 200-500ms latency creates stale state window, container changes between query and action execution. Prevention: Fresh state query before action, handle "already in state" as success.
- **Dual-Write Period Data Inconsistency** — Phased migration creates split-brain between Docker and Unraid APIs. Prevention: Short cutover window (hours not days), single source of truth per operation.
- **Batch Performance Degradation** — Sequential operations multiply cloud relay latency (10 containers = 10x slower). Prevention: GraphQL batching for reads, parallel processing where safe, progress streaming.
- **GraphQL Schema Changes Silent Breakage** — Unraid API evolves, field additions/deprecations break queries without warning. Prevention: Schema introspection checks on startup, field existence validation before use.
## Implications for Roadmap
Based on research, suggested phase structure follows risk mitigation order: infrastructure layers first, simple operations to prove patterns, complex update workflow last when patterns validated.
### Phase 1: Container ID Translation Layer
**Rationale:** ID format mismatch is catastrophic failure point — must be solid before any live API calls. All sub-workflows depend on container identification working correctly.
**Delivers:** PrefixedID validation, Matching sub-workflow outputs Unraid IDs, ID format documentation
**Addresses:** Container ID format mismatch pitfall (critical)
**Avoids:** All operations failing with "container not found" on cutover
**Complexity:** LOW — Pure data transformation, no API calls
### Phase 2: Callback Data Encoding Redesign
**Rationale:** Telegram inline keyboards are primary UI pattern. Must work before enabling any action operations. Can implement in parallel with Phase 1 (no dependencies).
**Delivers:** Callback format `action:idx` with lookup table, 64-byte validation, all callback patterns tested
**Addresses:** Callback data size limit pitfall, enables inline keyboard actions
**Avoids:** Silent inline keyboard failures on cutover
**Complexity:** MEDIUM — Requires lookup table design, static data storage strategy, extensive testing
### Phase 3: GraphQL Response Normalization
**Rationale:** Establishes data contract stability before modifying sub-workflows. Prevents cascading failures across 60+ Code nodes. Template for all future GraphQL integrations.
**Delivers:** Normalization Code node template, schema validation, response shape documentation
**Addresses:** Response structure parsing pitfall
**Avoids:** Garbled data, empty container lists, state comparison failures
**Complexity:** MEDIUM — Schema design, field mapping, validation logic
### Phase 4: Status Query Migration (Simple Read-Only)
**Rationale:** First live API integration with lowest risk (read-only query). Proves normalization layer works, establishes error handling patterns. Status sub-workflow = 3 Docker nodes → 4 GraphQL nodes.
**Delivers:** Container list via GraphQL, status display with Unraid data, error handling validation
**Uses:** Normalization layer from Phase 3, ID translation from Phase 1
**Implements:** Hybrid router (GraphQL for status, Docker proxy still active)
**Addresses:** Table stakes container status feature
**Avoids:** Breaking existing status functionality during migration
**Complexity:** LOW — Single query type, straightforward mapping
**Research flag:** Standard pattern, skip research-phase
### Phase 5: Actions Migration (Start/Stop/Restart)
**Rationale:** Proves mutation patterns work before tackling complex update flow. Restart operation tests sequential mutation chaining (stop + start). Actions sub-workflow = 4 Docker nodes → 5 GraphQL nodes.
**Delivers:** Start/stop/restart via GraphQL mutations, error handling for "already in state" (HTTP 304)
**Uses:** Callback encoding from Phase 2, normalization from Phase 3
**Implements:** Sequential mutation pattern for restart (no native restart mutation)
**Addresses:** Table stakes container actions
**Avoids:** Restart timing issues, state conflict errors
**Complexity:** MEDIUM — Mutation error handling, restart sequencing
**Research flag:** Standard pattern, skip research-phase
### Phase 6: Timeout and Latency Hardening
**Rationale:** Must address before Update workflow (long-running operations). Cloud relay latency causes timeout failures without proper handling. Affects all sub-workflows.
**Delivers:** 3-5x timeout increases, client-side timeout validation, progress indicators, latency monitoring
**Uses:** Findings from Phase 4-5 testing
**Implements:** Progress streaming pattern for long operations
**Addresses:** Sub-workflow timeout propagation pitfall, network resilience
**Avoids:** Silent failures, user confusion on slow operations
**Complexity:** LOW — Configuration changes, monitoring setup
**Research flag:** Implementation pattern testing needed
### Phase 7: Update Workflow Migration (Complex Atomic Operation)
**Rationale:** Highest impact phase — 9 Docker nodes → 2 GraphQL nodes, solves v1.3 update status sync issue. Deferred until patterns proven in Phase 4-5 and timeouts hardened in Phase 6.
**Delivers:** Single `updateContainer` mutation, automatic status sync, update workflow simplification (34 → 27 nodes)
**Uses:** All infrastructure from Phase 1-6
**Implements:** Atomic update pattern, major architectural win
**Addresses:** Table stakes update feature, v1.3 pain point resolution
**Avoids:** Multi-step Docker API complexity, manual status sync
**Complexity:** HIGH — Critical operation, thorough testing required
**Research flag:** Monitor for schema changes in updateContainer mutation behavior
### Phase 8: Logs Migration and Hybrid Finalization
**Rationale:** Validates logs query works (schema shows query exists but field structure untested). Completes hybrid architecture by locking down Docker proxy to logs-only.
**Delivers:** Logs via GraphQL (if query works) OR confirm Docker proxy retention, proxy reconfiguration (POST=0)
**Uses:** Normalization patterns from Phase 3
**Implements:** Final hybrid architecture state
**Addresses:** Table stakes logs feature
**Avoids:** Breaking logs functionality, accidental Docker proxy usage for control ops
**Complexity:** MEDIUM — Logs query field structure unknown until tested
**Research flag:** Logs query response format needs validation
### Phase 9: Batch Operations Optimization
**Rationale:** Deferred until basic operations proven. Batch update leverages native `updateContainers` mutation for performance. Only enable after single-container update stable.
**Delivers:** Batch update via GraphQL mutation, progress streaming, performance metrics
**Uses:** Update mutation from Phase 7, timeout patterns from Phase 6
**Implements:** GraphQL batch mutation pattern
**Addresses:** Competitive batch update feature
**Avoids:** Timeout issues, linear performance degradation
**Complexity:** MEDIUM — Batch error handling, partial failure scenarios
**Research flag:** Test batch mutation behavior with 10+ containers
### Phase 10: Validation and Cleanup
**Rationale:** Final verification before declaring migration complete. Remove Docker socket proxy if logs query worked, otherwise document hybrid architecture as permanent.
**Delivers:** Full workflow testing, Docker proxy removal (if possible), architecture docs update
**Addresses:** All migration success criteria
**Complexity:** LOW — Testing and documentation
### Phase Ordering Rationale
**Dependency chain:** ID Translation (Phase 1) → Callback Encoding (Phase 2) → Normalization (Phase 3) → Status (Phase 4) → Actions (Phase 5) → Timeouts (Phase 6) → Update (Phase 7) → Logs (Phase 8) → Batch (Phase 9) → Cleanup (Phase 10)
**Risk mitigation order:** Start with infrastructure layers that prevent catastrophic failures (ID format, callback limits, response parsing), prove patterns with low-risk read-only operations (status query), establish mutation patterns with simple operations (start/stop), harden for production (timeouts/latency), tackle high-impact complex operation (update), finalize hybrid architecture (logs), optimize performance (batch), validate and document.
**Architectural grouping:** Phases 1-3 are pure infrastructure (no API calls), Phases 4-5 prove API integration patterns, Phase 6 hardens for production latency, Phase 7 delivers main migration value (update simplification + status sync), Phases 8-10 complete feature parity and optimize.
**Pitfall avoidance mapping:** Each phase addresses 1-2 critical pitfalls from research. Phase 1 prevents ID mismatch disaster, Phase 2 prevents callback failures, Phase 3 prevents parsing breakage, Phase 6 prevents timeout frustration, Phase 7 proves atomic operations, Phase 8 locks down hybrid architecture to prevent dual-write.
### Research Flags
**Phases needing deeper research during planning:**
- **Phase 7 (Update Workflow):** updateContainer mutation behavior when already up-to-date unclear — does it return success immediately or pull image again? Batch error handling for updateContainers unknown — if one fails, do others continue? Test with non-critical container first.
- **Phase 8 (Logs):** DockerContainerLogs GraphQL type field structure unknown — timestamp format, stdout/stderr separation, entry structure all need testing. May require fallback plan if query unusable.
- **Phase 9 (Batch Operations):** updateAllContainers filter behavior unclear — does it filter by :latest tag or update everything with available updates? Rate limiting impact unknown — does batch count as 1 request or N?
**Phases with standard patterns (skip research-phase):**
- **Phase 1-3 (Infrastructure):** Data transformation patterns well-documented, no novel research needed
- **Phase 4 (Status Query):** GraphQL query tested in Phase 14, field mapping straightforward
- **Phase 5 (Actions):** Start/stop mutations tested in STACK.md research, restart pattern clear (sequential stop+start)
- **Phase 6 (Timeouts):** n8n timeout configuration documented, latency monitoring standard practice
- **Phase 10 (Validation):** Testing methodology established, documentation templates exist
## Confidence Assessment
| Area | Confidence | Notes |
|------|------------|-------|
| Stack | HIGH | Unraid GraphQL API tested live on Unraid 7.2 (Phase 14 + STACK.md research). Container operations verified via direct API calls. Logs unavailability confirmed by official docs. Hybrid architecture necessity proven. |
| Features | HIGH | Most operations are direct GraphQL equivalents of Docker API patterns (tested). Update simplification validated via schema + live updateContainer mutation testing. Only uncertainty: isUpdateAvailable field existence (documented but may not be in actual schema). |
| Architecture | HIGH | 4 of 7 sub-workflows require modification (18 Docker API nodes identified). Normalization layer pattern proven in existing workflows. Container ID format transition validated. Main workflow and 3 sub-workflows confirmed unchanged. |
| Pitfalls | MEDIUM | Container ID format mismatch validated via testing (HIGH). Callback data limit is Telegram spec (HIGH). Cloud relay dependency documented by Unraid (HIGH). GraphQL migration patterns sourced from industry best practices (MEDIUM). n8n timeout issue confirmed by GitHub issue (HIGH). Schema evolution patterns are general GraphQL risks (MEDIUM). |
**Overall confidence:** HIGH for migration feasibility and approach, MEDIUM for execution complexity and edge case handling.
### Gaps to Address
**Schema field validation (MEDIUM priority):**
- `isUpdateAvailable` field documented in community sources but needs verification against actual Unraid 7.2 schema introspection
- DockerContainerLogs field structure completely unknown until tested — may require response format iteration
- Resolution: Schema introspection query in Phase 4, field existence checks before use, graceful degradation if fields missing
**Update mutation behavior edge cases (MEDIUM priority):**
- updateContainer when already up-to-date: immediate success or redundant pull?
- updateContainers partial failure handling: abort all or continue?
- Resolution: Test with non-critical containers during Phase 7, document behavior, implement appropriate error handling
**Batch operation rate limiting (LOW priority):**
- Does updateContainers(ids) count as 1 API request or N requests against Unraid rate limits?
- What's the practical limit for batch size before timeout?
- Resolution: Test with 20+ containers in Phase 9, monitor for 429 errors, document batch size recommendations
**myunraid.net cloud relay reliability (LOW priority):**
- Is internet dependency acceptable for production use?
- Should we implement direct LAN fallback (HTTPS with SSL handling)?
- Resolution: Monitor in production after Phase 4-5, implement fallback in future phase if reliability issues surface
**Logs query fallback strategy (MEDIUM priority):**
- If GraphQL logs query unusable, hybrid architecture becomes permanent
- Docker socket proxy removal blocked indefinitely
- Resolution: Test logs query early in Phase 8, document hybrid architecture as expected state if logs unavailable
## Sources
### Primary (HIGH confidence)
- [Unraid GraphQL Schema](https://raw.githubusercontent.com/unraid/api/main/api/generated-schema.graphql) — Complete API specification, Docker mutations verified
- [Unraid API Documentation](https://docs.unraid.net/API/) — Official API overview, authentication patterns
- [Using the Unraid API](https://docs.unraid.net/API/how-to-use-the-api/) — API key setup, permissions, endpoint URLs
- `.planning/phases/14-unraid-api-access/14-RESEARCH.md` — Phase 14 connectivity research, PrefixedID format verified
- `.planning/phases/14-unraid-api-access/14-VERIFICATION.md` — Live Unraid 7.2 testing, query validation, myunraid.net requirement
- `ARCHITECTURE.md` — Existing workflow structure, sub-workflow contracts, 290-node system analysis
- `CLAUDE.md` — Docker API patterns, n8n conventions, static data limitations, Telegram credential IDs
- `n8n-workflow.json`, `n8n-*.json` — Actual workflow implementations, 18 Docker API nodes identified
### Secondary (MEDIUM confidence)
- [DeepWiki Unraid API](https://deepwiki.com/unraid/api) — Comprehensive technical documentation, DockerService internals
- [DeepWiki Docker Integration](https://deepwiki.com/unraid/api/2.4-docker-integration) — Docker service implementation details, retry logic
- [unraid-api-client by domalab](https://github.com/domalab/unraid-api-client/blob/main/UNRAIDAPI.md) — Python client documenting queries, isUpdateAvailable field source
- [unraid-mcp by jmagar](https://github.com/jmagar/unraid-mcp) — MCP server with Docker management tools, mutation examples
- [GraphQL Migration Patterns](https://docs.github.com/en/graphql/guides/migrating-from-rest-to-graphql) — GitHub's REST to GraphQL migration guide
- [3 GraphQL Pitfalls](https://www.vanta.com/resources/3-graphql-pitfalls-and-steps-to-avoid-them) — Schema evolution, error handling patterns
### Tertiary (LOW confidence, needs validation)
- [n8n Execute Workflow timeout issue #1572](https://github.com/n8n-io/n8n/issues/1572) — Timeout propagation bug report
- [Telegram Bot API callback data limit](https://core.telegram.org/bots/api#inlinekeyboardbutton) — 64-byte limit specification
- Community forum discussions on Unraid API update status sync — Anecdotal reports, needs testing
---
*Research completed: 2026-02-09*
*Ready for roadmap: yes*