docs: complete v1.3 project research (STACK, FEATURES, ARCHITECTURE, PITFALLS, SUMMARY)

2026-02-08 19:52:57 -05:00
parent c071b890ef
commit 07cde0490a
5 changed files with 1554 additions and 1288 deletions
@@ -1,224 +1,458 @@
-# Pitfalls Research: v1.1
+# Pitfalls Research

-**Project:** Unraid Docker Manager
-**Milestone:** v1.1 - n8n Integration & Polish
-**Researched:** 2026-02-02
-**Confidence:** MEDIUM-HIGH (verified with official docs where possible)
+**Domain:** Unraid Update Status Sync for Existing Docker Management Bot
+**Researched:** 2026-02-08
+**Confidence:** MEDIUM

-## Context
+Research combines verified Unraid architecture (HIGH confidence) with integration patterns from community sources (MEDIUM confidence). File format and API internals have LIMITED documentation — risk areas flagged for phase-specific investigation.

-This research identifies pitfalls specific to **adding** these features to an existing working system:
- n8n API access (programmatic workflow read/update/test/logs)
- Docker socket proxy (security hardening)
- Telegram inline keyboards (UX improvements)
- Unraid update sync (clear "update available" badge)
+## Critical Pitfalls

-**Risk focus:** Breaking existing functionality while adding new features.
+### Pitfall 1: State Desync Between Docker API and Unraid's Internal Tracking
+
+**What goes wrong:**
+After bot-initiated updates via Docker API (pull + recreate), Unraid's Docker tab continues showing "update ready" status. Unraid doesn't detect that the container was updated externally. This creates user confusion ("I just updated, why does it still show?") and leads to duplicate update attempts.
+
+**Why it happens:**
+Unraid tracks update status through multiple mechanisms that aren't automatically synchronized with Docker API operations:
+- `/var/lib/docker/unraid-update-status.json` — cached update status file (stale after external updates)
+- DockerManifestService cache — compares local image digests to registry manifests
+- Real-time DockerEventService — monitors Docker daemon events but doesn't trigger update status recalculation
+
+The bot bypasses Unraid's template system entirely, so Unraid "probably doesn't check if a container has magically been updated and change its UI" (watchtower discussion).
+
+**How to avoid:**
+Phase 1 (Investigation) must determine ALL state locations:
+1. **Verify update status file format** — inspect `/var/lib/docker/unraid-update-status.json` structure (undocumented, requires reverse engineering)
+2. **Document cache invalidation triggers** — what causes DockerManifestService to recompute?
+3. **Test event-based refresh** — does recreating a container trigger update check, or only on manual "Check for Updates"?
+
+Phase 2 (Sync Implementation) options (in order of safety):
+- **Option A (safest):** Delete stale entries from `unraid-update-status.json` for updated containers (forces recalculation on next check)
+- **Option B (if A insufficient):** Call Unraid API update check endpoint after bot updates (triggers full recalc)
+- **Option C (last resort):** Directly modify `unraid-update-status.json` with current digest (highest risk of corruption)
+
+**Warning signs:**
+- "Apply Update" shown in Unraid UI immediately after bot reports successful update
+- Unraid notification shows update available for container that bot just updated
+- `/var/lib/docker/unraid-update-status.json` modified timestamp doesn't change after bot update
+
+**Phase to address:**
+Phase 1 (Investigation & File Format Analysis) — understand state structure
+Phase 2 (Sync Implementation) — implement chosen sync strategy
+Phase 3 (UAT) — verify sync works across Unraid versions

 ---

-## n8n API Access Pitfalls
+### Pitfall 2: Race Condition Between Unraid's Periodic Update Check and Bot Sync-Back

-| Pitfall | Warning Signs | Prevention | Phase |
-|---------|---------------|------------|-------|
-| **API key with full access** | API key created without scopes; all workflows accessible | Enterprise: use scoped API keys (read-only for Claude Code initially). Non-enterprise: accept risk, rotate keys every 6-12 months | API Setup |
-| **Missing X-N8N-API-KEY header** | 401 Unauthorized errors on all API calls | Store API key in Claude Code MCP config; always send as `X-N8N-API-KEY` header, not Bearer token | API Setup |
-| **Workflow ID mismatch after import** | API calls return 404; workflow actions fail | Workflow IDs change on import; query `/api/v1/workflows` first to get current IDs, don't hardcode | API Setup |
-| **Editing active workflow via API** | Production workflow changes unexpectedly; users see partial updates | n8n 2.0: Save vs Publish are separate actions. Use API to read only; manual publish via UI | API Setup |
-| **N8N_BLOCK_ENV_ACCESS_IN_NODE default** | Code nodes can't access env vars; returns undefined | n8n 2.0+ blocks env vars by default. Use credentials system instead, or explicitly set `N8N_BLOCK_ENV_ACCESS_IN_NODE=false` | API Setup |
-| **API not enabled on instance** | Connection refused on /api/v1 endpoints | Self-hosted: API is available by default. Cloud trial: API not available. Verify with `curl http://localhost:5678/api/v1/workflows` | API Setup |
-| **Rate limiting on rapid API calls** | 429 errors when reading workflow repeatedly | Add delay between API calls (1-2 seconds); use caching for workflow data that doesn't change frequently | API Usage |
+**What goes wrong:**
+Unraid periodically checks for updates (user-configurable interval, often 15-60 minutes). If the bot writes to `unraid-update-status.json` while Unraid's update check is running, data corruption or lost updates occur. Symptoms: Unraid shows containers as "update ready" immediately after sync, or sync writes are silently discarded.

-**Sources:**
- [n8n API Authentication](https://docs.n8n.io/api/authentication/)
- [n8n API Reference](https://docs.n8n.io/api/)
- [n8n v2.0 Breaking Changes](https://docs.n8n.io/2-0-breaking-changes/)
+**Why it happens:**
+Two processes writing to the same file without coordination:
+- Unraid's update check: reads file → queries registries → writes full file
+- Bot sync: reads file → modifies entry → writes full file
+
+If both run concurrently, last writer wins (lost update problem). No evidence of file locking in Unraid's update status handling.
+
+**How to avoid:**
+1. **Read-modify-write atomicity:** Use file locking or atomic write (write to temp file, atomic rename)
+2. **Timestamp verification:** Read file, modify, check mtime before write — retry if changed
+3. **Idempotent sync:** Deleting entries (Option A above) is safer than modifying — delete is idempotent
+4. **Rate limiting:** Don't sync immediately after update — wait 5-10 seconds to avoid collision with Unraid's Docker event handler
+
+Phase 2 implementation requirements:
+- Use Python's `fcntl.flock()` or atomic file operations
+- Include retry logic with exponential backoff (max 3 attempts)
+- Log all file modification failures for debugging
+
+**Warning signs:**
+- Sync reports success but Unraid state unchanged
+- File modification timestamp inconsistent with sync execution time
+- "Resource temporarily unavailable" errors when accessing the file
+
+**Phase to address:**
+Phase 2 (Sync Implementation) — implement atomic file operations and retry logic

 ---

-## Docker Socket Security Pitfalls
+### Pitfall 3: Unraid Version Compatibility — Internal Format Changes Break Integration

-| Pitfall | Warning Signs | Prevention | Phase |
-|---------|---------------|------------|-------|
-| **Proxy exposes POST by default** | Container can create/delete containers; security scan flags | Set `POST=0` on socket proxy; most read operations work with GET only | Socket Proxy |
-| **Using `--privileged` unnecessarily** | Security audit fails; container has excessive permissions | Remove `--privileged` flag; Tecnativa proxy works without it on standard Docker | Socket Proxy |
-| **Outdated socket proxy image** | Using `latest` tag which is 3+ years old | Pin to specific version: `tecnativa/docker-socket-proxy:0.2.0` or use `linuxserver/socket-proxy` | Socket Proxy |
-| **Proxy port exposed publicly** | Port 2375 accessible from network; security scan fails | Never expose proxy port; run on internal Docker network only | Socket Proxy |
-| **Insufficient permissions for n8n** | "Permission denied" or empty responses from Docker API | Enable minimum required: `CONTAINERS=1`, `ALLOW_START=1`, `ALLOW_STOP=1`, `ALLOW_RESTARTS=1` for actions | Socket Proxy |
-| **Breaking existing curl commands** | Existing workflow fails after adding proxy; commands timeout | Socket proxy uses TCP, not Unix socket. Update curl commands: `curl http://socket-proxy:2375/...` instead of `--unix-socket` | Socket Proxy |
-| **Network isolation breaks connectivity** | n8n can't reach proxy; "connection refused" errors | Both containers must be on same Docker network; verify with `docker network inspect` | Socket Proxy |
-| **Permissions too restrictive** | Container list works but start/stop fails | Must explicitly enable action endpoints: `ALLOW_START=1`, `ALLOW_STOP=1`, `ALLOW_RESTARTS=1` (separate from `CONTAINERS=1`) | Socket Proxy |
-| **Missing INFO or VERSION permissions** | Some Docker API calls fail unexpectedly | `VERSION=1` and `PING=1` are enabled by default; may need `INFO=1` for system queries | Socket Proxy |
+**What goes wrong:**
+Unraid updates change the structure of `/var/lib/docker/unraid-update-status.json` or introduce new update tracking mechanisms. Bot's sync logic breaks silently (no status updates) or corrupts the file (containers disappear from UI, update checks fail).

-**Minimum safe configuration for this project:**
-```yaml
-environment:
-  - CONTAINERS=1      # Read container info
-  - ALLOW_START=1     # Start containers
-  - ALLOW_STOP=1      # Stop containers
-  - ALLOW_RESTARTS=1  # Restart containers
-  - IMAGES=1          # Pull images (for updates)
-  - POST=1            # Required for start/stop/restart actions
-  - NETWORKS=0        # Not needed
-  - VOLUMES=0         # Not needed
-  - BUILD=0           # Not needed
-  - COMMIT=0          # Not needed
-  - CONFIGS=0         # Not needed
-  - SECRETS=0         # Security critical - keep disabled
-  - EXEC=0            # Security critical - keep disabled
-  - AUTH=0            # Security critical - keep disabled
+**Why it happens:**
+- File format is undocumented (no schema, no version field)
+- Unraid 7.x introduced major API changes (GraphQL, new DockerService architecture)
+- Past example: Unraid 6.12.8 template errors that "previously were silently ignored could cause Docker containers to fail to start"
+- No backward compatibility guarantees for internal files
+
+Historical evidence of breaking changes:
+- Unraid 7.2.1 (Nov 2025): Docker localhost networking broke
+- Unraid 6.12.8: Docker template validation strictness increased
+- Unraid API open-sourced Jan 2025 — likely more changes incoming
+
+**How to avoid:**
+1. **Version detection:** Read Unraid version from `/etc/unraid-version` or API
+2. **Format validation:** Before modifying file, validate expected structure (reject unknown formats)
+3. **Graceful degradation:** If file format unrecognized, log error and skip sync (preserve existing bot functionality)
+4. **Testing matrix:** Test against Unraid 6.11, 6.12, 7.0, 7.2 (Phase 3)
+
+Phase 1 requirements:
+- Document current file format for Unraid 7.x
+- Check Unraid forums for known format changes across versions
+- Identify version-specific differences (if any)
+
+Phase 2 implementation:
+```python
+SUPPORTED_VERSIONS = ['6.11', '6.12', '7.0', '7.1', '7.2']
+version = read_unraid_version()
+if not version_compatible(version):
+    log_error(f"Unsupported Unraid version: {version}")
+    return  # Skip sync, preserve bot functionality
 ```

-**Sources:**
- [Tecnativa docker-socket-proxy](https://github.com/Tecnativa/docker-socket-proxy)
- [LinuxServer socket-proxy](https://docs.linuxserver.io/images/docker-socket-proxy/)
- [Docker Community Forums - Socket Proxy Security](https://forums.docker.com/t/does-a-docker-socket-proxy-improve-security/136305)
+**Warning signs:**
+- After Unraid upgrade, sync stops working (no errors, just no state change)
+- Unraid Docker tab shows errors or missing containers after bot update
+- File size changes significantly after Unraid upgrade (format change)
+
+**Phase to address:**
+Phase 1 (Investigation) — document current format, check version differences
+Phase 2 (Implementation) — add version detection and validation
+Phase 3 (UAT) — test across Unraid versions

 ---

-## Telegram Keyboard Pitfalls
+### Pitfall 4: Docker Socket Proxy Blocks Filesystem Access — n8n Can't Reach Unraid State Files

-| Pitfall | Warning Signs | Prevention | Phase |
-|---------|---------------|------------|-------|
-| **Native node rejects dynamic keyboards** | Error: "The value '[[...]]' is not supported!" | Use HTTP Request node for inline keyboards instead of native Telegram node; this is a known n8n limitation | Keyboards |
-| **callback_data exceeds 64 bytes** | Buttons don't respond; no callback_query received; 400 BUTTON_DATA_INVALID | Use short codes: `s:plex` not `start_container:plex-media-server`. Hash long names to 8-char IDs | Keyboards |
-| **Callback auth path missing** | Keyboard clicks ignored; no response to button press | Existing workflow already handles callback_query (line 56-74 in workflow). Ensure new keyboards use same auth flow | Keyboards |
-| **Multiple additional fields ignored** | Button has both callback_data and URL; only URL works | n8n Telegram node limitation - can't use both. Choose one per button: either action (callback) or link (URL) | Keyboards |
-| **Keyboard flickers on every message** | Visual glitches; keyboard re-renders constantly | Send `reply_markup` only on /start or menu requests; omit from action responses (keyboard persists) | Keyboards |
-| **Inline vs Reply keyboard confusion** | Wrong keyboard type appears; buttons don't trigger callbacks | Inline keyboards (InlineKeyboardMarkup) for callbacks; Reply keyboards (ReplyKeyboardMarkup) for persistent menus. Use inline for container actions | Keyboards |
-| **answerCallbackQuery not called** | "Loading..." spinner persists after button click; Telegram shows timeout | Must call `answerCallbackQuery` within 10 seconds of receiving callback_query, even if just to acknowledge | Keyboards |
-| **Button layout exceeds limits** | Buttons don't appear; API error | Bot API 7.0: max 100 buttons total per message. For container lists, paginate or limit to 8-10 buttons | Keyboards |
+**What goes wrong:**
+The bot runs inside n8n container, which accesses Docker via socket proxy (security layer). Socket proxy filters Docker API endpoints but doesn't provide filesystem access. `/var/lib/docker/unraid-update-status.json` is on the Unraid host, unreachable from n8n container.

-**Recommended keyboard structure for container actions:**
+Attempting to mount host paths into n8n violates security boundary and creates maintenance burden (n8n updates require preserving mounts).
+
+**Why it happens:**
+Current architecture (from ARCHITECTURE.md):
+```
+n8n container → docker-socket-proxy → Docker Engine
+```
+
+Socket proxy security model:
+- Grants specific Docker API endpoints (containers, images, exec)
+- Blocks direct filesystem access
+- n8n has no `/host` mount (intentional security decision)
+
+Mounting `/var/lib/docker` into n8n container:
+- Bypasses socket proxy security (defeats the purpose)
+- Requires n8n container restart when file path changes
+- Couples n8n deployment to Unraid internals
+
+**How to avoid:**
+Three architectural options (order of preference):
+
+**Option A: Unraid API Integration (cleanest, highest effort)**
+- Use Unraid's native API (GraphQL or REST) if update status endpoints exist
+- Requires: API key management, authentication flow, endpoint documentation
+- Benefits: Version-safe, no direct file access, official interface
+- Risk: API may not expose update status mutation endpoints
+
+**Option B: Helper Script on Host (recommended for v1.3)**
+- Small Python script runs on Unraid host (not in container)
+- n8n triggers via `docker exec` to host helper or webhook
+- Helper has direct filesystem access, performs sync
+- Benefits: Clean separation, no n8n filesystem access, minimal coupling
+- Implementation: `.planning/research/ARCHITECTURE.md` should detail this pattern
+
+**Option C: Controlled Host Mount (fallback, higher risk)**
+- Mount only `/var/lib/docker/unraid-update-status.json` (not entire `/var/lib/docker`)
+- Read-only mount + separate write mechanism (requires Docker API or exec)
+- Benefits: Direct access
+- Risk: Tight coupling, version fragility
+
+**Phase 1 must investigate:**
+1. Does Unraid API expose update status endpoints? (check GraphQL schema)
+2. Can Docker exec reach host scripts? (test in current deployment)
+3. Security implications of each option
+
+**Warning signs:**
+- "Permission denied" when attempting to read/write status file from n8n
+- File not found errors (path doesn't exist in container filesystem)
+- n8n container has no visibility of host filesystem
+
+**Phase to address:**
+Phase 1 (Architecture Decision) — choose integration pattern
+Phase 2 (Implementation) — implement chosen pattern
+
+---
+
+### Pitfall 5: Unraid Update Check Triggers While Bot Is Syncing — Notification Spam
+
+**What goes wrong:**
+Bot updates container → syncs status back to Unraid → Unraid's periodic update check runs during sync → update check sees partially-written file or stale cache → sends duplicate "update available" notification to user. User receives notification storm when updating multiple containers.
+
+**Why it happens:**
+Unraid's update check is asynchronous and periodic:
+- Notification service triggers on update detection
+- No debouncing for rapid state changes
+- File write + cache invalidation not atomic
+
+Community evidence:
+- "Excessive notifications from unRAID" — users report notification spam
+- "Duplicate notifications" — longstanding issue in notification system
+- System excludes duplicates from archive but not from active stream
+
+**How to avoid:**
+1. **Sync timing:** Delay sync by 10-30 seconds after update completion (let Docker events settle)
+2. **Batch sync:** If updating multiple containers, sync all at once (not per-container)
+3. **Cache invalidation signal:** If Unraid API provides cache invalidation, trigger AFTER all syncs complete
+4. **Idempotent sync:** Delete entries (forces recalc) rather than writing new digests (avoids partial state)
+
+Phase 2 implementation pattern:
 ```javascript
-// Short callback_data pattern: action:container_short_id
-// Example: "s:abc123" for start, "x:abc123" for stop
-{
-  "inline_keyboard": [
-    [
-      {"text": "Start", "callback_data": "s:" + containerId.slice(0,8)},
-      {"text": "Stop", "callback_data": "x:" + containerId.slice(0,8)}
-    ],
-    [
-      {"text": "Restart", "callback_data": "r:" + containerId.slice(0,8)},
-      {"text": "Logs", "callback_data": "l:" + containerId.slice(0,8)}
-    ]
-  ]
+// In Update sub-workflow
+if (responseMode === 'batch') {
+  return { success: true, skipSync: true }  // Sync after batch completes
 }
+
+// In main workflow (after batch completion)
+const updatedContainers = [...]  // Collect all updated
+await syncAllToUnraid(updatedContainers)  // Single sync operation
 ```

-**Sources:**
- [n8n GitHub Issue #19955 - Inline Keyboard Expression](https://github.com/n8n-io/n8n/issues/19955)
- [n8n Telegram Callback Operations](https://docs.n8n.io/integrations/builtin/app-nodes/n8n-nodes-base.telegram/callback-operations/)
- [Telegram Bot API - InlineKeyboardButton](https://core.telegram.org/bots/api#inlinekeyboardbutton)
+**Warning signs:**
+- Multiple "update available" notifications for same container within 1 minute
+- Notifications triggered immediately after bot update completes
+- Unraid notification log shows duplicate entries with close timestamps
+
+**Phase to address:**
+Phase 2 (Sync Implementation) — add batch sync and timing delays
+Phase 3 (UAT) — verify no notification spam during batch updates

 ---

-## Unraid Integration Pitfalls
+### Pitfall 6: n8n Workflow State Doesn't Persist — Can't Queue Sync Operations

-| Pitfall | Warning Signs | Prevention | Phase |
-|---------|---------------|------------|-------|
-| **Update badge persists after bot update** | Unraid UI shows "update available" after container updated via bot | Delete `/var/lib/docker/unraid-update-status.json` to force recheck; or trigger Unraid's check mechanism | Unraid Sync |
-| **unraid-update-status.json format unknown** | Attempted to modify file directly; broke Unraid Docker tab | File format is undocumented. Safest approach: delete file and let Unraid regenerate. Don't modify directly | Unraid Sync |
-| **Unraid only checks for new updates** | Badge never clears; only sees new updates, not cleared updates | This is known Unraid behavior. Deletion of status file is current workaround per Unraid forums | Unraid Sync |
-| **Race condition on status file** | Status file deleted but badge still shows; file regenerated too fast | Wait for Unraid's update check interval, or manually trigger "Check for Updates" from Unraid UI after deletion | Unraid Sync |
-| **Bot can't access Unraid filesystem** | Permission denied when accessing /var/lib/docker/ | n8n container needs additional volume mount: `/var/lib/docker:/var/lib/docker` or execute via SSH | Unraid Sync |
-| **Breaking Unraid's Docker management** | Unraid Docker tab shows errors; containers appear in wrong state | Never modify Unraid's internal files (in /boot/config/docker or /var/lib/docker) except update-status.json deletion | Unraid Sync |
+**What goes wrong:**
+Developer assumes n8n workflow static data persists between executions (like Phase 10.2 error logging attempt). Builds queue of "pending syncs" to batch them. Queue is lost between workflow executions. Each update triggers immediate sync attempt → file access contention, race conditions.

-**Unraid sync approach (safest):**
-1. After bot successfully updates container
-2. Execute: `rm -f /var/lib/docker/unraid-update-status.json`
-3. Unraid will regenerate on next "Check for Updates" or automatically
+**Why it happens:**
+Known limitation from STATE.md:
+> **n8n workflow static data does NOT persist between executions** (execution-scoped, not workflow-scoped)

-**Sources:**
- [Unraid Forums - Update notification regression](https://forums.unraid.net/bug-reports/stable-releases/regression-incorrect-docker-update-notification-r2807/)
- [Unraid Forums - Update badge persists](https://forums.unraid.net/topic/157820-docker-shows-update-ready-after-updating/)
- [Unraid Forums - Containers show update available incorrectly](https://forums.unraid.net/topic/142238-containers-show-update-available-even-when-it-is-up-to-date/)
+Phase 10.2 attempted ring buffer + debug commands — entirely removed due to this limitation.
+
+Implications for sync-back:
+- Can't queue sync operations across multiple update requests
+- Can't implement retry queue for failed syncs
+- Each workflow execution is stateless
+
+**How to avoid:**
+Don't rely on workflow state for sync coordination. Options:
+
+**Option A: Synchronous sync (simplest)**
+- Update container → immediately sync (no queue)
+- Atomic file operations handle contention
+- Acceptable for single updates, problematic for batch
+
+**Option B: External queue (Redis, file-based)**
+- Write pending syncs to external queue
+- Separate workflow polls queue and processes batch
+- Higher complexity, requires infrastructure
+
+**Option C: Batch-aware sync (recommended)**
+- Single updates: sync immediately
+- Batch updates: collect all container IDs in batch loop, sync once after completion
+- No cross-execution state needed (batch completes in single execution)
+
+Implementation in Phase 2:
+```javascript
+// Batch loop already collects results
+const batchResults = []
+for (const container of containers) {
+  const result = await updateContainer(container)
+  batchResults.push({ containerId, updated: result.updated })
+}
+// After loop completes (still in same execution):
+const toSync = batchResults.filter(r => r.updated).map(r => r.containerId)
+await syncToUnraid(toSync)  // Single sync call
+```
+
+**Warning signs:**
+- Developer adds static data writes for sync queue
+- Testing shows queue is empty on next execution
+- Sync attempts happen per-container instead of batched
+
+**Phase to address:**
+Phase 1 (Architecture) — document stateless constraint, reject queue-based designs
+Phase 2 (Implementation) — use in-execution batching, not cross-execution state

 ---

-## Integration Pitfalls (Breaking Existing Functionality)
+### Pitfall 7: Unraid's br0 Network Recreate Breaks Container Resolution After Bot Update

-| Pitfall | Warning Signs | Prevention | Phase |
-|---------|---------------|------------|-------|
-| **Socket proxy breaks existing curl** | All Docker commands fail after adding proxy | Existing workflow uses `--unix-socket`. Migrate curl commands to use proxy TCP endpoint: `http://socket-proxy:2375` | Socket Proxy |
-| **Auth flow bypassed on new paths** | New keyboard handlers skip user ID check; anyone can click buttons | Existing workflow has auth at lines 92-122 and 126-155. Copy same pattern for any new callback handlers | All |
-| **Workflow test vs production mismatch** | Works in test mode; fails when activated | Test with actual Telegram messages, not just manual execution. Production triggers differ from manual runs | All |
-| **n8n 2.0 upgrade breaks workflow** | After n8n update, workflow stops working; nodes missing | n8n 2.0 has breaking changes: Execute Command disabled by default, Start node removed, env vars blocked. Check [migration guide](https://docs.n8n.io/2-0-breaking-changes/) before upgrading | All |
-| **Credential reference breaks after import** | Imported workflow can't decrypt credentials; all nodes fail | n8n uses N8N_ENCRYPTION_KEY. After import, must recreate credentials manually in n8n UI | All |
-| **HTTP Request node vs Execute Command** | HTTP Request can't reach Docker socket; timeout errors | HTTP Request node doesn't support Unix sockets. Keep using Execute Command with curl for Docker API (or migrate to TCP proxy) | Socket Proxy |
-| **Parallel execution race conditions** | Two button clicks cause conflicting container states | Add debounce logic: ignore rapid duplicate callbacks within 2-3 seconds. Store last action timestamp | Keyboards |
-| **Error workflow doesn't fire** | Errors occur but no notification; silent failures | Error Trigger only fires on automatic executions, not manual test runs. Test by triggering via Telegram with intentional failure | All |
-| **Save vs Publish confusion (n8n 2.0)** | Edited workflow but production still uses old version | n8n 2.0 separates Save (preserves edits) from Publish (updates production). Must explicitly publish changes | All |
+**What goes wrong:**
+Bot updates container using Docker API (remove + create) → Unraid recreates bridge network (`br0`) → Docker network ID changes → other containers using `br0` fail to resolve updated container by name → service disruption beyond just the updated container.

-**Pre-migration checklist:**
- [ ] Export current workflow JSON as backup
- [ ] Document current curl commands and endpoints
- [ ] Test each existing command works after changes
- [ ] Verify auth flow applies to new handlers
- [ ] Test error handling triggers correctly
+**Why it happens:**
+Community report: "Unraid recreates 'br0' when the docker service restarts, and then services using 'br0' cannot be started because the ID of 'br0' has changed."

-**Sources:**
- [n8n v2.0 Breaking Changes](https://docs.n8n.io/2-0-breaking-changes/)
- [n8n Manual vs Production Executions](https://docs.n8n.io/workflows/executions/manual-partial-and-production-executions/)
- [n8n Community - Test vs Production Behavior](https://community.n8n.io/t/workflow-behaves-differently-in-test-vs-production/139973)
+Bot update flow: `docker pull` → `docker stop` → `docker rm` → `docker run` with same config
+- If container uses custom bridge network, recreation may trigger network ID change
+- Unraid's Docker service monitors for container lifecycle events
+- Network recreation is asynchronous to container operations
+
+**How to avoid:**
+1. **Preserve network settings:** Ensure container recreation uses identical network config (Phase 2)
+2. **Test network-dependent scenarios:** UAT must include containers with custom networks (Phase 3)
+3. **Graceful degradation:** If network issue detected (container unreachable after update), log error and notify user
+4. **Documentation:** Warn users about potential network disruption during updates (README)
+
+Phase 2 implementation check:
+- Current update sub-workflow uses Docker API recreate — verify network config preservation
+- Check if `n8n-update.json` copies network settings from old container to new
+- Test: update container on `br0`, verify other containers still resolve it
+
+**Warning signs:**
+- Container starts successfully but is unreachable by hostname
+- Other containers report DNS resolution failures after update
+- `docker network ls` shows new network ID for `br0` after container update
+
+**Phase to address:**
+Phase 2 (Update Flow Verification) — ensure network config preservation
+Phase 3 (UAT) — test multi-container network scenarios

 ---

-## Summary: Top 5 Risks
+## Technical Debt Patterns

-Ranked by likelihood x impact for this specific milestone:
+Shortcuts that seem reasonable but create long-term problems.

-### 1. Socket Proxy Breaks Existing Commands (HIGH likelihood, HIGH impact)
-**Why:** Current workflow uses `--unix-socket` flag. Socket proxy uses TCP. All existing functionality breaks if not migrated correctly.
-**Prevention:**
-1. Add socket proxy container first (don't remove direct socket yet)
-2. Update curl commands one-by-one to use proxy
-3. Test each command works via proxy
-4. Only then remove direct socket mount
+| Shortcut | Immediate Benefit | Long-term Cost | When Acceptable |
+|----------|-------------------|----------------|-----------------|
+| Skip Unraid version detection | Faster implementation | Silent breakage on Unraid upgrades | Never — version changes are documented |
+| Mount `/var/lib/docker` into n8n | Direct file access | Security bypass, tight coupling, upgrade fragility | Only if helper script impossible |
+| Sync immediately after update (no delay) | Simpler code | Race conditions with Unraid update check | Only for single-container updates (not batch) |
+| Assume file format from one Unraid version | Works on dev system | Breaks for users on different versions | Only during Phase 1 investigation (must validate before Phase 2) |
+| Write directly to status file without locking | Avoids complexity | File corruption on concurrent access | Never — use atomic operations |
+| Hardcode file paths | Works today | Breaks if Unraid changes internal structure | Acceptable if combined with version detection + validation |

-### 2. Native Telegram Node Rejects Dynamic Keyboards (HIGH likelihood, MEDIUM impact)
-**Why:** n8n's native Telegram node has a known bug (Issue #19955) where it rejects array expressions for inline keyboards.
-**Prevention:** Use HTTP Request node to call Telegram API directly for any dynamic keyboard generation. Keep native node for simple text responses only.
+## Integration Gotchas

-### 3. Unraid Update Badge Never Clears (HIGH likelihood, LOW impact)
-**Why:** Unraid doesn't check for "no longer outdated" containers - only new updates. Documented behavior, not a bug.
-**Prevention:** Delete `/var/lib/docker/unraid-update-status.json` after successful bot update. Requires additional volume mount or SSH access.
+Common mistakes when connecting to external services.

-### 4. n8n 2.0 Breaking Changes on Upgrade (MEDIUM likelihood, HIGH impact)
-**Why:** n8n 2.0 (released Dec 2025) has multiple breaking changes: Execute Command disabled by default, env vars blocked, Save/Publish separation.
-**Prevention:**
-1. Check current n8n version before starting
-2. If upgrading, run Migration Report first (Settings > Migration Report)
-3. Don't upgrade n8n during this milestone unless necessary
+| Integration | Common Mistake | Correct Approach |
+|-------------|----------------|------------------|
+| Unraid update status file | Assume JSON structure is stable | Validate structure before modification, reject unknown formats |
+| Docker socket proxy | Expect filesystem access like Docker socket mount | Use helper script on host OR Unraid API if available |
+| Unraid API (if used) | Assume unauthenticated localhost access | Check auth requirements, API key management |
+| File modification timing | Write immediately after container update | Delay 5-10 seconds to avoid collision with Docker event handlers |
+| Batch operations | Sync after each container update | Collect all updates, sync once after batch completes |
+| Network config preservation | Assume Docker API preserves settings | Explicitly copy network settings from old container inspect to new create |

-### 5. callback_data Exceeds 64 Bytes (MEDIUM likelihood, MEDIUM impact)
-**Why:** Container names can be long (e.g., `linuxserver-plex-media-server`). Adding action prefix easily exceeds 64 bytes.
-**Prevention:** Use short action codes (`s:`, `x:`, `r:`, `l:`) and container ID prefix (8 chars) instead of full names. Map back via lookup.
+## Performance Traps
+
+Patterns that work at small scale but fail as usage grows.
+
+| Trap | Symptoms | Prevention | When It Breaks |
+|------|----------|------------|----------------|
+| Sync per container in batch | File contention, slow batch updates | Batch sync after all updates complete | 5+ containers in batch |
+| Full file rewrite for each sync | High I/O, race window increases | Delete stale entries OR modify only changed entries | 10+ containers tracked |
+| No retry logic for file access | Silent sync failures | Exponential backoff retry (max 3 attempts) | Concurrent Unraid update check |
+| Sync blocks workflow execution | Slow Telegram responses | Async sync (fire and forget) OR move to separate workflow | 3+ second file operations |
+
+Note: Current system has 8-15 containers (from UAT scenarios). Performance traps unlikely to manifest, but prevention is low-cost.
+
+## Security Mistakes
+
+Domain-specific security issues beyond general web security.
+
+| Mistake | Risk | Prevention |
+|---------|------|------------|
+| Mount entire `/var/lib/docker` into n8n | n8n gains root-level access to all Docker data | Mount only specific file OR use helper script |
+| World-writable status file permissions | Any container can corrupt Unraid state | Verify file permissions, use host-side helper with proper permissions |
+| No validation before writing to status file | Malformed data corrupts Unraid Docker UI | Validate JSON structure, reject unknown formats |
+| Expose Unraid API key in workflow | API key visible in n8n execution logs | Use n8n credentials, not hardcoded keys |
+| Execute arbitrary commands on host | Container escape vector | Whitelist allowed operations in helper script |
+
+## UX Pitfalls
+
+Common user experience mistakes in this domain.
+
+| Pitfall | User Impact | Better Approach |
+|---------|-------------|-----------------|
+| Silent sync failure | User thinks status updated, Unraid still shows "update ready" | Log error to correlation ID, send Telegram notification on sync failure |
+| No indication of sync status | User doesn't know if sync worked | Include in update success message: "Updated + synced to Unraid" |
+| Sync delay causes confusion | User checks Unraid immediately, sees old status | Document 10-30 second sync delay in README troubleshooting |
+| Unraid badge still shows after sync | User thinks update failed | README: explain Unraid caches aggressively, manual "Check for Updates" forces refresh |
+| Batch update spam notifications | 10 updates = 10 Unraid notifications | Batch sync prevents this (if implemented correctly) |
+
+## "Looks Done But Isn't" Checklist
+
+Things that appear complete but are missing critical pieces.
+
+- [ ] **File modification:** Wrote to status file — verify atomic operation (temp file + rename, not direct write)
+- [ ] **Batch sync:** Syncs after each update — verify batching for multi-container operations
+- [ ] **Version compatibility:** Works on dev Unraid — verify against 6.11, 6.12, 7.0, 7.2
+- [ ] **Error handling:** Sync returns success — verify retry logic for file contention
+- [ ] **Network preservation:** Container starts after update — verify DNS resolution from other containers
+- [ ] **Race condition testing:** Works in sequential tests — verify concurrent update + Unraid check scenario
+- [ ] **Filesystem access:** Works on dev system — verify n8n container can actually reach file (or helper script exists)
+- [ ] **Notification validation:** No duplicate notifications in single test — verify batch scenario (5+ containers)
+
+## Recovery Strategies
+
+When pitfalls occur despite prevention, how to recover.
+
+| Pitfall | Recovery Cost | Recovery Steps |
+|---------|---------------|----------------|
+| Corrupted status file | LOW | Delete `/var/lib/docker/unraid-update-status.json`, Unraid recreates on next update check |
+| State desync (Unraid shows stale) | LOW | Manual "Check for Updates" in Unraid UI forces recalculation |
+| Unraid version breaks format | MEDIUM | Disable sync feature via feature flag, update sync logic for new format |
+| Network resolution broken | MEDIUM | Restart Docker service in Unraid (`Settings -> Docker -> Enable: No -> Yes`) |
+| File permission errors | LOW | Helper script with proper permissions, OR mount file read-only + use API |
+| n8n can't reach status file | HIGH | Architecture change required (add helper script OR switch to API) |
+| Notification spam | LOW | Unraid notification settings: disable Docker update notifications temporarily |
+
+## Pitfall-to-Phase Mapping
+
+How roadmap phases should address these pitfalls.
+
+| Pitfall | Prevention Phase | Verification |
+|---------|------------------|--------------|
+| State desync (Docker API vs Unraid) | Phase 1 (Investigation) + Phase 2 (Sync) | UAT: update via bot, verify Unraid shows "up to date" |
+| Race condition (concurrent access) | Phase 2 (Sync Implementation) | Stress test: simultaneous bot update + manual Unraid check |
+| Unraid version compatibility | Phase 1 (Format Documentation) + Phase 3 (Multi-version UAT) | Test on Unraid 6.12, 7.0, 7.2 |
+| Filesystem access from container | Phase 1 (Architecture Decision) | Deploy to prod, verify file access or helper script works |
+| Notification spam | Phase 2 (Batch Sync) | UAT: batch update 5+ containers, count notifications |
+| n8n state persistence assumption | Phase 1 (Architecture) | Code review: reject any `staticData` usage for sync queue |
+| Network recreation (br0) | Phase 2 (Update Flow) + Phase 3 (UAT) | Test: update container on custom network, verify resolution |
+
+## Sources
+
+**HIGH confidence (official/authoritative):**
+- [Unraid API — Docker and VM Integration](https://deepwiki.com/unraid/api/2.4.2-notification-system) — DockerService, DockerEventService architecture
+- [Unraid API — Notifications Service](https://deepwiki.com/unraid/api/2.4.1-notifications-service) — Race condition handling, duplicate detection
+- [Docker Socket Proxy Security](https://github.com/Tecnativa/docker-socket-proxy) — Security model, endpoint filtering
+- [Docker Socket Security Critical Vulnerability Guide](https://medium.com/@instatunnel/docker-socket-security-a-critical-vulnerability-guide-76f4137a68c5) — Filesystem access risks
+- [n8n Docker File System Access](https://community.n8n.io/t/file-system-access-in-docker-environment/214017) — Container filesystem limitations
+
+**MEDIUM confidence (community-verified):**
+- [Watchtower Discussion #1389](https://github.com/containrrr/watchtower/discussions/1389) — Unraid doesn't detect external updates
+- [Unraid Docker Troubleshooting](https://docs.unraid.net/unraid-os/troubleshooting/common-issues/docker-troubleshooting/) — br0 network recreation issue
+- [Unraid Forums: Docker Update Check](https://forums.unraid.net/topic/49041-warning-file_put_contentsvarlibdockerunraid-update-statusjson-blah/) — Status file location
+- [Unraid Forums: 7.2.1 Docker Issues](https://forums.unraid.net/topic/195255-unraid-721-upgrade-seems-to-break-docker-functionalities/) — Version upgrade breaking changes
+
+**LOW confidence (single source, needs validation):**
+- File format structure (`/var/lib/docker/unraid-update-status.json`) — inferred from forum posts, not officially documented
+- Unraid update check timing/frequency — user-configurable, no default documented
+- Cache invalidation triggers — inferred from API docs, not explicitly tested
+
+**Project-specific (from existing codebase):**
+- STATE.md — n8n static data limitation (Phase 10.2 findings)
+- ARCHITECTURE.md — Current system architecture, socket proxy usage
+- CLAUDE.md — n8n workflow patterns, sub-workflow contracts

 ---
-
-## Phase Assignment Summary
-
-| Phase | Pitfalls to Address |
-|-------|---------------------|
-| **API Setup** | API key scoping, header format, workflow ID discovery, env var blocking |
-| **Socket Proxy** | Proxy configuration, permission settings, curl command migration, network setup |
-| **Keyboards** | HTTP Request node for keyboards, callback_data limits, answerCallbackQuery |
-| **Unraid Sync** | Update status file deletion, volume mount for access |
-| **All Phases** | Auth flow consistency, test vs production, error workflow testing |
-
---
-
-## Confidence Assessment
-
-| Area | Confidence | Rationale |
-|------|------------|-----------|
-| n8n API | HIGH | Official docs verified, known breaking changes documented |
-| Docker Socket Proxy | HIGH | Official Tecnativa docs, community best practices verified |
-| Telegram Keyboards | MEDIUM-HIGH | n8n GitHub issues confirm limitations, Telegram API docs verified |
-| Unraid Integration | MEDIUM | Forum posts describe workaround, but file format undocumented |
-| Integration Risks | MEDIUM | Based on existing v1.0 codebase analysis and general patterns |
-
-**Research date:** 2026-02-02
-**Valid until:** 2026-03-02 (30 days - n8n and Telegram APIs stable)
+*Pitfalls research for: Unraid Update Status Sync*
+*Researched: 2026-02-08*