Files
unraid-docker-manager/.planning/research/PITFALLS.md
T

25 KiB

Pitfalls Research

Domain: Unraid Update Status Sync for Existing Docker Management Bot Researched: 2026-02-08 Confidence: MEDIUM

Research combines verified Unraid architecture (HIGH confidence) with integration patterns from community sources (MEDIUM confidence). File format and API internals have LIMITED documentation — risk areas flagged for phase-specific investigation.

Critical Pitfalls

Pitfall 1: State Desync Between Docker API and Unraid's Internal Tracking

What goes wrong: After bot-initiated updates via Docker API (pull + recreate), Unraid's Docker tab continues showing "update ready" status. Unraid doesn't detect that the container was updated externally. This creates user confusion ("I just updated, why does it still show?") and leads to duplicate update attempts.

Why it happens: Unraid tracks update status through multiple mechanisms that aren't automatically synchronized with Docker API operations:

  • /var/lib/docker/unraid-update-status.json — cached update status file (stale after external updates)
  • DockerManifestService cache — compares local image digests to registry manifests
  • Real-time DockerEventService — monitors Docker daemon events but doesn't trigger update status recalculation

The bot bypasses Unraid's template system entirely, so Unraid "probably doesn't check if a container has magically been updated and change its UI" (watchtower discussion).

How to avoid: Phase 1 (Investigation) must determine ALL state locations:

  1. Verify update status file format — inspect /var/lib/docker/unraid-update-status.json structure (undocumented, requires reverse engineering)
  2. Document cache invalidation triggers — what causes DockerManifestService to recompute?
  3. Test event-based refresh — does recreating a container trigger update check, or only on manual "Check for Updates"?

Phase 2 (Sync Implementation) options (in order of safety):

  • Option A (safest): Delete stale entries from unraid-update-status.json for updated containers (forces recalculation on next check)
  • Option B (if A insufficient): Call Unraid API update check endpoint after bot updates (triggers full recalc)
  • Option C (last resort): Directly modify unraid-update-status.json with current digest (highest risk of corruption)

Warning signs:

  • "Apply Update" shown in Unraid UI immediately after bot reports successful update
  • Unraid notification shows update available for container that bot just updated
  • /var/lib/docker/unraid-update-status.json modified timestamp doesn't change after bot update

Phase to address: Phase 1 (Investigation & File Format Analysis) — understand state structure Phase 2 (Sync Implementation) — implement chosen sync strategy Phase 3 (UAT) — verify sync works across Unraid versions


Pitfall 2: Race Condition Between Unraid's Periodic Update Check and Bot Sync-Back

What goes wrong: Unraid periodically checks for updates (user-configurable interval, often 15-60 minutes). If the bot writes to unraid-update-status.json while Unraid's update check is running, data corruption or lost updates occur. Symptoms: Unraid shows containers as "update ready" immediately after sync, or sync writes are silently discarded.

Why it happens: Two processes writing to the same file without coordination:

  • Unraid's update check: reads file → queries registries → writes full file
  • Bot sync: reads file → modifies entry → writes full file

If both run concurrently, last writer wins (lost update problem). No evidence of file locking in Unraid's update status handling.

How to avoid:

  1. Read-modify-write atomicity: Use file locking or atomic write (write to temp file, atomic rename)
  2. Timestamp verification: Read file, modify, check mtime before write — retry if changed
  3. Idempotent sync: Deleting entries (Option A above) is safer than modifying — delete is idempotent
  4. Rate limiting: Don't sync immediately after update — wait 5-10 seconds to avoid collision with Unraid's Docker event handler

Phase 2 implementation requirements:

  • Use Python's fcntl.flock() or atomic file operations
  • Include retry logic with exponential backoff (max 3 attempts)
  • Log all file modification failures for debugging

Warning signs:

  • Sync reports success but Unraid state unchanged
  • File modification timestamp inconsistent with sync execution time
  • "Resource temporarily unavailable" errors when accessing the file

Phase to address: Phase 2 (Sync Implementation) — implement atomic file operations and retry logic


Pitfall 3: Unraid Version Compatibility — Internal Format Changes Break Integration

What goes wrong: Unraid updates change the structure of /var/lib/docker/unraid-update-status.json or introduce new update tracking mechanisms. Bot's sync logic breaks silently (no status updates) or corrupts the file (containers disappear from UI, update checks fail).

Why it happens:

  • File format is undocumented (no schema, no version field)
  • Unraid 7.x introduced major API changes (GraphQL, new DockerService architecture)
  • Past example: Unraid 6.12.8 template errors that "previously were silently ignored could cause Docker containers to fail to start"
  • No backward compatibility guarantees for internal files

Historical evidence of breaking changes:

  • Unraid 7.2.1 (Nov 2025): Docker localhost networking broke
  • Unraid 6.12.8: Docker template validation strictness increased
  • Unraid API open-sourced Jan 2025 — likely more changes incoming

How to avoid:

  1. Version detection: Read Unraid version from /etc/unraid-version or API
  2. Format validation: Before modifying file, validate expected structure (reject unknown formats)
  3. Graceful degradation: If file format unrecognized, log error and skip sync (preserve existing bot functionality)
  4. Testing matrix: Test against Unraid 6.11, 6.12, 7.0, 7.2 (Phase 3)

Phase 1 requirements:

  • Document current file format for Unraid 7.x
  • Check Unraid forums for known format changes across versions
  • Identify version-specific differences (if any)

Phase 2 implementation:

SUPPORTED_VERSIONS = ['6.11', '6.12', '7.0', '7.1', '7.2']
version = read_unraid_version()
if not version_compatible(version):
    log_error(f"Unsupported Unraid version: {version}")
    return  # Skip sync, preserve bot functionality

Warning signs:

  • After Unraid upgrade, sync stops working (no errors, just no state change)
  • Unraid Docker tab shows errors or missing containers after bot update
  • File size changes significantly after Unraid upgrade (format change)

Phase to address: Phase 1 (Investigation) — document current format, check version differences Phase 2 (Implementation) — add version detection and validation Phase 3 (UAT) — test across Unraid versions


Pitfall 4: Docker Socket Proxy Blocks Filesystem Access — n8n Can't Reach Unraid State Files

What goes wrong: The bot runs inside n8n container, which accesses Docker via socket proxy (security layer). Socket proxy filters Docker API endpoints but doesn't provide filesystem access. /var/lib/docker/unraid-update-status.json is on the Unraid host, unreachable from n8n container.

Attempting to mount host paths into n8n violates security boundary and creates maintenance burden (n8n updates require preserving mounts).

Why it happens: Current architecture (from ARCHITECTURE.md):

n8n container → docker-socket-proxy → Docker Engine

Socket proxy security model:

  • Grants specific Docker API endpoints (containers, images, exec)
  • Blocks direct filesystem access
  • n8n has no /host mount (intentional security decision)

Mounting /var/lib/docker into n8n container:

  • Bypasses socket proxy security (defeats the purpose)
  • Requires n8n container restart when file path changes
  • Couples n8n deployment to Unraid internals

How to avoid: Three architectural options (order of preference):

Option A: Unraid API Integration (cleanest, highest effort)

  • Use Unraid's native API (GraphQL or REST) if update status endpoints exist
  • Requires: API key management, authentication flow, endpoint documentation
  • Benefits: Version-safe, no direct file access, official interface
  • Risk: API may not expose update status mutation endpoints

Option B: Helper Script on Host (recommended for v1.3)

  • Small Python script runs on Unraid host (not in container)
  • n8n triggers via docker exec to host helper or webhook
  • Helper has direct filesystem access, performs sync
  • Benefits: Clean separation, no n8n filesystem access, minimal coupling
  • Implementation: .planning/research/ARCHITECTURE.md should detail this pattern

Option C: Controlled Host Mount (fallback, higher risk)

  • Mount only /var/lib/docker/unraid-update-status.json (not entire /var/lib/docker)
  • Read-only mount + separate write mechanism (requires Docker API or exec)
  • Benefits: Direct access
  • Risk: Tight coupling, version fragility

Phase 1 must investigate:

  1. Does Unraid API expose update status endpoints? (check GraphQL schema)
  2. Can Docker exec reach host scripts? (test in current deployment)
  3. Security implications of each option

Warning signs:

  • "Permission denied" when attempting to read/write status file from n8n
  • File not found errors (path doesn't exist in container filesystem)
  • n8n container has no visibility of host filesystem

Phase to address: Phase 1 (Architecture Decision) — choose integration pattern Phase 2 (Implementation) — implement chosen pattern


Pitfall 5: Unraid Update Check Triggers While Bot Is Syncing — Notification Spam

What goes wrong: Bot updates container → syncs status back to Unraid → Unraid's periodic update check runs during sync → update check sees partially-written file or stale cache → sends duplicate "update available" notification to user. User receives notification storm when updating multiple containers.

Why it happens: Unraid's update check is asynchronous and periodic:

  • Notification service triggers on update detection
  • No debouncing for rapid state changes
  • File write + cache invalidation not atomic

Community evidence:

  • "Excessive notifications from unRAID" — users report notification spam
  • "Duplicate notifications" — longstanding issue in notification system
  • System excludes duplicates from archive but not from active stream

How to avoid:

  1. Sync timing: Delay sync by 10-30 seconds after update completion (let Docker events settle)
  2. Batch sync: If updating multiple containers, sync all at once (not per-container)
  3. Cache invalidation signal: If Unraid API provides cache invalidation, trigger AFTER all syncs complete
  4. Idempotent sync: Delete entries (forces recalc) rather than writing new digests (avoids partial state)

Phase 2 implementation pattern:

// In Update sub-workflow
if (responseMode === 'batch') {
  return { success: true, skipSync: true }  // Sync after batch completes
}

// In main workflow (after batch completion)
const updatedContainers = [...]  // Collect all updated
await syncAllToUnraid(updatedContainers)  // Single sync operation

Warning signs:

  • Multiple "update available" notifications for same container within 1 minute
  • Notifications triggered immediately after bot update completes
  • Unraid notification log shows duplicate entries with close timestamps

Phase to address: Phase 2 (Sync Implementation) — add batch sync and timing delays Phase 3 (UAT) — verify no notification spam during batch updates


Pitfall 6: n8n Workflow State Doesn't Persist — Can't Queue Sync Operations

What goes wrong: Developer assumes n8n workflow static data persists between executions (like Phase 10.2 error logging attempt). Builds queue of "pending syncs" to batch them. Queue is lost between workflow executions. Each update triggers immediate sync attempt → file access contention, race conditions.

Why it happens: Known limitation from STATE.md:

n8n workflow static data does NOT persist between executions (execution-scoped, not workflow-scoped)

Phase 10.2 attempted ring buffer + debug commands — entirely removed due to this limitation.

Implications for sync-back:

  • Can't queue sync operations across multiple update requests
  • Can't implement retry queue for failed syncs
  • Each workflow execution is stateless

How to avoid: Don't rely on workflow state for sync coordination. Options:

Option A: Synchronous sync (simplest)

  • Update container → immediately sync (no queue)
  • Atomic file operations handle contention
  • Acceptable for single updates, problematic for batch

Option B: External queue (Redis, file-based)

  • Write pending syncs to external queue
  • Separate workflow polls queue and processes batch
  • Higher complexity, requires infrastructure

Option C: Batch-aware sync (recommended)

  • Single updates: sync immediately
  • Batch updates: collect all container IDs in batch loop, sync once after completion
  • No cross-execution state needed (batch completes in single execution)

Implementation in Phase 2:

// Batch loop already collects results
const batchResults = []
for (const container of containers) {
  const result = await updateContainer(container)
  batchResults.push({ containerId, updated: result.updated })
}
// After loop completes (still in same execution):
const toSync = batchResults.filter(r => r.updated).map(r => r.containerId)
await syncToUnraid(toSync)  // Single sync call

Warning signs:

  • Developer adds static data writes for sync queue
  • Testing shows queue is empty on next execution
  • Sync attempts happen per-container instead of batched

Phase to address: Phase 1 (Architecture) — document stateless constraint, reject queue-based designs Phase 2 (Implementation) — use in-execution batching, not cross-execution state


Pitfall 7: Unraid's br0 Network Recreate Breaks Container Resolution After Bot Update

What goes wrong: Bot updates container using Docker API (remove + create) → Unraid recreates bridge network (br0) → Docker network ID changes → other containers using br0 fail to resolve updated container by name → service disruption beyond just the updated container.

Why it happens: Community report: "Unraid recreates 'br0' when the docker service restarts, and then services using 'br0' cannot be started because the ID of 'br0' has changed."

Bot update flow: docker pulldocker stopdocker rmdocker run with same config

  • If container uses custom bridge network, recreation may trigger network ID change
  • Unraid's Docker service monitors for container lifecycle events
  • Network recreation is asynchronous to container operations

How to avoid:

  1. Preserve network settings: Ensure container recreation uses identical network config (Phase 2)
  2. Test network-dependent scenarios: UAT must include containers with custom networks (Phase 3)
  3. Graceful degradation: If network issue detected (container unreachable after update), log error and notify user
  4. Documentation: Warn users about potential network disruption during updates (README)

Phase 2 implementation check:

  • Current update sub-workflow uses Docker API recreate — verify network config preservation
  • Check if n8n-update.json copies network settings from old container to new
  • Test: update container on br0, verify other containers still resolve it

Warning signs:

  • Container starts successfully but is unreachable by hostname
  • Other containers report DNS resolution failures after update
  • docker network ls shows new network ID for br0 after container update

Phase to address: Phase 2 (Update Flow Verification) — ensure network config preservation Phase 3 (UAT) — test multi-container network scenarios


Technical Debt Patterns

Shortcuts that seem reasonable but create long-term problems.

Shortcut Immediate Benefit Long-term Cost When Acceptable
Skip Unraid version detection Faster implementation Silent breakage on Unraid upgrades Never — version changes are documented
Mount /var/lib/docker into n8n Direct file access Security bypass, tight coupling, upgrade fragility Only if helper script impossible
Sync immediately after update (no delay) Simpler code Race conditions with Unraid update check Only for single-container updates (not batch)
Assume file format from one Unraid version Works on dev system Breaks for users on different versions Only during Phase 1 investigation (must validate before Phase 2)
Write directly to status file without locking Avoids complexity File corruption on concurrent access Never — use atomic operations
Hardcode file paths Works today Breaks if Unraid changes internal structure Acceptable if combined with version detection + validation

Integration Gotchas

Common mistakes when connecting to external services.

Integration Common Mistake Correct Approach
Unraid update status file Assume JSON structure is stable Validate structure before modification, reject unknown formats
Docker socket proxy Expect filesystem access like Docker socket mount Use helper script on host OR Unraid API if available
Unraid API (if used) Assume unauthenticated localhost access Check auth requirements, API key management
File modification timing Write immediately after container update Delay 5-10 seconds to avoid collision with Docker event handlers
Batch operations Sync after each container update Collect all updates, sync once after batch completes
Network config preservation Assume Docker API preserves settings Explicitly copy network settings from old container inspect to new create

Performance Traps

Patterns that work at small scale but fail as usage grows.

Trap Symptoms Prevention When It Breaks
Sync per container in batch File contention, slow batch updates Batch sync after all updates complete 5+ containers in batch
Full file rewrite for each sync High I/O, race window increases Delete stale entries OR modify only changed entries 10+ containers tracked
No retry logic for file access Silent sync failures Exponential backoff retry (max 3 attempts) Concurrent Unraid update check
Sync blocks workflow execution Slow Telegram responses Async sync (fire and forget) OR move to separate workflow 3+ second file operations

Note: Current system has 8-15 containers (from UAT scenarios). Performance traps unlikely to manifest, but prevention is low-cost.

Security Mistakes

Domain-specific security issues beyond general web security.

Mistake Risk Prevention
Mount entire /var/lib/docker into n8n n8n gains root-level access to all Docker data Mount only specific file OR use helper script
World-writable status file permissions Any container can corrupt Unraid state Verify file permissions, use host-side helper with proper permissions
No validation before writing to status file Malformed data corrupts Unraid Docker UI Validate JSON structure, reject unknown formats
Expose Unraid API key in workflow API key visible in n8n execution logs Use n8n credentials, not hardcoded keys
Execute arbitrary commands on host Container escape vector Whitelist allowed operations in helper script

UX Pitfalls

Common user experience mistakes in this domain.

Pitfall User Impact Better Approach
Silent sync failure User thinks status updated, Unraid still shows "update ready" Log error to correlation ID, send Telegram notification on sync failure
No indication of sync status User doesn't know if sync worked Include in update success message: "Updated + synced to Unraid"
Sync delay causes confusion User checks Unraid immediately, sees old status Document 10-30 second sync delay in README troubleshooting
Unraid badge still shows after sync User thinks update failed README: explain Unraid caches aggressively, manual "Check for Updates" forces refresh
Batch update spam notifications 10 updates = 10 Unraid notifications Batch sync prevents this (if implemented correctly)

"Looks Done But Isn't" Checklist

Things that appear complete but are missing critical pieces.

  • File modification: Wrote to status file — verify atomic operation (temp file + rename, not direct write)
  • Batch sync: Syncs after each update — verify batching for multi-container operations
  • Version compatibility: Works on dev Unraid — verify against 6.11, 6.12, 7.0, 7.2
  • Error handling: Sync returns success — verify retry logic for file contention
  • Network preservation: Container starts after update — verify DNS resolution from other containers
  • Race condition testing: Works in sequential tests — verify concurrent update + Unraid check scenario
  • Filesystem access: Works on dev system — verify n8n container can actually reach file (or helper script exists)
  • Notification validation: No duplicate notifications in single test — verify batch scenario (5+ containers)

Recovery Strategies

When pitfalls occur despite prevention, how to recover.

Pitfall Recovery Cost Recovery Steps
Corrupted status file LOW Delete /var/lib/docker/unraid-update-status.json, Unraid recreates on next update check
State desync (Unraid shows stale) LOW Manual "Check for Updates" in Unraid UI forces recalculation
Unraid version breaks format MEDIUM Disable sync feature via feature flag, update sync logic for new format
Network resolution broken MEDIUM Restart Docker service in Unraid (Settings -> Docker -> Enable: No -> Yes)
File permission errors LOW Helper script with proper permissions, OR mount file read-only + use API
n8n can't reach status file HIGH Architecture change required (add helper script OR switch to API)
Notification spam LOW Unraid notification settings: disable Docker update notifications temporarily

Pitfall-to-Phase Mapping

How roadmap phases should address these pitfalls.

Pitfall Prevention Phase Verification
State desync (Docker API vs Unraid) Phase 1 (Investigation) + Phase 2 (Sync) UAT: update via bot, verify Unraid shows "up to date"
Race condition (concurrent access) Phase 2 (Sync Implementation) Stress test: simultaneous bot update + manual Unraid check
Unraid version compatibility Phase 1 (Format Documentation) + Phase 3 (Multi-version UAT) Test on Unraid 6.12, 7.0, 7.2
Filesystem access from container Phase 1 (Architecture Decision) Deploy to prod, verify file access or helper script works
Notification spam Phase 2 (Batch Sync) UAT: batch update 5+ containers, count notifications
n8n state persistence assumption Phase 1 (Architecture) Code review: reject any staticData usage for sync queue
Network recreation (br0) Phase 2 (Update Flow) + Phase 3 (UAT) Test: update container on custom network, verify resolution

Sources

HIGH confidence (official/authoritative):

MEDIUM confidence (community-verified):

LOW confidence (single source, needs validation):

  • File format structure (/var/lib/docker/unraid-update-status.json) — inferred from forum posts, not officially documented
  • Unraid update check timing/frequency — user-configurable, no default documented
  • Cache invalidation triggers — inferred from API docs, not explicitly tested

Project-specific (from existing codebase):

  • STATE.md — n8n static data limitation (Phase 10.2 findings)
  • ARCHITECTURE.md — Current system architecture, socket proxy usage
  • CLAUDE.md — n8n workflow patterns, sub-workflow contracts

Pitfalls research for: Unraid Update Status Sync Researched: 2026-02-08