docs: complete v1.1 research (4 researchers + synthesis)

Files:
- STACK.md: Socket proxy, n8n API, Telegram keyboards
- FEATURES.md: Table stakes, differentiators, MVP scope
- ARCHITECTURE.md: Integration points, data flow changes
- PITFALLS.md: Top 5 risks with prevention strategies
- SUMMARY.md: Executive summary, build order, confidence

Key findings:
- Stack: LinuxServer socket-proxy, HTTP Request nodes for keyboards
- Architecture: TCP curl migration (~15 nodes), new callback routes
- Critical pitfall: Socket proxy breaks existing curl commands

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
This commit is contained in:
Lucas Berger
2026-02-02 22:09:06 -05:00
parent ff289677ab
commit 811030cee4
5 changed files with 1614 additions and 0 deletions
+224
View File
@@ -0,0 +1,224 @@
# Pitfalls Research: v1.1
**Project:** Unraid Docker Manager
**Milestone:** v1.1 - n8n Integration & Polish
**Researched:** 2026-02-02
**Confidence:** MEDIUM-HIGH (verified with official docs where possible)
## Context
This research identifies pitfalls specific to **adding** these features to an existing working system:
- n8n API access (programmatic workflow read/update/test/logs)
- Docker socket proxy (security hardening)
- Telegram inline keyboards (UX improvements)
- Unraid update sync (clear "update available" badge)
**Risk focus:** Breaking existing functionality while adding new features.
---
## n8n API Access Pitfalls
| Pitfall | Warning Signs | Prevention | Phase |
|---------|---------------|------------|-------|
| **API key with full access** | API key created without scopes; all workflows accessible | Enterprise: use scoped API keys (read-only for Claude Code initially). Non-enterprise: accept risk, rotate keys every 6-12 months | API Setup |
| **Missing X-N8N-API-KEY header** | 401 Unauthorized errors on all API calls | Store API key in Claude Code MCP config; always send as `X-N8N-API-KEY` header, not Bearer token | API Setup |
| **Workflow ID mismatch after import** | API calls return 404; workflow actions fail | Workflow IDs change on import; query `/api/v1/workflows` first to get current IDs, don't hardcode | API Setup |
| **Editing active workflow via API** | Production workflow changes unexpectedly; users see partial updates | n8n 2.0: Save vs Publish are separate actions. Use API to read only; manual publish via UI | API Setup |
| **N8N_BLOCK_ENV_ACCESS_IN_NODE default** | Code nodes can't access env vars; returns undefined | n8n 2.0+ blocks env vars by default. Use credentials system instead, or explicitly set `N8N_BLOCK_ENV_ACCESS_IN_NODE=false` | API Setup |
| **API not enabled on instance** | Connection refused on /api/v1 endpoints | Self-hosted: API is available by default. Cloud trial: API not available. Verify with `curl http://localhost:5678/api/v1/workflows` | API Setup |
| **Rate limiting on rapid API calls** | 429 errors when reading workflow repeatedly | Add delay between API calls (1-2 seconds); use caching for workflow data that doesn't change frequently | API Usage |
**Sources:**
- [n8n API Authentication](https://docs.n8n.io/api/authentication/)
- [n8n API Reference](https://docs.n8n.io/api/)
- [n8n v2.0 Breaking Changes](https://docs.n8n.io/2-0-breaking-changes/)
---
## Docker Socket Security Pitfalls
| Pitfall | Warning Signs | Prevention | Phase |
|---------|---------------|------------|-------|
| **Proxy exposes POST by default** | Container can create/delete containers; security scan flags | Set `POST=0` on socket proxy; most read operations work with GET only | Socket Proxy |
| **Using `--privileged` unnecessarily** | Security audit fails; container has excessive permissions | Remove `--privileged` flag; Tecnativa proxy works without it on standard Docker | Socket Proxy |
| **Outdated socket proxy image** | Using `latest` tag which is 3+ years old | Pin to specific version: `tecnativa/docker-socket-proxy:0.2.0` or use `linuxserver/socket-proxy` | Socket Proxy |
| **Proxy port exposed publicly** | Port 2375 accessible from network; security scan fails | Never expose proxy port; run on internal Docker network only | Socket Proxy |
| **Insufficient permissions for n8n** | "Permission denied" or empty responses from Docker API | Enable minimum required: `CONTAINERS=1`, `ALLOW_START=1`, `ALLOW_STOP=1`, `ALLOW_RESTARTS=1` for actions | Socket Proxy |
| **Breaking existing curl commands** | Existing workflow fails after adding proxy; commands timeout | Socket proxy uses TCP, not Unix socket. Update curl commands: `curl http://socket-proxy:2375/...` instead of `--unix-socket` | Socket Proxy |
| **Network isolation breaks connectivity** | n8n can't reach proxy; "connection refused" errors | Both containers must be on same Docker network; verify with `docker network inspect` | Socket Proxy |
| **Permissions too restrictive** | Container list works but start/stop fails | Must explicitly enable action endpoints: `ALLOW_START=1`, `ALLOW_STOP=1`, `ALLOW_RESTARTS=1` (separate from `CONTAINERS=1`) | Socket Proxy |
| **Missing INFO or VERSION permissions** | Some Docker API calls fail unexpectedly | `VERSION=1` and `PING=1` are enabled by default; may need `INFO=1` for system queries | Socket Proxy |
**Minimum safe configuration for this project:**
```yaml
environment:
- CONTAINERS=1 # Read container info
- ALLOW_START=1 # Start containers
- ALLOW_STOP=1 # Stop containers
- ALLOW_RESTARTS=1 # Restart containers
- IMAGES=1 # Pull images (for updates)
- POST=1 # Required for start/stop/restart actions
- NETWORKS=0 # Not needed
- VOLUMES=0 # Not needed
- BUILD=0 # Not needed
- COMMIT=0 # Not needed
- CONFIGS=0 # Not needed
- SECRETS=0 # Security critical - keep disabled
- EXEC=0 # Security critical - keep disabled
- AUTH=0 # Security critical - keep disabled
```
**Sources:**
- [Tecnativa docker-socket-proxy](https://github.com/Tecnativa/docker-socket-proxy)
- [LinuxServer socket-proxy](https://docs.linuxserver.io/images/docker-socket-proxy/)
- [Docker Community Forums - Socket Proxy Security](https://forums.docker.com/t/does-a-docker-socket-proxy-improve-security/136305)
---
## Telegram Keyboard Pitfalls
| Pitfall | Warning Signs | Prevention | Phase |
|---------|---------------|------------|-------|
| **Native node rejects dynamic keyboards** | Error: "The value '[[...]]' is not supported!" | Use HTTP Request node for inline keyboards instead of native Telegram node; this is a known n8n limitation | Keyboards |
| **callback_data exceeds 64 bytes** | Buttons don't respond; no callback_query received; 400 BUTTON_DATA_INVALID | Use short codes: `s:plex` not `start_container:plex-media-server`. Hash long names to 8-char IDs | Keyboards |
| **Callback auth path missing** | Keyboard clicks ignored; no response to button press | Existing workflow already handles callback_query (line 56-74 in workflow). Ensure new keyboards use same auth flow | Keyboards |
| **Multiple additional fields ignored** | Button has both callback_data and URL; only URL works | n8n Telegram node limitation - can't use both. Choose one per button: either action (callback) or link (URL) | Keyboards |
| **Keyboard flickers on every message** | Visual glitches; keyboard re-renders constantly | Send `reply_markup` only on /start or menu requests; omit from action responses (keyboard persists) | Keyboards |
| **Inline vs Reply keyboard confusion** | Wrong keyboard type appears; buttons don't trigger callbacks | Inline keyboards (InlineKeyboardMarkup) for callbacks; Reply keyboards (ReplyKeyboardMarkup) for persistent menus. Use inline for container actions | Keyboards |
| **answerCallbackQuery not called** | "Loading..." spinner persists after button click; Telegram shows timeout | Must call `answerCallbackQuery` within 10 seconds of receiving callback_query, even if just to acknowledge | Keyboards |
| **Button layout exceeds limits** | Buttons don't appear; API error | Bot API 7.0: max 100 buttons total per message. For container lists, paginate or limit to 8-10 buttons | Keyboards |
**Recommended keyboard structure for container actions:**
```javascript
// Short callback_data pattern: action:container_short_id
// Example: "s:abc123" for start, "x:abc123" for stop
{
"inline_keyboard": [
[
{"text": "Start", "callback_data": "s:" + containerId.slice(0,8)},
{"text": "Stop", "callback_data": "x:" + containerId.slice(0,8)}
],
[
{"text": "Restart", "callback_data": "r:" + containerId.slice(0,8)},
{"text": "Logs", "callback_data": "l:" + containerId.slice(0,8)}
]
]
}
```
**Sources:**
- [n8n GitHub Issue #19955 - Inline Keyboard Expression](https://github.com/n8n-io/n8n/issues/19955)
- [n8n Telegram Callback Operations](https://docs.n8n.io/integrations/builtin/app-nodes/n8n-nodes-base.telegram/callback-operations/)
- [Telegram Bot API - InlineKeyboardButton](https://core.telegram.org/bots/api#inlinekeyboardbutton)
---
## Unraid Integration Pitfalls
| Pitfall | Warning Signs | Prevention | Phase |
|---------|---------------|------------|-------|
| **Update badge persists after bot update** | Unraid UI shows "update available" after container updated via bot | Delete `/var/lib/docker/unraid-update-status.json` to force recheck; or trigger Unraid's check mechanism | Unraid Sync |
| **unraid-update-status.json format unknown** | Attempted to modify file directly; broke Unraid Docker tab | File format is undocumented. Safest approach: delete file and let Unraid regenerate. Don't modify directly | Unraid Sync |
| **Unraid only checks for new updates** | Badge never clears; only sees new updates, not cleared updates | This is known Unraid behavior. Deletion of status file is current workaround per Unraid forums | Unraid Sync |
| **Race condition on status file** | Status file deleted but badge still shows; file regenerated too fast | Wait for Unraid's update check interval, or manually trigger "Check for Updates" from Unraid UI after deletion | Unraid Sync |
| **Bot can't access Unraid filesystem** | Permission denied when accessing /var/lib/docker/ | n8n container needs additional volume mount: `/var/lib/docker:/var/lib/docker` or execute via SSH | Unraid Sync |
| **Breaking Unraid's Docker management** | Unraid Docker tab shows errors; containers appear in wrong state | Never modify Unraid's internal files (in /boot/config/docker or /var/lib/docker) except update-status.json deletion | Unraid Sync |
**Unraid sync approach (safest):**
1. After bot successfully updates container
2. Execute: `rm -f /var/lib/docker/unraid-update-status.json`
3. Unraid will regenerate on next "Check for Updates" or automatically
**Sources:**
- [Unraid Forums - Update notification regression](https://forums.unraid.net/bug-reports/stable-releases/regression-incorrect-docker-update-notification-r2807/)
- [Unraid Forums - Update badge persists](https://forums.unraid.net/topic/157820-docker-shows-update-ready-after-updating/)
- [Unraid Forums - Containers show update available incorrectly](https://forums.unraid.net/topic/142238-containers-show-update-available-even-when-it-is-up-to-date/)
---
## Integration Pitfalls (Breaking Existing Functionality)
| Pitfall | Warning Signs | Prevention | Phase |
|---------|---------------|------------|-------|
| **Socket proxy breaks existing curl** | All Docker commands fail after adding proxy | Existing workflow uses `--unix-socket`. Migrate curl commands to use proxy TCP endpoint: `http://socket-proxy:2375` | Socket Proxy |
| **Auth flow bypassed on new paths** | New keyboard handlers skip user ID check; anyone can click buttons | Existing workflow has auth at lines 92-122 and 126-155. Copy same pattern for any new callback handlers | All |
| **Workflow test vs production mismatch** | Works in test mode; fails when activated | Test with actual Telegram messages, not just manual execution. Production triggers differ from manual runs | All |
| **n8n 2.0 upgrade breaks workflow** | After n8n update, workflow stops working; nodes missing | n8n 2.0 has breaking changes: Execute Command disabled by default, Start node removed, env vars blocked. Check [migration guide](https://docs.n8n.io/2-0-breaking-changes/) before upgrading | All |
| **Credential reference breaks after import** | Imported workflow can't decrypt credentials; all nodes fail | n8n uses N8N_ENCRYPTION_KEY. After import, must recreate credentials manually in n8n UI | All |
| **HTTP Request node vs Execute Command** | HTTP Request can't reach Docker socket; timeout errors | HTTP Request node doesn't support Unix sockets. Keep using Execute Command with curl for Docker API (or migrate to TCP proxy) | Socket Proxy |
| **Parallel execution race conditions** | Two button clicks cause conflicting container states | Add debounce logic: ignore rapid duplicate callbacks within 2-3 seconds. Store last action timestamp | Keyboards |
| **Error workflow doesn't fire** | Errors occur but no notification; silent failures | Error Trigger only fires on automatic executions, not manual test runs. Test by triggering via Telegram with intentional failure | All |
| **Save vs Publish confusion (n8n 2.0)** | Edited workflow but production still uses old version | n8n 2.0 separates Save (preserves edits) from Publish (updates production). Must explicitly publish changes | All |
**Pre-migration checklist:**
- [ ] Export current workflow JSON as backup
- [ ] Document current curl commands and endpoints
- [ ] Test each existing command works after changes
- [ ] Verify auth flow applies to new handlers
- [ ] Test error handling triggers correctly
**Sources:**
- [n8n v2.0 Breaking Changes](https://docs.n8n.io/2-0-breaking-changes/)
- [n8n Manual vs Production Executions](https://docs.n8n.io/workflows/executions/manual-partial-and-production-executions/)
- [n8n Community - Test vs Production Behavior](https://community.n8n.io/t/workflow-behaves-differently-in-test-vs-production/139973)
---
## Summary: Top 5 Risks
Ranked by likelihood x impact for this specific milestone:
### 1. Socket Proxy Breaks Existing Commands (HIGH likelihood, HIGH impact)
**Why:** Current workflow uses `--unix-socket` flag. Socket proxy uses TCP. All existing functionality breaks if not migrated correctly.
**Prevention:**
1. Add socket proxy container first (don't remove direct socket yet)
2. Update curl commands one-by-one to use proxy
3. Test each command works via proxy
4. Only then remove direct socket mount
### 2. Native Telegram Node Rejects Dynamic Keyboards (HIGH likelihood, MEDIUM impact)
**Why:** n8n's native Telegram node has a known bug (Issue #19955) where it rejects array expressions for inline keyboards.
**Prevention:** Use HTTP Request node to call Telegram API directly for any dynamic keyboard generation. Keep native node for simple text responses only.
### 3. Unraid Update Badge Never Clears (HIGH likelihood, LOW impact)
**Why:** Unraid doesn't check for "no longer outdated" containers - only new updates. Documented behavior, not a bug.
**Prevention:** Delete `/var/lib/docker/unraid-update-status.json` after successful bot update. Requires additional volume mount or SSH access.
### 4. n8n 2.0 Breaking Changes on Upgrade (MEDIUM likelihood, HIGH impact)
**Why:** n8n 2.0 (released Dec 2025) has multiple breaking changes: Execute Command disabled by default, env vars blocked, Save/Publish separation.
**Prevention:**
1. Check current n8n version before starting
2. If upgrading, run Migration Report first (Settings > Migration Report)
3. Don't upgrade n8n during this milestone unless necessary
### 5. callback_data Exceeds 64 Bytes (MEDIUM likelihood, MEDIUM impact)
**Why:** Container names can be long (e.g., `linuxserver-plex-media-server`). Adding action prefix easily exceeds 64 bytes.
**Prevention:** Use short action codes (`s:`, `x:`, `r:`, `l:`) and container ID prefix (8 chars) instead of full names. Map back via lookup.
---
## Phase Assignment Summary
| Phase | Pitfalls to Address |
|-------|---------------------|
| **API Setup** | API key scoping, header format, workflow ID discovery, env var blocking |
| **Socket Proxy** | Proxy configuration, permission settings, curl command migration, network setup |
| **Keyboards** | HTTP Request node for keyboards, callback_data limits, answerCallbackQuery |
| **Unraid Sync** | Update status file deletion, volume mount for access |
| **All Phases** | Auth flow consistency, test vs production, error workflow testing |
---
## Confidence Assessment
| Area | Confidence | Rationale |
|------|------------|-----------|
| n8n API | HIGH | Official docs verified, known breaking changes documented |
| Docker Socket Proxy | HIGH | Official Tecnativa docs, community best practices verified |
| Telegram Keyboards | MEDIUM-HIGH | n8n GitHub issues confirm limitations, Telegram API docs verified |
| Unraid Integration | MEDIUM | Forum posts describe workaround, but file format undocumented |
| Integration Risks | MEDIUM | Based on existing v1.0 codebase analysis and general patterns |
**Research date:** 2026-02-02
**Valid until:** 2026-03-02 (30 days - n8n and Telegram APIs stable)