docs: complete v1.1 research (4 researchers + synthesis)

Files: - STACK.md: Socket proxy, n8n API, Telegram keyboards - FEATURES.md: Table stakes, differentiators, MVP scope - ARCHITECTURE.md: Integration points, data flow changes - PITFALLS.md: Top 5 risks with prevention strategies - SUMMARY.md: Executive summary, build order, confidence Key findings: - Stack: LinuxServer socket-proxy, HTTP Request nodes for keyboards - Architecture: TCP curl migration (~15 nodes), new callback routes - Critical pitfall: Socket proxy breaks existing curl commands Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-02 22:09:06 -05:00
parent ff289677ab
commit 811030cee4
5 changed files with 1614 additions and 0 deletions
@@ -0,0 +1,381 @@
+# Features Research: v1.1
+
+**Domain:** Telegram Bot for Docker Container Management
+**Researched:** 2026-02-02
+**Confidence:** MEDIUM-HIGH (WebSearch verified with official docs where available)
+
+## Telegram Inline Keyboards
+
+### Table Stakes
+
+| Feature | Why Expected | Complexity | Dependencies |
+|---------|--------------|------------|--------------|
+| Callback button handling | Core inline keyboard functionality - buttons must trigger actions | Low | Telegram Trigger already handles callback_query |
+| answerCallbackQuery response | Required by Telegram - clients show loading animation until answered (up to 1 minute) | Low | None |
+| Edit message after button press | Standard pattern - update existing message rather than send new one to reduce clutter | Low | None |
+| Container action buttons | Users expect tap-to-action for start/stop/restart without typing | Medium | Existing container matching logic |
+| Status view with action buttons | Show container list with inline buttons for each container | Medium | Existing status command |
+
+### Differentiators
+
+| Feature | Value Proposition | Complexity | Dependencies |
+|---------|-------------------|------------|--------------|
+| Confirmation dialogs for dangerous actions | "Are you sure?" before stop/restart/update prevents accidental actions | Low | None - edit message with Yes/No buttons |
+| Contextual button removal | Remove buttons after action completes (prevents double-tap issues) | Low | None |
+| Dynamic container list keyboards | Generate buttons based on actual running containers | Medium | Container listing logic |
+| Progress indicators via message edit | Update message with "Updating..." then "Complete" states | Low | None |
+| Pagination for many containers | "Next page" button when >8-10 containers | Medium | None |
+
+### Anti-features
+
+| Anti-Feature | Why Avoid | What to Do Instead |
+|--------------|-----------|-------------------|
+| Reply keyboards for actions | Takes over user keyboard space, sends visible messages to chat | Use inline keyboards attached to bot messages |
+| More than 5 buttons per row | Wraps poorly on mobile/desktop, breaks muscle memory | Max 3-4 buttons per row for container actions |
+| Complex callback_data structures | 64-byte limit, easy to exceed with JSON | Use short action codes: `start_plex`, `stop_sonarr` |
+| Buttons without feedback | Users think tap didn't work, tap again | Always answerCallbackQuery, even for errors |
+| Auto-refreshing keyboards | High API traffic, rate limiting risk | Refresh on explicit user action only |
+
+### Implementation Notes
+
+**Critical constraint:** callback_data is limited to 64 bytes. Use short codes like `action:containername` rather than JSON structures.
+
+**n8n native node limitation:** The Telegram node doesn't support dynamic inline keyboards well. Workaround is HTTP Request node calling Telegram Bot API directly for `sendMessage` with `reply_markup` parameter.
+
+**Pattern for confirmations:**
+1. User taps "Stop plex"
+2. Edit message: "Stop plex container?" with [Yes] [Cancel] buttons
+3. User taps Yes -> perform action, edit message with result, remove buttons
+4. User taps Cancel -> edit message back to original state
+
+**Sources:**
+- [Telegram Bot Features](https://core.telegram.org/bots/features) (HIGH confidence)
+- [Telegram Bot API Buttons](https://core.telegram.org/api/bots/buttons) (HIGH confidence)
+- [n8n Telegram Callback Operations](https://docs.n8n.io/integrations/builtin/app-nodes/n8n-nodes-base.telegram/callback-operations/) (HIGH confidence)
+- [n8n Community: Dynamic Inline Keyboard](https://community.n8n.io/t/dynamic-inline-keyboard-for-telegram-bot/86568) (MEDIUM confidence)
+
+---
+
+## Batch Operations
+
+### Table Stakes
+
+| Feature | Why Expected | Complexity | Dependencies |
+|---------|--------------|------------|--------------|
+| Update multiple specified containers | Core batch use case - `update plex sonarr radarr` | Medium | Existing update logic, loop handling |
+| Sequential execution | Process one at a time to avoid resource contention | Low | None |
+| Per-container status feedback | "Updated plex... Updated sonarr..." progress | Low | Existing message sending |
+| Error handling per container | One failure shouldn't abort the batch | Low | Try-catch per iteration |
+| Final summary message | "3 updated, 1 failed: jellyfin" | Low | Accumulator pattern |
+
+### Differentiators
+
+| Feature | Value Proposition | Complexity | Dependencies |
+|---------|-------------------|------------|--------------|
+| "Update all" command | Single command to update everything (with confirmation) | Medium | Container listing |
+| "Update all except X" | Exclude specific containers from batch | Medium | Exclusion pattern |
+| Parallel status checks | Check which containers have updates available first | Medium | None |
+| Batch operation confirmation | Show what will happen before doing it | Low | Keyboard buttons |
+| Cancel mid-batch | Stop processing remaining containers | High | State management |
+
+### Anti-features
+
+| Anti-Feature | Why Avoid | What to Do Instead |
+|--------------|-----------|-------------------|
+| Parallel container updates | Resource contention, disk I/O saturation, network bandwidth | Sequential with progress feedback |
+| Silent batch operations | User thinks bot is frozen during long batch | Send progress message per container |
+| Update without checking first | Wastes time on already-updated containers | Check for updates, report "3 containers have updates" |
+| Auto-update on schedule | Out of scope - user might be using system when update causes downtime | User-initiated only, this is reactive tool |
+
+### Implementation Notes
+
+**Existing update flow:** Current implementation pulls image, recreates container, cleans up old image. Batch needs to wrap this in a loop.
+
+**Progress pattern:**
+```
+User: update all
+Bot: Found 5 containers with updates. Update now? [Yes] [Cancel]
+User: Yes
+Bot: Updating plex (1/5)...
+Bot: (edit) Updated plex. Updating sonarr (2/5)...
+...
+Bot: (edit) Batch complete: 5 updated, 0 failed.
+```
+
+**Watchtower-style options (NOT recommended for this bot):**
+- Watchtower does automatic updates on schedule
+- This bot is intentionally reactive (user asks, bot does)
+- Automation can cause downtime at bad times
+
+**Sources:**
+- [Watchtower Documentation](https://containrrr.dev/watchtower/) (HIGH confidence)
+- [Docker Multi-Container Apps](https://docs.docker.com/get-started/docker-concepts/running-containers/multi-container-applications/) (HIGH confidence)
+- [How to Update Docker Containers](https://phoenixnap.com/kb/update-docker-image-container) (MEDIUM confidence)
+
+---
+
+## Development API Workflow
+
+### Table Stakes
+
+| Feature | Why Expected | Complexity | Dependencies |
+|---------|--------------|------------|--------------|
+| API key authentication | Standard n8n API auth method | Low | n8n configuration |
+| Get workflow by ID | Read current workflow JSON | Low | n8n REST API |
+| Update workflow | Push modified workflow back | Low | n8n REST API |
+| Activate/deactivate workflow | Turn workflow on/off programmatically | Low | n8n REST API |
+| Get execution list | See recent runs | Low | n8n REST API |
+| Get execution details/logs | Debug failed executions | Low | n8n REST API |
+
+### Differentiators
+
+| Feature | Value Proposition | Complexity | Dependencies |
+|---------|-------------------|------------|--------------|
+| Execute workflow on demand | Trigger test run via API | Medium | n8n REST API with test data |
+| Version comparison | Diff local vs deployed workflow | High | JSON diff tooling |
+| Backup before update | Save current version before pushing changes | Low | File system or git |
+| Rollback capability | Restore previous version on failure | Medium | Version history |
+| MCP integration | Claude Code can manage workflows via MCP | High | MCP server setup |
+
+### Anti-features
+
+| Anti-Feature | Why Avoid | What to Do Instead |
+|--------------|-----------|-------------------|
+| Direct n8n database access | Bypasses API, can corrupt state | Use REST API only |
+| Credential exposure via API | API returns credential IDs, not values | Never try to extract credential values |
+| Auto-deploy on git push | Adds CI/CD complexity, not needed for single-user | Manual deploy via API call |
+| Real-time workflow editing | n8n UI is better for this | API for read/bulk operations only |
+
+### Implementation Notes
+
+**n8n REST API key endpoints:**
+
+| Operation | Method | Endpoint |
+|-----------|--------|----------|
+| List workflows | GET | `/api/v1/workflows` |
+| Get workflow | GET | `/api/v1/workflows/{id}` |
+| Update workflow | PUT | `/api/v1/workflows/{id}` |
+| Activate | POST | `/api/v1/workflows/{id}/activate` |
+| Deactivate | POST | `/api/v1/workflows/{id}/deactivate` |
+| List executions | GET | `/api/v1/executions` |
+| Get execution | GET | `/api/v1/executions/{id}` |
+| Execute workflow | POST | `/rest/workflows/{id}/run` |
+
+**Authentication:** Header `X-N8N-API-KEY: your_api_key`
+
+**Workflow structure:** n8n workflows are JSON documents (~3,200 lines for this bot). Key sections:
+- `nodes[]` - Array of workflow nodes
+- `connections` - How nodes connect
+- `settings` - Workflow-level settings
+
+**MCP option:** There's an unofficial n8n MCP server (makafeli/n8n-workflow-builder) that could enable Claude Code to manage workflows directly, but this adds complexity. Standard REST API is simpler for v1.1.
+
+**Sources:**
+- [n8n API Documentation](https://docs.n8n.io/api/) (HIGH confidence)
+- [n8n API Reference](https://docs.n8n.io/api/api-reference/) (HIGH confidence)
+- [n8n Workflow Manager API Template](https://n8n.io/workflows/4166-n8n-workflow-manager-api/) (MEDIUM confidence)
+- [Python n8n API Guide](https://martinuke0.github.io/posts/2025-12-10-a-detailed-guide-to-using-the-n8n-api-with-python/) (MEDIUM confidence)
+
+---
+
+## Update Notification Sync
+
+### Table Stakes
+
+| Feature | Why Expected | Complexity | Dependencies |
+|---------|--------------|------------|--------------|
+| Update clears bot's "update available" state | Bot should know container is now current | Low | Already works - re-check after update |
+| Accurate update status reporting | Status command shows which have updates | Medium | Image digest comparison |
+
+### Differentiators
+
+| Feature | Value Proposition | Complexity | Dependencies |
+|---------|-------------------|------------|--------------|
+| Sync with Unraid UI | Clear "update available" badge in Unraid web UI | High | Unraid API or file manipulation |
+| Pre-update check | Show what version you're on, what version available | Medium | Image tag inspection |
+| Update notification to user | "3 containers have updates available" proactive message | Medium | Scheduled check, notification logic |
+
+### Anti-features
+
+| Anti-Feature | Why Avoid | What to Do Instead |
+|--------------|-----------|-------------------|
+| Taking over Unraid notifications | Explicitly out of scope per PROJECT.md | Keep Unraid notifications, bot is for control |
+| Proactive monitoring | Bot is reactive per PROJECT.md | User checks status manually |
+| Blocking Unraid auto-updates | User may want both systems | Coexist with Unraid's own update mechanism |
+
+### Implementation Notes
+
+**The core problem:** When you update a container via the bot (or Watchtower), Unraid's web UI may still show "update available" because it has its own tracking.
+
+**Unraid update status file:** `/var/lib/docker/unraid-update-status.json`
+- This file tracks which containers have updates
+- Deleting it forces Unraid to recheck
+- Can also trigger recheck via: Settings > Docker > Check for Updates
+
+**Unraid API (v7.2+):**
+- GraphQL API for Docker containers
+- Can query container status
+- Mutations for notifications exist
+- API key auth: `x-api-key` header
+
+**Practical approach for v1.1:**
+1. **Minimum:** Document that Unraid UI may lag behind - user can click "Check for Updates" in Unraid
+2. **Better:** After bot update, delete `/var/lib/docker/unraid-update-status.json` to force Unraid recheck
+3. **Best (requires Unraid 7.2+):** Use Unraid GraphQL API to clear notification state
+
+**Known issue:** Users report Unraid shows "update ready" even after container is updated. This is a known Unraid bug where it only checks for new updates, not whether containers are now current.
+
+**Sources:**
+- [Unraid API Documentation](https://docs.unraid.net/API/how-to-use-the-api/) (HIGH confidence)
+- [Unraid Docker Integration DeepWiki](https://deepwiki.com/unraid/api/2.4.1-docker-integration) (MEDIUM confidence)
+- [Watchtower + Unraid Discussion](https://github.com/containrrr/watchtower/discussions/1389) (MEDIUM confidence)
+- [Unraid Forum: Update Badge Issues](https://forums.unraid.net/topic/157820-docker-shows-update-ready-after-updating/) (MEDIUM confidence)
+
+---
+
+## Docker Socket Security
+
+### Table Stakes
+
+| Feature | Why Expected | Complexity | Dependencies |
+|---------|--------------|------------|--------------|
+| Remove direct socket from internet-exposed n8n | Security requirement per PROJECT.md scope | Medium | Socket proxy setup |
+| Maintain all existing functionality | Bot should work identically after security change | Medium | API compatibility |
+| Container start/stop/restart/update | Core actions must still work | Low | Proxy allows these APIs |
+| Container list/inspect | Status command must still work | Low | Proxy allows read APIs |
+| Image pull | Update command needs this | Low | Proxy configuration |
+
+### Differentiators
+
+| Feature | Value Proposition | Complexity | Dependencies |
+|---------|-------------------|------------|--------------|
+| Granular API restrictions | Only allow APIs the bot actually uses | Low | Socket proxy env vars |
+| Block dangerous APIs | Prevent exec, create, system commands | Low | Socket proxy defaults |
+| Audit logging | Log all Docker API calls through proxy | Medium | Proxy logging config |
+
+### Anti-features
+
+| Anti-Feature | Why Avoid | What to Do Instead |
+|--------------|-----------|-------------------|
+| Read-only socket mount (:ro) | Doesn't actually protect - socket as pipe stays writable | Use proper socket proxy |
+| Direct socket access from internet-facing container | Full root access if n8n is compromised | Socket proxy isolates access |
+| Allowing exec API | Enables arbitrary command execution in containers | Block exec in proxy |
+| Allowing create/network APIs | Bot doesn't need to create containers | Block creation APIs |
+
+### Implementation Notes
+
+**Recommended: Tecnativa/docker-socket-proxy or LinuxServer.io/docker-socket-proxy**
+
+Both provide HAProxy-based filtering of Docker API requests.
+
+**Minimal proxy configuration for this bot:**
+
+```yaml
+# docker-compose.yml
+services:
+  socket-proxy:
+    image: tecnativa/docker-socket-proxy
+    environment:
+      - CONTAINERS=1      # List/inspect containers
+      - IMAGES=1          # Pull images
+      - POST=1            # Allow write operations
+      - SERVICES=0        # Swarm services (not needed)
+      - TASKS=0           # Swarm tasks (not needed)
+      - NETWORKS=0        # Network management (not needed)
+      - VOLUMES=0         # Volume management (not needed)
+      - EXEC=0            # CRITICAL: Block exec
+      - BUILD=0           # CRITICAL: Block build
+      - COMMIT=0          # CRITICAL: Block commit
+      - SECRETS=0         # CRITICAL: Block secrets
+      - CONFIGS=0         # CRITICAL: Block configs
+    volumes:
+      - /var/run/docker.sock:/var/run/docker.sock:ro
+    networks:
+      - docker-proxy
+
+  n8n:
+    # ... existing config ...
+    environment:
+      - DOCKER_HOST=tcp://socket-proxy:2375
+    networks:
+      - docker-proxy
+      # Plus existing networks
+```
+
+**Key security benefits:**
+1. n8n no longer has direct socket access
+2. Only whitelisted API categories are available
+3. EXEC=0 prevents arbitrary command execution
+4. Proxy is on internal network only, not internet-exposed
+
+**Migration path:**
+1. Deploy socket-proxy container
+2. Update n8n to use `DOCKER_HOST=tcp://socket-proxy:2375`
+3. Remove direct socket mount from n8n
+4. Test all bot commands still work
+
+**Sources:**
+- [Tecnativa docker-socket-proxy](https://github.com/Tecnativa/docker-socket-proxy) (HIGH confidence)
+- [LinuxServer.io docker-socket-proxy](https://docs.linuxserver.io/images/docker-socket-proxy/) (HIGH confidence)
+- [Docker Socket Security Guide](https://www.paulsblog.dev/how-to-secure-your-docker-environment-by-using-a-docker-socket-proxy/) (MEDIUM confidence)
+
+---
+
+## Feature Summary Table
+
+| Feature | Complexity | Dependencies | Priority | Notes |
+|---------|------------|--------------|----------|-------|
+| **Inline Keyboards** | | | | |
+| Basic callback handling | Low | Existing trigger | Must Have | Foundation for all buttons |
+| Container action buttons | Medium | Container matching | Must Have | Core UX improvement |
+| Confirmation dialogs | Low | None | Should Have | Prevents accidents |
+| Dynamic keyboard generation | Medium | HTTP Request node | Must Have | n8n native node limitation workaround |
+| **Batch Operations** | | | | |
+| Update multiple containers | Medium | Existing update | Must Have | Sequential with progress |
+| "Update all" command | Medium | Container listing | Should Have | With confirmation |
+| Per-container feedback | Low | None | Must Have | Progress visibility |
+| **n8n API** | | | | |
+| API key setup | Low | n8n config | Must Have | Enable programmatic access |
+| Read workflow | Low | REST API | Must Have | Development workflow |
+| Update workflow | Low | REST API | Must Have | Development workflow |
+| Activate/deactivate | Low | REST API | Should Have | Testing workflow |
+| **Update Sync** | | | | |
+| Delete status file | Low | SSH/exec access | Should Have | Simple Unraid sync |
+| Unraid GraphQL API | High | Unraid 7.2+, API key | Nice to Have | Requires version check |
+| **Security** | | | | |
+| Socket proxy deployment | Medium | New container | Must Have | Security requirement |
+| API restriction config | Low | Proxy env vars | Must Have | Minimize attack surface |
+| Migration testing | Low | All commands | Must Have | Verify no regression |
+
+## MVP Recommendation for v1.1
+
+**Phase 1: Foundation (Must Have)**
+1. Docker socket security via proxy - security first
+2. n8n API access setup - enables faster development
+3. Basic inline keyboard infrastructure - callback handling
+
+**Phase 2: UX Improvements (Should Have)**
+4. Container action buttons from status view
+5. Confirmation dialogs for stop/update actions
+6. Batch update with progress feedback
+
+**Phase 3: Polish (Nice to Have)**
+7. Unraid update status sync (file deletion method)
+8. "Update all" convenience command
+
+## Confidence Assessment
+
+| Area | Confidence | Reason |
+|------|------------|--------|
+| Telegram Inline Keyboards | HIGH | Official Telegram docs + n8n docs verified |
+| Batch Operations | MEDIUM-HIGH | Standard Docker patterns, well-documented |
+| n8n API | MEDIUM | API exists but detailed endpoint docs required fetching |
+| Unraid Update Sync | MEDIUM | Community knowledge, API docs limited |
+| Docker Socket Security | HIGH | Well-documented proxy solutions |
+
+## Gaps to Address in Phase Planning
+
+1. **Exact n8n API endpoints** - Need to verify full endpoint list during implementation
+2. **Unraid version compatibility** - GraphQL API requires Unraid 7.2+, need version check
+3. **n8n Telegram node workarounds** - HTTP Request approach needs testing
+4. **Socket proxy on Unraid** - Deployment specifics for Unraid environment