docs(10.2): research phase domain

2026-02-08 12:23:29 -05:00
parent df637c9286
commit 0ef36ab4c8
1 changed files with 767 additions and 0 deletions
@@ -0,0 +1,767 @@
+# Phase 10.2: Better Logging and Log Management - Research
+
+**Researched:** 2026-02-08
+**Domain:** n8n workflow execution logging, error tracking, and debug infrastructure
+**Confidence:** HIGH
+
+## Summary
+
+Phase 10.2 adds centralized logging and error tracking to improve Claude's ability to diagnose issues in the n8n-based Unraid Docker Manager bot. The research reveals that n8n provides native capabilities for this exact use case: workflow static data for in-memory storage, structured error data from Error Trigger nodes, sub-workflow return patterns for error propagation, and API access to execution logs. The primary challenge is designing a trace format that makes the three specific pain points (sub-workflow data loss, callback routing confusion, execution log parsing) immediately queryable.
+
+The standard approach combines ring buffer storage in workflow static data, structured error objects with context, correlation IDs for request tracing, and programmatic access via both Telegram commands and n8n API. This infrastructure is well-established in distributed systems observability (2026) and maps cleanly to n8n's architecture.
+
+**Primary recommendation:** Use workflow static data for ring buffer storage (50 errors), structured error objects with correlation IDs, sub-workflow error propagation via return values, and selective debug mode that captures boundary data only when enabled. Avoid over-logging; focus on the three stated pain points with targeted trace data.
+
+<user_constraints>
+## User Constraints (from CONTEXT.md)
+
+### Locked Decisions
+
+**Error capture & reporting:**
+- Errors display inline to the user as summary + cause (e.g., "Failed to stop nginx: Docker API returned 404 (container not found)")
+- Full diagnostic data (sub-workflow name, node, raw response, stack trace) captured in central error store for Claude's use
+- Only report errors on user-triggered actions — no proactive/unsolicited error notifications
+- Error store uses ring buffer: last 50 errors, auto-rotated
+- Manual clear command also available (/clear-errors or similar, hidden/unlisted)
+
+**Execution traceability:**
+- All sub-workflows report errors back to main workflow for centralized storage
+- Trace data designed for programmatic access — Claude can query it during debugging sessions
+- Hidden/unlisted Telegram commands for quick error checks (e.g., /errors to see recent errors)
+- File-based access also available for deep investigation during debugging sessions
+
+**Log output & storage:**
+- Error/trace data stored in n8n workflow static data (main workflow)
+- Centralized in main workflow — sub-workflows report back, main stores
+- Auto-rotate (ring buffer, 50 entries) + manual clear command
+- Both Telegram commands (quick checks) and file/API access (deep investigation)
+
+**Debug mode:**
+- Debug mode is for Claude's use during debugging — not user-facing
+- Must address three specific pain points:
+  1. **Sub-workflow data loss** — capture what data was sent to and received from each sub-workflow at boundaries
+  2. **Callback routing confusion** — trace which path a callback took through routing logic
+  3. **n8n API execution log parsing** — make execution data easily queryable without manual workflow investigation
+
+### Claude's Discretion
+
+- Trace format and structure (timeline vs. data snapshots vs. both)
+- Whether to trace all executions or only errors (overhead vs. usefulness)
+- Structured entries vs. simple log lines (what enables best debugging)
+- Debug toggle mechanism (global toggle, per-request, or always-on for errors)
+- Log level granularity (on/off vs. error/warn/info)
+- What specific debug data to capture (raw API responses, sub-workflow I/O, timing)
+- Telegram command naming and exact interface
+
+### Deferred Ideas (OUT OF SCOPE)
+
+None — discussion stayed within phase scope
+
+</user_constraints>
+
+## Standard Stack
+
+### Core Components
+
+| Component | Version/Type | Purpose | Why Standard |
+|-----------|--------------|---------|--------------|
+| n8n Workflow Static Data | Built-in (`$getWorkflowStaticData('global')`) | In-memory ring buffer storage | Native n8n persistence mechanism, survives across executions |
+| n8n Error Trigger | Built-in node type | Structured error capture | Standard n8n error handling pattern, provides rich error context |
+| n8n Execute Workflow | Built-in node type | Sub-workflow communication | Existing pattern in project (7 sub-workflows deployed) |
+| n8n API | `/api/v1/executions` endpoint | Programmatic execution log access | Official n8n API for querying execution history and data |
+| Correlation ID | String field in trace entries | Request tracking across workflow boundaries | Industry standard for distributed tracing (OpenTelemetry pattern) |
+
+**Note:** No external logging libraries needed. n8n's built-in capabilities are sufficient for this use case.
+
+### Supporting Patterns
+
+| Pattern | Implementation | Purpose | When to Use |
+|---------|----------------|---------|-------------|
+| Ring Buffer | JavaScript array with modulo arithmetic | Auto-rotating error store (50 entries) | Size-bounded in-memory storage |
+| Structured Error Object | JSON with standard fields (timestamp, executionId, node, error, context) | Queryable error data | Always — enables programmatic access |
+| Error Propagation | Sub-workflow return values include error object | Centralized error collection | When sub-workflow encounters error |
+| Debug Toggle | Boolean flag in workflow static data | Enable/disable debug tracing | Claude sets via Telegram command or API |
+| Correlation ID | UUID passed through sub-workflow calls | Trace single request across workflows | All sub-workflow invocations |
+
+### Alternatives Considered
+
+| Instead of | Could Use | Tradeoff |
+|------------|-----------|----------|
+| Workflow static data | External database (Redis, MongoDB) | External DB provides unlimited storage but adds infrastructure complexity; static data is simpler, sufficient for 50-entry ring buffer |
+| Ring buffer | Append-only log with external rotation | Unlimited history but requires external storage and log rotation scripts; ring buffer is self-managing |
+| n8n API access | n8n log streaming to external service | Real-time streaming but requires external log aggregator; API access is simpler for on-demand queries |
+| Correlation IDs | Execution ID only | Execution ID doesn't span sub-workflows; correlation ID tracks single user request across all workflows |
+
+**Installation:** No external packages needed. All components are n8n built-ins.
+
+## Architecture Patterns
+
+### Recommended Data Structure
+
+```javascript
+// Workflow static data structure
+{
+  "debug": {
+    "enabled": false,           // Debug mode toggle
+    "logLevel": "error"         // "off" | "error" | "warn" | "info" | "debug"
+  },
+  "errors": {
+    "buffer": [                 // Ring buffer (max 50 entries)
+      {
+        "id": "err_001",        // Sequential error ID
+        "correlationId": "uuid-v4",  // Trace across sub-workflows
+        "timestamp": "2026-02-08T10:30:00Z",
+        "executionId": "12345", // n8n execution ID
+        "workflow": "main",     // "main" or sub-workflow name
+        "node": "Execute Container Action",
+        "operation": "docker.stop",
+        "userMessage": "Failed to stop nginx: Docker API returned 404 (container not found)",
+        "error": {
+          "message": "Container not found",
+          "stack": "Error: Container not found\n  at ...",
+          "httpCode": 404,
+          "rawResponse": "{\"message\":\"No such container: nginx\"}"
+        },
+        "context": {
+          "userId": "123456789",
+          "containerId": "nginx",
+          "subWorkflowInput": {...},   // Data sent to sub-workflow
+          "subWorkflowOutput": {...}   // Data received from sub-workflow
+        }
+      }
+    ],
+    "nextId": 2,                // Auto-increment for error IDs
+    "count": 1,                 // Total errors captured (all-time)
+    "lastCleared": "2026-02-08T09:00:00Z"
+  },
+  "traces": {                   // Debug mode traces (only when debug.enabled = true)
+    "buffer": [                 // Ring buffer (max 50 entries)
+      {
+        "id": "trace_001",
+        "correlationId": "uuid-v4",
+        "timestamp": "2026-02-08T10:29:55Z",
+        "executionId": "12345",
+        "event": "sub-workflow-call",
+        "workflow": "n8n-actions",
+        "node": "Execute Container Action",
+        "data": {
+          "input": {...},       // Boundary data: what was sent
+          "output": {...},      // Boundary data: what was received
+          "duration": 234       // Execution time in ms
+        }
+      },
+      {
+        "id": "trace_002",
+        "correlationId": "uuid-v4",
+        "timestamp": "2026-02-08T10:29:56Z",
+        "executionId": "12345",
+        "event": "callback-routing",
+        "node": "Route Callback",
+        "data": {
+          "callbackData": "action:stop:nginx",
+          "routeTaken": "single-action",    // Which switch output path
+          "availableRoutes": ["cancel", "expired", "batch", "single-action"]
+        }
+      }
+    ],
+    "nextId": 3
+  }
+}
+```
+
+### Pattern 1: Ring Buffer Implementation
+
+**What:** Fixed-size circular buffer that auto-rotates when full, keeping only the most recent N entries.
+
+**When to use:** Storing errors and traces in bounded memory (workflow static data has size limits).
+
+**Example:**
+```javascript
+// Code node: Add Error to Ring Buffer
+const staticData = $getWorkflowStaticData('global');
+
+// Initialize if needed
+if (!staticData.errors) {
+  staticData.errors = {
+    buffer: [],
+    nextId: 1,
+    count: 0,
+    lastCleared: new Date().toISOString()
+  };
+}
+
+const MAX_ENTRIES = 50;
+const errorEntry = {
+  id: `err_${String(staticData.errors.nextId).padStart(3, '0')}`,
+  correlationId: $execution.id,  // Use execution ID as correlation ID
+  timestamp: new Date().toISOString(),
+  executionId: $execution.id,
+  workflow: 'main',
+  node: $('Execute Container Action').name,
+  operation: 'docker.stop',
+  userMessage: $input.item.json.errorMessage,
+  error: {
+    message: $input.item.json.error.message,
+    stack: $input.item.json.error.stack,
+    httpCode: $input.item.json.error.httpCode,
+    rawResponse: $input.item.json.error.rawResponse
+  },
+  context: {
+    userId: $input.item.json.userId,
+    containerId: $input.item.json.containerId,
+    subWorkflowInput: $input.item.json.subWorkflowInput,
+    subWorkflowOutput: $input.item.json.subWorkflowOutput
+  }
+};
+
+// Ring buffer: add at end, remove from start if full
+staticData.errors.buffer.push(errorEntry);
+if (staticData.errors.buffer.length > MAX_ENTRIES) {
+  staticData.errors.buffer.shift();  // Remove oldest
+}
+
+staticData.errors.nextId++;
+staticData.errors.count++;
+
+return { json: { success: true, errorId: errorEntry.id } };
+```
+**Source:** Ring buffer pattern from [Tucker Leach - Ring Buffer in TypeScript](https://www.tuckerleach.com/blog/ring-buffer)
+
+### Pattern 2: Sub-workflow Error Propagation
+
+**What:** Sub-workflows return error objects to main workflow for centralized storage.
+
+**When to use:** All sub-workflow calls. Enables centralized error collection.
+
+**Example:**
+```javascript
+// Sub-workflow (n8n-actions.json): Return error to main workflow
+// Code node: Format Error Response (on error path)
+return {
+  json: {
+    success: false,
+    error: {
+      message: $input.item.json.error.message,
+      stack: $input.item.json.error.stack || '',
+      httpCode: $input.item.json.error.httpCode || 500,
+      rawResponse: $input.item.json.error.rawResponse || ''
+    },
+    context: {
+      workflow: 'n8n-actions',
+      node: $('Stop Container').name,
+      operation: 'docker.stop',
+      input: $('When executed by another workflow').item.json  // What was sent to this sub-workflow
+    }
+  }
+};
+
+// Main workflow: Capture sub-workflow error
+// IF node: Check Sub-workflow Success
+{{ $('Execute Container Action').item.json.success }} equals false
+
+// Code node: Log Error (on false path)
+const subWorkflowResult = $('Execute Container Action').item.json;
+const errorData = {
+  errorMessage: `Failed to stop ${subWorkflowResult.context.input.containerId}: ${subWorkflowResult.error.message}`,
+  error: subWorkflowResult.error,
+  userId: $('Telegram Trigger').item.json.message.from.id,
+  containerId: subWorkflowResult.context.input.containerId,
+  subWorkflowInput: subWorkflowResult.context.input,
+  subWorkflowOutput: subWorkflowResult
+};
+
+// Pass to ring buffer node
+return { json: errorData };
+```
+**Source:** n8n sub-workflow pattern from [n8n Execute Sub-workflow docs](https://docs.n8n.io/integrations/builtin/core-nodes/n8n-nodes-base.executeworkflow.md)
+
+### Pattern 3: Correlation ID for Request Tracing
+
+**What:** Unique ID generated at workflow entry point, passed through all sub-workflow calls, used to correlate logs/traces for single user request.
+
+**When to use:** Always. Essential for tracing requests across sub-workflows.
+
+**Example:**
+```javascript
+// Main workflow: Generate Correlation ID
+// Code node: Initialize Request Context (early in workflow, after auth)
+const { v4: uuidv4 } = require('uuid');  // n8n includes uuid
+
+const correlationId = uuidv4();
+const requestContext = {
+  correlationId,
+  userId: $('Telegram Trigger').item.json.message.from.id,
+  messageId: $('Telegram Trigger').item.json.message.message_id,
+  timestamp: new Date().toISOString()
+};
+
+return { json: { ...requestContext, ...$input.item.json } };
+
+// Pass correlation ID to sub-workflow
+// Execute Workflow node: Execute Container Action
+// Input parameters:
+{{ { correlationId: $('Initialize Request Context').item.json.correlationId, ...otherParams } }}
+
+// Debug trace: Log callback routing decision
+const staticData = $getWorkflowStaticData('global');
+if (staticData.debug?.enabled) {
+  const traceEntry = {
+    id: `trace_${String(staticData.traces.nextId).padStart(3, '0')}`,
+    correlationId: $('Initialize Request Context').item.json.correlationId,
+    timestamp: new Date().toISOString(),
+    executionId: $execution.id,
+    event: 'callback-routing',
+    node: 'Route Callback',
+    data: {
+      callbackData: $input.item.json.callback_query.data,
+      routeTaken: '{{ $json.routeName }}',  // Set by switch node metadata
+      availableRoutes: ['cancel', 'expired', 'batch', 'single-action']
+    }
+  };
+
+  // Add to ring buffer (same pattern as errors)
+  staticData.traces.buffer.push(traceEntry);
+  if (staticData.traces.buffer.length > 50) {
+    staticData.traces.buffer.shift();
+  }
+  staticData.traces.nextId++;
+}
+```
+**Source:** Correlation ID pattern from [Microsoft Engineering Playbook - Correlation IDs](https://microsoft.github.io/code-with-engineering-playbook/observability/correlation-id/)
+
+### Pattern 4: Debug Mode Toggle
+
+**What:** Boolean flag in workflow static data that enables/disables debug tracing. When enabled, captures boundary data (sub-workflow I/O) and routing decisions.
+
+**When to use:** Claude needs to diagnose issues. User doesn't see debug traces; only visible via /errors command or API.
+
+**Example:**
+```javascript
+// Telegram command: /debug on|off (hidden command)
+// Code node: Toggle Debug Mode
+const staticData = $getWorkflowStaticData('global');
+const command = $input.item.json.message.text.toLowerCase();
+
+if (!staticData.debug) {
+  staticData.debug = { enabled: false, logLevel: 'error' };
+}
+
+if (command === '/debug on') {
+  staticData.debug.enabled = true;
+  return { json: { message: 'Debug mode enabled. Tracing sub-workflow boundaries and callback routing.' } };
+} else if (command === '/debug off') {
+  staticData.debug.enabled = false;
+  return { json: { message: 'Debug mode disabled.' } };
+} else if (command === '/debug status') {
+  return { json: {
+    message: `Debug mode: ${staticData.debug.enabled ? 'ON' : 'OFF'}\nLog level: ${staticData.debug.logLevel}`
+  } };
+}
+```
+
+### Pattern 5: Query Errors via Telegram
+
+**What:** Hidden command that returns recent errors in human-readable format.
+
+**When to use:** Quick error checks during debugging sessions.
+
+**Example:**
+```javascript
+// Telegram command: /errors [count] (hidden command)
+// Code node: Format Error Report
+const staticData = $getWorkflowStaticData('global');
+const errors = staticData.errors?.buffer || [];
+const requestedCount = parseInt($input.item.json.message.text.split(' ')[1]) || 5;
+
+const recentErrors = errors.slice(-requestedCount).reverse();
+
+if (recentErrors.length === 0) {
+  return { json: { message: 'No errors recorded.' } };
+}
+
+let message = `📋 Recent Errors (${recentErrors.length}):\n\n`;
+recentErrors.forEach(err => {
+  const time = new Date(err.timestamp).toLocaleString();
+  message += `🔴 ${err.id} - ${time}\n`;
+  message += `Workflow: ${err.workflow} → ${err.node}\n`;
+  message += `User: ${err.userMessage}\n`;
+  message += `Error: ${err.error.message}\n`;
+  if (err.error.httpCode) {
+    message += `HTTP: ${err.error.httpCode}\n`;
+  }
+  message += `\n`;
+});
+
+message += `Total errors: ${staticData.errors.count}\n`;
+message += `Last cleared: ${new Date(staticData.errors.lastCleared).toLocaleString()}`;
+
+return { json: { message } };
+```
+
+### Pattern 6: n8n API Access for Deep Investigation
+
+**What:** Use n8n API to retrieve full execution data including node inputs/outputs.
+
+**When to use:** Deep debugging when Telegram command output isn't sufficient.
+
+**Example:**
+```bash
+# Claude Code: Query recent failed executions
+curl -X GET 'http://n8n:5678/api/v1/executions?status=error&limit=10' \
+  -H 'X-N8N-API-KEY: <api-key>'
+
+# Response:
+{
+  "data": [
+    {
+      "id": "12345",
+      "workflowId": "1000",
+      "status": "error",
+      "startedAt": "2026-02-08T10:29:55Z",
+      "finishedAt": "2026-02-08T10:30:00Z"
+    }
+  ]
+}
+
+# Get detailed execution data
+curl -X GET 'http://n8n:5678/api/v1/executions/12345?includeData=true' \
+  -H 'X-N8N-API-KEY: <api-key>'
+
+# Response includes node-level data:
+{
+  "id": "12345",
+  "data": {
+    "resultData": {
+      "runData": {
+        "Execute Container Action": [
+          {
+            "startTime": "...",
+            "executionTime": 234,
+            "data": {
+              "main": [
+                [
+                  {
+                    "json": {
+                      "success": false,
+                      "error": { ... }
+                    }
+                  }
+                ]
+              ]
+            }
+          }
+        ]
+      }
+    }
+  }
+}
+```
+**Source:** [n8n Executions API](https://docs.n8n.io/api/v1/executions/)
+
+### Anti-Patterns to Avoid
+
+- **Over-logging:** Don't trace every node execution — only boundaries (sub-workflow I/O) and decision points (routing). Full tracing creates noise and fills the ring buffer quickly.
+- **Logging sensitive data:** Don't capture Telegram API keys, Docker socket responses with sensitive container environment variables, or user credentials in error context.
+- **Unbounded storage:** Don't append errors indefinitely to workflow static data — use ring buffer with fixed size (50 entries). Static data has size limits and isn't designed for unlimited storage.
+- **Synchronous API calls:** Don't call n8n API from within workflow execution for logging — too slow, creates circular dependency. Use workflow static data; query API externally (Claude Code).
+- **User-facing debug output:** Don't send raw error objects or stack traces to Telegram user — only show `userMessage` field. Full diagnostic data is for Claude only.
+
+## Don't Hand-Roll
+
+| Problem | Don't Build | Use Instead | Why |
+|---------|-------------|-------------|-----|
+| Ring buffer with manual rotation | Custom linked list, manual cleanup logic | Simple array with `push()` and `shift()` | Ring buffer with array + modulo is 10 lines of code; custom structures add complexity for zero benefit |
+| Correlation ID generation | Manual timestamp-based IDs | UUID v4 (`require('uuid').v4()`) | UUIDs are guaranteed unique; custom IDs risk collisions |
+| Error serialization | Custom error formatting | `JSON.stringify(error)` with try-catch | Errors aren't always JSON-serializable; need safe serialization (`error.message`, `error.stack` fields) |
+| Execution log parsing | Manual n8n database queries | n8n API `/api/v1/executions` | API provides structured access; database queries are fragile and break on schema changes |
+| Log aggregation service | External ELK/Splunk/Datadog | Workflow static data + n8n API | 50-entry ring buffer is sufficient for debugging; external service is over-engineering for this use case |
+
+**Key insight:** n8n's built-in capabilities (static data, Error Trigger, API) are designed for exactly this use case. Don't add external dependencies when native features are sufficient.
+
+## Common Pitfalls
+
+### Pitfall 1: Workflow Static Data Not Persisting
+
+**What goes wrong:** Static data cleared between executions, errors not retained.
+
+**Why it happens:** Workflow static data only persists when workflow is **active** (not testing mode) and execution completes successfully. If workflow execution errors before reaching end, static data changes are lost.
+
+**How to avoid:**
+- Ensure main workflow is active (not testing)
+- Write to static data in nodes that execute **before** error occurs
+- For error logging: use `try` node or error trigger to catch errors without failing execution
+
+**Warning signs:**
+- `/errors` command shows no errors despite known failures
+- Ring buffer resets to empty on every execution
+- `nextId` counter doesn't increment
+
+**Source:** [n8n workflow static data behavior](https://docs.n8n.io/code/cookbook/builtin/get-workflow-static-data/)
+
+### Pitfall 2: Execution ID vs Correlation ID Confusion
+
+**What goes wrong:** Using execution ID to trace across sub-workflows fails because each sub-workflow has its own execution ID.
+
+**Why it happens:** n8n creates new execution ID for each sub-workflow invocation. Single user request = multiple execution IDs (main + N sub-workflows).
+
+**How to avoid:**
+- Generate correlation ID in main workflow (UUID v4)
+- Pass correlation ID to all sub-workflows as input parameter
+- Use correlation ID (not execution ID) to query logs for single user request
+
+**Warning signs:**
+- Can't trace callback from callback_query through sub-workflow to result
+- Errors from sub-workflows appear unrelated to main workflow execution
+
+**Example:**
+```
+User request "stop nginx"
+├─ Main workflow execution: executionId=12345, correlationId=uuid-abc
+├─ Sub-workflow (n8n-actions): executionId=12346, correlationId=uuid-abc  ← Same correlation ID
+└─ Error logged with correlationId=uuid-abc ← Can query all entries for this request
+```
+
+**Source:** [Distributed tracing correlation ID pattern](https://microsoft.github.io/code-with-engineering-playbook/observability/correlation-id/)
+
+### Pitfall 3: Static Data Size Limits
+
+**What goes wrong:** Workflow static data grows unbounded, eventually fails with "data too large" error.
+
+**Why it happens:** n8n stores static data in database. Large objects (50+ entries with full rawResponse fields) can exceed database column size limits.
+
+**How to avoid:**
+- Use ring buffer (fixed size, auto-rotate)
+- Limit `rawResponse` field size (truncate to 1000 chars)
+- Don't store binary data or large payloads in error context
+- Provide manual clear command (`/clear-errors`) for ring buffer reset
+
+**Warning signs:**
+- Workflow execution fails with database error
+- Static data write operations timing out
+- Execution time increases as ring buffer fills
+
+**Mitigation:**
+```javascript
+// Truncate large fields before storing
+error: {
+  message: err.message,
+  stack: err.stack?.substring(0, 500) || '',  // Limit stack trace
+  rawResponse: err.rawResponse?.substring(0, 1000) || ''  // Limit response
+}
+```
+
+**Source:** [n8n community: static data size limits](https://community.n8n.io/t/question-about-workflowstaticdata/47029)
+
+### Pitfall 4: Querying Errors by Wrong Field
+
+**What goes wrong:** Can't find specific error when searching logs because field name assumptions are wrong.
+
+**Why it happens:** Inconsistent field naming (e.g., `containerId` vs `container_id`, `workflow` vs `workflowName`).
+
+**How to avoid:**
+- Define standard error schema (see Architecture Patterns above)
+- Use TypeScript-style interfaces as comments in Code nodes
+- Validate error object structure when storing (check required fields exist)
+
+**Warning signs:**
+- `/errors` command can't filter by container or user
+- Claude's queries return empty results despite known errors for that container
+
+**Prevention:**
+```javascript
+// Code node: Validate Error Schema
+const requiredFields = ['id', 'correlationId', 'timestamp', 'workflow', 'node', 'userMessage', 'error'];
+const errorEntry = { ... };
+
+// Validate
+const missing = requiredFields.filter(field => !errorEntry[field]);
+if (missing.length > 0) {
+  console.error(`Missing required error fields: ${missing.join(', ')}`);
+}
+```
+
+### Pitfall 5: Debug Mode Always-On Performance Impact
+
+**What goes wrong:** Debug mode left enabled, fills ring buffer with traces, obscures actual errors.
+
+**Why it happens:** Claude enables debug mode for investigation, forgets to disable it.
+
+**How to avoid:**
+- Default debug mode to OFF
+- Auto-disable debug mode after N executions (e.g., 100)
+- Include debug status in `/errors` command output
+- Separate ring buffers for errors (always on) and traces (debug mode only)
+
+**Warning signs:**
+- Ring buffer fills with trace entries, pushes out error entries
+- `/errors` command mostly shows traces, not actual errors
+- Workflow execution noticeably slower
+
+**Mitigation:**
+```javascript
+// Auto-disable debug mode after 100 executions
+const staticData = $getWorkflowStaticData('global');
+if (staticData.debug?.enabled) {
+  staticData.debug.executionCount = (staticData.debug.executionCount || 0) + 1;
+
+  if (staticData.debug.executionCount > 100) {
+    staticData.debug.enabled = false;
+    // Send notification to Claude via Telegram
+    return { json: {
+      message: '⚠️ Debug mode auto-disabled after 100 executions.'
+    }};
+  }
+}
+```
+
+## Code Examples
+
+All code examples provided in Architecture Patterns section above. Key patterns:
+
+1. **Ring Buffer Implementation** - Add/rotate entries in workflow static data
+2. **Sub-workflow Error Propagation** - Return error objects from sub-workflows
+3. **Correlation ID Tracking** - Generate and pass correlation ID through calls
+4. **Debug Mode Toggle** - Enable/disable tracing via Telegram command
+5. **Query Errors via Telegram** - Format and display recent errors
+6. **n8n API Access** - Retrieve execution data for deep investigation
+
+## State of the Art
+
+| Old Approach | Current Approach | When Changed | Impact |
+|--------------|------------------|--------------|--------|
+| Log to external service (Splunk, Datadog) | Store in workflow static data + query via API | 2024-2025 | n8n static data sufficient for small-scale debugging; no external dependencies |
+| Trace every node execution | Trace only boundaries and decisions | 2025-2026 | Reduces noise, focuses on actionable data (distributed tracing best practices) |
+| Execution ID only | Correlation ID + Execution ID | 2024-2026 | Correlation ID essential for multi-workflow tracing (OpenTelemetry pattern) |
+| Manual log parsing | Structured JSON logs | 2023-2024 | Programmatic querying replaces manual log reading |
+| Error Trigger to external workflow | Error propagation via return values | 2024-2025 | Centralized storage in main workflow, simpler architecture |
+
+**Deprecated/outdated:**
+- **n8n log streaming to external service:** Requires self-hosted n8n with log streaming enabled. Adds infrastructure complexity. Static data + API is simpler for debugging use case.
+- **External error tracking service (Sentry, Rollbar):** Over-engineering for workflow errors. These services are for application errors in production systems, not workflow debugging.
+- **Database storage for logs:** n8n already stores execution data in database. Querying via API is cleaner than direct database access (which is fragile and breaks on schema changes).
+
+**Source:** [n8n log streaming](https://docs.n8n.io/hosting/logging-monitoring/log-streaming/) (optional feature, not required)
+
+## Open Questions
+
+### 1. **Workflow Static Data Size Limits**
+   - **What we know:** Static data persists in n8n database, has size limits, can fail with "data too large" error
+   - **What's unclear:** Exact size limit in bytes/entries before failure occurs
+   - **Recommendation:** Conservative ring buffer size (50 entries), truncate large fields (`rawResponse` to 1000 chars), provide manual clear command. Monitor in production; reduce to 25 entries if size errors occur.
+
+### 2. **Sub-workflow Error Context Propagation**
+   - **What we know:** Sub-workflows can return error objects via return values
+   - **What's unclear:** Do all 7 sub-workflows currently return structured responses, or do some fail silently?
+   - **Recommendation:** Audit existing sub-workflows during implementation. Standardize return format: `{ success: boolean, error?: object, data?: object }`. Update all sub-workflows to return errors (don't throw/fail execution).
+
+### 3. **Debug Mode Performance Impact**
+   - **What we know:** Capturing boundary data and routing decisions adds code execution overhead
+   - **What's unclear:** Measurable impact on workflow execution time (milliseconds? seconds?)
+   - **Recommendation:** Implement debug mode with selective tracing (only 3 pain points). Measure execution time before/after debug mode enabled. If impact > 500ms, reduce trace granularity.
+
+### 4. **n8n API Rate Limits**
+   - **What we know:** n8n provides API for querying executions
+   - **What's unclear:** Are there rate limits on API calls? Does frequent querying impact n8n performance?
+   - **Recommendation:** Use Telegram commands for quick checks (doesn't hit API, reads static data). Reserve API queries for deep investigation. If rate limits discovered, implement query caching/throttling.
+
+### 5. **Telegram Message Size Limits**
+   - **What we know:** Telegram messages have 4096 character limit
+   - **What's unclear:** If `/errors` command returns 50 errors, will message exceed limit?
+   - **Recommendation:** Paginate error output (default: last 5 errors, optional count parameter). Provide `/errors full` for file-based export (Telegram file upload API). Split long messages if needed.
+
+## Sources
+
+### Primary (HIGH confidence)
+
+- [n8n Workflow Static Data](https://docs.n8n.io/code/cookbook/builtin/get-workflow-static-data/) - Official docs on `$getWorkflowStaticData()`
+- [n8n Error Trigger Node](https://docs.n8n.io/integrations/builtin/core-nodes/n8n-nodes-base.errortrigger/) - Error data structure and usage
+- [n8n Execute Sub-workflow](https://docs.n8n.io/integrations/builtin/core-nodes/n8n-nodes-base.executeworkflow.md) - Sub-workflow communication patterns
+- [n8n Executions API](https://docs.n8n.io/api/v1/executions/) - Querying execution data programmatically
+- [n8n workflow data access](https://docs.n8n.io/code/builtin/node-execution-data/) - Accessing node data and workflow metadata
+
+### Secondary (MEDIUM confidence)
+
+- [Better Stack: Node.js Logging Best Practices](https://betterstack.com/community/guides/logging/nodejs-logging-best-practices/) - Structured logging patterns
+- [Microsoft Engineering Playbook: Correlation IDs](https://microsoft.github.io/code-with-engineering-playbook/observability/correlation-id/) - Request tracing pattern
+- [Distributed Tracing Logs (GroundCover)](https://www.groundcover.com/learn/logging/distributed-tracing-logs) - Tracing workflow debugging patterns
+- [Tucker Leach: Ring Buffer in TypeScript](https://www.tuckerleach.com/blog/ring-buffer) - Ring buffer implementation
+- [n8n Community: Workflow Static Data](https://community.n8n.io/t/question-about-workflowstaticdata/47029) - Static data limitations and behaviors
+
+### Tertiary (LOW confidence)
+
+- [n8n community: inline keyboard callback query](https://community.n8n.io/t/n8n-telegram-inline-keyboard-callback-query-workflow-example/112588) - Telegram callback patterns (referenced for callback routing context)
+- [Ring buffer npm packages](https://www.npmjs.com/search?q=ring+buffer) - External libraries (not needed, but validate pattern)
+
+## Metadata
+
+**Confidence breakdown:**
+- **Standard stack:** HIGH - All components are n8n built-ins, well-documented in official docs
+- **Architecture patterns:** HIGH - Ring buffer, correlation IDs, structured errors are industry-standard patterns; n8n static data verified in official docs
+- **Common pitfalls:** MEDIUM - Based on n8n community reports and general workflow debugging experience; specific size limits not documented precisely
+- **Code examples:** HIGH - All examples use documented n8n APIs and standard JavaScript patterns
+
+**Research date:** 2026-02-08
+**Valid until:** 2026-03-08 (30 days - stable technology stack)
+
+## Implementation Recommendations
+
+Based on research findings and user constraints:
+
+### 1. Trace Format (Claude's Discretion)
+**Recommendation:** Hybrid approach — structured error objects (always on) + selective debug traces (opt-in).
+
+**Rationale:** Errors are rare and always need full context. Debug traces are verbose and only needed for specific pain points. Separate ring buffers prevent trace noise from obscuring errors.
+
+**Structure:**
+- `staticData.errors.buffer` - 50 entries, always on
+- `staticData.traces.buffer` - 50 entries, only when `staticData.debug.enabled = true`
+
+### 2. Trace Scope (Claude's Discretion)
+**Recommendation:** Trace only errors (always) + three pain points (debug mode only).
+
+**Pain point traces (debug mode only):**
+1. **Sub-workflow boundaries:** Capture input/output at Execute Workflow nodes
+2. **Callback routing:** Capture which switch path taken in Route Callback node
+3. **n8n API queries:** (No tracing needed — query via API is already structured)
+
+**Rationale:** Tracing every execution creates noise. Focus on high-value data: errors (always actionable) and specific debug scenarios (when Claude needs deep visibility).
+
+### 3. Structured vs. Simple Logs (Claude's Discretion)
+**Recommendation:** Structured JSON objects.
+
+**Rationale:** Claude needs programmatic access to query by correlationId, workflow, node, error type. Simple log lines require text parsing; structured objects enable direct field access.
+
+### 4. Debug Toggle Mechanism (Claude's Discretion)
+**Recommendation:** Global toggle via Telegram command (`/debug on|off`) with auto-disable after 100 executions.
+
+**Rationale:** Global toggle is simplest. Per-request debugging adds complexity (need to tag specific requests). Always-on would fill ring buffer with traces. Auto-disable prevents performance impact from forgotten debug mode.
+
+### 5. Log Level Granularity (Claude's Discretion)
+**Recommendation:** Binary on/off for debug mode. Errors are always logged (no levels).
+
+**Rationale:** Traditional log levels (error/warn/info/debug) are for application logs. Workflow debugging has two modes: normal (errors only) and debug (errors + traces). Additional levels add complexity without benefit.
+
+### 6. Specific Debug Data to Capture (Claude's Discretion)
+**Recommendation:** Minimal boundary data + routing decisions.
+
+**Capture:**
+- Sub-workflow I/O: `{ input: {...}, output: {...}, duration: 234 }`
+- Callback routing: `{ callbackData: "...", routeTaken: "...", availableRoutes: [...] }`
+- Docker API responses: `{ httpCode: 404, rawResponse: "..." }` (truncate to 1000 chars)
+
+**Don't capture:**
+- Every node execution (too verbose)
+- Full execution data from n8n API (query on-demand, don't cache)
+- User messages, Telegram webhook payloads (not relevant to pain points)
+
+### 7. Telegram Command Interface (Claude's Discretion)
+**Recommendation:**
+
+| Command | Description | Hidden? |
+|---------|-------------|---------|
+| `/errors [count]` | Show last N errors (default 5) | Yes (unlisted) |
+| `/clear-errors` | Clear error ring buffer | Yes (unlisted) |
+| `/debug on\|off\|status` | Toggle debug mode | Yes (unlisted) |
+| `/trace <correlationId>` | Show all entries for correlation ID | Yes (unlisted) |
+
+**Rationale:** Developer/debug tools should be hidden (not in `/help` menu). Claude can use them during debugging sessions. User never needs to see these commands.