Files

T

Lucas Berger 0ef36ab4c8 docs(10.2): research phase domain

2026-02-08 18:56:44 -05:00

36 KiB

Raw Blame History

Phase 10.2: Better Logging and Log Management - Research

Researched: 2026-02-08 Domain: n8n workflow execution logging, error tracking, and debug infrastructure Confidence: HIGH

Summary

Phase 10.2 adds centralized logging and error tracking to improve Claude's ability to diagnose issues in the n8n-based Unraid Docker Manager bot. The research reveals that n8n provides native capabilities for this exact use case: workflow static data for in-memory storage, structured error data from Error Trigger nodes, sub-workflow return patterns for error propagation, and API access to execution logs. The primary challenge is designing a trace format that makes the three specific pain points (sub-workflow data loss, callback routing confusion, execution log parsing) immediately queryable.

The standard approach combines ring buffer storage in workflow static data, structured error objects with context, correlation IDs for request tracing, and programmatic access via both Telegram commands and n8n API. This infrastructure is well-established in distributed systems observability (2026) and maps cleanly to n8n's architecture.

Primary recommendation: Use workflow static data for ring buffer storage (50 errors), structured error objects with correlation IDs, sub-workflow error propagation via return values, and selective debug mode that captures boundary data only when enabled. Avoid over-logging; focus on the three stated pain points with targeted trace data.

<user_constraints>

User Constraints (from CONTEXT.md)

Locked Decisions

Error capture & reporting:

Errors display inline to the user as summary + cause (e.g., "Failed to stop nginx: Docker API returned 404 (container not found)")
Full diagnostic data (sub-workflow name, node, raw response, stack trace) captured in central error store for Claude's use
Only report errors on user-triggered actions — no proactive/unsolicited error notifications
Error store uses ring buffer: last 50 errors, auto-rotated
Manual clear command also available (/clear-errors or similar, hidden/unlisted)

Execution traceability:

All sub-workflows report errors back to main workflow for centralized storage
Trace data designed for programmatic access — Claude can query it during debugging sessions
Hidden/unlisted Telegram commands for quick error checks (e.g., /errors to see recent errors)
File-based access also available for deep investigation during debugging sessions

Log output & storage:

Error/trace data stored in n8n workflow static data (main workflow)
Centralized in main workflow — sub-workflows report back, main stores
Auto-rotate (ring buffer, 50 entries) + manual clear command
Both Telegram commands (quick checks) and file/API access (deep investigation)

Debug mode:

Debug mode is for Claude's use during debugging — not user-facing
Must address three specific pain points:
1. Sub-workflow data loss — capture what data was sent to and received from each sub-workflow at boundaries
2. Callback routing confusion — trace which path a callback took through routing logic
3. n8n API execution log parsing — make execution data easily queryable without manual workflow investigation

Claude's Discretion

Trace format and structure (timeline vs. data snapshots vs. both)
Whether to trace all executions or only errors (overhead vs. usefulness)
Structured entries vs. simple log lines (what enables best debugging)
Debug toggle mechanism (global toggle, per-request, or always-on for errors)
Log level granularity (on/off vs. error/warn/info)
What specific debug data to capture (raw API responses, sub-workflow I/O, timing)
Telegram command naming and exact interface

Deferred Ideas (OUT OF SCOPE)

None — discussion stayed within phase scope

</user_constraints>

Standard Stack

Core Components

Component	Version/Type	Purpose	Why Standard
n8n Workflow Static Data	Built-in (`$getWorkflowStaticData('global')`)	In-memory ring buffer storage	Native n8n persistence mechanism, survives across executions
n8n Error Trigger	Built-in node type	Structured error capture	Standard n8n error handling pattern, provides rich error context
n8n Execute Workflow	Built-in node type	Sub-workflow communication	Existing pattern in project (7 sub-workflows deployed)
n8n API	`/api/v1/executions` endpoint	Programmatic execution log access	Official n8n API for querying execution history and data
Correlation ID	String field in trace entries	Request tracking across workflow boundaries	Industry standard for distributed tracing (OpenTelemetry pattern)

Note: No external logging libraries needed. n8n's built-in capabilities are sufficient for this use case.

Supporting Patterns

Pattern	Implementation	Purpose	When to Use
Ring Buffer	JavaScript array with modulo arithmetic	Auto-rotating error store (50 entries)	Size-bounded in-memory storage
Structured Error Object	JSON with standard fields (timestamp, executionId, node, error, context)	Queryable error data	Always — enables programmatic access
Error Propagation	Sub-workflow return values include error object	Centralized error collection	When sub-workflow encounters error
Debug Toggle	Boolean flag in workflow static data	Enable/disable debug tracing	Claude sets via Telegram command or API
Correlation ID	UUID passed through sub-workflow calls	Trace single request across workflows	All sub-workflow invocations

Alternatives Considered

Instead of	Could Use	Tradeoff
Workflow static data	External database (Redis, MongoDB)	External DB provides unlimited storage but adds infrastructure complexity; static data is simpler, sufficient for 50-entry ring buffer
Ring buffer	Append-only log with external rotation	Unlimited history but requires external storage and log rotation scripts; ring buffer is self-managing
n8n API access	n8n log streaming to external service	Real-time streaming but requires external log aggregator; API access is simpler for on-demand queries
Correlation IDs	Execution ID only	Execution ID doesn't span sub-workflows; correlation ID tracks single user request across all workflows

Installation: No external packages needed. All components are n8n built-ins.

Architecture Patterns

Recommended Data Structure

// Workflow static data structure
{
  "debug": {
    "enabled": false,           // Debug mode toggle
    "logLevel": "error"         // "off" | "error" | "warn" | "info" | "debug"
  },
  "errors": {
    "buffer": [                 // Ring buffer (max 50 entries)
      {
        "id": "err_001",        // Sequential error ID
        "correlationId": "uuid-v4",  // Trace across sub-workflows
        "timestamp": "2026-02-08T10:30:00Z",
        "executionId": "12345", // n8n execution ID
        "workflow": "main",     // "main" or sub-workflow name
        "node": "Execute Container Action",
        "operation": "docker.stop",
        "userMessage": "Failed to stop nginx: Docker API returned 404 (container not found)",
        "error": {
          "message": "Container not found",
          "stack": "Error: Container not found\n  at ...",
          "httpCode": 404,
          "rawResponse": "{\"message\":\"No such container: nginx\"}"
        },
        "context": {
          "userId": "123456789",
          "containerId": "nginx",
          "subWorkflowInput": {...},   // Data sent to sub-workflow
          "subWorkflowOutput": {...}   // Data received from sub-workflow
        }
      }
    ],
    "nextId": 2,                // Auto-increment for error IDs
    "count": 1,                 // Total errors captured (all-time)
    "lastCleared": "2026-02-08T09:00:00Z"
  },
  "traces": {                   // Debug mode traces (only when debug.enabled = true)
    "buffer": [                 // Ring buffer (max 50 entries)
      {
        "id": "trace_001",
        "correlationId": "uuid-v4",
        "timestamp": "2026-02-08T10:29:55Z",
        "executionId": "12345",
        "event": "sub-workflow-call",
        "workflow": "n8n-actions",
        "node": "Execute Container Action",
        "data": {
          "input": {...},       // Boundary data: what was sent
          "output": {...},      // Boundary data: what was received
          "duration": 234       // Execution time in ms
        }
      },
      {
        "id": "trace_002",
        "correlationId": "uuid-v4",
        "timestamp": "2026-02-08T10:29:56Z",
        "executionId": "12345",
        "event": "callback-routing",
        "node": "Route Callback",
        "data": {
          "callbackData": "action:stop:nginx",
          "routeTaken": "single-action",    // Which switch output path
          "availableRoutes": ["cancel", "expired", "batch", "single-action"]
        }
      }
    ],
    "nextId": 3
  }
}

Pattern 1: Ring Buffer Implementation

What: Fixed-size circular buffer that auto-rotates when full, keeping only the most recent N entries.

When to use: Storing errors and traces in bounded memory (workflow static data has size limits).

Example:

// Code node: Add Error to Ring Buffer
const staticData = $getWorkflowStaticData('global');

// Initialize if needed
if (!staticData.errors) {
  staticData.errors = {
    buffer: [],
    nextId: 1,
    count: 0,
    lastCleared: new Date().toISOString()
  };
}

const MAX_ENTRIES = 50;
const errorEntry = {
  id: `err_${String(staticData.errors.nextId).padStart(3, '0')}`,
  correlationId: $execution.id,  // Use execution ID as correlation ID
  timestamp: new Date().toISOString(),
  executionId: $execution.id,
  workflow: 'main',
  node: $('Execute Container Action').name,
  operation: 'docker.stop',
  userMessage: $input.item.json.errorMessage,
  error: {
    message: $input.item.json.error.message,
    stack: $input.item.json.error.stack,
    httpCode: $input.item.json.error.httpCode,
    rawResponse: $input.item.json.error.rawResponse
  },
  context: {
    userId: $input.item.json.userId,
    containerId: $input.item.json.containerId,
    subWorkflowInput: $input.item.json.subWorkflowInput,
    subWorkflowOutput: $input.item.json.subWorkflowOutput
  }
};

// Ring buffer: add at end, remove from start if full
staticData.errors.buffer.push(errorEntry);
if (staticData.errors.buffer.length > MAX_ENTRIES) {
  staticData.errors.buffer.shift();  // Remove oldest
}

staticData.errors.nextId++;
staticData.errors.count++;

return { json: { success: true, errorId: errorEntry.id } };

Source: Ring buffer pattern from Tucker Leach - Ring Buffer in TypeScript

Pattern 2: Sub-workflow Error Propagation

What: Sub-workflows return error objects to main workflow for centralized storage.

When to use: All sub-workflow calls. Enables centralized error collection.

Example:

// Sub-workflow (n8n-actions.json): Return error to main workflow
// Code node: Format Error Response (on error path)
return {
  json: {
    success: false,
    error: {
      message: $input.item.json.error.message,
      stack: $input.item.json.error.stack || '',
      httpCode: $input.item.json.error.httpCode || 500,
      rawResponse: $input.item.json.error.rawResponse || ''
    },
    context: {
      workflow: 'n8n-actions',
      node: $('Stop Container').name,
      operation: 'docker.stop',
      input: $('When executed by another workflow').item.json  // What was sent to this sub-workflow
    }
  }
};

// Main workflow: Capture sub-workflow error
// IF node: Check Sub-workflow Success
{{ $('Execute Container Action').item.json.success }} equals false

// Code node: Log Error (on false path)
const subWorkflowResult = $('Execute Container Action').item.json;
const errorData = {
  errorMessage: `Failed to stop ${subWorkflowResult.context.input.containerId}: ${subWorkflowResult.error.message}`,
  error: subWorkflowResult.error,
  userId: $('Telegram Trigger').item.json.message.from.id,
  containerId: subWorkflowResult.context.input.containerId,
  subWorkflowInput: subWorkflowResult.context.input,
  subWorkflowOutput: subWorkflowResult
};

// Pass to ring buffer node
return { json: errorData };

Source: n8n sub-workflow pattern from n8n Execute Sub-workflow docs

Pattern 3: Correlation ID for Request Tracing

What: Unique ID generated at workflow entry point, passed through all sub-workflow calls, used to correlate logs/traces for single user request.

When to use: Always. Essential for tracing requests across sub-workflows.

Example:

// Main workflow: Generate Correlation ID
// Code node: Initialize Request Context (early in workflow, after auth)
const { v4: uuidv4 } = require('uuid');  // n8n includes uuid

const correlationId = uuidv4();
const requestContext = {
  correlationId,
  userId: $('Telegram Trigger').item.json.message.from.id,
  messageId: $('Telegram Trigger').item.json.message.message_id,
  timestamp: new Date().toISOString()
};

return { json: { ...requestContext, ...$input.item.json } };

// Pass correlation ID to sub-workflow
// Execute Workflow node: Execute Container Action
// Input parameters:
{{ { correlationId: $('Initialize Request Context').item.json.correlationId, ...otherParams } }}

// Debug trace: Log callback routing decision
const staticData = $getWorkflowStaticData('global');
if (staticData.debug?.enabled) {
  const traceEntry = {
    id: `trace_${String(staticData.traces.nextId).padStart(3, '0')}`,
    correlationId: $('Initialize Request Context').item.json.correlationId,
    timestamp: new Date().toISOString(),
    executionId: $execution.id,
    event: 'callback-routing',
    node: 'Route Callback',
    data: {
      callbackData: $input.item.json.callback_query.data,
      routeTaken: '{{ $json.routeName }}',  // Set by switch node metadata
      availableRoutes: ['cancel', 'expired', 'batch', 'single-action']
    }
  };

  // Add to ring buffer (same pattern as errors)
  staticData.traces.buffer.push(traceEntry);
  if (staticData.traces.buffer.length > 50) {
    staticData.traces.buffer.shift();
  }
  staticData.traces.nextId++;
}

Source: Correlation ID pattern from Microsoft Engineering Playbook - Correlation IDs

Pattern 4: Debug Mode Toggle

What: Boolean flag in workflow static data that enables/disables debug tracing. When enabled, captures boundary data (sub-workflow I/O) and routing decisions.

When to use: Claude needs to diagnose issues. User doesn't see debug traces; only visible via /errors command or API.

Example:

// Telegram command: /debug on|off (hidden command)
// Code node: Toggle Debug Mode
const staticData = $getWorkflowStaticData('global');
const command = $input.item.json.message.text.toLowerCase();

if (!staticData.debug) {
  staticData.debug = { enabled: false, logLevel: 'error' };
}

if (command === '/debug on') {
  staticData.debug.enabled = true;
  return { json: { message: 'Debug mode enabled. Tracing sub-workflow boundaries and callback routing.' } };
} else if (command === '/debug off') {
  staticData.debug.enabled = false;
  return { json: { message: 'Debug mode disabled.' } };
} else if (command === '/debug status') {
  return { json: {
    message: `Debug mode: ${staticData.debug.enabled ? 'ON' : 'OFF'}\nLog level: ${staticData.debug.logLevel}`
  } };
}

Pattern 5: Query Errors via Telegram

What: Hidden command that returns recent errors in human-readable format.

When to use: Quick error checks during debugging sessions.

Example:

// Telegram command: /errors [count] (hidden command)
// Code node: Format Error Report
const staticData = $getWorkflowStaticData('global');
const errors = staticData.errors?.buffer || [];
const requestedCount = parseInt($input.item.json.message.text.split(' ')[1]) || 5;

const recentErrors = errors.slice(-requestedCount).reverse();

if (recentErrors.length === 0) {
  return { json: { message: 'No errors recorded.' } };
}

let message = `📋 Recent Errors (${recentErrors.length}):\n\n`;
recentErrors.forEach(err => {
  const time = new Date(err.timestamp).toLocaleString();
  message += `🔴 ${err.id} - ${time}\n`;
  message += `Workflow: ${err.workflow} → ${err.node}\n`;
  message += `User: ${err.userMessage}\n`;
  message += `Error: ${err.error.message}\n`;
  if (err.error.httpCode) {
    message += `HTTP: ${err.error.httpCode}\n`;
  }
  message += `\n`;
});

message += `Total errors: ${staticData.errors.count}\n`;
message += `Last cleared: ${new Date(staticData.errors.lastCleared).toLocaleString()}`;

return { json: { message } };

Pattern 6: n8n API Access for Deep Investigation

What: Use n8n API to retrieve full execution data including node inputs/outputs.

When to use: Deep debugging when Telegram command output isn't sufficient.

Example:

# Claude Code: Query recent failed executions
curl -X GET 'http://n8n:5678/api/v1/executions?status=error&limit=10' \
  -H 'X-N8N-API-KEY: <api-key>'

# Response:
{
  "data": [
    {
      "id": "12345",
      "workflowId": "1000",
      "status": "error",
      "startedAt": "2026-02-08T10:29:55Z",
      "finishedAt": "2026-02-08T10:30:00Z"
    }
  ]
}

# Get detailed execution data
curl -X GET 'http://n8n:5678/api/v1/executions/12345?includeData=true' \
  -H 'X-N8N-API-KEY: <api-key>'

# Response includes node-level data:
{
  "id": "12345",
  "data": {
    "resultData": {
      "runData": {
        "Execute Container Action": [
          {
            "startTime": "...",
            "executionTime": 234,
            "data": {
              "main": [
                [
                  {
                    "json": {
                      "success": false,
                      "error": { ... }
                    }
                  }
                ]
              ]
            }
          }
        ]
      }
    }
  }
}

Source: n8n Executions API

Anti-Patterns to Avoid

Over-logging: Don't trace every node execution — only boundaries (sub-workflow I/O) and decision points (routing). Full tracing creates noise and fills the ring buffer quickly.
Logging sensitive data: Don't capture Telegram API keys, Docker socket responses with sensitive container environment variables, or user credentials in error context.
Unbounded storage: Don't append errors indefinitely to workflow static data — use ring buffer with fixed size (50 entries). Static data has size limits and isn't designed for unlimited storage.
Synchronous API calls: Don't call n8n API from within workflow execution for logging — too slow, creates circular dependency. Use workflow static data; query API externally (Claude Code).
User-facing debug output: Don't send raw error objects or stack traces to Telegram user — only show userMessage field. Full diagnostic data is for Claude only.

Don't Hand-Roll

Problem	Don't Build	Use Instead	Why
Ring buffer with manual rotation	Custom linked list, manual cleanup logic	Simple array with `push()` and `shift()`	Ring buffer with array + modulo is 10 lines of code; custom structures add complexity for zero benefit
Correlation ID generation	Manual timestamp-based IDs	UUID v4 (`require('uuid').v4()`)	UUIDs are guaranteed unique; custom IDs risk collisions
Error serialization	Custom error formatting	`JSON.stringify(error)` with try-catch	Errors aren't always JSON-serializable; need safe serialization (`error.message`, `error.stack` fields)
Execution log parsing	Manual n8n database queries	n8n API `/api/v1/executions`	API provides structured access; database queries are fragile and break on schema changes
Log aggregation service	External ELK/Splunk/Datadog	Workflow static data + n8n API	50-entry ring buffer is sufficient for debugging; external service is over-engineering for this use case

Key insight: n8n's built-in capabilities (static data, Error Trigger, API) are designed for exactly this use case. Don't add external dependencies when native features are sufficient.

Common Pitfalls

Pitfall 1: Workflow Static Data Not Persisting

What goes wrong: Static data cleared between executions, errors not retained.

Why it happens: Workflow static data only persists when workflow is active (not testing mode) and execution completes successfully. If workflow execution errors before reaching end, static data changes are lost.

How to avoid:

Ensure main workflow is active (not testing)
Write to static data in nodes that execute before error occurs
For error logging: use try node or error trigger to catch errors without failing execution

Warning signs:

/errors command shows no errors despite known failures
Ring buffer resets to empty on every execution
nextId counter doesn't increment

Source: n8n workflow static data behavior

Pitfall 2: Execution ID vs Correlation ID Confusion

What goes wrong: Using execution ID to trace across sub-workflows fails because each sub-workflow has its own execution ID.

Why it happens: n8n creates new execution ID for each sub-workflow invocation. Single user request = multiple execution IDs (main + N sub-workflows).

How to avoid:

Generate correlation ID in main workflow (UUID v4)
Pass correlation ID to all sub-workflows as input parameter
Use correlation ID (not execution ID) to query logs for single user request

Warning signs:

Can't trace callback from callback_query through sub-workflow to result
Errors from sub-workflows appear unrelated to main workflow execution

Example:

User request "stop nginx"
├─ Main workflow execution: executionId=12345, correlationId=uuid-abc
├─ Sub-workflow (n8n-actions): executionId=12346, correlationId=uuid-abc  ← Same correlation ID
└─ Error logged with correlationId=uuid-abc ← Can query all entries for this request

Source: Distributed tracing correlation ID pattern

Pitfall 3: Static Data Size Limits

What goes wrong: Workflow static data grows unbounded, eventually fails with "data too large" error.

Why it happens: n8n stores static data in database. Large objects (50+ entries with full rawResponse fields) can exceed database column size limits.

How to avoid:

Use ring buffer (fixed size, auto-rotate)
Limit rawResponse field size (truncate to 1000 chars)
Don't store binary data or large payloads in error context
Provide manual clear command (/clear-errors) for ring buffer reset

Warning signs:

Workflow execution fails with database error
Static data write operations timing out
Execution time increases as ring buffer fills

Mitigation:

// Truncate large fields before storing
error: {
  message: err.message,
  stack: err.stack?.substring(0, 500) || '',  // Limit stack trace
  rawResponse: err.rawResponse?.substring(0, 1000) || ''  // Limit response
}

Source: n8n community: static data size limits

Pitfall 4: Querying Errors by Wrong Field

What goes wrong: Can't find specific error when searching logs because field name assumptions are wrong.

Why it happens: Inconsistent field naming (e.g., containerId vs container_id, workflow vs workflowName).

How to avoid:

Define standard error schema (see Architecture Patterns above)
Use TypeScript-style interfaces as comments in Code nodes
Validate error object structure when storing (check required fields exist)

Warning signs:

/errors command can't filter by container or user
Claude's queries return empty results despite known errors for that container

Prevention:

// Code node: Validate Error Schema
const requiredFields = ['id', 'correlationId', 'timestamp', 'workflow', 'node', 'userMessage', 'error'];
const errorEntry = { ... };

// Validate
const missing = requiredFields.filter(field => !errorEntry[field]);
if (missing.length > 0) {
  console.error(`Missing required error fields: ${missing.join(', ')}`);
}

Pitfall 5: Debug Mode Always-On Performance Impact

What goes wrong: Debug mode left enabled, fills ring buffer with traces, obscures actual errors.

Why it happens: Claude enables debug mode for investigation, forgets to disable it.

How to avoid:

Default debug mode to OFF
Auto-disable debug mode after N executions (e.g., 100)
Include debug status in /errors command output
Separate ring buffers for errors (always on) and traces (debug mode only)

Warning signs:

Ring buffer fills with trace entries, pushes out error entries
/errors command mostly shows traces, not actual errors
Workflow execution noticeably slower

Mitigation:

// Auto-disable debug mode after 100 executions
const staticData = $getWorkflowStaticData('global');
if (staticData.debug?.enabled) {
  staticData.debug.executionCount = (staticData.debug.executionCount || 0) + 1;

  if (staticData.debug.executionCount > 100) {
    staticData.debug.enabled = false;
    // Send notification to Claude via Telegram
    return { json: {
      message: '⚠️ Debug mode auto-disabled after 100 executions.'
    }};
  }
}

Code Examples

All code examples provided in Architecture Patterns section above. Key patterns:

Ring Buffer Implementation - Add/rotate entries in workflow static data
Sub-workflow Error Propagation - Return error objects from sub-workflows
Correlation ID Tracking - Generate and pass correlation ID through calls
Debug Mode Toggle - Enable/disable tracing via Telegram command
Query Errors via Telegram - Format and display recent errors
n8n API Access - Retrieve execution data for deep investigation

State of the Art

Old Approach	Current Approach	When Changed	Impact
Log to external service (Splunk, Datadog)	Store in workflow static data + query via API	2024-2025	n8n static data sufficient for small-scale debugging; no external dependencies
Trace every node execution	Trace only boundaries and decisions	2025-2026	Reduces noise, focuses on actionable data (distributed tracing best practices)
Execution ID only	Correlation ID + Execution ID	2024-2026	Correlation ID essential for multi-workflow tracing (OpenTelemetry pattern)
Manual log parsing	Structured JSON logs	2023-2024	Programmatic querying replaces manual log reading
Error Trigger to external workflow	Error propagation via return values	2024-2025	Centralized storage in main workflow, simpler architecture

Deprecated/outdated:

n8n log streaming to external service: Requires self-hosted n8n with log streaming enabled. Adds infrastructure complexity. Static data + API is simpler for debugging use case.
External error tracking service (Sentry, Rollbar): Over-engineering for workflow errors. These services are for application errors in production systems, not workflow debugging.
Database storage for logs: n8n already stores execution data in database. Querying via API is cleaner than direct database access (which is fragile and breaks on schema changes).

Source: n8n log streaming (optional feature, not required)

Open Questions

1. Workflow Static Data Size Limits

What we know: Static data persists in n8n database, has size limits, can fail with "data too large" error
What's unclear: Exact size limit in bytes/entries before failure occurs
Recommendation: Conservative ring buffer size (50 entries), truncate large fields (rawResponse to 1000 chars), provide manual clear command. Monitor in production; reduce to 25 entries if size errors occur.

2. Sub-workflow Error Context Propagation

What we know: Sub-workflows can return error objects via return values
What's unclear: Do all 7 sub-workflows currently return structured responses, or do some fail silently?
Recommendation: Audit existing sub-workflows during implementation. Standardize return format: { success: boolean, error?: object, data?: object }. Update all sub-workflows to return errors (don't throw/fail execution).

3. Debug Mode Performance Impact

What we know: Capturing boundary data and routing decisions adds code execution overhead
What's unclear: Measurable impact on workflow execution time (milliseconds? seconds?)
Recommendation: Implement debug mode with selective tracing (only 3 pain points). Measure execution time before/after debug mode enabled. If impact > 500ms, reduce trace granularity.

4. n8n API Rate Limits

What we know: n8n provides API for querying executions
What's unclear: Are there rate limits on API calls? Does frequent querying impact n8n performance?
Recommendation: Use Telegram commands for quick checks (doesn't hit API, reads static data). Reserve API queries for deep investigation. If rate limits discovered, implement query caching/throttling.

5. Telegram Message Size Limits

What we know: Telegram messages have 4096 character limit
What's unclear: If /errors command returns 50 errors, will message exceed limit?
Recommendation: Paginate error output (default: last 5 errors, optional count parameter). Provide /errors full for file-based export (Telegram file upload API). Split long messages if needed.

Sources

Primary (HIGH confidence)

n8n Workflow Static Data - Official docs on $getWorkflowStaticData()
n8n Error Trigger Node - Error data structure and usage
n8n Execute Sub-workflow - Sub-workflow communication patterns
n8n Executions API - Querying execution data programmatically
n8n workflow data access - Accessing node data and workflow metadata

Secondary (MEDIUM confidence)

Better Stack: Node.js Logging Best Practices - Structured logging patterns
Microsoft Engineering Playbook: Correlation IDs - Request tracing pattern
Distributed Tracing Logs (GroundCover) - Tracing workflow debugging patterns
Tucker Leach: Ring Buffer in TypeScript - Ring buffer implementation
n8n Community: Workflow Static Data - Static data limitations and behaviors

Tertiary (LOW confidence)

n8n community: inline keyboard callback query - Telegram callback patterns (referenced for callback routing context)
Ring buffer npm packages - External libraries (not needed, but validate pattern)

Metadata

Confidence breakdown:

Standard stack: HIGH - All components are n8n built-ins, well-documented in official docs
Architecture patterns: HIGH - Ring buffer, correlation IDs, structured errors are industry-standard patterns; n8n static data verified in official docs
Common pitfalls: MEDIUM - Based on n8n community reports and general workflow debugging experience; specific size limits not documented precisely
Code examples: HIGH - All examples use documented n8n APIs and standard JavaScript patterns

Research date: 2026-02-08 Valid until: 2026-03-08 (30 days - stable technology stack)

Implementation Recommendations

Based on research findings and user constraints:

1. Trace Format (Claude's Discretion)

Recommendation: Hybrid approach — structured error objects (always on) + selective debug traces (opt-in).

Rationale: Errors are rare and always need full context. Debug traces are verbose and only needed for specific pain points. Separate ring buffers prevent trace noise from obscuring errors.

Structure:

staticData.errors.buffer - 50 entries, always on
staticData.traces.buffer - 50 entries, only when staticData.debug.enabled = true

2. Trace Scope (Claude's Discretion)

Recommendation: Trace only errors (always) + three pain points (debug mode only).

Pain point traces (debug mode only):

Sub-workflow boundaries: Capture input/output at Execute Workflow nodes
Callback routing: Capture which switch path taken in Route Callback node
n8n API queries: (No tracing needed — query via API is already structured)

Rationale: Tracing every execution creates noise. Focus on high-value data: errors (always actionable) and specific debug scenarios (when Claude needs deep visibility).

3. Structured vs. Simple Logs (Claude's Discretion)

Recommendation: Structured JSON objects.

Rationale: Claude needs programmatic access to query by correlationId, workflow, node, error type. Simple log lines require text parsing; structured objects enable direct field access.

4. Debug Toggle Mechanism (Claude's Discretion)

Recommendation: Global toggle via Telegram command (/debug on|off) with auto-disable after 100 executions.

Rationale: Global toggle is simplest. Per-request debugging adds complexity (need to tag specific requests). Always-on would fill ring buffer with traces. Auto-disable prevents performance impact from forgotten debug mode.

5. Log Level Granularity (Claude's Discretion)

Recommendation: Binary on/off for debug mode. Errors are always logged (no levels).

Rationale: Traditional log levels (error/warn/info/debug) are for application logs. Workflow debugging has two modes: normal (errors only) and debug (errors + traces). Additional levels add complexity without benefit.

6. Specific Debug Data to Capture (Claude's Discretion)

Recommendation: Minimal boundary data + routing decisions.

Capture:

Sub-workflow I/O: { input: {...}, output: {...}, duration: 234 }
Callback routing: { callbackData: "...", routeTaken: "...", availableRoutes: [...] }
Docker API responses: { httpCode: 404, rawResponse: "..." } (truncate to 1000 chars)

Don't capture:

Every node execution (too verbose)
Full execution data from n8n API (query on-demand, don't cache)
User messages, Telegram webhook payloads (not relevant to pain points)

7. Telegram Command Interface (Claude's Discretion)

Recommendation:

Command	Description	Hidden?
`/errors [count]`	Show last N errors (default 5)	Yes (unlisted)
`/clear-errors`	Clear error ring buffer	Yes (unlisted)
`/debug on\|off\|status`	Toggle debug mode	Yes (unlisted)
`/trace <correlationId>`	Show all entries for correlation ID	Yes (unlisted)

Rationale: Developer/debug tools should be hidden (not in /help menu). Claude can use them during debugging sessions. User never needs to see these commands.

36 KiB Raw Blame History

Phase 10.2: Better Logging and Log Management - Research

Summary

User Constraints (from CONTEXT.md)

Locked Decisions

Claude's Discretion

Deferred Ideas (OUT OF SCOPE)

Standard Stack

Core Components

Supporting Patterns

Alternatives Considered

Architecture Patterns

Recommended Data Structure

Pattern 1: Ring Buffer Implementation

Pattern 2: Sub-workflow Error Propagation

Pattern 3: Correlation ID for Request Tracing

Pattern 4: Debug Mode Toggle

Pattern 5: Query Errors via Telegram

Pattern 6: n8n API Access for Deep Investigation

Anti-Patterns to Avoid

Don't Hand-Roll

Common Pitfalls

Pitfall 1: Workflow Static Data Not Persisting

Pitfall 2: Execution ID vs Correlation ID Confusion

Pitfall 3: Static Data Size Limits

Pitfall 4: Querying Errors by Wrong Field

Pitfall 5: Debug Mode Always-On Performance Impact

Code Examples

State of the Art

Open Questions

1. Workflow Static Data Size Limits

2. Sub-workflow Error Context Propagation

3. Debug Mode Performance Impact

4. n8n API Rate Limits

5. Telegram Message Size Limits

Sources

Primary (HIGH confidence)

Secondary (MEDIUM confidence)

Tertiary (LOW confidence)

Metadata

Implementation Recommendations

1. Trace Format (Claude's Discretion)

2. Trace Scope (Claude's Discretion)

3. Structured vs. Simple Logs (Claude's Discretion)

4. Debug Toggle Mechanism (Claude's Discretion)

5. Log Level Granularity (Claude's Discretion)

6. Specific Debug Data to Capture (Claude's Discretion)

7. Telegram Command Interface (Claude's Discretion)

36 KiB

Raw Blame History