Monitoring & Observability

Effective monitoring is crucial for maintaining healthy autonomous agents. PAP provides comprehensive observability through heartbeats, metrics, logs, and distributed tracing.

The Three Pillars of Agent Observability

💓 Heartbeats

Liveness Signals
Lightweight health checks proving agent is alive and responsive

📊 Metrics

Resource Telemetry
CPU, memory, requests, and custom business metrics

📝 Logs

Event Streams
Structured logs for debugging and audit

Heartbeats: The Liveness Channel

Purpose: Prove agent is alive and responsive.

CRITICAL: Heartbeats contain ONLY liveness data. Resource metrics are FORBIDDEN in heartbeats per PAP-RFC-001 §8.2.

Heartbeat Structure

{
  "agent_uuid": "123e4567-e89b-12d3-a456-426614174000",
  "mode": "IDLE",
  "uptime_seconds": 3600,
  "timestamp": "2025-11-13T08:30:00Z"
}

Allowed Fields:

mode: EMERGENCY, IDLE, or SLEEP
uptime_seconds: How long agent has been running
timestamp: When heartbeat was sent

Forbidden in Heartbeats:

❌ CPU usage
❌ Memory usage
❌ Request counts
❌ Any resource or business metrics

Heartbeat Modes

IDLE Mode (Default)

Interval: 30 seconds Use Case: Normal operation

{
  "mode": "IDLE",
  "uptime_seconds": 1800
}

EMERGENCY Mode

Interval: 5 seconds Use Case: Critical operations requiring aggressive monitoring

{
  "mode": "EMERGENCY",
  "uptime_seconds": 45
}

Use EMERGENCY Sparingly: High-frequency heartbeats increase control plane load. Only use for truly critical situations (e.g., handling financial transactions, emergency alerts).

SLEEP Mode

Interval: 15 minutes Use Case: Background agents with low-priority work

{
  "mode": "SLEEP",
  "uptime_seconds": 86400
}

SLEEP Mode Benefits: Reduces control plane load for infrequently-used agents. Perfect for scheduled report generators or monitoring agents that only act occasionally.

Viewing Heartbeats

Retrieve recent heartbeats via API:

curl https://plugged.in/api/agents/AGENT_UUID \
  -H "Authorization: Bearer $API_KEY" \
  | jq '.recentHeartbeats'

Response:

[
  {
    "id": "1003",
    "agent_uuid": "...",
    "mode": "IDLE",
    "uptime_seconds": 3630,
    "timestamp": "2025-11-13T08:30:30Z"
  },
  {
    "id": "1002",
    "agent_uuid": "...",
    "mode": "IDLE",
    "uptime_seconds": 3600,
    "timestamp": "2025-11-13T08:30:00Z"
  }
]

Heartbeat Health Check

function isAgentHealthy(agent) {
  if (agent.state !== 'ACTIVE') return false;
  if (!agent.last_heartbeat_at) return false;

  const lastHeartbeat = new Date(agent.last_heartbeat_at);
  const now = new Date();
  const msSinceHeartbeat = now - lastHeartbeat;

  // Default IDLE mode: 30s interval * 1.5 = 45s max
  const maxInterval = 45000;

  return msSinceHeartbeat < maxInterval;
}

Metrics: The Resource Channel

Purpose: Monitor resource usage and business metrics.

Separation is Key: Metrics are sent on a completely separate channel from heartbeats. This separation is PAP’s superpower for zombie prevention.

Metrics Structure

{
  "agent_uuid": "123e4567-e89b-12d3-a456-426614174000",
  "cpu_percent": 23,
  "memory_mb": 384,
  "requests_handled": 152,
  "custom_metrics": {
    "queue_depth": 5,
    "cache_hit_rate": 0.87,
    "active_connections": 12
  },
  "timestamp": "2025-11-13T08:30:00Z"
}

Standard Metrics

Metric	Type	Description
`cpu_percent`	integer	CPU usage (0-100)
`memory_mb`	integer	Memory usage in megabytes
`requests_handled`	integer	Total requests processed since start
`custom_metrics`	object	Agent-specific business metrics

Viewing Metrics

Retrieve recent metrics via API:

curl https://plugged.in/api/agents/AGENT_UUID \
  -H "Authorization: Bearer $API_KEY" \
  | jq '.recentMetrics'

Response:

[
  {
    "id": "2005",
    "agent_uuid": "...",
    "cpu_percent": 23,
    "memory_mb": 384,
    "requests_handled": 152,
    "custom_metrics": {
      "queue_depth": 5
    },
    "timestamp": "2025-11-13T08:30:00Z"
  }
]

Metric Collection Frequency

Recommended: 60 seconds Unlike heartbeats, metrics can be sent less frequently:

More frequent = finer granularity, higher storage
Less frequent = reduced load, coarser data

Adaptive Frequency: Send metrics more frequently during high activity, less during idle periods to optimize storage.

Alerting on Metrics

High CPU Alert:

if (metrics.cpu_percent > 80) {
  alert('Agent CPU usage critical: ' + metrics.cpu_percent + '%');
}

Memory Leak Detection:

// Check if memory consistently increases
const recentMetrics = agent.recentMetrics.slice(0, 10);
const memoryTrend = recentMetrics.every((m, i) =>
  i === 0 || m.memory_mb > recentMetrics[i - 1].memory_mb
);

if (memoryTrend) {
  alert('Possible memory leak detected');
}

Logs: The Event Stream

Purpose: Detailed event logs for debugging and audit.

Log Access (Current)

For now, server administrators can access logs via kubectl:

# Get recent logs
kubectl logs deployment/AGENT_NAME -n agents --tail=100 --timestamps

# Follow logs in real-time
kubectl logs deployment/AGENT_NAME -n agents -f

# Get logs from previous container (after crash)
kubectl logs deployment/AGENT_NAME -n agents --previous

Log API (Coming Soon)

Future API endpoint for log retrieval:

curl https://plugged.in/api/agents/AGENT_UUID/logs?tail=100 \
  -H "Authorization: Bearer $API_KEY"

Structured Logging Best Practices

Use structured JSON logs:

console.log(JSON.stringify({
  level: 'info',
  timestamp: new Date().toISOString(),
  agent_uuid: 'agent-uuid',
  event: 'request_processed',
  duration_ms: 45,
  status: 200,
  trace_id: 'trace-123'
}));

Benefits:

Parseable by log aggregators
Searchable by field
Compatible with OpenTelemetry

Log Levels

Level	Use Case	Example
`error`	Failures requiring attention	`Database connection failed`
`warn`	Potential issues	`Retry attempt 3 of 5`
`info`	Normal operations	`Request completed successfully`
`debug`	Detailed debugging	`Cache lookup: key=foo, hit=true`

Kubernetes-Level Monitoring

Pod Health

Check pod status:

kubectl get pods -n agents -l app=AGENT_NAME

Healthy pod:

NAME                         READY   STATUS    RESTARTS   AGE
my-agent-65bf69d8c-abc123    1/1     Running   0          5m

Unhealthy indicators:

CrashLoopBackOff: Pod keeps crashing
ImagePullBackOff: Cannot pull container image
Pending: Not scheduled (quota/resources)
RESTARTS > 0: Pod restarted (check logs)

Resource Usage

Check actual resource consumption:

kubectl top pod -n agents -l app=AGENT_NAME

Output:

NAME                         CPU(cores)   MEMORY(bytes)
my-agent-65bf69d8c-abc123    23m          384Mi

Events

View recent Kubernetes events:

kubectl get events -n agents \
  --field-selector involvedObject.name=AGENT_NAME \
  --sort-by='.lastTimestamp'

Distributed Tracing

PAP supports OpenTelemetry for distributed tracing across agents and tools.

Trace Context Propagation

All PAP messages carry:

{
  "trace_id": "4bf92f3577b34da6a3ce929d0e0e4736",
  "span_id": "00f067aa0ba902b7"
}

Tracing Agent-to-Tool Calls

// Agent makes MCP tool call
const span = tracer.startSpan('mcp.invoke', {
  attributes: {
    'mcp.server': 'filesystem',
    'mcp.tool': 'read_file',
    'mcp.arguments': JSON.stringify({path: '/docs/file.pdf'})
  }
});

try {
  const result = await mcpClient.invoke('filesystem', 'read_file', {
    path: '/docs/file.pdf'
  });
  span.setStatus({ code: SpanStatusCode.OK });
} catch (error) {
  span.setStatus({ code: SpanStatusCode.ERROR, message: error.message });
} finally {
  span.end();
}

Viewing Traces (Future)

Integration with Jaeger/Zipkin/Tempo for trace visualization showing:

Agent request flow
Tool invocations
Database queries
External API calls

Alerting Strategies

Critical Alerts (Immediate Response)

Agent Unhealthy:

if (now - last_heartbeat > 45000 && state === 'ACTIVE') {
  pagerDuty.alert('Agent UNHEALTHY: ' + agent.name);
}

Agent Killed:

if (lifecycleEvents.latest().to_state === 'KILLED') {
  slack.alert('Agent KILLED: ' + agent.name + ' - Reason: ' + event.metadata.kill_reason);
}

Resource Exhaustion:

if (metrics.memory_mb > 0.9 * limits.memory_mb) {
  alert('Agent approaching memory limit');
}

Warning Alerts (Monitor)

High Error Rate:

const errorRate = recentMetrics.reduce((sum, m) =>
  sum + (m.custom_metrics?.error_count || 0), 0
) / recentMetrics.length;

if (errorRate > 0.05) {
  warn('Agent error rate elevated: ' + (errorRate * 100) + '%');
}

Slow Response Times:

if (metrics.custom_metrics.avg_response_ms > 1000) {
  warn('Agent response time degraded');
}

Informational (Logging)

State Changes:

lifecycleEvents.forEach(event => {
  if (event.timestamp > lastCheck) {
    log(`Agent ${agent.name}: ${event.from_state} → ${event.to_state}`);
  }
});

Dashboard Recommendations

Agent Health Dashboard

Metrics to Display:

Agent Count by State
- ACTIVE count (green)
- PROVISIONED count (yellow)
- TERMINATED count (grey)
- KILLED count (red)
Heartbeat Status
- Agents with recent heartbeat (< 1min ago)
- Agents with stale heartbeat (1-2min ago)
- Agents missing heartbeat (> 2min ago)
Resource Utilization
- CPU usage histogram
- Memory usage histogram
- Agents approaching limits
Request Throughput
- Requests per minute per agent
- Error rate per agent

Individual Agent Dashboard

Sections:

Status Overview
- Current state
- Uptime
- Last heartbeat time
Resource Graphs (Time Series)
- CPU usage over time
- Memory usage over time
Throughput (Time Series)
- Requests handled per minute
- Error rate
Recent Events
- Lifecycle events (last 24h)
- Log errors (last 1h)
Kubernetes Health
- Pod status
- Restart count
- Resource quota usage

Monitoring Tools Integration

Prometheus

Metrics Endpoint (Future):

curl https://AGENT_NAME.is.plugged.in/metrics

Example output:

# HELP agent_cpu_percent Agent CPU usage
# TYPE agent_cpu_percent gauge
agent_cpu_percent{agent="my-agent"} 23

# HELP agent_memory_mb Agent memory usage
# TYPE agent_memory_mb gauge
agent_memory_mb{agent="my-agent"} 384

# HELP agent_requests_total Total requests handled
# TYPE agent_requests_total counter
agent_requests_total{agent="my-agent"} 152

Grafana

Import Plugged.in Agent dashboard (future):

# Dashboard JSON available at
curl https://plugged.in/grafana-dashboards/agents.json

Datadog / New Relic

Configure agent to export to APM:

// OpenTelemetry exporter
const { NodeTracerProvider } = require('@opentelemetry/sdk-trace-node');
const { DatadogExporter } = require('@opentelemetry/exporter-datadog');

const provider = new NodeTracerProvider();
provider.addSpanProcessor(new BatchSpanProcessor(new DatadogExporter({
  agentUrl: 'http://datadog-agent:8126'
})));

Troubleshooting with Monitoring Data

Problem: Agent Not Processing Requests

Diagnosis:

Check heartbeats: Are they arriving?
Check metrics: Is CPU/memory normal?
Check logs: Any errors?
Check Kubernetes: Is pod healthy?

Resolution Path:

# 1. Check state
curl https://plugged.in/api/agents/UUID | jq '.agent.state'

# 2. Check heartbeats
curl https://plugged.in/api/agents/UUID | jq '.recentHeartbeats[0]'

# 3. Check pod status
kubectl get pod -n agents -l app=AGENT_NAME

# 4. Check logs
kubectl logs deployment/AGENT_NAME -n agents --tail=50

Problem: High CPU Usage

Diagnosis:

Check metrics: CPU consistently > 80%?
Check logs: Infinite loop? Expensive operation?
Check traces: Which operations are slow?

Resolution:

Optimize hot code paths
Increase CPU limit
Scale horizontally (future)

Problem: Memory Leak

Diagnosis:

Check metrics: Memory steadily increasing?
Check logs: Out of memory errors?
Take heap snapshot (future feature)

Resolution:

Identify leaking resources
Fix and redeploy
Restart agent as temporary fix

Best Practices Summary

DO: Keep heartbeats lightweight (mode + uptime only)

DO: Send comprehensive metrics on separate channel

DO: Use structured JSON logging

DO: Include trace_id in all logs

DO: Set up alerts for missed heartbeats

DON’T: Mix metrics with heartbeats

DON’T: Use EMERGENCY mode for normal operations

DON’T: Log sensitive data (passwords, tokens)

Next Steps

Lifecycle Management

Understand agent states and transitions

Architecture Deep Dive

Learn about PAP’s dual-profile design

Getting Started

Platform Features

PAP Agents

Tutorials

Security

Deployment

Observability

Testing

Migration Guides

Troubleshooting

Advanced Configuration

Integration Guides

Performance

​Monitoring & Observability

​The Three Pillars of Agent Observability

💓 Heartbeats

📊 Metrics

📝 Logs

​Heartbeats: The Liveness Channel

​Heartbeat Structure

​Heartbeat Modes

​IDLE Mode (Default)

​EMERGENCY Mode

​SLEEP Mode

​Viewing Heartbeats

​Heartbeat Health Check

​Metrics: The Resource Channel

​Metrics Structure

​Standard Metrics

​Viewing Metrics

​Metric Collection Frequency

​Alerting on Metrics

​Logs: The Event Stream

​Log Access (Current)

​Log API (Coming Soon)

​Structured Logging Best Practices

​Log Levels

​Kubernetes-Level Monitoring

​Pod Health

​Resource Usage

​Events

​Distributed Tracing

​Trace Context Propagation

​Tracing Agent-to-Tool Calls

​Viewing Traces (Future)

​Alerting Strategies

​Critical Alerts (Immediate Response)

​Warning Alerts (Monitor)

​Informational (Logging)

​Dashboard Recommendations

​Agent Health Dashboard

​Individual Agent Dashboard

​Monitoring Tools Integration

​Prometheus

​Grafana

​Datadog / New Relic

​Troubleshooting with Monitoring Data

​Problem: Agent Not Processing Requests

​Problem: High CPU Usage

​Problem: Memory Leak

​Best Practices Summary

​Next Steps

Lifecycle Management

Architecture Deep Dive

Monitoring & Observability

The Three Pillars of Agent Observability

Heartbeats: The Liveness Channel

Heartbeat Structure

Heartbeat Modes

IDLE Mode (Default)

EMERGENCY Mode

SLEEP Mode

Viewing Heartbeats

Heartbeat Health Check

Metrics: The Resource Channel

Metrics Structure

Standard Metrics

Viewing Metrics

Metric Collection Frequency

Alerting on Metrics

Logs: The Event Stream

Log Access (Current)

Log API (Coming Soon)

Structured Logging Best Practices

Log Levels

Kubernetes-Level Monitoring

Pod Health

Resource Usage

Events

Distributed Tracing

Trace Context Propagation

Tracing Agent-to-Tool Calls

Viewing Traces (Future)

Alerting Strategies

Critical Alerts (Immediate Response)

Warning Alerts (Monitor)

Informational (Logging)

Dashboard Recommendations

Agent Health Dashboard

Individual Agent Dashboard

Monitoring Tools Integration

Prometheus

Grafana

Datadog / New Relic

Troubleshooting with Monitoring Data

Problem: Agent Not Processing Requests

Problem: High CPU Usage

Problem: Memory Leak

Best Practices Summary

Next Steps