Monitoring & Observability
Effective monitoring is crucial for maintaining healthy autonomous agents. PAP provides comprehensive observability through heartbeats, metrics, logs, and distributed tracing .
The Three Pillars of Agent Observability
💓 Heartbeats Liveness Signals
Lightweight health checks proving agent is alive and responsive
📊 Metrics Resource Telemetry
CPU, memory, requests, and custom business metrics
📝 Logs Event Streams
Structured logs for debugging and audit
Heartbeats: The Liveness Channel
Purpose : Prove agent is alive and responsive.
CRITICAL : Heartbeats contain ONLY liveness data. Resource metrics are FORBIDDEN in heartbeats per PAP-RFC-001 §8.2.
Heartbeat Structure
{
"agent_uuid" : "123e4567-e89b-12d3-a456-426614174000" ,
"mode" : "IDLE" ,
"uptime_seconds" : 3600 ,
"timestamp" : "2025-11-13T08:30:00Z"
}
Allowed Fields:
mode: EMERGENCY, IDLE, or SLEEP
uptime_seconds: How long agent has been running
timestamp: When heartbeat was sent
Forbidden in Heartbeats:
❌ CPU usage
❌ Memory usage
❌ Request counts
❌ Any resource or business metrics
Heartbeat Modes
IDLE Mode (Default)
Interval : 30 seconds
Use Case : Normal operation
{
"mode" : "IDLE" ,
"uptime_seconds" : 1800
}
EMERGENCY Mode
Interval : 5 seconds
Use Case : Critical operations requiring aggressive monitoring
{
"mode" : "EMERGENCY" ,
"uptime_seconds" : 45
}
Use EMERGENCY Sparingly : High-frequency heartbeats increase control plane load. Only use for truly critical situations (e.g., handling financial transactions, emergency alerts).
SLEEP Mode
Interval : 15 minutes
Use Case : Background agents with low-priority work
{
"mode" : "SLEEP" ,
"uptime_seconds" : 86400
}
SLEEP Mode Benefits : Reduces control plane load for infrequently-used agents. Perfect for scheduled report generators or monitoring agents that only act occasionally.
Viewing Heartbeats
Retrieve recent heartbeats via API:
curl https://plugged.in/api/agents/AGENT_UUID \
-H "Authorization: Bearer $API_KEY " \
| jq '.recentHeartbeats'
Response:
[
{
"id" : "1003" ,
"agent_uuid" : "..." ,
"mode" : "IDLE" ,
"uptime_seconds" : 3630 ,
"timestamp" : "2025-11-13T08:30:30Z"
},
{
"id" : "1002" ,
"agent_uuid" : "..." ,
"mode" : "IDLE" ,
"uptime_seconds" : 3600 ,
"timestamp" : "2025-11-13T08:30:00Z"
}
]
Heartbeat Health Check
function isAgentHealthy ( agent ) {
if ( agent . state !== 'ACTIVE' ) return false ;
if ( ! agent . last_heartbeat_at ) return false ;
const lastHeartbeat = new Date ( agent . last_heartbeat_at );
const now = new Date ();
const msSinceHeartbeat = now - lastHeartbeat ;
// Default IDLE mode: 30s interval * 1.5 = 45s max
const maxInterval = 45000 ;
return msSinceHeartbeat < maxInterval ;
}
Metrics: The Resource Channel
Purpose : Monitor resource usage and business metrics.
Separation is Key : Metrics are sent on a completely separate channel from heartbeats. This separation is PAP’s superpower for zombie prevention.
Metrics Structure
{
"agent_uuid" : "123e4567-e89b-12d3-a456-426614174000" ,
"cpu_percent" : 23 ,
"memory_mb" : 384 ,
"requests_handled" : 152 ,
"custom_metrics" : {
"queue_depth" : 5 ,
"cache_hit_rate" : 0.87 ,
"active_connections" : 12
},
"timestamp" : "2025-11-13T08:30:00Z"
}
Standard Metrics
Metric Type Description cpu_percentinteger CPU usage (0-100) memory_mbinteger Memory usage in megabytes requests_handledinteger Total requests processed since start custom_metricsobject Agent-specific business metrics
Viewing Metrics
Retrieve recent metrics via API:
curl https://plugged.in/api/agents/AGENT_UUID \
-H "Authorization: Bearer $API_KEY " \
| jq '.recentMetrics'
Response:
[
{
"id" : "2005" ,
"agent_uuid" : "..." ,
"cpu_percent" : 23 ,
"memory_mb" : 384 ,
"requests_handled" : 152 ,
"custom_metrics" : {
"queue_depth" : 5
},
"timestamp" : "2025-11-13T08:30:00Z"
}
]
Metric Collection Frequency
Recommended : 60 seconds
Unlike heartbeats, metrics can be sent less frequently:
More frequent = finer granularity, higher storage
Less frequent = reduced load, coarser data
Adaptive Frequency : Send metrics more frequently during high activity, less during idle periods to optimize storage.
Alerting on Metrics
High CPU Alert:
if ( metrics . cpu_percent > 80 ) {
alert ( 'Agent CPU usage critical: ' + metrics . cpu_percent + '%' );
}
Memory Leak Detection:
// Check if memory consistently increases
const recentMetrics = agent . recentMetrics . slice ( 0 , 10 );
const memoryTrend = recentMetrics . every (( m , i ) =>
i === 0 || m . memory_mb > recentMetrics [ i - 1 ]. memory_mb
);
if ( memoryTrend ) {
alert ( 'Possible memory leak detected' );
}
Logs: The Event Stream
Purpose : Detailed event logs for debugging and audit.
Log Access (Current)
For now, server administrators can access logs via kubectl:
# Get recent logs
kubectl logs deployment/AGENT_NAME -n agents --tail=100 --timestamps
# Follow logs in real-time
kubectl logs deployment/AGENT_NAME -n agents -f
# Get logs from previous container (after crash)
kubectl logs deployment/AGENT_NAME -n agents --previous
Log API (Coming Soon)
Future API endpoint for log retrieval:
curl https://plugged.in/api/agents/AGENT_UUID/logs?tail= 100 \
-H "Authorization: Bearer $API_KEY "
Structured Logging Best Practices
Use structured JSON logs:
console . log ( JSON . stringify ({
level: 'info' ,
timestamp: new Date (). toISOString (),
agent_uuid: 'agent-uuid' ,
event: 'request_processed' ,
duration_ms: 45 ,
status: 200 ,
trace_id: 'trace-123'
}));
Benefits:
Parseable by log aggregators
Searchable by field
Compatible with OpenTelemetry
Log Levels
Level Use Case Example errorFailures requiring attention Database connection failedwarnPotential issues Retry attempt 3 of 5infoNormal operations Request completed successfullydebugDetailed debugging Cache lookup: key=foo, hit=true
Kubernetes-Level Monitoring
Pod Health
Check pod status:
kubectl get pods -n agents -l app=AGENT_NAME
Healthy pod:
NAME READY STATUS RESTARTS AGE
my-agent-65bf69d8c-abc123 1/1 Running 0 5m
Unhealthy indicators:
CrashLoopBackOff: Pod keeps crashing
ImagePullBackOff: Cannot pull container image
Pending: Not scheduled (quota/resources)
RESTARTS > 0: Pod restarted (check logs)
Resource Usage
Check actual resource consumption:
kubectl top pod -n agents -l app=AGENT_NAME
Output:
NAME CPU(cores) MEMORY(bytes)
my-agent-65bf69d8c-abc123 23m 384Mi
Events
View recent Kubernetes events:
kubectl get events -n agents \
--field-selector involvedObject.name=AGENT_NAME \
--sort-by= '.lastTimestamp'
Distributed Tracing
PAP supports OpenTelemetry for distributed tracing across agents and tools.
Trace Context Propagation
All PAP messages carry:
{
"trace_id" : "4bf92f3577b34da6a3ce929d0e0e4736" ,
"span_id" : "00f067aa0ba902b7"
}
// Agent makes MCP tool call
const span = tracer . startSpan ( 'mcp.invoke' , {
attributes: {
'mcp.server' : 'filesystem' ,
'mcp.tool' : 'read_file' ,
'mcp.arguments' : JSON . stringify ({ path: '/docs/file.pdf' })
}
});
try {
const result = await mcpClient . invoke ( 'filesystem' , 'read_file' , {
path: '/docs/file.pdf'
});
span . setStatus ({ code: SpanStatusCode . OK });
} catch ( error ) {
span . setStatus ({ code: SpanStatusCode . ERROR , message: error . message });
} finally {
span . end ();
}
Viewing Traces (Future)
Integration with Jaeger/Zipkin/Tempo for trace visualization showing:
Agent request flow
Tool invocations
Database queries
External API calls
Alerting Strategies
Agent Unhealthy :
if ( now - last_heartbeat > 45000 && state === 'ACTIVE' ) {
pagerDuty . alert ( 'Agent UNHEALTHY: ' + agent . name );
}
Agent Killed :
if ( lifecycleEvents . latest (). to_state === 'KILLED' ) {
slack . alert ( 'Agent KILLED: ' + agent . name + ' - Reason: ' + event . metadata . kill_reason );
}
Resource Exhaustion :
if ( metrics . memory_mb > 0.9 * limits . memory_mb ) {
alert ( 'Agent approaching memory limit' );
}
Warning Alerts (Monitor)
High Error Rate :
const errorRate = recentMetrics . reduce (( sum , m ) =>
sum + ( m . custom_metrics ?. error_count || 0 ), 0
) / recentMetrics . length ;
if ( errorRate > 0.05 ) {
warn ( 'Agent error rate elevated: ' + ( errorRate * 100 ) + '%' );
}
Slow Response Times :
if ( metrics . custom_metrics . avg_response_ms > 1000 ) {
warn ( 'Agent response time degraded' );
}
State Changes :
lifecycleEvents . forEach ( event => {
if ( event . timestamp > lastCheck ) {
log ( `Agent ${ agent . name } : ${ event . from_state } → ${ event . to_state } ` );
}
});
Dashboard Recommendations
Agent Health Dashboard
Metrics to Display:
Agent Count by State
ACTIVE count (green)
PROVISIONED count (yellow)
TERMINATED count (grey)
KILLED count (red)
Heartbeat Status
Agents with recent heartbeat (< 1min ago)
Agents with stale heartbeat (1-2min ago)
Agents missing heartbeat (> 2min ago)
Resource Utilization
CPU usage histogram
Memory usage histogram
Agents approaching limits
Request Throughput
Requests per minute per agent
Error rate per agent
Individual Agent Dashboard
Sections:
Status Overview
Current state
Uptime
Last heartbeat time
Resource Graphs (Time Series)
CPU usage over time
Memory usage over time
Throughput (Time Series)
Requests handled per minute
Error rate
Recent Events
Lifecycle events (last 24h)
Log errors (last 1h)
Kubernetes Health
Pod status
Restart count
Resource quota usage
Prometheus
Metrics Endpoint (Future):
curl https://AGENT_NAME.is.plugged.in/metrics
Example output:
# HELP agent_cpu_percent Agent CPU usage
# TYPE agent_cpu_percent gauge
agent_cpu_percent{agent="my-agent"} 23
# HELP agent_memory_mb Agent memory usage
# TYPE agent_memory_mb gauge
agent_memory_mb{agent="my-agent"} 384
# HELP agent_requests_total Total requests handled
# TYPE agent_requests_total counter
agent_requests_total{agent="my-agent"} 152
Grafana
Import Plugged.in Agent dashboard (future):
# Dashboard JSON available at
curl https://plugged.in/grafana-dashboards/agents.json
Datadog / New Relic
Configure agent to export to APM:
// OpenTelemetry exporter
const { NodeTracerProvider } = require ( '@opentelemetry/sdk-trace-node' );
const { DatadogExporter } = require ( '@opentelemetry/exporter-datadog' );
const provider = new NodeTracerProvider ();
provider . addSpanProcessor ( new BatchSpanProcessor ( new DatadogExporter ({
agentUrl: 'http://datadog-agent:8126'
})));
Troubleshooting with Monitoring Data
Problem: Agent Not Processing Requests
Diagnosis:
Check heartbeats: Are they arriving?
Check metrics: Is CPU/memory normal?
Check logs: Any errors?
Check Kubernetes: Is pod healthy?
Resolution Path:
# 1. Check state
curl https://plugged.in/api/agents/UUID | jq '.agent.state'
# 2. Check heartbeats
curl https://plugged.in/api/agents/UUID | jq '.recentHeartbeats[0]'
# 3. Check pod status
kubectl get pod -n agents -l app=AGENT_NAME
# 4. Check logs
kubectl logs deployment/AGENT_NAME -n agents --tail=50
Problem: High CPU Usage
Diagnosis:
Check metrics: CPU consistently > 80%?
Check logs: Infinite loop? Expensive operation?
Check traces: Which operations are slow?
Resolution:
Optimize hot code paths
Increase CPU limit
Scale horizontally (future)
Problem: Memory Leak
Diagnosis:
Check metrics: Memory steadily increasing?
Check logs: Out of memory errors?
Take heap snapshot (future feature)
Resolution:
Identify leaking resources
Fix and redeploy
Restart agent as temporary fix
Best Practices Summary
DO : Keep heartbeats lightweight (mode + uptime only)
DO : Send comprehensive metrics on separate channel
DO : Use structured JSON logging
DO : Include trace_id in all logs
DO : Set up alerts for missed heartbeats
DON’T : Mix metrics with heartbeats
DON’T : Use EMERGENCY mode for normal operations
DON’T : Log sensitive data (passwords, tokens)
Next Steps