Monitoring & Observability
Effective monitoring is crucial for maintaining healthy autonomous agents. PAP provides comprehensive observability through heartbeats, metrics, logs, and distributed tracing.The Three Pillars of Agent Observability
💓 Heartbeats
Liveness Signals
Lightweight health checks proving agent is alive and responsive
Lightweight health checks proving agent is alive and responsive
📊 Metrics
Resource Telemetry
CPU, memory, requests, and custom business metrics
CPU, memory, requests, and custom business metrics
📝 Logs
Event Streams
Structured logs for debugging and audit
Structured logs for debugging and audit
Heartbeats: The Liveness Channel
Purpose: Prove agent is alive and responsive.Heartbeat Structure
mode: EMERGENCY, IDLE, or SLEEPuptime_seconds: How long agent has been runningtimestamp: When heartbeat was sent
- ❌ CPU usage
- ❌ Memory usage
- ❌ Request counts
- ❌ Any resource or business metrics
Heartbeat Modes
IDLE Mode (Default)
Interval: 30 seconds Use Case: Normal operationEMERGENCY Mode
Interval: 5 seconds Use Case: Critical operations requiring aggressive monitoringSLEEP Mode
Interval: 15 minutes Use Case: Background agents with low-priority workSLEEP Mode Benefits: Reduces control plane load for infrequently-used agents. Perfect for scheduled report generators or monitoring agents that only act occasionally.
Viewing Heartbeats
Retrieve recent heartbeats via API:Heartbeat Health Check
Metrics: The Resource Channel
Purpose: Monitor resource usage and business metrics.Separation is Key: Metrics are sent on a completely separate channel from heartbeats. This separation is PAP’s superpower for zombie prevention.
Metrics Structure
Standard Metrics
| Metric | Type | Description |
|---|---|---|
cpu_percent | integer | CPU usage (0-100) |
memory_mb | integer | Memory usage in megabytes |
requests_handled | integer | Total requests processed since start |
custom_metrics | object | Agent-specific business metrics |
Viewing Metrics
Retrieve recent metrics via API:Metric Collection Frequency
Recommended: 60 seconds Unlike heartbeats, metrics can be sent less frequently:- More frequent = finer granularity, higher storage
- Less frequent = reduced load, coarser data
Alerting on Metrics
High CPU Alert:Logs: The Event Stream
Purpose: Detailed event logs for debugging and audit.Log Access (Current)
For now, server administrators can access logs via kubectl:Log API (Coming Soon)
Future API endpoint for log retrieval:Structured Logging Best Practices
Use structured JSON logs:- Parseable by log aggregators
- Searchable by field
- Compatible with OpenTelemetry
Log Levels
| Level | Use Case | Example |
|---|---|---|
error | Failures requiring attention | Database connection failed |
warn | Potential issues | Retry attempt 3 of 5 |
info | Normal operations | Request completed successfully |
debug | Detailed debugging | Cache lookup: key=foo, hit=true |
Kubernetes-Level Monitoring
Pod Health
Check pod status:CrashLoopBackOff: Pod keeps crashingImagePullBackOff: Cannot pull container imagePending: Not scheduled (quota/resources)RESTARTS > 0: Pod restarted (check logs)
Resource Usage
Check actual resource consumption:Events
View recent Kubernetes events:Distributed Tracing
PAP supports OpenTelemetry for distributed tracing across agents and tools.Trace Context Propagation
All PAP messages carry:Tracing Agent-to-Tool Calls
Viewing Traces (Future)
Integration with Jaeger/Zipkin/Tempo for trace visualization showing:- Agent request flow
- Tool invocations
- Database queries
- External API calls
Alerting Strategies
Critical Alerts (Immediate Response)
-
Agent Unhealthy:
-
Agent Killed:
-
Resource Exhaustion:
Warning Alerts (Monitor)
-
High Error Rate:
-
Slow Response Times:
Informational (Logging)
- State Changes:
Dashboard Recommendations
Agent Health Dashboard
Metrics to Display:-
Agent Count by State
- ACTIVE count (green)
- PROVISIONED count (yellow)
- TERMINATED count (grey)
- KILLED count (red)
-
Heartbeat Status
- Agents with recent heartbeat (< 1min ago)
- Agents with stale heartbeat (1-2min ago)
- Agents missing heartbeat (> 2min ago)
-
Resource Utilization
- CPU usage histogram
- Memory usage histogram
- Agents approaching limits
-
Request Throughput
- Requests per minute per agent
- Error rate per agent
Individual Agent Dashboard
Sections:-
Status Overview
- Current state
- Uptime
- Last heartbeat time
-
Resource Graphs (Time Series)
- CPU usage over time
- Memory usage over time
-
Throughput (Time Series)
- Requests handled per minute
- Error rate
-
Recent Events
- Lifecycle events (last 24h)
- Log errors (last 1h)
-
Kubernetes Health
- Pod status
- Restart count
- Resource quota usage
Monitoring Tools Integration
Prometheus
Metrics Endpoint (Future):Grafana
Import Plugged.in Agent dashboard (future):Datadog / New Relic
Configure agent to export to APM:Troubleshooting with Monitoring Data
Problem: Agent Not Processing Requests
Diagnosis:- Check heartbeats: Are they arriving?
- Check metrics: Is CPU/memory normal?
- Check logs: Any errors?
- Check Kubernetes: Is pod healthy?
Problem: High CPU Usage
Diagnosis:- Check metrics: CPU consistently > 80%?
- Check logs: Infinite loop? Expensive operation?
- Check traces: Which operations are slow?
- Optimize hot code paths
- Increase CPU limit
- Scale horizontally (future)
Problem: Memory Leak
Diagnosis:- Check metrics: Memory steadily increasing?
- Check logs: Out of memory errors?
- Take heap snapshot (future feature)
- Identify leaking resources
- Fix and redeploy
- Restart agent as temporary fix
Best Practices Summary
DO: Keep heartbeats lightweight (mode + uptime only)
DO: Send comprehensive metrics on separate channel
DO: Use structured JSON logging
DO: Include trace_id in all logs
DO: Set up alerts for missed heartbeats
Next Steps
Lifecycle Management
Understand agent states and transitions
Architecture Deep Dive
Learn about PAP’s dual-profile design

