Agent Lifecycle Management
PAP agents follow a normative state machine defined in PAP-RFC-001 v1.0. Understanding these states and transitions is crucial for effective agent management.The Normative State Machine
Normative: These states and transitions are protocol-mandated. Invalid transitions are rejected to maintain system integrity.
State Definitions
NEW
Initial state when an agent is created in the database but not yet deployed. Characteristics:- Agent record exists in database
- No Kubernetes resources yet
- No heartbeats
- No external access
PROVISIONED(automatic upon successful K8s deployment)TERMINATED(if user deletes before provisioning completes)
PROVISIONED
Infrastructure ready - Kubernetes resources deployed, waiting for first heartbeat. Characteristics:- Kubernetes Deployment created
- Service and Ingress configured
- TLS certificate provisioning (Let’s Encrypt)
- Pod starting up
- Agent initializing but not yet healthy
ACTIVE(upon first heartbeat)TERMINATED(if user deletes)KILLED(if provisioning fails repeatedly)
- Container image pulling
- Application startup
- Health probes initializing
- Heartbeat mechanism activating
ACTIVE
Running and healthy - Agent is operational and accepting requests. Characteristics:- Receiving regular heartbeats
- Emitting metrics
- Accessible via
https://{agent}.is.plugged.in - TLS certificate active
- Processing requests
DRAINING(graceful shutdown initiated)TERMINATED(immediate user deletion)KILLED(Station decision - zombie, policy violation, etc.)
- IDLE mode: Every 30 seconds (default)
- EMERGENCY mode: Every 5 seconds (critical operations)
- SLEEP mode: Every 15 minutes (background tasks)
DRAINING
Gracefully shutting down - Agent is completing in-flight work before termination. Characteristics:- No new requests accepted
- Existing requests being completed
- Still sending heartbeats
- Still emitting metrics
- Traefik removes from routing
TERMINATED(after drain complete)KILLED(if drain timeout exceeded)
- Completing multi-step workflows
- Finishing file uploads/downloads
- Flushing buffered data
- Saving state
Current Implementation: DRAINING state is defined in protocol but not yet exposed in API. Coming in future release.
TERMINATED
Cleanly shut down - User-initiated termination completed successfully. Characteristics:- All Kubernetes resources deleted
- TLS certificate removed
- Agent no longer accessible
- Heartbeats stopped
- Record preserved for audit
- Agent record remains in database
- Lifecycle events preserved
- Historical metrics/heartbeats preserved (per retention policy)
KILLED
Forcefully terminated - Station (control plane) forcibly terminated agent. Characteristics:- Control plane decision (NOT user-initiated)
- Immediate termination (no drain period)
- Resources forcefully removed
- Indicates policy violation or zombie
- Zombie detected (missed heartbeats)
- Policy violation (resource abuse, security breach)
- Compliance requirement
- Emergency station action
State Transitions
Valid Transitions
| From | To | Trigger | Automatic? |
|---|---|---|---|
NEW | PROVISIONED | K8s deployment created | ✅ Yes |
NEW | TERMINATED | User deletes during provisioning | No |
PROVISIONED | ACTIVE | First heartbeat received | ✅ Yes |
PROVISIONED | KILLED | Provisioning timeout or failure | ✅ Yes |
PROVISIONED | TERMINATED | User deletes | No |
ACTIVE | DRAINING | Graceful shutdown initiated | No (future) |
ACTIVE | TERMINATED | User deletes | No |
ACTIVE | KILLED | Station decision (zombie, policy) | ✅ Yes |
DRAINING | TERMINATED | Drain completed | ✅ Yes |
DRAINING | KILLED | Drain timeout exceeded | ✅ Yes |
Invalid Transitions
These transitions are rejected by the protocol: ❌TERMINATED → any state (cannot revive)
❌ KILLED → any state (cannot revive)
❌ NEW → ACTIVE (must go through PROVISIONED)
❌ PROVISIONED → DRAINING (must be ACTIVE first)
❌ DRAINING → ACTIVE (drain is one-way)
Lifecycle Events
All state transitions are logged immutably in theagent_lifecycle_events table.
Event Structure
Event Types
| Event Type | Description | Typical Trigger |
|---|---|---|
CREATED | Agent created | User API call |
PROVISIONED | K8s resources deployed | Automatic |
ACTIVATED | First heartbeat received | Automatic |
DRAINING_STARTED | Graceful shutdown initiated | User or Station |
TERMINATED | Clean shutdown completed | Automatic after drain or user deletion |
KILLED | Forceful termination | Station decision |
ERROR | Error occurred (non-state-changing) | Various |
Retrieving Lifecycle Events
Zombie Detection
PAP’s killer feature for zombie prevention.What is a Zombie?
A zombie agent is one that appears running (Kubernetes pod healthy) but is actually non-functional (not processing requests, not responsive). Causes:- Application deadlock
- Network partition from control plane
- Resource starvation (CPU throttling)
- Infinite loop in application code
Detection Algorithm
- IDLE mode (30s heartbeat): Zombie after 45 seconds
- EMERGENCY mode (5s heartbeat): Zombie after 7.5 seconds
- SLEEP mode (15min heartbeat): Zombie after 22.5 minutes
Why 1.5x? Tolerates one missed heartbeat due to transient network issues, but catches real zombies quickly.
Preventing Zombies
Best Practices:-
Robust heartbeat emission:
-
Monitor event loop:
-
Resource limits:
- Set appropriate CPU/memory limits
- Monitor for throttling
- Add alerts for resource exhaustion
-
Separate heartbeat thread (advanced):
- Dedicated thread/process for heartbeats
- Independent of main application logic
- Guarantees heartbeat even if app hangs
State Management Best Practices
1. Poll for State Changes
After creating an agent, poll until ACTIVE:2. Handle Terminal States
3. Monitor State Transitions
4. Graceful Shutdown (Future)
Troubleshooting State Issues
Agent Stuck in NEW
Symptoms: Agent created but never reaches PROVISIONED. Causes:- Kubernetes API unavailable
- Quota exceeded in agents namespace
- Invalid image specified
Agent Stuck in PROVISIONED
Symptoms: Infrastructure deployed but never ACTIVE. Causes:- Application not starting
- Heartbeat endpoint misconfigured
- Pod crashlooping
Unexpected KILLED State
Symptoms: Agent killed without user action. Causes:- Zombie detection (missed heartbeats)
- Policy violation
- Resource limits exceeded

