PAP Architecture: Technical Deep Dive
The Plugged.in Agent Protocol (PAP) establishes the physical and logical substrate for autonomous agent operation—how agents live, breathe, migrate, and die across infrastructure.Protocol vs. Orchestration: PAP defines the substrate layer (lifecycle, heartbeats, infrastructure), while MCP/A2A handle orchestration logic (tool invocation, peer communication). PAP makes agent infrastructure a first-class concern.
Core Philosophy
“Autonomy without anarchy” - Agents operate independently yet remain under organizational governance through protocol-level controls.System Components
Station (Control Plane)
The Station is Plugged.in’s central authority for agent management, currently implemented as the REST API atplugged.in/api/agents.
Responsibilities:
- Lifecycle Authority: Exclusive rights to provision, activate, and kill agents
- Policy Enforcement: Resource quotas, security policies, compliance rules
- Zombie Detection: Monitors heartbeats and terminates unhealthy agents
- Routing: Directs requests to appropriate agent instances
- Audit Trail: Immutable logging of all lifecycle events
- gRPC endpoints for PAP-CP protocol
- Distributed control plane for multi-region deployments
- Advanced scheduling and placement logic
Satellites (Agents)
Satellites are the autonomous agent instances running on infrastructure. Characteristics:- Self-contained: Each agent is a Kubernetes Deployment with Service and Ingress
- Self-healing: Kubernetes restarts failed pods automatically
- Telemetry emission: Separate heartbeat and metrics channels
- Protocol compliance: Implement PAP-CP for control and PAP-Hooks for I/O
- Station respect: Accept kill commands and policy mandates
Proxy (mcp.plugged.in)
The Proxy acts as a gateway for external access (future implementation). Planned Responsibilities:- TLS termination and certificate management
- Signature validation for PAP-CP messages
- Rate limiting and quota enforcement
- Request logging and traffic analysis
- DDoS protection and abuse prevention
Registry (Service Discovery)
DNS-based service discovery for agent location and routing. Current Implementation:- Pattern:
{agent}.is.plugged.in - DNS: BIND9 with wildcard records
- TLS: cert-manager with Let’s Encrypt
- Routing: Traefik SNI-based routing
- DNSSEC for DNS security
- SRV records for load balancing
- Multi-region routing with GeoDNS
Dual-Profile Architecture
PAP separates control plane operations from application I/O through two distinct profiles:PAP-CP (Control Plane Profile)
Purpose: Infrastructure management and lifecycle control Characteristics:- Transport: gRPC over HTTP/2 with TLS 1.3 + mTLS
- Wire Format: Protocol Buffers v3
- Security: Ed25519 signatures REQUIRED, nonce-based replay protection
- Endpoint:
grpc://pap.plugged.in:50051(future)
- Agent provisioning and termination
- Heartbeat reporting (liveness only!)
- Metrics reporting (separate channel)
- Lifecycle state transitions
- Policy enforcement
PAP-Hooks (Open I/O Profile)
Purpose: Tool invocations, MCP access, ecosystem integration Characteristics:- Transport: JSON-RPC 2.0 over WebSocket or HTTP SSE
- Wire Format: UTF-8 JSON with schema validation
- Security: OAuth 2.1 with JWT RECOMMENDED
- Endpoint:
wss://{agent}.{region}.a.plugged.in/hooks
- MCP tool invocations
- A2A (Agent-to-Agent) delegation
- External API access
- Real-time event subscriptions
- Streaming responses
The Zombie Prevention Superpower
PAP’s killer feature: strict heartbeat/metrics separation.Problem: Control Plane Saturation
Traditional agent systems mix liveness signals with telemetry:- Large payloads saturate control plane
- Network issues delay liveness signals
- Cannot be aggressive with zombie detection
- False positives from metric collection failures
Solution: Channel Separation
PAP enforces strict separation:Heartbeat Channel (Liveness Only)
- Payload: ONLY
modeanduptime_seconds - Size: ~100 bytes
- Frequency: 5s (EMERGENCY), 30s (IDLE), 15min (SLEEP)
- Transport: Separate UDP/gRPC stream (future)
- FORBIDDEN: Any resource or business metrics
- Lightweight (no saturation risk)
- Fast transmission (predictable latency)
- Aggressive detection (one missed → unhealthy)
- No false positives
Metrics Channel (Telemetry)
- Payload: All resource and business metrics
- Size: Unlimited
- Frequency: Independent (typically 60s)
- Transport: Separate HTTP/gRPC endpoint
- FORBIDDEN: Mixing with heartbeat channel
- Rich telemetry without control plane impact
- Can send large payloads safely
- Independent failure domains
- Buffering and batching allowed
Zombie Detection Algorithm
- EMERGENCY mode: 7.5 seconds
- IDLE mode: 45 seconds
- SLEEP mode: 22.5 minutes
Why 1.5x?: Allows one missed heartbeat due to transient network issues, but catches actual zombies quickly.
Infrastructure Layer
Kubernetes Deployment
Each agent consists of three Kubernetes resources:1. Deployment
- Single replica (agents are stateful)
- Non-root execution (UID 1001)
- Capabilities dropped (minimal privileges)
- Resource limits enforced
2. Service
3. Ingress
- Automatic TLS via cert-manager
- Let’s Encrypt certificates
- SNI-based routing via Traefik
- Per-agent DNS hostname
Network Architecture
- TLS: All external traffic encrypted (Let’s Encrypt)
- NetworkPolicy: Pod-level isolation
- RBAC: ServiceAccount with minimal permissions
- SecurityContext: Non-root, no privileges
- ResourceQuota: Namespace-level limits
DNS Infrastructure
Agents use DNS for discovery and routing: BIND9 Configuration:{agent-name}.is.plugged.in → 185.96.168.254
Future: DNSSEC for DNS security (planned)
Security Model
Authentication Hierarchy
- User → Station: API key authentication
- Station → Agent: Ed25519 signatures + mTLS (PAP-CP)
- Agent → Tools/MCP: OAuth 2.1 tokens (PAP-Hooks)
- Agent → Agent: Mutual authentication with Station mediation
Authorization Model
Profile-Based Isolation:- Each agent belongs to exactly one profile
- Profile acts as security boundary
- Agents cannot access other profiles’ data
- ONLY Station can issue KILL command
- User deletion → TERMINATED (graceful)
- Station kill → KILLED (forced)
Audit Trail
All lifecycle events logged immutably:Protocol Interoperability
PAP integrates with existing protocols:MCP (Model Context Protocol)
Integration Point: PAP-Hooks profile Agents access MCP tools through PAP-Hooks:A2A (Agent-to-Agent)
Integration Point: PAP-Hooks profile with Station routing Agent delegation via Station:OpenTelemetry
Distributed Tracing:- All PAP messages carry
trace_idandspan_id - Spans propagate across agent boundaries
- Compatible with Jaeger, Zipkin, Tempo
Performance Characteristics
Latency Targets (PAP-RFC-001 §11)
| Operation | P50 | P95 | P99 |
|---|---|---|---|
| Heartbeat processing | < 10ms | < 50ms | < 100ms |
| Agent provisioning | < 30s | < 60s | < 120s |
| Invoke (synchronous) | < 100ms | < 500ms | < 1s |
| Invoke (async) | < 10ms ack | < 5s result | < 30s result |
Scalability
Current Limits (Single Cluster):- 100 pods per namespace (agents namespace)
- 40 CPU cores total
- 200Gi memory total
- Multi-region deployment
- Horizontal Station scaling
- Distributed agent registry
Failure Modes & Recovery
Agent Failures
Detection: Missed heartbeat → AGENT_UNHEALTHY (480) Recovery:- Kubernetes restarts pod automatically
- Agent sends first heartbeat → PROVISIONED → ACTIVE
- If persistent failure → Station issues KILL
Station Failures
Impact: Control plane unavailable, but agents continue operating Recovery:- Agents buffer heartbeats/metrics locally
- Resume reporting when Station available
- No data loss (buffered telemetry)
Network Partitions
Scenario: Agent isolated from Station Behavior:- Agent continues local operation
- Station marks agent AGENT_UNHEALTHY after timeout
- Upon reconnection, agent state reconciled

