Skip to main content

Health Checks & Application Metrics

Plugged.in exposes health check endpoints for load balancers and monitoring systems, plus comprehensive Node.js runtime metrics for performance tracking.

Health Check Endpoint

GET /api/health

Returns the health status of the application with database connectivity check.
# Basic health check
curl http://localhost:12005/api/health

# Example response
{
  "status": "healthy",
  "timestamp": "2025-11-10T18:00:00.000Z",
  "checks": {
    "service": true,
    "database": true
  }
}

Response Format

status
string
required
Overall health status: healthy or unhealthy
  • 200 OK: Application is healthy
  • 503 Service Unavailable: Application has issues
timestamp
string
required
ISO 8601 timestamp of the health check
checks
object
required
Individual health check results
uptime
number
Whitelisted IPs only: Process uptime in seconds
version
string
Whitelisted IPs only: Application version (from APP_VERSION env var)
environment
string
Whitelisted IPs only: Runtime environment (development/production)

Security & IP Restrictions

Detailed health information (version, environment, uptime) is only visible to whitelisted monitoring IPs to prevent information disclosure.
Health endpoint uses the same METRICS_ALLOWED_IPS configuration as the metrics endpoint:
# .env configuration
METRICS_ALLOWED_IPS="127.0.0.1,::1,172.17.0.0/16,172.18.0.0/16,185.96.168.253/32"
Allowed IPs see:
  • Full health status with version, environment, uptime
Non-whitelisted IPs see:
  • Basic health status only (status, timestamp, checks)

HEAD /api/health

Lightweight health check that returns only status code (no body).
# HEAD request for load balancers
curl -I http://localhost:12005/api/health

# Returns:
# 200 OK - healthy
# 503 Service Unavailable - unhealthy
Ideal for:
  • Load balancer health checks
  • Kubernetes liveness/readiness probes
  • High-frequency monitoring

Application Metrics Endpoint

GET /api/metrics

Exposes Node.js runtime and HTTP metrics in Prometheus format.
# Access metrics (requires IP whitelist)
curl http://localhost:12005/api/metrics

# Filter specific metrics
curl http://localhost:12005/api/metrics | grep pluggedin_http
The /api/metrics endpoint is IP-restricted and should only be accessible to your Prometheus/Grafana server.

IP Whitelist Configuration

Configure allowed IPs in .env:
# Development (permissive for Docker)
METRICS_ALLOWED_IPS="127.0.0.1,::1,172.17.0.0/16,172.18.0.0/16"

# Production (restrictive)
METRICS_ALLOWED_IPS="127.0.0.1,::1,172.17.0.0/16,172.18.0.0/16,185.96.168.253/32"
Supported formats:
  • IPv4: 127.0.0.1, 10.0.0.1
  • IPv6: ::1, fe80::1
  • CIDR: 172.17.0.0/16, 10.0.0.0/8

Node.js Runtime Metrics

pluggedin_process_cpu_user_seconds_total
Counter
Total user CPU time consumed by the process
pluggedin_process_cpu_system_seconds_total
Counter
Total system CPU time consumed by the process
pluggedin_process_start_time_seconds
Gauge
Process start time in seconds since Unix epoch
pluggedin_process_resident_memory_bytes
Gauge
Resident memory size in bytes
pluggedin_nodejs_heap_size_total_bytes
Gauge
Total heap size allocated for the process
pluggedin_nodejs_heap_size_used_bytes
Gauge
Currently used heap size
# Memory usage percentage
(pluggedin_nodejs_heap_size_used_bytes / pluggedin_nodejs_heap_size_total_bytes) * 100
pluggedin_nodejs_external_memory_bytes
Gauge
Memory used by C++ objects bound to JavaScript objects
pluggedin_nodejs_eventloop_lag_seconds
Gauge
Event loop lag in seconds (sampled every 10ms)
High event loop lag (> 0.1s) indicates the process is blocked or under heavy load
# Alert when event loop lag > 100ms
pluggedin_nodejs_eventloop_lag_seconds > 0.1
pluggedin_nodejs_eventloop_lag_p50_seconds
Gauge
50th percentile event loop lag
pluggedin_nodejs_eventloop_lag_p90_seconds
Gauge
90th percentile event loop lag
pluggedin_nodejs_eventloop_lag_p99_seconds
Gauge
99th percentile event loop lag
pluggedin_nodejs_gc_duration_seconds
Histogram
Garbage collection duration by GC typeLabels: kind (minor/major/incremental/etc.)Buckets: 0.001s, 0.01s, 0.1s, 1s, 2s, 5s
# GC duration p95 by type
histogram_quantile(0.95,
  rate(pluggedin_nodejs_gc_duration_seconds_bucket[5m])
)
pluggedin_nodejs_active_handles
Gauge
Number of active handles (file descriptors, sockets, etc.)
pluggedin_nodejs_active_requests
Gauge
Number of active asynchronous requests

HTTP Metrics

pluggedin_http_requests_total
Counter
Total HTTP requestsLabels: method, path, status_code
# Request rate by endpoint
rate(pluggedin_http_requests_total[5m])

# Requests by status code
sum by (status_code) (pluggedin_http_requests_total)
pluggedin_http_request_duration_seconds
Histogram
HTTP request duration in secondsLabels: method, path, status_codeBuckets: 0.01s, 0.05s, 0.1s, 0.5s, 1s, 2s, 5s, 10s
# p50, p95, p99 latency
histogram_quantile(0.50, rate(pluggedin_http_request_duration_seconds_bucket[5m]))
histogram_quantile(0.95, rate(pluggedin_http_request_duration_seconds_bucket[5m]))
histogram_quantile(0.99, rate(pluggedin_http_request_duration_seconds_bucket[5m]))

# Average latency by endpoint
sum by (path) (rate(pluggedin_http_request_duration_seconds_sum[5m]))
  / sum by (path) (rate(pluggedin_http_request_duration_seconds_count[5m]))
pluggedin_http_errors_total
Counter
Total HTTP errors (4xx + 5xx responses)Labels: method, path, error_typeError types: client_error (4xx), server_error (5xx), rate_limit, unauthorized
# Error rate
rate(pluggedin_http_errors_total[5m])

# Error rate percentage
(rate(pluggedin_http_errors_total[5m]) / rate(pluggedin_http_requests_total[5m])) * 100

Prometheus Configuration

Add to your prometheus.yml:
scrape_configs:
  - job_name: 'pluggedin-app'
    metrics_path: '/api/metrics'
    scheme: 'https'  # or 'http' for local
    static_configs:
      - targets: ['plugged.in']  # or 'localhost:12005'
        labels:
          service: 'pluggedin-app'
          environment: 'production'
    scrape_interval: 30s
    scrape_timeout: 10s

Alert Rules

Health Check Alerts

Add to prometheus/rules/pluggedin-app-alerts.yml:
groups:
  - name: pluggedin_app_health
    interval: 30s
    rules:
      - alert: PluggedinAppDown
        expr: up{job="pluggedin-app"} == 0
        for: 2m
        labels:
          severity: critical
          service: pluggedin-app
          category: availability
        annotations:
          summary: "Plugged.in application is down"
          description: "No metrics are being scraped for 2+ minutes"

      - alert: PluggedinAppDatabaseDown
        expr: pluggedin_http_errors_total{path="/api/health", status_code="503"} > 0
        for: 3m
        labels:
          severity: critical
          service: pluggedin-app
          category: database
        annotations:
          summary: "Database connection lost"
          description: "Health check returning 503 due to database issues"

Performance Alerts

  - name: pluggedin_app_performance
    interval: 30s
    rules:
      - alert: HighMemoryUsage
        expr: (pluggedin_nodejs_heap_size_used_bytes / pluggedin_nodejs_heap_size_total_bytes) > 0.80
        for: 10m
        labels:
          severity: warning
        annotations:
          summary: "High memory usage (>80%)"

      - alert: HighEventLoopLag
        expr: pluggedin_nodejs_eventloop_lag_seconds > 0.1
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "High event loop lag (>100ms)"

      - alert: HighLatency
        expr: histogram_quantile(0.95, rate(pluggedin_http_request_duration_seconds_bucket[5m])) > 2
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "P95 latency > 2s"

Grafana Dashboard

Query Examples

# Application uptime
time() - pluggedin_process_start_time_seconds

# Memory usage percentage
(pluggedin_nodejs_heap_size_used_bytes / pluggedin_nodejs_heap_size_total_bytes) * 100

# Request rate (requests/sec)
rate(pluggedin_http_requests_total[5m])

# Error rate percentage
(rate(pluggedin_http_errors_total[5m]) / rate(pluggedin_http_requests_total[5m])) * 100

# P95 latency by endpoint
histogram_quantile(0.95,
  sum by (path, le) (rate(pluggedin_http_request_duration_seconds_bucket[5m]))
)

# Active connections trend
rate(pluggedin_nodejs_active_handles[5m])

Troubleshooting

  1. Check database connectivity: psql $DATABASE_URL -c "SELECT 1"
  2. Review application logs for database errors
  3. Verify database server is running
  4. Check connection pool settings
  1. Verify your IP is in METRICS_ALLOWED_IPS
  2. Check IP format (IPv4, IPv6, or CIDR)
  3. For CIDR, ensure proper notation (e.g., 172.17.0.0/16)
  4. Test from allowed IP: curl -H "X-Forwarded-For: 127.0.0.1" http://localhost:12005/api/metrics
  1. Check for blocking synchronous operations
  2. Review CPU usage: pluggedin_process_cpu_user_seconds_total
  3. Identify long-running functions
  4. Consider offloading heavy work to background workers
  1. Check for memory leaks with heap snapshots
  2. Review pluggedin_nodejs_heap_size_used_bytes trend
  3. Check GC metrics: pluggedin_nodejs_gc_duration_seconds
  4. Consider increasing heap size or implementing memory limits

Best Practices

Health Check Frequency

Load Balancers: Poll every 10-30 seconds using HEAD requestMonitoring Systems: Poll every 30-60 seconds using GET requestAvoid: Polling more frequently than 10 seconds (unnecessary load)

IP Whitelist Security

Production: Only whitelist your specific monitoring server IPsNever: Use 0.0.0.0/0 or overly broad CIDR rangesReview: Audit whitelist quarterly, remove unused IPs

Metrics Retention

Prometheus: 15-30 days for detailed metricsLong-term: Export to time-series database for historical analysisAggregation: Use recording rules for frequently-queried metrics

Next Steps