Skip to main content

Grafana Dashboards for OAuth Monitoring

Build comprehensive Grafana dashboards combining Prometheus metrics and Loki logs for real-time OAuth 2.1 monitoring.

Dashboard Overview

Recommended dashboard structure:

High-level health metrics, SLO tracking, and alert status

OAuth Performance

Latency percentiles, throughput, and resource utilization

OAuth Security

Security events, attack detection, and compliance monitoring

OAuth Debugging

Detailed logs, traces, and error analysis

OAuth Overview Dashboard

Panel 1: SLO Summary (Row)

OAuth Flow Success Rate
# Query
(
  sum(rate(oauth_flows_total{status="success"}[5m]))
  / sum(rate(oauth_flows_total[5m]))
) * 100

# Settings
- Visualization: Stat
- Unit: Percent (0-100)
- Decimals: 2
- Thresholds:
  - Red: < 95
  - Yellow: 95-98
  - Green: > 98
- Value mapping: Display "N/A" if no data
Token Refresh Success Rate
# Query
(
  sum(rate(oauth_token_refresh_total{status="success"}[5m]))
  / sum(rate(oauth_token_refresh_total[5m]))
) * 100

# Settings
- Visualization: Stat
- Unit: Percent
- Thresholds:
  - Red: < 99
  - Yellow: 99-99.5
  - Green: > 99.5
PKCE Validation Success Rate
# Query
(
  sum(rate(oauth_pkce_validations_total{status="success"}[5m]))
  / sum(rate(oauth_pkce_validations_total[5m]))
) * 100

# Settings
- Same as above
- Thresholds:
  - Red: < 98
  - Yellow: 98-99
  - Green: > 99
Critical Security Events (Last 24h)
# Query
sum(increase(oauth_security_events_total{severity="critical"}[24h]))

# Settings
- Visualization: Stat
- Unit: None
- Color: Red if > 0, Green if 0

Panel 2: Operations Rate (Time Series)

OAuth Operations per Minute
# Flows
sum(rate(oauth_flows_total[5m])) * 60

# Token Refreshes
sum(rate(oauth_token_refresh_total[5m])) * 60

# PKCE Validations
sum(rate(oauth_pkce_validations_total[5m])) * 60

# Settings
- Visualization: Time series
- Legend: Flows, Refreshes, PKCE
- Y-axis: ops/min
- Stack: None

Panel 3: Active Resources (Time Series)

Active Tokens & PKCE States
# Active Tokens
oauth_active_tokens

# Active PKCE States
oauth_active_pkce_states

# Settings
- Visualization: Time series
- Y-axis: Count
- Legend: Tokens, PKCE States

Panel 4: Error Rate (Time Series)

OAuth Errors per Minute
# Flow Failures
sum(rate(oauth_flows_total{status="failure"}[5m])) * 60

# Refresh Failures
sum(rate(oauth_token_refresh_total{status="failure"}[5m])) * 60

# PKCE Failures
sum(rate(oauth_pkce_validations_total{status="failure"}[5m])) * 60

# Settings
- Visualization: Time series
- Y-axis: errors/min
- Color scheme: Red/Orange

OAuth Performance Dashboard

Panel 1: Latency Percentiles (Time Series)

OAuth Flow Duration (p50, p95, p99)
# p50
histogram_quantile(0.50, sum(rate(oauth_flow_duration_seconds_bucket[5m])) by (le))

# p95
histogram_quantile(0.95, sum(rate(oauth_flow_duration_seconds_bucket[5m])) by (le))

# p99
histogram_quantile(0.99, sum(rate(oauth_flow_duration_seconds_bucket[5m])) by (le))

# Settings
- Visualization: Time series
- Y-axis: Seconds
- Legend: p50, p95, p99
- Thresholds for p95:
  - Green: < 5s
  - Yellow: 5-10s
  - Red: > 10s
Token Refresh Duration (p50, p95, p99)
# Same pattern as flow duration
histogram_quantile(0.50, sum(rate(oauth_token_refresh_duration_seconds_bucket[5m])) by (le))
histogram_quantile(0.95, sum(rate(oauth_token_refresh_duration_seconds_bucket[5m])) by (le))
histogram_quantile(0.99, sum(rate(oauth_token_refresh_duration_seconds_bucket[5m])) by (le))

# Settings
- Thresholds for p95:
  - Green: < 1s
  - Yellow: 1-2s
  - Red: > 2s

Panel 2: Throughput by Provider (Bar Gauge)

OAuth Flows by Provider (Last Hour)
# Query
topk(10,
  sum by (provider) (increase(oauth_flows_total[1h]))
)

# Settings
- Visualization: Bar gauge
- Orientation: Horizontal
- Display mode: Gradient
- Show: Value

Panel 3: Discovery Performance (Stat + Time Series)

Average Discovery Time by Method
# Query
sum by (method) (rate(oauth_discovery_duration_seconds_sum[5m]))
  / sum by (method) (rate(oauth_discovery_duration_seconds_count[5m]))

# Settings
- Visualization: Stat (current value) + Time series (trend)
- Unit: Seconds
- Group by: method

Panel 4: Client Registration Performance

Registration Duration p95
histogram_quantile(0.95, rate(oauth_client_registration_duration_seconds_bucket[5m]))

# Settings
- Visualization: Gauge
- Min: 0
- Max: 10
- Unit: Seconds

OAuth Security Dashboard

Panel 1: Security Events Timeline (Logs)

Critical/High Security Events
# Query
{service_name="pluggedin-app"}
  | json
  | severity =~ "critical|high"
  | event =~ "(token_reuse|code_injection|integrity_violation|ownership_violation)"
  | line_format "[{{.severity}}] {{.event}}: {{.msg}}"

# Settings
- Visualization: Logs
- Show time: Yes
- Wrap lines: Yes
- Deduplication: None

Panel 2: Security Event Counters (Stat Row)

Token Reuse Detected (24h)
sum(increase(oauth_token_refresh_total{status="reuse_detected"}[24h]))

# Settings
- Color: Red if > 0
- Sparkline: Show trend
Code Injection Attempts (24h)
sum(increase(oauth_code_injection_attempts_total[24h]))

# Settings
- Color: Red if > 0
Integrity Violations (24h)
sum(increase(oauth_integrity_violations_total[24h]))

# Settings
- Color: Yellow if > 0
Ownership Violations (24h)
# Count from logs
count_over_time({service_name="pluggedin-app"} | json | event="oauth_ownership_violation" [24h])

# Settings
- Color: Yellow if > 0

Panel 3: Attack Heatmap (Heatmap)

Security Events by Hour
# Query
sum by (event_type) (
  increase(oauth_security_events_total{severity=~"high|critical"}[1h])
)

# Settings
- Visualization: Heatmap
- Color scheme: Red-Yellow-Green (inverted)
- Cell gap: 2

Panel 4: Top Attackers (Table from Logs)

Users with Most Security Events
# Query
topk(20,
  sum by (userId) (
    count_over_time(
      {service_name="pluggedin-app"}
      | json
      | severity =~ "high|critical"
      [24h]
    )
  )
)

# Settings
- Visualization: Table
- Columns: User ID, Event Count
- Sort: Event Count (desc)

Panel 5: PKCE Replay Attempts (Time Series)

PKCE States in Audit Table (Prevented Replays)
# Query (from logs)
count_over_time(
  {service_name="pluggedin-app"}
  | json
  | event="pkce_replay_prevented"
  [5m]
)

# Settings
- Visualization: Time series
- Color: Orange
- Alert if > 0

OAuth Debugging Dashboard

Panel 1: Error Logs (Logs Panel)

OAuth Errors (Last 6 Hours)
# Query
{service_name="pluggedin-app"}
  | json
  | level >= 50
  | event =~ "oauth_.*"
  | line_format "{{.time | date \"15:04:05\"}} [{{.level | level}}] {{.event}}: {{.msg}}"

# Settings
- Visualization: Logs
- Show labels: event, level, serverUuid
- Order: Time descending
- Live: Yes

Panel 2: Error Distribution (Pie Chart)

OAuth Errors by Type (Last 24h)
# Query
sum by (event) (
  count_over_time(
    {service_name="pluggedin-app"}
    | json
    | level >= 50
    | event =~ "oauth_.*"
    [24h]
  )
)

# Settings
- Visualization: Pie chart
- Legend: Values + percent
- Tooltip: All series

Panel 3: Token Refresh Failures by Reason (Bar Chart)

Refresh Failures by Reason (Last Hour)
# Query
topk(10,
  sum by (reason) (increase(oauth_token_refresh_total{status="failure"}[1h]))
)

# Settings
- Visualization: Bar chart
- Orientation: Horizontal
- Show values: On bars

Panel 4: OAuth Flow Trace (Logs with Trace)

Complete Flow by Trace ID Variables:
Name: trace_id
Label: Trace ID
Type: Query
Query:
  label_values(
    {service_name="pluggedin-app"} | json | event =~ "oauth_.*",
    trace_id
  )
Panel Query:
{service_name="pluggedin-app"}
  | json
  | trace_id="$trace_id"
  | event =~ "(oauth_|pkce_|token_)"
  | line_format "{{.time | date \"15:04:05.000\"}} {{.event}}: {{.msg}}"

# Settings
- Visualization: Logs
- Sorted by: Time (asc)
- Show common labels: No

Panel 5: Slow Operations (Table)

Operations Exceeding SLO (Last Hour)
# Query
{service_name="pluggedin-app"}
  | json
  | duration_ms > 2000
  | event =~ "(token_refresh|oauth_flow)"
  | line_format "{{.event}} {{.serverUuid}} {{.duration_ms}}ms"

# Settings
- Visualization: Table
- Columns: Time, Event, Server, Duration
- Sort: Duration (desc)

Dashboard Variables

Global Variables

# Environment
Name: env
Type: Constant
Value: production

# Time Range Selector
Name: time_range
Type: Interval
Options: 5m, 15m, 1h, 6h, 24h, 7d

# Service Name
Name: service
Type: Query
Query: label_values(service_name)

OAuth-Specific Variables

# Provider Filter
Name: provider
Type: Query
Query: label_values(oauth_flows_total, provider)
Multi-value: Yes
Include all: Yes

# Server UUID (for debugging)
Name: server_uuid
Type: Query
Query:
  label_values(
    {service_name="pluggedin-app"} | json | serverUuid != "",
    serverUuid
  )
Multi-value: No

# User ID (for debugging)
Name: user_id
Type: Query
Query:
  label_values(
    {service_name="pluggedin-app"} | json | userId != "",
    userId
  )
Multi-value: No

Alert Annotations

Add alert annotations to show when alerts fire:
# Annotation Query (Prometheus Alerts)
Name: OAuth Alerts
Data source: Prometheus
Query: ALERTS{alertname=~"OAuth.*"}
# Annotation Query (Deployments)
Name: Deployments
Data source: Loki
Query: {service_name="pluggedin-app"} |= "deployment_completed"
Color: Green

Dashboard JSON Export

Complete OAuth Overview Dashboard

Import this dashboard JSON into Grafana for instant setup
{
  "dashboard": {
    "title": "OAuth 2.1 Overview",
    "tags": ["oauth", "security", "authentication"],
    "timezone": "browser",
    "panels": [
      {
        "title": "OAuth Flow Success Rate",
        "type": "stat",
        "targets": [
          {
            "expr": "(sum(rate(oauth_flows_total{status=\"success\"}[5m])) / sum(rate(oauth_flows_total[5m]))) * 100",
            "refId": "A"
          }
        ],
        "fieldConfig": {
          "defaults": {
            "unit": "percent",
            "thresholds": {
              "steps": [
                { "value": 0, "color": "red" },
                { "value": 95, "color": "yellow" },
                { "value": 98, "color": "green" }
              ]
            }
          }
        }
      }
    ],
    "refresh": "30s",
    "time": {
      "from": "now-6h",
      "to": "now"
    }
  }
}

Best Practices

Use Template Variables

Add provider, server, and user filters for flexible debugging

Set Appropriate Refresh Rates

Overview: 30s, Performance: 15s, Security: 10s, Debugging: 5s

Configure Alerts

Critical alerts: Immediate notification (PagerDuty, Slack) Warning alerts: Email or dashboard only

Use Annotations

Mark deployments, incidents, and configuration changes

Optimize Query Performance

Use recording rules for complex queries Limit time ranges for expensive log queries

Troubleshooting

  1. Verify data sources configured correctly (Prometheus, Loki)
  2. Check metrics endpoint: curl http://localhost:12005/metrics
  3. Ensure OAuth operations are generating events
  4. Verify time range includes recent data
  1. Add recording rules for expensive queries
  2. Reduce time range for log queries
  3. Use indexed labels in Loki queries
  4. Consider query caching in Grafana
  1. Check data source selection in variable
  2. Verify label exists in metrics/logs
  3. Test query in Explore view first
  4. Check for syntax errors in query
  1. Verify alert rule expression is correct
  2. Check evaluation interval (should match panel refresh)
  3. Test query returns data in expected range
  4. Verify notification channels configured

Dashboard Recommendations

Priority 1: Must-Have Panels

  • ✅ OAuth Flow Success Rate (SLO)
  • ✅ Token Refresh Success Rate (SLO)
  • ✅ Critical Security Events (Last 24h)
  • ✅ Operations Rate (flows, refreshes, PKCE)
  • ✅ p95 Latency (flows, refreshes)
  • ⭐ Error Rate Trends
  • ⭐ Active Tokens/States Gauge
  • ⭐ Security Events Timeline
  • ⭐ Top Providers by Volume
  • ⭐ Slow Operations Table

Priority 3: Advanced Panels

  • 🔧 Trace Correlation
  • 🔧 Attack Heatmap
  • 🔧 Error Distribution Pie
  • 🔧 User Activity Analysis
  • 🔧 Discovery Method Breakdown

Pre-built Dashboard Downloads

OAuth Overview Dashboard

Production-ready overview with all critical metrics

OAuth Security Dashboard

Security monitoring and incident response

OAuth Performance Dashboard

Latency analysis and capacity planning

Next Steps

Set Up Alerts

Configure Grafana alerts for critical events

Customize Dashboards

Adapt panels to your specific needs

Share Dashboards

Export and version control dashboard JSON

Monitor SLOs

Track OAuth SLOs and error budgets