Monitoring

Monitor IdentityScribe health, diagnose issues, and tune performance. Access the built-in portal at / or use the Grafana monitoring bundle for production alerting.

Related:

Observability — Metrics, traces, and endpoint taxonomy
REST Channel — API documentation including change history and temporal lookup
Upgrading IdentityScribe — Probe and endpoint migrations
HTTP Server Configuration — Socket separation for monitoring traffic

Portal and operator UI

IdentityScribe includes a built-in web interface for operators. No external tools required — just point your browser at the running instance.

Portal (`/`)

The root page shows system status at a glance:

System status — version, uptime, health checks, and online/offline indicator
Golden signals strip — traffic, latency, errors, saturation, ingest rate, and events
Quick status bar — indexes, services, hints, signatures, and doctor check counts
Destinations — quick links to Operator UI, Documentation, REST API, and GraphQL
Directory data — entry type counts and change feed summary (adds, modifies, deletes)

Operator dashboard (`/ui/`)

The main operator dashboard provides:

Entry counts by type with quick links to search
Change rates with live sparklines for ingest activity
Pressure indicators for query, ingest, and JVM
Quick access to REST API, GraphQL, and Observe views
Hints and signatures counts for performance awareness

Entries browser

Browse and inspect directory entries with full search and filtering capabilities.

Path	Purpose
`/ui/entries`	Entry type chooser — select which type to browse
`/ui/entries/{type}`	Search and filter entries with query builder
`/ui/entries/{type}/{id}`	Entry detail view with attributes, relations, and history
`/ui/entries/{type}/{id}?at=...`	Point-in-time view — see entry state at a specific moment
`/ui/entries/{type}/{id}/changes`	Per-entry change timeline with diff viewing
`/ui/entries/{type}/{id}/diff`	Compare entry state between two timestamps

Features:

Query builder with FleX filter syntax support
Attribute inspection with copy-to-clipboard
Point-in-time navigation via timeline or ?at= parameter
Diff viewing between any two versions

Changes feed

Track changes across all entries in real-time.

Path	Purpose
`/ui/changes`	Global change feed with event type filtering

Features:

Event type filtering — show only adds, modifies, moves, or deletes
Time range filtering — focus on recent changes or specific periods
Entry linking — click through to entry detail and history views
Patch/merge views — see exactly what changed in each event

Observe views

The Observe section (/ui/observe) provides system health and performance monitoring.

Main dashboard (`/ui/observe`)

The observe dashboard provides:

System status: Uptime, version, and service health at a glance
Golden signals strip: Throughput, error rate, and latency metrics
Pressure gauges: Visual indicators for query permit, ingest queue/task, and JVM memory utilization
Health summary: Auto-refreshing status card (30-second default interval)
Operator Copilot: AI-driven recommendations based on current system state

Drilldown views

Click through to specialized views for deeper investigation:

Path	Purpose
`/ui/observe/doctor`	Full health report with all registered checks and thresholds
`/ui/observe/services`	Service lifecycle status and restart counts
`/ui/observe/query-pipeline`	Query stage breakdown with timing and pressure metrics
`/ui/observe/signatures`	Query signature analysis for pattern identification
`/ui/observe/hints`	Performance optimization hints and recommendations
`/ui/observe/ingest`	Ingest metrics per entry type (users, groups, etc.)
`/ui/observe/indexes`	Index build status and progress, unused index detection
`/ui/observe/stats`	Statistics overview (entries, values, events, slow queries)
`/ui/observe/jvm`	JVM metrics (memory, threads, GC)

Features

Auto-refresh: Dashboard updates automatically every 30 seconds (configurable)
Operator Copilot: AI-powered recommendations appear when issues are detected
No authentication required: Uses the same access controls as /observe/* endpoints
Mobile-friendly: Responsive layout works on tablets and phones

See the interactive API documentation at /observe for the full operational endpoints specification.

When to use

Scenario	Recommended Tool
Browse and inspect entries	Operator UI (`/ui/entries`)
Track recent changes	Operator UI (`/ui/changes`)
Investigate entry history	Operator UI (`/ui/entries/{type}/{id}/changes`)
Quick spot-check during deployment	Operator UI (`/ui/observe`)
Ad-hoc troubleshooting	Operator UI drilldown views
Production monitoring with alerts	Grafana Monitoring Bundle
Historical analysis and trending	Grafana + Prometheus

Channel discovery

The /observe/channels endpoint exposes enabled channels, their runtime binding information, and HTTP socket configuration. Useful for service discovery, UIs, and debugging.

curl -s http://localhost:8080/observe/channels | jq

Example response:

{
  "channels": {
    "ldap": {
      "enabled": true,
      "running": true,
      "bindings": [
        { "host": "0.0.0.0", "configuredPort": 0, "actualPort": 10389, "ssl": false, "url": "ldap://0.0.0.0:10389" }
      ]
    },
    "rest": { "enabled": true, "sockets": ["@default"], "basePath": "/api" }
  },
  "sockets": {
    "@default": { "host": "0.0.0.0", "configuredPort": 8080, "actualPort": 8080, "ssl": false, "url": "http://localhost:8080" }
  },
  "request": { "socket": "@default", "host": "localhost", "port": 8080, "scheme": "http" }
}

Key fields:

Field	Description
`channels.ldap.running`	Whether LDAP is actively listening
`channels.ldap.bindings[].actualPort`	Runtime-assigned port (important for ephemeral port 0)
`channels.ldap.bindings[].url`	Ready-to-use connection URL
`sockets.<name>.actualPort`	HTTP socket runtime port
`sockets.<name>.url`	Auto-generated base URL (proxy-aware for current socket)
`request`	Request context showing detected host/scheme from headers

Use cases:

Service discovery: Fetch socket URLs dynamically for UI configuration
Ephemeral ports: Test environments using port 0 can discover actual assigned ports
Proxy debugging: Verify X-Forwarded-* headers are correctly interpreted

Configuration inspection

The /observe/config endpoint exposes the resolved configuration with passwords redacted. Equivalent to --printconfig CLI but accessible at runtime.

# JSON response (default)
curl -s http://localhost:8080/observe/config | jq

# Plain text response (raw HOCON)
curl -s -H "Accept: text/plain" http://localhost:8080/observe/config

Example JSON response:

{
  "config": "Configuration Sources:\n\n- system properties\n- reference.conf\n\napp {\n  mode = production\n}\ndatabase {\n  \"password\" : \"<REDACTED>\"\n  host = \"localhost\"\n}",
  "timestamp": "2026-01-13T12:00:00Z"
}

Key features:

Feature	Description
Password redaction	All password fields show `<REDACTED>`
Configuration sources	Shows merge order (system props, env vars, files)
Lazy caching	Computed once on first request
Cache-Control	`max-age=3600, private` (1-hour client-side TTL)
Content negotiation	`Accept: text/plain` returns raw HOCON

Use cases:

Debugging: Verify config without shell access to production
Support diagnostics: Share config safely (passwords redacted)
CI/CD verification: Confirm environment variables are applied correctly

Quick health check

Note: Examples below use port 8080 (the default). The bundled monitoring stack (monitoring/docker, monitoring/helm) uses port 9001 for monitoring endpoints via named socket separation.

“Is everything OK?”

Use the new /observe/doctor endpoint for an intelligent health assessment:

curl -s http://localhost:8080/observe/doctor | jq

Example response:

{
  "status": "healthy",
  "checks": [
    {
      "name": "query.permit_pressure",
      "status": "healthy",
      "value": 0.12,
      "threshold": 0.8,
      "message": "Permit utilization is 0.12"
    },
    {
      "name": "ingest.lag_seconds",
      "status": "healthy",
      "value": 2.5,
      "threshold": 60,
      "message": "Seconds behind event head is 2.50"
    }
  ],
  "context": {
    "jvm_thread_count": 42,
    "jvm_cpu_count": 8,
    "process_uptime_seconds": 86400,
    "jvm_memory_pressure": 0.45
  }
}

Status values:

Status	Meaning
`healthy`	All checks passing, no issues detected
`degraded`	Warning thresholds exceeded, investigate soon
`critical`	Critical thresholds exceeded, immediate action needed

Pressure-focused health check

For a quick view of just the pressure metrics (resource saturation signals), use /observe/pressure:

curl -s http://localhost:8080/observe/pressure | jq

Example response:

{
  "status": "healthy",
  "metrics": [
    { "name": "query_permit", "value": 0.45, "status": "healthy", "meaning": "Query permit utilization (0-1)" },
    { "name": "ingest_queue", "value": 0.12, "status": "healthy", "meaning": "Queue fill ratio (0-1)" },
    { "name": "ingest_task", "value": 0.95, "status": "healthy", "meaning": "Task pressure (~1 steady, >1 backlog)" },
    { "name": "jvm_memory", "value": 0.62, "status": "healthy", "meaning": "Heap utilization (0-1)" }
  ],
  "entry_types": {
    "user": { "queue": 0.10, "task": 0.95 },
    "group": { "queue": 0.12, "task": 0.88 }
  },
  "query_permit_queue": 0,
  "timestamp": "2026-01-04T12:00:00Z"
}

The query_permit_queue field shows how many threads are currently waiting to acquire query permits. Non-zero values indicate contention—queries are queuing behind slower operations. This is particularly useful for diagnosing latency spikes when query_permit pressure appears low but queries are still slow.

Key differences from /observe/doctor:

Focused only on pressure metrics (saturation signals)
Per-entry-type breakdown shows queue and task pressure for each transcribe
Actionable hints appear only when metrics exceed warning thresholds
Omits metrics with non-finite values (e.g., task pressure during cold start before any tasks complete)

When to use which:

Endpoint	Use Case
`/observe/pressure`	Quick saturation check, per-entry-type debugging
`/observe/doctor`	Comprehensive health assessment with recommendations

For simpler checks, use the Kubernetes-style health endpoints:

# Combined health (services + sync + indexes)
curl http://localhost:8080/healthz?verbose

# Liveness (is the process alive?)
curl http://localhost:8080/livez

# Readiness (can it serve traffic?)
curl http://localhost:8080/readyz?verbose

# Startup (have services initialized?)
curl http://localhost:8080/startedz?verbose

Key actionable metrics

These are the “first things to check” when investigating issues.

Pressure gauges (saturation signals)

The most actionable metrics - they indicate resource saturation:

Metric	Description	Warning	Critical
`scribe.query.permit.pressure`	Query capacity utilization (0..1)	0.8	0.95
`scribe.query.permit.queue`	Threads waiting for query permits (count)	5	10
`scribe.ingest.queue.pressure`	Ingest buffer fill ratio (0..1)	0.8	0.95
`scribe.ingest.task.pressure`	Processing keep-up ratio (~1 = steady)	1.2	2.0
`jvm.memory.pressure`	Heap pressure (0..1)	0.8	0.95

Pool / connection health

Metric	Description	Threshold
`scribe.db.connections.pending`	Threads waiting for DB connections	> 0 = saturation
`scribe.db.connections.active`	Active DB connections across pools	(info)
`scribe.ldap.connections.active`	LDAP connection count	(info)

Active work (current load)

Metric	Description	Threshold
`scribe.channel.inflight`	Requests currently processing	(info)
`scribe.ingest.tasks.active`	Active ingest tasks	(info)
`scribe_query_rejected_5m`	Queries rejected in last 5 min	> 0 = issues

Data freshness

Metric	Description	Warning	Critical
`scribe.ingest.lag.seconds`	Seconds behind event head	60s	300s

Deployment / restart verification

After a deployment or restart:

# 1. Verify services started
curl http://localhost:8080/startedz?verbose
# Expected: "startedz check passed"

# 2. Check uptime and status
curl -s http://localhost:8080/observe/status | jq
# Look for: "status": "OK", uptime_duration

# 3. Verify sync completed (if applicable)
curl http://localhost:8080/readyz?verbose
# Expected: "ingest ok", "indexes ok"

# 4. Check for recent restarts
curl -s http://localhost:8080/observe/doctor | jq '.checks[] | select(.name == "services.restarts_5m")'
# Expected: value = 0

Data freshness troubleshooting

“Why is data stale?”

# 1. Check ingest status with real-time gauges
curl -s http://localhost:8080/observe/stats/ingest | jq

# Look for entry_types section with per-type metrics:
# - lag_seconds: how far behind
# - queue_pressure: buffer fill (0..1)
# - task_pressure: processing ratio (~1 = keeping up)
# - ingest_completed: has initial ingest completed?

Decision tree:

Symptom	Likely Cause	Action
High `lag_seconds`, low `task_pressure`	Source LDAP slow	Check source LDAP performance
High `lag_seconds`, high `task_pressure`	Processing bottleneck	Increase workers or check DB
High `queue_pressure`	DB writes slow	Check DB connections, vacuum
`ingest_completed` = false	Ingest not started	Check `/readyz`

Performance degradation diagnosis

“Why are queries slow?”

# 1. Check pressure gauges first
curl -s http://localhost:8080/observe/doctor | jq '.checks[] | select(.name | startswith("query.") or startswith("jvm."))'

# 2. Check front-door latency via Prometheus
# scribe_channel_request_duration_seconds{quantile="0.99"}

# 3. Check stage breakdown
# scribe_query_stage_duration_seconds{stage="execute"}

Common causes:

High Pressure	Meaning	Action
`query.permit_pressure` > 0.8	Too many concurrent queries	Increase `database.channelPoolSize`
`jvm.memory_pressure` > 0.8	Heap pressure	Increase `-Xmx`
`db.connections_pending` > 0	Pool saturation	Increase pool size
`ingest.task_pressure` > 1.5	Ingest competing with queries	Check write load

Index build status

# Check index build status
curl -s http://localhost:8080/observe/indexes | jq

# With details
curl -s 'http://localhost:8080/observe/indexes?details=true&limit=20' | jq

Key fields:

ready: true when all required indexes are built
counts.pending: indexes waiting to build
counts.running: indexes currently building
concurrentBuildsInProgress: parallel builds in progress

Database performance monitoring

IdentityScribe exposes PostgreSQL performance insights via the observe API.

Index usage analysis

Check for unused or underutilized indexes:

curl -s http://localhost:8080/observe/stats/index-usage | jq

Example response:

{
  "indexes": [
    {
      "schemaName": "public",
      "tableName": "entries",
      "indexName": "idx_entries_dn",
      "idxScan": 15420,
      "idxTupRead": 30840,
      "idxTupFetch": 30840,
      "size": "2 MB",
      "sizeBytes": 2097152,
      "usageStatus": "active"
    }
  ],
  "unusedCount": 2,
  "totalSizeUnused": "512 kB",
  "totalSizeUnusedBytes": 524288,
  "timestamp": "2026-01-24T12:00:00Z"
}

Usage status values:

Status	Meaning	Action
`active`	Index is actively used (100+ scans)	Keep
`low`	Index has few scans (1-99)	Monitor
`unused`	Index has never been scanned	Consider dropping

Important: Index usage statistics reset on pg_stat_reset() or PostgreSQL restart. After a restart, all indexes will appear “unused” until they are scanned again. Wait for normal traffic patterns before making decisions about unused indexes.

Slow query analysis

Requires pg_stat_statements extension in PostgreSQL:

curl -s http://localhost:8080/observe/stats/queries | jq

If you see "error": "pg_stat_statements extension not available", see the Enabling pg_stat_statements guide.

Alert runbooks

High permit pressure

Alert: scribe_query_permit_pressure > 0.8

Symptoms: Slow queries, rejected requests

Actions:

Check /observe/doctor for query.permit_pressure
Identify slow queries via traces or stage durations
Consider increasing database.channelPoolSize
Check for missing indexes: /observe/hints?severity=warning

Ingest falling behind

Alert: scribe_ingest_task_pressure > 1.2

Symptoms: Stale data, high lag_seconds

Actions:

Check /observe/stats/ingest for per-entry-type breakdown
Check source LDAP connectivity
Run VACUUM ANALYZE if DB is slow
Consider increasing ingest workers

Memory pressure elevated

Alert: jvm_memory_pressure > 0.8

Symptoms: GC pauses, slow response times

Actions:

Check /observe/doctor context section
Review heap size configuration (-Xmx)
Check for memory leaks via profiler
Consider restarting if persistent

Service flapping

Alert: scribe_service_restarts_5m > 1

Symptoms: Intermittent failures, connection errors

Actions:

Check /observe/doctor for services.restarts_5m
Review logs for crash causes
Check resource limits (CPU, memory)
Check external dependencies (DB, LDAP)

Prometheus queries

Common PromQL queries for dashboards:

# Query throughput
rate(scribe_channel_requests_total[5m])

# Query latency p99
histogram_quantile(0.99, rate(scribe_channel_request_duration_seconds_bucket[5m]))

# Ingest rate
rate(scribe_ingest_events_written_total[5m])

# Error rate
sum(rate(scribe_channel_requests_total{result!="ok"}[5m])) / sum(rate(scribe_channel_requests_total[5m]))

# Pressure gauges (direct)
scribe_query_permit_pressure
scribe_ingest_queue_pressure
jvm_memory_pressure

Observability - Metrics, traces, and endpoint taxonomy
Deployment Guide - Installation and configuration

Monitoring

Portal and operator UI

Portal (/)

Operator dashboard (/ui/)

Entries browser

Changes feed

Observe views

Main dashboard (/ui/observe)

Drilldown views

Features

When to use

Channel discovery

Configuration inspection

Quick health check

Pressure-focused health check

Key actionable metrics

Pressure gauges (saturation signals)

Pool / connection health

Active work (current load)

Data freshness

Deployment / restart verification

Data freshness troubleshooting

Performance degradation diagnosis

Index build status

Database performance monitoring

Index usage analysis

Slow query analysis

Alert runbooks

High permit pressure

Ingest falling behind

Memory pressure elevated

Service flapping

Prometheus queries

Related documentation

Portal (`/`)

Operator dashboard (`/ui/`)

Main dashboard (`/ui/observe`)