Health and Monitoring
Is IdentityScribe healthy? This page shows you how to check, what the numbers mean, and what to do when they look wrong.
For the full list of /observe/* endpoints and their response shapes, see the Endpoints Reference.
Built-in tools
Section titled “Built-in tools”Scribe ships with a web interface. Point your browser at the running instance — no external tools needed.
Portal (/) shows system status at a glance: version, uptime, golden signals strip, entry counts, and change feed summary.
Operator UI (/ui/) goes deeper: entry counts by type, live sparklines for ingest activity, pressure indicators, and quick links to REST, GraphQL, and Observe views.
Entries browser (/ui/entries) lets you search, filter, and inspect directory entries. You can view point-in-time snapshots, compare versions, and browse the change timeline for any entry.
Observe dashboard (/ui/observe) shows pressure gauges, golden signals, service health, and the Operator Copilot — which surfaces recommendations when it detects problems.
| What you need | Where to go |
|---|---|
| Browse entries | /ui/entries |
| Track changes | /ui/changes |
| Quick health spot-check | /ui/observe |
| Full health report | /ui/observe/doctor |
| Production alerting | Grafana bundle |
Health endpoints
Section titled “Health endpoints”Examples use port 8080 (the default). The bundled monitoring stack uses port 9001 via socket separation. These endpoints are also listed in the Endpoints Reference with full response schemas.
Doctor (/observe/doctor)
Section titled “Doctor (/observe/doctor)”The smartest health check. Returns threshold-based assessments, per-service status, and actionable recommendations.
curl -s http://localhost:8080/observe/doctor | jq| Status | Meaning |
|---|---|
healthy | All checks pass |
degraded | Warning thresholds exceeded — investigate soon |
critical | Critical thresholds exceeded — act now |
Pressure (/observe/pressure)
Section titled “Pressure (/observe/pressure)”Focused on saturation metrics only. Shows per-entry-type breakdown of queue and task pressure, plus query_permit_queue (threads waiting for permits). Use this when you want a quick saturation check without the full doctor report.
curl -s http://localhost:8080/observe/pressure | jqKubernetes probes
Section titled “Kubernetes probes”For Kubernetes liveness, readiness, and startup checks, see the probe configuration in the deployment guide.
| Endpoint | Purpose |
|---|---|
/livez | Is the process alive? |
/readyz | Can it serve traffic? |
/startedz | Have services initialized? |
/healthz | Combined health check |
Append ?verbose for detailed output.
Pressure gauges
Section titled “Pressure gauges”Four metrics that tell you if the system is saturated
All pressure metrics nominal
jvm.memory.pressure Pressure gauges are the first thing to check when something feels off. Each one measures how close a resource is to saturation. For the full metric definitions with labels, units, and scrape details, see the Telemetry Reference.
| Metric | Range | Warning | Critical | What it means |
|---|---|---|---|---|
scribe_query_permit_pressure | 0..1 | 0.8 | 0.95 | Query capacity used |
scribe_query_permit_queue | 0..N | 5 | 10 | Threads waiting for query permits |
scribe_ingest_queue_pressure | 0..1 | 0.8 | 0.95 | Ingest buffer fill level |
scribe_ingest_task_pressure | ~1 | 1.2 | 2.0 | Processing keep-up ratio (>1 = falling behind) |
jvm_memory_pressure | 0..1 | 0.8 | 0.95 | Heap utilization |
Look for sustained elevation, not spikes. A brief spike during a bulk import is normal. Sustained high pressure with rising latency or rejections means you need to act.
Connection health
Section titled “Connection health”| Metric | Threshold |
|---|---|
scribe_db_connections_pending | > 0 means pool saturation |
scribe_ingest_lag_seconds | > 60s warning, > 300s critical |
scribe_query_rejected_5m | > 0 means queries are being dropped |
Deployment verification
Section titled “Deployment verification”After a deployment or restart:
# 1. Services started?curl http://localhost:8080/startedz?verbose
# 2. Uptime and statuscurl -s http://localhost:8080/observe/status | jq
# 3. Sync completed?curl http://localhost:8080/readyz?verbose
# 4. Any recent restarts?curl -s http://localhost:8080/observe/doctor | jq '.checks[] | select(.name == "services.restarts_5m")'Data freshness
Section titled “Data freshness”If data looks stale, check ingest status:
curl -s http://localhost:8080/observe/stats/ingest | jq| Symptom | Likely cause | Action |
|---|---|---|
| High lag, low task pressure | Source LDAP slow | Check source LDAP performance |
| High lag, high task pressure | Processing bottleneck | Increase workers or check DB |
| High queue pressure | DB writes slow | Check connections, run VACUUM ANALYZE |
ingest_completed = false | Initial sync still running | Check /readyz |
Alert runbooks
Section titled “Alert runbooks”High permit pressure
Section titled “High permit pressure”Alert: scribe_query_permit_pressure > 0.8
- Check
/observe/doctorforquery.permit_pressure - Find slow queries via traces or stage durations
- Consider raising
database.channelPoolSize - Check
/observe/hints?severity=warningfor missing indexes
Ingest falling behind
Section titled “Ingest falling behind”Alert: scribe_ingest_task_pressure > 1.2
- Check
/observe/stats/ingestfor per-entry-type breakdown - Verify source LDAP connectivity
- Run
VACUUM ANALYZEif DB is slow - Consider raising ingest workers
Memory pressure
Section titled “Memory pressure”Alert: jvm_memory_pressure > 0.8
- Check
/observe/doctorcontext section - Review
-Xmxheap size - Profile for leaks if persistent
- Restart as last resort
Service flapping
Section titled “Service flapping”Alert: scribe_service_restarts_5m > 1
- Check
/observe/doctorforservices.restarts_5m - Review logs for crash causes
- Check resource limits (CPU, memory)
- Check external dependencies (DB, LDAP)