Observability

Collect metrics, traces, and logs from IdentityScribe. This guide covers endpoints, signals, playbooks, and PromQL examples.

For exhaustive inventories, see the generated telemetry reference:

Related:

Monitoring — Dashboards, workflows, and troubleshooting
Upgrading IdentityScribe — Migration notes for endpoints and metrics
Failures — Error codes, retry semantics, and support workflow

Note: This document describes the OTel-first contract. Legacy /status/* endpoints have been removed in favor of /observe/* (breaking change, no aliases).

Endpoint taxonomy

Standard paths

Path	Description
`/metrics`	Prometheus scrape endpoint (OpenTelemetry-backed)
`/observe/health`	Combined MicroProfile health (JSON)
`/observe/health/live`	Liveness probe (JSON)
`/observe/health/ready`	Readiness probe (JSON)
`/observe/health/started`	Startup probe (JSON)
`/observe/health/check/{name}`	Individual health check by name (JSON)
`/livez`	Kubernetes liveness probe (plain text)
`/readyz`	Kubernetes readiness probe (plain text)
`/startedz`	Kubernetes startup probe (plain text)
`/healthz`	Kubernetes combined health (plain text)

Note: The root path / no longer serves metrics — use /metrics explicitly.

Observe endpoints (`/observe/*`)

These endpoints expose on-demand insights (JSON, cached) for data that would explode metrics cardinality.

Path	Description
`/observe`	OpenAPI documentation (HTML UI + JSON/YAML spec)
`/observe/status`	Basic status
`/observe/channels`	Channel and socket discovery with runtime binding info
`/observe/config`	Resolved configuration (passwords redacted)
`/observe/doctor`	Health report with threshold checks and recommendations
`/observe/services`	Service lifecycle status (state, uptime, failures)
`/observe/pressure`	Saturation metrics (queue/task/memory pressure)
`/observe/signals`	Golden signals summary (latency, traffic, errors, saturation)
`/observe/indexes`	Index build status and concurrent build detection
`/observe/hints`	Persisted hints
`/observe/signatures`	Query signatures
`/observe/stats/values`	Value size statistics per entry type and attribute
`/observe/stats/entries`	Entry blob size percentiles per entry type
`/observe/stats/events`	Event rate windows (dashboard-friendly buckets)
`/observe/stats/ingest`	Ingest lag and checkpoint positions
`/observe/mcp`	MCP server for AI assistants (see MCP Channel)

Tip: Use /observe for interactive docs. If you need machine-readable status, use /observe/status.

Target: /observe/* replaces the legacy /status/* paths (breaking change, no aliases).

Stats endpoints performance note

The /observe/stats/* endpoints execute direct database queries and can be expensive on large datasets:

Use case: Investigation, debugging, collecting real usage statistics—not high-frequency polling.
Caching: Two-tier caching protects the database:
- In-process cache (30s TTL): /values, /entries, /ingest endpoints cache responses server-side.
- HTTP Cache-Control: All responses include Cache-Control: max-age=30, private for client-side caching.
- /events is parameterized by since, so it uses client-side caching only.
Row limits: Queries return at most 100 rows by default to bound response size.
since parameter (/observe/stats/events):
- ISO-8601 timestamps: 2026-01-01T00:00:00Z, 2026-01-01T00:00:00+01:00
- Duration strings: 1h, 24h, 7d, 30m, PT1H
- Invalid input returns 400: Future timestamps, negative/zero durations, or unparseable values.
Precision: Numeric byte fields (avgBytes, p50Bytes, etc.) use double to preserve fractional values.

These endpoints are intended for operational investigation and dashboard population, not continuous scraping.

Channels endpoint (`/observe/channels`)

Returns enabled channels, sockets, and runtime binding information. Useful for service discovery, UI connectivity information, and debugging network configuration.

Example response:

{
  "channels": {
    "ldap": {
      "enabled": true,
      "running": true,
      "bindings": [
        {
          "host": "0.0.0.0",
          "configuredPort": 0,
          "actualPort": 10389,
          "ssl": false,
          "url": "ldap://0.0.0.0:10389"
        },
        {
          "host": "0.0.0.0",
          "configuredPort": 10636,
          "actualPort": 10636,
          "ssl": true,
          "url": "ldaps://0.0.0.0:10636"
        }
      ]
    },
    "identityHub": {
      "enabled": false,
      "running": false,
      "bindings": []
    },
    "rest": {
      "enabled": true,
      "sockets": ["@default"],
      "basePath": "/api"
    }
  },
  "monitoring": {
    "prometheus": { "enabled": true, "sockets": ["@default"], "path": "/metrics" },
    "observe": { "enabled": true, "sockets": ["@default"], "path": "/observe" },
    "health": { "enabled": true, "sockets": ["@default"], "paths": ["/livez", "/readyz", "/startedz", "/healthz"] }
  },
  "telemetry": {
    "traces": { "enabled": true },
    "metrics": { "prometheus": true, "otlp": false },
    "hints": { "enabled": true, "explain": false, "persistence": true }
  },
  "sockets": {
    "@default": {
      "host": "0.0.0.0",
      "configuredPort": 0,
      "actualPort": 8080,
      "ssl": false,
      "url": "http://api.example.com:8080"
    },
    "internal": {
      "host": "127.0.0.1",
      "configuredPort": 9090,
      "actualPort": 9090,
      "ssl": false,
      "url": "http://127.0.0.1:9090"
    }
  },
  "request": {
    "socket": "@default",
    "host": "api.example.com",
    "port": 8080,
    "scheme": "http"
  },
  "timestamp": "2026-01-13T12:00:00Z"
}

Key features:

channels: Enabled channels with binding info. LDAP shows running status and actual ports (important for ephemeral port 0). REST shows socket references.
sockets: HTTP sockets with both configuredPort and actualPort (useful when port 0 is configured for ephemeral binding).
request: Request context showing detected host/port/scheme (respects X-Forwarded-* headers from proxies).
url: Auto-generated connection URL. For the current socket, uses detected host/scheme from request headers. For other sockets, uses configured values.

Use cases:

UI service discovery (fetch socket URLs dynamically)
Debugging ephemeral port bindings in test environments
Verifying proxy header forwarding (request.host, request.scheme)
Operational dashboards showing enabled features

Config endpoint (`/observe/config`)

Returns the resolved configuration with passwords redacted. Equivalent to the --printconfig CLI flag but accessible at runtime via HTTP.

Content negotiation:

Accept Header	Response Format
`application/json` (default)	JSON with `config` and `timestamp` fields
`text/plain`	Raw HOCON config text

Example JSON response:

{
  "config": "Configuration Sources:\n\n- system properties\n- reference.conf\n\napp {\n  mode = production\n}\n\ndatabase {\n  \"password\" : \"<REDACTED>\"\n  host = \"localhost\"\n  port = 5432\n}\n...",
  "timestamp": "2026-01-13T12:00:00Z"
}

Example plain text request:

curl -H "Accept: text/plain" http://localhost:8080/observe/config

Key features:

Password redaction: All password fields are replaced with <REDACTED>
Configuration sources: Shows merge order (system properties, env vars, config files)
Lazy caching: Config string is computed once on first request
Cache-Control: Responses include Cache-Control: max-age=3600, private (1-hour TTL)
Excludes JVM internals: Filters out java.*, jdk.*, sun.*, org.graalvm.* paths

Use cases:

Debugging configuration issues in production without shell access
Verifying environment variable overrides are applied correctly
Support diagnostics (config can be shared without exposing secrets)
CI/CD verification that config is resolved as expected

Services endpoint (`/observe/services`)

Returns detailed per-service status with uptime, startup duration, tags, and failure causes.

Example response:

{
  "services": [
    {
      "id": "Scribe.user",
      "name": "Scribe",
      "state": "running",
      "healthy": true,
      "tags": {"entryType": "user"},
      "uptime_seconds": 3600,
      "startup_seconds": 2.5
    },
    {
      "id": "Database.System",
      "name": "Database.System",
      "state": "failed",
      "healthy": false,
      "failure": "Connection refused"
    }
  ],
  "summary": {
    "total": 8,
    "healthy": 7,
    "unhealthy": 1,
    "restarts_5m": 3
  },
  "timestamp": "2026-01-08T12:00:00Z"
}

Use case: Detailed service diagnostics, startup timing analysis, failure investigation.

Doctor endpoint (`/observe/doctor`)

Returns an intelligent health report with threshold-based checks, per-service status, and actionable recommendations.

Key features:

services.down check shows which services are down by ID (not just count)
services array provides per-service state, uptime, and health
recommendations array with prioritized, actionable hints

Example services.down hint: "Down: Scribe.user, Database.Batch"

Golden signals

IdentityScribe exposes the Four Golden Signals for quick system health assessment, covering both query and ingest sides.

Query signals (channel side)

Signal	Metric	What it measures
Latency	`scribe_signals_latency_p95`	Response time p95 (seconds)
Traffic	`scribe_signals_requests_per_second`	Request throughput
Errors	`scribe_signals_error_rate_percent`	Failure percentage
Saturation	`scribe_signals_traffic_ratio`, `scribe_db_pool_pressure`	Resource utilization

Per-channel breakdown with channel label:

scribe_signals_channel_latency_p95{channel="ldap"}
scribe_signals_channel_requests_per_second{channel="rest"}
scribe_signals_channel_error_rate_percent{channel="graphql"}

Ingest signals (sync side)

Signal	Metric	What it measures
Latency	`scribe_signals_ingest_task_duration_p95`	Task processing time p95
Latency	`scribe_signals_ingest_lag_max_seconds`	Worst replication lag
Traffic	`scribe_signals_ingest_changes_per_second`	Change detection rate
Errors	`scribe_signals_ingest_failed_rate_percent`	Task failure percentage

Per-entry-type breakdown with entry_type label:

scribe_signals_ingest_entry_lag_seconds{entry_type="user"}
scribe_signals_ingest_entry_task_duration_p95{entry_type="group"}
scribe_signals_ingest_entry_changes_per_second{entry_type="role"}

Built-in dashboard

The /observe/signals endpoint returns a JSON summary for the built-in dashboard:

curl -s http://localhost:8080/observe/signals | jq

Grafana integration

These metrics integrate with existing dashboards via PromQL:

# Query signals alerting
scribe_signals_latency_p95 > 2.0
scribe_signals_error_rate_percent > 5.0
scribe_signals_traffic_ratio > 5.0

# Ingest signals alerting
scribe_signals_ingest_lag_max_seconds > 300
scribe_signals_ingest_failed_rate_percent > 1.0
scribe_signals_ingest_task_duration_p95 > 5.0

# Per-entry-type lag comparison
scribe_signals_ingest_entry_lag_seconds

A dedicated Golden Signals dashboard is available at monitoring/grafana/dashboards/signals.json.

Core metrics inventory

All metrics use the scribe. prefix (canonical dot notation) which is auto-converted to scribe_ for Prometheus.

Channel (front door SLO)

Metric	Type	Labels	Description
`scribe.channel.requests.total`	Counter	`channel`, `op`, `result`	Total requests by channel/operation
`scribe.channel.request.duration.seconds`	Histogram	`channel`, `op`, `result`	Request latency distribution
`scribe.channel.inflight`	Gauge	`channel`, `op`	Currently processing requests

Query pipeline

Metric	Type	Labels	Description
`scribe.query.stage.duration.seconds`	Histogram	`channel`, `op`, `stage`, `result`	Per-stage latency breakdown
`scribe.query.shapes.total`	Counter	`channel`, `op`, `shape`	Query shape classification counts
`scribe.query.permit.pressure`	Gauge	—	Permit utilization (0..1)
`scribe.query.permit.queue`	Gauge	—	Threads waiting for permits (count)
`scribe.query.rejected.total`	Counter	`channel`, `result`	Rejected queries (resource exhaustion)

Ingest

Metric	Type	Labels	Description
`scribe.ingest.lag.seconds`	Gauge	`entry_type`	Seconds behind head
`scribe.ingest.queue.pressure`	Gauge	`entry_type`	Queue fill ratio (0..1)
`scribe.ingest.task.pressure`	Gauge	`entry_type`	Processing demand ratio (~1 steady)
`scribe.ingest.changes.total`	Counter	`entry_type`, `change`	Change events by type
`scribe.ingest.events.written.total`	Counter	`entry_type`, `event_type`	Events written to store

Store

Metric	Type	Labels	Description
`scribe.store.commit.duration.seconds`	Histogram	`phase`	Commit phase durations
`scribe.store.commit.wait.duration.seconds`	Histogram	`phase`	Wait time for commit phases

Services

Metric	Type	Labels	Description
`scribe.service.restarts.total`	Counter	`service`	Service restart count
`scribe.service.up`	Gauge	`service`	Service health (0/1)
`scribe.service.transitions.total`	Counter	`service`, `from`, `to`	Service state transition count
`scribe.service.transition.duration.seconds`	Histogram	`service`, `from`, `to`	Time spent in each state before transition

Registered services: Database.System, Database.Batch, Database.Channel, Channel.LDAP, Channel.IdentityHub, Scribe, TaskExecutor, HintEngine, LicenseVerification, HelidonObserveServer.

Hints

Metric	Type	Labels	Description
`scribe.hints.queue.size`	Gauge	—	Persistence queue size
`scribe.hints.queue.dropped.total`	Counter	—	Dropped hints (queue full)
`scribe.hints.persisted.total`	Counter	—	Successfully persisted hints

JVM / process

These metrics provide minimal, portable runtime gauges that work reliably in both JVM and GraalVM native-image environments.

Metric	Type	Labels	Description
`jvm.memory.used`	Gauge	`jvm.memory.type`	Memory in use (bytes)
`jvm.memory.committed`	Gauge	`jvm.memory.type`	Memory committed (bytes)
`jvm.memory.limit`	Gauge	`jvm.memory.type`	Max memory (bytes)
`jvm.memory.pressure`	Gauge	—	Memory pressure (used/max, 0..1)
`jvm.thread.count`	Gauge	—	Active thread count
`jvm.cpu.count`	Gauge	—	Available processors
`process.uptime`	Gauge	—	Process uptime (seconds)

Note: GC metrics (jvm.gc.*) are intentionally omitted as they are not reliably available in GraalVM native-image.

External metrics (not in `scribe.*` contract)

These are emitted by libraries/frameworks and are kept as-is:

hikaricp_* — HikariCP connection pool metrics
jvm_* — JVM memory, threads (minimal, native-image safe; GC metrics intentionally omitted)
process_* — Process uptime (minimal, native-image safe)
system_* — System CPU, load
executor_* — Executor service metrics

PromQL examples

Request rate and latency

# Request rate by channel
rate(scribe_channel_requests_total[5m])

# p99 latency by channel
histogram_quantile(0.99, rate(scribe_channel_request_duration_seconds_bucket[5m]))

# Error rate
sum(rate(scribe_channel_requests_total{result!="ok"}[5m]))
  / sum(rate(scribe_channel_requests_total[5m]))

Pressure alerts

# Query permit pressure sustained high
avg_over_time(scribe_query_permit_pressure[5m]) > 0.9

# Ingest queue pressure by entry_type
scribe_ingest_queue_pressure > 0.8

# Ingest falling behind
scribe_ingest_task_pressure > 1.2

Stage breakdown

# p95 by stage
histogram_quantile(0.95,
  sum by (stage, le) (rate(scribe_query_stage_duration_seconds_bucket[5m]))
)

# Which stage dominates?
sum by (stage) (rate(scribe_query_stage_duration_seconds_sum[5m]))
  / sum(rate(scribe_query_stage_duration_seconds_sum[5m]))

Service health

# Services that restarted recently
increase(scribe_service_restarts_total[1h]) > 0

# Services currently down
scribe_service_up == 0

# Service state transitions (e.g., identify flapping services)
rate(scribe_service_transitions_total[5m])

# How long services spent starting (to detect slow startups)
histogram_quantile(0.95, rate(scribe_service_transition_duration_seconds_bucket{to="running"}[1h]))

# Services that failed recently
increase(scribe_service_transitions_total{to="failed"}[1h]) > 0

Trace layout

Query pipeline Spans

When tracing is enabled, each LDAP/REST/GraphQL query produces a nested span hierarchy:

LDAP.Search (or REST.Search, etc.)         ← Channel entry span
└── Query.Normalize                         ← Normalization stage
└── Query.Plan                              ← Planning stage
└── Query.Compile                           ← SQL emission stage (Prepare)
└── Query.Execute                           ← DB execution + result streaming

Span attributes:

Attribute	Description	Example
`scribe.result`	Outcome classification	`ok`, `cancelled`, `deadline_exceeded`
`scribe.search.kind`	Search pagination mode (trace-only)	`simple`, `paged`, `vlv`
`scribe.query.signature`	Query signature hash (trace-only)	`a1b2c3d4`
`scribe.entry_type`	Entry type(s) in scope (trace-only)	`inetOrgPerson`

Stage names

The stage label in scribe.query.stage.duration.seconds maps directly to span names:

Stage	Span Name	Description
`normalize`	`Query.Normalize`	Attribute mapping, filter canonicalization
`plan`	`Query.Plan`	Index selection, predicate classification
`compile`	`Query.Compile`	Logical plan → SQL
`execute`	`Query.Execute`	JDBC execution, result streaming

Using traces for latency debugging

Find the slow stage: Look at scribe_query_stage_duration_seconds by stage
Get a sample trace: Filter by trace ID or look for traces with high duration
Drill into the Execute span: If execute is slow, check for:
- Database lock contention
- Missing indexes (see /observe/hints)
- Pool saturation (hikaricp_connections_pending > 0)
Check the Plan span: If plan is slow, the query may be too complex

Enabling tracing

Configure OTLP traces export in application.conf:

monitoring.telemetry.traces {
  enabled = true
  endpoint = "http://otel-collector:4317"
  protocol = "grpc"  # or "http/protobuf" for port 4318
}

Or via environment variables:

export SCRIBE_TELEMETRY_TRACES_ENABLED=true
export SCRIBE_TELEMETRY_TRACES_ENDPOINT=http://otel-collector:4317

Wide event logging

WideLogCollector provides trace-first observability by accumulating context throughout request/task execution and emitting one structured log line when the operation is “interesting.”

What gets logged

Operations emit at completion only when:

Failure: Any failure kind (normalized via Failure.wrap()) → ERROR level
Warnings: At least one warning recorded via WideLogCollector.warn() → WARN level
Exceptions: At least one exception event recorded → WARN level
Slow: Duration exceeds the configured threshold for the flow type
Marked: Explicitly marked interesting via WideLogCollector.markInteresting()

Silent (fast success without warnings/exceptions) is intentional — it keeps logs focused on actionable events.

Logger categories

Identity Scribe uses 5 canonical loggers, all under the com.kenoxa.scribe.* namespace:

Logger	Config Key	Env Override	Purpose
`com.kenoxa.scribe.SuperVisor`	`log.SuperVisor`	`SCRIBE_LOG_SUPERVISOR`	Startup/shutdown, orchestration
`com.kenoxa.scribe.Ingest`	`log.Ingest`	`SCRIBE_LOG_INGEST`	Transcription pipeline, event store
`com.kenoxa.scribe.Monitoring`	`log.Monitoring`	`SCRIBE_LOG_MONITORING`	Wide-log output, observability
`com.kenoxa.scribe.License`	`log.License`	`SCRIBE_LOG_LICENSE`	License verification
`com.kenoxa.scribe.Config`	`log.Config`	`SCRIBE_LOG_CONFIG`	Configuration parsing

All loggers inherit from log.level by default. Configure per-logger levels to control verbosity:

# Disable wide-log output entirely
SCRIBE_LOG_MONITORING=off

# Verbose transcription debugging
SCRIBE_LOG_INGEST=debug

# Quiet license checks
SCRIBE_LOG_LICENSE=error

Log format

Wide logs are emitted via the Monitoring logger at INFO/WARN/ERROR level:

{"trace_id":"...","span_id":"...","duration_seconds":1.5,"result":"ok","scribe.operation":"LDAP.Search",...}

JSON format

Single-line JSON for machine parsing:

{
  "trace_id": "abc123",
  "span_id": "def456",
  "duration_seconds": 1.5,
  "result": "ok",
  "scribe.operation": "LDAP.Search",
  "scribe.entry_type": "user",
  "scribe.search.kind": "paged",
  "events": [
    {"name": "warning", "timestamp": "...", "attributes": {"code": "unsupported_control", "message": "VLV not available"}}
  ]
}

Pretty format (default)

Human-friendly format with header line, auto-grouped attributes, and segment timeline. When terminal color support is detected, output is enhanced with ANSI colors:

Element	Color	Purpose
Result `ok`	Green	Success at a glance
Result `ok` (with warnings)	Yellow	Success but needs attention
Result (error)	Red	Immediate attention required
Duration	Dim	Visual separation from content
Attribute group prefixes	Cyan	Quick section identification (`db:`, `http:`)
Attribute keys	Bold	Easy scanning within groups
Warning events	Yellow	Stand out in event timeline
Failure header	Red	Draws eye to error details

Color detection uses TERM, FORCE_COLOR, NO_COLOR (no-color.org), and TTY presence. Disable colors with NO_COLOR=1 or TERM=dumb.

LDAP.Search ok dur=1.5s warnings=1 events=5
  trace=abc123 span=def456
  db: system=postgresql duration.seconds=0.42 row_count=42
  http: route="/ldap/search"
  scribe: entry_type=user search.kind=paged
  scribe.query: scope="ou=users,o=org" where="(&(objectClass=*))" sort="cn"
  events:
    +  5.1ms  12.3ms  Query.Plan
    warning code=unsupported_control message="VLV not available"
    + 18.0ms 401.9ms  DB.Fetch row_count=42
    cache.hit
    +520.2ms   9.8ms  Limiter.Acquire permits=1

Failure output (with full details and stacktrace):

scribe REST.Modify internal dur=311ms trace=trace123 span=span456
  failure:
    kind=INTERNAL code=SERVER_ERROR
    message="Database connection failed"
    details={"context": "user.modify", "attempt": 3}
    trace_id=trace123 span_id=span456
    cause_type=java.sql.SQLException
    cause_message="Connection refused"
    stacktrace:
      java.sql.SQLException: Connection refused
        at org.postgresql.Driver.connect(Driver.java:285)
        at ...
  scribe: entry_type=user

Features:

Header line: Operation, result (colored on TTY), duration, warning/event counts
Trace/span: Only shown on warnings or failures (de-emphasized on success)
Auto-grouped attributes: Keys grouped by prefix (e.g., db:, scribe:, scribe.query:)
Segment timeline: Child segments shown with offset and duration (+offset duration name)
JSON detection: String values that are JSON are auto-formatted
Color support: Auto-detected based on TTY, CI environment, and FORCE_COLOR/NO_COLOR env vars

Wide event fields

Field	Type	Description
`trace_id`	string	OpenTelemetry trace ID (when tracing enabled)
`span_id`	string	OpenTelemetry span ID
`parent_span_id`	string	Parent span ID (optional)
`duration_seconds`	double	Operation duration in seconds
`result`	string	`ok` or failure kind (e.g., `internal`, `not_found`)
`failure`	object	Failure details: `kind`, `code`, `message`, `details`, `stacktrace` (when failed)
`events`	array	Event records including warnings (name=`warning`) and segment timing
`scribe.operation`	string	Operation name (e.g., `LDAP.Search`, `REST.GET`, `Transcription.WorkItem`)
Custom attributes	varies	Any attributes set via `Segment.annotate()`

Configuration

monitoring.log {
  # Enable/disable wide event logging (default: true)
  enabled = true
  enabled = ${?SCRIBE_LOG_ENABLED}

  # Log format: pretty (default) | json | auto
  # auto = pretty in dev mode + TTY, json otherwise
  # Uses app.mode (dev/development/local/test → dev mode)
  format = "pretty"
  format = ${?SCRIBE_LOG_FORMAT}

  # Random sampling for ops that don't match any rule
  # 0 = never sample, 100 = always log (default)
  sample-rate = 100
  sample-rate = ${?SCRIBE_LOG_SAMPLE_RATE}

  # Per-key redaction strategies
  # Keys support glob patterns: * = any chars, ? = single char
  #
  # Available strategies:
  #   replace  - Replace with "[REDACTED]"
  #   hash     - SHA-256, base64url, 16 chars
  #   truncate - Show ≤33% of chars (max 8 visible); ≤8 chars fall back to [REDACTED]
  #   omit     - Remove attribute entirely
  #
  # Hard-coded security patterns (always OMIT, cannot be disabled):
  #   *password*, *credential*, *token*, *secret*, *apikey*, *api_key*
  redaction {
    "*dn*" = hash           # Matches scribe.entry_dn, ldap.base_dn, etc.
    "*email*" = truncate    # Matches user.email, notification.email, etc.
    "*.raw" = replace
    "*.pii.*" = hash
  }
}

Rule-based filtering

Rules control which operations are logged. They evaluate in order — first match wins. Operations not matching any rule fall through to random sampling.

Decision flow:

Failures → Always logged (ERROR level)
Warnings/Exceptions → Always logged (WARN level)
markInteresting() → Always logged
First matching rule → include logs, exclude suppresses
No match → Apply sample-rate (0-100%)

Rule syntax:

monitoring.log.rules = [
  { action = include|exclude, name = "glob", where = "(filter)" }
]

Field	Required	Description
`action`	Yes	`include` (log) or `exclude` (suppress)
`name`	No	Glob pattern for operation name (`*` = any chars, `?` = single char)
`where`	No	LDAP-style filter on attributes

Duration filtering:

The synthetic duration.seconds attribute is injected before rule evaluation, enabling duration-based filtering:

monitoring.log.rules = [
  # Suppress fast successful operations
  { action = exclude, where = "(&(scribe.result=ok)(duration.seconds<=50ms))" }

  # Always log slow operations
  { action = include, where = "(duration.seconds>=5s)" }

  # Suppress internal maintenance under threshold
  { action = exclude, name = "Hints.*", where = "(duration.seconds<=100ms)" }
  { action = exclude, name = "Metrics.*", where = "(duration.seconds<=500ms)" }
]

Duration values support multiple formats: plain seconds (0.05), HOCON style (50ms, 5s, 1m), or ISO 8601 (PT5S).

Common patterns:

# Log all LDAP operations
{ action = include, name = "LDAP.*" }

# Suppress fast successful ops
{ action = exclude, where = "(&(scribe.result=ok)(duration.seconds<=50ms))" }

# Log slow DB queries
{ action = include, where = "(db.duration.seconds>=1s)" }

# Exclude everything else (catch-all)
{ action = exclude }

See monitoring.log.rules for the complete reference.

Segment tracking

Child segments can be automatically captured as events in wide logs, providing operation breakdown without trace analysis.

Modes:

Mode	What’s Captured	Overhead
`auto`	Full in dev mode, minimal in prod (default)	Varies
`off`	Nothing	None
`minimal`	Name, offset, duration	Low
`full`	Name, offset, duration, segment attributes	Moderate

Configuration:

monitoring.log.childs {
  # Mode: auto (default) | off | minimal | full
  # auto = full in dev mode, minimal in prod
  mode = "auto"
  mode = ${?SCRIBE_LOG_CHILDS_MODE}

  # Display: auto (default) | off | summary | details
  # auto = summary in dev, off in prod
  display = "auto"

  # Rules: which segments to track (first match wins, empty = include all)
  rules = [
    { action = include, where = "(duration.seconds>=1ms)" }
    { action = exclude }  # Catch-all: filter sub-millisecond noise
  ]
}

Example configurations:

# Track all segments (no filtering)
childs {
  mode = "full"
  rules = []  # Empty = include all
}

# Track only slow segments
childs {
  mode = "minimal"
  rules = [
    { action = include, name = "Query.*", where = "(duration.seconds>=10ms)" }
    { action = include, name = "DB.*", where = "(duration.seconds>=5ms)" }
    { action = exclude }
  ]
}

Event attributes (added to each tracked segment event):

Attribute	Type	Description
`offset.seconds`	double	Time from operation start to segment start
`duration.seconds`	double	Segment duration
Segment attributes	varies	In `full` mode, includes attributes set via `segment.annotate()`

Ordering: Events appear in segment start-time order (not end-time), making the wide log easy to read chronologically.

Redaction: Segment-local attributes are subject to the same redaction rules as operation-level attributes.

Use case: Identify slow sub-operations without diving into traces:

{
  "duration_seconds": 0.125,
  "result": "ok",
  "scribe.operation": "LDAP.Search",
  "events": [
    {
      "name": "Query.Plan",
      "timestamp": "2026-01-07T10:30:00.001Z",
      "attributes": { "offset.seconds": 0.001, "duration.seconds": 0.002 }
    },
    {
      "name": "Query.Execute",
      "timestamp": "2026-01-07T10:30:00.003Z",
      "attributes": { "offset.seconds": 0.003, "duration.seconds": 0.095 }
    },
    {
      "name": "Query.Map",
      "timestamp": "2026-01-07T10:30:00.098Z",
      "attributes": { "offset.seconds": 0.098, "duration.seconds": 0.025 }
    }
  ]
}

Kubernetes probe configuration

startupProbe:
  httpGet:
    path: /startedz
    port: 8081
  initialDelaySeconds: 5
  periodSeconds: 10
  failureThreshold: 90   # 15min max for services + DB init

livenessProbe:
  httpGet:
    path: /livez
    port: 8081
  initialDelaySeconds: 0
  periodSeconds: 10
  failureThreshold: 3

readinessProbe:
  httpGet:
    path: /readyz
    port: 8081
  initialDelaySeconds: 0
  periodSeconds: 5
  failureThreshold: 1
  successThreshold: 1

Observability

Endpoint taxonomy

Standard paths

Observe endpoints (/observe/*)

Stats endpoints performance note

Channels endpoint (/observe/channels)

Config endpoint (/observe/config)

Services endpoint (/observe/services)

Doctor endpoint (/observe/doctor)

Golden signals

Query signals (channel side)

Ingest signals (sync side)

Built-in dashboard

Grafana integration

Core metrics inventory

Channel (front door SLO)

Query pipeline

Ingest

Store

Services

Hints

JVM / process

External metrics (not in scribe.* contract)

PromQL examples

Request rate and latency

Pressure alerts

Stage breakdown

Service health

Trace layout

Query pipeline Spans

Stage names

Using traces for latency debugging

Enabling tracing

Wide event logging

What gets logged

Logger categories

Log format

JSON format

Pretty format (default)

Wide event fields

Configuration

Rule-based filtering

Segment tracking

Kubernetes probe configuration

Observe endpoints (`/observe/*`)

Channels endpoint (`/observe/channels`)

Config endpoint (`/observe/config`)

Services endpoint (`/observe/services`)

Doctor endpoint (`/observe/doctor`)

External metrics (not in `scribe.*` contract)