Observability & Telemetry¶
JarvisCore ships with a built-in, two-layer observability system. Every agent turn, tool call, LLM request, mailbox message, and HITL event is captured automatically — no instrumentation required in your agent code.
The two layers serve different purposes:
| Layer | What it does | Where data goes |
|---|---|---|
Structured Tracing (TraceManager) |
Records what happened — the full execution narrative | Redis List (persistent), Redis PubSub (real-time), JSONL (compliance fallback) |
Operational Metrics (metrics.py) |
Records how it performed — counters, histograms, and gauges | Prometheus (scrape endpoint on port 9090) |
Both layers are non-blocking by design. If Redis is unavailable, tracing falls back to JSONL. If prometheus-client is not installed, metrics become silent no-ops. Neither failure will crash your agent.
Structured Tracing¶
TraceManager¶
TraceManager is the framework's flight data recorder. It is instantiated automatically by the kernel for each workflow step and receives events for the lifetime of that step.
from jarviscore.telemetry import TraceManager
tracer = TraceManager(
workflow_id="wf-abc123",
step_id="step-001",
redis_store=redis_store, # optional — omit to use JSONL-only mode
trace_dir="traces", # directory for JSONL fallback files
)
Event Output Channels¶
Every event emitted through TraceManager.log_event() is written to up to three places simultaneously:
1. Redis List (persistent)
Suitable for replay, post-mortem debugging, and audit log retention. Survives process restarts.2. Redis PubSub (real-time)
Subscribe from a dashboard, alerting system, or log aggregator to receive events as they happen.3. JSONL file (compliance fallback)
Written even when Redis is unavailable. Each line is a self-contained JSON event — parseable with any standard tooling.Trace Event Shape¶
Every event has the same envelope:
{
"workflow_id": "wf-abc123",
"step_id": "step-001",
"timestamp": "2026-05-01T17:00:00.000000+00:00",
"type": "tool_start",
"data": {
"tool": "slack_send_message_v1",
"params": { "channel": "#alerts", "text": "Deploy complete" }
}
}
Event Types¶
All event types are defined in TraceEventType (a typed str enum). The full set:
Workflow Lifecycle¶
| Event | When emitted |
|---|---|
workflow_start |
Workflow begins; includes step_count |
workflow_complete |
Workflow finishes; includes status and summary |
Step Execution¶
| Event | When emitted |
|---|---|
step_claimed |
An agent claims a step from the queue |
step_complete |
Step finishes successfully |
step_failed |
Step terminates with an unrecoverable error |
Kernel Cognition¶
| Event | When emitted |
|---|---|
thinking |
Kernel or subagent logs a reasoning step |
kernel_delegate |
Kernel dispatches work to a subagent |
subagent_yield |
Subagent returns control and requests human input |
Tool Execution¶
| Event | When emitted |
|---|---|
tool_start |
Tool invocation begins; includes tool name and parameters |
tool_result |
Tool invocation completes; includes result preview or error |
LLM Interaction¶
| Event | When emitted |
|---|---|
llm_request |
Outgoing request to an LLM provider; includes provider, model, prompt preview |
llm_response |
LLM response received; includes latency, input tokens, output tokens |
Mailbox¶
| Event | When emitted |
|---|---|
mailbox_send |
Agent sends a message to another agent |
mailbox_receive |
Agent reads messages from its inbox |
HITL¶
| Event | When emitted |
|---|---|
hitl_task_created |
A human review task is created |
hitl_waiting |
Kernel enters wait state for human input |
hitl_response_received |
Human provides a response |
hitl_resolved |
HITL task is closed with an outcome |
Infrastructure¶
| Event | When emitted |
|---|---|
context_snapshot |
Periodic capture of the context store state |
error_recovery |
Automatic recovery action is taken after an error |
Emitting Events Manually¶
The kernel handles all standard events automatically. If you are building a custom subagent or tool, you can emit custom events directly:
# Convenience methods — the recommended approach
tracer.log_tool_start("my_custom_tool", params={"key": "value"})
tracer.log_tool_result("my_custom_tool", result="Done")
tracer.log_thinking("Evaluating whether the output meets the acceptance criteria")
# Raw event — for custom types not covered by convenience methods
tracer.log_event("my_custom_event", data={"detail": "something happened"})
Consuming Traces¶
Live stream (Redis PubSub):
import redis, json
r = redis.Redis()
ps = r.pubsub()
ps.subscribe("trace_events:wf-abc123")
for message in ps.listen():
if message["type"] == "message":
event = json.loads(message["data"])
print(event["type"], event["data"])
Replay from JSONL:
# All events for a workflow
cat traces/wf-abc123_step-001.jsonl | jq .
# Filter for tool events only
cat traces/wf-abc123_step-001.jsonl | jq 'select(.type | startswith("tool_"))'
# LLM cost summary
cat traces/*.jsonl | jq 'select(.type == "llm_response") | .data.latency_ms' | awk '{sum+=$1} END {print "Total latency:", sum, "ms"}'
Operational Metrics (Prometheus)¶
Installation¶
Prometheus metrics require the prometheus-client package. Without it, all metric calls are silent no-ops — your agent runs normally but no metrics are collected.
Starting the Metrics Server¶
Metrics are then available at http://localhost:9090/metrics for Prometheus to scrape.
Available Metrics¶
LLM Metrics¶
| Metric | Type | Labels | Description |
|---|---|---|---|
jarviscore_llm_tokens_input_total |
Counter | provider, model |
Total input tokens consumed |
jarviscore_llm_tokens_output_total |
Counter | provider, model |
Total output tokens generated |
jarviscore_llm_cost_dollars_total |
Counter | provider, model |
Total LLM cost in USD |
jarviscore_llm_request_duration_seconds |
Histogram | provider, model |
LLM request latency |
jarviscore_llm_requests_total |
Counter | provider, model, status |
Total LLM requests (success/error) |
Workflow & Step Metrics¶
| Metric | Type | Labels | Description |
|---|---|---|---|
jarviscore_workflow_steps_total |
Counter | status |
Steps processed by outcome |
jarviscore_step_execution_duration_seconds |
Histogram | status |
Step duration in seconds |
jarviscore_active_workflows |
Gauge | — | Currently running workflows |
jarviscore_active_steps |
Gauge | — | Currently executing steps |
Event Metrics¶
| Metric | Type | Labels | Description |
|---|---|---|---|
jarviscore_events_emitted_total |
Counter | event_type |
Total trace events by type |
Recording Metrics Manually¶
from jarviscore.telemetry.metrics import record_llm_call, record_step_execution
# After an LLM call completes
record_llm_call(
provider="anthropic",
model="claude-opus-4-5",
input_tokens=1500,
output_tokens=800,
cost=0.12,
duration=2.3,
success=True,
)
# After a workflow step completes
record_step_execution(duration=8.5, status="completed")
Prometheus Configuration¶
Add a scrape target to your prometheus.yml:
scrape_configs:
- job_name: jarviscore
static_configs:
- targets: ["localhost:9090"]
scrape_interval: 15s
Grafana Dashboard¶
With the Prometheus scrape active, you can build dashboards around these queries:
# LLM cost rate over 1 hour
rate(jarviscore_llm_cost_dollars_total[1h])
# P95 step latency
histogram_quantile(0.95, rate(jarviscore_step_execution_duration_seconds_bucket[5m]))
# Error rate per model
rate(jarviscore_llm_requests_total{status="error"}[5m])
/ rate(jarviscore_llm_requests_total[5m])
# Active workflow count
jarviscore_active_workflows
Exporting to External Stacks¶
Datadog¶
Use the Datadog Agent with OpenMetrics to scrape the Prometheus endpoint:
# datadog.yaml
instances:
- openmetrics_endpoint: http://localhost:9090/metrics
namespace: jarviscore
metrics:
- jarviscore_llm_.*
- jarviscore_workflow_.*
- jarviscore_active_.*
Grafana Cloud / Mimir¶
Use prometheus-remote-write or the Grafana Agent to forward metrics directly without self-hosting Prometheus.
Splunk / ELK¶
The JSONL trace files are the simplest integration point. Point a Filebeat or Splunk Universal Forwarder at the traces/ directory — each line is already structured JSON, with workflow_id, step_id, timestamp, and type as top-level fields for indexing.
Running Without Redis¶
JarvisCore observability works in three modes:
| Mode | Configuration | Behaviour |
|---|---|---|
| Full | Redis connected, prometheus-client installed |
All three trace channels active; Prometheus metrics collected |
| JSONL-only | No Redis | Traces written to JSONL files only; no real-time stream |
| Local dev | No Redis, no prometheus-client |
JSONL traces only; metrics are silent no-ops |
No configuration flag is needed — the system detects what is available at runtime and degrades gracefully.