Observability & Telemetry¶

JarvisCore ships with a built-in, two-layer observability system. Every agent turn, tool call, LLM request, mailbox message, and HITL event is captured automatically — no instrumentation required in your agent code.

The two layers serve different purposes:

Layer	What it does	Where data goes
Structured Tracing (`TraceManager`)	Records what happened — the full execution narrative	Redis List (persistent), Redis PubSub (real-time), JSONL (compliance fallback)
Operational Metrics (`metrics.py`)	Records how it performed — counters, histograms, and gauges	Prometheus (scrape endpoint on port 9090)

Both layers are non-blocking by design. If Redis is unavailable, tracing falls back to JSONL. If prometheus-client is not installed, metrics become silent no-ops. Neither failure will crash your agent.

Structured Tracing¶

TraceManager¶

TraceManager is the framework's flight data recorder. It is instantiated automatically by the kernel for each workflow step and receives events for the lifetime of that step.

from jarviscore.telemetry import TraceManager

tracer = TraceManager(
    workflow_id="wf-abc123",
    step_id="step-001",
    redis_store=redis_store,   # optional — omit to use JSONL-only mode
    trace_dir="traces",        # directory for JSONL fallback files
)

Event Output Channels¶

Every event emitted through TraceManager.log_event() is written to up to three places simultaneously:

1. Redis List (persistent)

Key: traces:{workflow_id}:{step_id}

Suitable for replay, post-mortem debugging, and audit log retention. Survives process restarts.

2. Redis PubSub (real-time)

Channel: trace_events:{workflow_id}

Subscribe from a dashboard, alerting system, or log aggregator to receive events as they happen.

3. JSONL file (compliance fallback)

Path: {trace_dir}/{workflow_id}_{step_id}.jsonl

Written even when Redis is unavailable. Each line is a self-contained JSON event — parseable with any standard tooling.

Trace Event Shape¶

Every event has the same envelope:

{
  "workflow_id": "wf-abc123",
  "step_id": "step-001",
  "timestamp": "2026-05-01T17:00:00.000000+00:00",
  "type": "tool_start",
  "data": {
    "tool": "slack_send_message_v1",
    "params": { "channel": "#alerts", "text": "Deploy complete" }
  }
}

Event Types¶

All event types are defined in TraceEventType (a typed str enum). The full set:

Workflow Lifecycle¶

Event	When emitted
`workflow_start`	Workflow begins; includes `step_count`
`workflow_complete`	Workflow finishes; includes `status` and `summary`

Step Execution¶

Event	When emitted
`step_claimed`	An agent claims a step from the queue
`step_complete`	Step finishes successfully
`step_failed`	Step terminates with an unrecoverable error

Kernel Cognition¶

Event	When emitted
`thinking`	Kernel or subagent logs a reasoning step
`kernel_delegate`	Kernel dispatches work to a subagent
`subagent_yield`	Subagent returns control and requests human input

Tool Execution¶

Event	When emitted
`tool_start`	Tool invocation begins; includes tool name and parameters
`tool_result`	Tool invocation completes; includes result preview or error

LLM Interaction¶

Event	When emitted
`llm_request`	Outgoing request to an LLM provider; includes provider, model, prompt preview
`llm_response`	LLM response received; includes latency, input tokens, output tokens

Mailbox¶

Event	When emitted
`mailbox_send`	Agent sends a message to another agent
`mailbox_receive`	Agent reads messages from its inbox

HITL¶

Event	When emitted
`hitl_task_created`	A human review task is created
`hitl_waiting`	Kernel enters wait state for human input
`hitl_response_received`	Human provides a response
`hitl_resolved`	HITL task is closed with an outcome

Infrastructure¶

Event	When emitted
`context_snapshot`	Periodic capture of the context store state
`error_recovery`	Automatic recovery action is taken after an error

Emitting Events Manually¶

The kernel handles all standard events automatically. If you are building a custom subagent or tool, you can emit custom events directly:

# Convenience methods — the recommended approach
tracer.log_tool_start("my_custom_tool", params={"key": "value"})
tracer.log_tool_result("my_custom_tool", result="Done")
tracer.log_thinking("Evaluating whether the output meets the acceptance criteria")

# Raw event — for custom types not covered by convenience methods
tracer.log_event("my_custom_event", data={"detail": "something happened"})

Consuming Traces¶

Live stream (Redis PubSub):

import redis, json

r = redis.Redis()
ps = r.pubsub()
ps.subscribe("trace_events:wf-abc123")

for message in ps.listen():
    if message["type"] == "message":
        event = json.loads(message["data"])
        print(event["type"], event["data"])

Replay from JSONL:

# All events for a workflow
cat traces/wf-abc123_step-001.jsonl | jq .

# Filter for tool events only
cat traces/wf-abc123_step-001.jsonl | jq 'select(.type | startswith("tool_"))'

# LLM cost summary
cat traces/*.jsonl | jq 'select(.type == "llm_response") | .data.latency_ms' | awk '{sum+=$1} END {print "Total latency:", sum, "ms"}'

Operational Metrics (Prometheus)¶

Installation¶

Prometheus metrics require the prometheus-client package. Without it, all metric calls are silent no-ops — your agent runs normally but no metrics are collected.

pip install "jarviscore-framework[prometheus]"
# or directly:
pip install prometheus-client

Starting the Metrics Server¶

from jarviscore.telemetry.metrics import start_prometheus_server

start_prometheus_server(port=9090)

Metrics are then available at http://localhost:9090/metrics for Prometheus to scrape.

Available Metrics¶

LLM Metrics¶

Metric	Type	Labels	Description
`jarviscore_llm_tokens_input_total`	Counter	`provider`, `model`	Total input tokens consumed
`jarviscore_llm_tokens_output_total`	Counter	`provider`, `model`	Total output tokens generated
`jarviscore_llm_cost_dollars_total`	Counter	`provider`, `model`	Total LLM cost in USD
`jarviscore_llm_request_duration_seconds`	Histogram	`provider`, `model`	LLM request latency
`jarviscore_llm_requests_total`	Counter	`provider`, `model`, `status`	Total LLM requests (success/error)

Workflow & Step Metrics¶

Metric	Type	Labels	Description
`jarviscore_workflow_steps_total`	Counter	`status`	Steps processed by outcome
`jarviscore_step_execution_duration_seconds`	Histogram	`status`	Step duration in seconds
`jarviscore_active_workflows`	Gauge	—	Currently running workflows
`jarviscore_active_steps`	Gauge	—	Currently executing steps

Event Metrics¶

Metric	Type	Labels	Description
`jarviscore_events_emitted_total`	Counter	`event_type`	Total trace events by type

Recording Metrics Manually¶

from jarviscore.telemetry.metrics import record_llm_call, record_step_execution

# After an LLM call completes
record_llm_call(
    provider="anthropic",
    model="claude-opus-4-5",
    input_tokens=1500,
    output_tokens=800,
    cost=0.12,
    duration=2.3,
    success=True,
)

# After a workflow step completes
record_step_execution(duration=8.5, status="completed")

Prometheus Configuration¶

Add a scrape target to your prometheus.yml:

scrape_configs:
  - job_name: jarviscore
    static_configs:
      - targets: ["localhost:9090"]
    scrape_interval: 15s

Grafana Dashboard¶

With the Prometheus scrape active, you can build dashboards around these queries:

# LLM cost rate over 1 hour
rate(jarviscore_llm_cost_dollars_total[1h])

# P95 step latency
histogram_quantile(0.95, rate(jarviscore_step_execution_duration_seconds_bucket[5m]))

# Error rate per model
rate(jarviscore_llm_requests_total{status="error"}[5m])
  / rate(jarviscore_llm_requests_total[5m])

# Active workflow count
jarviscore_active_workflows

Exporting to External Stacks¶

Datadog¶

Use the Datadog Agent with OpenMetrics to scrape the Prometheus endpoint:

# datadog.yaml
instances:
  - openmetrics_endpoint: http://localhost:9090/metrics
    namespace: jarviscore
    metrics:
      - jarviscore_llm_.*
      - jarviscore_workflow_.*
      - jarviscore_active_.*

Grafana Cloud / Mimir¶

Use prometheus-remote-write or the Grafana Agent to forward metrics directly without self-hosting Prometheus.

Splunk / ELK¶

The JSONL trace files are the simplest integration point. Point a Filebeat or Splunk Universal Forwarder at the traces/ directory — each line is already structured JSON, with workflow_id, step_id, timestamp, and type as top-level fields for indexing.

Running Without Redis¶

JarvisCore observability works in three modes:

Mode	Configuration	Behaviour
Full	Redis connected, `prometheus-client` installed	All three trace channels active; Prometheus metrics collected
JSONL-only	No Redis	Traces written to JSONL files only; no real-time stream
Local dev	No Redis, no `prometheus-client`	JSONL traces only; metrics are silent no-ops

No configuration flag is needed — the system detects what is available at runtime and degrades gracefully.