Troubleshooting¶

Common issues and solutions for JarvisCore developers — from installation through production mesh deployments.

Quick Diagnostics¶

Run these first before digging into individual issues:

# Check installation, env vars, and LLM connectivity
jarviscore check

# Validate LLM connectivity (makes real API calls)
jarviscore check --validate-llm

# Verbose health check output
jarviscore check --verbose

# End-to-end smoke test
jarviscore smoketest

# Verbose output for debugging
jarviscore smoketest --verbose

Installation¶

`ModuleNotFoundError: No module named 'jarviscore'`¶

pip install jarviscore-framework

# Development install
pip install -e .

`ImportError: cannot import name 'AutoAgent'`¶

Stale cached install. Reinstall:

pip uninstall jarviscore-framework -y
pip install jarviscore-framework

LLM Configuration¶

`No LLM provider configured`¶

Missing API key. Add one of the following to .env:

GEMINI_API_KEY=...
CLAUDE_API_KEY=...       # or ANTHROPIC_API_KEY
AZURE_API_KEY=...        # also requires AZURE_ENDPOINT and AZURE_DEPLOYMENT

Then validate:

jarviscore check --validate-llm

`Error code: 401 — Unauthorized`¶

Invalid or expired API key. Verify the key value, check expiry, and for Azure confirm AZURE_ENDPOINT and AZURE_DEPLOYMENT are set.

`Error code: 429 — Rate limit exceeded`¶

Wait 60 seconds, then retry. If persistent, upgrade your API plan or switch to a less-loaded model.

`Error code: 529 — Overloaded`¶

Provider temporarily overloaded (common with Claude). The smoke test retries automatically 3 times. Retry manually after a few seconds or add a secondary provider.

Execution Errors¶

`Task failed: Code execution timed out`¶

Default timeout is controlled by SANDBOX_TIMEOUT. Increase it in .env:

SANDBOX_TIMEOUT=600   # seconds — default is 300

`Sandbox execution failed`¶

The framework auto-repairs up to 3 times. If all attempts fail:

Check traces for the exact error:

ls traces/
cat traces/<workflow>_<step>.jsonl | python -m json.tool | grep error

Make the task more explicit — the agent needs to know exactly what to produce:

system_prompt = """
You are a Python expert. Generate clean, working code.
Use only the standard library.
Store the final answer in a variable named `result`.
Handle edge cases explicitly.
"""

Simplify the task first, then add complexity once it runs.

`Maximum repair attempts exceeded`¶

The LLM could not generate working code in 3 tries. Simplify the task or add more detail to the system prompt. Check the trace log to see what errors occurred each attempt.

Silent success with `execution_time ≈ 0.003s` and `output: null`¶

This is a known diagnostic tell. Real LLM-driven computation takes 1–30 seconds. Sub-10ms means the sandbox code crashed instantly.

Cause: Agent-generated code raised NameError: name 'context' is not defined — the sandbox caught it silently.

Fix: Confirm autoagent.py passes context to the sandbox:

result = await self.sandbox.execute(code, context=task.get('context'))

If you subclass AutoAgent and override execute_task, pass context=task.get('context') in your sandbox.execute() call.

Workflow Issues¶

`Agent not found: <role>`¶

Role string mismatch between agent definition and workflow step:

class CalculatorAgent(AutoAgent):
    role = "calculator"        # ← this value

results = await mesh.workflow("wf-1", [
    {"agent": "calculator", "task": "..."},  # ← must match exactly
])

`Dependency not satisfied: <step-id>`¶

The depends_on step ID does not exist in the workflow, or it failed. The correct key is depends_on (not dependencies):

results = await mesh.workflow("wf-1", [
    {"id": "step1", "agent": "agent1", "task": "..."},
    {"id": "step2", "agent": "agent2", "task": "...",
     "depends_on": ["step1"]},   # ← correct key
])

CustomAgent Issues¶

`self.mailbox is None` / `self._redis_store is None`¶

Infrastructure attributes are injected by the Mesh after __init__ runs — they are only available inside setup():

# ❌ Wrong — __init__ runs before injection
class MyAgent(CustomAgent):
    def __init__(self):
        self.memory = UnifiedMemory(redis_store=self._redis_store)  # None!

# ✅ Correct — setup() runs after injection
class MyAgent(CustomAgent):
    async def setup(self):
        await super().setup()
        self.memory = UnifiedMemory(redis_store=self._redis_store)  # injected ✓

Verify injection after mesh.start():

await mesh.start()
for agent in mesh.agents:
    print(f"{agent.role}: redis={agent._redis_store is not None} "
          f"mailbox={agent.mailbox is not None}")

Redis & Memory Issues¶

`ConnectionError: Redis connection refused`¶

Redis is not running, or REDIS_URL is not set.

# Start Redis
docker compose -f docker-compose.infra.yml up -d

# Verify
redis-cli ping   # → PONG

# Check .env
grep REDIS_URL .env

[!NOTE] Without REDIS_URL, the Mesh degrades gracefully — _redis_store and mailbox become None. Workflow execution still works but checkpointing, mailboxes, and distributed coordination are disabled.

`EpisodicLedger.append()` fails / events not in Redis¶

Ensure REDIS_URL is set and Redis is reachable
Confirm UnifiedMemory is initialised in setup() (not __init__)

Check the ledger stream directly:

redis-cli xrange ledgers:your-workflow-id - +

`blob_storage.load()` returns `None`¶

The file was saved with a different STORAGE_BASE_PATH or in a different process's working directory.

ls -la blob_storage/
find blob_storage/ -name "*.json" -o -name "*.md" | head -20

Fix: pin STORAGE_BASE_PATH in .env to an absolute path:

STORAGE_BASE_PATH=/app/blob_storage

Distributed Mesh Issues¶

Step stuck in `"pending"` forever¶

Causes:

A prior step's step_output:wf:step_id key never written to Redis
No node has the agent role the step requires
Step was claimed by a crashed node

Diagnose:

redis-cli hgetall "workflow_graph:your-workflow-id"
redis-cli keys "step_output:your-workflow-id:*"
redis-cli smembers "jarviscore:active_workflows"

Reset a stuck step:

redis-cli hset "workflow_graph:wf-id" "step-id:status" "pending"
redis-cli del "claim:wf-id:step-id"

Per-process port conflicts in multi-node setups¶

Each process needs a unique port. A shared .env with a single BIND_PORT won't work for four nodes.

Recommended approach — explicit config dict:

BIND_PORT = 7949   # this script's port — part of its identity
mesh = Mesh(config={"bind_port": BIND_PORT, ...})

Production approach — per-process env var:

JARVISCORE_BIND_PORT=7949 python synthesizer.py
JARVISCORE_BIND_PORT=7946 python research_node1.py

JarvisCore reads JARVISCORE_BIND_PORT (not BIND_PORT) to keep per-process config cleanly separated from shared .env values.

`self._auth_manager` is `None` despite `requires_auth = True`¶

NEXUS_GATEWAY_URL is not set. The Mesh only injects AuthenticationManager when a gateway URL is configured:

# In .env
NEXUS_GATEWAY_URL=https://your-dromos-gateway.example.com
AUTH_MODE=production

For local development:

AUTH_MODE=mock

Or guard the call in your agent:

if self._auth_manager:
    result = await self._auth_manager.make_authenticated_request(...)
else:
    pass  # graceful degradation

Performance¶

Code generation is slow (> 10 seconds)¶

Switch to a faster model in .env:

# Gemini
GEMINI_MODEL=gemini-2.0-flash

# Claude
CLAUDE_MODEL=claude-haiku-4

# Local vLLM (free, no API cost)
LLM_ENDPOINT=http://localhost:8000
LLM_MODEL=Qwen/Qwen2.5-Coder-32B-Instruct

Also simplify the system prompt — shorter, more specific prompts generate faster.

High API costs¶

Use cheaper models (gemini-2.0-flash, claude-haiku-4)
Run a local vLLM server
Reduce OODA loop turns by making tasks and system prompts more precise

Testing¶

Smoke test fails but agents work in examples¶

The smoke test is stricter than examples. Run with --verbose to see which assertion failed:

jarviscore smoketest --verbose

If retrying eventually passes, it is temporary LLM overload — not a code issue.

All tests pass but my agent fails¶

Test with the simplest possible task first:

task = "Calculate 2 + 2. Store the result in `result`."

Check the trace log:

cat traces/<workflow>_<step>.jsonl | python -m json.tool

Add complexity incrementally once the simple case passes.

Debug Mode¶

# .env
LOG_LEVEL=DEBUG

Then tail:

tail -f logs/<latest>.log

Getting Help¶

When opening an issue on GitHub, include:

Python version: python --version
JarvisCore version: pip show jarviscore-framework
LLM provider (Gemini / Claude / Azure / vLLM)
Full error message and relevant log lines
Minimal code to reproduce

Run diagnostics first and paste the output:

jarviscore check --verbose
jarviscore smoketest --verbose

Troubleshooting¶

Quick Diagnostics¶

Installation¶

ModuleNotFoundError: No module named 'jarviscore'¶

ImportError: cannot import name 'AutoAgent'¶

LLM Configuration¶

No LLM provider configured¶

Error code: 401 — Unauthorized¶

Error code: 429 — Rate limit exceeded¶

Error code: 529 — Overloaded¶