JarvisCore Troubleshooting Guide¶
Common issues and solutions for AutoAgent and CustomAgent users.
Quick Diagnostics¶
Run these commands to diagnose issues:
# Check installation and configuration
python -m jarviscore.cli.check
# Test LLM connectivity
python -m jarviscore.cli.check --validate-llm
# Run end-to-end smoke test
python -m jarviscore.cli.smoketest
# Verbose output for debugging
python -m jarviscore.cli.smoketest --verbose
Common Issues¶
- Installation Problems
- LLM Configuration Issues
- Execution Errors
- Workflow Issues
- CustomAgent Issues
- Environment Issues
- Sandbox Configuration
- Infrastructure & Memory Issues (v0.4.0)
- P2P/Distributed Mode Issues
- Performance Issues
- Testing Issues
1. Installation Problems¶
Issue: ModuleNotFoundError: No module named 'jarviscore'¶
Solution:
Issue: ImportError: cannot import name 'AutoAgent'¶
Cause: Old/cached installation
Solution:
2. LLM Configuration Issues¶
Issue: No LLM provider configured¶
Cause: Missing API key in .env
Solution: 1. Initialize project and copy example config:
-
Add your API key:
-
Validate:
Issue: Error code: 401 - Unauthorized¶
Cause: Invalid API key
Solution: 1. Verify your API key is correct 2. Check it hasn't expired 3. For Azure: Ensure AZURE_ENDPOINT and AZURE_DEPLOYMENT are correct
Issue: Error code: 529 - Overloaded¶
Cause: LLM provider temporarily overloaded (Claude, Azure, etc.)
Solution:
- This is temporary - retry after a few seconds
- The smoke test automatically retries 3 times
- Consider adding a backup LLM provider in .env
Issue: Error code: 429 - Rate limit exceeded¶
Cause: Too many requests to LLM API
Solution: - Wait 60 seconds before retrying - Check your API plan limits - Consider upgrading your API plan
3. Execution Errors¶
Issue: Task failed: Code execution timed out¶
Cause: Generated code runs longer than timeout (default: 300s)
Solution:
Increase timeout in .env:
Issue: Sandbox execution failed: <error>¶
Cause: Generated code has errors
What happens: - Framework automatically attempts repairs (max 3 attempts) - If repairs fail, the task fails
Solution: 1. Check logs for details:
-
Make prompt more specific:
-
Adjust system prompt:
Issue: Maximum repair attempts exceeded¶
Cause: LLM unable to generate working code after 3 tries
Solution: 1. Simplify your task 2. Be more explicit in prompt 3. Check logs to see what errors occurred:
4. Workflow Issues¶
Issue: Agent not found: <role>¶
Cause: Agent role mismatch
Solution:
# Agent definition
class CalculatorAgent(AutoAgent):
role = "calculator" # <-- This name
# Workflow must match
mesh.workflow("wf-1", [
{"agent": "calculator", "task": "..."} # <-- Must match role
])
Issue: Dependency not satisfied: <step-id>¶
Cause: Workflow dependency chain broken
Solution:
# Ensure dependencies exist
await mesh.workflow("wf-1", [
{"id": "step1", "agent": "agent1", "task": "..."},
{"id": "step2", "agent": "agent2", "task": "...",
"dependencies": ["step1"]} # step1 must exist
])
5. CustomAgent Issues¶
Issue: execute_task not called¶
Cause: Wrong mode for your use case
Solution:
# For workflow orchestration (autonomous/distributed modes)
class MyAgent(CustomAgent):
async def execute_task(self, task): # Called by workflow engine
return {"status": "success", "output": ...}
# For P2P mode, use run() instead
class MyAgent(CustomAgent):
async def run(self): # Called in P2P mode
while not self.shutdown_requested:
msg = await self.peers.receive(timeout=0.5)
...
Issue: self.peers is None¶
Cause: Agent not in P2P or distributed mode
Solution:
# Ensure mesh is in p2p or distributed mode
mesh = Mesh(mode="distributed", config={ # or "p2p"
'bind_port': 7950,
'node_name': 'my-node',
})
# Check peers is available before using
if self.peers:
result = await self.peers.as_tool().execute("ask_peer", {...})
Issue: No response from peer¶
Cause: Target agent not listening or wrong role
Solution:
# Ensure target agent is running its run() loop
# In researcher agent:
async def run(self):
while not self.shutdown_requested:
msg = await self.peers.receive(timeout=0.5)
if msg and msg.is_request:
await self.peers.respond(msg, {"response": ...})
# When asking, use correct role
result = await self.peers.as_tool().execute(
"ask_peer",
{"role": "researcher", "question": "..."} # Must match agent's role
)
6. Environment Issues¶
Issue: .env file not found¶
Solution:
# Initialize project first (creates .env.example)
python -m jarviscore.cli.scaffold
# Then copy and configure
cp .env.example .env
# Or create manually
cat > .env << 'EOF'
CLAUDE_API_KEY=your-key-here
EOF
Issue: Environment variable not loading¶
Cause: .env file in wrong location
Solution:
Place .env in one of these locations:
- Current working directory: ./env
- Project root: jarviscore/.env
Or set environment variable directly:
7. Sandbox Configuration¶
Issue: Remote sandbox connection failed¶
Cause: SANDBOX_SERVICE_URL incorrect or service down
Solution: 1. Use local sandbox (default):
-
Or verify remote URL:
-
Test connectivity:
8. Infrastructure & Memory Issues (v0.4.0)¶
Issue: self._redis_store / self._blob_storage / self.mailbox is None after setup()¶
Cause: Accessing injected attributes in __init__ instead of setup(), or using a
Mesh mode that does not start the full infrastructure.
Solution:
# Wrong — __init__ runs before injection
class MyAgent(CustomAgent):
def __init__(self):
self.memory = UnifiedMemory(..., redis_store=self._redis_store) # None here!
# Correct — setup() runs after injection
class MyAgent(CustomAgent):
async def setup(self):
await super().setup()
self.memory = UnifiedMemory(..., redis_store=self._redis_store) # injected ✓
Verify injection after mesh.start():
await mesh.start()
for agent in mesh.agents:
print(f"{agent.role}: redis={agent._redis_store is not None} "
f"blob={agent._blob_storage is not None} "
f"mailbox={agent.mailbox is not None}")
Issue: ConnectionError: Redis connection refused / Redis unavailable¶
Cause: Redis is not running, or REDIS_URL is not set / incorrect.
Solution:
# Start Redis (quickest)
docker compose -f docker-compose.infra.yml up -d
# Verify Redis is responding
redis-cli ping # → PONG
# Check REDIS_URL in .env
grep REDIS_URL .env # → REDIS_URL=redis://localhost:6379/0
Required for: mailbox, distributed workflow, and UnifiedMemory.
Without REDIS_URL, these degrade gracefully — _redis_store / mailbox become None.
Issue: Silent task success with execution_time ≈ 0.003s and output: null¶
Cause: Agent-generated function tool raised NameError: name 'context' is not defined.
The sandbox catches the exception silently and returns a fallback result. This happens
when context=task.get('context') is not passed to sandbox.execute().
Diagnostic tell:
Real LLM-driven computation takes 1–30s. Sub-10ms means the code crashed instantly.Solution (v0.4.0 — already fixed): Confirm jarviscore/profiles/autoagent.py has:
execute_task, ensure you pass
context=task.get('context') when calling sandbox.execute().
For agent-generated function tools that read prior steps, use the simple access pattern:
# In system_prompt — tell the LLM to use this pattern:
research = context.get('previous_step_results', {}).get('fetch', {})
Per-Process Port Configuration — Multi-Node Setup¶
In a multi-node deployment each process needs a unique port. A single BIND_PORT
value in a shared .env file cannot serve four nodes that require four different ports.
The right approach — explicit Mesh config dict (recommended for example scripts):
# Each script declares its own port as an architecture constant
BIND_PORT = 7949 # synthesizer — this is its role; the port is part of its identity
mesh = Mesh(mode="distributed", config={"bind_port": BIND_PORT, ...})
The right approach — per-process env var (recommended for production / containers):
# Set at process launch — not in a shared .env file
JARVISCORE_BIND_PORT=7949 python ex2_synthesizer.py
JARVISCORE_BIND_PORT=7946 python ex2_research_node1.py
JarvisCore reads JARVISCORE_BIND_PORT (not BIND_PORT) to keep per-process port
config cleanly separated from other shared settings in .env.
Port reference for Ex2:
| Script | SWIM port | ZMQ port | Role |
|--------|-----------|----------|------|
| ex2_synthesizer.py | 7949 | 8949 | Seed (no SEED_NODES) |
| ex2_research_node1.py | 7946 | 8946 | TechResearcher |
| ex2_research_node2.py | 7947 | 8947 | MarketResearcher |
| ex2_research_node3.py | 7948 | 8948 | RegResearcher |
What NOT to do — shared .env for per-process settings:
All four processes would read the same value. Use the Meshconfig dict or
JARVISCORE_BIND_PORT set per-process instead.
Issue: Distributed step never starts — stuck in "pending" forever¶
Cause: One of: (a) are_dependencies_met() returning False because a prior step
never wrote its status to Redis; (b) no node has the matching agent role; (c) the step
was already claimed by another node.
Diagnose:
# See all step statuses for a workflow
redis-cli hgetall "workflow_graph:your-workflow-id"
# Check what step outputs exist
redis-cli keys "step_output:your-workflow-id:*"
# Check which workflows are active
redis-cli smembers "jarviscore:active_workflows"
Solutions:
1. Ensure the prior step completed: its step_output:wf:step_id key must exist in Redis
2. Confirm the node running the expected agent is alive and has joined the cluster
3. If a step is stuck in "claimed" (crashed mid-run), reset it:
redis-cli hset "workflow_graph:wf-id" "step-id:status" "pending"
redis-cli del "claim:wf-id:step-id"
Issue: self._auth_manager is None despite requires_auth = True¶
Cause: NEXUS_GATEWAY_URL is not set in .env. The Mesh only injects
AuthenticationManager when a gateway URL is configured.
Solution:
For local development without a Nexus gateway, use mock mode:
Or guard the call in your agent:
if self._auth_manager:
result = await self._auth_manager.make_authenticated_request(...)
else:
# Graceful degradation path
Issue: EpisodicLedger.append() raises / events not appearing in Redis¶
Cause: Redis unavailable, or UnifiedMemory initialised without a valid redis_store.
Diagnose:
Solution:
1. Ensure REDIS_URL is set and Redis is reachable
2. Confirm UnifiedMemory is initialised in setup() (not __init__):
async def setup(self):
await super().setup()
self.memory = UnifiedMemory(
workflow_id="wf-001", step_id=self.role,
agent_id=self.role,
redis_store=self._redis_store, # must not be None
blob_storage=self._blob_storage,
)
self._redis_store is not None before init
Issue: blob_storage.load() returns None for a path that should exist¶
Cause: (a) Path was saved with a different base; (b) STORAGE_BASE_PATH differs
between save and load runs; (c) file was saved to a different process's working directory.
Diagnose:
Solution:
- Use consistent path conventions: {type}/{workflow_id}/{filename}.{ext}
- Pin STORAGE_BASE_PATH in .env rather than relying on the default ./blob_storage
- In CI/Docker, use an absolute path:
9. P2P / Distributed Mode Issues¶
Issue: P2P coordinator failed to start¶
Cause: Port already in use or network issue
Solution:
# Check if port is in use
lsof -i :7950
# Try different port
mesh = Mesh(mode="distributed", config={
'bind_port': 7960, # Different port
})
Issue: Cannot connect to seed nodes¶
Cause: Firewall, wrong address, or seed node not running
Solution:
# Check connectivity
nc -zv 192.168.1.10 7950
# Open firewall ports
sudo ufw allow 7950/tcp
sudo ufw allow 7950/udp
# Ensure seed node is running first
# On seed node:
mesh = Mesh(mode="distributed", config={
'bind_host': '0.0.0.0', # Listen on all interfaces
'bind_port': 7950,
})
Issue: Workflow not available in p2p mode¶
Cause: P2P mode doesn't include workflow engine
Solution:
# Use distributed mode for both workflow + P2P
mesh = Mesh(mode="distributed", config={...})
# Or use p2p mode with run() loops instead
mesh = Mesh(mode="p2p", config={...})
await mesh.start()
await mesh.run_forever() # Agents use run() loops
Issue: Agents not discovering each other¶
Cause: Network configuration or timing
Solution:
# Wait for mesh to stabilize after start
await mesh.start()
await asyncio.sleep(1) # Give time for peer discovery
# Check if peers are available
agent = mesh.get_agent("my_role")
if agent.peers:
print("Peers available")
10. Performance Issues¶
Issue: Code generation is slow (>10 seconds)¶
Cause: LLM latency or complex prompt
Solutions: 1. Use faster model:
- Simplify system prompt:
- Remove unnecessary instructions
-
Be concise but specific
-
Use local vLLM:
Issue: High LLM API costs¶
Solutions:
1. Use cheaper models (Haiku, Flash)
2. Set up local vLLM (free)
3. Cache common operations
4. Reduce MAX_REPAIR_ATTEMPTS in .env
11. Testing Issues¶
Issue: Smoke test fails but examples work¶
Cause: Temporary LLM issues or network
Solution: - Smoke test is more strict than examples - Run with verbose to see details:
- If retrying works eventually, it's temporary LLM overloadIssue: All tests pass but my agent fails¶
Cause: Task-specific issue
Solution: 1. Test with simpler task first:
-
Gradually increase complexity:
-
Check agent logs:
Debug Mode¶
Enable verbose logging for detailed diagnostics:
Then check logs:
Getting Help¶
If issues persist:
-
Check logs:
-
Run diagnostics:
-
Provide this info when asking for help:
- Python version:
python --version - JarvisCore version:
pip show jarviscore-framework - LLM provider used (Claude/Azure/Gemini)
- Error message and logs
-
Minimal code to reproduce issue
-
Create an issue:
- GitHub: https://github.com/Prescott-Data/jarviscore-framework/issues
- Include diagnostics output above
Best Practices to Avoid Issues¶
-
Always validate setup first:
-
Use specific prompts:
- ❌ "Do math"
-
✅ "Calculate the factorial of 10 and store result in 'result' variable"
-
Start simple, then scale:
- Test with simple tasks first
- Add complexity gradually
-
Monitor logs for warnings
-
Keep dependencies updated:
-
Use version control for
.env: - Never commit API keys
- Use
.env.exampleas template - Document required variables
Performance Benchmarks (Expected)¶
Use these as baselines:
| Operation | Expected Time | Notes |
|---|---|---|
| Sandbox execution | 2-5ms | Local code execution |
| Code generation | 2-4s | LLM response time |
| Simple task (e.g., 2+2) | 3-5s | End-to-end |
| Complex task | 5-15s | With potential repairs |
| Multi-step workflow (2 steps) | 7-10s | Sequential execution |
If significantly slower: 1. Check network latency 2. Try different LLM model 3. Consider local vLLM 4. Check LOG_LEVEL (DEBUG is slower)
Last updated: 2026-02-19
Version¶
Troubleshooting Guide for JarvisCore v1.0.2