Internet Search¶

JarvisCore includes a built-in InternetSearch module used primarily by ResearcherSubAgent. It runs multiple search providers in parallel, deduplicates and ranks results by relevance and source trust, then returns a unified result list your agents can reason over.

No configuration is required to get started — if you have GEMINI_API_KEY set, search is active immediately.

How it works¶

When an agent triggers a search, the module fans out to all configured providers simultaneously, collects results, deduplicates by URL, and ranks them by a weighted trust score. All providers have circuit breakers — a failing provider is automatically bypassed after 8 consecutive failures and retried after 5 minutes.

Agent → InternetSearch.search(query)
           ├── Google Grounded Search  (score weight: 1.4)
           ├── Serper                  (score weight: 1.2)
           ├── SearXNG                 (score weight: 1.1)
           ├── arXiv                   (score weight: 0.9)  ← academic papers
           ├── Crossref                (score weight: 0.8)  ← academic papers
           └── Wikipedia               (score weight: 0.6)
                    ↓
           Deduplicate by URL
           Rank by (source weight + query-term overlap + pdf bonus)
                    ↓
           Return top N results

Providers¶

Google Grounded Search Primary¶

Uses the Gemini API with Google Search grounding enabled. The model queries Google Search in real time and returns structured web results alongside a synthesised summary grounded in those sources.

Required: one of the following —

Variable	Description
`GEMINI_API_KEY`	Standard Gemini API key (starts with `AIza`). Uses the `v1alpha` endpoint.
`GOOGLE_CLOUD_PROJECT` + ADC	Vertex AI path. Uses `v1` endpoint via Application Default Credentials.

Optional:

Variable	Default	Description
`GEMINI_GROUNDING_API_KEY`	falls back to `GEMINI_API_KEY`	Override the key used specifically for grounded search
`GEMINI_GROUNDING_MODEL`	`gemini-2.5-flash`	Which Gemini model to use for grounded search
`GOOGLE_CLOUD_LOCATION`	`global`	GCP location for Vertex AI path

[!NOTE] Google Grounded Search is the highest-quality provider. If you have GEMINI_API_KEY set (which is already required for agents to function), this provider is active with zero additional setup.

Serper Optional¶

Serper calls the Google Search API via serper.dev. It adds a second independent Google Search signal and is useful when you want results without burning Gemini API quota on grounding.

Variable	Default	Description
`SERPER_API_KEY`	—	API key from serper.dev. Provider is skipped if unset.

Serper results score slightly lower than Google Grounded Search in the ranking (1.2 vs 1.4) because they don't carry grounding metadata or synthesised summaries.

SearXNG Optional Self-hosted¶

SearXNG is a self-hosted, privacy-respecting metasearch engine that aggregates results from Google, Bing, and dozens of other engines — without API keys. It is always attempted unless you point SEARXNG_INSTANCE_URL at a non-responsive host, in which case the circuit breaker takes it offline.

Variable	Default	Description
`SEARXNG_INSTANCE_URL`	`http://localhost:8080`	URL of your SearXNG instance

Running SearXNG locally:

docker run -d \
  -p 8080:8080 \
  -e SEARXNG_SETTINGS_SEARCH_FORMATS="json" \
  searxng/searxng

[!IMPORTANT] SearXNG's JSON format must be enabled. Add - json under search.formats in settings.yml, or pass the env var shown above.

If SearXNG is not running, the connection fails silently and the circuit breaker skips it — no error reaches your agent.

arXiv & Crossref No config needed¶

Both are always active. They use free public APIs with no authentication.

arXiv — searches academic preprints via export.arxiv.org. Best for research-heavy queries.
Crossref — searches published academic papers via api.crossref.org. Returns DOI links.

These providers score lower by default (0.9 and 0.8) because they are domain-specific. Use exclude_providers to skip them for non-academic tasks.

Wikipedia No config needed¶

Always active. Uses the Wikipedia search API. Scores lowest (0.6) — useful as a broad fallback but not a primary signal for real-time queries.

Configuration summary¶

# .env — minimum for search to work (also needed for agents)
GEMINI_API_KEY=AIza...

# Optional: add Serper for dual Google-source coverage
SERPER_API_KEY=your-serper-key

# Optional: self-hosted SearXNG
SEARXNG_INSTANCE_URL=http://your-searxng-host:8080

# Optional: override the Gemini model used for grounding
GEMINI_GROUNDING_MODEL=gemini-2.5-flash

# Optional: tune PDF extraction behaviour
RESEARCH_PDF_TIMEOUT_SECONDS=90
RESEARCH_PDF_MAX_RETRIES=3

Excluding providers at call time¶

If your agent's task doesn't benefit from certain sources, exclude them per-call:

results = await search.search(
    query="latest SEC filings for AAPL",
    max_results=10,
    exclude_providers={"arxiv", "crossref", "wikipedia"},  # skip academic sources
)

This prevents wasted network calls — the provider is skipped entirely, not just filtered from results.

Browser escalation (SPAs and paywalled content)¶

When HTTP extraction returns empty content — common for single-page applications and some paywalled sites — the module can escalate to a headless browser. This is an opt-in extra:

pip install "jarviscore[browser]"

Variable	Default	Description
`BROWSER_HEADLESS`	`true`	Run browser in headless mode
`BROWSER_CONTROL_URL`	—	CDP endpoint for remote browser (e.g. Browserless, Playwright server)

Resilience: circuit breakers¶

Every provider has an independent circuit breaker:

Parameter	Default
Failure threshold	8 consecutive failures
Recovery timeout	300 seconds (5 minutes)

Once a provider's breaker opens, it is skipped entirely until the recovery window passes. Healthy providers continue operating normally. You'll see circuit breaker state changes in agent logs at WARNING level.

Which providers should I enable?¶

Scenario	Recommendation
Getting started	Just `GEMINI_API_KEY` — Google Grounded Search alone is sufficient
High-volume research agents	Add `SERPER_API_KEY` to reduce Gemini quota usage
Privacy-sensitive deployments	Add self-hosted SearXNG, exclude `google_grounded` and `serper`
Academic / scientific research	Keep arXiv and Crossref enabled (they are by default)
Real-time financial / news	Google Grounded Search + Serper; exclude arxiv, crossref, wikipedia