Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion AGENTS.md
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
# AGENTS.md Dados Financeiros Abertos
# AGENTS.md: Dados Financeiros Abertos

This file is for coding agents working in this repository. Keep it practical:
follow the project conventions, avoid speculative dependencies, and produce
Expand Down
115 changes: 115 additions & 0 deletions docs/MCP_SURFACE.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,115 @@
# MCP surface: curated tools over the REST API

> Status: prototype / design proposal (alpha 0.3.x). Non-breaking: the REST API
> is untouched. Implemented in [`src/findata/api/mcp_app.py`](../src/findata/api/mcp_app.py).

## Problem

The MCP server used to be auto-generated **1:1 from the FastAPI app**:
`FastApiMCP(app)` turns every route into a tool, so the catalog was **95 tools**,
one per dataset/endpoint. From a client/agent's point of view that means:

- **~21k tokens of `tools/list`** loaded at the start of every session, before a
single call.
- **Worse tool selection**, a model picks worse among 95 near-duplicate names
(one tool per SGS series, per CVM fund facet…) than among ~two dozen
well-described tools.

## Approach (A + B + C)

A separate FastAPI app, `mcp_app`, is the **only** source of the tool catalog.
It exposes a small, hand-curated set of tools that dispatch to the same
`findata.sources.*` functions the REST routers already use.

```python
# app.py: tools come from mcp_app; transport is served on the public app
_mcp = FastApiMCP(mcp_app, name=..., description=...)
_mcp.mount_http(router=app) # /mcp on the public app; REST routes untouched
```

`FastApiMCP(mcp_app)` builds the catalog from `mcp_app`'s OpenAPI and executes
each tool via `httpx.ASGITransport(app=mcp_app)`. Because the routers carry no
app-state/rate-limiter coupling, reusing the source functions in a second app is
safe. **The 95 REST routes that back the CLI and HTTP consumers never change.**

- **A, curation.** Each tool has an explicit `operation_id`, an agent-oriented
one-line `summary`, and a docstring written *for an agent deciding whether to
call it*, not the raw route docstring. `response_model=None` + `-> Any` keeps
response schemas out of the catalog (they would re-inflate it).
- **B, consolidation.** Sprawly clusters collapse behind a `dataset`/`kind`
selector (see table). The work moves from "many thin tools" to "few tools with
good docs".
- **C, code mode.** One optional tool, `findata_run_code`, runs a Python
snippet against the `findata` library in an isolated child interpreter. It
replaces dozens of fine-grained calls for filter/join/aggregate flows that
would otherwise stream every intermediate result through the model's context.
**Gated off by default** (`FINDATA_MCP_CODE_MODE=1` to enable).

## Result

| | 1:1 (old) | curated (new) |
|---|---:|---:|
| MCP tools | 95 | **24** (25 with code mode) |
| `tools/list` size | ~85k chars (~21k tok) | **~29k chars (~7k tok)** |
| REST operations | 95 | **95 (unchanged)** |

## The 24 curated tools

```
registry_lookup ← start here: CNPJ / ticker / code / name → entities

bcb_series bcb_ptax bcb_focus (BCB: 12 → 3)
cvm_company cvm_financials cvm_fund cvm_structured_fund (CVM: 22 → 4)
b3_quote b3_cotahist b3_index (B3: 9 → 3)
tesouro_bonds tesouro_siconfi (Tesouro: 6 → 2)
ibge_indicator ibge_ipca_breakdown (IBGE: 4 → 2)
ipea_series ipea_search (IPEA: 4 → 2)
anbima (ANBIMA: 3 → 1)
openfinance_directory (Open Finance: 15 → 1)
basedosdados_search basedosdados_sql (BdD: 7 → 2)
receita_arrecadacao aneel_leiloes susep_empresas
findata_run_code (code mode, opt-in)
```
Comment on lines +58 to +72

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

📐 Maintainability & Code Quality | 🟡 Minor | ⚡ Quick win

Add a language tag to this fenced block.

This trips MD040 and makes the snippet less tooling-friendly. text would be enough here.

Suggested patch
-```
+```text
 registry_lookup          ← start here: CNPJ / ticker / code / name → entities
@@
 findata_run_code                                         (code mode, opt-in)
-```
+```

As per coding guidelines, **/*.md: Keep repository-facing Markdown disciplined and functional.

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
```
registry_lookup ← start here: CNPJ / ticker / code / name → entities
bcb_series bcb_ptax bcb_focus (BCB: 12 → 3)
cvm_company cvm_financials cvm_fund cvm_structured_fund (CVM: 22 → 4)
b3_quote b3_cotahist b3_index (B3: 9 → 3)
tesouro_bonds tesouro_siconfi (Tesouro: 6 → 2)
ibge_indicator ibge_ipca_breakdown (IBGE: 4 → 2)
ipea_series ipea_search (IPEA: 4 → 2)
anbima (ANBIMA: 3 → 1)
openfinance_directory (Open Finance: 15 → 1)
basedosdados_search basedosdados_sql (BdD: 7 → 2)
receita_arrecadacao aneel_leiloes susep_empresas
findata_run_code (code mode, opt-in)
```
🧰 Tools
🪛 markdownlint-cli2 (0.22.1)

[warning] 58-58: Fenced code blocks should have a language specified

(MD040, fenced-code-language)

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@docs/MCP_SURFACE.md` around lines 58 - 72, The fenced block in the MCP
surface document is missing a language tag, which triggers MD040. Update the
fence around the registry and source list to use a plain text tag so the snippet
remains tooling-friendly, and keep the surrounding content unchanged.

Sources: Coding guidelines, Linters/SAST tools


### Consolidation map

| Tool | Folds in | Selector |
|---|---|---|
| `bcb_series` | `/series`, `/series/code/{code}`, `/series/name/{name}` | `code` / `name` / none=catalog |
| `bcb_ptax` | `/ptax/usd`, `/ptax/usd/period`, `/ptax/{currency}` | `start`+`end` → period |
| `bcb_focus` | `/focus/{indicators,annual,monthly,selic,top5}` | `horizon`, `panel`, `indicator` |
| `cvm_company` | companies search/list, `fca/*`, `ipe` | `dataset=search\|list\|fca_*\|filings` |
| `cvm_fund` | `funds`, `funds/{daily,holdings,lamina,profile,periods}`, returns | `dataset` |
| `cvm_structured_fund` | `funds/{fii,fidc,fip}/*` | `kind` + `dataset` |
| `b3_index` | index portfolio + monthly + list | `dataset`, omit `symbol` to list |
| `tesouro_bonds` | bonds list/search/history | `dataset` |
| `tesouro_siconfi` | `rreo`, `rgf`, `entes` | `report` |
| `openfinance_directory` | participants/endpoints/resources/roles | `dataset` |

## Tradeoffs

- **Fewer but "fatter" tools.** Each carries a `dataset` enum and more doc. The
whole bet is that good descriptions beat tool count, so the docstrings are the
deliverable, not an afterthought.
- **Consolidation can hide endpoint-specific params behind an enum.** Mitigated
by documenting each `dataset`/`kind` value and validating bad combinations with
a `400` (e.g. `cvm_fund dataset=holdings` requires `cnpj`+`month`), matching the
REST API's `ValueError → 400` behaviour.
- **Discoverability of rare endpoints.** A handful of niche REST routes are not
individually surfaced as tools. They remain fully reachable over REST and via
`findata_run_code`.

## Code mode: security

`findata_run_code` is a **prototype, not a hardened sandbox**. The snippet runs
in a child `python -I` (isolated mode, cwd in a tempdir) with a wall-clock
timeout and a 20k-char output cap, but it has full library and network access.
It is **disabled unless `FINDATA_MCP_CODE_MODE=1`** and is intended for trusted,
local/agent use. A production deployment should run it in a real sandbox
(container/seccomp/network egress controls) before enabling.

## Example flows (verified through the curated MCP)

- `registry_lookup(q="PETR4")` → PETROBRAS, CNPJ `33.000.167/0001-01`, `[PETR3, PETR4]` (offline).
- `bcb_ptax(start=2024-01-02, end=2024-01-05)` → daily PTAX USD series (the handoff's headline flow).
- `findata_run_code("import findata; ...")` → runs in the sandbox, returns captured stdout.
Comment thread
coderabbitai[bot] marked this conversation as resolved.
3 changes: 3 additions & 0 deletions pyproject.toml
Original file line number Diff line number Diff line change
Expand Up @@ -134,6 +134,9 @@ max-statements = 50
]
# FastAPI idiom: Query() / Depends() calls in argument defaults.
"src/findata/api/routers/**" = ["B008", "PLR0913"]
# Curated MCP layer: FastAPI Query() defaults (B008), wide consolidated tools
# (PLR0913), and intentional flat dataset-dispatch switches (C901/PLR0912/PLR0911).
"src/findata/api/mcp_app.py" = ["B008", "PLR0913", "C901", "PLR0912", "PLR0911"]
# CLI commands are naturally wide (many typer.Option flags).
"src/findata/cli.py" = ["PLR0913"]
# Banner uses rich + sys.stdout directly — not a print-statement debug.
Expand Down
12 changes: 10 additions & 2 deletions src/findata/api/app.py
Original file line number Diff line number Diff line change
Expand Up @@ -148,15 +148,23 @@ async def _value_error_handler(_: Request, exc: ValueError) -> JSONResponse:
try:
from fastapi_mcp import FastApiMCP

from findata.api.mcp_app import mcp_app

# The MCP tool catalog is built from the *curated* `mcp_app` (a separate
# FastAPI app, ~24 well-described tools), not from the public `app`, which
# would expose one near-duplicate tool per REST route (~95) and bloat every
# agent's context. `mount_http(router=app)` serves the /mcp transport on the
# public app, while the tools are generated from and executed against
# `mcp_app` (via its ASGI transport). The 95 REST routes stay untouched.
_mcp = FastApiMCP(
app,
mcp_app,
name=_PROJECT_SLUG,
description=(
f"{_PROJECT_STATEMENT} MCP para BCB, CVM, B3, IBGE, IPEA, "
"Tesouro, Base dos Dados, Open Finance e gráficos experimentais."
),
)
_mcp.mount_http() # Serves MCP at /mcp (fastapi-mcp >=0.4)
_mcp.mount_http(router=app) # Serves MCP at /mcp (fastapi-mcp >=0.4)
Comment on lines +151 to +167

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔒 Security & Privacy | 🔴 Critical | 🏗️ Heavy lift

The public /mcp mount becomes unauthenticated RCE when code mode is enabled.

Because this transport is mounted on the public app, enabling findata_run_code in src/findata/api/mcp_app.py exposes arbitrary payload.code execution to any MCP client that can reach /mcp. FINDATA_MCP_CODE_MODE is only a feature flag; it is not an access-control boundary.

Please keep code mode off the public transport entirely, or put it behind an explicit auth/private-admin mount before merge.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/findata/api/app.py` around lines 151 - 167, The public /mcp transport
mounted via FastApiMCP.mount_http on app must not expose code execution when
findata_run_code is enabled in mcp_app. Move code mode off the public transport
entirely or gate it behind a separate authenticated/private-admin mount, and
ensure the FastApiMCP setup for mcp_app does not allow arbitrary payload.code
execution through the public app.

_MCP_ENABLED = True
except Exception: # optional subsystem must never break core API
_MCP_ENABLED = False
Expand Down
Loading
Loading