fix(cvm): stop silently dropping BLC_2 holdings (stray quotes in CDA)#24
Conversation
|
You have reached your Codex usage limits for code reviews. You can see your limits in the Codex usage dashboard. |
|
Warning Review limit reached
More reviews will be available in 36 minutes and 54 seconds. Learn how PR review limits work. Your organization has run out of usage credits. Purchase more credits in the billing tab to continue. ⌛ How to resolve this issue?After more reviews become available, a review can be triggered using the We recommend that you space out your commits to avoid hitting the rate limit. 🚦 How do rate limits work?CodeRabbit enforces hourly rate limits for each developer per organization. Our paid plans include higher PR review limits than trial, open-source, and free plans. In all cases, reviews become available again over time. During sustained high-volume PR review activity, CodeRabbit may temporarily slow when the next review becomes available. Please see our Fair Usage Limits Policy for further information. ℹ️ Review info⚙️ Run configurationConfiguration used: defaults Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (3)
✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Code Review
This pull request fixes a bug where cvm holdings silently dropped BLC_2 (cotas de fundos) data because stray double-quotes in free-text names caused the default CSV parser to swallow delimiters and drop rows. This is resolved by passing quoting=csv.QUOTE_NONE to csv.DictReader. Feedback on the changes suggests using io.TextIOWrapper instead of reading the entire file into memory with f.read().decode(...) to maintain the streaming design and avoid memory spikes.
Important
The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.
| reader = csv.DictReader( | ||
| io.StringIO(f.read().decode("iso-8859-1")), | ||
| delimiter=";", | ||
| quoting=csv.QUOTE_NONE, | ||
| ) |
There was a problem hiding this comment.
The module docstring states that we line-stream every CSV inside the ZIP to avoid burning RAM and CPU. However, f.read().decode("iso-8859-1") reads and decodes the entire uncompressed CSV file into memory at once before wrapping it in io.StringIO.
For large CSV files (which can be tens of megabytes), this defeats the streaming design and can cause significant memory spikes.
We can achieve true line-streaming by wrapping the binary stream f in io.TextIOWrapper instead of reading the whole file into memory.
| reader = csv.DictReader( | |
| io.StringIO(f.read().decode("iso-8859-1")), | |
| delimiter=";", | |
| quoting=csv.QUOTE_NONE, | |
| ) | |
| reader = csv.DictReader( | |
| io.TextIOWrapper(f, encoding="iso-8859-1"), | |
| delimiter=";", | |
| quoting=csv.QUOTE_NONE, | |
| ) |
CDA free-text fund names (especially BLC_2, cotas de fundos) contain stray double-quotes. The default csv dialect treats `"` as a quote char and swallows delimiters/newlines across rows, so whole blocks parse to garbage and match nothing. The CLI then printed a partial table that looked complete — for a fund-of-funds (FIC) this hid ~99.8% of the portfolio (e.g. Verde FIC's R$ 850.657.058,68 Verde Master position vanished). Pass `quoting=csv.QUOTE_NONE` so every `"` is kept as a literal char. The field-size limit bumped at import already guards against the resulting long fields. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
bf13e8b to
8514d27
Compare
CDA free-text fund names (especially BLC_2, cotas de fundos) contain stray double-quotes. The default csv dialect treats `"` as a quote char and swallows delimiters/newlines across rows, so whole blocks parse to garbage and match nothing. The CLI then printed a partial table that looked complete — for a fund-of-funds (FIC) this hid ~99.8% of the portfolio (e.g. Verde FIC's R$ 850.657.058,68 Verde Master position vanished). Pass `quoting=csv.QUOTE_NONE` so every `"` is kept as a literal char. The field-size limit bumped at import already guards against the resulting long fields. Co-authored-by: Roberto <robertoecf@users.noreply.github.com> Co-authored-by: Claude Opus 4.8 <noreply@anthropic.com>
Bug (silent data loss)
findata cvm holdingssilently dropped the entire BLC_2 block (cotas de fundos) for fund-of-funds (FIC) portfolios. The CLI printed a partial table that looked complete, hiding the bulk of the portfolio with no warning or error.Concrete case — Verde FIC (
22.187.946/0001-41), ref 2025-12: the output omitted its Verde Master position of R$ 850.657.058,68 (~99.8% of PL). Only BLC_1 + BLC_8 showed.Root cause
CDA free-text fund names contain stray double-quotes. Python's default
csvdialect treats"as a quote char: when one opens a field, the reader swallows delimiters and newlines until it finds the next", merging several rows into one. The merged row's columns shift,CNPJ_FUNDO_CLASSEholds garbage, nothing matches — the whole block contributes zero rows.Reproduced deterministically: a 3-row block with one stray opening quote collapses to 1 garbage row under the default dialect (2 rows silently lost); with
QUOTE_NONEall 3 parse correctly.Fix
Pass
quoting=csv.QUOTE_NONEto theDictReaderso every"is kept as a literal character (matching the known-good pandas referencesep=";", encoding="latin1", quoting=csv.QUOTE_NONE). The field-size limit already bumped at import covers the long free-text fields.Verification
test_holdings_blc2_stray_quote_not_dropped: a BLC_2 fixture whose fund name has a stray quote — both positions survive, the stray quote is preserved literally (fails onmain, passes here).ruff format/check,mypy src/findata, fullpytestsuite all green.Note
Companion PR fixes an independent defect in the same reader (bare-digit CNPJ matching). The two touch the same
CHANGELOG.md/test anchor, so whichever merges second will hit a trivial both-added conflict — happy to rebase this one once the other lands.