Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
56 changes: 56 additions & 0 deletions .github/workflows/etl-tests.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,56 @@
name: ETL Pipeline Tests

on:
push:
branches: [main, codex/etl-standardization]
pull_request:
branches: [main]

jobs:
test:
runs-on: ubuntu-latest
strategy:
matrix:
python-version: ["3.10", "3.11", "3.12"]

steps:
- name: Checkout code
uses: actions/checkout@v4

- name: Set up Python ${{ matrix.python-version }}
uses: actions/setup-python@v5
with:
python-version: ${{ matrix.python-version }}
cache: pip

- name: Install ETL dependencies
run: |
python -m pip install --upgrade pip
pip install pandas openpyxl requests pytest anywidget plotly

- name: Run ETL core tests
run: pytest tests/etl/test_core_etl.py -v

- name: Run schema compatibility tests
run: pytest tests/etl/test_full_compat_matrix.py -v -s

- name: Test CLI sweep
run: python tests/run_etl.py --sweep

- name: Verify all 7 sources load
run: |
python -c "
from www.services.etl import convert2df
sources = [
('SCOPUS', 'sources/Scopus/Scopus.csv'),
('DIMENSIONS', 'sources/Dimensions/Dimensions.xlsx'),
('PUBMED_FILE', 'sources/PubMed/pubmed-allergicrh-set.txt'),
('COCHRANE', 'sources/Cochrane/citation-export.txt'),
('LENS', 'sources/Lens/Lens.csv'),
]
for src, path in sources:
df = convert2df(src, input_path=path)
assert len(df) > 0, f'{src} produced empty DataFrame'
assert len(df.columns) == 24, f'{src} schema mismatch'
print(f' OK {src}: {len(df)} records')
"
13 changes: 10 additions & 3 deletions .gitignore
Original file line number Diff line number Diff line change
@@ -1,4 +1,11 @@
__pycache__/
bibliovenv/
Bibenv/
.idea/
.idea/
.venv/
.venv312/
**/.DS_Store
.pytest_cache/
.ipynb_checkpoints/
out/

# Local Claude preview/launch config (not part of the project)
.claude/
184 changes: 184 additions & 0 deletions DEBUGGING_WALKTHROUGH.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,184 @@
# Debugging Walkthrough — Making the Analytical Functions Source-Agnostic

The analytical functions in `functions/` and `www/services/` were written
assuming **Web of Science** data: that every column is present, that
multi-value fields are always lists, and that the input DataFrame behaves like
a Shiny reactive value. When fed standardized Scopus / Dimensions / PubMed /
Lens / Cochrane data, several of them crashed.

This document shows the **method** used to fix them. Every patch followed the
same four steps:

> **Symptom → Diagnose (root cause) → Patch (source-agnostic) → Verify.**

A full list of all patches is in `PROJECT_REPORT.md` §8. Four representative
examples are walked through below.

---

## Example 1 — "Main Information" crashes: `'float' object is not iterable`

**Symptom.** Load a dataset, open the **Main Information** panel → the whole
panel shows `Error: 'float' object is not iterable`. Every country-based panel
(Countries Production, Corresponding Authors, Cited Countries) fails the same
way.

**Diagnose.** Read the traceback to the deepest application frame:

```
functions/get_maininformations.py:102 -> metaTagExtraction(df, "AU_CO")
www/services/metatagextraction.py:137 -> for c1 in C1.iloc[i]:
TypeError: 'float' object is not iterable
```

The function iterates each record's affiliation list `C1.iloc[i]`. For records
with no affiliation, `C1` is a `NaN` **float**, not a list — so the `for` loop
explodes. A WoS-only assumption: *"the affiliation field is always a populated
list."*

**Patch.** Treat any non-list affiliation as empty, and only parse string
entries (`www/services/metatagextraction.py`):

```python
c1_value = C1.iloc[i]
if not isinstance(c1_value, (list, tuple)):
c1_value = [] # NaN / missing -> no affiliations
for c1 in c1_value:
if isinstance(c1, str) and pd.notna(c1):
...
```

**Verify.** Restart the dashboard → open Main Information → it renders the
dataset summary (Timespan 1985–2020, 281 Sources, 898 Documents, 14.05%
growth). 0 errors. (Before/after screenshots captured.)

---

## Example 2 — Historiograph crashes only *after* opening the Thematic Map

**Symptom.** Each panel works alone, but in the dashboard, opening **Thematic
Map** and then **Historiograph** crashes with:

```
ValueError: 'SR' is both an index level and a column label, which is ambiguous
```

Run on its own, `histNetwork` is fine — so the bug depends on **execution
order**.

**Diagnose.** In the dashboard every module reads the *same* reactive object,
`df.get()`. Instrumenting the shared DataFrame after each panel shows the
mutation point:

```
get_thematic_map(df) -> df.index.name changes from None to "SR"
```

The Thematic Map path calls `cocMatrix`, which does `M.index = M["SR"]`. Because
`M` was the caller's DataFrame **by reference** (no copy), this left the shared
frame with an index named `SR` *and* a column named `SR`. The next module
(Historiograph) then hit the ambiguity. This affects **all** databases,
including WoS.

**Patch.** Make `cocMatrix` a pure function — copy at entry so it can't corrupt
its caller (`www/services/cocmatrix.py`):

```python
# was: M = df if isinstance(df, pd.DataFrame) else df.get()
M = (df if isinstance(df, pd.DataFrame) else df.get()).copy()
```

**Verify.** Thematic Map → Historiograph in sequence: no error, the shared
`df.index.name` stays `None`. 65/65 tests pass.

---

## Example 3 — Dimensions standardizes to 100% empty rows

**Symptom.** The schema tests pass for Dimensions, but the standardized output
is *empty* — every column blank, `PY = 0`, `AU = []` for all 500 rows. The
tests only checked the schema/types, so they were green on meaningless data.

**Diagnose.** Compare the raw header the extractor reads against the mapping
keys:

```python
pd.read_excel("Dimensions.xlsx").columns # -> ['"About the data: ...', 'Unnamed: 1', ...]
pd.read_excel("Dimensions.xlsx", skiprows=1) # -> ['Rank', 'Publication ID', 'Title', 'Authors', ...]
```

Dimensions exports prepend a one-line copyright banner, so the **real header is
on row 2**. The extractor read row 1 as the header → none of the mapping keys
matched → everything mapped to empty. Then, even after fixing that, `PY` stayed
empty because the mapping used `"Publication Year"` while the actual column is
`"PubYear"`.

**Patch.** Two targeted fixes:

```python
# extractor: skip the banner row (with a fallback if absent)
df = pd.read_excel(self.input_path, skiprows=1)

# mapping: add the real year column name
"PubYear": "PY",
```

**Verify.**

```
Dimensions: 500 rows | PY 100% populated | SR: "Sohda Makoto, 2022, Surgery Today"
```

Lesson: a passing schema test is not the same as correct data — validate that
fields are actually **populated**, not just present.

---

## Example 4 — A short-reference loop that never terminates (`chr()` overflow)

**Symptom.** On Lens data, `metaTagExtraction(df, "SR")` hangs for ~10 minutes
then raises `ValueError: chr() arg not in range(0x110000)`.

**Diagnose.** The duplicate-SR disambiguation loop appends `-a`, `-b`, ... to
repeated short references until none repeat:

```python
while st == 0:
ind = SR.duplicated()
if ind.any():
i += 1
SR[ind] = SR[ind] + "-" + chr(96 + i) # i grows forever
```

Nine Lens rows have a missing journal, so their SR is `NaN`. `NaN + "-a"` is
still `NaN`, so those rows can *never* be made unique — the loop spins ~1.1
million times until `chr(96 + i)` exceeds the Unicode range.

**Patch.** Remove the NaN, and replace the fragile loop with a single-pass,
overflow-proof suffixer (`www/services/metatagextraction.py`):

```python
J9 = J9.fillna("NA"); SR = (... ).fillna("NA") # no NaN can enter
dup_rank = SR.groupby(SR).cumcount() # 0,1,2,... per group
SR = SR + dup_rank.map(_dup_suffix) # "", "-a", "-b", ... "-aa"
```

**Verify.** Lens: 1000 rows → 1000 unique SRs, 0 NaN, completes instantly.

---

## The pattern

Across all 16 patches the same WoS-only assumptions recurred, and the
source-agnostic fix was always one of:

| WoS assumption | Source-agnostic fix |
|----------------|---------------------|
| A field is always a populated list | Normalize: list stays, string is split, NaN → `[]` |
| The input is a Shiny reactive (`df.get()`) | Accept a plain DataFrame too / defensive `.copy()` |
| The DB is exactly `"Web_of_Science"` | Case-insensitive matching + non-WoS routing |
| A computed matrix is never empty | Guard `None` / empty before using it |
| Author/year/journal are always present | Fall back to `"NA"` / `0` / `""` |

The ETL guarantees the **schema**; these patches make the **functions** stop
assuming the data came from Web of Science.
Loading