From Heterogeneous Bibliographic Data to a Unified Schema: Source-Agnostic ETL Pipeline for Bibliometrix-Python by ideepkush · Pull Request #16 · PRAISELab-PicusLab/bibliometrix-python

ideepkush · 2026-06-12T12:17:42Z

Group Members

Name	Matricola
Deepak Kushwaha	D03000258
Subhadip Maity	D03000291
Vedant Gajanan Pawar	D03000257
Vishal Kumar	D03000263

Course: Data Science — Academic Year 2025/2026
Professor: Vincenzo Moscato

1. Summary

This contribution adds a source-agnostic ETL pipeline (www/services/etl/) to Bibliometrix-Python. The pipeline converts bibliographic data from 7 sources — Scopus, Dimensions, PubMed (file + API), OpenAlex, Cochrane, and Lens.org — into the standardized Web of Science (WoS) schema expected by the analytical functions in functions/ and www/services/.

Metric	Value
Sources supported	7 (5 file + 2 API)
Required columns guaranteed	24 (full WoS glossary)
Files patched for WoS-bug compatibility	40+
Automated tests	65 passing
Function compatibility	96% — 135/140 (27/28 functions × 5 sources)
Throughput	up to 8,800 records/sec (Cochrane)
CI/CD	GitHub Actions across Python 3.10/3.11/3.12
Dashboard integration	API query panel + Standardized CSV loader

2. Architecture

                ┌────────────────────────────────────┐
                │  convert2df(source, ...)           │  ← single public entry
                └──────────────┬─────────────────────┘
                               │
                ┌──────────────▼─────────────────────┐
                │  Dispatcher (SOURCE_REGISTRY)      │
                │  routes by source name             │
                └──────────────┬─────────────────────┘
                               │
        ┌──────────────────────┼──────────────────────────┐
        │                      │                          │
   ┌────▼────────┐    ┌────────▼─────────┐    ┌──────────▼─────────┐
   │ Extractors  │    │ Mappings (dicts) │    │ Transform pipeline │
   │ (7 sources) │    │ raw col → WoS    │    │ rename→types→SR    │
   └─────────────┘    └──────────────────┘    └──────────┬─────────┘
                                                          │
                                              ┌───────────▼──────────┐
                                              │ Validation (24 cols, │
                                              │  no NaN, list types) │
                                              └───────────┬──────────┘
                                                          │
                                              ┌───────────▼──────────┐
                                              │ Standardized DF      │
                                              │ → CSV / Dashboard /  │
                                              │   Analytical funcs   │
                                              └──────────────────────┘

2.1 Dispatcher Pattern with Plugin API

www/services/etl/dispatcher.py exposes a single registry plus a public register_source() API for third-party extensions:

SOURCE_REGISTRY = {
    "SCOPUS":      {"extractor": ScopusCSVExtractor,      "mapping": SCOPUS_MAPPING,      "mode": "file"},
    "DIMENSIONS":  {"extractor": DimensionsExcelExtractor,"mapping": DIMENSIONS_MAPPING,  "mode": "file"},
    "PUBMED_FILE": {"extractor": PubMedFileExtractor,     "mapping": PUBMED_MAPPING,      "mode": "file"},
    "OPENALEX":    {"extractor": OpenAlexAPIExtractor,    "mapping": OPENALEX_MAPPING,    "mode": "api"},
    "PUBMED_API":  {"extractor": PubMedAPIExtractor,      "mapping": PUBMED_MAPPING,      "mode": "api"},
    "COCHRANE":    {"extractor": CochraneFileExtractor,   "mapping": COCHRANE_MAPPING,    "mode": "file"},
    "LENS":        {"extractor": LensCSVExtractor,        "mapping": LENS_MAPPING,        "mode": "file"},
}
# Plugin API — third-party packages can add new sources without modifying core code
register_source("MY_DB", MyExtractor, MY_MAPPING, mode="file")

2.2 Mapping Dictionaries (declarative, not procedural)

Each source has a dedicated mapping file under www/services/etl/mappings/: scopus_mapping.py, dimensions_mapping.py, pubmed_mapping.py, openalex_mapping.py, cochrane_mapping.py, lens_mapping.py. These are pure Python dicts of {"source_column": "WoS_field_tag"} — no conditional branching, no hardcoded source-specific logic.

2.3 Type Contracts

Field group	Python type	Null default
`AU, AF, C1, CR, DE, ID`	`list[str]`	`[]`
`TC, PY`	`int`	`0`
All other (16 fields)	`str`	`""`

2.4 SR Calculated Field

Author, Year, Journal format, populated for every record via the existing metaTagExtraction logic.

2.5 Validation Module

Programmatically verifies:

All 24 mandatory columns exist
No NaN or None values remain
Multi-value columns are real list[str]
PY is a 4-digit year integer (or 0)
DB is populated for every row

3. Limitations of Original Python Implementation — Solution Matrix

#	Original limitation	Where addressed
1	No single entry-point like `convert2df()`	`convert.py::convert_to_bibliometrix_df()` + `convert2df` alias
2	Scattered transformation logic	`transform/pipeline.py` orchestrator
3	Weak type enforcement	`transform/type_contracts.py`
4	Poor NaN/None handling	`transform/normalizer.py`
5	Implicit WoS dependency	Mapping dicts + case-insensitive DB matching in `histNetwork`
6	Incomplete column mapping	24-column TARGET schema enforced
7	Non-standard reference parsing	Reference parsing in extractors + `normalize_list_field`

4. ETL Pipeline Phases

Phase	Module	Responsibility
1. Extract	`extractors/` (7 files)	Source-specific raw load (CSV / XLSX / TXT / REST JSON / XML)
2. Transform — Rename	`transform/renamer.py`	Map raw columns → WoS tags via mapping dicts
2. Transform — Type contracts	`transform/type_contracts.py`	Cast values to required types, split delimited strings into lists
2. Transform — Schema completion	`transform/schema_completion.py`	Add missing columns with typed defaults
4. Calculated Fields	`transform/calculated_fields.py`	SR (Short Reference) — reuses existing `metaTagExtraction`
5. Validation	`validation/validator.py`	Schema, type, and null checks
6. Load (Export)	`export/csv_exporter.py`	CSV serialization with `;` delimiter for list fields

No monolithic function — each phase is a separate module with explicit boundaries.

5. Advanced Level — API Extraction

5.1 OpenAlex

Endpoint: https://api.openalex.org/works
Pagination: page + per-page parameters
Rate limit: HTTP 429 → exponential backoff (time.sleep(2**attempt))
Retries: 3 attempts per request
Abstract reconstruction from inverted index; author/institution/concept normalization

5.2 PubMed API

NCBI ESearch + EFetch endpoints; XML payload parsing with xml.etree.ElementTree
Same retry / backoff strategy

5.3 Caching Layer (`cache.py`)

Every API GET is cached on disk for 24 hours (SHA-1 of url + params as key), reducing repeated network calls during notebook runs, CI, and dashboard reloads.

5.4 Shared Pipeline

Both API extractors feed through convert2df() and inherit the same transformation, type contracts, SR calculation, and validation as file-based sources — no duplicated logic.

6. Shiny Dashboard Integration

app.py exposes a new API Data Retrieval panel:

Sidebar: Data → API — platform selector (OpenAlex / PubMed), search query, max-records input, "Fetch from API" button
Real-time progress feedback and standardized preview table after retrieval
Fetched DataFrame is pushed into the reactive df, immediately enabling all downstream analytical modules

Verified live end-to-end:

http://127.0.0.1:8000 → Data → API → "machine learning" / OpenAlex / 20 records
"✅ Successfully retrieved 20 records from OPENALEX and standardized into the WoS schema"
Preview table shows DB | UT | TI | PY | AU | TC columns populated

A second panel — "Load a Standardized CSV" — re-imports any CSV produced by the ETL pipeline, re-validates it against the WoS schema, and renders a column-coverage map.

7. Performance Benchmarks

Source	Records	ETL Time	Throughput
SCOPUS	1,000	0.40s	2,503 rec/s
DIMENSIONS	501	0.14s	3,673 rec/s
PUBMED_FILE	10,000	1.82s	5,481 rec/s
COCHRANE	1,126	0.13s	8,801 rec/s
LENS	1,000	0.18s	5,550 rec/s

8. Function Patches — Removing Hardcoded WoS-Specific Logic

8.1 `df.get()` reactive-value pattern (39 files)

# Before
data = df.get()
# After
data = df if isinstance(df, pd.DataFrame) else df.get()

8.2 `histNetwork` — case-insensitive DB + non-WoS routing

The function compared db == "Web_of_Science" (case-sensitive) and rejected everything else. Now matches db.upper().replace("-", "_") against an accepted set and routes non-WoS sources through the scopus-compatible code path.

8.3 Empty `CR` guard

For sources without cited references, histNetwork returns None gracefully; callers check for None and short-circuit.

8.4 NaN-on-empty-data guards (8 functions)

Functions computing int(max_x) from possibly-empty Series now guard against NaN/zero with a safe default.

8.5 `get_thematicmap` column count bug

Original code joined words into a comma-separated string then re-split, losing alignment with sC. Patched to keep-as-list throughout.

8.6 `get_factorialanalysis` infinity guard

Default topWordPlot=np.inf was cast directly via int(). Patched to treat infinity as "all rows".

8.7 `cocMatrix` in-place mutation of shared DataFrame

cocMatrix set M.index = M["SR"] by reference. Every subsequent module reading the shared reactive df.get() found SR as both index and column, crashing with 'SR' is both an index level and a column label, which is ambiguous. Fixed by taking a defensive .copy() at function entry — this affected all databases including WoS.

8.8 `metaTagExtraction` (`SR`) — infinite-loop / `chr()` overflow

The SR de-duplication loop appended -{chr(96+i)} to duplicate SR values. When a record produced NaN as SR, NaN + "-a" stayed NaN, so the loop spun ~1.1M times until chr() exceeded Unicode range. Fixed by filling the missing journal field and replacing the loop with a single-pass vectorized suffixer.

8.9 `histNetwork` (`wos` branch) — non-iterable `CR` guard

When CR was a NaN float (empty-CR rows in the bundled WoS sample), iterating it raised TypeError: 'float' object is not iterable. Fixed by normalising CR to a list first.

8.10 `histNetwork` (`wos` branch) — empty local-citation matrix guard

cocMatrix(..., Field="LCR") returns None for sparse datasets. The next line did set(WLCR.columns), raising AttributeError: 'NoneType'. Added a guard that falls back to an empty zero self-matrix.

8.11 `metaTagExtraction` (`AU_CO`) — non-iterable affiliation guard

Country extraction iterated C1.iloc[i] which was a NaN float for records with no affiliation. Fixed by treating any non-list affiliation as empty — confirmed live in the dashboard (Main Information and Countries Production now render).

8.12 `metaTagExtraction` (`SR`) — list/string/NaN author normalization

The SR builder did [x.strip() for x in l] over AU, assuming a list. When AU was a ";"-delimited string (reloaded CSV) it iterated single characters and produced garbage. Normalised AU to a list (pass lists, split strings on ;, map missing to []).

8.13 `histNetwork` (`scopus` branch) — list/string/NaN `CR` normalization

The Scopus path assumed CR was already a list. Reloaded flat data supplies CR as a ";"-delimited string or NaN. Normalised CR to lists first, making the historiograph render on Scopus data — confirmed live in the dashboard.

9. Standard Column Glossary — All 24 Columns Present

Tag	Type	Tag	Type	Tag	Type	Tag	Type
DB	str	LA	str	RP	str	IS	str
UT	str	TC	int	CR	list	BP	str
DI	str	AU	list	DE	list	EP	str
PMID	str	AF	list	ID	list	SR	str
TI	str	C1	list	AB	str
SO	str	DT	str	VL	str
JI	str	PY	int

10. Test Results

10.1 Automated Test Suite

Total tests passing:  65
Test files:           4 (test_core_etl, test_all_sources,
                         test_function_compatibility, test_full_compat_matrix)
Per-source schema compliance:    5/5 sources ✅
Per-source type contracts:      25/25 checks ✅

10.2 Function Compatibility Matrix (28 functions × 5 sources)

Source	Records	Pass Rate
SCOPUS	1,000	27 / 28 (96%) ✅
DIMENSIONS	501	27 / 28 (96%) ✅
PUBMED	10,000	27 / 28 (96%) ✅
COCHRANE	1,126	27 / 28 (96%) ✅
LENS	1,000	27 / 28 (96%) ✅
TOTAL	13,627	135 / 140 (96%) ✅

10.3 Single Remaining Limitation

get_thematicevolution requires user-provided year breakpoints from the Shiny reactive context — it is interactive by design and cannot be tested headlessly. It works correctly when called from the Shiny UI.

10.4 Continuous Integration

.github/workflows/etl-tests.yml runs every push and PR across Python 3.10, 3.11, and 3.12.

11. How to Reproduce

# Run all tests
pytest tests/etl/ -v -s

# CLI sweep over all 5 file sources
python tests/run_etl.py --sweep

# Process a single source
python tests/run_etl.py --source COCHRANE --file sources/Cochrane/citation-export.txt

# Live API query
python tests/run_etl.py --source OPENALEX --query "machine learning" --max 50

# Launch the dashboard
shiny run app.py
# Open http://127.0.0.1:8000 → Sidebar → Data → API

12. Files Changed

New ETL package: www/services/etl/ — dispatcher, extractors (7), mappings (6), transform, validation, export, cache

Tests: tests/conftest.py, tests/etl/test_core_etl.py, tests/etl/test_all_sources.py, tests/etl/test_function_compatibility.py, tests/etl/test_full_compat_matrix.py, tests/run_etl.py

Notebooks & CI: notebooks/ETL_Demonstration.ipynb, .github/workflows/etl-tests.yml

Modified (dashboard): app.py — API Data Retrieval + Standardized CSV Loader panels

Modified (WoS-bug patches): 33 files in functions/, 7 files in www/services/

Implements www/services/etl/ — a modular ETL pipeline that converts bibliographic data from Scopus, Dimensions, PubMed (file + API), and OpenAlex into the standardized Web of Science schema expected by the analytical functions in functions/ and www/services/. Architecture: - Single entry point: convert_to_bibliometrix_df() - Dispatcher pattern routing to 5 source-specific extractors - Mapping dictionaries (no hardcoded if/else) - Type contracts: list[str] for AU/AF/C1/CR/DE/ID, int for TC/PY - Null handling: empty string / 0 / [] defaults - SR calculated field generation - Validation engine (24-column schema check) - CSV export with semicolon delimiters for list fields Advanced level features: - OpenAlex and PubMed REST API extractors - Pagination, rate-limit handling (HTTP 429), exponential-backoff retries - API extractors reuse the same transformation pipeline (no duplicated logic) Honors bonus: - API Data Retrieval panel integrated into Shiny dashboard (app.py) - Live query → standardized DataFrame → ready for analysis Function patches (per exam: 'debug and patch hardcoded WoS logic'): - 39 files patched for df.get() reactive-value pattern compatibility - 2 service files patched for df.set() pattern - 7 files: missing 'from typing import List' imports added - histNetwork: case-insensitive DB matching, non-WoS source routing - Empty CR guard in citation-network functions - NaN guards in plot-axis tick calculations across 8 functions - Fixed thematicmap column count alignment bug - Fixed factorialanalysis infinity overflow - biblionetwork / cocMatrix: explicit None-result propagation Tests: - 12/12 automated tests pass - 96% function compatibility on Scopus, Dimensions, PubMed See PROJECT_REPORT.md for full architecture and patch documentation.

…alias - convert2df() alias matching the R original (per exam Section 4) - tests/run_etl.py: CLI tool with --sweep, --source/--file/--query, --strict, --mailto - notebooks/ETL_Demonstration.ipynb: 10-cell walkthrough of the pipeline - Dashboard: 'Load a Standardized CSV' panel with pill-badge column coverage - tests/etl/test_full_compat_matrix.py: parametrized matrix across all sources

…fixtures New sources (now 7 total supported): - Cochrane Library citation export (CochraneFileExtractor + COCHRANE_MAPPING) - Lens.org CSV export (LensCSVExtractor + LENS_MAPPING) Plugin architecture: - dispatcher.py exposes register_source() for runtime registration of third-party extractors, enabling extension without core modifications Production hardening: - www/services/etl/cache.py: SHA-1 keyed on-disk cache for API responses with 24h TTL — speeds up notebooks, CI runs, and dashboard reloads Test infrastructure: - tests/conftest.py: shared session-scoped fixtures for all 5 file sources - tests/etl/test_all_sources.py: 35 schema + type-contract tests across all sources (covers Scopus, Dimensions, PubMed, Cochrane, Lens) - Total tests now: 65 passing CI/CD: - .github/workflows/etl-tests.yml: GitHub Actions matrix across Python 3.10 / 3.11 / 3.12 with ETL core, schema, CLI, and 7-source tests CLI: - tests/run_etl.py --sweep now processes all 5 file-based sources Documentation: - PROJECT_REPORT.md: architecture diagram, problem to solution matrix, performance benchmarks (up to 8,800 records/sec)

- Section 10 now documents the function compatibility matrix: - 5 sources tested (Scopus, Dimensions, PubMed, Cochrane, Lens) - 28 analytical functions per source - 135/140 pass rate (96%) across all sources - List of all 27 functions that pass per source - Single remaining limitation (get_thematicevolution) explained - Section 6: dashboard integration restructured (API + CSV loader) - Removed promotional phrasing for neutral, factual tone - Tightened section titles for a professional contribution voice

ETL pipeline: - Dimensions extractor skipped the export banner row and missed the "PubYear" column, producing 100% empty standardized data; fix skiprows + add PubYear -> PY mapping - SR (Short Reference) now invokes the existing metaTagExtraction/SR() function instead of reimplementing it (per project requirement) - Route raw Scopus/Dimensions/PubMed/Lens/Cochrane dashboard imports through convert2df, with a safe fallback to the legacy parser Analytical-function robustness (no longer crash on real data): - cocMatrix: defensive copy so it no longer mutates the shared reactive DataFrame ("'SR' is both an index level and a column label") - metaTagExtraction.SR: normalize AU (list/str/NaN) + single-pass, overflow-proof duplicate suffixing (fixes chr() overflow loop) - metaTagExtraction.AU_CO: guard non-list affiliations (fixes Main Information + country panels) - histNetwork wos()/scopus(): normalize CR (list/str/NaN) and guard the empty local-citation matrix - Auto-download required NLTK corpora (stopwords, wordnet) so text-mining functions work on a fresh environment Docs: PROJECT_REPORT sections 8.10-8.16, add TESTING.md

Drop dead assets (fonts, JS libs, static images), orphaned source data, and redundant .gitignore entries. All tests still pass.

ideepkush added 11 commits May 13, 2026 01:04

Add source-agnostic bibliographic ETL pipeline

aa50b66

Add timeout to CRAN version check

ecd7ac0

Add debugging walkthrough (symptom -> root cause -> patch -> verify)

053dd10

Ignore local .claude/ preview config

51ea7ea

Remove unused files and clean up .gitignore

81bec89

Drop dead assets (fonts, JS libs, static images), orphaned source data, and redundant .gitignore entries. All tests still pass.

Add group member details to project report

00e512b

ideepkush closed this Jun 12, 2026

ideepkush reopened this Jun 12, 2026

ideepkush changed the title ~~Add source agnostic etl~~ From Heterogeneous Bibliographic Data to a Unified Schema: Source-Agnostic ETL Pipeline for Bibliometrix-Python Jun 12, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

From Heterogeneous Bibliographic Data to a Unified Schema: Source-Agnostic ETL Pipeline for Bibliometrix-Python#16

From Heterogeneous Bibliographic Data to a Unified Schema: Source-Agnostic ETL Pipeline for Bibliometrix-Python#16
ideepkush wants to merge 11 commits into
PRAISELab-PicusLab:mainfrom
ideepkush:add-source-agnostic-etl

ideepkush commented Jun 12, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

ideepkush commented Jun 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Group Members

1. Summary

2. Architecture

2.1 Dispatcher Pattern with Plugin API

2.2 Mapping Dictionaries (declarative, not procedural)

2.3 Type Contracts

2.4 SR Calculated Field

2.5 Validation Module

3. Limitations of Original Python Implementation — Solution Matrix

4. ETL Pipeline Phases

5. Advanced Level — API Extraction

5.1 OpenAlex

5.2 PubMed API

5.3 Caching Layer (cache.py)

5.4 Shared Pipeline

6. Shiny Dashboard Integration

7. Performance Benchmarks

8. Function Patches — Removing Hardcoded WoS-Specific Logic

8.1 df.get() reactive-value pattern (39 files)

8.2 histNetwork — case-insensitive DB + non-WoS routing

8.3 Empty CR guard

8.4 NaN-on-empty-data guards (8 functions)

8.5 get_thematicmap column count bug

8.6 get_factorialanalysis infinity guard

8.7 cocMatrix in-place mutation of shared DataFrame

8.8 metaTagExtraction (SR) — infinite-loop / chr() overflow

8.9 histNetwork (wos branch) — non-iterable CR guard

8.10 histNetwork (wos branch) — empty local-citation matrix guard

8.11 metaTagExtraction (AU_CO) — non-iterable affiliation guard

8.12 metaTagExtraction (SR) — list/string/NaN author normalization

8.13 histNetwork (scopus branch) — list/string/NaN CR normalization

9. Standard Column Glossary — All 24 Columns Present

10. Test Results

10.1 Automated Test Suite

10.2 Function Compatibility Matrix (28 functions × 5 sources)

10.3 Single Remaining Limitation

10.4 Continuous Integration

11. How to Reproduce

12. Files Changed

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

ideepkush commented Jun 12, 2026 •

edited

Loading

5.3 Caching Layer (`cache.py`)

8.1 `df.get()` reactive-value pattern (39 files)

8.2 `histNetwork` — case-insensitive DB + non-WoS routing

8.3 Empty `CR` guard

8.5 `get_thematicmap` column count bug

8.6 `get_factorialanalysis` infinity guard

8.7 `cocMatrix` in-place mutation of shared DataFrame

8.8 `metaTagExtraction` (`SR`) — infinite-loop / `chr()` overflow

8.9 `histNetwork` (`wos` branch) — non-iterable `CR` guard

8.10 `histNetwork` (`wos` branch) — empty local-citation matrix guard

8.11 `metaTagExtraction` (`AU_CO`) — non-iterable affiliation guard

8.12 `metaTagExtraction` (`SR`) — list/string/NaN author normalization

8.13 `histNetwork` (`scopus` branch) — list/string/NaN `CR` normalization