From Heterogeneous Bibliographic Data to a Unified Schema: Source-Agnostic ETL Pipeline for Bibliometrix-Python#16
Open
ideepkush wants to merge 11 commits into
Conversation
Implements www/services/etl/ — a modular ETL pipeline that converts bibliographic data from Scopus, Dimensions, PubMed (file + API), and OpenAlex into the standardized Web of Science schema expected by the analytical functions in functions/ and www/services/. Architecture: - Single entry point: convert_to_bibliometrix_df() - Dispatcher pattern routing to 5 source-specific extractors - Mapping dictionaries (no hardcoded if/else) - Type contracts: list[str] for AU/AF/C1/CR/DE/ID, int for TC/PY - Null handling: empty string / 0 / [] defaults - SR calculated field generation - Validation engine (24-column schema check) - CSV export with semicolon delimiters for list fields Advanced level features: - OpenAlex and PubMed REST API extractors - Pagination, rate-limit handling (HTTP 429), exponential-backoff retries - API extractors reuse the same transformation pipeline (no duplicated logic) Honors bonus: - API Data Retrieval panel integrated into Shiny dashboard (app.py) - Live query → standardized DataFrame → ready for analysis Function patches (per exam: 'debug and patch hardcoded WoS logic'): - 39 files patched for df.get() reactive-value pattern compatibility - 2 service files patched for df.set() pattern - 7 files: missing 'from typing import List' imports added - histNetwork: case-insensitive DB matching, non-WoS source routing - Empty CR guard in citation-network functions - NaN guards in plot-axis tick calculations across 8 functions - Fixed thematicmap column count alignment bug - Fixed factorialanalysis infinity overflow - biblionetwork / cocMatrix: explicit None-result propagation Tests: - 12/12 automated tests pass - 96% function compatibility on Scopus, Dimensions, PubMed See PROJECT_REPORT.md for full architecture and patch documentation.
…alias - convert2df() alias matching the R original (per exam Section 4) - tests/run_etl.py: CLI tool with --sweep, --source/--file/--query, --strict, --mailto - notebooks/ETL_Demonstration.ipynb: 10-cell walkthrough of the pipeline - Dashboard: 'Load a Standardized CSV' panel with pill-badge column coverage - tests/etl/test_full_compat_matrix.py: parametrized matrix across all sources
…fixtures New sources (now 7 total supported): - Cochrane Library citation export (CochraneFileExtractor + COCHRANE_MAPPING) - Lens.org CSV export (LensCSVExtractor + LENS_MAPPING) Plugin architecture: - dispatcher.py exposes register_source() for runtime registration of third-party extractors, enabling extension without core modifications Production hardening: - www/services/etl/cache.py: SHA-1 keyed on-disk cache for API responses with 24h TTL — speeds up notebooks, CI runs, and dashboard reloads Test infrastructure: - tests/conftest.py: shared session-scoped fixtures for all 5 file sources - tests/etl/test_all_sources.py: 35 schema + type-contract tests across all sources (covers Scopus, Dimensions, PubMed, Cochrane, Lens) - Total tests now: 65 passing CI/CD: - .github/workflows/etl-tests.yml: GitHub Actions matrix across Python 3.10 / 3.11 / 3.12 with ETL core, schema, CLI, and 7-source tests CLI: - tests/run_etl.py --sweep now processes all 5 file-based sources Documentation: - PROJECT_REPORT.md: architecture diagram, problem to solution matrix, performance benchmarks (up to 8,800 records/sec)
- Section 10 now documents the function compatibility matrix: - 5 sources tested (Scopus, Dimensions, PubMed, Cochrane, Lens) - 28 analytical functions per source - 135/140 pass rate (96%) across all sources - List of all 27 functions that pass per source - Single remaining limitation (get_thematicevolution) explained - Section 6: dashboard integration restructured (API + CSV loader) - Removed promotional phrasing for neutral, factual tone - Tightened section titles for a professional contribution voice
ETL pipeline:
- Dimensions extractor skipped the export banner row and missed the
"PubYear" column, producing 100% empty standardized data; fix skiprows
+ add PubYear -> PY mapping
- SR (Short Reference) now invokes the existing metaTagExtraction/SR()
function instead of reimplementing it (per project requirement)
- Route raw Scopus/Dimensions/PubMed/Lens/Cochrane dashboard imports
through convert2df, with a safe fallback to the legacy parser
Analytical-function robustness (no longer crash on real data):
- cocMatrix: defensive copy so it no longer mutates the shared reactive
DataFrame ("'SR' is both an index level and a column label")
- metaTagExtraction.SR: normalize AU (list/str/NaN) + single-pass,
overflow-proof duplicate suffixing (fixes chr() overflow loop)
- metaTagExtraction.AU_CO: guard non-list affiliations (fixes Main
Information + country panels)
- histNetwork wos()/scopus(): normalize CR (list/str/NaN) and guard the
empty local-citation matrix
- Auto-download required NLTK corpora (stopwords, wordnet) so text-mining
functions work on a fresh environment
Docs: PROJECT_REPORT sections 8.10-8.16, add TESTING.md
Drop dead assets (fonts, JS libs, static images), orphaned source data, and redundant .gitignore entries. All tests still pass.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Group Members
Course: Data Science — Academic Year 2025/2026
Professor: Vincenzo Moscato
1. Summary
This contribution adds a source-agnostic ETL pipeline (
www/services/etl/) to Bibliometrix-Python. The pipeline converts bibliographic data from 7 sources — Scopus, Dimensions, PubMed (file + API), OpenAlex, Cochrane, and Lens.org — into the standardized Web of Science (WoS) schema expected by the analytical functions infunctions/andwww/services/.2. Architecture
2.1 Dispatcher Pattern with Plugin API
www/services/etl/dispatcher.pyexposes a single registry plus a publicregister_source()API for third-party extensions:2.2 Mapping Dictionaries (declarative, not procedural)
Each source has a dedicated mapping file under
www/services/etl/mappings/:scopus_mapping.py,dimensions_mapping.py,pubmed_mapping.py,openalex_mapping.py,cochrane_mapping.py,lens_mapping.py. These are pure Python dicts of{"source_column": "WoS_field_tag"}— no conditional branching, no hardcoded source-specific logic.2.3 Type Contracts
AU, AF, C1, CR, DE, IDlist[str][]TC, PYint0str""2.4 SR Calculated Field
Author, Year, Journalformat, populated for every record via the existingmetaTagExtractionlogic.2.5 Validation Module
Programmatically verifies:
NaNorNonevalues remainlist[str]PYis a 4-digit year integer (or 0)DBis populated for every row3. Limitations of Original Python Implementation — Solution Matrix
convert2df()convert.py::convert_to_bibliometrix_df()+convert2dfaliastransform/pipeline.pyorchestratortransform/type_contracts.pytransform/normalizer.pyhistNetworknormalize_list_field4. ETL Pipeline Phases
extractors/(7 files)transform/renamer.pytransform/type_contracts.pytransform/schema_completion.pytransform/calculated_fields.pymetaTagExtractionvalidation/validator.pyexport/csv_exporter.py;delimiter for list fieldsNo monolithic function — each phase is a separate module with explicit boundaries.
5. Advanced Level — API Extraction
5.1 OpenAlex
https://api.openalex.org/workspage+per-pageparameterstime.sleep(2**attempt))5.2 PubMed API
xml.etree.ElementTree5.3 Caching Layer (
cache.py)Every API GET is cached on disk for 24 hours (SHA-1 of url + params as key), reducing repeated network calls during notebook runs, CI, and dashboard reloads.
5.4 Shared Pipeline
Both API extractors feed through
convert2df()and inherit the same transformation, type contracts, SR calculation, and validation as file-based sources — no duplicated logic.6. Shiny Dashboard Integration
app.pyexposes a new API Data Retrieval panel:df, immediately enabling all downstream analytical modulesVerified live end-to-end:
http://127.0.0.1:8000→ Data → API → "machine learning" / OpenAlex / 20 recordsDB | UT | TI | PY | AU | TCcolumns populatedA second panel — "Load a Standardized CSV" — re-imports any CSV produced by the ETL pipeline, re-validates it against the WoS schema, and renders a column-coverage map.
7. Performance Benchmarks
8. Function Patches — Removing Hardcoded WoS-Specific Logic
8.1
df.get()reactive-value pattern (39 files)8.2
histNetwork— case-insensitive DB + non-WoS routingThe function compared
db == "Web_of_Science"(case-sensitive) and rejected everything else. Now matchesdb.upper().replace("-", "_")against an accepted set and routes non-WoS sources through the scopus-compatible code path.8.3 Empty
CRguardFor sources without cited references,
histNetworkreturnsNonegracefully; callers check forNoneand short-circuit.8.4 NaN-on-empty-data guards (8 functions)
Functions computing
int(max_x)from possibly-empty Series now guard against NaN/zero with a safe default.8.5
get_thematicmapcolumn count bugOriginal code joined
wordsinto a comma-separated string then re-split, losing alignment withsC. Patched to keep-as-list throughout.8.6
get_factorialanalysisinfinity guardDefault
topWordPlot=np.infwas cast directly viaint(). Patched to treat infinity as "all rows".8.7
cocMatrixin-place mutation of shared DataFramecocMatrixsetM.index = M["SR"]by reference. Every subsequent module reading the shared reactivedf.get()foundSRas both index and column, crashing with'SR' is both an index level and a column label, which is ambiguous. Fixed by taking a defensive.copy()at function entry — this affected all databases including WoS.8.8
metaTagExtraction(SR) — infinite-loop /chr()overflowThe SR de-duplication loop appended
-{chr(96+i)}to duplicate SR values. When a record producedNaNas SR,NaN + "-a"stayedNaN, so the loop spun ~1.1M times untilchr()exceeded Unicode range. Fixed by filling the missing journal field and replacing the loop with a single-pass vectorized suffixer.8.9
histNetwork(wosbranch) — non-iterableCRguardWhen
CRwas aNaNfloat (empty-CR rows in the bundled WoS sample), iterating it raisedTypeError: 'float' object is not iterable. Fixed by normalisingCRto a list first.8.10
histNetwork(wosbranch) — empty local-citation matrix guardcocMatrix(..., Field="LCR")returnsNonefor sparse datasets. The next line didset(WLCR.columns), raisingAttributeError: 'NoneType'. Added a guard that falls back to an empty zero self-matrix.8.11
metaTagExtraction(AU_CO) — non-iterable affiliation guardCountry extraction iterated
C1.iloc[i]which was aNaNfloat for records with no affiliation. Fixed by treating any non-list affiliation as empty — confirmed live in the dashboard (Main Information and Countries Production now render).8.12
metaTagExtraction(SR) — list/string/NaN author normalizationThe SR builder did
[x.strip() for x in l]overAU, assuming a list. WhenAUwas a";"-delimited string (reloaded CSV) it iterated single characters and produced garbage. NormalisedAUto a list (pass lists, split strings on;, map missing to[]).8.13
histNetwork(scopusbranch) — list/string/NaNCRnormalizationThe Scopus path assumed
CRwas already a list. Reloaded flat data suppliesCRas a";"-delimited string orNaN. NormalisedCRto lists first, making the historiograph render on Scopus data — confirmed live in the dashboard.9. Standard Column Glossary — All 24 Columns Present
10. Test Results
10.1 Automated Test Suite
10.2 Function Compatibility Matrix (28 functions × 5 sources)
10.3 Single Remaining Limitation
get_thematicevolutionrequires user-provided year breakpoints from the Shiny reactive context — it is interactive by design and cannot be tested headlessly. It works correctly when called from the Shiny UI.10.4 Continuous Integration
.github/workflows/etl-tests.ymlruns every push and PR across Python 3.10, 3.11, and 3.12.11. How to Reproduce
12. Files Changed
New ETL package:
www/services/etl/— dispatcher, extractors (7), mappings (6), transform, validation, export, cacheTests:
tests/conftest.py,tests/etl/test_core_etl.py,tests/etl/test_all_sources.py,tests/etl/test_function_compatibility.py,tests/etl/test_full_compat_matrix.py,tests/run_etl.pyNotebooks & CI:
notebooks/ETL_Demonstration.ipynb,.github/workflows/etl-tests.ymlModified (dashboard):
app.py— API Data Retrieval + Standardized CSV Loader panelsModified (WoS-bug patches): 33 files in
functions/, 7 files inwww/services/