Skip to content

From Heterogeneous Bibliographic Data to a Unified Schema: Source-Agnostic ETL Pipeline for Bibliometrix-Python#16

Open
ideepkush wants to merge 11 commits into
PRAISELab-PicusLab:mainfrom
ideepkush:add-source-agnostic-etl
Open

From Heterogeneous Bibliographic Data to a Unified Schema: Source-Agnostic ETL Pipeline for Bibliometrix-Python#16
ideepkush wants to merge 11 commits into
PRAISELab-PicusLab:mainfrom
ideepkush:add-source-agnostic-etl

Conversation

@ideepkush

@ideepkush ideepkush commented Jun 12, 2026

Copy link
Copy Markdown

Group Members

Name Matricola
Deepak Kushwaha D03000258
Subhadip Maity D03000291
Vedant Gajanan Pawar D03000257
Vishal Kumar D03000263

Course: Data Science — Academic Year 2025/2026
Professor: Vincenzo Moscato


1. Summary

This contribution adds a source-agnostic ETL pipeline (www/services/etl/) to Bibliometrix-Python. The pipeline converts bibliographic data from 7 sources — Scopus, Dimensions, PubMed (file + API), OpenAlex, Cochrane, and Lens.org — into the standardized Web of Science (WoS) schema expected by the analytical functions in functions/ and www/services/.

Metric Value
Sources supported 7 (5 file + 2 API)
Required columns guaranteed 24 (full WoS glossary)
Files patched for WoS-bug compatibility 40+
Automated tests 65 passing
Function compatibility 96% — 135/140 (27/28 functions × 5 sources)
Throughput up to 8,800 records/sec (Cochrane)
CI/CD GitHub Actions across Python 3.10/3.11/3.12
Dashboard integration API query panel + Standardized CSV loader

2. Architecture

                ┌────────────────────────────────────┐
                │  convert2df(source, ...)           │  ← single public entry
                └──────────────┬─────────────────────┘
                               │
                ┌──────────────▼─────────────────────┐
                │  Dispatcher (SOURCE_REGISTRY)      │
                │  routes by source name             │
                └──────────────┬─────────────────────┘
                               │
        ┌──────────────────────┼──────────────────────────┐
        │                      │                          │
   ┌────▼────────┐    ┌────────▼─────────┐    ┌──────────▼─────────┐
   │ Extractors  │    │ Mappings (dicts) │    │ Transform pipeline │
   │ (7 sources) │    │ raw col → WoS    │    │ rename→types→SR    │
   └─────────────┘    └──────────────────┘    └──────────┬─────────┘
                                                          │
                                              ┌───────────▼──────────┐
                                              │ Validation (24 cols, │
                                              │  no NaN, list types) │
                                              └───────────┬──────────┘
                                                          │
                                              ┌───────────▼──────────┐
                                              │ Standardized DF      │
                                              │ → CSV / Dashboard /  │
                                              │   Analytical funcs   │
                                              └──────────────────────┘

2.1 Dispatcher Pattern with Plugin API

www/services/etl/dispatcher.py exposes a single registry plus a public register_source() API for third-party extensions:

SOURCE_REGISTRY = {
    "SCOPUS":      {"extractor": ScopusCSVExtractor,      "mapping": SCOPUS_MAPPING,      "mode": "file"},
    "DIMENSIONS":  {"extractor": DimensionsExcelExtractor,"mapping": DIMENSIONS_MAPPING,  "mode": "file"},
    "PUBMED_FILE": {"extractor": PubMedFileExtractor,     "mapping": PUBMED_MAPPING,      "mode": "file"},
    "OPENALEX":    {"extractor": OpenAlexAPIExtractor,    "mapping": OPENALEX_MAPPING,    "mode": "api"},
    "PUBMED_API":  {"extractor": PubMedAPIExtractor,      "mapping": PUBMED_MAPPING,      "mode": "api"},
    "COCHRANE":    {"extractor": CochraneFileExtractor,   "mapping": COCHRANE_MAPPING,    "mode": "file"},
    "LENS":        {"extractor": LensCSVExtractor,        "mapping": LENS_MAPPING,        "mode": "file"},
}
# Plugin API — third-party packages can add new sources without modifying core code
register_source("MY_DB", MyExtractor, MY_MAPPING, mode="file")

2.2 Mapping Dictionaries (declarative, not procedural)

Each source has a dedicated mapping file under www/services/etl/mappings/: scopus_mapping.py, dimensions_mapping.py, pubmed_mapping.py, openalex_mapping.py, cochrane_mapping.py, lens_mapping.py. These are pure Python dicts of {"source_column": "WoS_field_tag"} — no conditional branching, no hardcoded source-specific logic.

2.3 Type Contracts

Field group Python type Null default
AU, AF, C1, CR, DE, ID list[str] []
TC, PY int 0
All other (16 fields) str ""

2.4 SR Calculated Field

Author, Year, Journal format, populated for every record via the existing metaTagExtraction logic.

2.5 Validation Module

Programmatically verifies:

  1. All 24 mandatory columns exist
  2. No NaN or None values remain
  3. Multi-value columns are real list[str]
  4. PY is a 4-digit year integer (or 0)
  5. DB is populated for every row

3. Limitations of Original Python Implementation — Solution Matrix

# Original limitation Where addressed
1 No single entry-point like convert2df() convert.py::convert_to_bibliometrix_df() + convert2df alias
2 Scattered transformation logic transform/pipeline.py orchestrator
3 Weak type enforcement transform/type_contracts.py
4 Poor NaN/None handling transform/normalizer.py
5 Implicit WoS dependency Mapping dicts + case-insensitive DB matching in histNetwork
6 Incomplete column mapping 24-column TARGET schema enforced
7 Non-standard reference parsing Reference parsing in extractors + normalize_list_field

4. ETL Pipeline Phases

Phase Module Responsibility
1. Extract extractors/ (7 files) Source-specific raw load (CSV / XLSX / TXT / REST JSON / XML)
2. Transform — Rename transform/renamer.py Map raw columns → WoS tags via mapping dicts
2. Transform — Type contracts transform/type_contracts.py Cast values to required types, split delimited strings into lists
2. Transform — Schema completion transform/schema_completion.py Add missing columns with typed defaults
4. Calculated Fields transform/calculated_fields.py SR (Short Reference) — reuses existing metaTagExtraction
5. Validation validation/validator.py Schema, type, and null checks
6. Load (Export) export/csv_exporter.py CSV serialization with ; delimiter for list fields

No monolithic function — each phase is a separate module with explicit boundaries.


5. Advanced Level — API Extraction

5.1 OpenAlex

  • Endpoint: https://api.openalex.org/works
  • Pagination: page + per-page parameters
  • Rate limit: HTTP 429 → exponential backoff (time.sleep(2**attempt))
  • Retries: 3 attempts per request
  • Abstract reconstruction from inverted index; author/institution/concept normalization

5.2 PubMed API

  • NCBI ESearch + EFetch endpoints; XML payload parsing with xml.etree.ElementTree
  • Same retry / backoff strategy

5.3 Caching Layer (cache.py)

Every API GET is cached on disk for 24 hours (SHA-1 of url + params as key), reducing repeated network calls during notebook runs, CI, and dashboard reloads.

5.4 Shared Pipeline

Both API extractors feed through convert2df() and inherit the same transformation, type contracts, SR calculation, and validation as file-based sources — no duplicated logic.


6. Shiny Dashboard Integration

app.py exposes a new API Data Retrieval panel:

  • Sidebar: Data → API — platform selector (OpenAlex / PubMed), search query, max-records input, "Fetch from API" button
  • Real-time progress feedback and standardized preview table after retrieval
  • Fetched DataFrame is pushed into the reactive df, immediately enabling all downstream analytical modules

Verified live end-to-end:

  1. http://127.0.0.1:8000 → Data → API → "machine learning" / OpenAlex / 20 records
  2. "✅ Successfully retrieved 20 records from OPENALEX and standardized into the WoS schema"
  3. Preview table shows DB | UT | TI | PY | AU | TC columns populated

A second panel — "Load a Standardized CSV" — re-imports any CSV produced by the ETL pipeline, re-validates it against the WoS schema, and renders a column-coverage map.


7. Performance Benchmarks

Source Records ETL Time Throughput
SCOPUS 1,000 0.40s 2,503 rec/s
DIMENSIONS 501 0.14s 3,673 rec/s
PUBMED_FILE 10,000 1.82s 5,481 rec/s
COCHRANE 1,126 0.13s 8,801 rec/s
LENS 1,000 0.18s 5,550 rec/s

8. Function Patches — Removing Hardcoded WoS-Specific Logic

8.1 df.get() reactive-value pattern (39 files)

# Before
data = df.get()
# After
data = df if isinstance(df, pd.DataFrame) else df.get()

8.2 histNetwork — case-insensitive DB + non-WoS routing

The function compared db == "Web_of_Science" (case-sensitive) and rejected everything else. Now matches db.upper().replace("-", "_") against an accepted set and routes non-WoS sources through the scopus-compatible code path.

8.3 Empty CR guard

For sources without cited references, histNetwork returns None gracefully; callers check for None and short-circuit.

8.4 NaN-on-empty-data guards (8 functions)

Functions computing int(max_x) from possibly-empty Series now guard against NaN/zero with a safe default.

8.5 get_thematicmap column count bug

Original code joined words into a comma-separated string then re-split, losing alignment with sC. Patched to keep-as-list throughout.

8.6 get_factorialanalysis infinity guard

Default topWordPlot=np.inf was cast directly via int(). Patched to treat infinity as "all rows".

8.7 cocMatrix in-place mutation of shared DataFrame

cocMatrix set M.index = M["SR"] by reference. Every subsequent module reading the shared reactive df.get() found SR as both index and column, crashing with 'SR' is both an index level and a column label, which is ambiguous. Fixed by taking a defensive .copy() at function entry — this affected all databases including WoS.

8.8 metaTagExtraction (SR) — infinite-loop / chr() overflow

The SR de-duplication loop appended -{chr(96+i)} to duplicate SR values. When a record produced NaN as SR, NaN + "-a" stayed NaN, so the loop spun ~1.1M times until chr() exceeded Unicode range. Fixed by filling the missing journal field and replacing the loop with a single-pass vectorized suffixer.

8.9 histNetwork (wos branch) — non-iterable CR guard

When CR was a NaN float (empty-CR rows in the bundled WoS sample), iterating it raised TypeError: 'float' object is not iterable. Fixed by normalising CR to a list first.

8.10 histNetwork (wos branch) — empty local-citation matrix guard

cocMatrix(..., Field="LCR") returns None for sparse datasets. The next line did set(WLCR.columns), raising AttributeError: 'NoneType'. Added a guard that falls back to an empty zero self-matrix.

8.11 metaTagExtraction (AU_CO) — non-iterable affiliation guard

Country extraction iterated C1.iloc[i] which was a NaN float for records with no affiliation. Fixed by treating any non-list affiliation as empty — confirmed live in the dashboard (Main Information and Countries Production now render).

8.12 metaTagExtraction (SR) — list/string/NaN author normalization

The SR builder did [x.strip() for x in l] over AU, assuming a list. When AU was a ";"-delimited string (reloaded CSV) it iterated single characters and produced garbage. Normalised AU to a list (pass lists, split strings on ;, map missing to []).

8.13 histNetwork (scopus branch) — list/string/NaN CR normalization

The Scopus path assumed CR was already a list. Reloaded flat data supplies CR as a ";"-delimited string or NaN. Normalised CR to lists first, making the historiograph render on Scopus data — confirmed live in the dashboard.


9. Standard Column Glossary — All 24 Columns Present

Tag Type Tag Type Tag Type Tag Type
DB str LA str RP str IS str
UT str TC int CR list BP str
DI str AU list DE list EP str
PMID str AF list ID list SR str
TI str C1 list AB str
SO str DT str VL str
JI str PY int

10. Test Results

10.1 Automated Test Suite

Total tests passing:  65
Test files:           4 (test_core_etl, test_all_sources,
                         test_function_compatibility, test_full_compat_matrix)
Per-source schema compliance:    5/5 sources ✅
Per-source type contracts:      25/25 checks ✅

10.2 Function Compatibility Matrix (28 functions × 5 sources)

Source Records Pass Rate
SCOPUS 1,000 27 / 28 (96%)
DIMENSIONS 501 27 / 28 (96%)
PUBMED 10,000 27 / 28 (96%)
COCHRANE 1,126 27 / 28 (96%)
LENS 1,000 27 / 28 (96%)
TOTAL 13,627 135 / 140 (96%)

10.3 Single Remaining Limitation

get_thematicevolution requires user-provided year breakpoints from the Shiny reactive context — it is interactive by design and cannot be tested headlessly. It works correctly when called from the Shiny UI.

10.4 Continuous Integration

.github/workflows/etl-tests.yml runs every push and PR across Python 3.10, 3.11, and 3.12.


11. How to Reproduce

# Run all tests
pytest tests/etl/ -v -s

# CLI sweep over all 5 file sources
python tests/run_etl.py --sweep

# Process a single source
python tests/run_etl.py --source COCHRANE --file sources/Cochrane/citation-export.txt

# Live API query
python tests/run_etl.py --source OPENALEX --query "machine learning" --max 50

# Launch the dashboard
shiny run app.py
# Open http://127.0.0.1:8000 → Sidebar → Data → API

12. Files Changed

New ETL package: www/services/etl/ — dispatcher, extractors (7), mappings (6), transform, validation, export, cache

Tests: tests/conftest.py, tests/etl/test_core_etl.py, tests/etl/test_all_sources.py, tests/etl/test_function_compatibility.py, tests/etl/test_full_compat_matrix.py, tests/run_etl.py

Notebooks & CI: notebooks/ETL_Demonstration.ipynb, .github/workflows/etl-tests.yml

Modified (dashboard): app.py — API Data Retrieval + Standardized CSV Loader panels

Modified (WoS-bug patches): 33 files in functions/, 7 files in www/services/

ideepkush added 11 commits May 13, 2026 01:04
Implements www/services/etl/ — a modular ETL pipeline that converts
bibliographic data from Scopus, Dimensions, PubMed (file + API),
and OpenAlex into the standardized Web of Science schema expected
by the analytical functions in functions/ and www/services/.

Architecture:
- Single entry point: convert_to_bibliometrix_df()
- Dispatcher pattern routing to 5 source-specific extractors
- Mapping dictionaries (no hardcoded if/else)
- Type contracts: list[str] for AU/AF/C1/CR/DE/ID, int for TC/PY
- Null handling: empty string / 0 / [] defaults
- SR calculated field generation
- Validation engine (24-column schema check)
- CSV export with semicolon delimiters for list fields

Advanced level features:
- OpenAlex and PubMed REST API extractors
- Pagination, rate-limit handling (HTTP 429), exponential-backoff retries
- API extractors reuse the same transformation pipeline (no duplicated logic)

Honors bonus:
- API Data Retrieval panel integrated into Shiny dashboard (app.py)
- Live query → standardized DataFrame → ready for analysis

Function patches (per exam: 'debug and patch hardcoded WoS logic'):
- 39 files patched for df.get() reactive-value pattern compatibility
- 2 service files patched for df.set() pattern
- 7 files: missing 'from typing import List' imports added
- histNetwork: case-insensitive DB matching, non-WoS source routing
- Empty CR guard in citation-network functions
- NaN guards in plot-axis tick calculations across 8 functions
- Fixed thematicmap column count alignment bug
- Fixed factorialanalysis infinity overflow
- biblionetwork / cocMatrix: explicit None-result propagation

Tests:
- 12/12 automated tests pass
- 96% function compatibility on Scopus, Dimensions, PubMed

See PROJECT_REPORT.md for full architecture and patch documentation.
…alias

- convert2df() alias matching the R original (per exam Section 4)
- tests/run_etl.py: CLI tool with --sweep, --source/--file/--query, --strict, --mailto
- notebooks/ETL_Demonstration.ipynb: 10-cell walkthrough of the pipeline
- Dashboard: 'Load a Standardized CSV' panel with pill-badge column coverage
- tests/etl/test_full_compat_matrix.py: parametrized matrix across all sources
…fixtures

New sources (now 7 total supported):
- Cochrane Library citation export (CochraneFileExtractor + COCHRANE_MAPPING)
- Lens.org CSV export (LensCSVExtractor + LENS_MAPPING)

Plugin architecture:
- dispatcher.py exposes register_source() for runtime registration of
  third-party extractors, enabling extension without core modifications

Production hardening:
- www/services/etl/cache.py: SHA-1 keyed on-disk cache for API responses
  with 24h TTL — speeds up notebooks, CI runs, and dashboard reloads

Test infrastructure:
- tests/conftest.py: shared session-scoped fixtures for all 5 file sources
- tests/etl/test_all_sources.py: 35 schema + type-contract tests across
  all sources (covers Scopus, Dimensions, PubMed, Cochrane, Lens)
- Total tests now: 65 passing

CI/CD:
- .github/workflows/etl-tests.yml: GitHub Actions matrix across
  Python 3.10 / 3.11 / 3.12 with ETL core, schema, CLI, and 7-source tests

CLI:
- tests/run_etl.py --sweep now processes all 5 file-based sources

Documentation:
- PROJECT_REPORT.md: architecture diagram, problem to solution matrix,
  performance benchmarks (up to 8,800 records/sec)
- Section 10 now documents the function compatibility matrix:
  - 5 sources tested (Scopus, Dimensions, PubMed, Cochrane, Lens)
  - 28 analytical functions per source
  - 135/140 pass rate (96%) across all sources
  - List of all 27 functions that pass per source
  - Single remaining limitation (get_thematicevolution) explained
- Section 6: dashboard integration restructured (API + CSV loader)
- Removed promotional phrasing for neutral, factual tone
- Tightened section titles for a professional contribution voice
ETL pipeline:
- Dimensions extractor skipped the export banner row and missed the
  "PubYear" column, producing 100% empty standardized data; fix skiprows
  + add PubYear -> PY mapping
- SR (Short Reference) now invokes the existing metaTagExtraction/SR()
  function instead of reimplementing it (per project requirement)
- Route raw Scopus/Dimensions/PubMed/Lens/Cochrane dashboard imports
  through convert2df, with a safe fallback to the legacy parser

Analytical-function robustness (no longer crash on real data):
- cocMatrix: defensive copy so it no longer mutates the shared reactive
  DataFrame ("'SR' is both an index level and a column label")
- metaTagExtraction.SR: normalize AU (list/str/NaN) + single-pass,
  overflow-proof duplicate suffixing (fixes chr() overflow loop)
- metaTagExtraction.AU_CO: guard non-list affiliations (fixes Main
  Information + country panels)
- histNetwork wos()/scopus(): normalize CR (list/str/NaN) and guard the
  empty local-citation matrix
- Auto-download required NLTK corpora (stopwords, wordnet) so text-mining
  functions work on a fresh environment

Docs: PROJECT_REPORT sections 8.10-8.16, add TESTING.md
Drop dead assets (fonts, JS libs, static images), orphaned source data,
and redundant .gitignore entries. All tests still pass.
@ideepkush ideepkush closed this Jun 12, 2026
@ideepkush ideepkush reopened this Jun 12, 2026
@ideepkush ideepkush changed the title Add source agnostic etl From Heterogeneous Bibliographic Data to a Unified Schema: Source-Agnostic ETL Pipeline for Bibliometrix-Python Jun 12, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant