Skip to content

Add Advanced OpenAlex ETL pipeline for WoS-like standardization#15

Open
francyfra2709-tech wants to merge 3 commits into
PRAISELab-PicusLab:mainfrom
francyfra2709-tech:advanced-etl-openalex
Open

Add Advanced OpenAlex ETL pipeline for WoS-like standardization#15
francyfra2709-tech wants to merge 3 commits into
PRAISELab-PicusLab:mainfrom
francyfra2709-tech:advanced-etl-openalex

Conversation

@francyfra2709-tech

Copy link
Copy Markdown

Advanced OpenAlex ETL Pipeline for Bibliometrix-Python

1. Objective

This pull request introduces an Advanced ETL pipeline for Bibliometrix-Python.

The goal is to make the system more source-agnostic by retrieving bibliographic data directly from OpenAlex through its API and converting the raw API response into a standardized Web of Science-like schema.

The pipeline follows the required ETL structure:

  1. EXTRACT
  2. TRANSFORM / RENAME
  3. CALCULATED FIELDS
  4. VALIDATION
  5. LOAD / CSV export

The implementation avoids a single monolithic function and separates responsibilities across dedicated modules.

2. Problems addressed

The current Python implementation is fragile when working with heterogeneous bibliographic sources because several analytical functions assume consistent Web of Science-like columns, formats and data types.

This PR addresses the following issues:

  • lack of a source-agnostic conversion workflow;
  • scattered transformation logic;
  • weak type enforcement;
  • poor handling of missing values;
  • missing mandatory columns when the input source does not provide them;
  • need for list-based multi-value fields such as authors, affiliations, cited references and keywords;
  • need for a calculated SR field compatible with citation-based analyses.

3. Files added or modified

Added files:

  • www/services/api_retriever.py
  • www/services/mappings.py
  • www/services/standardizer.py
  • www/services/validators.py
  • scripts/run_advanced_etl.py

Modified files:

  • www/services/format_functions.py
  • .gitignore

4. Architecture

4.1 EXTRACT phase

Implemented in:

www/services/api_retriever.py

The extraction phase retrieves raw bibliographic records from OpenAlex using a textual query.

Main features:

  • OpenAlex Works API support;
  • user-provided textual query;
  • cursor-based pagination;
  • configurable maximum number of results;
  • retry logic for temporary failures;
  • basic handling of HTTP 429 rate-limit responses;
  • raw JSON-like output without transformation.

Example function:

fetch_openalex_works("machine learning", max_results=100)

This keeps extraction separate from transformation.

4.2 TRANSFORM / RENAME phase

Implemented in:

www/services/mappings.py

and:

www/services/standardizer.py

The mapping layer defines the correspondence between OpenAlex fields and the internal WoS-like schema.

Example mappings:

  • OpenAlex id -> UT
  • OpenAlex doi -> DI
  • OpenAlex title -> TI
  • OpenAlex publication_year -> PY
  • OpenAlex cited_by_count -> TC
  • OpenAlex referenced_works -> CR

Derived OpenAlex fields are also normalized:

  • authorships -> AU, AF, C1
  • primary_location.source -> SO, JI
  • abstract_inverted_index -> AB
  • concepts / keywords -> ID, DE
  • biblio -> VL, IS, BP, EP

The target schema contains all mandatory columns:

DB, UT, DI, PMID, TI, SO, JI, PY, DT, LA, TC, AU, AF, C1, RP, CR, DE, ID, AB, VL, IS, BP, EP, SR

If OpenAlex does not provide a field, the corresponding standard column is still created and populated with an empty value.

5. Type contracts

The ETL pipeline enforces strong type contracts before validation.

Scalar fields are converted to strings:

DB, UT, DI, PMID, TI, SO, JI, PY, DT, LA, RP, AB, VL, IS, BP, EP, SR

Citation counts are converted to integers:

TC -> int

Multi-value fields are converted to Python lists of strings:

AU, AF, C1, CR, DE, ID -> list[str]

Missing values are handled as follows:

  • None / NaN in scalar fields -> ""
  • None / NaN in multi-value fields -> []
  • None / NaN in TC -> 0

When exported to CSV, list-valued fields are serialized using the semicolon delimiter.

6. Calculated field: SR

The SR field is required for citation-related analyses.

Instead of creating a fully separate SR implementation, this PR reuses and extends the existing function:

format_sr_column()

located in:

www/services/format_functions.py

A new OpenAlex branch was added to support standardized OpenAlex records:

FirstAuthor, PublicationYear, Source

The implementation also avoids malformed values when the author or source is missing. For example:

  • 1989, Choice Reviews Online
  • J. R. Quinlan, 1992

instead of producing leading or trailing commas.

7. VALIDATION phase

Implemented in:

www/services/validators.py

The validation module checks that the standardized DataFrame is suitable for downstream Bibliometrix-Python analyses.

Validation checks include:

  1. all mandatory columns exist;
  2. no NaN values remain;
  3. no None values remain;
  4. multi-value columns are list[str];
  5. scalar columns are strings;
  6. TC is numeric/integer.

The validation function is:

validate_standardized_dataframe(df)

If validation fails, a descriptive ValidationError is raised.

8. LOAD / CSV export

Implemented in:

scripts/run_advanced_etl.py

This script runs the complete pipeline:

EXTRACT -> TRANSFORM -> CALCULATED FIELDS -> VALIDATION -> LOAD

Example command:

python scripts/run_advanced_etl.py --platform openalex --query "machine learning" --max-results 10 --output standardized_openalex.csv

The script prints:

  • selected platform;
  • query;
  • number of raw records retrieved;
  • number of standardized rows;
  • SR sample;
  • validation result;
  • normalized preview;
  • output CSV path.

The generated CSV is ignored through .gitignore because it is an output artifact, not source code.

9. Execution evidence

The pipeline was tested locally with the following command:

python scripts/run_advanced_etl.py --platform openalex --query "machine learning" --max-results 10 --output standardized_openalex.csv

The execution produced:

=== ADVANCED BIBLIOMETRIX ETL ===
Platform: openalex
Query: machine learning
Max results: 10

[1/5] EXTRACT - Retrieving raw API records...
Raw records retrieved: 10

[2/5] TRANSFORM - Standardizing records into WoS-like schema...
Standardized rows: 10

[3/5] CALCULATED FIELDS - SR generated during standardization.

[4/5] VALIDATION - Checking schema, null values and type contracts...
VALIDATION PASSED

[5/5] LOAD - Saving standardized CSV...
Output saved to: standardized_openalex.csv

A normalized preview was also printed, including fields such as:

`DB, UT, DI, TI, PY, TC, AU, SO, SR`

## 10. Debugging and patches

During implementation, the following issues were identified and addressed.

### 10.1 Missing dependencies during isolated import

`format_functions.py` originally relied on relative imports that made isolated testing of `format_sr_column()` difficult.

The import section was made more defensive so that the SR formatter can be reused by the ETL module without requiring the entire dashboard dependency stack during CLI testing.

### 10.2 Missing OpenAlex support in SR formatter

The existing `format_sr_column()` function supported sources such as Web of Science, PubMed, Scopus, Dimensions and Lens, but did not include OpenAlex.

An OpenAlex branch was added.

### 10.3 Multi-value validation issues

The validator initially detected that multi-value columns such as `CR`, `ID` and `AU` could contain non-string or nested values.

The standardization step was strengthened so that all multi-value fields are normalized as `list[str]` before validation.

### 10.4 CSV output handling

The generated file:

`standardized_openalex.csv`

was added to `.gitignore` because it is produced by the ETL script and should not be committed as source code.

## 11. Limitations and future improvements

This PR currently implements the Advanced ETL workflow for OpenAlex.

Future extensions may include:

- PubMed API support;
- direct integration into the Shiny dashboard interface;
- additional source-specific mappings;
- broader automated tests against more analytical functions.

The current contribution focuses on a robust OpenAlex API-based pipeline with clear separation between extraction, transformation, validation and export.

## 12. Summary

This PR adds an Advanced OpenAlex ETL pipeline that:

- retrieves data through the OpenAlex API;
- handles pagination and retries;
- maps OpenAlex records to the WoS-like Bibliometrix schema;
- enforces strong type contracts;
- handles missing values;
- generates the calculated `SR` field through the existing formatter;
- validates the final DataFrame;
- exports a standardized CSV file;
- provides a reproducible command-line execution workflow.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant