Add Advanced OpenAlex ETL pipeline for WoS-like standardization#15
Open
francyfra2709-tech wants to merge 3 commits into
Open
Add Advanced OpenAlex ETL pipeline for WoS-like standardization#15francyfra2709-tech wants to merge 3 commits into
francyfra2709-tech wants to merge 3 commits into
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Advanced OpenAlex ETL Pipeline for Bibliometrix-Python
1. Objective
This pull request introduces an Advanced ETL pipeline for Bibliometrix-Python.
The goal is to make the system more source-agnostic by retrieving bibliographic data directly from OpenAlex through its API and converting the raw API response into a standardized Web of Science-like schema.
The pipeline follows the required ETL structure:
The implementation avoids a single monolithic function and separates responsibilities across dedicated modules.
2. Problems addressed
The current Python implementation is fragile when working with heterogeneous bibliographic sources because several analytical functions assume consistent Web of Science-like columns, formats and data types.
This PR addresses the following issues:
SRfield compatible with citation-based analyses.3. Files added or modified
Added files:
www/services/api_retriever.pywww/services/mappings.pywww/services/standardizer.pywww/services/validators.pyscripts/run_advanced_etl.pyModified files:
www/services/format_functions.py.gitignore4. Architecture
4.1 EXTRACT phase
Implemented in:
www/services/api_retriever.pyThe extraction phase retrieves raw bibliographic records from OpenAlex using a textual query.
Main features:
Example function:
fetch_openalex_works("machine learning", max_results=100)This keeps extraction separate from transformation.
4.2 TRANSFORM / RENAME phase
Implemented in:
www/services/mappings.pyand:
www/services/standardizer.pyThe mapping layer defines the correspondence between OpenAlex fields and the internal WoS-like schema.
Example mappings:
id->UTdoi->DItitle->TIpublication_year->PYcited_by_count->TCreferenced_works->CRDerived OpenAlex fields are also normalized:
authorships->AU,AF,C1primary_location.source->SO,JIabstract_inverted_index->ABconcepts / keywords->ID,DEbiblio->VL,IS,BP,EPThe target schema contains all mandatory columns:
DB, UT, DI, PMID, TI, SO, JI, PY, DT, LA, TC, AU, AF, C1, RP, CR, DE, ID, AB, VL, IS, BP, EP, SRIf OpenAlex does not provide a field, the corresponding standard column is still created and populated with an empty value.
5. Type contracts
The ETL pipeline enforces strong type contracts before validation.
Scalar fields are converted to strings:
DB, UT, DI, PMID, TI, SO, JI, PY, DT, LA, RP, AB, VL, IS, BP, EP, SRCitation counts are converted to integers:
TC -> intMulti-value fields are converted to Python lists of strings:
AU, AF, C1, CR, DE, ID -> list[str]Missing values are handled as follows:
None / NaNin scalar fields ->""None / NaNin multi-value fields ->[]None / NaNinTC->0When exported to CSV, list-valued fields are serialized using the semicolon delimiter.
6. Calculated field: SR
The
SRfield is required for citation-related analyses.Instead of creating a fully separate SR implementation, this PR reuses and extends the existing function:
format_sr_column()located in:
www/services/format_functions.pyA new OpenAlex branch was added to support standardized OpenAlex records:
FirstAuthor, PublicationYear, SourceThe implementation also avoids malformed values when the author or source is missing. For example:
1989, Choice Reviews OnlineJ. R. Quinlan, 1992instead of producing leading or trailing commas.
7. VALIDATION phase
Implemented in:
www/services/validators.pyThe validation module checks that the standardized DataFrame is suitable for downstream Bibliometrix-Python analyses.
Validation checks include:
NaNvalues remain;Nonevalues remain;list[str];TCis numeric/integer.The validation function is:
validate_standardized_dataframe(df)If validation fails, a descriptive
ValidationErroris raised.8. LOAD / CSV export
Implemented in:
scripts/run_advanced_etl.pyThis script runs the complete pipeline:
EXTRACT -> TRANSFORM -> CALCULATED FIELDS -> VALIDATION -> LOADExample command:
python scripts/run_advanced_etl.py --platform openalex --query "machine learning" --max-results 10 --output standardized_openalex.csvThe script prints:
The generated CSV is ignored through
.gitignorebecause it is an output artifact, not source code.9. Execution evidence
The pipeline was tested locally with the following command:
python scripts/run_advanced_etl.py --platform openalex --query "machine learning" --max-results 10 --output standardized_openalex.csvThe execution produced: