ETL Pipeline: From heterogeneous bibliographic data to a unified schema by marioloskovic55-jpg · Pull Request #14 · PRAISELab-PicusLab/bibliometrix-python

marioloskovic55-jpg · 2026-06-06T16:27:25Z

Function of the project

The goal of this pull request is to add a complete Data Extraction, Transformation, and Loading (ETL) pipeline to bibliometrix-python, so that we have clean, standardized data from the source and can proceed with analysis.
The work starts from a problem present in bibliometrix-python: it only works with data downloaded from Web of Science (WoS), while it encounters problems with those downloaded from PubMed, Scopus or Dimensions because the column names are different.
The created pipeline acts as a universal translator: wherever the data comes from, the output will always be standard so that it can be understood by all analytical functions.

Problems found in the current implementation

-There was no function in bibliometrix-python that would load and standardize the data (there is convert2df, but in the R version).

-There was no clear structure to the transformation logic, making it difficult to maintain or extend.

-Crashes were caused by incorrectly storing data in the wrong formats (for example, in the case of authors AU or cited references CR, the lists were incorrectly stored as strings)

-Many functions crashed when handling missing values (NaN, None)

-The data was pre-determined to come from Web of Science, so if it came from other sources the code would crash.

-There was no block that validated the data, but it was directly passed to the analytical functions without checking the formatting.

Building

5 modules have been created within the www/services/etl folder, each with a single, specific purpose.

1. Mapping (`www/services/etl/mappings/`)

Different sources use different names for the same information: for example, if the title of an article is indicated with "Title" in Scopus and Dimensions, in Web of Science it is indicated with "TI".

I created a dictionary file for each source that provides a table:

scopus_mapping.py — maps Scopus column names to WoS standard tags
dimensions_mapping.py — maps Dimensions column names to WoS standard tags
pubmed_mapping.py — maps PubMed field tags to WoS standard tags
openalex_mapping.py — maps OpenAlex field names to WoS standard tags
These dictionaries are just lookup tables used by the transformer.

2. Transformer (`www/services/etl/transformer.py`)

This module has 5 steps:

Renames columns according to the name used in WoS
Applies list format to columns containing string formats instead of the correct format (such as authors "AU")
Applies integer format to columns containing strings instead of integers (such as number of citations "TC") and sets missing data to 0
Fills in missing columns if a source doesn't provide one, always outputs all 24 mandatory columns, using empty default values where data is missing
Calculates the "SR" (Short Reference), a unique identifier for each article, automatically from the other columns. It consists of: "FirstAuthorSurname, PublicationYear, JournalName".

3. Validator (`www/services/etl/validator.py`)

The validator checks the data before providing it to the analytical functions. Specifically, it performs three checks:

It verifies that all 24 predefined columns are present.
It verifies that all columns are free of Nan and None.
It verifies that the format of all column data is correct.
The validator prints a clear [OK] or [FAIL] for each check.

4. Api Retriever (`www/services/etl/api_retriever.py`)

This module retrieves data without the need for manual downloading, but via internet.
It draws sources from two free public databases:

OpenAlex - a search query is sent (for example, "machine learning") and output articles in JSON format. The module ensures that the results are displayed in 25-page spreads, limits the query rate to once every 0.5 seconds, and automatically retries up to a maximum of 3 times.
PubMed - the US biomedical health database. In this case, we first search for article IDs and then download the corresponding full records. The output is in MEDLINE text format and is parsed line by line.

5. Standardizer (`www/services/etl/standardizer.py`)

This is the equivalent of the convert2df function in R and is the entry point of the pipeline.
You only need to call this function.
It works in two modes:

File mode (Base level) - load a manually exported file:

from www.services.etl.standardizer import convert2df
df = convert2df(source="scopus", filepath="scopus_export.csv")

API mode (Advanced level) - retrieve data automatically:

df = convert2df(source="openalex", query="machine learning", max_results=100)
df = convert2df(source="pubmed", query="deep learning", max_results=50)

Internally, convert2df() coordinates all the other modules in order:
Extract -> Transform -> Validate. The user does not need to know anything about the internal modules..

Modifications to Existing Files

`www/services/init.py`

Originally, the file imported all modules using wildcards, which caused problems because numerous additional installations (such as selenium, pyvis, etc.) were required when only the ETL pipeline was needed. Therefore, the imports were replaced with a comment so that it is still possible to import modules if necessary.

Validation results

The pipeline passed validation in tests from 3 different sources:

Scopus CSV file (20 articles) - all columns present, no missing data, and all typing correct. OK
OpenAlex "machine learning" API query (10 articles) - passed the same checks. OK
PubMed "machine learning" API query (10 articles) - passed the same checks. OK
All three sources passed full validation without any crashes.

Demo notebook

The demo_etl_pipeline.ipynb file contains a step-by-step process starting from the raw pre-transformation data and displaying the standardized outputs after each stage.

…, mappings, demo notebook

…ard methods, and missing imports

marioloskovic55-jpg · 2026-06-08T10:35:32Z

Fixes applied to existing functions

During testing, a systematic bug was found in 38 files in the www/services/ and functions/ directories and has been fixed.

Bug 1

The original files used df.get() and df.set(M) as if calling a pre-existing custom class. But df is a standard pandas DataFrame, so they were called without existing, causing crashes.

Fix 1

All df.get() statements have been replaced with df.copy()
In metatagextraction.py, df.set(M) has been replaced with return M

Bug 2

Several functions in functions/ had several missing imports from the standard library.

Fix 2

get_maininformations.py: added import time, import pandas as pd, and from www.services.metatagextraction import metaTagExtraction
get_annualproduction.py: added import pandas as pd, import plotly.express as px, import plotly.graph_objects as go

Testing

After the corrections, the following functions were tested using standardized Scopus data and worked correctly:

get_maininformations
get_annualproduction

…ions, geopandas, matatagextraction)

marioloskovic55-jpg · 2026-06-08T15:52:31Z

Additional fixing applied

Bug 3

After importing pandas and plotly into all functions, there were other missing imports

Fix 3

All functions in functions/: added import plotly.express as px and import plotly.graph_objects as go
get_lotkalaw.py: added import numpy as np
get_frequentwords.py: added from collections import Counter
get_countriesproduction.py: added import geopandas as gpd and from www.services.metatagextraction import metaTagExtraction
get_bradfordlaw.py: added import numpy as np
get_sourcesproduction.py: added from www.services.cocmatrix import cocMatrix

Bug 4

In addition to the failed imports, get_sourceproduction also encountered problems with cocMatrix. These arose from the fact that it directly called the split() function without first converting the values to strings, causing AttributeErrors when interfacing with the ints contained in the "PY" (publication year) column.

Fix 4

cocmatrix.py added str(x) cast before .split(sep)

Testing

Other functions have been tested with standardized Scopus data, always successfully:

get_countriesproduction
get_frequentwords
get_lotkalaw
get_averagecitations
get_bradfordlaw
get_sourcesproduction

…imports in bradfordlaw and sourcesproduction

…lionetwork, cocMatrix, term_extraction), fix megaTagExtraction overwrite bug, fix NaN in plotly marker size

marioloskovic55-jpg · 2026-06-09T15:01:35Z

Dashboard Revisions

To ensure the Shiny dashboard works with data not derived from WoS, it was revised as follows:

functions/init.py

Replaced the comment with explicit imports so that all analytical functions are loaded in app.py. Also, corrected many function calls to match their actual names (e.g., get_co_citation instead of get_cocitation).

Missing function imports

All functions were missing imports from the standard libraries, which previously relied on wildcards from __init__.py. The added imports were:

import numpy as np
import pandas as pd
import plotly.express as px and import plotly.graph_objects as go
from www.services.metatagextraction import metaTagExtraction
from www.services.histnetwork import histNetwork
from www.services.biblionetwork import biblionetwork
from www.services.cocmatrix import cocMatrix
from www.services.termextraction import term_extraction

Bug 5

In get_maininformations.py, metaTagExtraction overwrote the data variable, losing columns it previously calculated, such as Min_Year.

Fix 5

You avoid replacing the entire DataFrame by copying only the resulting "AU_CO" column.

Bug 6

In get_localcitedauthors.py, there were NaN values in the Plotly marker size due to division by 0, which occurred when all local citation values were 0.

Fix 6

Fixed by adding fillna(18) to the size calculation.

app.py fixes

Added missing imports: add_to_report, get_filtered_table, get_data, get_table, get_database
Fixed all analytical function calls to use df.get() instead of df
Fixed DT(style="width=100%;") → style="width:100%;"

…laboration network, world map, histograph, thematic map, co-occurrence, co-citation, clustering, factorial analysis

marioloskovic55-jpg · 2026-06-11T10:26:44Z

Dashboard revision 2

Missing imports in functions/

All analytic functions were missing imports that were previously done by wildcard imports, and have been explicitly added:

import numpy as np (all files)
import pandas as pd (all files)
from www.services.networkplot import network_plot
from pyvis.network import Network
from www.services.couplingmap import avoid_net_overlaps
from www.services.histplot import histPlot
from www.services.tabletag import table_tag
from www.services.thematicmap import thematic_map
import tempfile, import os, import json, import math, import re
from collections import Counter
import matplotlib.pyplot as plt, import matplotlib.colors as mcolors
from matplotlib.colors import to_hex
from scipy.cluster.hierarchy import linkage, dendrogram, fcluster, to_tree
from prince import CA, MCA
from sklearn.manifold import MDS, from sklearn.preprocessing import StandardScaler
from typing import Union, Optional, Sequence, Dict, List

Bug 7

All analytic functions received df as reacting.Value and not a DataFrame, so df = df.get() if hasattr(df, 'get') and not isinstance(df, pd.DataFrame) else df.copy().

Bug 8

All functions that used size=18 + 6 * (x / x.max()) caused problems when they only had 0 values. Fixed this by adding .fillna(18) to Plotly's marker size calculation.

Bug 9

Several functions had typos in the method name in layout_fruchterman_reingold, fixed in get_co_occurence_network.py, get_cocitation.py, get_collaborationnetwork.py, thematicmap.py.

Bug 10

In couplingmap.py and thematicmap.py, the group was consuming the key column. Using x.name inside lamba retrieved the group key.

Bug 11

The termextraction.py function returned return df and used df.set(M). Fixed this by replacing it with return M and removing all .get().

Bug 12

The histnetwork.py function did not call metaTagExtraction to create the SR column, so that call was added at the beginning.

Bug 13

The couplingmap.py function dropped the SR column with M.drop(columns=['SR']) before it was even needed. Fixed this by removing the drop.

Bug 14

The networkplot.py function crashed because after filtering it found an array of size 0. Problem solved by adding a check before the clustering and layout phases.

Bug 15

The get_collaborationnetwork.py function received isolates as a string with a "yes/no" value, not a Boolean. This issue was fixed by adding a default value of False for small databases.

Bug 16

The get_worldmapcollaboration.py function crashed when None was returned from biblionetwork for collaboration in small Scopus databases. This issue was resolved by adding an empty figure as the anticipated result.

Bug 17

In the standardizer.py function, the API mode set an empty list (mapping = []) instead of the correct mapping dictionary. This was fixed by assigning the correct mapping based on the source (PUBMED_MAPPING for PubMed and OPENALEX_MAPPING for OpenAlex).

Mario Losco added 3 commits June 5, 2026 16:29

Add ETL pipeline: standardizer, transformer, validator, api_retriever…

be2aa70

…, mappings, demo notebook

Fix file naming: rename misspelled ETL module files

0546739

Fix analytical functions: replace df.get()/df.set() with pandas stand…

b6582b0

…ard methods, and missing imports

Fix analytical functions: add missing imports (numpy, plotly, collect…

a257aef

…ions, geopandas, matatagextraction)

Mario Losco added 2 commits June 8, 2026 18:11

Fix cocmatrix.py: cast field values to str before split; add missing …

c7d4533

…imports in bradfordlaw and sourcesproduction

Fix dashboard functions: add missing imports (numpy, histNetwork, bib…

e77e018

…lionetwork, cocMatrix, term_extraction), fix megaTagExtraction overwrite bug, fix NaN in plotly marker size

Fix all dashboard functions:imports, reactive.Value, NaN markers, col…

cc1f065

…laboration network, world map, histograph, thematic map, co-occurrence, co-citation, clustering, factorial analysis

Mario Losco added 2 commits June 11, 2026 12:59

Fix PubMed mapping in standardizer.py

08fd768

Add pubmed test files to gitignore

aa4bd7a

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ETL Pipeline: From heterogeneous bibliographic data to a unified schema#14

ETL Pipeline: From heterogeneous bibliographic data to a unified schema#14
marioloskovic55-jpg wants to merge 9 commits into
PRAISELab-PicusLab:mainfrom
marioloskovic55-jpg:main

marioloskovic55-jpg commented Jun 6, 2026

Uh oh!

marioloskovic55-jpg commented Jun 8, 2026

Uh oh!

marioloskovic55-jpg commented Jun 8, 2026 •

edited

Loading

Uh oh!

marioloskovic55-jpg commented Jun 9, 2026

Uh oh!

marioloskovic55-jpg commented Jun 11, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

marioloskovic55-jpg commented Jun 6, 2026

Function of the project

Problems found in the current implementation

Building

1. Mapping (www/services/etl/mappings/)

2. Transformer (www/services/etl/transformer.py)

3. Validator (www/services/etl/validator.py)

4. Api Retriever (www/services/etl/api_retriever.py)

5. Standardizer (www/services/etl/standardizer.py)

Modifications to Existing Files

www/services/__init__.py

Validation results

Demo notebook

Uh oh!

marioloskovic55-jpg commented Jun 8, 2026

Fixes applied to existing functions

Bug 1

Fix 1

Bug 2

Fix 2

Testing

Uh oh!

marioloskovic55-jpg commented Jun 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Additional fixing applied

Bug 3

Fix 3

Bug 4

Fix 4

Testing

Uh oh!

marioloskovic55-jpg commented Jun 9, 2026

Dashboard Revisions

functions/init.py

Missing function imports

Bug 5

Fix 5

Bug 6

Fix 6

app.py fixes

Uh oh!

marioloskovic55-jpg commented Jun 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Dashboard revision 2

Missing imports in functions/

Bug 7

Bug 8

Bug 9

Bug 10

Bug 11

Bug 12

Bug 13

Bug 14

Bug 15

Bug 16

Bug 17

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

1. Mapping (`www/services/etl/mappings/`)

2. Transformer (`www/services/etl/transformer.py`)

3. Validator (`www/services/etl/validator.py`)

4. Api Retriever (`www/services/etl/api_retriever.py`)

5. Standardizer (`www/services/etl/standardizer.py`)

`www/services/init.py`

marioloskovic55-jpg commented Jun 8, 2026 •

edited

Loading

marioloskovic55-jpg commented Jun 11, 2026 •

edited

Loading