Skip to content

MaRDI4NFDI/MathMLintent-Wikipedia2Wikidata

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

5 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Wikipedia to Wikidata Link Resolver

This project extracts unique Wikipedia URLs from a local knowledge graph repository (such as the MaRDI Knowledge Graph) and maps them directly to their corresponding Wikidata IDs (QIDs) using the official English Wikipedia Action API.

To maximize throughput and prevent hitting standard API request throttles, the pipeline is fully split into two independent, isolated execution steps and supports OAuth 2.0 authentication.


Architecture Overview

  1. getUrls.py: Queries the MaRDI SPARQL endpoint to fetch all unique Wikipedia links present in the dataset. It isolates the heavy graph operation and saves a clean, raw text cache to wikipedia_urls.txt.
  2. resolveUrls.py: Reads your text cache locally, contacts the English Wikipedia Action API (https://wikipedia.org) in a strict 1-by-1 operation, and logs results directly to wikidata_mapping.csv.

Fast Mode Integration via OAuth 2.0 (Required for Speed)

Anonymous IP addresses querying the Wikipedia endpoints frequently hit heavy 429 Client Error: Too Many Requests traffic blocks. To unlock high-volume speeds (up to hundreds of requests per minute), you must use an official Owner-Only OAuth 2.0 Access Token.

How to generate your high-speed token:

  1. Log into your standard account on Wikipedia.
  2. Go to meta:Special:OAuthConsumerRegistration/propose/oauth2.
  3. Complete the required field forms (Application Name, Description).
  4. Crucial: Ensure you check the box titled "This consumer is for use only by user" to instantly self-approve the configuration without waiting for administrator review.
  5. Under requested grants/permissions, select at least "Basic Rights (Read)".
  6. Save the application and copy your generated Access Token string.

Setup & Running the Project

1. Installation

Install the necessary networking dependency inside your virtual environment setup:

pip install requests

2. Step 1: Populate the URL Cache

Run the first step to fetch everything from your repository. This writes directly into wikipedia_urls.txt:

python step1_fetch.py

3. Step 2: High-Speed Link Resolution

You can feed your generated token to the script using two different secure approaches. The script will dynamically process outstanding items 1-by-1 and automatically skip records that are already successfully cached inside wikidata_mapping.csv.

Option A: Automatic Pipeline Integration (Recommended)

Export your secret token to your environment layer before execution. This entirely bypasses interactive terminals:

export WIKIPEDIA_TOKEN="your_copied_oauth2_access_token"
python resolveUrls.py

Option B: Secure Terminal Prompt

If the environment variable is missing, the program will securely request your credentials via a masked interface:

python resolveUrls.py
# The program will output:
# WIKIPEDIA_TOKEN environment variable not found.
# Please paste your Wikipedia OAuth2 Access Token: [Your input stays hidden]

Output Formatting

The mapped matrix is updated in real-time after every single lookup transaction inside wikidata_mapping.csv:

qid,original_link
Q12345,https://wikipedia.org
NO_WIKIDATA_LINK,https://wikipedia.org

Import to MaRDI

Copy file to importer pod:

 kubectl --context mardi --namespace=production cp input.txt importer-676d4db986-tlkl4:/app/input.txt -c importer 

On the pod run

while IFS= read -r qid || [ -n "$qid" ]; do
  qid=$(echo "$qid" | tr -d "\r\n ")
  if [ -n "$qid" ]; then
    echo "----------------------------------------"
    echo "Processing item token: $qid"
    echo "----------------------------------------"
    python -m cli.importer_cli import-wikidata --qids "$qid"
  fi
done < /app/input.txt

About

Vibe coding session to convert Wikipedia links to Wikidata links

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages