This project extracts unique Wikipedia URLs from a local knowledge graph repository (such as the MaRDI Knowledge Graph) and maps them directly to their corresponding Wikidata IDs (QIDs) using the official English Wikipedia Action API.
To maximize throughput and prevent hitting standard API request throttles, the pipeline is fully split into two independent, isolated execution steps and supports OAuth 2.0 authentication.
getUrls.py: Queries the MaRDI SPARQL endpoint to fetch all unique Wikipedia links present in the dataset. It isolates the heavy graph operation and saves a clean, raw text cache towikipedia_urls.txt.resolveUrls.py: Reads your text cache locally, contacts the English Wikipedia Action API (https://wikipedia.org) in a strict 1-by-1 operation, and logs results directly towikidata_mapping.csv.
Anonymous IP addresses querying the Wikipedia endpoints frequently hit heavy 429 Client Error: Too Many Requests traffic blocks. To unlock high-volume speeds (up to hundreds of requests per minute), you must use an official Owner-Only OAuth 2.0 Access Token.
- Log into your standard account on Wikipedia.
- Go to meta:Special:OAuthConsumerRegistration/propose/oauth2.
- Complete the required field forms (Application Name, Description).
- Crucial: Ensure you check the box titled "This consumer is for use only by user" to instantly self-approve the configuration without waiting for administrator review.
- Under requested grants/permissions, select at least "Basic Rights (Read)".
- Save the application and copy your generated Access Token string.
Install the necessary networking dependency inside your virtual environment setup:
pip install requestsRun the first step to fetch everything from your repository. This writes directly into wikipedia_urls.txt:
python step1_fetch.pyYou can feed your generated token to the script using two different secure approaches. The script will dynamically process outstanding items 1-by-1 and automatically skip records that are already successfully cached inside wikidata_mapping.csv.
Export your secret token to your environment layer before execution. This entirely bypasses interactive terminals:
export WIKIPEDIA_TOKEN="your_copied_oauth2_access_token"
python resolveUrls.pyIf the environment variable is missing, the program will securely request your credentials via a masked interface:
python resolveUrls.py
# The program will output:
# WIKIPEDIA_TOKEN environment variable not found.
# Please paste your Wikipedia OAuth2 Access Token: [Your input stays hidden]The mapped matrix is updated in real-time after every single lookup transaction inside wikidata_mapping.csv:
qid,original_link
Q12345,https://wikipedia.org
NO_WIKIDATA_LINK,https://wikipedia.orgCopy file to importer pod:
kubectl --context mardi --namespace=production cp input.txt importer-676d4db986-tlkl4:/app/input.txt -c importer
On the pod run
while IFS= read -r qid || [ -n "$qid" ]; do
qid=$(echo "$qid" | tr -d "\r\n ")
if [ -n "$qid" ]; then
echo "----------------------------------------"
echo "Processing item token: $qid"
echo "----------------------------------------"
python -m cli.importer_cli import-wikidata --qids "$qid"
fi
done < /app/input.txt