Structured metadata extraction from academic PDFs on International Large-Scale Assessments (PISA, TIMSS, etc.) and machine learning. Core stack: PyMuPDF for text, OpenAI for JSON extraction, Pydantic schema in src/schemas/models.py.
conda activate ilsa-literature-review # or your own environment
pip install -r ilsa_pipeline/requirements.txt
cp ilsa_pipeline/.env.example ilsa_pipeline/.env
# Add OPENAI_API_KEYThe root requirements.txt is a full conda-style lockfile. For extraction only, ilsa_pipeline/requirements.txt is sufficient.
Main orchestration (batch PDFs, JSON plus optional SQLite):
cd /path/to/ILSA_LLMs
python ilsa_pipeline/scripts/run_pipeline.py \
--pdf-dir ./data/pdfs \
--output-dir ./output \
--workers 3 \
--resumeTargeted batch: ilsa_pipeline/scripts/extract_targeted.py
output/json/*.json: Per PDF, a single object with only top-level keysmetadataanddata(same shape as the PydanticILSAArticleMetadatapublic schema). Pipeline-only failures use a sentinel prefix insidedata.outcome_summaryso--resumecan retry those files.- Parquet / SQLite helpers:
build_master_parquet,build_sqlite_database, andStorageManagerinilsa_pipeline/utils/storage.py.
Legacy backups and alternate schemas live under cop_kutusu/.
The structured outputs are publicly available on HuggingFace:
dedemerve/ILSA-LLM-Extractor-Dataset
| Table | Rows | Description |
|---|---|---|
articles_master |
1,264 | Core article metadata (deduplicated, enriched) |
findings |
2,126 | Main findings per article |
confounders |
8,334 | Confounders and covariates per study |
raw/ |
1,756 | Per-article JSON extraction outputs |
To upload or update the dataset:
python scripts/upload_to_hf.py