PathBench is a benchmark designed to evaluate pathological speech assessment systems.
Speaker-level Pearson Correlation Coefficients (PCC) results can be found below. Currently ArtP is the best reference-based, and DArtP is the best reference-free method.
- CSV (canonical, in this repo):
results_table.csv - Google docs Google Docs
There are several use cases for PathBench:
No install needed beyond numpy. Provide a CSV of predicted speaker scores and compare against the PathBench ground truth.
Single dataset:
python scripts/evaluate_from_csv.py \
--predictions results/datasets/copas/pathological/word/balanced/dummy_scores.csv \
--ground-truth datasets/copas/pathological/word/balanced/spk2scoreFull benchmark — evaluate one evaluator across all datasets. Place your CSVs in a results directory that mirrors the dataset structure, with each CSV named <evaluator>.csv. Dummy score files are provided as a worked example:
results/datasets/
copas/pathological/word/balanced/dummy_scores.csv
torgo/pathological/utterances/balanced/dummy_scores.csv
youtube/dummy_scores.csv
Then run:
python scripts/evaluate_from_csv.py \
--results-dir results/datasets/ \
--datasets-root datasets/ \
--evaluator dummy_scoresThis prints a table with the Pearson correlation for each dataset and the mean across all datasets.
Expected CSV format (speaker IDs must match the ground truth exactly):
speaker_id,score
C16,15.61
C17,16.74
All speaker IDs in the CSV must match the ground truth exactly — the script exits with an error if any are missing from either side.
If it works well, you might want to consider opening a pull request for your evaluator and incorporating into a codebase along the contribution guidelines, as this is the only way we can ensure that you are not "cheating", i.e., using resources that are not allowed for your evaluator.
Follow the steps in the Installation section. Then you can run inference on individual audio files. For example, to score a single utterance with ArtP:
from pathbench import ArticulatoryPrecisionEvaluator
evaluator = ArticulatoryPrecisionEvaluator()
score = evaluator.score("utt1", "/path/to/audio.wav", transcription="the cat sat on the mat", language="en")
print(f"ArtP score: {score}")See CONTRIBUTING.md for a step-by-step guide.
- Follow the steps in the Installation section.
- Follow the steps in the Downloads section.
- Follow the steps in the Testing section.
- Run the evaluation script on the dataset(s) you want to evaluate. For example, to evaluate on the YouTube dataset:
python scripts/evaluate_spk2score.py datasets/youtubeTo also write per-evaluator score CSVs, pass --results-dir:
python scripts/evaluate_spk2score.py datasets/youtube --results-dir results/datasets/You can evaluate multiple datasets in a single run:
python scripts/evaluate_spk2score.py \
datasets/copas/pathological/word/balanced \
datasets/torgo/pathological/utterances/balanced \
datasets/youtubeResults are written to the results_11/ directory as timestamped text files containing per-evaluator Pearson correlations and a summary table.
We are continously trying to make the installation easier for your use case.
If you have the opportunity to start from a clean AWS/GCE instance, please do so and follow the make installation.
If you are working on a highly restricted HPC cluster, I would recommend starting from the singularity container provided.
Package installation is the recommended pathway when you are trying to incorporate into your existing stuff. In this case, you are kind of your own figuring out dependency conflicts.
PathBench cannot be published to PyPI because it depends on Git-hosted forks of phonemizer and pyctcdecode.
System dependencies (not installable via pip — must be installed separately):
espeak-ngat commit2ea41210(post-1.52.0) — required by the phonemizer for grapheme-to-phoneme conversion. The exact commit matters: different espeak-ng versions produce different IPA symbols for some languages (e.g. Italianɾvsr), which affects phoneme-based metrics (PER, dPER, ArtP). Build from source:git clone https://github.com/espeak-ng/espeak-ng.git cd espeak-ng && git checkout 2ea41210 cmake -B build -DUSE_ASYNC=OFF -DBUILD_SHARED_LIBS=ON cmake --build build -j$(nproc) && sudo cmake --install build
- PyTorch with CUDA support — install following pytorch.org before installing pathbench
Option A — Install from a GitHub Release:
pip install https://github.com/karkirowle/pathbench/releases/download/v0.1.0/pathbench-0.1.0-py3-none-any.whlOption B — Install directly from the repository:
pip install git+https://github.com/karkirowle/pathbench.gitWith optional dependencies (scripts, docs):
pip install "pathbench[scripts] @ git+https://github.com/karkirowle/pathbench.git"The make installation route assumes the default setup of a standard Ubuntu 22.04 image (ubuntu-2204-jammy).
sudo apt-get update -qq
sudo apt install python3 python3-pip python3-venv build-essential cmake libfftw3-dev liblapack-dev -y
# Install espeak-ng from source (pinned commit for reproducible phonemization)
git clone https://github.com/espeak-ng/espeak-ng.git /tmp/espeak-ng
cd /tmp/espeak-ng && git checkout 2ea41210
cmake -B build -DUSE_ASYNC=OFF -DBUILD_SHARED_LIBS=ON
cmake --build build -j$(nproc) && sudo cmake --install build && sudo ldconfig
cd -
git clone git@github.com:karkirowle/pathbench.git
cd pathbench/tools && make
cd ..
source tools/venv/bin/activateWithout sudo access: A containerised environment such as Docker is recommended.
We are not allowed to share these datasets ourselves, however, all of them are relatively easily accesible. Please get your copy.
After downloading the datasets, repoint the wav.scp files to your local dataset root. We do not provide a script for this, but you can use a regex replacement such as:
find datasets/ -name "wav.scp" -exec sed -i 's|/data/group1/z40484r/datasets|/path/to/your/datasets|g' {} +EasyCall fix: Some EasyCall audio files have a stray space in their filename (m13 _ instead of m13_). Rename them before running the benchmark:
find /path/to/your/datasets/easycall/EasyCall/m13 -name "m13 _*" -exec bash -c 'mv "$1" "${1//m13 _/m13_}"' _ {} \;The n-gram models required for DArtP and ArtP are included in the Oral Cancer - YouTube download.
It is recommended that after this setup you run the unit tests below. If these pass you can be reasonably sure about installation integrity.
source tools/venv/bin/activate
python -m pytest tests/test_evaluators.py::TestEvaluatorMethods -vAll tests should pass. If all evaluator tests fail simultaneously, the reference audio file in tests/data/test_audio.wav may be corrupted — the test_audio_integrity test will confirm this.
Note: During the NAD evaluator tests you will see a
Wav2Vec2Model LOAD REPORTtable listing several keys (e.g.project_q,quantizer) as UNEXPECTED. These warnings are harmless — the keys belong to pre-training heads that are not needed for feature extraction and can be safely ignored.
python -m pytest tests/test_evaluators.py::TestDatasetIntegrity::test_audio_file_hashes -vShare these hashes alongside your results so others can verify they are using the same data.
Note: Different versions of UASpeech exist. A denoising step was applied to UASpeech in December 2020. PathBench uses the denoised version canonically —
datasets/uaspeech/points at thenoisereduceaudio set. If you have the pre-denoising audio your hashes will not match.
If you use PathBench in your research, please cite:
@misc{halpern2026pathbenchspeechintelligibilitybenchmark,
title={PathBench: Speech Intelligibility Benchmark for Automatic Pathological Speech Assessment},
author={Bence Mark Halpern and Thomas Tienkamp and Defne Abur and Tomoki Toda},
year={2026},
eprint={2603.08097},
archivePrefix={arXiv},
primaryClass={cs.SD},
url={https://arxiv.org/abs/2603.08097},
}The PathBench repository code is released under the MIT License. However, some individual evaluators include or derive from code released under more restrictive licenses. In particular, the PraatSpeechRateEvaluator in pathbench/speech_rate.py is based on the Praat Script Syllable Nuclei by Nivja de Jong and Ton Wempe, which is licensed under the GNU General Public License v3 (GPL-3.0).
We are not able to provide legal advice. If you believe there is a licensing concern with any component in this codebase, please open an issue.
Many parts were shamelessly copied from others libraries or reproduced after consultation with those people. I would like to especially say thanks to Martijn Bartelds and Parvaneh Janbakhshi.
- WADA-SNR: https://gist.github.com/johnmeade/d8d2c67b87cda95cd253f55c21387e75
- NAD: https://github.com/Bartelds/neural-acoustic-distance
- CPP: https://github.com/satvik-dixit/CPP
- Unit test audio from the Speech Accent Archive
This work is partly financed by the Dutch Research Council (NWO) under project number 019.232SG.011, and partly supported by JST CREST JPMJCR19A3, Japan.
Bence Mark Halpern, Nagoya University
