PathBench

PathBench is a benchmark designed to evaluate pathological speech assessment systems.

Results

Speaker-level Pearson Correlation Coefficients (PCC) results can be found below. Currently ArtP is the best reference-based, and DArtP is the best reference-free method.

CSV (canonical, in this repo): results_table.csv
Google docs Google Docs

Usage guide

There are several use cases for PathBench:

I want to evaluate my newly developed predictor
I want to use the predictors developed by you
I want to contribute a new predictor to this repository, how do I do that?
I want to reproduce your research

I want to evaluate my newly developed predictor

No install needed beyond numpy. Provide a CSV of predicted speaker scores and compare against the PathBench ground truth.

Single dataset:

python scripts/evaluate_from_csv.py \
    --predictions results/datasets/copas/pathological/word/balanced/dummy_scores.csv \
    --ground-truth datasets/copas/pathological/word/balanced/spk2score

Full benchmark — evaluate one evaluator across all datasets. Place your CSVs in a results directory that mirrors the dataset structure, with each CSV named <evaluator>.csv. Dummy score files are provided as a worked example:

results/datasets/
  copas/pathological/word/balanced/dummy_scores.csv
  torgo/pathological/utterances/balanced/dummy_scores.csv
  youtube/dummy_scores.csv

Then run:

python scripts/evaluate_from_csv.py \
    --results-dir results/datasets/ \
    --datasets-root datasets/ \
    --evaluator dummy_scores

This prints a table with the Pearson correlation for each dataset and the mean across all datasets.

Expected CSV format (speaker IDs must match the ground truth exactly):

speaker_id,score
C16,15.61
C17,16.74

All speaker IDs in the CSV must match the ground truth exactly — the script exits with an error if any are missing from either side.

If it works well, you might want to consider opening a pull request for your evaluator and incorporating into a codebase along the contribution guidelines, as this is the only way we can ensure that you are not "cheating", i.e., using resources that are not allowed for your evaluator.

I want to use the predictors developed by you

Follow the steps in the Installation section. Then you can run inference on individual audio files. For example, to score a single utterance with ArtP:

from pathbench import ArticulatoryPrecisionEvaluator

evaluator = ArticulatoryPrecisionEvaluator()
score = evaluator.score("utt1", "/path/to/audio.wav", transcription="the cat sat on the mat", language="en")
print(f"ArtP score: {score}")

I want to contribute a new predictor to this repository, how do I do that?

See CONTRIBUTING.md for a step-by-step guide.

I want to reproduce your research

Follow the steps in the Installation section.
Follow the steps in the Downloads section.
Follow the steps in the Testing section.
Run the evaluation script on the dataset(s) you want to evaluate. For example, to evaluate on the YouTube dataset:

python scripts/evaluate_spk2score.py datasets/youtube

To also write per-evaluator score CSVs, pass --results-dir:

python scripts/evaluate_spk2score.py datasets/youtube --results-dir results/datasets/

You can evaluate multiple datasets in a single run:

python scripts/evaluate_spk2score.py \
    datasets/copas/pathological/word/balanced \
    datasets/torgo/pathological/utterances/balanced \
    datasets/youtube

Results are written to the results_11/ directory as timestamped text files containing per-evaluator Pearson correlations and a summary table.

Installation

We are continously trying to make the installation easier for your use case.

If you have the opportunity to start from a clean AWS/GCE instance, please do so and follow the make installation.

If you are working on a highly restricted HPC cluster, I would recommend starting from the singularity container provided.

Package installation is the recommended pathway when you are trying to incorporate into your existing stuff. In this case, you are kind of your own figuring out dependency conflicts.

Package installation

PathBench cannot be published to PyPI because it depends on Git-hosted forks of phonemizer and pyctcdecode.

System dependencies (not installable via pip — must be installed separately):

espeak-ng at commit 2ea41210 (post-1.52.0) — required by the phonemizer for grapheme-to-phoneme conversion. The exact commit matters: different espeak-ng versions produce different IPA symbols for some languages (e.g. Italian ɾ vs r), which affects phoneme-based metrics (PER, dPER, ArtP). Build from source:
```
git clone https://github.com/espeak-ng/espeak-ng.git
cd espeak-ng && git checkout 2ea41210
cmake -B build -DUSE_ASYNC=OFF -DBUILD_SHARED_LIBS=ON
cmake --build build -j$(nproc) && sudo cmake --install build
```
PyTorch with CUDA support — install following pytorch.org before installing pathbench

Option A — Install from a GitHub Release:

pip install https://github.com/karkirowle/pathbench/releases/download/v0.1.0/pathbench-0.1.0-py3-none-any.whl

Option B — Install directly from the repository:

pip install git+https://github.com/karkirowle/pathbench.git

With optional dependencies (scripts, docs):

pip install "pathbench[scripts] @ git+https://github.com/karkirowle/pathbench.git"

Make installation

The make installation route assumes the default setup of a standard Ubuntu 22.04 image (ubuntu-2204-jammy).

sudo apt-get update -qq
sudo apt install python3 python3-pip python3-venv build-essential cmake libfftw3-dev liblapack-dev -y
# Install espeak-ng from source (pinned commit for reproducible phonemization)
git clone https://github.com/espeak-ng/espeak-ng.git /tmp/espeak-ng
cd /tmp/espeak-ng && git checkout 2ea41210
cmake -B build -DUSE_ASYNC=OFF -DBUILD_SHARED_LIBS=ON
cmake --build build -j$(nproc) && sudo cmake --install build && sudo ldconfig
cd -
git clone git@github.com:karkirowle/pathbench.git
cd pathbench/tools && make
cd ..
source tools/venv/bin/activate

Without sudo access: A containerised environment such as Docker is recommended.

Downloads

Datasets

We are not allowed to share these datasets ourselves, however, all of them are relatively easily accesible. Please get your copy.

After downloading the datasets, repoint the wav.scp files to your local dataset root. We do not provide a script for this, but you can use a regex replacement such as:

find datasets/ -name "wav.scp" -exec sed -i 's|/data/group1/z40484r/datasets|/path/to/your/datasets|g' {} +

EasyCall fix: Some EasyCall audio files have a stray space in their filename (m13 _ instead of m13_). Rename them before running the benchmark:

find /path/to/your/datasets/easycall/EasyCall/m13 -name "m13 _*" -exec bash -c 'mv "$1" "${1//m13 _/m13_}"' _ {} \;

N-gram models

The n-gram models required for DArtP and ArtP are included in the Oral Cancer - YouTube download.

Testing

Installation integrity

It is recommended that after this setup you run the unit tests below. If these pass you can be reasonably sure about installation integrity.

source tools/venv/bin/activate
python -m pytest tests/test_evaluators.py::TestEvaluatorMethods -v

All tests should pass. If all evaluator tests fail simultaneously, the reference audio file in tests/data/test_audio.wav may be corrupted — the test_audio_integrity test will confirm this.

Note: During the NAD evaluator tests you will see a Wav2Vec2Model LOAD REPORT table listing several keys (e.g. project_q, quantizer) as UNEXPECTED. These warnings are harmless — the keys belong to pre-training heads that are not needed for feature extraction and can be safely ignored.

Dataset integrity

python -m pytest tests/test_evaluators.py::TestDatasetIntegrity::test_audio_file_hashes -v

Share these hashes alongside your results so others can verify they are using the same data.

Note: Different versions of UASpeech exist. A denoising step was applied to UASpeech in December 2020. PathBench uses the denoised version canonically — datasets/uaspeech/ points at the noisereduce audio set. If you have the pre-denoising audio your hashes will not match.

Citation

If you use PathBench in your research, please cite:

@misc{halpern2026pathbenchspeechintelligibilitybenchmark,
      title={PathBench: Speech Intelligibility Benchmark for Automatic Pathological Speech Assessment},
      author={Bence Mark Halpern and Thomas Tienkamp and Defne Abur and Tomoki Toda},
      year={2026},
      eprint={2603.08097},
      archivePrefix={arXiv},
      primaryClass={cs.SD},
      url={https://arxiv.org/abs/2603.08097},
}

License

The PathBench repository code is released under the MIT License. However, some individual evaluators include or derive from code released under more restrictive licenses. In particular, the PraatSpeechRateEvaluator in pathbench/speech_rate.py is based on the Praat Script Syllable Nuclei by Nivja de Jong and Ton Wempe, which is licensed under the GNU General Public License v3 (GPL-3.0).

We are not able to provide legal advice. If you believe there is a licensing concern with any component in this codebase, please open an issue.

Acknowledgements

Many parts were shamelessly copied from others libraries or reproduced after consultation with those people. I would like to especially say thanks to Martijn Bartelds and Parvaneh Janbakhshi.

WADA-SNR: https://gist.github.com/johnmeade/d8d2c67b87cda95cd253f55c21387e75
NAD: https://github.com/Bartelds/neural-acoustic-distance
CPP: https://github.com/satvik-dixit/CPP
Unit test audio from the Speech Accent Archive

Funding

This work is partly financed by the Dutch Research Council (NWO) under project number 019.232SG.011, and partly supported by JST CREST JPMJCR19A3, Japan.

Author

Bence Mark Halpern, Nagoya University

Name		Name	Last commit message	Last commit date
Latest commit History 84 Commits
.github/workflows		.github/workflows
assets		assets
datasets		datasets
docs		docs
pathbench		pathbench
results/datasets		results/datasets
scripts		scripts
tests		tests
tools		tools
.gitignore		.gitignore
.readthedocs.yaml		.readthedocs.yaml
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml
results_table.csv		results_table.csv

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

PathBench

Results

Usage guide

I want to evaluate my newly developed predictor

I want to use the predictors developed by you

I want to contribute a new predictor to this repository, how do I do that?

I want to reproduce your research

Installation

Package installation

Make installation

Downloads

Datasets

N-gram models

Testing

Installation integrity

Dataset integrity

Citation

License

Acknowledgements

Funding

Author

About

Uh oh!

Releases 1

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

PathBench

Results

Usage guide

I want to evaluate my newly developed predictor

I want to use the predictors developed by you

I want to contribute a new predictor to this repository, how do I do that?

I want to reproduce your research

Installation

Package installation

Make installation

Downloads

Datasets

N-gram models

Testing

Installation integrity

Dataset integrity

Citation

License

Acknowledgements

Funding

Author

About

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages