Research software for fine-tuning large language models to match a target author's writing style, combining a calibrated stylometric evaluation suite with LoRA supervised fine-tuning and GRPO reinforcement learning.
Table of Contents
VOICE is an NLP research toolkit for stylometric style alignment: fine-tuning a large language model so its outputs are stylistically consistent with a target author. The toolkit provides three integrated components:
- Stylometry: a suite of surface writing style metrics (word length moments, vocabulary richness, function word frequency, character n-gram diversity) organised into four metric groups.
-
Evaluation: a calibrated alignment score
$\mathcal{S}\in[0, 1]$ comparing model completions to a reference corpus using Wasserstein distance, normalised against within-author variation estimated from the training split via bootstrap resampling. Uncertainty estimates are provided via jackknife resampling. - Fine-tuning: a CLI for running LoRA experiments (single runs or hyperparameter sweeps) via axolotl, with style alignment scoring built in. Both supervised fine-tuning and GRPO are made available, with the latter using a custom typicality reward function.
Example: Function word ratio distributions for base model and VOICE fine-tuned model completions vs. the reference corpus with the Wasserstein distance annotated.
docs/
├── 00_data.md - Datasets: format and Hugging Face references
├── 01_stylometry.md - Stylometric metrics: definitions and catalogue
├── 02_evals.md - Evaluation suite: scoring methodology and API
├── 03_rl.md - Reward functions for GRPO training
└── 04_cli.md - Fine-tuning CLI: single runs and sweeps
VOICE requires Python 3.12. Training functionality requires Linux (pinned axolotl version is Linux only); the evaluation and stylometry components run on all platforms.
Using uv (recommended):
git clone https://github.com/acceleratescience/voice
cd voice
uv sync
source .venv/bin/activateAuthenticate with Hugging Face before running fine-tuning jobs:
huggingface-cli loginRun a single fine-tuning job:
voice finetune single configs/single/example.yamlThis trains a LoRA adapter on top of Llama-3.1-8B-Instruct and writes per-epoch completions and alignment scores to runs/{run_name}/.
For a hyperparameter sweep:
voice finetune sweep configs/sweep/example.yamlSee docs/04_cli.md for the full CLI reference, config format and output layout.
The evaluation suite can be used independently of the CLI:
from voice import get_dataset, make_comparison, DatasetSpec
from voice.datasets._schema import Split
ds = get_dataset(
DatasetSpec(
repo_id="AccelerateScience/bo-press-conference-qa",
splits=(Split.TRAIN, Split.VALIDATION, Split.TEST),
)
)
# completions: list[Example] — model outputs on the same prompts as ds.validation
results = make_comparison(completions, ds)
print(results.score) # overall alignment score in [0, 1]
print(results.group_tails) # per-group breakdown
print(results.score_ci()) # 90% jackknife confidence intervalSee docs/02_evals.md for the full scoring methodology and available diagnostic fields.
Contributions are welcome. To propose a change:
- Fork the repository
- Create a feature branch (
git checkout -b feature/my-change) - Commit your changes (
git commit -m 'Add my change') - Push to the branch (
git push origin feature/my-change) - Open a pull request
Please raise an issue first for substantial changes.
Distributed under the GNU General Public License. See LICENSE for details.