+ Research software for fine-tuning large language models to match a target author's writing style,
+ combining a calibrated stylometric evaluation suite with LoRA supervised fine-tuning and GRPO reinforcement learning.
+
+
+
+
+---
+
+## Overview
+
+VOICE is an NLP research toolkit for **stylometric style alignment**: fine-tuning a large language model so its outputs are stylistically consistent with a target author. The toolkit provides three integrated components:
+
+- **Stylometry**: a suite of surface writing style metrics (word length moments, vocabulary richness, function word frequency, character n-gram diversity) organised into four metric groups.
+- **Evaluation**: a calibrated alignment score $\mathcal{S}\in[0, 1]$ comparing model completions to a reference corpus using Wasserstein distance, normalised against within-author variation estimated from the training split via bootstrap resampling. Uncertainty estimates are provided via jackknife resampling.
+- **Fine-tuning**: a CLI for running LoRA experiments (single runs or hyperparameter sweeps) via [axolotl](https://github.com/axolotl-ai-cloud/axolotl), with style alignment scoring built in. Both supervised fine-tuning and GRPO are made available, with the latter using a custom *typicality reward* function.
+
+
+
+
Example: Function word ratio distributions for base model and VOICE fine-tuned model completions vs. the reference corpus with the Wasserstein distance annotated.
+
+---
+
+## Installation
+
+VOICE requires Python 3.12. Training functionality requires Linux (*pinned axolotl version is Linux only*); the evaluation and stylometry components run on all platforms.
+
+Using [uv](https://github.com/astral-sh/uv) (recommended):
+
+```bash
+git clone https://github.com/acceleratescience/voice
+cd voice
+uv sync
+source .venv/bin/activate
+```
+
+Authenticate with Hugging Face before running fine-tuning jobs:
+
+```bash
+huggingface-cli login
+```
+
+
+
+---
+
+## Quick Start
+
+Run a single fine-tuning job:
+
+```bash
+voice finetune single configs/single/example.yaml
```
+
+This trains a LoRA adapter on top of Llama-3.1-8B-Instruct and writes per-epoch completions and alignment scores to `runs/{run_name}/`.
+
+For a hyperparameter sweep:
+
+```bash
+voice finetune sweep configs/sweep/example.yaml
+```
+
+See [docs/04_cli.md](docs/04_cli.md) for the full CLI reference, config format and output layout.
+
+
+
+---
+
+## Contributing
+
+Contributions are welcome. To propose a change:
+
+1. Fork the repository
+2. Create a feature branch (`git checkout -b feature/my-change`)
+3. Commit your changes (`git commit -m 'Add my change'`)
+4. Push to the branch (`git push origin feature/my-change`)
+5. Open a pull request
+
+Please raise an issue first for substantial changes.
+
+
diff --git a/configs/sweep/bush/gwb_qwen3_14b_grpo.yaml b/configs/sweep/bush/gwb_qwen3_14b_grpo.yaml
new file mode 100644
index 0000000..037177b
--- /dev/null
+++ b/configs/sweep/bush/gwb_qwen3_14b_grpo.yaml
@@ -0,0 +1,71 @@
+sweep:
+ learning_rate: [1.0e-5, 5.0e-6, 1.0e-6]
+ trl.beta: [0.01, 0.005, 0.015]
+
+ lora_r: [32]
+ micro_batch_size: [2]
+ gradient_accumulation_steps: [8]
+
+ target_layers:
+ - name: mlp_attention
+ modules: [q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj]
+
+axolotl:
+ base_model: AccelerateScience/Qwen3-14B-gwb-press-conference-sft-02-merged
+ rl: grpo
+ chat_template: tokenizer_default
+ chat_template_kwargs:
+ enable_thinking: false
+
+ datasets:
+ - path: AccelerateScience/gwb-press-conference-qa
+ type: voice._prompt_transform
+
+ skip_prepare_dataset: true
+ val_set_size: 0.0
+ hf_use_auth_token: true
+
+ trl:
+ num_generations: 8
+ max_completion_length: 1024
+ temperature: 0.7
+ top_p: 1.0
+ top_k: 0
+ repetition_penalty: 1.0
+ reward_funcs:
+ - voice.rl.rewards.typicality_reward
+ reward_weights:
+ - 1.0
+
+ adapter: lora
+ lora_dropout: 0.0
+
+ sequence_len: 2048
+
+ num_epochs: 4
+
+ bf16: true
+ tf32: true
+ flash_attention: true
+ gradient_checkpointing: true
+
+ optimizer: adamw_torch_fused
+ lr_scheduler: cosine
+ warmup_ratio: 0.03
+ weight_decay: 0.0
+
+ eval_strategy: "no"
+ save_strategy: epoch
+ save_total_limit: 4
+
+ special_tokens:
+ pad_token: "<|eot_id|>"
+
+ plugins:
+ - voice.finetune.callbacks.EvalCompletionsPlugin
+ - voice.finetune.callbacks.TypicalityRewardPlugin
+
+ use_wandb: true
+ wandb_project: VOICE
+ wandb_entity: accelerate-science
+ logging_steps: 1
diff --git a/configs/sweep/bush/gwb_qwen3_14b_sft.yaml b/configs/sweep/bush/gwb_qwen3_14b_sft.yaml
new file mode 100644
index 0000000..8f7a780
--- /dev/null
+++ b/configs/sweep/bush/gwb_qwen3_14b_sft.yaml
@@ -0,0 +1,79 @@
+sweep:
+ learning_rate: [5.0e-4, 3.0e-4, 2.0e-4, 1e-4]
+ lora_r: [2, 4, 8, 16, 32]
+
+ micro_batch_size: [2, 4, 8]
+ gradient_accumulation_steps: [1]
+
+ target_layers:
+ - name: mlp
+ modules: [gate_proj, up_proj, down_proj]
+ - name: attention
+ modules: [q_proj, k_proj, v_proj, o_proj]
+ - name: mlp_attention
+ modules: [gate_proj, up_proj, down_proj, q_proj, k_proj, v_proj, o_proj]
+
+axolotl:
+
+ base_model: Qwen/Qwen3-14B
+
+ datasets:
+ - path: AccelerateScience/gwb-press-conference-qa
+ type: chat_template
+ field_messages: messages
+ roles_to_train:
+ - assistant
+ train_on_eos: last
+ eot_tokens: ["<|eot_id|>"]
+
+ special_tokens:
+ pad_token: "<|eot_id|>"
+
+ hf_use_auth_token: true
+
+ chat_template: tokenizer_default
+ val_set_size: 0.0
+
+ test_datasets:
+ - path: AccelerateScience/gwb-press-conference-qa
+ split: validation
+ type: chat_template
+ field_messages: messages
+
+ sequence_len: 2048
+ sample_packing: true
+
+ adapter: lora
+ lora_dropout: 0.05
+
+ num_epochs: 3
+
+ bf16: true
+ tf32: true
+ flash_attention: true
+ gradient_checkpointing: true
+
+ lora_mlp_kernel: false
+ lora_qkv_kernel: false
+ lora_o_kernel: false
+
+ optimizer: adamw_torch_fused
+ lr_scheduler: cosine
+ warmup_ratio: 0.03
+ weight_decay: 0.0
+
+ eval_strategy: "epoch"
+
+ save_strategy: epoch
+ save_total_limit: 3
+
+ plugins:
+ - voice.finetune.callbacks.EvalCompletionsPlugin
+
+ use_wandb: true
+ wandb_project: VOICE
+ wandb_entity: accelerate-science
+
+ # Disable thinking
+ chat_template_kwargs:
+ enable_thinking: false
diff --git a/configs/sweep/obama/bo_qwen3_14b_grpo.yaml b/configs/sweep/obama/bo_qwen3_14b_grpo.yaml
new file mode 100644
index 0000000..536588c
--- /dev/null
+++ b/configs/sweep/obama/bo_qwen3_14b_grpo.yaml
@@ -0,0 +1,71 @@
+sweep:
+ learning_rate: [1.0e-5, 5.0e-6]
+ trl.beta: [0.005, 0.01, 0.015]
+
+ lora_r: [64]
+ micro_batch_size: [2]
+ gradient_accumulation_steps: [8]
+
+ target_layers:
+ - name: mlp_attention
+ modules: [q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj]
+
+axolotl:
+ base_model: AccelerateScience/Qwen3-14B-bo-press-conference-sft-merged
+ rl: grpo
+ chat_template: tokenizer_default
+ chat_template_kwargs:
+ enable_thinking: false
+
+ datasets:
+ - path: AccelerateScience/bo-press-conference-qa
+ type: voice._prompt_transform
+
+ skip_prepare_dataset: true
+ val_set_size: 0.0
+ hf_use_auth_token: true
+
+ trl:
+ num_generations: 8
+ max_completion_length: 1024
+ temperature: 0.7
+ top_p: 1.0
+ top_k: 0
+ repetition_penalty: 1.0
+ reward_funcs:
+ - voice.rl.rewards.typicality_reward
+ reward_weights:
+ - 1.0
+
+ adapter: lora
+ lora_dropout: 0.0
+
+ sequence_len: 2048
+
+ num_epochs: 4
+
+ bf16: true
+ tf32: true
+ flash_attention: true
+ gradient_checkpointing: true
+
+ optimizer: adamw_torch_fused
+ lr_scheduler: cosine
+ warmup_ratio: 0.03
+ weight_decay: 0.0
+
+ eval_strategy: "no"
+ save_strategy: epoch
+ save_total_limit: 4
+
+ special_tokens:
+ pad_token: "<|eot_id|>"
+
+ plugins:
+ - voice.finetune.callbacks.EvalCompletionsPlugin
+ - voice.finetune.callbacks.TypicalityRewardPlugin
+
+ use_wandb: true
+ wandb_project: VOICE
+ wandb_entity: accelerate-science
+ logging_steps: 1
diff --git a/configs/sweep/obama/bo_qwen3_14b_sft.yaml b/configs/sweep/obama/bo_qwen3_14b_sft.yaml
new file mode 100644
index 0000000..7875f27
--- /dev/null
+++ b/configs/sweep/obama/bo_qwen3_14b_sft.yaml
@@ -0,0 +1,76 @@
+sweep:
+ learning_rate: [1.0e-4, 2.0e-4, 3.0e-4, 5.0e-4, 7.0e-4, 1.03e-3]
+ lora_r: [16, 32, 64]
+
+ micro_batch_size: [1,2,4]
+ gradient_accumulation_steps: [2]
+
+ target_layers:
+ - name: mlp_attention
+ modules: [gate_proj, up_proj, down_proj, q_proj, k_proj, v_proj, o_proj]
+ - name: mlp
+ modules: [gate_proj, up_proj, down_proj]
+
+axolotl:
+
+ base_model: Qwen/Qwen3-14B
+
+ datasets:
+ - path: AccelerateScience/bo-press-conference-qa
+ type: chat_template
+ field_messages: messages
+ roles_to_train:
+ - assistant
+ train_on_eos: last
+ eot_tokens: ["<|eot_id|>"]
+
+ special_tokens:
+ pad_token: "<|eot_id|>"
+
+ hf_use_auth_token: true
+
+ chat_template: tokenizer_default
+ val_set_size: 0.0
+
+ test_datasets:
+ - path: AccelerateScience/bo-press-conference-qa
+ split: validation
+ type: chat_template
+ field_messages: messages
+
+ sequence_len: 2048
+ sample_packing: true
+
+ adapter: lora
+ lora_dropout: 0.05
+
+ num_epochs: 3
+
+ bf16: true
+ tf32: true
+ flash_attention: true
+ gradient_checkpointing: true
+
+ lora_mlp_kernel: false
+ lora_qkv_kernel: false
+ lora_o_kernel: false
+
+ optimizer: adamw_torch_fused
+ lr_scheduler: cosine
+ warmup_ratio: 0.03
+ weight_decay: 0.0
+
+ eval_strategy: "epoch"
+
+ save_strategy: epoch
+ save_total_limit: 3
+
+ plugins:
+ - voice.finetune.callbacks.EvalCompletionsPlugin
+
+ use_wandb: true
+ wandb_project: VOICE
+ wandb_entity: accelerate-science
+
+ chat_template_kwargs:
+ enable_thinking: false
diff --git a/cspell/library-words.txt b/cspell/library-words.txt
index 730b848..687eecf 100644
--- a/cspell/library-words.txt
+++ b/cspell/library-words.txt
@@ -25,3 +25,5 @@ PYTHONPATH
venv
cuda
funcs
+pathlib
+kwargs
diff --git a/cspell/project-words.txt b/cspell/project-words.txt
index 8f88110..0b0c804 100644
--- a/cspell/project-words.txt
+++ b/cspell/project-words.txt
@@ -29,3 +29,5 @@ Wegmann
embs
cdfs
unprimed
+qwen
+imgs
diff --git a/docs/00_data.md b/docs/00_data.md
index e504061..038f68b 100644
--- a/docs/00_data.md
+++ b/docs/00_data.md
@@ -2,12 +2,14 @@
---
-Two example datasets are included, each containing press conference Q&A transcripts split into train, validation and test sets:
+VOICE expects datasets in chat format with train, validation, and test splits. Each example is a `messages` list containing system, user, and assistant turns; the assistant turn is the text against which stylometric alignment is measured.
-| President | HuggingFace |
+Two example datasets are hosted on Hugging Face:
+
+| Dataset | HuggingFace |
|---|---|
-| Barack Obama | [`AccelerateScience/bo-press-conference-qa`](https://huggingface.co/datasets/AccelerateScience/bo-press-conference-qa) |
-| George W. Bush | [`AccelerateScience/gwb-press-conference-qa`](https://huggingface.co/datasets/AccelerateScience/gwb-press-conference-qa) |
+| Barack Obama press conference Q&A | [`AccelerateScience/bo-press-conference-qa`](https://huggingface.co/datasets/AccelerateScience/bo-press-conference-qa) |
+| George W. Bush press conference Q&A | [`AccelerateScience/gwb-press-conference-qa`](https://huggingface.co/datasets/AccelerateScience/gwb-press-conference-qa) |
Each example is a single JSONL record in chat format:
@@ -20,3 +22,32 @@ Each example is a single JSONL record in chat format:
]
}
```
+
+## Bringing Your Own Data
+
+Any dataset following the chat format above can be used by specifying its Hugging Face repo id in the axolotl config:
+
+```yaml
+datasets:
+ - path: your-org/your-dataset
+ type: chat_template
+ field_messages: messages
+ roles_to_train: [assistant]
+```
+
+For local datasets, use `LocalDatasetSpec` and provide one `.jsonl` file per split:
+
+```python
+from voice import get_dataset, LocalDatasetSpec
+from voice.datasets._schema import Split
+from pathlib import Path
+
+ds = get_dataset(
+ LocalDatasetSpec(
+ path=Path("data/my-author"),
+ splits=(Split.TRAIN, Split.VALIDATION, Split.TEST),
+ )
+)
+```
+
+VOICE expects files at `{path}/train.jsonl`, `{path}/validation.jsonl`, and `{path}/test.jsonl`. Columns `system`, `question`, and `answer` are also accepted as an alternative to `messages`.
diff --git a/docs/01_stylometry.md b/docs/01_stylometry.md
index ee053e4..ccb7484 100644
--- a/docs/01_stylometry.md
+++ b/docs/01_stylometry.md
@@ -8,9 +8,12 @@ $$f : \mathcal{T} \rightarrow \mathbb{R}$$
that maps a text string to a real scalar capturing some surface property of writing style. VOICE treats each metric as a distribution over a corpus: given a set of texts, $f$ is applied to each one to produce a sample from the author's stylometric distribution for that feature.
+Metrics are organised into four groups. Because metrics within the same group are highly correlated (they characterise the same underlying linguistic object from different angles), the evaluation suite averages within groups before aggregating across them, preventing any single dimension from dominating the alignment score through sheer metric count.
+
## Implemented Metrics
### Word Length Distribution
+
Moments of the per-word character length distribution.
| Metric | Description |
@@ -21,6 +24,7 @@ Moments of the per-word character length distribution.
| `kurtosis_word_length` | Kurtosis of word length |
### Vocabulary Richness
+
Type–token and word statistics measuring lexical diversity.
| Metric | Description |
@@ -32,21 +36,16 @@ Type–token and word statistics measuring lexical diversity.
| `tri_legomena_ratio` | Fraction of words appearing exactly three times |
### Function Words
+
| Metric | Description |
|---|---|
| `function_word_ratio` | Proportion of tokens drawn from a closed function-word list |
### Character N-gram Diversity
+
Type–token ratio and MATTR computed over character $n$-grams for $n \in \{3, 4, 5\}$.
| Metric | Description |
|---|---|
| `char_{n}gram_type_token_ratio` | Character $n$-gram TTR |
| `char_{n}gram_moving_avg_type_token_ratio` | Character $n$-gram MATTR |
-
-### Text Length
-| Metric | Description |
-|---|---|
-| `num_words` | Total word count |
-
-> **Note:** `num_words` may be deprecated in a future release. It does not tend to be used as a signature for authorship attribution in the broader stylometry literature.
diff --git a/docs/02_evals.md b/docs/02_evals.md
index a5d1d16..054b653 100644
--- a/docs/02_evals.md
+++ b/docs/02_evals.md
@@ -28,7 +28,7 @@ $t$ is the **tail value**: the fraction of self-distances that *exceed* the obse
### Step 4: Group and Overall Score
-Metrics are organised into groups (see [Stylometric Metrics](01_stylometry.md)). Each group captures a distinct linguistic object (word length, vocabulary richness, .etc) and the metrics within it represent different ways of characterising that same object. Because they study the same underlying property, within-group metrics are highly correlated; averaging within groups before aggregating across them prevents any single linguistic dimension from dominating the score simply by having more metrics defined for it.
+Metrics are organised into groups (see [Stylometric Metrics](01_stylometry.md)). Each group captures a distinct linguistic object (word length, vocabulary richness, etc.) and the metrics within it represent different ways of characterising that same object. Because they study the same underlying property, metrics in the same group are highly correlated; averaging within groups before aggregating across them prevents any single linguistic dimension from dominating the score simply by having more metrics defined for it.
Within each group the tail values are averaged:
@@ -112,6 +112,6 @@ results.group_tail_cis() # dict[MetricGroup, tuple[float, float]] | None
results.score_ci() # tuple[float, float] | None
```
-Per-metric tails are useful for identifying which stylometric dimensions are misaligned; per-group tails aggregate correlated metrics and correspond directly to the terms summed in the overall score.
+Per-metric tails are useful for identifying which stylometric dimensions are misaligned; per-group tails aggregate correlated metrics and correspond directly to the terms in the geometric mean.
-The confidence interval methods accept a `confidence` keyword (default 0.90, see `UNCERTAINTY_DEFAULTS`). Passing `uncertainty=False` to `make_comparison` skips the jackknife pass, in which case the interval methods return `None`.
+The confidence interval methods accept a `confidence` keyword (default `0.90`, see `UNCERTAINTY_DEFAULTS`). Passing `uncertainty=False` to `make_comparison` skips the jackknife pass, in which case the interval methods return `None`.
diff --git a/docs/03_rl.md b/docs/03_rl.md
index 1163109..6466793 100644
--- a/docs/03_rl.md
+++ b/docs/03_rl.md
@@ -1,4 +1,10 @@
-# Reward Function
+# Reward Functions
+
+---
+
+VOICE provides a GRPO reward function that guides an online model to produce completions more typical of the target author's style than a reference (base) model. It can be used as the primary training signal or in combination with other rewards.
+
+The same `voice finetune` CLI is used for GRPO runs - the axolotl config simply adds `rl: grpo` and a few extra keys. See [Fine-Tuning CLI](04_cli.md) for CLI options and output layout.
---
@@ -10,7 +16,7 @@ Let $\pi_{\theta}$ and $\pi_{\text{base}}$ denote the online and reference (base
The typicality reward reuses the stylometric metrics directly (see [Stylometric Metrics](01_stylometry.md)), and asks how typical a single completion is of the author's general style.
-For each metric $f$, let $\hat{F}_f$ be its empirical CDF over the author's training split, the same split used to construct $\mathcal{W}_0$ in the eval suite. For a text $a$, define its percentile $u_f(a) = \hat{F}_f(f(a)) \in [0, 1]$ and
+For each metric $f$, let $\hat{F}_f$ be its empirical CDF over the author's training split - the same split used to construct $\mathcal{W}_0$ in the eval suite. For a text $a$, define its percentile $u_f(a) = \hat{F}_f(f(a)) \in [0, 1]$ and
$$\tau_f(a) = 1 - 2\left|u_f(a) - \frac{1}{2}\right| \in [0, 1]$$
@@ -21,3 +27,43 @@ $$\bar{\tau}_g(a) = \frac{1}{|g|} \sum_{f \in g} \tau_f(a), \qquad T(a) = \left(
using the same floor $\varepsilon$ as the eval suite. The reward is then:
$$r_{\text{typicality}}(d_i, d_i^{\text{base}}) = \max\big(T(d_i) - T(d_i^{\text{base}}),\ 0\big)$$
+
+This is a relative reward: a completion only receives a positive signal if it is more typical of the author's style than what the base model would have produced for the same prompt.
+
+---
+
+## Configuration
+
+### Dataset
+
+GRPO training requires a `ref_completion` column in the dataset containing pre-generated base model completions for each prompt. The `voice._prompt_transform` axolotl dataset type extracts the prompt, the ground-truth assistant turn, and the reference completion and makes them available to TRL:
+
+```yaml
+datasets:
+ - path: AccelerateScience/bo-press-conference-qa
+ type: voice._prompt_transform
+```
+
+### Axolotl Config
+
+A minimal GRPO config adds four keys to the standard axolotl setup:
+
+```yaml
+rl: grpo
+
+trl:
+ num_generations: 8
+ max_completion_length: 1024
+ reward_funcs:
+ - voice.rl.rewards.typicality_reward
+ reward_weights:
+ - 1.0
+
+plugins:
+ - voice.finetune.callbacks.EvalCompletionsPlugin
+ - voice.finetune.callbacks.TypicalityRewardPlugin
+```
+
+`TypicalityRewardPlugin` pre-computes and caches the per-metric empirical CDFs from the training split before training begins. The reward function reads from this cache at each step; calling `typicality_reward` without the plugin registered raises a `RuntimeError`.
+
+A full sweep example is at `configs/sweep/obama/bo_qwen3_14b_grpo.yaml`.
diff --git a/docs/04_cli.md b/docs/04_cli.md
index a5662a9..17be2e2 100644
--- a/docs/04_cli.md
+++ b/docs/04_cli.md
@@ -2,7 +2,7 @@
---
-The `voice finetune` command group runs LoRA fine-tuning jobs via [axolotl](https://github.com/axolotl-ai-cloud/axolotl). Two subcommands are available: `single` for a single training run and `sweep` for a hyperparameter grid search.
+The `voice finetune` command group runs LoRA fine-tuning jobs via [axolotl](https://github.com/axolotl-ai-cloud/axolotl). Two subcommands are available: `single` for a single training run and `sweep` for a hyperparameter grid search. Both work for SFT and GRPO runs - the training mode is controlled by the config.
---
@@ -87,7 +87,9 @@ axolotl:
All five `sweep:` axes are required. The CLI expands them into a Cartesian product; `lora_alpha` is set equal to `lora_r` for each run and does not need to be listed. Per-run values for `learning_rate`, `lora_r`, `lora_alpha`, `lora_target_modules`, `micro_batch_size`, and `gradient_accumulation_steps` are merged on top of the shared `axolotl:` block before training starts.
-A full example is at `configs/sweep/example.yaml`.
+Additional axes (such as `trl.beta` for GRPO sweeps) can be added freely and are passed through to the axolotl config using dot notation.
+
+A full SFT example is at `configs/sweep/example.yaml`. A full GRPO example is at `configs/sweep/obama/bo_qwen3_14b_grpo.yaml`.
### Resume
diff --git a/docs/imgs/function_word_ratio.svg b/docs/imgs/function_word_ratio.svg
new file mode 100644
index 0000000..21164a9
--- /dev/null
+++ b/docs/imgs/function_word_ratio.svg
@@ -0,0 +1,3385 @@
+
+
+