Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
8 changes: 7 additions & 1 deletion .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -75,6 +75,12 @@ configs/sweep/**
!configs/single/example.yaml
!configs/single/test.yaml
!configs/sweep/example.yaml

!configs/sweep/obama/
!configs/sweep/obama/bo_qwen3_14b_grpo.yaml
!configs/sweep/obama/bo_qwen3_14b_sft.yaml
!configs/sweep/bush/
!configs/sweep/bush/gwb_qwen3_14b_grpo.yaml
!configs/sweep/bush/gwb_qwen3_14b_sft.yaml
w
# Paper write up
paper/**
164 changes: 159 additions & 5 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,14 +1,168 @@
# 🗣️ VOICE
[![GPL-3 License](https://img.shields.io/badge/License-GPLv3-brightgreen.svg)](https://opensource.org/licenses/GPL-3.0)
[![Issues](https://img.shields.io/github/issues-raw/acceleratescience/voice.svg?maxAge=25000)](https://github.com/acceleratescience/voice/issues)
[![GitHub contributors](https://img.shields.io/github/contributors/acceleratescience/voice.svg?style=flat)]()
[![GitHub pull requests](https://img.shields.io/github/issues-pr/acceleratescience/voice.svg?style=flat)]()
[![PR's Welcome](https://img.shields.io/badge/PRs-welcome-brightgreen.svg?style=flat)](http://makeapullrequest.com)
<br>
[![GitHub stars](https://img.shields.io/github/stars/acceleratescience/voice.svg?style=social&label=Star)]()
[![GitHub watchers](https://img.shields.io/github/watchers/acceleratescience/voice.svg?style=social&label=Watch)]()
[![GitHub forks](https://img.shields.io/github/forks/acceleratescience/voice.svg?style=social&label=Fork)]()

...
<br />
<div align="center">
<h2>🗣️VOICE</h2>
<p align="justify">
Research software for fine-tuning large language models to match a target author's writing style,
combining a calibrated stylometric evaluation suite with LoRA supervised fine-tuning and GRPO reinforcement learning.
</p>
<p align="center">
<a href="docs/">Documentation</a>
·
<a href="https://github.com/acceleratescience/voice/issues">Report Bug</a>
·
<a href="https://github.com/acceleratescience/voice/issues">Request Feature</a>
</p>
</div>

<details>
<summary>Table of Contents</summary>
<ol>
<li><a href="#overview">Overview</a></li>
<li><a href="#documentation">Documentation</a></li>
<li><a href="#installation">Installation</a></li>
<li><a href="#quick-start">Quick Start</a></li>
<li><a href="#python-api">Python API</a></li>
<li><a href="#contributing">Contributing</a></li>
<li><a href="#license">License</a></li>
</ol>
</details>

---

## Overview

VOICE is an NLP research toolkit for **stylometric style alignment**: fine-tuning a large language model so its outputs are stylistically consistent with a target author. The toolkit provides three integrated components:

- **Stylometry**: a suite of surface writing style metrics (word length moments, vocabulary richness, function word frequency, character n-gram diversity) organised into four metric groups.
- **Evaluation**: a calibrated alignment score $\mathcal{S}\in[0, 1]$ comparing model completions to a reference corpus using Wasserstein distance, normalised against within-author variation estimated from the training split via bootstrap resampling. Uncertainty estimates are provided via jackknife resampling.
- **Fine-tuning**: a CLI for running LoRA experiments (single runs or hyperparameter sweeps) via [axolotl](https://github.com/axolotl-ai-cloud/axolotl), with style alignment scoring built in. Both supervised fine-tuning and GRPO are made available, with the latter using a custom *typicality reward* function.

<div align="center">
<img src="docs/imgs/function_word_ratio.svg" alt="Function word ratio distribution comparison between model completions and reference corpus" width="50%">
<p><em>Example: Function word ratio distributions for base model and VOICE fine-tuned model completions vs. the reference corpus with the Wasserstein distance annotated.</em></p>
</div>

<p align="right">(<a href="#top">back to top</a>)</p>

---

## Documentation

```
docs/
├── 00_data.md — Included datasets
├── 01_evals.md — Evaluation suite: scoring and interpretation
└── 02_stylometry.md — Stylometric metrics: definition and catalogue
├── 00_data.md - Datasets: format and Hugging Face references
├── 01_stylometry.md - Stylometric metrics: definitions and catalogue
├── 02_evals.md - Evaluation suite: scoring methodology and API
├── 03_rl.md - Reward functions for GRPO training
└── 04_cli.md - Fine-tuning CLI: single runs and sweeps
```

<p align="right">(<a href="#top">back to top</a>)</p>

---

## Installation

VOICE requires Python 3.12. Training functionality requires Linux (*pinned axolotl version is Linux only*); the evaluation and stylometry components run on all platforms.

Using [uv](https://github.com/astral-sh/uv) (recommended):

```bash
git clone https://github.com/acceleratescience/voice
cd voice
uv sync
source .venv/bin/activate
```

Authenticate with Hugging Face before running fine-tuning jobs:

```bash
huggingface-cli login
```

<p align="right">(<a href="#top">back to top</a>)</p>

---

## Quick Start

Run a single fine-tuning job:

```bash
voice finetune single configs/single/example.yaml
```

This trains a LoRA adapter on top of Llama-3.1-8B-Instruct and writes per-epoch completions and alignment scores to `runs/{run_name}/`.

For a hyperparameter sweep:

```bash
voice finetune sweep configs/sweep/example.yaml
```

See [docs/04_cli.md](docs/04_cli.md) for the full CLI reference, config format and output layout.

<p align="right">(<a href="#top">back to top</a>)</p>

---

## Python API

The evaluation suite can be used independently of the CLI:

```python
from voice import get_dataset, make_comparison, DatasetSpec
from voice.datasets._schema import Split

ds = get_dataset(
DatasetSpec(
repo_id="AccelerateScience/bo-press-conference-qa",
splits=(Split.TRAIN, Split.VALIDATION, Split.TEST),
)
)

# completions: list[Example] — model outputs on the same prompts as ds.validation
results = make_comparison(completions, ds)

print(results.score) # overall alignment score in [0, 1]
print(results.group_tails) # per-group breakdown
print(results.score_ci()) # 90% jackknife confidence interval
```

See [docs/02_evals.md](docs/02_evals.md) for the full scoring methodology and available diagnostic fields.

<p align="right">(<a href="#top">back to top</a>)</p>

---

## Contributing

Contributions are welcome. To propose a change:

1. Fork the repository
2. Create a feature branch (`git checkout -b feature/my-change`)
3. Commit your changes (`git commit -m 'Add my change'`)
4. Push to the branch (`git push origin feature/my-change`)
5. Open a pull request

Please raise an issue first for substantial changes.

<p align="right">(<a href="#top">back to top</a>)</p>

---

## License

Distributed under the GNU General Public License. See `LICENSE` for details.

<p align="right">(<a href="#top">back to top</a>)</p>
71 changes: 71 additions & 0 deletions configs/sweep/bush/gwb_qwen3_14b_grpo.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,71 @@
sweep:
learning_rate: [1.0e-5, 5.0e-6, 1.0e-6]
trl.beta: [0.01, 0.005, 0.015]

lora_r: [32]
micro_batch_size: [2]
gradient_accumulation_steps: [8]

target_layers:
- name: mlp_attention
modules: [q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj]

axolotl:
base_model: AccelerateScience/Qwen3-14B-gwb-press-conference-sft-02-merged
rl: grpo
chat_template: tokenizer_default
chat_template_kwargs:
enable_thinking: false

datasets:
- path: AccelerateScience/gwb-press-conference-qa
type: voice._prompt_transform

skip_prepare_dataset: true
val_set_size: 0.0
hf_use_auth_token: true

trl:
num_generations: 8
max_completion_length: 1024
temperature: 0.7
top_p: 1.0
top_k: 0
repetition_penalty: 1.0
reward_funcs:
- voice.rl.rewards.typicality_reward
reward_weights:
- 1.0

adapter: lora
lora_dropout: 0.0

sequence_len: 2048

num_epochs: 4

bf16: true
tf32: true
flash_attention: true
gradient_checkpointing: true

optimizer: adamw_torch_fused
lr_scheduler: cosine
warmup_ratio: 0.03
weight_decay: 0.0

eval_strategy: "no"
save_strategy: epoch
save_total_limit: 4

special_tokens:
pad_token: "<|eot_id|>"

plugins:
- voice.finetune.callbacks.EvalCompletionsPlugin
- voice.finetune.callbacks.TypicalityRewardPlugin

use_wandb: true
wandb_project: VOICE
wandb_entity: accelerate-science
logging_steps: 1
79 changes: 79 additions & 0 deletions configs/sweep/bush/gwb_qwen3_14b_sft.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,79 @@
sweep:
learning_rate: [5.0e-4, 3.0e-4, 2.0e-4, 1e-4]
lora_r: [2, 4, 8, 16, 32]

micro_batch_size: [2, 4, 8]
gradient_accumulation_steps: [1]

target_layers:
- name: mlp
modules: [gate_proj, up_proj, down_proj]
- name: attention
modules: [q_proj, k_proj, v_proj, o_proj]
- name: mlp_attention
modules: [gate_proj, up_proj, down_proj, q_proj, k_proj, v_proj, o_proj]

axolotl:

base_model: Qwen/Qwen3-14B

datasets:
- path: AccelerateScience/gwb-press-conference-qa
type: chat_template
field_messages: messages
roles_to_train:
- assistant
train_on_eos: last
eot_tokens: ["<|eot_id|>"]

special_tokens:
pad_token: "<|eot_id|>"

hf_use_auth_token: true

chat_template: tokenizer_default
val_set_size: 0.0

test_datasets:
- path: AccelerateScience/gwb-press-conference-qa
split: validation
type: chat_template
field_messages: messages

sequence_len: 2048
sample_packing: true

adapter: lora
lora_dropout: 0.05

num_epochs: 3

bf16: true
tf32: true
flash_attention: true
gradient_checkpointing: true

lora_mlp_kernel: false
lora_qkv_kernel: false
lora_o_kernel: false

optimizer: adamw_torch_fused
lr_scheduler: cosine
warmup_ratio: 0.03
weight_decay: 0.0

eval_strategy: "epoch"

save_strategy: epoch
save_total_limit: 3

plugins:
- voice.finetune.callbacks.EvalCompletionsPlugin

use_wandb: true
wandb_project: VOICE
wandb_entity: accelerate-science

# Disable thinking
chat_template_kwargs:
enable_thinking: false
Loading
Loading