Where historical variation meets multilingual alignment.
This repository presents a curated collection of multilingual texts aligned through manual and semi-manual workflows, including scriptural materials and medieval prose. It is intended for cross-lingual representation learning and for the fine-tuning or domain adaptation of multilingual models such as LaBSE to historical and pre-standardized textual data.
Unlike many modern corpora, this dataset exposes models to:
- non-standardized spelling
- strong linguistic variation
- non-literal translation practices
- historically variable textual traditions
This makes it particularly suitable for studying robust multilingual representations in historical and pre-standardized textual settings, in conditions that differ significantly from modern benchmarks.
Most existing parallel or alignment-oriented corpora focus on modern, standardized languages.
Historical and pre-standardized texts introduce specific challenges:
- orthographic instability
- linguistic heterogeneity
- non-literal translation practices
- narrative or textual restructuring
- variation across textual traditions
Existing resources are often:
- small
- domain-restricted
- weakly aligned
- or not designed for model adaptation and evaluation
This project provides a curated multilingual alignment dataset combining scriptural materials and medieval prose, designed to support cross-lingual representation learning, LaBSE fine-tuning, and robustness evaluation on historical textual data.
This repository brings together two complementary resources:
- scriptural parallel corpora
- a smaller dataset of narrative prose
The scriptural corpora provide multilingual data that is:
- structurally stable
- inherently parallel
- relatively standardized
These properties make them well suited for alignment-oriented experiments.
The prose dataset extends this setting to more structurally variable narrative texts, where:
- translations may be non-literal
- events can be reordered
- discourse structure may vary significantly
- semantic equivalence does not always correspond to direct structural alignment
Together, these resources support the study of multilingual textual alignment across both highly standardized parallel corpora and more divergent narrative prose.
This repository brings together multilingual textual resources across a broad range of linguistic traditions, while placing particular emphasis on medieval Romance languages.
French · Catalan · Portuguese · Castilian · Occitan · Italian · Aragonese · Venetian · Galician-Portuguese
Middle English
Latin · Medieval Latin
Including Greek (Septuagint) and Arabic
The datasets are distributed as JSON files, but their internal structure differs depending on the corpus.
Detailed documentation for each corpus is available in the docs/ directory, where users can find information about:
- file organization
- corpus sources and source-text availability
- metadata fields
- alignment structure
- dataset statistics
- segmentation or alignment units
- corpus-specific conventions and limitations
Because the repository combines different types of aligned material, corpus-level documentation should be consulted for details on structure, coverage, sources, and licensing constraints.
This dataset is designed to support multilingual representation learning, historical NLP, and the fine-tuning or evaluation of cross-lingual alignment models.
It is particularly suited for experiments aiming to:
- improve semantic alignment across languages and textual traditions
- handle historical and pre-standardized language variation
- evaluate multilingual models beyond literal translation settings
- fine-tune or adapt multilingual sentence embedding models
- test robustness on scriptural and medieval prose materials
- study multilingual text alignment across structurally different corpora
This dataset may be useful for:
- NLP researchers working on low-resource, historical, or cross-lingual alignment
- digital humanists studying multilingual translation, textual reuse, or textual variants
- scholars exploring textual transmission across religious, linguistic, and historical traditions
This dataset is intended for computational experiments in multilingual representation learning, model adaptation, and alignment evaluation.
It is not a substitute for philological alignment, critical editing, or scholarly textual collation. During manual alignment, textual segments may be adjusted or non-alignable material may be omitted.
This repository is under development. Corpus coverage, metadata, and alignment files may be updated as new texts become available or as existing alignments are revised.
Contributions are welcome.
You can contribute by:
- opening an issue to report errors, inconsistencies, or missing metadata
- providing usable source texts or witnesses for additional languages and textual traditions
- submitting a pull request with corrections, new alignments, new texts, or documentation improvements
This repository is part of a broader ecosystem of tools and corpora developed for the study of medieval multilingual textual traditions:
-
Aquilign
A clause-level multilingual alignment engine based on contextual embeddings (LaBSE), designed specifically for premodern texts. -
Multilingual Segmentation Data
Source texts and segmented versions in multiple medieval Romance languages, as well as Latin and English, used for training and evaluating clause segmentation models.
Please cite the repository when using the dataset, alignment metadata, or derived files in publications, models, or downstream experiments.
@misc{parallelium_multilingual_alignment_dataset_2026,
title = {Parallelium: A Multilingual Alignment Dataset for Scriptural and Medieval Narrative Texts},
author = {Ing, Lucence and Levenson, Matthias Gille and Macedo, Carolina},
year = {2026},
howpublished = {GitHub repository}
}The alignment metadata produced for this dataset are distributed under the CC BY-NC-SA 4.0 license, unless otherwise specified.
⚠️ Some source texts—including certain medieval Bible translations, the Qur’anic corpus, and some critical editions—are not redistributed in this repository due to third-party copyright restrictions.
Please refer to the documentation or the original editions for licensing information on specific versions.
This work benefited from national funding managed by the Agence Nationale de la Recherche under the Investissements d'avenir programme with the reference ANR-21-ESRE-0005 (Biblissima+).
Ce travail a bénéficié d'une aide de l’État gérée par l’Agence Nationale de la Recherche au titre du programme d’Investissements d’avenir portant la référence ANR-21-ESRE-0005 (Biblissima+).
