Build sample-specific diploid genomes with non-reference mobile element insertions from targeted long-read TE-capture data.
NanoMEI is a post-processing tool for TEnCATS and Fiber-TEnCATS datasets analyzed with NanoPal.
Starting from NanoPal mobile element insertion calls, NanoMEI extracts insertion-supporting sequence from read softclips or BLAST-defined read segments, groups reads that support the same insertion event, and builds a consensus sequence for each MEI. It then writes (optionally phased) MEI VCF and can uses vcf2diploid to augment the reference genome with the reconstructed non-reference insertions.
All dependencies can be installed with the provided conda environment:
conda env create -f environment.yml
conda activate NanoMEI
pip install -e .To make sure everything is working, you can run the following test from the main repository folder:
pytest -vPlease first run NanoPal on your (Fiber-)TEnCATS data to identify captured non-reference TEs. Once you have the NanoPal output files summary.final.PALMER.TE.read.txt and blastn_refine.all.txt, you can use NanoMEI to reconstruct MEI sequences and augment your reference genome.
reconstruct-mei \
--final-summary-palmer summary.final.PALMER.TE.read.txt \
--blastn-refine blastn_refine.all.txt \
--bam-file reads.hg38.bam \
--output-vcf sample.MEI.vcf \
--reference-genome-fasta hg38.fa \
--sample-id sample | Argument | Required | Description |
|---|---|---|
--final-summary-palmer |
Yes | Path to the NanoPal/PALMER summary file, usually named summary.final.PALMER.TE.read.txt. This file provides read-level evidence for captured non-reference TE insertions. |
--blastn-refine |
Yes | Path to the NanoPal BLAST refinement output, usually named blastn_refine.all.txt. This file is used to determine which portion of each read corresponds to the mobile element insertion. |
--bam-file |
Yes | BAM file containing nanopore reads aligned to the reference genome. NanoMEI uses this file to extract insertion-supporting read sequence from softclips or BLAST-defined read segments. |
--output-vcf |
Yes | Path/name for the final MEI VCF produced by NanoMEI. |
--reference-genome-fasta |
Yes | Reference genome FASTA used by vcf2diploid when creating the sample-specific diploid genome. |
--sample-id |
Yes | Sample identifier to use in the final VCF and vcf2diploid output. |
--vcf2diploid-jar |
No | Path to the vcf2diploid.jar file. vcf2diploid.jar is already in the resources subdirectory, but you can provide a new path if you want to try out new versions |
--min-reads-support |
No | Minimum number of reads required to build a consensus sequence for an insertion event. Default: 10. |
--phased-reads |
No | Optional file with read-level haplotype assignments. The first column must contain the read name matching the FASTA header, and the second column must contain the phase (`1 |
--output-dir |
No | Directory for intermediate files and vcf2diploid output. Default: TE_vcf. |
--vcf-header |
No | Path to the VCF header file. The provided default is a minimal header that can be used with any reference genome. An example of a more detailed header is also provided for hg38; use a different header if working with another reference. |
--use-blast-defined-only |
No | If set, NanoMEI extracts only the BLAST-defined read segment instead of the full trimmed softclip. |
-v, --version |
No | Print the NanoMEI version and exit. |
-h, --help |
No | Print the help message and exit. |
If you use this repository, please cite:
Fiber-TEnCATS reveals haplotype-specific chromatin accessibility and DNA methylation at human L1HS loci