Skip to content

fsecada01/TextSpitter

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

84 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

TextSpitter

Transforming documents into insights, effortlessly and efficiently.

license last-commit repo-top-language repo-language-count docs

Built with the tools and technologies:

TOML Pytest Python Rust GitHub%20Actions uv


Table of Contents


Overview

TextSpitter is a Python library that extracts text from documents and source-code files with a single call. It normalises diverse input types β€” file paths, BytesIO streams, SpooledTemporaryFile objects, and raw bytes β€” into plain strings, making it ideal for pipelines that feed text into LLMs, search engines, or data-processing workflows.

As of v2.0, the processing core is written in Rust (via PyO3 + Maturin), delivering 10x–40x batch throughput improvements over the pure-Python v1 implementation. A transparent Python fallback is included for environments where the native extension is unavailable.

Why TextSpitter?

  • πŸ“„ Multi-format extraction β€” PDF (PyMuPDF + PyPDF fallback), DOCX, TXT, CSV, and 50 + programming-language file types.
  • πŸ”Œ Stream-first API β€” accepts file paths, BytesIO, SpooledTemporaryFile, or raw bytes; no temp files required.
  • ⚑ Rust-powered core β€” encoding detection, Unicode normalisation, BPE token counting, and text chunking all run in native code with Rayon parallelism and GIL-released batch methods.
  • 🐍 Graceful fallback β€” pure-Python mirror of every Rust class; _RUST_AVAILABLE flag lets callers detect which path is active.
  • πŸ› οΈ Optional structured logging β€” install textspitter[logging] to add loguru; falls back to stdlib logging transparently.
  • πŸ–₯️ CLI included β€” uv tool install textspitter gives you a textspitter command for quick one-off extractions.
  • πŸš€ Automated CI/CD β€” GitHub Actions run the test matrix (Python 3.12–3.14) and publish multi-platform wheels (Linux, Windows, macOS) to PyPI on every release.

Features

Component Details
βš™οΈ Architecture
  • Four-layer design: TextSpitter convenience function β†’ WordLoader dispatcher β†’ FileExtractor reader β†’ Rust _core extension
  • Transparent Python fallback (_fallback.py) when the native extension is unavailable
πŸ¦€ Rust Core
  • detect_encoding β€” single-pass chardetng encoding detection with UTF-8 BOM handling
  • TextNormalizer β€” Unicode NFC/NFD/NFKC/NFKD, whitespace collapse, OCR artifact repair, header/footer stripping
  • TokenCounter β€” BPE counting via tiktoken-rs; count_batch() releases the GIL via Rayon
  • TextChunker / Chunk β€” token-aware chunking with table preservation and section detection
πŸ”© Code Quality
  • Strict PEP 8 / ruff linting with black formatting
  • Full type hints on both Python and Rust layers; ships a py.typed PEP 561 marker
πŸ“„ Documentation
  • API docs auto-published to GitHub Pages via pdoc
  • Quick-start guide, tutorial, use-case examples, and recipes
πŸ”Œ Integrations
  • CI/CD with GitHub Actions (tests + docs + multi-platform PyPI publish via maturin-action)
  • Package management via uv; installable via pip or uv tool install
🧩 Modularity
  • Core FileExtractor separated from dispatch logic in WordLoader
  • Logging abstraction in logger.py isolates the optional loguru dependency
πŸ§ͺ Testing
  • 239 pytest tests covering all readers, Rust classes, and Python fallback paths
  • Dual-path test fixtures exercise both _RUST_AVAILABLE=True and False branches
⚑️ Performance
  • 10x–40x batch throughput improvement over v1 via Rust + Rayon parallelism
  • GIL released on all *_batch() methods; Python threads unblocked during Rust work
πŸ“¦ Dependencies
  • Core: pymupdf, pypdf, python-docx
  • Optional logging: loguru (pip install textspitter[logging])
  • No Rust toolchain required at runtime β€” pre-built wheels for Linux, Windows, macOS

Project Structure

TextSpitter/
β”œβ”€β”€ .github/
β”‚   └── workflows/
β”‚       β”œβ”€β”€ docs.yml             # pdoc β†’ GitHub Pages
β”‚       β”œβ”€β”€ python-publish.yml   # multi-platform PyPI release (maturin-action)
β”‚       └── tests.yml            # pytest matrix (3.12 – 3.14)
β”œβ”€β”€ src/                         # Rust extension (PyO3 / Maturin)
β”‚   β”œβ”€β”€ lib.rs                   # PyModule registration
β”‚   β”œβ”€β”€ encoding.rs              # detect_encoding() via chardetng
β”‚   β”œβ”€β”€ normalize.rs             # TextNormalizer
β”‚   β”œβ”€β”€ token.rs                 # TokenCounter via tiktoken-rs
β”‚   β”œβ”€β”€ chunk.rs                 # TextChunker + Chunk
β”‚   └── separator.rs             # Section-boundary detection (stub)
β”œβ”€β”€ TextSpitter/
β”‚   β”œβ”€β”€ __init__.py              # imports _core or _fallback; exports _RUST_AVAILABLE
β”‚   β”œβ”€β”€ _fallback.py             # Pure-Python mirror of all _core exports
β”‚   β”œβ”€β”€ cli.py                   # argparse CLI entry point
β”‚   β”œβ”€β”€ core.py                  # FileExtractor class
β”‚   β”œβ”€β”€ logger.py                # Optional loguru / stdlib fallback
β”‚   β”œβ”€β”€ main.py                  # WordLoader dispatcher
β”‚   β”œβ”€β”€ py.typed                 # PEP 561 marker
β”‚   └── guide/                   # pdoc documentation pages (subpackage)
β”œβ”€β”€ tests/
β”‚   β”œβ”€β”€ conftest.py              # shared fixtures (log_capture)
β”‚   β”œβ”€β”€ test_chunker.py          # TextChunker β€” Rust + fallback paths
β”‚   β”œβ”€β”€ test_detect_encoding.py  # detect_encoding()
β”‚   β”œβ”€β”€ test_normalizer.py       # TextNormalizer
β”‚   β”œβ”€β”€ test_token_counter.py    # TokenCounter
β”‚   β”œβ”€β”€ test_rust_integration.py # cross-class integration tests
β”‚   β”œβ”€β”€ test_file_extractor.py
β”‚   β”œβ”€β”€ test_cli.py
β”‚   └── ...
β”œβ”€β”€ Cargo.toml
β”œβ”€β”€ Cargo.lock
β”œβ”€β”€ CHANGELOG.md
β”œβ”€β”€ CONTRIBUTING.md
β”œβ”€β”€ pyproject.toml
└── uv.lock

Getting Started

Prerequisites

  • Python β‰₯ 3.10
  • uv (recommended) or pip
  • No Rust toolchain required β€” pre-built wheels are provided for Linux (x86_64, aarch64), Windows (x64), and macOS (x86_64, Apple Silicon)

Installation

From PyPI:

pip install textspitter

# With optional loguru logging
pip install "textspitter[logging]"

Using uv:

uv add textspitter

# With optional loguru logging
uv add "textspitter[logging]"

As a standalone CLI tool:

uv tool install textspitter

From source:

git clone https://github.com/fsecada01/TextSpitter.git
cd TextSpitter
uv sync --all-extras --dev

Usage

As a library (one-liner):

from TextSpitter import TextSpitter

# From a file path
text = TextSpitter(filename="report.pdf")
print(text)

# From a BytesIO stream
from io import BytesIO
text = TextSpitter(file_obj=BytesIO(pdf_bytes), filename="report.pdf")

# From raw bytes
text = TextSpitter(file_obj=docx_bytes, filename="contract.docx")

Using the WordLoader class directly:

from TextSpitter.main import WordLoader

loader = WordLoader(filename="data.csv")
text = loader.file_load()

As a CLI tool:

# Extract a single file to stdout
textspitter report.pdf

# Extract multiple files and write to a combined output file
textspitter file1.pdf file2.docx notes.txt -o combined.txt

Testing

uv run pytest tests/

# With coverage
uv run pytest tests/ --cov=TextSpitter --cov-report=term-missing

Roadmap

v1.x

  • Stream-based API (BytesIO, SpooledTemporaryFile, raw bytes)
  • CLI entry point (uv tool install textspitter)
  • Optional loguru logging with stdlib fallback
  • Programming-language file support (50 + extensions)
  • CI matrix (Python 3.12 – 3.14) + GitHub Pages docs
  • Async extraction API
  • CSV β†’ structured output (list of dicts)
  • PPTX support

v2.0 β€” Rust backend (full roadmap)

  • Rust core via PyO3 + Maturin β€” 10x–40x batch throughput (encoding, normalize, token, chunk)
  • Graceful Python fallback when Rust extension is unavailable (_fallback.py)
  • manylinux wheels on PyPI β€” zero-compile install for Linux, Windows, macOS
  • chardetng encoding detection replacing 4-attempt Python loop
  • Token-aware chunking with Markdown table preservation and section detection
  • Rayon parallelism + GIL release on all *_batch() methods
  • Memory-mapped file processing for very large PDFs (memmap2)
  • SIMD-accelerated string search for separator detection
  • Streaming iterator API (yield chunks instead of collecting all)
  • Optional SIMD feature flag (pip install "textspitter[simd]")

Contributing

Contributing Guidelines
  1. Fork the Repository: Fork the project to your GitHub account.
  2. Clone Locally: Clone the forked repository.
    git clone https://github.com/fsecada01/TextSpitter.git
  3. Create a New Branch: Always work on a new branch.
    git checkout -b new-feature-x
  4. Make Your Changes: Develop and test your changes locally.
  5. Commit Your Changes: Commit with a clear message.
    git commit -m 'Add new feature x.'
  6. Push to GitHub: Push the changes to your fork.
    git push origin new-feature-x
  7. Submit a Pull Request: Create a PR against main. Describe the changes and motivation clearly.
  8. Review: Once approved, your PR will be merged. Thanks for contributing!
Contributor Graph


License

TextSpitter is released under the MIT License.

About

Python package that spits out text from your document files!

Resources

License

Contributing

Stars

Watchers

Forks

Packages

 
 
 

Contributors