Skip to main content
The Document Collector is designed to provide a robust ingestion pipeline. It follows a layered architecture to separate API handling from the complexity of file format conversion.

Tech Stack

FastAPI

v0.115.0+ with Uvicorn (ASGI) for high-concurrency request handling.

PyMuPDF

Primary engine for PDF text extraction (with PyPDF2 fallback).

Playwright

Browser automation for rendering and scraping JavaScript-heavy websites.

TikToken

Using the cl100k_base tokenizer for accurate token counting.

OpenAI Whisper

Integration for state-of-the-art audio and video transcription.

HMAC

Optional payload integrity verification for secure API communication.

Data Flow Pipeline

The processing pipeline moves data from ingestion to storage through a series of specialized components.

Collector Data Flow

Processing Layers

  1. Routes (app/routes/): Define the HTTP endpoints and DTOs.
  2. Processors (app/processors/): Orchestrate the logic. For example, single_file.py handles direct uploads, moving files from the “Hotdir” to the processing pipeline.
  3. Converters (app/converters/): specialized modules for each file type (e.g., as_pdf.py, as_docx.py).
  4. Extensions (app/extensions/): Fetchers for external sources like YouTube or GitHub.

Chunking Strategy

Large documents are split into smaller, meaningful chunks to fit within LLM context windows. This logic is handled by app/utils/chunker.py.
  • Default Model: gpt-3.5-turbo (cl100k_base encoding).
  • Default Size: 1000 tokens per chunk.
  • Overlap: 200 tokens to maintain context between chunks.

Strategies

  • By Tokens: Strict hard limit on token count.
  • By Characters: Simple character count limit.
  • By Paragraph: Attempts to split at paragraph boundaries for better semantic coherence.

Storage Contract

The collector outputs files to a designated storage directory defined by STORAGE_DIR (default: ./storage).
  • Processed Docs: <STORAGE_DIR>/documents/custom-documents/
  • Direct Uploads: <STORAGE_DIR>/direct-uploads/
  • Hotdir: A temporary landing zone (<STORAGE_DIR>/hotdir/) used for atomic file moves during processing.