Architecture

The Document Collector is designed to provide a robust ingestion pipeline. It follows a layered architecture to separate API handling from the complexity of file format conversion.

Tech Stack

FastAPI

v0.115.0+ with Uvicorn (ASGI) for high-concurrency request handling.

PyMuPDF

Primary engine for PDF text extraction (with PyPDF2 fallback).

Playwright

Browser automation for rendering and scraping JavaScript-heavy websites.

TikToken

Using the cl100k_base tokenizer for accurate token counting.

OpenAI Whisper

Integration for state-of-the-art audio and video transcription.

HMAC

Optional payload integrity verification for secure API communication.

Data Flow Pipeline

The processing pipeline moves data from ingestion to storage through a series of specialized components.

Processing Layers

Routes (app/routes/): Define the HTTP endpoints and DTOs.
Processors (app/processors/): Orchestrate the logic. For example, single_file.py handles direct uploads, moving files from the “Hotdir” to the processing pipeline.
Converters (app/converters/): specialized modules for each file type (e.g., as_pdf.py, as_docx.py).
Extensions (app/extensions/): Fetchers for external sources like YouTube or GitHub.

Chunking Strategy

Large documents are split into smaller, meaningful chunks to fit within LLM context windows. This logic is handled by app/utils/chunker.py.

Default Model: gpt-3.5-turbo (cl100k_base encoding).
Default Size: 1000 tokens per chunk.
Overlap: 200 tokens to maintain context between chunks.

Strategies

By Tokens: Strict hard limit on token count.
By Characters: Simple character count limit.
By Paragraph: Attempts to split at paragraph boundaries for better semantic coherence.

Storage Contract

The collector outputs files to a designated storage directory defined by STORAGE_DIR (default: ./storage).

Processed Docs: <STORAGE_DIR>/documents/custom-documents/
Direct Uploads: <STORAGE_DIR>/direct-uploads/
Hotdir: A temporary landing zone (<STORAGE_DIR>/hotdir/) used for atomic file moves during processing.

Getting Started

Core Components

Tech Stack

FastAPI

PyMuPDF

Playwright

TikToken

OpenAI Whisper

HMAC

Data Flow Pipeline

Processing Layers

Chunking Strategy

Strategies

Storage Contract

Getting Started

Core Components

​Tech Stack

FastAPI

PyMuPDF

Playwright

TikToken

OpenAI Whisper

HMAC

​Data Flow Pipeline

​Processing Layers

​Chunking Strategy

​Strategies

​Storage Contract

Tech Stack

Data Flow Pipeline

Processing Layers

Chunking Strategy

Strategies

Storage Contract