Tech Stack
FastAPI
v0.115.0+ with Uvicorn (ASGI) for high-concurrency request handling.
PyMuPDF
Primary engine for PDF text extraction (with PyPDF2 fallback).
Playwright
Browser automation for rendering and scraping JavaScript-heavy websites.
TikToken
Using the
cl100k_base tokenizer for accurate token counting.OpenAI Whisper
Integration for state-of-the-art audio and video transcription.
HMAC
Optional payload integrity verification for secure API communication.
Data Flow Pipeline
The processing pipeline moves data from ingestion to storage through a series of specialized components.Collector Data Flow
Processing Layers
- Routes (
app/routes/): Define the HTTP endpoints and DTOs. - Processors (
app/processors/): Orchestrate the logic. For example,single_file.pyhandles direct uploads, moving files from the “Hotdir” to the processing pipeline. - Converters (
app/converters/): specialized modules for each file type (e.g.,as_pdf.py,as_docx.py). - Extensions (
app/extensions/): Fetchers for external sources like YouTube or GitHub.
Chunking Strategy
Large documents are split into smaller, meaningful chunks to fit within LLM context windows. This logic is handled byapp/utils/chunker.py.
- Default Model:
gpt-3.5-turbo(cl100k_base encoding). - Default Size: 1000 tokens per chunk.
- Overlap: 200 tokens to maintain context between chunks.
Strategies
- By Tokens: Strict hard limit on token count.
- By Characters: Simple character count limit.
- By Paragraph: Attempts to split at paragraph boundaries for better semantic coherence.
Storage Contract
The collector outputs files to a designated storage directory defined bySTORAGE_DIR (default: ./storage).
- Processed Docs:
<STORAGE_DIR>/documents/custom-documents/ - Direct Uploads:
<STORAGE_DIR>/direct-uploads/ - Hotdir: A temporary landing zone (
<STORAGE_DIR>/hotdir/) used for atomic file moves during processing.