Skip to main content
The Document Collector is a high-performance FastAPI service responsible for ingesting, normalizing, and converting diverse document formats into a standardized JSON structure optimized for RAG pipelines.

Capabilities

The service supports a wide range of input formats:

PDF Documents

High-fidelity extraction using pymupdf (fitz) with PyPDF2 fallback.

Microsoft Office

Native support for Word (.docx), Excel (.xlsx), and PowerPoint (.pptx).

Text & Code

Parses .txt, .md, .json, .csv, .html and other plain text formats.

Multimedia

Integrates with OpenAI Whisper API for transcribing Audio and Video files.

Web Scraping

Extracts cleaner content from URLs using playwright and beautifulsoup4.

E-Books & Email

Processes .epub books and .mbox email archives.

Standardized Output

Regardless of the input format, the collector outputs a consistent JSON structure. This normalization is crucial for downstream embedding and indexing.
{
  "id": "uuid-v4",
  "url": "file:///path/to/source.pdf",
  "title": "Clean Filename.pdf",
  "docAuthor": "Extracted Author",
  "wordCount": 1500,
  "pageContent": "The full extracted text content...",
  "token_count_estimate": 1950,
  "location": "storage/custom-documents/uuid.json"
}

Getting Started

To learn more about the internals and how to integrate with the collector: