Overview - Datadot

The Document Collector is a high-performance FastAPI service responsible for ingesting, normalizing, and converting diverse document formats into a standardized JSON structure optimized for RAG pipelines.

Capabilities

The service supports a wide range of input formats:

PDF Documents

High-fidelity extraction using pymupdf (fitz) with PyPDF2 fallback.

Microsoft Office

Native support for Word (.docx), Excel (.xlsx), and PowerPoint (.pptx).

Text & Code

Parses .txt, .md, .json, .csv, .html and other plain text formats.

Multimedia

Integrates with OpenAI Whisper API for transcribing Audio and Video files.

Web Scraping

Extracts cleaner content from URLs using playwright and beautifulsoup4.

E-Books & Email

Processes .epub books and .mbox email archives.

Standardized Output

Regardless of the input format, the collector outputs a consistent JSON structure. This normalization is crucial for downstream embedding and indexing.

{
  "id": "uuid-v4",
  "url": "file:///path/to/source.pdf",
  "title": "Clean Filename.pdf",
  "docAuthor": "Extracted Author",
  "wordCount": 1500,
  "pageContent": "The full extracted text content...",
  "token_count_estimate": 1950,
  "location": "storage/custom-documents/uuid.json"
}

Getting Started

To learn more about the internals and how to integrate with the collector:

Architecture

Review the internal processing pipeline and tech stack.

Extensions

Learn about external integrations like YouTube and GitHub.

Getting Started

Core Components

​Capabilities

PDF Documents

Microsoft Office

Text & Code

Multimedia

Web Scraping

E-Books & Email

​Standardized Output

​Getting Started

Architecture

Extensions

Capabilities

Standardized Output

Getting Started