Skip to main content

Overview

DataDot is a comprehensive RAG (Retrieval-Augmented Generation) application that combines a modern Next.js frontend with powerful Python backend services to provide:
  • Multi-LLM Support: Integration with OpenAI, Anthropic, Google Gemini, Cohere, and Ollama
  • Smart Document Processing: Convert and process PDFs, Word docs, Excel files, images, audio, and more
  • Intelligent Chat Interface: Real-time streaming conversations with AI agents
  • Workspace Management: Organize documents and conversations by project
  • Vector Search: Multiple vector database support (LanceDB, ChromaDB, Pinecone, Qdrant, Weaviate)
  • MCP Server: Model Context Protocol server for AI agent integration

Features

Document Collector

Intelligent ingestion system that processes PDFs, websites, and YouTube videos into RAG-ready formats.

MCP Server

Model Context Protocol server that exposes documentation and tools to AI agents like Claude and Cursor.

Vector Search

High-performance similarity search using Pinecone and OpenAI embeddings.

Admin Dashboard

Next.js-based interface for managing workspaces, documents, and API keys.
  • Customizable Themes: Dark/light mode with custom branding
  • Internationalization: Multi-language support via i18next
  • Built with Next.js 16 (App Router) and React 19
  • Real-time chat with streaming responses (SSE + WebSockets)
  • Responsive design with mobile support
  • Custom theme system with CSS variables
  • JWT-based authentication
  • Drag-and-drop file uploads
  • Text-to-speech and speech-to-text capabilities
  • Data visualization with Recharts

Backend Server

  • FastAPI with async support
  • Multi-user mode with workspace isolation
  • Secure authentication with JWT tokens
  • Unified LLM interface for multiple providers
  • Vector database operations for RAG
  • Background job scheduling
  • Comprehensive API documentation (OpenAPI/Swagger)

Document Collector

  • Support for 9+ file formats (PDF, DOCX, XLSX, EPUB, MBOX, images, audio)
  • Audio transcription with OpenAI Whisper
  • OCR for images using Tesseract
  • Web scraping with depth control
  • GitHub/GitLab repository ingestion
  • YouTube transcript extraction
  • Obsidian Vault processing
  • Automatic token counting and text chunking

MCP Server

  • Model Context Protocol (MCP) server for AI agent integration
  • SSE (Server-Sent Events) transport for real-time communication
  • Dynamic configuration updates at runtime
  • API key authentication
  • Multiple LLM provider support
  • Document upload and processing tools
  • Vector database query tools
  • Workspace management tools