Overview
DataDot is a comprehensive RAG (Retrieval-Augmented Generation) application that combines a modern Next.js frontend with powerful Python backend services to provide:- Multi-LLM Support: Integration with OpenAI, Anthropic, Google Gemini, Cohere, and Ollama
- Smart Document Processing: Convert and process PDFs, Word docs, Excel files, images, audio, and more
- Intelligent Chat Interface: Real-time streaming conversations with AI agents
- Workspace Management: Organize documents and conversations by project
- Vector Search: Multiple vector database support (LanceDB, ChromaDB, Pinecone, Qdrant, Weaviate)
- MCP Server: Model Context Protocol server for AI agent integration
Features
Document Collector
Intelligent ingestion system that processes PDFs, websites, and YouTube videos into RAG-ready formats.
MCP Server
Model Context Protocol server that exposes documentation and tools to AI agents like Claude and Cursor.
Vector Search
High-performance similarity search using Pinecone and OpenAI embeddings.
Admin Dashboard
Next.js-based interface for managing workspaces, documents, and API keys.
- Customizable Themes: Dark/light mode with custom branding
- Internationalization: Multi-language support via i18next
- Built with Next.js 16 (App Router) and React 19
- Real-time chat with streaming responses (SSE + WebSockets)
- Responsive design with mobile support
- Custom theme system with CSS variables
- JWT-based authentication
- Drag-and-drop file uploads
- Text-to-speech and speech-to-text capabilities
- Data visualization with Recharts
Backend Server
- FastAPI with async support
- Multi-user mode with workspace isolation
- Secure authentication with JWT tokens
- Unified LLM interface for multiple providers
- Vector database operations for RAG
- Background job scheduling
- Comprehensive API documentation (OpenAPI/Swagger)
Document Collector
- Support for 9+ file formats (PDF, DOCX, XLSX, EPUB, MBOX, images, audio)
- Audio transcription with OpenAI Whisper
- OCR for images using Tesseract
- Web scraping with depth control
- GitHub/GitLab repository ingestion
- YouTube transcript extraction
- Obsidian Vault processing
- Automatic token counting and text chunking
MCP Server
- Model Context Protocol (MCP) server for AI agent integration
- SSE (Server-Sent Events) transport for real-time communication
- Dynamic configuration updates at runtime
- API key authentication
- Multiple LLM provider support
- Document upload and processing tools
- Vector database query tools
- Workspace management tools