diff --git a/ARCHITECTURE.md b/ARCHITECTURE.md new file mode 100644 index 0000000..b61318a --- /dev/null +++ b/ARCHITECTURE.md @@ -0,0 +1,268 @@ +# CodeRED-Astra Architecture + +## Overview + +CodeRED-Astra is a Retrieval-Augmented Generation (RAG) system for querying ISS technical documentation using vector search, MySQL metadata storage, and Gemini AI for analysis and response generation. + +## System Components + +### 1. **Rust Backend** (`rust-engine/`) +High-performance Rust backend using Warp for HTTP, SQLx for MySQL, and Reqwest for external API calls. + +#### Modules + +**`main.rs`** - Entry point +- Initializes tracing, database, storage +- Spawns FileWorker and QueryWorker background tasks +- Serves API routes on port 8000 + +**`db.rs`** - Database initialization +- Connects to MySQL +- Creates `files` table (id, filename, path, description, pending_analysis, analysis_status) +- Creates `queries` table (id, status, payload, result, timestamps) + +**`api.rs`** - HTTP endpoints +- `POST /api/files` - Upload file (multipart/form-data) +- `POST /api/files/import-demo` - Bulk import from demo-data directory +- `GET /api/files/list` - List all files with status +- `GET /api/files/delete?id=` - Delete file and remove from Qdrant +- `POST /api/query/create` - Create new query (returns query ID) +- `GET /api/query/status?id=` - Check query status +- `GET /api/query/result?id=` - Get query result +- `GET /api/query/cancel?id=` - Cancel in-progress query + +**`file_worker.rs`** - File analysis pipeline +- **Background worker** that processes files with `pending_analysis = TRUE` +- Claims stale/queued files (requeues if stuck >10 min) +- **Stage 1**: Call Gemini 1.5 Flash for initial description +- **Stage 2**: Call Gemini 1.5 Pro for deep vector graph data (keywords, relationships) +- **Stage 3**: Generate embedding and upsert to Qdrant +- **Stage 4**: Mark file as ready (`pending_analysis = FALSE`, `analysis_status = 'Completed'`) +- Resumable: Can recover from crashes/restarts + +**`worker.rs`** - Query processing pipeline +- **Background worker** that processes queries with `status = 'Queued'` +- Requeues stale InProgress jobs (>10 min) +- **Stage 1**: Embed query text +- **Stage 2**: Search top-K similar vectors in Qdrant +- **Stage 3**: Fetch file metadata from MySQL (only completed files) +- **Stage 4**: Call Gemini to analyze relationships between files +- **Stage 5**: Call Gemini for final answer synthesis (strict: no speculation) +- **Stage 6**: Save results to database +- Supports cancellation checks between stages + +**`gemini_client.rs`** - Gemini API integration +- `generate_text(prompt)` - Text generation with model switching via GEMINI_MODEL env var +- `demo_text_embedding(text)` - Demo 64-dim embeddings (replace with real Gemini embeddings) +- Falls back to demo responses if GEMINI_API_KEY not set + +**`vector_db.rs`** - Qdrant client +- `ensure_files_collection(dim)` - Create 'files' collection with Cosine distance +- `upsert_point(id, vector)` - Store file embedding +- `search_top_k(vector, k)` - Find k nearest neighbors +- `delete_point(id)` - Remove file from index + +**`storage.rs`** - File storage utilities +- `storage_dir()` - Get storage path from ASTRA_STORAGE env or default `/app/storage` +- `ensure_storage_dir()` - Create storage directory if missing +- `save_file(filename, contents)` - Save file to storage +- `delete_file(path)` - Remove file from storage + +**`models.rs`** - Data structures +- `FileRecord` - File metadata (mirrors files table) +- `QueryRecord` - Query metadata (mirrors queries table) +- `QueryStatus` enum - Queued, InProgress, Completed, Cancelled, Failed + +### 2. **Web App** (`web-app/`) +React + Vite frontend with Express backend for API proxying. + +#### Backend (`server.mjs`) +- Express server that proxies API calls to rust-engine:8000 +- Serves React static build from `/dist` +- **Why needed**: Docker networking - React can't call rust-engine directly from browser + +#### Frontend (`src/`) +- `App.jsx` - Main chat interface component +- `components/ui/chat/chat-header.jsx` - Header with debug-only "Seed Demo Data" button (visible with `?debug=1`) +- Calls `/api/files/import-demo` endpoint to bulk-load ISS PDFs + +### 3. **MySQL Database** +Two tables for metadata storage: + +**`files` table** +```sql +id VARCHAR(36) PRIMARY KEY +filename TEXT NOT NULL +path TEXT NOT NULL +description TEXT +created_at DATETIME DEFAULT CURRENT_TIMESTAMP +pending_analysis BOOLEAN DEFAULT TRUE +analysis_status VARCHAR(32) DEFAULT 'Queued' +``` + +**`queries` table** +```sql +id VARCHAR(36) PRIMARY KEY +status VARCHAR(32) NOT NULL +payload JSON +result JSON +created_at DATETIME DEFAULT CURRENT_TIMESTAMP +updated_at DATETIME DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP +``` + +### 4. **Qdrant Vector Database** +- Collection: `files` +- Dimension: 64 (demo) - replace with real Gemini embedding dimension +- Distance: Cosine similarity +- Stores file embeddings for semantic search + +### 5. **Demo Data** (`rust-engine/demo-data/`) +~20 ISS technical PDFs organized by subsystem: +- Electrical Power System (EPS) +- Environmental Control & Life Support (ECLSS) +- Command & Data Handling (C&DH) +- Structures & Mechanisms + +## Data Flow + +### File Upload & Analysis +``` +1. User uploads PDF → POST /api/files +2. API saves file to storage, inserts DB record (pending_analysis=true) +3. FileWorker claims pending file +4. Gemini 1.5 Flash generates description +5. Gemini 1.5 Pro generates vector graph data +6. Embed text → upsert to Qdrant +7. Mark file as ready (pending_analysis=false) +``` + +### Query Processing +``` +1. User submits query → POST /api/query/create +2. API inserts query record (status='Queued') +3. QueryWorker claims queued query +4. Embed query text +5. Search Qdrant for top-K similar files +6. Fetch file metadata from MySQL +7. Gemini analyzes relationships between files +8. Gemini synthesizes final answer (no speculation) +9. Save results to database +``` + +## Deployment + +### Development (`docker-compose.yml`) +- Local testing with hot-reload +- Bind mounts for code + +### Production (`docker-compose.prod.yml`) +- Used by GitHub Actions for deployment +- Runs rust-engine as user "1004" (github-actions) +- Docker volume: `rust-storage` → `/app/storage` +- Bind mount: `/var/www/codered-astra/rust-engine/demo-data` → `/app/demo-data:ro` +- Environment variables: + - `ASTRA_STORAGE=/app/storage` + - `DEMO_DATA_DIR=/app/demo-data` + - `QDRANT_URL=http://qdrant:6333` + - `GEMINI_API_KEY=` + - `DATABASE_URL=mysql://astraadmin:password@mysql:3306/astra` + +## Key Design Decisions + +### 1. **Two-Stage Analysis (Flash → Pro)** +- Flash is faster/cheaper for initial description +- Pro is better for deep analysis and relationship extraction +- Enables cost-effective scaling + +### 2. **Resumable Workers** +- Workers requeue stale jobs (>10 min in InProgress) +- Survives container restarts without data loss +- Atomic state transitions via SQL + +### 3. **Separation of Concerns** +- FileWorker: Makes files searchable +- QueryWorker: Answers user queries +- Independent scaling and failure isolation + +### 4. **Strict Answer Generation** +- Gemini prompted to not speculate +- Must state uncertainty when info is insufficient +- Prevents hallucination in critical ISS documentation + +### 5. **Demo Embeddings** +- Current: 64-dim deterministic embeddings from text hash +- Production: Replace with real Gemini text embeddings API +- Allows development/testing without embedding API credits + +## API Usage Examples + +### Upload File +```bash +curl -F "file=@document.pdf" http://localhost:3001/api/files +``` + +### Import Demo Data +```bash +curl -X POST http://localhost:3001/api/files/import-demo +``` + +### Create Query +```bash +curl -X POST http://localhost:3001/api/query/create \ + -H "Content-Type: application/json" \ + -d '{"q": "What is the voltage of the ISS main bus?", "top_k": 5}' +``` + +### Check Status +```bash +curl http://localhost:3001/api/query/status?id= +``` + +### Get Result +```bash +curl http://localhost:3001/api/query/result?id= +``` + +## Future Enhancements + +### High Priority +1. Real Gemini text embeddings (replace demo embeddings) +2. File status UI panel (show processing progress) +3. Health check endpoint (`/health`) +4. Data purge endpoint (clear all files/queries) + +### Medium Priority +1. Streaming query responses (SSE/WebSocket) +2. Query result caching +3. File chunking for large PDFs +4. User authentication + +### Low Priority +1. Multi-collection support (different document types) +2. Query history UI +3. File preview in chat +4. Export results to PDF + +## Troubleshooting + +### Storage Permission Errors +- Ensure `/app/storage` is owned by container user +- Docker volume must be writable by user 1004 in production + +### SQL Syntax Errors +- MySQL requires separate `CREATE TABLE` statements +- Cannot combine multiple DDL statements in one `sqlx::query()` + +### Qdrant Connection Issues +- Check QDRANT_URL environment variable +- Ensure qdrant service is running and healthy +- Verify network connectivity between containers + +### Worker Not Processing +- Check logs: `docker logs rust-engine` +- Verify database connectivity +- Look for stale InProgress jobs in queries/files tables + +## Demo Presentation (3 minutes) + +See `rust-engine/DEMODETAILS.md` for curated demo script with example queries. diff --git a/QUICK_REFERENCE.md b/QUICK_REFERENCE.md new file mode 100644 index 0000000..c694cfa --- /dev/null +++ b/QUICK_REFERENCE.md @@ -0,0 +1,219 @@ +# CodeRED-Astra Quick Reference + +## System Overview + +**Two-worker architecture for ISS document RAG:** + +1. **FileWorker**: Analyzes uploaded files (Flash → Pro → Embed → Qdrant) +2. **QueryWorker**: Answers queries (Embed → Search → Relationships → Answer) + +Both workers are **resumable** and automatically recover from crashes. + +## Core Data Flow + +``` +Upload PDF → Storage → MySQL (pending) → FileWorker → Qdrant → MySQL (ready) + ↓ +User Query → MySQL (queued) → QueryWorker → Search Qdrant → Gemini → Result +``` + +## Module Map + +| Module | Purpose | Key Functions | +|--------|---------|---------------| +| `main.rs` | Entry point | Spawns workers, serves API | +| `db.rs` | Database init | Creates files/queries tables | +| `api.rs` | HTTP endpoints | Upload, list, delete, query CRUD | +| `file_worker.rs` | File analysis | Flash→Pro→embed→upsert | +| `worker.rs` | Query processing | Search→relationships→answer | +| `gemini_client.rs` | AI integration | Text generation, embeddings | +| `vector_db.rs` | Qdrant client | Upsert, search, delete | +| `storage.rs` | File management | Save/delete files | +| `models.rs` | Data structures | FileRecord, QueryRecord | + +## API Endpoints + +### Files +- `POST /api/files` - Upload file +- `POST /api/files/import-demo?force=1` - Bulk import demo PDFs +- `GET /api/files/list` - List all files with status +- `GET /api/files/delete?id=` - Delete file + +### Queries +- `POST /api/query/create` - Create query +- `GET /api/query/status?id=` - Check status +- `GET /api/query/result?id=` - Get result +- `GET /api/query/cancel?id=` - Cancel query + +## Database Schema + +### files +- `id` - UUID primary key +- `filename` - Original filename +- `path` - Storage path +- `description` - Gemini Flash description +- `pending_analysis` - FALSE when ready for search +- `analysis_status` - Queued/InProgress/Completed/Failed + +### queries +- `id` - UUID primary key +- `status` - Queued/InProgress/Completed/Cancelled/Failed +- `payload` - JSON query params `{"q": "...", "top_k": 5}` +- `result` - JSON result `{"summary": "...", "related_files": [...], "relationships": "...", "final_answer": "..."}` + +## Environment Variables + +### Required +- `GEMINI_API_KEY` - Gemini API key +- `DATABASE_URL` - MySQL connection string +- `QDRANT_URL` - Qdrant URL (default: http://qdrant:6333) + +### Optional +- `ASTRA_STORAGE` - Storage directory (default: /app/storage) +- `DEMO_DATA_DIR` - Demo data directory (default: /app/demo-data) +- `GEMINI_MODEL` - Override Gemini model (default: gemini-1.5-pro) + +## Worker States + +### FileWorker +1. **Queued** - File uploaded, awaiting processing +2. **InProgress** - Currently being analyzed +3. **Completed** - Ready for search (pending_analysis=FALSE) +4. **Failed** - Error during processing + +### QueryWorker +1. **Queued** - Query created, awaiting processing +2. **InProgress** - Currently searching/analyzing +3. **Completed** - Result available +4. **Cancelled** - User cancelled +5. **Failed** - Error during processing + +## Gemini Prompts + +### FileWorker Stage 1 (Flash) +``` +Describe the file '{filename}' and extract all key components, keywords, +and details for later vectorization. Be comprehensive and factual. +``` + +### FileWorker Stage 2 (Pro) +``` +Given the file '{filename}' and its description: {desc} +Generate a set of vector graph data (keywords, use cases, relationships) +that can be used for broad and precise search. Only include what is +directly supported by the file. +``` + +### QueryWorker Stage 4 (Relationships) +``` +You are an assistant analyzing relationships STRICTLY within the provided files. +Query: {query} +Files: {file_list} +Tasks: +1) Summarize key details from the files relevant to the query. +2) Describe relationships and linkages strictly supported by these files. +3) List important follow-up questions that could be answered only using the provided files. +Rules: Do NOT guess or invent. If information is insufficient in the files, explicitly state that. +``` + +### QueryWorker Stage 5 (Final Answer) +``` +You are to compose a final answer to the user query using only the information from the files. +Query: {query} +Files considered: {file_list} +Relationship analysis: {relationships} +Requirements: +- Use only information present in the files and analysis above. +- If the answer is uncertain or cannot be determined from the files, clearly state that limitation. +- Avoid speculation or assumptions. +Provide a concise, structured answer. +``` + +## Docker Architecture + +### Services +- **rust-engine** - Warp API + workers (port 8000) +- **web-app** - Express + React (port 3001) +- **mysql** - MySQL 9.1 (port 3306) +- **qdrant** - Qdrant vector DB (port 6333) +- **phpmyadmin** - DB admin UI (port 8080) + +### Volumes (Production) +- `rust-storage:/app/storage` - File storage (writable) +- `/var/www/codered-astra/rust-engine/demo-data:/app/demo-data:ro` - Demo PDFs (read-only) +- `~/astra-logs:/var/log` - Log files + +## Common Issues + +### 1. SQL Syntax Error +**Problem**: `error near 'CREATE TABLE'` +**Cause**: Multiple CREATE TABLE in one query +**Fix**: Split into separate `sqlx::query()` calls + +### 2. Permission Denied +**Problem**: `Permission denied (os error 13)` +**Cause**: Container user can't write to storage +**Fix**: Use Docker volume, ensure ownership matches container user + +### 3. Worker Not Processing +**Problem**: Files/queries stuck in Queued +**Cause**: Worker crashed or not started +**Fix**: Check logs, ensure workers spawned in main.rs + +### 4. Qdrant Connection Failed +**Problem**: `qdrant upsert/search failed` +**Cause**: Qdrant not running or wrong URL +**Fix**: Verify QDRANT_URL, check qdrant container health + +## Development Commands + +```bash +# Build and run locally +cd rust-engine +cargo build +cargo run + +# Check code +cargo check + +# Run with logs +RUST_LOG=info cargo run + +# Docker compose (dev) +docker-compose up --build + +# Docker compose (production) +docker-compose -f docker-compose.prod.yml up -d + +# View logs +docker logs rust-engine -f + +# Rebuild single service +docker-compose build rust-engine +docker-compose up -d rust-engine +``` + +## Testing Flow + +1. Start services: `docker-compose up -d` +2. Import demo data: `curl -X POST http://localhost:3001/api/files/import-demo` +3. Wait for FileWorker to complete (~30 seconds for 20 files) +4. Check file status: `curl http://localhost:3001/api/files/list` +5. Create query: `curl -X POST http://localhost:3001/api/query/create -H "Content-Type: application/json" -d '{"q": "ISS main bus voltage", "top_k": 5}'` +6. Check status: `curl http://localhost:3001/api/query/status?id=` +7. Get result: `curl http://localhost:3001/api/query/result?id=` + +## Performance Notes + +- FileWorker: ~1-2 sec per file (demo embeddings) +- QueryWorker: ~3-5 sec per query (search + 2 Gemini calls) +- Qdrant search: <100ms for 1000s of vectors +- MySQL queries: <10ms for simple selects + +## Security Considerations + +- Store GEMINI_API_KEY in GitHub Secrets (production) +- Use environment variables for all credentials +- Don't commit `.env` files +- Restrict phpmyadmin to internal network only +- Use HTTPS in production deployment diff --git a/rust-engine/src/db.rs b/rust-engine/src/db.rs index efbe686..3245814 100644 --- a/rust-engine/src/db.rs +++ b/rust-engine/src/db.rs @@ -1,10 +1,11 @@ -use sqlx::{MySql, MySqlPool}; +use sqlx::MySqlPool; use tracing::info; pub async fn init_db(database_url: &str) -> Result { let pool = MySqlPool::connect(database_url).await?; // Create tables if they don't exist. Simple schema for demo/hackathon use. + // Note: MySQL requires separate statements for each CREATE TABLE sqlx::query( r#" CREATE TABLE IF NOT EXISTS files ( @@ -15,8 +16,14 @@ pub async fn init_db(database_url: &str) -> Result { created_at DATETIME DEFAULT CURRENT_TIMESTAMP, pending_analysis BOOLEAN DEFAULT TRUE, analysis_status VARCHAR(32) DEFAULT 'Queued' - ); + ) + "#, + ) + .execute(&pool) + .await?; + sqlx::query( + r#" CREATE TABLE IF NOT EXISTS queries ( id VARCHAR(36) PRIMARY KEY, status VARCHAR(32) NOT NULL, @@ -24,7 +31,7 @@ pub async fn init_db(database_url: &str) -> Result { result JSON, created_at DATETIME DEFAULT CURRENT_TIMESTAMP, updated_at DATETIME DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP - ); + ) "#, ) .execute(&pool)