A portfolio project demonstrating edge AI deployment for querying DOD Directives using RAG (Retrieval Augmented Generation). Features a modern web interface with folder selection, real-time processing output, and adaptive hardware detection.
- Advanced Retrieval System:
- Cross-encoder Reranking: Optional two-stage retrieval with cross-encoder models for improved result quality
- Document-aware Ranking: Hierarchical boosting for document relationships and adjacent chunk proximity
- Semantic Search: Vector similarity search using SentenceTransformers embeddings with contextual responses
- Adaptive Hardware Detection: Automatically detects system capabilities and recommends optimal configurations
- Centralized Configuration: All models and settings managed through
config.yml
for easy customization - Web Interface: Intuitive GUI with folder selection, live terminal output, and real-time status monitoring
- Offline Operation: Complete offline functionality with lightweight, file-based vector database
To generate the vector database, you will need to manually download the policies from https://esd.whs.mil/. This README assumes you will save them to policies/dodi, policies/dodm, and policies/dodd. As of 19 June 2025 there were about a thousand documents retrievable from this site.
Alternatively, a prebuilt vector databases and policy PDFs can be used. It will likely take several hours to generate embeddings with the highest performer if you lack a CUDA-capable GPU or Apple silicon. This will get you querying much, much faster:
https://www.dropbox.com/scl/fi/trgevdgnzkxsbbvzwx3ze/iris-data.zip?rlkey=5uzfu4iwk56lv12n5koqh03wo&st=725nkl6t&dl=0
- Install Prerequisites
- Python 3.10+: Download from https://python.org/downloads/
- Be sure to check "Add Python to PATH" during installation
- Test: Open Command Prompt → python --version
- Ollama: Download from https://ollama.ai/download
- Install the Windows executable
- Test: ollama --version
- Setup Virtual Environment
- Open Command Prompt in your IRIS folder:
python -m venv venv venv\Scripts\activate pip install -r requirements.txt
- Start Ollama Server
- In one Command Prompt window (keep running):
ollama serve
- Launch IRIS GUI
- In another Command Prompt window:
cd path\to\iris venv\Scripts\activate python gui\app.py
# Install Ollama (prerequisite for LLM responses)
# macOS: brew install ollama
# Linux: curl -fsSL https://ollama.ai/install.sh | sh
# Windows: Download from https://ollama.ai/download
# Create virtual environment and install dependencies
make setup
# Or manually:
python3 -m venv venv
source venv/bin/activate # On Windows: venv\Scripts\activate
pip install -r requirements.txt
# Launch the web interface
python3 gui/app.py
# Opens automatically at http://localhost:8080
- Python: 3.10+
- RAM: 4GB minimum, 8GB+ recommended
- GPU: Not required, highly recommended for speed
- Storage: 2GB+ for full document collection
- OS: Windows, macOS, or Linux
- Document Processing: pdfplumber, SentenceTransformers
- Vector Database: ChromaDB, NumPy
- Cross-encoder Reranking: sentence-transformers (cross-encoder models)
- Hardware Detection: psutil, GPUtil
- LLM Integration: Ollama
- Configuration: PyYAML
- Web Interface: Flask
- 2-4GB RAM: TinyLlama 1.1B Chat Q4 (minimum spec functionality)
- 4-6GB RAM: Llama 3.2 1B Instruct Q4 (basic LLM functionality)
- 6-8GB RAM: Llama 3.2 3B Instruct Q4 (standard LLM performance)
- 8-12GB RAM: Gemma2 9B Instruct Q4 (high-quality performance)
- 12GB+ RAM: Phi4 Mini (premium performance)
- GPU: Automatic GPU acceleration when available (Ollama)
- 4-8GB RAM: all-MiniLM-L6-v2 (fast processing - default)
- 8-16GB RAM: all-mpnet-base-v2 (policy documents)
- 16GB+ RAM: mixedbread-ai/mxbai-embed-large-v1 (best retrieval)
- Choose Embedding Model: Select your preferred embedding model for processing (all-MiniLM-L6-v2, all-mpnet-base-v2, or mixedbread-ai/mxbai-embed-large-v1)
- Select Document Folders: Click "Select Document Folders" to choose directories containing PDF files
- Process Documents: Click "Process and Generate Embeddings" to start document processing
- Live Terminal Output: Watch real-time command output in the terminal-style display
- Progress Monitoring: See exactly what files are being processed and current status
- Auto-completion: Process completes automatically and updates system status
- Start Ollama Server: Use the "Ollama Server Control" panel to start the Ollama server
- Model Selection: Choose your preferred LLM and embedding models - they load automatically when selected
- Advanced Options: Enable cross-encoder reranking for improved result quality (slower but more accurate)
- Real-time Status: Monitor system readiness with live status indicators for docs, models, and server
- Ask Questions: Query your processed documents and get intelligent, contextual responses with automatic citations
- "What ACAT levels delegate decision authority to the service components?"
- "When is a CAPE AOA requried for an MDAP?"
- "How are performance evaluations conducted?"
Sample Response with Citations:
Response:
Based on DOD policies: For an ACAT ID program, the decision authority is typically either the Defense Acquisition Executive (DAE) or a designee. This can be inferred from Section 2430(d)(3)(A) of Title 10 U.S.C., which states that for programs where "the USD(A&S) has designated [an alternate MDA],... the Secretary of the Military Department concerned, or designee," may request reversion back to the SAE. Additionally, it is mentioned in Section 2430(d)(2), stating: "The service acquisition executive (SAE)...will review ACAT IB programs unless otherwise specified." Therefore, for an ACAT ID program specifically designated as such by the USD(A&S) and not delegated elsewhere within DoD Policy, either the DAE or a designee would typically have decision authority.
Sources: 5000.01, 5000.02, 5000.82, 5000.85
GPU acceleration will be used if available; querying delays may be frustrating on CPU-only systems, even at the lowest end configuration.
Model | Parameters | RAM Usage | Speed | Quality | Best For |
---|---|---|---|---|---|
TinyLlama 1.1B Chat Q4 | 1.1B | 2-4GB | Fastest | Basic | Minimum systems |
Llama 3.2 1B Instruct Q4 | 1B | 4-6GB | Very Fast | Good | Low-end systems |
Llama 3.2 3B Instruct Q4 | 3.2B | 6-8GB | Fast | Very Good | Standard systems |
Gemma2 9B Instruct Q4 | 9B | 8-12GB | Medium | Excellent | High-end systems |
Phi4 Mini | 14B | 12GB+ | Slower | Best | Premium systems |
Model | Quality Score | Speed | RAM Usage | Best For |
---|---|---|---|---|
all-MiniLM-L6-v2 | 81.3 | Fastest | 200MB | Fast processing |
all-mpnet-base-v2 | 84.8 | Fast | 800MB | Policy documents |
mixedbread-ai/mxbai-embed-large-v1 | 87.2 | Medium | 1200MB | Best retrieval |
For advanced users or automation, CLI commands are available: [GitHub link]
# Check hardware compatibility
python3 -m src.cli --info
# Load single directory
python3 -m src.cli --load-docs --doc-dirs policies/test
# Load multiple directories (full pipeline)
python3 -m src.cli --load-docs --doc-dirs policies/dodd policies/dodi policies/dodm
# Show detailed processing information (verbose mode)
python3 -m src.cli --load-docs --doc-dirs policies/dodd policies/dodi policies/dodm --verbose
# Query from command line
python3 -m src.cli --query "What are the requirements for security clearances?"
# Query with cross-encoder reranking for improved accuracy
python3 -m src.cli --query "Who has ACAT delegation authority?" --xencode
# Use specific embedding model
python3 -m src.cli --query "your question" --embedding-model mixedbread-ai/mxbai-embed-large-v1
# Use any Ollama-supported model
python3 -m src.cli --query "your question" --model-name any-ollama-model
# Show retrieved context chunks with query response
python3 -m src.cli --query "your question" --show-context
Note: The CLI supports any model available in Ollama. The GUI focuses on the recommended models for optimal user experience, but CLI users can specify any model name that Ollama supports.
IRIS uses a centralized config.yml
file for all system settings, making customization easy and consistent across the entire application.
Both LLM and embedding models are configured in config.yml
:
# LLM Models (automatically detected by GUI)
llm:
models:
"llama3.2:1b-instruct-q4_K_M":
tier: 1
max_tokens: 400
temperature: 0.7
context_window: 4096
prompt_style: "instruct"
# Embedding Models (automatically detected by GUI)
embedding_models:
"all-MiniLM-L6-v2":
tier: 0
ram_usage_mb: 200
quality_score: 81.3
best_for: "Fast processing"
Configure cross-encoder settings for improved retrieval quality:
retrieval:
cross_encoder:
model_name: "cross-encoder/ms-marco-MiniLM-L6-v2"
rerank_top_k: 20 # Candidates to rerank
final_top_k: 10 # Final results returned
batch_size: 8 # Inference batch size
max_length: 512 # Max tokens per query+passage pair
Customize how the system instructs LLMs to respond:
prompts:
simple: # For TinyLlama models
template: |
Context from DOD policies:
{context}
Question: {question}
Instructions: Answer based ONLY on the context above...
instruct: # For Llama 3.2 models
template: |
### System:
You are an expert assistant for Department of Defense policy questions...
Key benefits:
- Easy experimentation: Test different prompt strategies without code changes
- Model-specific optimization: Different templates for different model families
- Version control: Prompt changes are tracked in git
- Quick iteration: Modify prompts and test immediately
The config.yml
file controls all major system parameters:
# Model behavior
llm:
default_context_window: 4096 # Context length for all models
fallback_temperature: 0.7 # Response randomness
# Document processing
document_processing:
default_chunk_size: 500 # Words per document chunk
default_chunk_overlap: 50 # Overlap between chunks
# Document-aware ranking with adjacency boosting
ranking:
same_document_boost: 1.5 # Same filename
adjacent_chunk_boost: 1.3 # Adjacent chunks (±1)
near_chunk_boost: 1.15 # Near chunks (±2)
nearby_chunk_boost: 1.05 # Nearby chunks (±3-5)
# Hardware detection
# (automatically configured based on system capabilities)
For development and testing, you can split document processing into two phases:
Phase 1: Process PDFs (slow, ~10+ minutes for full dataset)
# Process all documents and save to intermediate JSON
python3 -m src.cli --load-docs --doc-dirs policies/dodd policies/dodi policies/dodm --save-intermediate processed_docs.json
Phase 2: Generate Embeddings (fast, ~2-10 minutes per model)
# Test different embedding models quickly
python3 -m src.cli --load-intermediate processed_docs.json --embedding-model all-MiniLM-L6-v2
python3 -m src.cli --load-intermediate processed_docs.json --embedding-model all-mpnet-base-v2
python3 -m src.cli --load-intermediate processed_docs.json --embedding-model mixedbread-ai/mxbai-embed-large-v1
Benefits:
- Time Savings: PDF processing happens only once, then test multiple embedding models quickly
- Development Efficiency: Iterate on embedding models without re-processing PDFs
- Debugging: Inspect intermediate JSON to understand document processing results
src/
├── documents.py # PDF processing, text extraction, and embedding generation
├── vectorstore.py # ChromaDB vector database with similarity search
├── rag.py # Complete RAG pipeline with document retrieval
├── llm.py # LLM integration using Ollama with configurable prompts
├── hardware.py # Hardware detection and model recommendation
├── embedding_models.py # Embedding model utilities
├── config.py # Configuration management (YAML-based) with prompt templates
├── cli.py # Command-line interface
├── error_utils.py # Error handling utilities
├── logging_utils.py # Logging configuration
└── __init__.py # Package initialization
gui/
├── app.py # Flask web interface with SSE streaming and task management
├── templates/
│ └── index.html # Main web interface with terminal display
└── static/
└── style.css # CSS with terminal styling and responsive design
database/ # ChromaDB vector database (auto-created)
├── chroma.sqlite3 # ChromaDB metadata and collections
├── documents.sqlite3 # Document chunks (shared between embeddings)
└── [uuid-dirs]/ # Collection data (binary files)
(policies/) # DOD document collection (downloaded separately)
├── (dodd/) # DOD Directives (259 files)
├── (dodi/) # DOD Instructions (638 files)
├── (dodm/) # DOD Manuals (112 files)
└── (test/) # Test subset (25 files)
tests/ # Test suite
├── test_documents.py # Document processing tests
├── test_hardware.py # Hardware detection tests
├── test_llm.py # LLM integration tests
├── test_rag.py # RAG functionality tests
└── test_vectorstore.py # Vector database tests
config.yml # YAML configuration file with model settings and prompt templates
start_gui.py # GUI launcher script
Makefile # Build and setup automation
- Folder Selection → Browser dialog → Selected paths → Backend validation
- Document Processing → PDF extraction → DOD number extraction → Text chunks → Live terminal output → Embeddings → Model-specific Vector DB
- Real-time Streaming → Server-Sent Events → Terminal display → Progress feedback
- Query Processing → Embedding → Similarity search → DOD context-aware ranking → Context retrieval → LLM response
- GUI Management → Status monitoring → Model hot-loading → Server control
- Step 1: PDF processing extracts and cleans text, creates chunks, identifies DOD numbers
- Step 2: Optional intermediate JSON save enables faster iteration and debugging
- Step 3: Embedding generation converts text chunks to vectors for semantic search
- Step 4: Vector database storage in model-specific ChromaDB files with DOD metadata
- Pdfplumber table detection and vector generation both take significant time, especially on the advanced embedding models. Luckily, it only needs to be done once.
- Number Extraction: Robust extraction †of DOD directive numbers (DODD/DODI/DODM) from filenames and content
- Context-Aware Ranking: Related policies automatically cluster together in search results
- Hierarchical Understanding: System recognizes DOD's logical document organization (major groups, subgroups)
- Smart Boosting: Documents in the same series receive similarity boosts, plus adjacent chunks within the same document receive multiplicative proximity boosts
- Adjacency Boosting: Sequential chunks (before/after relevant content) are boosted to provide better context and narrative flow
System Configuration | Document Extraction | all-MiniLM-L6-v2 | all-mpnet-base-v2 | mixedbread-ai/mxbai-embed-large-v1 |
---|---|---|---|---|
M4 Mac Mini 10 core CPU/GPU, 16GB RAM |
X | 1m 42s | 13m 42s | 49m 16s |
AMD 9800X3D 64GB RAM, Pop!_OS 22.04 |
4m 59s | 3m 17s | 42m 49s | (chose not to run) |
M4, 10 core CPU/GPU | tinyllama:1.1b-chat-v1-q4_K_M | llama3.2:1b-instruct-q4_K_M | llama3.2:3b-instruct-q4_K_M | gemma2:9b-instruct-q4_K_M | phi4-mini:latest |
---|---|---|---|---|---|
all-MiniLM-L6-v2 | 6.3s | 11.4s | 11.2s | 51.1s | 33.8s |
all-mpnet-base-v2 | 5.6s | 6.5s | 11.3s | 62.7s | 32.9s |
mixedbread-ai/mxbai-embed-large-v1 | 8.1s† | 7.2s† | 21.0s† | 58.0s†* | 44.1s† |
9800X3D/6950XT | tinyllama:1.1b-chat-v1-q4_K_M | llama3.2:1b-instruct-q4_K_M | llama3.2:3b-instruct-q4_K_M | gemma2:9b-instruct-q4_K_M | phi4-mini:latest |
---|---|---|---|---|---|
all-MiniLM-L6-v2 | 2.6s | 3.6s | 6.1s | 12.7s | 14.0s |
all-mpnet-base-v2 | 2.5s | 3.8s | 6.2s | 15.2s† | 12.1s†* |
mixedbread-ai/mxbai-embed-large-v1 | 4.5s† | 4.6s† | 8.4s† | 14.1s | 11.4s |
9800X3D (no GPU) | tinyllama:1.1b-chat-v1-q4_K_M | llama3.2:1b-instruct-q4_K_M | llama3.2:3b-instruct-q4_K_M | gemma2:9b-instruct-q4_K_M | phi4-mini:latest |
---|---|---|---|---|---|
all-MiniLM-L6-v2 | 7.8s | 2.6s | 27.2s | 132.9s | 64.1s |
all-mpnet-base-v2 | 8.3s† | 10.5s | 20.4s | 148.7s†* | 76.1s |
mixedbread-ai/mxbai-embed-large-v1 | 9.0s | 14.8s† | 38.0s† | 133.7s | 73.7s† |
† - best per model * best overall
Windows 10 was tested for installation and verification, not benchmarking; I only have Windows VM's available and don't run it on bare metal.
- ✅ Modern Web Interface: Complete GUI with intuitive folder selection and responsive design
- ✅ Live Terminal Output: Real-time streaming of document processing with terminal-style display
- ✅ Flexible Document Processing: Support for any PDF folders with live progress feedback
- ✅ Server-Sent Events: Streaming command output via SSE for real-time user feedback
- ✅ Automatic Model Management: Models load automatically when selected with status indicators
- ✅ Ollama Server Control: Start/stop server directly from GUI with real-time status
- ✅ Multiple Embedding Models: Process documents with different models and instant database switching
- ✅ Hardware Detection: Automatic recommendations with manual override options
- Web Backend Performance Refactor: Hybrid architecture combining CLI power with web performance
- Windows installer with model downloading
- MacOS installer and distribution optimization
Apache License 2.0
Dr. Christopher Waters - [email protected] - Github account