A Python application for automatically building vector databases from document collections. This system processes documents, generates embeddings using OpenAI, and stores them in Pinecone for retrieval and RAG (Retrieval-Augmented Generation) applications.
- Processes
.txt
,.md
,.mdx
, and.markdown
files from a data directory - Chunks documents into ~750-character segments
- Generates embeddings using OpenAI's
text-embedding-3-small
model - Stores embeddings in a Pinecone vector database
- Saves processed data to a local JSON file
- Python 3.7+
- OpenAI API key
- Pinecone API key
-
Clone this repository:
git clone [repository-url] cd AutoRAG
-
Create a virtual environment and activate it:
python3 -m venv venv source venv/bin/activate # On Windows: venv\Scripts\activate
-
Install the required packages:
pip install -r requirements.txt
-
Create a
.env
file based on.env.example
:cp .env.example .env
-
Add your API keys to the
.env
file:OPENAI_API_KEY=your_openai_api_key_here PINECONE_API_KEY=your_pinecone_api_key_here
Place your prompt engineering documents in the data/raw/
directory. The system supports .txt
, .md
, .mdx
, and .markdown
files.
Run the main script to process documents and create the vector database:
python3 main.py
You can automatically fetch markdown files from a GitHub repository:
python3 main.py --github owner/repo-name
If the repository is private, you'll need a GitHub personal access token:
python3 main.py --github owner/repo-name --token your_github_token
To use previously fetched files without downloading again:
python3 main.py --github owner/repo-name --skip-fetch
You can specify a custom Pinecone index name:
python3 main.py --index-name my-custom-index
The script will:
- (Optional) Fetch markdown documents from GitHub if requested
- Read documents from
data/raw/
- Chunk the documents into ~1000-character segments
- Generate embeddings for each chunk
- Save the embeddings to
data/processed/embedded.json
- Initialize a Pinecone index (name:
prompt-feedback
) - Upload the embedded chunks to Pinecone
AutoRAG/
├── data/
│ ├── raw/ # Place your documents here
│ └── processed/ # Output directory for processed data
├── scripts/
│ ├── __init__.py # Package initialization
│ ├── chunker.py # Functions for chunking documents
│ ├── embedder.py # Functions for generating embeddings
│ ├── github_fetcher.py # Functions for fetching markdown files from GitHub
│ └── uploader.py # Functions for uploading to Pinecone
├── .env # Environment variables (create from .env.example)
├── .env.example # Example environment file
├── .gitignore # Git ignore file
├── main.py # Main script
├── README.md # This file
└── requirements.txt # Project dependencies
- Chunker: Splits documents into chunks of approximately 750 characters
- Embedder: Generates embeddings for text chunks using OpenAI
- Uploader: Manages Pinecone database initialization and uploads
- GitHub Fetcher: Clones GitHub repositories and extracts markdown files for processing
[Specify the license here]