AutoRAG

A Python application for automatically building vector databases from document collections. This system processes documents, generates embeddings using OpenAI, and stores them in Pinecone for retrieval and RAG (Retrieval-Augmented Generation) applications.

Features

Processes .txt, .md, .mdx, and .markdown files from a data directory
Chunks documents into ~750-character segments
Generates embeddings using OpenAI's text-embedding-3-small model
Stores embeddings in a Pinecone vector database
Saves processed data to a local JSON file

Setup

Prerequisites

Python 3.7+
OpenAI API key
Pinecone API key

Installation

Clone this repository:
```
git clone [repository-url]
cd AutoRAG
```

Create a virtual environment and activate it:

python3 -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate

Install the required packages:
```
pip install -r requirements.txt
```
Create a .env file based on .env.example:
```
cp .env.example .env
```

Add your API keys to the .env file:

OPENAI_API_KEY=your_openai_api_key_here
PINECONE_API_KEY=your_pinecone_api_key_here

Prepare Data

Place your prompt engineering documents in the data/raw/ directory. The system supports .txt, .md, .mdx, and .markdown files.

Usage

Run the main script to process documents and create the vector database:

python3 main.py

Fetching Documents from GitHub

You can automatically fetch markdown files from a GitHub repository:

python3 main.py --github owner/repo-name

If the repository is private, you'll need a GitHub personal access token:

python3 main.py --github owner/repo-name --token your_github_token

To use previously fetched files without downloading again:

python3 main.py --github owner/repo-name --skip-fetch

Custom Pinecone Index

You can specify a custom Pinecone index name:

python3 main.py --index-name my-custom-index

Processing Pipeline

The script will:

(Optional) Fetch markdown documents from GitHub if requested
Read documents from data/raw/
Chunk the documents into ~1000-character segments
Generate embeddings for each chunk
Save the embeddings to data/processed/embedded.json
Initialize a Pinecone index (name: prompt-feedback)
Upload the embedded chunks to Pinecone

Project Structure

AutoRAG/
├── data/
│   ├── raw/            # Place your documents here
│   └── processed/      # Output directory for processed data
├── scripts/
│   ├── __init__.py     # Package initialization
│   ├── chunker.py      # Functions for chunking documents
│   ├── embedder.py     # Functions for generating embeddings
│   ├── github_fetcher.py # Functions for fetching markdown files from GitHub
│   └── uploader.py     # Functions for uploading to Pinecone
├── .env                # Environment variables (create from .env.example)
├── .env.example        # Example environment file
├── .gitignore          # Git ignore file
├── main.py             # Main script
├── README.md           # This file
└── requirements.txt    # Project dependencies

Modules

Chunker: Splits documents into chunks of approximately 750 characters
Embedder: Generates embeddings for text chunks using OpenAI
Uploader: Manages Pinecone database initialization and uploads
GitHub Fetcher: Clones GitHub repositories and extracts markdown files for processing

License

[Specify the license here]

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

AutoRAG

Features

Setup

Prerequisites

Installation

Prepare Data

Usage

Fetching Documents from GitHub

Custom Pinecone Index

Processing Pipeline

Project Structure

Modules

License

About

Uh oh!

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
scripts		scripts
.env.example		.env.example
.gitignore		.gitignore
README.md		README.md
main.py		main.py
requirements.txt		requirements.txt

prompy-dev/AutoRAG

Folders and files

Latest commit

History

Repository files navigation

AutoRAG

Features

Setup

Prerequisites

Installation

Prepare Data

Usage

Fetching Documents from GitHub

Custom Pinecone Index

Processing Pipeline

Project Structure

Modules

License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages