Skip to content

Autonomy-Data-Unit/magpy

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

11 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

magpy

A powerful Python library for extracting structured data from unstructured text using Large Language Models (LLMs). magpy provides a simple, flexible interface for converting free-form text into structured, machine-readable data formats.

Installation

pip install git+https://github.com/Autonomy-Data-Unit/magpy

Execute the following commands in your terminal to quickly set up a testing environment where you can try out magpy's features:

mkdir magpy-test # Create folder to test MagPy in
cd magpy-test
# Download example notebooks
curl -O https://raw.githubusercontent.com/Autonomy-Data-Unit/magpy/refs/heads/main/nbs/examples/extract/00_data_extraction_from_text.ipynb
curl -O https://raw.githubusercontent.com/Autonomy-Data-Unit/magpy/refs/heads/main/nbs/examples/extract/01_pdfs_to_csv.ipynb
# Download example PDF files
mkdir charity_accounts
curl -o "charity_accounts/3966635 2023-12-31 THE DAVID SNOWDON TRUST.pdf" https://raw.githubusercontent.com/Autonomy-Data-Unit/magpy/refs/heads/main/nbs/examples/extract/charity_accounts/3966635%202023-12-31%20THE%20DAVID%20SNOWDON%20TRUST.pdf
curl -o "charity_accounts/4028313 2020-12-31 CATS IN CRISIS.pdf" https://raw.githubusercontent.com/Autonomy-Data-Unit/magpy/refs/heads/main/nbs/examples/extract/charity_accounts/4028313%202020-12-31%20CATS%20IN%20CRISIS.pdf
curl -o "charity_accounts/5211224 2023-12-31 ROYAL ENGINEERS HEADQUARTER MESS.pdf" https://raw.githubusercontent.com/Autonomy-Data-Unit/magpy/refs/heads/main/nbs/examples/extract/charity_accounts/5211224%202023-12-31%20ROYAL%20ENGINEERS%20HEADQUARTER%20MESS.pdf
# Create a virtual environment and install dependencies
python -m venv venv
source ./venv/bin/activate
pip install git+https://github.com/Autonomy-Data-Unit/magpy
pip install jupyterlab
# Create a .env file in the directory
echo "OPENAI_API_KEY=" > .env
# Run jupyterlab
jupyter lab

Note: To run the notebooks, you must first register for an API key from an LLM provider. In the example scripts we are using OpenAI's models, so you'll need to get an API key from https://openai.com/api/, and then paste it into the .env file created in one of the commands above. As Jupyter does not by default display hidden files you cannot edit .env from within Jupyter Lab. On Mac you can open the file by executing open .env in the terminal.

Quick Start

For detailed examples see the examples folder.

Basic Usage

from magpy import extract_structured, set_magpy_config
import os

# Configure the LLM
set_magpy_config(
    api_key=os.getenv("OPENAI_API_KEY"), 
    model_name="gpt-4o",    
    temperature=0.1,
    cache_path='.cache'
)

# Define your extraction schema
schema = {
    "name": str,
    "amount": int,
    "date": datetime,
    "category": str
}

# Extract structured data from text
text = "John Doe donated $500 to the charity on 2024-01-15 for education programs."
result = extract_structured(text=text, schema=schema)
print(result)
# Output: {'name': 'John Doe', 'amount': 500, 'date': datetime(2024, 1, 15), 'category': 'education'}

Contributing

To contribute to the development of the package, follow the below instructions.

Prerequisites

  • Install uv.
  • Install direnv to automatically load the project virtual environment when entering it.
    • Mac: brew install direnv
    • Linux: curl -sfL https://direnv.net/install.sh | bash

Setting up the environment

Run the following:

# In the root of the repo folder
uv sync --all-extras # Installs the virtual environment at './.venv'
direnv allow # Allows the automatic running of the script './.envrc'
nbl install-hooks # Installs a git hooks that ensures that notebooks are added properly

You are now set up to develop the codebase.

Further instructions:

  • To export notebooks run nbl export.
  • To clean notebooks run nbl clean.
  • To see other available commands run just nbl.
  • To add a new dependency run uv add package-name. See the the uv documentation for more details.
  • You need to git add all 'twinned' notebooks for the commit to be validated by the git-hook. For example, if you add nbs/my-nb.ipynb, you must also add pts/my-nb.pct.py.
  • To render the documentation, run nbl render-docs. To preview it run nbl preview-docs
  • To upgrade all dependencies run uv sync --upgrade --all-extras

Support

For questions, issues, or contributions, please open an issue on the GitHub repository.

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Packages

No packages published