This project demonstrates how to fine-tune one of OpenAI's key models to achieve JSON output formatting for generating fake identity data. By leveraging fine-tuning, we can get better steerability, shorter prompts, and therefore, reduced costs.
Detailed Article on This Project - A comprehensive guide on this project, its motivation, and methodology.
Often, in the development stages, there's a need to generate structured data to seed our databases, populate dashboards, etc. This project specifically focuses on generating Twitter-like user profiles in a structured format.
With the fine-tuned model, the aim is to reduce the number of tokens used in a prompt without compromising on the quality of the response. This project shows you how to:
- Prepare synthetic training data
- Format the data according to OpenAI's guidelines
- Fine-tune the model using the prepared data
- Test the fine-tuned model
-
Clone the GitHub repository.
-
Install required packages:
pip install -U -r requirements.txt
-
Include your OpenAI API key in your environment variables:
export OPENAI_API_KEY="sk-XXXXX"
Follow the instructions in the article to generate the training data, fine-tune the model, and test it.
- Detailed Article on This Project - A comprehensive guide on this project, its motivation, and methodology.
- Langchain - a popular library for language processing
- Native Function Calling Demo
-
requirements.txt
- Purpose: Lists all the required Python packages and libraries for this project.
-
prepare_data.py
- Purpose: Contains scripts to generate synthetic training data for model fine-tuning.
-
transform_data.py
- Purpose: Formats the synthetic data according to OpenAI's guidelines.
-
openai_formatting.py
- Purpose: Validates the data formatting according to OpenAI's guidelines. Counts tokens. Source.
-
finetuning.py
- Purpose: Contains scripts and instructions to fine-tune the OpenAI model with the prepared data.
-
run_model.py
- Purpose: Allows users to test the fine-tuned model by generating JSON formatted data.
-
training_examples.json
- Purpose: Output of
prepare_data.py
so you don't have to pay for generating synthetic data again.
- Purpose: Output of
For more insights, updates, and discussions, connect with me:
This project is open-sourced under the MIT License. The exemption is openai_formatting.py, which is proprietary to OpenAI.