- Start weaviate docker container
docker run -p 8080:8080 -p 50051:50051 cr.weaviate.io/semitechnologies/weaviate:1.27.2
- Set environment variable with Anthropic API Key
export ANTHROPIC_API_KEY='...'
Oh no, the old coach has moved out of town! You have taken over the team.
The players know their positions in the field, but the coach didn't leave behind a batting order.
Luckily we know the players on the team and their hitting stats from the previous season.
Let's calculate the OPS for each player and use the top 9 batters in our lineup!
poetry run make-lineup
Pandas is a library for working with data. The data is stored in data frames (2-dimensional like a spreadsheet) and it is easy to manipulate columns and rows using this library.
scikit-learn is a very popular machine learning library in python.
XGBoost is a machine learning library that implements Gradient Boosting. It is particularly useful for predicting column values in tabular data (2-dimensional).
Gradient Boosting
is a machine learning
technique that works well for making predictions from tabular data. XGBoost makes predicions based on decision trees.
You can make predictions that are:
Classification
A choice from a set of pre-defined categories. Useful for predicting something is a, b, or c.
Regression
A numeric value. Useful for predicting how strong the relationship is or how likely something is.
Note
Our example is a regression example because we predicted the OPS, a numeric value.
Tip
Collect data that includes the values you would like to predict. It is required for training your model.
Tip
Be prepared to clean your data and normalize your data if necessary.
I haven't coached before, so I am not good at calling pitches. What if AI could learn to call pitches for me?
First we need to train the AI to learn which pitches to call based on the game situation.
poetry run learn-to-pitch
Now let's use the model to simulate a game and see what it learned.
poetry run practice-calling-pitches
Python library for scientific computing. It has N-dimensional arrays and numerical computing functions.
Provides an API for
reinforcement learning
. It provides some referenceenvironments
(typically video games) but we did not use any. The older library was calledgym
.
A library of reinforcement learning algorithms. It uses
PyTorch
. We used theProximal Policy Optimization
(PPO
) algorithm forDeep Reinforcement Learning
(DRL
).
Library for building
deep learning
models. Typically used forcomputer vision
(akaimage recognition
) andnatural language processing
(NLP
).
Use Deep Reinforcement Learning
to create models that learn which action to take based on a reward function.
Tip
Too much training can overfit your model. Insufficient training can underfit your model which means that the predictions it makes will not be very good.
Finally, it is game day. But oh no, one of our hitters can't make it to the game. Let's find the hitter on the bench that is the most similar so we can put them in the game!
poetry run recommend-similar-hitter
Library for computing
embeddings
for text and images. It can also calculate similarity scores. It is useful for semantic search.
We compute embeddings
for the various hitters. An embedding is a numerical representation of the data. Typically they are vectors
.
We use cosine similarity
(which is the cosine angle between vectors) to measure the similarity between two vectors. We store the similarities in a 2-dimensional similarity matrix
(Pandas DataFrame).
This allows us to rank the other hitters by how similar they are to the hitter we are looking at.
Earlier we developed a DRL model to help determine what the best pitch was to throw given the current state of the game.
Let's use that now to tell us which pitches to call in game!
poetry run call-pitches
Framework for building
Large Language Model
(LLM
) applications
Cloud based AI with an API. Competitor to OpenAI and ChatGPT. We use it because it is much cheaper and sufficiently capable for our needs.
Important
This is the only commercial product we use in this project. You will need a paid account and an API key to run this demo.
Agentic workflows
offer an excellent balance of AI based decision making with some structure for the overall framework of what should be happening.
Tip
You can use function calls, API calls, AI models and much more as steps in your workflow.
Tip
Do not make calls to LLM's that are too complicated or you will get results that will not be easy to handle. We recommend having the AI do things like choosing from a small discrete set of options or having the API parse human provided text to provide text that is simpler for models to understand.
Now that our game is over, let's look at some real world AI generated game recaps.
GameChanger
(a commercial product) provides a real world example of using AI to summarize baseball games into a game recap article. The text of the game recaps was copied into the data/game_recaps
folder (with names modified).
That I like baseball. :)
poetry run summarize-game-recap
Hugging Face provides machine learning models and datasets. It is similar to GitHub for AI models and datasets.
Transformers
are aneural network
architecture that change an input sequence into an output sequence. HuggingFace provides an API for transformers that works with PyTorch, TensorFlow and JAX.
Hugging Face
is a website with models, datasets and much more provided by the community. We can use existing models with our data or use existing datasets to train our own models. We can use models for all sorts of tasks such as image recognition, audio processing, and Natural Language Processing (NLP) like our recommender
or similarity
example.
Summarization
is a common NLP task.
Tip
There are many models and datasets on HuggingFace. Be sure to experiment with several models to see which one works best for you.
Tip
New models and datasets are being added to HuggingFace all the time. Be sure to check the website for the latest models.
Tip
We have found this web page to be very helpful when trying to understand the types of models that exist on HuggingFace. Hugging Face Tasks
poetry run game-sentiment
Models exist to perform sentiment analysis
on text.
poetry run chat-about-the-season
poetry run python novalug/wvt/query.py
Vector database useful for
Retrieval-Augmented Generation
(RAG
). Weaviate offers a paid cloud version or you can run the open source database locally.
We can augment a Large Language Model
or LLM
with our own data with Retrieval-Augmented Generation (RAG)
!
Our data is stored in a vector store database
.
Caution
Consider carefully before sending any data to an third party API or cloud service.
Note
We did not cover ollama, but it is possible to run LLMs locally. You could replace Anthropic with an open source LLM run in ollama.
This planning agent will attempt to figure out what steps are necessary to solve the question asked.
poetry run python novalug/li/agents.py
Agents
are a very exciting idea. We expect big advancements in these library capabilities in the next few years.
Warning
Agents tend to be highly unreliable. While they are a very exciting concept, they frequently fail to run.
poetry run streamlit run novalug/sl/site.py
Library for creating websites via a python API. Useful for presenting data frames, charts, etc. with your findings.
Streamlit makes it convenient to build websites with your AI
/ML
findings. We could have also used Jupyter, Dash, or Gradio.