|
| 1 | +--- |
| 2 | +description: Evaluate finetuned embeddings and compare to original base embeddings. |
| 3 | +--- |
| 4 | + |
| 5 | +Now that we've finetuned our embeddings, we can evaluate them and compare to the |
| 6 | +base embeddings. We have all the data saved and versioned already, and we will |
| 7 | +reuse the same MatryoshkaLoss function for evaluation. |
| 8 | + |
| 9 | +In code, our evaluation steps are easy to comprehend. Here, for example, is the |
| 10 | +base model evaluation step: |
| 11 | + |
| 12 | +```python |
| 13 | +from zenml import log_model_metadata, step |
| 14 | + |
| 15 | +def evaluate_model( |
| 16 | + dataset: DatasetDict, model: SentenceTransformer |
| 17 | +) -> Dict[str, float]: |
| 18 | + """Evaluate the given model on the dataset.""" |
| 19 | + evaluator = get_evaluator( |
| 20 | + dataset=dataset, |
| 21 | + model=model, |
| 22 | + ) |
| 23 | + return evaluator(model) |
| 24 | + |
| 25 | +@step |
| 26 | +def evaluate_base_model( |
| 27 | + dataset: DatasetDict, |
| 28 | +) -> Annotated[Dict[str, float], "base_model_evaluation_results"]: |
| 29 | + """Evaluate the base model on the given dataset.""" |
| 30 | + model = SentenceTransformer( |
| 31 | + EMBEDDINGS_MODEL_ID_BASELINE, |
| 32 | + device="cuda" if torch.cuda.is_available() else "cpu", |
| 33 | + ) |
| 34 | + |
| 35 | + results = evaluate_model( |
| 36 | + dataset=dataset, |
| 37 | + model=model, |
| 38 | + ) |
| 39 | + |
| 40 | + # Convert numpy.float64 values to regular Python floats |
| 41 | + # (needed for serialization) |
| 42 | + base_model_eval = { |
| 43 | + f"dim_{dim}_cosine_ndcg@10": float( |
| 44 | + results[f"dim_{dim}_cosine_ndcg@10"] |
| 45 | + ) |
| 46 | + for dim in EMBEDDINGS_MODEL_MATRYOSHKA_DIMS |
| 47 | + } |
| 48 | + |
| 49 | + log_model_metadata( |
| 50 | + metadata={"base_model_eval": base_model_eval}, |
| 51 | + ) |
| 52 | + |
| 53 | + return results |
| 54 | +``` |
| 55 | + |
| 56 | +We log the results for our core Matryoshka dimensions as model metadata to ZenML |
| 57 | +within our evaluation step. This will allow us to inspect these results from |
| 58 | +within [the Model Control Plane](https://docs.zenml.io/how-to/use-the-model-control-plane) (see |
| 59 | +below for more details). Our results come in the form of a dictionary of string |
| 60 | +keys and float values which will, like all step inputs and outputs, be |
| 61 | +versioned, tracked and saved in your artifact store. |
| 62 | + |
| 63 | +## Visualizing results |
| 64 | + |
| 65 | +It's possible to visualize results in a few different ways in ZenML, but one |
| 66 | +easy option is just to output your chart as an `PIL.Image` object. (See our |
| 67 | +[documentation on more ways to visualize your |
| 68 | +results](../../../how-to/visualize-artifacts/README.md).) The rest the |
| 69 | +implementation of our `visualize_results` step is just simple `matplotlib` code |
| 70 | +to plot out the base model evaluation against the finetuned model evaluation. We |
| 71 | +represent the results as percentage values and horizontally stack the two sets |
| 72 | +to make comparison a little easier. |
| 73 | + |
| 74 | + |
| 76 | + |
| 77 | +We can see that our finetuned embeddings have improved the recall of our |
| 78 | +retrieval system across all of the dimensions, but the results are still not |
| 79 | +amazing. In a production setting, we would likely want to focus on improving the |
| 80 | +data being used for the embeddings training. In particular, we could consider |
| 81 | +stripping out some of the logs output from the documentation, and perhaps omit |
| 82 | +some pages which offer low signal for the retrieval task. This embeddings |
| 83 | +finetuning was run purely on the full set of synthetic data generated by |
| 84 | +`distilabel` and `gpt-4o`, so we wouldn't necessarily expect to see huge |
| 85 | +improvements out of the box, especially when the underlying data chunks are |
| 86 | +complex and contain multiple topics. |
| 87 | + |
| 88 | +## Model Control Plane as unified interface |
| 89 | + |
| 90 | +Once all our pipelines are finished running, the best place to inspect our |
| 91 | +results as well as the artifacts and models we generated is the Model Control |
| 92 | +Plane. |
| 93 | + |
| 94 | + |
| 95 | + |
| 96 | +The interface is split into sections that correspond to: |
| 97 | + |
| 98 | +- the artifacts generated by our steps |
| 99 | +- the models generated by our steps |
| 100 | +- the metadata logged by our steps |
| 101 | +- (potentially) any deployments of models made, though we didn't use this in |
| 102 | + this guide so far |
| 103 | +- any pipeline runs associated with this 'Model' |
| 104 | + |
| 105 | +We can easily see which are the latest artifact or technical model versions, as |
| 106 | +well as compare the actual values of our evals or inspect the hardware or |
| 107 | +hyperparameters used for training. |
| 108 | + |
| 109 | +This one-stop-shop interface is available on ZenML Pro and you can learn more |
| 110 | +about it in the [Model Control Plane |
| 111 | +documentation](https://docs.zenml.io/how-to/use-the-model-control-plane). |
| 112 | + |
| 113 | +## Next Steps |
| 114 | + |
| 115 | +Now that we've finetuned our embeddings and evaluated them, when they were in a |
| 116 | +good shape for use we could bring these into [the original RAG pipeline](../rag/basic-rag-inference-pipeline.md), |
| 117 | +regenerate a new series of embeddings for our data and then rerun our RAG |
| 118 | +retrieval evaluations to see how they've improved in our hand-crafted and |
| 119 | +LLM-powered evaluations. |
| 120 | + |
| 121 | +The next section will cover [LLM finetuning and deployment](../finetuning-llms/finetuning-llms.md) as the |
| 122 | +final part of our LLMops guide. (This section is currently still a work in |
| 123 | +progress, but if you're eager to try out LLM finetuning with ZenML, you can use |
| 124 | +[our LoRA |
| 125 | +project](https://github.com/zenml-io/zenml-projects/blob/main/llm-lora-finetuning/README.md) |
| 126 | +to get started. We also have [a |
| 127 | +blogpost](https://www.zenml.io/blog/how-to-finetune-llama-3-1-with-zenml) guide which |
| 128 | +takes you through |
| 129 | +[all the steps you need to finetune Llama 3.1](https://www.zenml.io/blog/how-to-finetune-llama-3-1-with-zenml) using GCP's Vertex AI with ZenML, |
| 130 | +including one-click stack creation!) |
| 131 | + |
| 132 | +To try out the two pipelines, please follow the instructions in [the project |
| 133 | +repository README](https://github.com/zenml-io/zenml-projects/blob/main/llm-complete-guide/README.md), |
| 134 | +and you can find the full code in that same directory. |
| 135 | + |
| 136 | +<!-- For scarf --> |
| 137 | +<figure><img alt="ZenML Scarf" referrerpolicy="no-referrer-when-downgrade" src="https://static.scarf.sh/a.png?x-pxid=f0b4f458-0a54-4fcd-aa95-d5ee424815bc" /></figure> |
| 138 | + |
| 139 | + |
0 commit comments