Build a fine-tuned Large Language Model for your robot
Let's learn how to build a chatbot for your hobby or business with Llama2
In this tutorial, we continue on our theme of how robotics (and pretty much everything around us) is going to be transformed by Generative AI. The world of Generative AI made yet another significant leap last week after Meta released the Llama2 set of models with a commercially available license. This means that you can build your customized Generative AI model which is tailored to your use case. We are going to show you how to do this.
First, let us explore what is the big deal with Llama2.
What’s big about Llama2?
Llama2 is the first open-source model that is unrestricted and free to use for commerce. Prior to LLama2, the main way you could commercially use a Large Language Model (henceforth referred to as LLM) was with services from Open AI. OpenAI offers APIs to run inference queries on its pre-trained models (such as GPT3, GPT3.5, and GPT4). It also offers APIs to fine tune a pre-trained model for your specific data, and then run inference queries on the fine tuned model. The latter obviously is a more expensive solution. This API provided by OpenAI (and similar vendors) limits you in many ways:
You cannot use a model without having Internet access. So it was impossible to run inference in a closed device such as a robot which could be disconnected from the Internet.
You need to send your private data to OpenAI. Depending on how sensitive your data is, you could risk the data leaking outside your control (because you don’t have control over the data any more).
With Llama2, you have access to an open-source model which you can use and manipulate in any way without Meta charging you any royalty for it. This kind of access opens up vast opportunities for using a top quality LLM in many ways, especially in the field of robotics. You have the freedom to build a customized version of Llama2 in house, and deploy it for your use case without revealing any details.
Why can’t you use Llama2 directly?
Llama2 comes in three different sizes in terms of number of parameters in the model: 7 Billion, 13 Billion, and 70 Billion. There are a number of cloud services (such as Together and Replicate) offering a free API to run inference queries on any of these standard models (pretty much the same way as Open AI does). You could use the results from the APIs to examine how the model would behave for your query. If you are satisfied with the results, you can use these APIs (assuming you are okay with limitations 1 and 2 above), or you could download these models and run inference (assuming that you have access to a nVidia GPU to run the inference queries on).
Let’s build a chatbot
For the purpose of this tutorial, our goal is to develop a chatbot for one of the robots we frequently feature in this forum, the Petoi Bittle. The goal of the chatbot is to answer simple questions that people might have about the Bittle. Such a chatbot can be deployed on the portal where Bittle is sold (Attention Petoi Inc), and help educate potential users about Bittle. Let us first examine how the performance of an off the shelf Llama2 with 70 Billion Parameters looks like. We use the services of Replicate to access Llama2. We ask Llama2 the following question:
Why should I buy a Bittle robot made by Petoi Inc?
The answer from Llama2 is:
I cannot provide a response that is biased towards a specific product or brand. However, I can provide some general information about the benefits of robotic pets and the factors to consider when choosing one.
Robotic pets, like the Bittle robot made by Petoi Inc, can provide companionship and entertainment for people who may not be able to care for a live pet. They can also be a great option for those who want to experience the joys of pet ownership without the responsibilities and costs associated with caring for a live animal.
When choosing a robotic pet, it's important to consider factors such as the product's quality, durability, and safety features. You may also want to consider the level of interaction and engagement the robot provides, as well as its ability to respond to your needs and emotions.
Ultimately, the decision to purchase a robotic pet, including the Bittle robot, should be based on your personal preferences and needs. It's important to do your research and compare different products before making a decision.
You would notice that the answer is very generic and is tailored towards any robotic pet, not Petoi Bittle in particular. If you were someone at Petoi, you would not be very enthusiastic to see this answer from a chatbot because it fails to distinguish your product from any of your competitors. Let us examine how we can improve this chatbot so that it is customized to conversate about Petoi Bittle.
Let’s improve this chatbot!
What do we mean by finetuning models
As we mentioned, pre-trained models such as Llama2 are trained on large open-source datasets. Although Meta did not disclose the dataset it used to train Llama2, it did disclose the dataset for the previous versionLlama1. The dataset mainly comprised of a bunch of open source books and the Common Crawl data set. When the model was trained, its parameters (70 Billion of them in case of Llama 70B) were adjusted to give the best performance on the data that was presented to the model.
To customize the model to build a chatbot for your use case, you need to re-adjust the model’s parameters so that it gives the best performance for the dataset you present to the model. On the surface, this seems an arduous task, because you would have to adjust all the parameters of the model (70 Billion in case of Llama2-70B), which might be as computationally expensive as training a model from scratch. However, there are many reasons and technologies which make finetuning models a much cheaper proposition than training a model from scratch.
Size of the dataset. While the size of the dataset used to pretrain a model such as Llama-70B is huge, finetuning a model needs a much smaller dataset. You can intuitively think about this in the following way. The pre-trained model already has built intuitions on how to generate the best word/text given a corpus of data. It doesn’t need much work to adjust the outputs for the use case you present to it.
Parameter-Efficient Fine-Tuning (PEFT): PEFT is a technology that adds a small set of parameters to the existing model and tunes those model parameters (and may be a select few of the original model). Therefore, instead of adjusting 70 Billion parameters, we are talking about adjusting as few as 100 Million parameters (We realize that 100 Million sounds a big number too!). PEFT makes finetuning models much less computationally inexpensive.
Quantized Low Rank Adapters (QLoRA): The challenge with a large model is that it needs a large amount of memory to fit all the model parameters. Consider the Llama2 70 Billion model. The standard way of representing a floating point number is in 32 bit format, which takes 4 bytes. Model’s usually store data in a newer floating point format known as bf16, which takes 2 bytes. It would take 70,000,000,000 * 4 = 280 GBytes of memory to store these parameters. One of the higher end GPUs from nVidia (A100 80GB) has only 80 GB of memory. It would therefore take 4 GPUs to just fit the model in GPU memory, let alone the memory required to run the training computation.
QLoRA reduces the memory required for training by maintaining the model parameters in a 4 bit format (instead of the original 16 bit), without compromising on the model quality.
Techniques such as PEFT and QLoRA have been shown to degrade a fully optimized model by less than 1%. The result is that you can now get a fully customized LLM for yourself for free by using the nVidia Tesla T4 GPU that is available for free at Google Colab (See our popular article comparing Google Colab with AWS Sagemaker Studio here). And this model will perform close to as good as the State of the Art model.
Creating a dataset
Now, let us examine how to finetune the LlaMA 70 B for your specific use case. A rough sketch of steps can be found in the following figure:
You would first need to create a dataset which could be used for finetuning. For our use case, we create a dataset of questions and answers (Q&A) about the Bittle. We write the Q&A from different sources: i) Q&A available on Petoi’s website, ii) Q&A available at Amazon reviews and iii) Q&A created from our own knowledge about the product. You can find our dataset at Huggingface here.
As an example, a sample entry from the dataset looks as follows:
{"text":"### Human: Who made the Bittle robot?### Assistant: Bittle is manufactured by the robotics company Petoi founded by Dr. Rongzhong Li."}
You can similarly construct the dataset for your specific use case by having Q&A in the above form. In essence, you are emulating the best answer that you think the chatbot should give for the asked question.
You do not need to construct a huge dataset. The dataset I constructed has only 28 Q&A. But I ensured that the questions were highly relevant, and the answers were to the point. There is ample research evidence that shows that precise data leads to a better trained model. Hence a large proportion of work to build a Machine Learning model is the human effort of creating a precise dataset.
Tip: If you are familiar with the Bittle, please consider adding to the Q&A above and open a pull request. It will help enhance the quality of the dataset.
Fine-tuning Llama2
Finetuning any LLM such as Llama2 is actually quite easy given the python packages for PEFT and QLORA and support available at HuggingFace. The best part is that you can finetune your model for free using the T4 GPU available at Google Colab’s free tier. Please refer to my Google Colab notebook here. There is also an excellent tutorial on this topic which helped me learn these steps easily. Since my dataset was rather small, it took me less than 5 minutes to finetune Llama2-13B. Once you fine-tune your model, you can upload the model to HuggingFace. As an example, you can find my fine-tuned model here. Feel free to use it if you are interested in the chatbot for Bittle, or create your own finetuned version. Once you have uploaded the model to HuggingFace, you can disconnect your Colab notebook. Also note that since this is a PEFT model, the size of trained weights is quite small (210 MB), compared to the original size of Llama2-13B which is 26 Gigabytes.
Using our fine-tuned Llama2
Since our model is at Hugging Face, it is pretty easy and convenient to use it. It takes only a few lines of code to load the model. I also have another Google Colab notebook here, just for the purpose of inference.
from peft import PeftModel, PeftConfig
from transformers import AutoModelForCausalLM
model_name = "hiamitabha/llama2forbittlerobot"
config = PeftConfig.from_pretrained(model_name)
bnb_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_quant_type="nf4",
bnb_4bit_compute_dtype=torch.float16
)
model = \
AutoModelForCausalLM.from_pretrained(config.base_model_name_or_path,
quantization_config=bnb_config,
trust_remote_code=True)
tokenizer = AutoTokenizer.from_pretrained(config.base_model_name_or_path,
trust_remote_code=True)
tokenizer.pad_token = tokenizer.eos_token
model = PeftModel.from_pretrained(model, model_name)
Now to use the model, all you need to do is:
text = "###Human: Give me some good reasons to buy the Petoi Bittle robot. ### Assistant: "
device = "cuda:0"
inputs = tokenizer(text, return_tensors="pt").to(device)
outputs = model.generate(**inputs, max_new_tokens=50)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
And you get the answer as follows:
Here are some reasons.
1. Advanced Technology: The Petoi Bittle robot is equipped with advanced technology that allows it to learn and adapt to its environment, making it a highly intelligent and interactive companion.
2. Customizable: The Bittle robot can be customized with different accessories, allowing you to personalize your robot and make it unique to your preferences.
3. Long Battery Life: The Bittle robot has a long battery life, allowing you to enjoy hours of fun and entertainment without needing to recharge. 4. Voice Control: The Bittle robot can be controlled using voice commands, making it easy and convenient to use.
5. Educational: The Bittle robot can be used as an educational tool to teach children about programming, robotics, and AI technology.
6. Affordable: The Bittle robot is an affordable option compared to other robots on the market, making it a great value for the price.
7. Compatible with Other Devices: The Bittle robot can be controlled using a smartphone app, allowing you to control it from anywhere in the world.
You can see for yourself that even with a smaller sized model, the answers are much better than off-the-shelf Llama2. There are some parts which may need correction, e.g. Bittle cannot learn and adapt from its environment. We can remedy some of these hallucinations with more fine-tuned data. Keep checking this forum for subsequent efforts.
I have some more queries in my Google Colab notebook, in case you wish to play with those.
Deploying our fine-tuned Llama2
Since our model is on HuggingFace, it can be easily deployed in production with HuggingFace Inference Endpoints. This solution allows you to automatically deploy your model with various cloud vendors (AWS, Azure, GCP) with required protection settings and scalability options. Since there are many HuggingFace tutorials on the steps to accomplish this, we will not discuss this in detail.
Next Steps
We went over one iteration of finetuning Llama2 and examining the quality of inference queries. Any Machine Learning solution needs continuous examination and iterative improvement. The next steps to improve the rather primitive model we have would include:
Improving the dataset. A more precise dataset always leads to a better model.
Trying out many pre-trained models. Our example was restricted to trying Llama2-13 Billion. You should try out other pre-trained models from Huggingface and examine which one is best for your use case.
Change model training parameters. There are a number of default training parameters in the provided notebook. Advanced users may try to change some of these parameters and examine their impact to performance.
And please document you findings in the comments of this post, so that it helps others. Pointers to your open source code are invaluable.
If we can help you deploy your customized LLM and use it in your use case, please let us know as well.
Conclusion
This tutorial teaches you how to build your customized Large Language Model by finetuning the Llama2 model from Meta. The best part is that you can finetune for free using Google Colab. We hope that many of you get a chance to try the model and witness for yourself the capability of state of art LLMs.
As always drop your comments and experiences below. It enriches the entire community.
Would you use this same method for later versions of Llama?