Exploring server-less endpoints for Large Language Models (LLMs)
We explore the APIs provided by Together.ai
In our last post, we explored how to support the knowledge graph in Vector Wirepod using Large Language Models (LLMs) available from Together AI. Together AI is one of the startups which provide on-demand APIs to directly use a Large Language Model. You simply pay for the tokens (A token is a decomposition of a word in generative AI terminology) you pass to the model, and the tokens you generate.
Benefits of serverless APIs for LLMs
Services such as those from Together AI greatly simplify the ease of using generative AI, because you do not need to think about spending resources on a deployment any more. This helps you in the following ways:
You don’t need to calculate or plan the size of your deployment such as the number of GPUs you need.
You do not need to pay upfront and right size your cloud or on-premise based deployment. You simply need to have an idea of your requirements, and you can then easily create a budget for your work.
You can pick, play with, and choose a model which is best for your use case in terms of getting the desired quality of results, and the price you are willing to pay for it.
How does Together AI compare with Open AI?
Together' AI’s service is very similar to that of OpenAI (now a well known name because of ChatGPT), which also provides a pay as you use API… The main difference is that since Together AI supports open source models, the collection of models and the price points they are made available at is much larger. You will find almost every known model at HuggingFace on Together.
Tradeoffs of using server-less endpoints?
The tradeoff of using a server-less endpoint is that since you do not own the deployment, you are not entitled to any guarantees of service. Likewise, Together AI doesn’t provide any Service Level Agreement (SLA) metrics for its services. Together AI also doesn’t document the performance of any of its services. This post aims to shed some light on the latency of Together AI’s services, with a goal to shed some light on whether Together AI’s API would make sense for your use case.
Our benchmark
We created a small benchmark which we call Together-bench to benchmark the performance of Large Language Models (LLMs) offered by Together. The benchmark is rather simple, it requests the Together AI service to answer a short question: What is the capital of France? The benchmark takes the LLM model as an input parameter, and measures the end-to-end latency of the API to respond, which translates to the response time you can expect from Together. Please remember that the expected answer to the question is one word: Paris. Thus, the response time measured is for generating a single token. While you would typically use a generative AI model to generate paragraphs of text, we believe that quantifying the first token is a good representation of performance, because it takes the longest time for the model to generate the first token.
The benchmark is available open-source on Github, and I encourage you to explore and play with it.
Editor’s note: Thank you for reading this article. We really depend on your support and referral via word of mouth to grow. In case you know of someone who can benefit from this newsletter, please refer. Better, please consider providing them with a gift subscription. Thanks again.
Results
We used Together-bench to understand how the response time of Together AI’s service varies depending on the Large Language Model. We chose a wide set of models based on common open source distributions: Llama/Llama2, Falcon, MPT-Instruct, and GPT-NeoX. The results are in the following chart. The Y-axis quantifies the API Response Time in seconds. For each model, we measured the response time over 5 minute intervals for a period of 1 hour. The median data point for latency is shown as the result.
We note that the latency for most models is around the 0.4 seconds. There are a couple of models: Falcon 40B and Llama2-70B for which the API response time is a bit higher at 0.7s and 0.53s respectively. Let us try to understand this in more detail.
The remaining article is for paid members only and discusses the different factors behind the latency and how the latency varies across time. Please consider supporting our publication by becoming a paid member.