We’re excited to announce streaming support in LangChain. There's been a lot of talk about the best UX for LLM applications, and we believe streaming is at its core. We’ve also updated the chat-langchain repo to include streaming and async execution. We hope that this repo can serve as a template for developers building best-in-class chat and question/answering applications.
Motivation
One of the biggest pain-points developers discuss when trying to build useful LLM applications is latency; these applications often make multiple calls to LLM APIs, each one taking a few seconds. It can be quite a frustrating user experience to stare at a loading spinner for more than a couple seconds.
Streaming helps reduce this perceived latency by returning the output of the LLM token by token, instead of all at once. In the context of a chat application, as a token is generated by the LLM, it can be served immediately to the user. While this doesn’t change the end-to-end execution time from question submittal to full response, it greatly reduces the perceived latency by showing the user that the LLM is making progress. ChatGPT is a great example of an application that leverages LLM streaming. We've built a sample chatbot application that uses streaming just like ChatGPT (more details below):
Usage
As a starting point, we’ve implemented streaming support for the OpenAI
implementation of LLM
. Read the full documentation here.
We’ve supported a callback called on_llm_new_token
that users can implement in their callback handlers when the streaming
parameter of OpenAI
is set to True
. Streaming is supported for both synchronous and asynchronous execution.
Web Application Template
Now that you’ve used LangChain to build a chatbot for the Chat-Your-Data Challenge challenge (or other application) you can run in the terminal, what’s next? How about turning that program into a web application multiple users can take advantage of?
We’ve implemented some changes in chat-langchain to highlight best practices on how to integrate the relevant LangChain features into a ready-to-deploy application that can support many users. The app leverages FastAPI for the backend and a very basic UI made with Jinja templates. The repo remains open source — changes and suggestions are welcome!
Streaming
The application takes advantage of LangChain streaming and implements StreamingLLMCallbackHandler
to send each token back to the client via websocket. Another callback handler QuestionGenCallbackHandler
is used to send messages to the client at the question-generation step of the ChatVectorDBChain
.
Async Execution
The application leverages recently added asyncio support for select chains and LLMs to support concurrent execution (without having to spawn multiple threads and reason about races). This is important when there are multiple client connections to the application:
Up Next
We're just getting started with streaming, callbacks, and async support! We'd love any and all feedback. In the near future, we hope to implement:
- Streaming support for other LLMs.
- More examples and use-cases for callback handlers.