We’re excited to announce streaming support in LangChain. There's been a lot of talk about the best UX for LLM applications, and we believe streaming is at its core. We’ve also updated the chat-langchain repo to include streaming and async execution. We hope that this repo can serve as a template for developers building best-in-class chat and question/answering applications.
One of the biggest pain-points developers discuss when trying to build useful LLM applications is latency; these applications often make multiple calls to LLM APIs, each one taking a few seconds. It can be quite a frustrating user experience to stare at a loading spinner for more than a couple seconds.
Streaming helps reduce this perceived latency by returning the output of the LLM token by token, instead of all at once. In the context of a chat application, as a token is generated by the LLM, it can be served immediately to the user. While this doesn’t change the end-to-end execution time from question submittal to full response, it greatly reduces the perceived latency by showing the user that the LLM is making progress. ChatGPT is a great example of an application that leverages LLM streaming. We've built a sample chatbot application that uses streaming just like ChatGPT (more details below):
As a starting point, we’ve implemented streaming support for the
OpenAI implementation of
LLM. Read the full documentation here.
We’ve supported a callback called
on_llm_new_token that users can implement in their callback handlers when the
streaming parameter of
OpenAI is set to
True. Streaming is supported for both synchronous and asynchronous execution.
Web Application Template
Now that you’ve used LangChain to build a chatbot for the Chat-Your-Data Challenge challenge (or other application) you can run in the terminal, what’s next? How about turning that program into a web application multiple users can take advantage of?
We’ve implemented some changes in chat-langchain to highlight best practices on how to integrate the relevant LangChain features into a ready-to-deploy application that can support many users. The app leverages FastAPI for the backend and a very basic UI made with Jinja templates. The repo remains open source — changes and suggestions are welcome!
The application takes advantage of LangChain streaming and implements
StreamingLLMCallbackHandler to send each token back to the client via websocket. Another callback handler
QuestionGenCallbackHandler is used to send messages to the client at the question-generation step of the
The application leverages recently added asyncio support for select chains and LLMs to support concurrent execution (without having to spawn multiple threads and reason about races). This is important when there are multiple client connections to the application:
We're just getting started with streaming, callbacks, and async support! We'd love any and all feedback. In the near future, we hope to implement:
- Streaming support for other LLMs.
- More examples and use-cases for callback handlers.