Step 03 - Streaming responses
LLM responses can be long. Imagine asking the model to generate a story. It could potentially produce hundreds of lines of text.
In the current application, the entire response is accumulated before being sent to the client. During that generation, the client is waiting for the response, and the server is waiting for the model to finish generating the response. Sure there is the “…” bubble indicating that something is happening, but it is not the best user experience.
Streaming allows us to send the response in chunks as it is generated by the model. The model sends the response in chunks (tokens) and the server sends these chunks to the client as they arrive.
The final code of this step is located in the step-03
directory.
However, we recommend you to follow the instructions below to get there, and continue extending your current application.
Asking the LLM to return chunks
The first step is to ask the LLM to return the response in chunks. Initially, our AI service looked like this:
package dev.langchain4j.quarkus.workshop;
import io.quarkiverse.langchain4j.RegisterAiService;
import jakarta.enterprise.context.SessionScoped;
@SessionScoped
@RegisterAiService
public interface CustomerSupportAgent {
String chat(String userMessage);
}
Note that the return type of the chat
method is String
.
We will change it to Multi<String>
to indicate that the response will be streamed instead of returned synchronously.
package dev.langchain4j.quarkus.workshop;
import io.quarkiverse.langchain4j.RegisterAiService;
import io.smallrye.mutiny.Multi;
import jakarta.enterprise.context.SessionScoped;
@SessionScoped
@RegisterAiService
public interface CustomerSupportAgent {
Multi<String> chat(String userMessage);
}
A Multi<String>
is a stream of strings.
Multi
is a type from the Mutiny library that represents a stream of items, possibly infinite.
In this case, it will be a stream of strings representing the response from the LLM, and it will be finite (fortunately).
A Multi
has other characteristics, such as the ability to handle back pressure, which we will not cover in this workshop.
Serving streams from the websocket
Ok, now our AI Service returns a stream of strings. But, we need to modify our websocket endpoint to handle this stream and send it to the client.
Currently, our websocket endpoint looks like this:
package dev.langchain4j.quarkus.workshop;
import io.quarkus.websockets.next.OnOpen;
import io.quarkus.websockets.next.OnTextMessage;
import io.quarkus.websockets.next.WebSocket;
@WebSocket(path = "/customer-support-agent")
public class CustomerSupportAgentWebSocket {
private final CustomerSupportAgent customerSupportAgent;
public CustomerSupportAgentWebSocket(CustomerSupportAgent customerSupportAgent) {
this.customerSupportAgent = customerSupportAgent;
}
@OnOpen
public String onOpen() {
return "Welcome to Miles of Smiles! How can I help you today?";
}
@OnTextMessage
public String onTextMessage(String message) {
return customerSupportAgent.chat(message);
}
}
Let’s modify the onTextMessage
method to send the response to the client as it arrives.
package dev.langchain4j.quarkus.workshop;
import io.quarkus.websockets.next.OnOpen;
import io.quarkus.websockets.next.OnTextMessage;
import io.quarkus.websockets.next.WebSocket;
import io.smallrye.mutiny.Multi;
@WebSocket(path = "/customer-support-agent")
public class CustomerSupportAgentWebSocket {
private final CustomerSupportAgent customerSupportAgent;
public CustomerSupportAgentWebSocket(CustomerSupportAgent customerSupportAgent) {
this.customerSupportAgent = customerSupportAgent;
}
@OnOpen
public String onOpen() {
return "Welcome to Miles of Smiles! How can I help you today?";
}
@OnTextMessage
public Multi<String> onTextMessage(String message) {
return customerSupportAgent.chat(message);
}
}
That’s it!
Now the response will be streamed to the client as it arrives.
This is because Quarkus understands that the return type is a Multi
natively, and it knows how to handle it.
Testing the streaming
To test the streaming, you can use the same chat interface as before. The application should still be running. Go back to the browser, refresh the page, and start chatting. If you ask simple questions, you may not notice the difference.
Ask something like
and you will see the response being displayed as it arrives.
Let’s now switch to the next step!