Raw Record Source

{
  "path": "/posts/2024/lm-streaming-with-sse/index",
  "site": "at://did:plc:mracrip6qu3vw46nbewg44sm/site.standard.publication/self",
  "tags": [
    "sse",
    "vercel",
    "language_models",
    "python",
    "fastapi"
  ],
  "$type": "site.standard.document",
  "title": "Language Model Streaming With SSE",
  "updatedAt": "2024-01-31T08:19:38.000Z",
  "publishedAt": "2024-01-31T08:19:38.000Z",
  "textContent": "OpenAI popularized a pattern of streaming results from a backend API in realtime with ChatGPT.\nThis approach is useful because the time a language model takes to run inference is often longer than what you want for an API call to feel snappy and fast.\nBy streaming the results as they're produced, the user can start reading them and the product experience doesn't feel slow as a result.\n\nOpenAI has a nice example of how to use their client to stream results.\nThis approach makes it straightforward to print each token out as it is returned by the model.\nMost user facing apps aren't command line interfaces, so to build our own ChatGPT like experience where the tokens show up in realtime on a user interface, we need to do a bit more work.\nUsing Server-Sent Events (SSE), we can display results to a user on a webpage in realtime.\n\nIf you're short on time, Vercel wrote a nice ai library that handles much of the work we're about to do here in React, with nice hooks and an easy pattern for hosting a backend (through proprietary).\nThis exploration is to understand SSE more deeply and look at what it takes to build your own API and UI capable of the UX described.\n\nA simple implementation\n\nWe're going to build a fastapi backend to stream tokens from OpenAI to a frontend, which I will build as well.\nLet's start with a simple server that returns at StreamingResponse.\n\nserver.py\n\nRun the server\n\nCurl the endpoint shows things are wired together reasonably\n\nThis is _almost_ all we need on the server for an MVP of streaming to a UI.\nReferencing the same SSE docs as earlier, we see a description of an interface called EventSource.\nWe'll see a simple html and Javascript frontend with EventSource to build our UI.\nLooking at the example PHP code, we also see our server code needs some slight modifications.\nThe data we emit needs to be formatted as f\"data: {data}\\n\\n\" for EventSource to handle it.\nWe also need to set the content type of text/event-stream.\n\nHere is the server with those changes:\n\nAnd the curl still works (though now is less human readable).\n\nNow let's create a simple UI and serve it from the server (for simplicity).\n\nindex.html\n\nWe also modify the server to render with html file when request the root url.\n\nWe can open the site at http://localhost:8000 and enter a message to the model.\nThen click \"Start Streaming\".\n\n!Language model streaming without close\n\nWe see the response from the model showing up in real-time!\nSomething funny happens though.\nIf we wait a few seconds, we see new responses to the message continue coming through, as if the request is being submitted repeated it.\nIn fact, this is exactly what is happening.\n\nWe can modify the server to see a [DONE] response when the streaming is complete and the client to close the EventSource connection when it receives this data.\n\nNow when we submit a message from the UI, the client closes the connection with the server after it is finished responding.\nIf we submit another request, the client clears the content and streaming the new response.\nIt generally works.\nThe weirdness in the spacing is due to the model returning newlines (\\n) which are being consumed by EventSource as the suffix of data.\nThat is a problem we won't delve into here.\n\nRepurposing the backend\n\nEventSource works for streaming events from a backend with SSE, but there are other options depending on how you're building your UI.\nAs mentioned earlier, Vercel's ai package provides React hooks to build a realtime streaming chat.\nWe can modify our backend to support this library.\n\nLet's first consider a minimal (as much as possible anyway) Next.js app with the ai library installed, taken straight from the docs\n\nsrc/app/page.tsx\n\nWe've setup the useChat hook to point to our backend and wired up an input field to submit the conversation along with a new message.\nIf we inspect the payload the ai sends to the backend, we see it's a POST request and that the body content contains the OpenAI schema for an array of messages.\n\nWith this knowledge, we can modify our backend to expect this request schema and pass it to the model.\nThrough some quick trial and error (maybe the docs have this somewhere as well), I also learned the library expects the streaming responses to contain the tokens only, not the data: {content}\\n\\n\\ wrapping like EventSource does.\n\nOur server code now looks like this\n\nHere's how it looks (sorry, no fancy UI):\n\n!Demo using Vercel",
  "canonicalUrl": "https://www.danielcorin.com/posts/2024/lm-streaming-with-sse/index"
}