← Back to Field Notes
May 06, 2026 · 6 min read

Streaming Two Streams at Once: ASR Partials + LLM Tokens in One UI

Building QolAssist - what 'live' actually means when speech transcription and Claude tokens both have to share an event loop.

pyqt6claude-apistreamingasr

Most “AI assistant” demos work the same way: one big request goes out, one big answer comes back. That pattern is fine for a chatbot. It falls apart the moment you want the assistant to feel live— reading captions on one side of the screen while an answer types itself out on the other.

QolAssist is a Windows desktop overlay I built to find out what that actually feels like. Two frameless, always-on-top panels float over whatever app you’re using: a Transcript on the left captioning system audio word-by-word, an Answer on the right streaming Claude tokens. One hotkey toggles between listening and asking. Everything except the Claude call runs on-device.

Here’s what I learned building it — the architectural decisions, the threading model, and the deadlock that taught me what thread.join() actually does.

QolAssist running as a transparent desktop overlay

Vosk Over Whisper: Why Perceived Latency Wins

The first version used faster-whisper on CPU. Whisper is more accurate. Captions arrived in 3-second chunks. It felt unusably laggy in conversation — you’d hear the speaker say something, then watch the words appear three seconds later, with no visual signal in between that anything was happening.

Switching to Vosk gave word-by-word partials at sub-second latency on the same hardware, at the cost of some accuracy on uncommon words. That’s the tradeoff in one sentence: for any ‘live’ UX, perceived latency beats peak accuracy. Users will forgive a wrong word; they won’t forgive three seconds of empty UI.

Vosk also streams. Every audio chunk you feed it can return either a partial result (current best guess for the in-flight phrase) or a final result (a confirmed segment). The transcript panel renders partials in a dimmed style and finals in a bright one, so the user can see the engine’s confidence in real time without reading the model’s mind.

The Threading Model

Three things have to happen at once and none of them can block the UI:

  • Audio capture— WASAPI loopback pulling system audio into a numpy buffer.
  • ASR transcription — Vosk consuming the buffer and emitting partial/final results.
  • LLM streaming— the Anthropic SDK’s .stream() helper iterating over Claude’s SSE response.

Each of these lives on its own background thread. The Qt main thread owns every widget; nothing else is allowed to touch a widget directly. Background threads communicate with the UI exclusively through Qt signals.

# Background ASR thread emits a signal; the slot runs on the UI thread
class TranscriberWorker(QObject):
    partial_result = pyqtSignal(str)
    final_result = pyqtSignal(str)

    def run(self):
        for chunk in self.audio_queue:
            if recognizer.AcceptWaveform(chunk):
                self.final_result.emit(recognizer.Result())
            else:
                self.partial_result.emit(recognizer.PartialResult())

# Wire the signal to a UI slot once
worker.partial_result.connect(transcript_panel.show_partial)
worker.final_result.connect(transcript_panel.append_final)

Qt signals are a thread-safe message bus — emit from anywhere, the slot runs on the receiver’s thread. This is the single most useful pattern in PyQt for this kind of app.

The Self-Join Deadlock

Early on, an unhandled error in the audio mixer thread would call stop_listening() directly. That helper joined the audio thread before returning. The audio thread was the calling thread. Python obligingly tried to wait for itself to finish.

The app froze. No exception. No crash. Just silent deadlock.

The fix is the same fix I’d apply to any Qt threading bug: stop calling cross-thread methods directly. Emit a signal instead.

# WRONG - runs in the audio thread, joins itself, deadlocks
def on_audio_error(err):
    log.error(err)
    self.stop_listening()  # joins the calling thread

# RIGHT - emit, let the main thread handle teardown
class AudioMixer(QObject):
    error_occurred = pyqtSignal(str)

    def on_audio_error(self, err):
        self.error_occurred.emit(str(err))

mixer.error_occurred.connect(controller.stop_listening)

Lesson: in any threaded UI app, treat “background thread calling a method that touches the main thread” as a code smell. Route it through a signal even when it looks unnecessary. The one time it isnecessary, you’ll already be doing the right thing.

Streaming LLM with Prompt Caching

Every ask sends the same reference text file as the system prompt and the latest caption lines as the user message. Without caching, every ask re-tokenizes the whole reference file on Anthropic’s side and you pay full input cost every time.

With cache_control: ephemeral set on the system block, repeat asks against the same context are roughly 10x cheaper after the first call. The reference file usually does nothing for two minutes between asks, then gets hit five times in a row. Caching is purpose-built for that pattern.

with client.messages.stream(
    model="claude-haiku-4-5",
    max_tokens=1024,
    system=[{
        "type": "text",
        "text": reference_file_contents,
        "cache_control": {"type": "ephemeral"},
    }],
    messages=[{"role": "user", "content": latest_captions}],
) as stream:
    for text in stream.text_stream:
        token_received.emit(text)  # Qt signal → answer panel

The SDK’s .stream()context manager handles all the SSE plumbing — chunked decoding, retry on connection blips, clean teardown. Each text fragment becomes a Qt signal emission; the answer panel appends it. Tokens land on screen the instant they arrive.

What ‘Live’ Actually Feels Like

With both streams running, the experience is qualitatively different from any single-request demo I’d built before. Captions tick across the left panel as the speaker talks. You hit space. The right panel starts typing an answer before the left panel has even finished rendering its last partial. Two streams, one UI, no blocking, no lag.

The whole project is ~1,000 lines of Python across 8 modules, distributed as a self-contained Windows folder via PyInstaller. Solo build — design, architecture, and code — for Windows 10/11.

If you take one thing from this: for streaming UIs, the bottleneck is almost never the model. It’s the threading boundary between the model’s output and your UI thread. Get that boundary right with signals, and the rest of the architecture falls into place.

Working on something like this?

Start a conversation →