Skip to content

Bug: Multi-byte UTF-8 characters corrupted in streaming responses causing LLM regeneration #3068

@PlaneInABottle

Description

@PlaneInABottle

Description

When streaming responses from LLMs contain multi-byte UTF-8 characters (Turkish, emoji, CJK), they can be corrupted if the HTTP stream splits at byte boundaries. This causes the LLM to detect malformed JSON in structured output and restart generation, resulting in double-generated content within a single response.

Symptoms

  1. Malformed JSON in agent responses: Response contains nested/duplicated JSON structures
  2. LLM double-generation: The LLM starts generating, gets cut off mid-character, then restarts and generates again
  3. Affects specific languages: Turkish characters (ü, ö, ş, ç, ğ), emoji (🚀, 🎯), CJK (中, 日)
  4. Memory block errors: Downstream blocks fail with "field doesn't exist" because the response wasn't properly formatted

Example of corrupted response:

{
  "model": "google/gemini-3-flash-preview",
  "content": "{\"response\": \"First part of response... Ö{\"response\": \"Complete response...\"}",
  "toolCalls": { "list": [] }
}

The first "response" field is cut off at a multi-byte character boundary, then the JSON restarts.

Root Cause

When HTTP streaming splits at byte boundaries within multi-byte UTF-8 sequences:

  • Example: Turkish "Ö" is 2 bytes in UTF-8 (0xC3 0x96)
  • If split between chunks: Chunk 1 gets 0xC3, Chunk 2 gets 0x96
  • TextDecoder.decode(chunk) without { stream: true } doesn't maintain state
  • Each chunk decoded separately outputs a replacement character (U+FFFD)
  • Result: JSON becomes malformed → LLM detects error → restarts generation

Impact

This affects:

  • All workflows using agents with structured output (responseFormat)
  • Any language with multi-byte UTF-8 characters
  • Streaming responses from any LLM provider (OpenAI, Anthropic, Google, OpenRouter, etc.)

Example Scenario

A support bot configured with structured output (strict JSON schema) receives a query in a language with multi-byte characters. The agent's response gets split during streaming:

  1. First chunk arrives with incomplete byte sequence of a multi-byte character
  2. TextDecoder without { stream: true } outputs replacement character
  3. JSON becomes invalid
  4. LLM detects malformed JSON (syntax error)
  5. LLM regenerates response within same call
  6. Both partial and regenerated responses concatenated = corrupt output
  7. Downstream memory/processing blocks fail with "field doesn't exist" errors

Workaround

Use TextDecoder.decode(value, { stream: true }) where TextDecoder is created once outside the streaming loop. This allows the decoder to maintain internal state and buffer incomplete multi-byte sequences until the next chunk arrives.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions