Skip to content

Fix stdio encoding issue: Enforce explicit UTF-8 for correct Unicode handling #73

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged

Conversation

willibrandon
Copy link
Contributor

Overview

This pull request addresses the encoding issue reported in #35, where JSON-RPC messages printed in the terminal showed corrupted characters (e.g., Chinese characters displayed as question marks).

The problem stemmed from the stdio transport layer relying on the system default encoding (Windows-1252 on Windows)
instead of explicitly using UTF-8.

Changes

  • Encoding Fix: Updated both client and server transports to explicitly use UTF8Encoding (without BOM) for reading and writing:
    // Create streams with explicit UTF-8 encoding to ensure proper Unicode character handling
    // This is especially important for non-ASCII characters like Chinese text and emoji
    var utf8Encoding = new UTF8Encoding(false); // No BOM
    _stdInWriter = new StreamWriter(_process.StandardInput.BaseStream, utf8Encoding) { AutoFlush = true };
    _stdOutReader = new StreamReader(_process.StandardOutput.BaseStream, utf8Encoding);
  • Tests Added: Tests have been implemented to verify that both BMP Unicode characters (Chinese: "上下文伺服器") and non-BMP Unicode characters (emoji: 🔍🚀👍) are correctly preserved during transport.

Impact

This fix resolves the Unicode character corruption by ensuring that the transport layer uses consistent UTF-8 encoding, improving the reliability of message display in all locales. The changes maintain the existing API surface while enhancing support for international characters.

Next Steps

Please review the changes and let me know if any further modifications or additional tests are needed.

@willibrandon willibrandon mentioned this pull request Mar 23, 2025
willibrandon and others added 2 commits March 24, 2025 08:53
This change replaces the default system encoding with an explicit UTF8Encoding (without BOM)
for both client and server transports. This ensures proper handling of Unicode characters,
including Chinese characters and emoji.

- Use UTF8Encoding explicitly for StreamReader and StreamWriter.
- Add tests for Chinese characters ("上下文伺服器") and emoji (🔍🚀👍) to confirm the fix.

Fixes modelcontextprotocol#35.
@stephentoub stephentoub self-assigned this Mar 24, 2025
@eiriktsarpalis eiriktsarpalis linked an issue Mar 24, 2025 that may be closed by this pull request
@stephentoub stephentoub force-pushed the fix/utf8-stdio-encoding branch from 1162a06 to f47d99f Compare March 24, 2025 13:46
@stephentoub stephentoub merged commit 36d5019 into modelcontextprotocol:main Mar 24, 2025
9 of 13 checks passed
@stephentoub
Copy link
Contributor

Thanks, @willibrandon.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Encoding issue
2 participants