Wednesday, February 20, 2008

Writing XML with UTF-8 Encoding using XmlTextWriter and StringWriter







If you want to use XmlTextWriter to write XML into a StringBuilder you can create the XmlTextWriter like this:

StringBuilder builder = new StringBuilder();
XmlWriter writer = new XmlTextWriter(new StringWriter(builder));


But this generates a declaration on the resulting XML with the encoding of UTF-16 (the encoding of a .Net String). There doesn't seem to be a straightforward way of making this declaration UTF-8 in this set up.

You can, of course, use a MemoryStream instead of a StringWriter, and then use Encoding.UTF8.GetString(...) to convert the bytes to a string, but doing this made the resulting string have non-printable characters in it, which we don't want.

The solution is to subclass StringWriter and override the Encoding property. Sounds a bit overkill for a solution, but it works very well. Just create the following class (based on Jon Skeet's class):

public class StringWriterWithEncoding : StringWriter
{
Encoding encoding;

public StringWriterWithEncoding (StringBuilder builder, Encoding encoding)
:base(builder)
{
this.encoding = encoding;
}

public override Encoding Encoding
{
get { return encoding; }
}
}

Then use StringWriterWithEncoding instead of StringWriter in your XmlTextWriter.

10 comments:

Anonymous said...

try the following

StringBuilder sb = new StringBuilder();
XmlWriterSettings settings = new XmlWriterSettings();
settings.Indent=true;
settings.Encoding = Encoding.UTF8;
settings.CloseOutput = false;
settings.CheckCharacters = true;

XmlWriter w = XmlWriter.Create(sb, settings);

thoward37 said...

Here's a slightly more complete version (sorry no html formatting for the code listing):



public class StringWriterWithEncoding : StringWriter
{
private Encoding _encoding;

public StringWriterWithEncoding()
: base() {}

public StringWriterWithEncoding(IFormatProvider formatProvider)
: base(formatProvider) {}

public StringWriterWithEncoding(StringBuilder sb)
: base(sb) { }

public StringWriterWithEncoding(StringBuilder sb, IFormatProvider formatProvider)
: base(sb, formatProvider) { }


public StringWriterWithEncoding(Encoding encoding)
: base() {
_encoding = encoding;
}

public StringWriterWithEncoding(IFormatProvider formatProvider, Encoding encoding)
: base(formatProvider) {
_encoding = encoding;
}

public StringWriterWithEncoding(StringBuilder sb, Encoding encoding)
: base(sb) {
_encoding = encoding;
}

public StringWriterWithEncoding(StringBuilder sb, IFormatProvider formatProvider, Encoding encoding)
: base(sb, formatProvider) {
_encoding = encoding;
}

public override Encoding Encoding
{
get
{
return (null == _encoding) ? base.Encoding : _encoding;
}
}
}

Anonymous said...

To avoid the undesirable chars when using Encoding.UTF8.GetString on a memory stream, load the memory stream into a text reader and then read from that:

Stream s = new MemoryStream();
XmlWriter xw = new XmlTextWriter(s, Encoding.UTF8);
// Write XML to xw

// Now read back
TextReader tr = new StreamReader(s);
s.Seek(0, SeekOrigin.Begin);
string xml = tr.ReadToEnd();

Chris said...

You can reduce the lines down quite a bit, just use a XmlWriter (not XmlTextWriter), specify your tab settings and use a MemoryStream with a StreamReader as anonymous said, though it needs UTF8 forced for the byte order to be correct. I shoved two example snippets to write UTF8 XML here, feel free to take.

Ling said...

I ran into the same issue trying to write an XML with UTF-8 encoding to a String Writer. I tried all suggestions from this email chain. Only the suggestion from thoward37 with a subclass of StringWriter worked.

andi said...

The hint to override the Endoding property was great!

Thank You!

Avi said...

Thanks :)

UttaM said...

Great solution.... really good..works well...

UttaM said...
This comment has been removed by the author.
UttaM said...
This comment has been removed by the author.