Faster string to UTF-8 encoding in Java

Update: It turns out that after further investigation, the performance improvements didn’t hold up when some uncovered correctness bugs forced some code changes. The patch was rolled back, so we’re stuck with the same old encoding mechanism. Sigh.

I’ve spent a lot of time profiling Thrift serialization and deserialization, and one thing that has always stood out is Java’s UTF-8 String encoding and decoding. I’ve monkeyed around with different ways to make this faster, but I’ve always come up short of any big improvements.

The culprit is the notoriously slow String.getBytes(“UTF-8″) call. While it is an incredibly convenient method for getting your string’s UTF-8 representation as a byte array, for some reason, it takes up a ton of time. To boot, the method can only encode into brand new byte arrays, meaning you are also going to incur both an object creation to encode the string and a byte array copy to get that encoded string into your final output. (On the decoding side, however, you can pass in a buffer, offset, and length, so there’s no need for an extra array copy if you know what you’re doing.)

In looking for alternatives, I found my way to CharsetEncoder. I was pretty hopeful about this one – it doesn’t take a string charset each time, and it allows encoding into a ByteBuffer, so it seemed promising. Alas, it turned out to be slower than getBytes, if only marginally. It seems like this particular class is more suited to situations when you need more control over the exact same encoding process that getBytes taps into.

Finally, luck shined upon me when I was having a look over the Thrift/Protobuf Comparison benchmark project. The project’s maintainer pointed out that DataOutputStream’s writeUTF method was the fastest string encoder in sight, and some other serialization frameworks (like Kryo) had poached that methodology for their own purposes. Tantalized, I quickly set up a benchmark shootout between getBytes, CharsetEncoder, and the writeUTF method. Stunningly, writeUTF is about 2x as fast as the other two methods! Now we’re getting somewhere.

I have followed suit and added this method to Thrift. You can find it in TRUNK and the upcoming 0.3 release in the Utf8Helper class. If your Thrift structs are string-heavy, then this will give you a substantial boost in performance. Try it out and let me know how it goes!

  • Facebook
  • HackerNews
  • Reddit
  • Twitter
  • del.icio.us
  • Digg
  • Slashdot
  • StumbleUpon

Follow Bryan on Twitter: @bryanduxbury

This entry was posted in Thrift. Bookmark the permalink. Post a comment or leave a trackback: Trackback URL.

2 Comments

  1. Posted June 25, 2010 at 12:04 am | Permalink

    Thanks for the post :)

    I’m in the middle of writing a fault-tolerant queue on top of Thrift (Java + Python server implementations) and am finding all of these optimizations incredibly interesting.

    BTW, the link to the Utf8Helper class is invalid. I was able to dig up the commit link: http://mail-archives.apache.org/mod_mbox/incubator-thrift-commits/201004.mbox/%3C20100425152011.3DFA123888FE@eris.apache.org%3E

  2. Posted June 25, 2010 at 9:20 am | Permalink

    Kunal – the link doesn’t work anymore because, as I say in the update at the top of the post, that it ended up not working correctly and had to be reverted.

One Trackback

  1. [...] References: Faster string to UTF-8 encoding in Java [...]

Post a Comment

Your email is never published nor shared. Required fields are marked *

*
*

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>

  • Rapleaf Is Hiring!

    We are looking for engineers who want to solve challenging problems.

    We have great people, do great work, and have great perks.

    Know someone who might be interested? Refer a friend and get $5,000 for successful hires.

    See our current openings at
    www.rapleaf.com/careers