Converting String to UTF8?

Lets see what happens when trying to convert random string into UTF8 using different methods.

[code lang=”java”]
package org.kari.test.string;

import java.io.IOException;
import java.io.OutputStreamWriter;
import java.nio.charset.Charset;

import org.apache.log4j.Logger;
import org.kari.log.LogUtil;
import org.kari.util.DirectByteArrayOutputStream;

/**
* Test UTF8 conversion
*
* @author kari
*/
public class UTF8Test {
public static final Logger LOG = LogUtil.getLogger(”utf8”);

private static final DirectByteArrayOutputStream mOutBuffer = new DirectByteArrayOutputStream(100000);
private static OutputStreamWriter mWriter;

static final class ByteArrayReference {
private byte[] mBuffer;
private int mOffset;
private int mLength;

public void set(byte[] pBuffer, int pOffset, int pLength) {
mBuffer = pBuffer;
mOffset = pOffset;
mLength = pLength;
}

public void clear() {
mBuffer = null;
mOffset = 0;
mLength = 0;
}

public byte[] getBuffer() {
return mBuffer;
}

public int getOffset() {
return mOffset;
}

public int getLength() {
return mLength;
}

}

public static abstract class Test {
public abstract void convert(
String pStr,
ByteArrayReference pRef)
throws IOException;
}

public static final class WriterTest extends Test{
@Override
public void convert(
String pStr,
ByteArrayReference pRef)
throws IOException
{
DirectByteArrayOutputStream out = mOutBuffer;
OutputStreamWriter writer = mWriter;

out.reset();
writer.write(pStr);
writer.flush();

pRef.set(out.getBuffer(), 0, out.size());

// System.out.print(’.’);
}
}

public static final class BasicTest extends Test {
@Override
public void convert(
String pStr,
ByteArrayReference pRef)
throws IOException
{
byte[] data = pStr.getBytes(”UTF-8”);
pRef.set(data, 0, data.length);
// System.out.print(’.’);
}
}

private UTF8Test() throws Exception {
mWriter = new OutputStreamWriter(mOutBuffer, ”UTF-8”);
}

public ByteArrayReference test(String str, Test pTest)
throws Exception
{
try {
ByteArrayReference ref = new ByteArrayReference();
System.out.println(”string len=” + str.length());

long startTime = System.nanoTime();
int COUNT = 100;
for (int i = 0; i < COUNT; i++) { ref.clear(); pTest.convert(str, ref); } long endTime = System.nanoTime(); long diff = endTime - startTime; System.out.println(); System.out.println(" utf8 len=" + ref.getLength()); System.out.println("total nano = " + diff + " nanos"); System.out.println(" per nano = " + (diff / (double)COUNT) + " nanos"); System.out.println("total time = " + (diff / (1000.0 * 1000)) + " ms"); System.out.println(" per time = " + ((diff / (1000.0 * 1000)) / (double)COUNT) + " ms"); return ref; } catch (Exception e) { throw e; } } public static void main(String[] args) { try { test(); } catch (Exception e) { LOG.error("Failed", e); } } private static void test() throws Exception { UTF8Test test = new UTF8Test(); ByteArrayReference ref1; ByteArrayReference ref2; { String str = createLongString(); System.out.println("===================="); System.out.println("=====BASIC=========="); System.out.println("===================="); ref1 = test.test(str, new BasicTest()); System.out.println("===================="); System.out.println("=====WRITER========="); System.out.println("===================="); ref2 = test.test(str, new WriterTest()); } System.out.println("===================="); System.out.println("equal= " + equals(ref1, ref2)); } public static boolean equals( ByteArrayReference ref1, ByteArrayReference ref2) { boolean result = false; result = ref1.getLength() == ref2.getLength(); if (result) { byte[] buf1 = ref1.getBuffer(); byte[] buf2 = ref2.getBuffer(); int offset1 = ref1.getOffset(); int offset2 = ref2.getOffset(); for (int i = 0; result && i < ref1.getLength(); i++) { result = buf1[offset1 + i] == buf2[offset2 + i]; } } return result; } private static String createLongString() { StringBuilder sb = new StringBuilder(); for (int i = 0; i < 1000 * 1000; i++) { char ch = (char)(32 + (60000 * Math.random())); sb.append(ch); } return sb.toString(); } } [/code]

Test run with following memory settings with Sun Java 1.6.0_20-b02 (32bit):

[code]
-Xms100M -Xmx400M
[/code]

And the results are:
[code]
====================
=====BASIC==========
====================
string len=1000000

utf8 len=2897030
total nano = 3675792784 nanos
per nano = 3.675792784E7 nanos
total time = 3675.792784 ms
per time = 36.75792784 ms
====================
=====WRITER=========
====================
string len=1000000

utf8 len=2897030
total nano = 3252002400 nanos
per nano = 3.2520024E7 nanos
total time = 3252.0024 ms
per time = 32.520024 ms
====================
equal= true
[/code]

It seems that using Writer for conversion is slightly faster in this test run. However, in real life I believe difference can be even greater due to memory trashing what String.getBytes() causes.

Faster approach could be to extract Encoder from UTF_8 class (i.e. re-implement it). Caveat emptor of such is naturally the fact that re-implementation can introduce mild bugs into logic easily, since most of the internal character set encoding logic must be duplicated in order to do so.

References:
Faster string to UTF-8 encoding in Java
Fast alternative to java.nio.charset.Charset.decode(..)/encode(..)

Update: 16.6.2010
For completeness, I tried also what happens if CharEncoder is used
[code lang=”java”]
Charset cs = Charset.forName(”UTF-8”);
ByteBuffer data = cs.encode(pStr);
[/code]

Net result is that this is much slower (over 50% slower) than String.getBytes(). Main reason for slowness is likely the fact that this API cannot use optimized logic in String, which allows direct char[] access into original character data.

Notice 1:
It seems that speed of OutputStreamWriter comes with cost. Logic inside OSW is using StreamEncoder , which allocates temporary char[] for *whole* string contents in order to copy chars from String for fast access, if strings are large, this can cause problems (!).

Notice 2:
When changing test to use 100 char strings with 1 million iterations, it turned out that String.getBytes() was practically as fast than Writer (or faster, depending if extra gc() due to allocation in StreamEncoder is hit or not).

Notice 3:
In my, not-so-new, hardware, I got String encoding speed in between 54M/s for plain ASCII chars (random characters in range 32 – 255), and 18M/s when ”high” UNICODE chars were included (random characters in range 32 – 60000). Not stellar performance, but what is note worthy is that for non-western users, speed is less than 50% (and encoded byte[] storage is tripled), so that needs to be taken in account when trying to ”optimize” strings.

12.6.2010 / java

Converting String to UTF8?

Vastaa Peruuta vastaus