Lets see what happens when trying to convert random string into UTF8 using different methods.
[code lang=”java”]
package org.kari.test.string;
import java.io.IOException;
import java.io.OutputStreamWriter;
import java.nio.charset.Charset;
import org.apache.log4j.Logger;
import org.kari.log.LogUtil;
import org.kari.util.DirectByteArrayOutputStream;
/**
* Test UTF8 conversion
*
* @author kari
*/
public class UTF8Test {
public static final Logger LOG = LogUtil.getLogger(”utf8”);
private static final DirectByteArrayOutputStream mOutBuffer = new DirectByteArrayOutputStream(100000);
private static OutputStreamWriter mWriter;
static final class ByteArrayReference {
private byte[] mBuffer;
private int mOffset;
private int mLength;
public void set(byte[] pBuffer, int pOffset, int pLength) {
mBuffer = pBuffer;
mOffset = pOffset;
mLength = pLength;
}
public void clear() {
mBuffer = null;
mOffset = 0;
mLength = 0;
}
public byte[] getBuffer() {
return mBuffer;
}
public int getOffset() {
return mOffset;
}
public int getLength() {
return mLength;
}
}
public static abstract class Test {
public abstract void convert(
String pStr,
ByteArrayReference pRef)
throws IOException;
}
public static final class WriterTest extends Test{
@Override
public void convert(
String pStr,
ByteArrayReference pRef)
throws IOException
{
DirectByteArrayOutputStream out = mOutBuffer;
OutputStreamWriter writer = mWriter;
out.reset();
writer.write(pStr);
writer.flush();
pRef.set(out.getBuffer(), 0, out.size());
// System.out.print(’.’);
}
}
public static final class BasicTest extends Test {
@Override
public void convert(
String pStr,
ByteArrayReference pRef)
throws IOException
{
byte[] data = pStr.getBytes(”UTF-8”);
pRef.set(data, 0, data.length);
// System.out.print(’.’);
}
}
private UTF8Test() throws Exception {
mWriter = new OutputStreamWriter(mOutBuffer, ”UTF-8”);
}
public ByteArrayReference test(String str, Test pTest)
throws Exception
{
try {
ByteArrayReference ref = new ByteArrayReference();
System.out.println(”string len=” + str.length());
long startTime = System.nanoTime();
int COUNT = 100;
for (int i = 0; i < COUNT; i++) {
ref.clear();
pTest.convert(str, ref);
}
long endTime = System.nanoTime();
long diff = endTime - startTime;
System.out.println();
System.out.println(" utf8 len=" + ref.getLength());
System.out.println("total nano = " + diff + " nanos");
System.out.println(" per nano = " + (diff / (double)COUNT) + " nanos");
System.out.println("total time = " + (diff / (1000.0 * 1000)) + " ms");
System.out.println(" per time = " + ((diff / (1000.0 * 1000)) / (double)COUNT) + " ms");
return ref;
} catch (Exception e) {
throw e;
}
}
public static void main(String[] args) {
try {
test();
} catch (Exception e) {
LOG.error("Failed", e);
}
}
private static void test()
throws Exception
{
UTF8Test test = new UTF8Test();
ByteArrayReference ref1;
ByteArrayReference ref2;
{
String str = createLongString();
System.out.println("====================");
System.out.println("=====BASIC==========");
System.out.println("====================");
ref1 = test.test(str, new BasicTest());
System.out.println("====================");
System.out.println("=====WRITER=========");
System.out.println("====================");
ref2 = test.test(str, new WriterTest());
}
System.out.println("====================");
System.out.println("equal= " + equals(ref1, ref2));
}
public static boolean equals(
ByteArrayReference ref1,
ByteArrayReference ref2)
{
boolean result = false;
result = ref1.getLength() == ref2.getLength();
if (result) {
byte[] buf1 = ref1.getBuffer();
byte[] buf2 = ref2.getBuffer();
int offset1 = ref1.getOffset();
int offset2 = ref2.getOffset();
for (int i = 0; result && i < ref1.getLength(); i++) {
result = buf1[offset1 + i] == buf2[offset2 + i];
}
}
return result;
}
private static String createLongString() {
StringBuilder sb = new StringBuilder();
for (int i = 0; i < 1000 * 1000; i++) {
char ch = (char)(32 + (60000 * Math.random()));
sb.append(ch);
}
return sb.toString();
}
}
[/code]
Test run with following memory settings with Sun Java 1.6.0_20-b02 (32bit):
[code]
-Xms100M -Xmx400M
[/code]
And the results are:
[code]
====================
=====BASIC==========
====================
string len=1000000
utf8 len=2897030
total nano = 3675792784 nanos
per nano = 3.675792784E7 nanos
total time = 3675.792784 ms
per time = 36.75792784 ms
====================
=====WRITER=========
====================
string len=1000000
utf8 len=2897030
total nano = 3252002400 nanos
per nano = 3.2520024E7 nanos
total time = 3252.0024 ms
per time = 32.520024 ms
====================
equal= true
[/code]
It seems that using Writer for conversion is slightly faster in this test run. However, in real life I believe difference can be even greater due to memory trashing what String.getBytes() causes.
Faster approach could be to extract Encoder from UTF_8 class (i.e. re-implement it). Caveat emptor of such is naturally the fact that re-implementation can introduce mild bugs into logic easily, since most of the internal character set encoding logic must be duplicated in order to do so.
References:
Faster string to UTF-8 encoding in Java
Fast alternative to java.nio.charset.Charset.decode(..)/encode(..)
Update: 16.6.2010
For completeness, I tried also what happens if CharEncoder is used
[code lang=”java”]
Charset cs = Charset.forName(”UTF-8”);
ByteBuffer data = cs.encode(pStr);
[/code]
Net result is that this is much slower (over 50% slower) than String.getBytes(). Main reason for slowness is likely the fact that this API cannot use optimized logic in String, which allows direct char[] access into original character data.
Notice 1:
It seems that speed of OutputStreamWriter comes with cost. Logic inside OSW is using StreamEncoder , which allocates temporary char[] for *whole* string contents in order to copy chars from String for fast access, if strings are large, this can cause problems (!).
Notice 2:
When changing test to use 100 char strings with 1 million iterations, it turned out that String.getBytes() was practically as fast than Writer (or faster, depending if extra gc() due to allocation in StreamEncoder is hit or not).
Notice 3:
In my, not-so-new, hardware, I got String encoding speed in between 54M/s for plain ASCII chars (random characters in range 32 – 255), and 18M/s when ”high” UNICODE chars were included (random characters in range 32 – 60000). Not stellar performance, but what is note worthy is that for non-western users, speed is less than 50% (and encoded byte[] storage is tripled), so that needs to be taken in account when trying to ”optimize” strings.