Thursday, April 01, 2010

Writing Out Very Large Files

I recently ran into an issue in which I need to write out a very large file (upwards of 250Mb) in Groovy (could just as easily have been Java). In researching ways to do this I discovered the java.nio package which includes FileChannels. If you haven't played around with these yet, I encourage you to give them a try. They are super fast, and if you are doing IO and handling a lot of data, there really isn't another option.

The challenge that I had, was dealing with the heap size. Even with a FileChannel, I still had no way to hold the entire file in memory. So I came up with a pretty cool solution. It's simple but it seems to have hit the sweet spot for processing between holding a large object in memory vs. the cost of writing it out. I needed to do this because I was writing across a mounted drive and the writes were costly. With a local drive it wouldn't have been an issue.

I needed to print out each record in order to a file based on it's contents. I found that using a BufferedOutputReader was DOG slow, and the FileChannel write for each record was really fast locally, but struggled over the mount. Ultimately I decided on an approach that used a StringBuffer to store up records in order and then write them out once the Buffer got large enough. The interesting thing was that there was a significant point in which no further gain in processing time seemed possible. Basically once the Buffer reached a length of 150,000 characters.

So I had a loop that looked like this:

StringBuffer good = new StringBuffer()
StringBuffer bad = new StringBuffer()

records.each { rec ->
if (good.length() > 150000) {
ByteBuffer buf = ByteBuffer.wrap(good.toString().getBytes());
goodChannel.write(buf)
good = new StringBuffer()
}

if (bad.length() > 150000) {
ByteBuffer buf = ByteBuffer.wrap(bad.toString().getBytes());
badChannel.write(buf)
bad = new StringBuffer()
}

//read the next record and do my processing

if (goodRecord) {
good.append(chunk)
} else {
bad.append(chunk)

}

//Then print out whatever is left in case the Buffers aren't empty
if (good.length() > 0) {
ByteBuffer buf = ByteBuffer.wrap(good.toString().getBytes());
goodChannel.write(buf)
}

if (bad.length() > 0) {
ByteBuffer buf = ByteBuffer.wrap(bad.toString().getBytes());
badChannel.write(buf)
}


This worked extremely well for large file processing, given that I had to contend with the heap size. I wouldn't do this for most IO problems but for very large files, this is really effective.

No comments: