Solutions to IT problems

Solutions I found when learning new IT stuff

Posts Tagged ‘io

Fast random file access and line by line reading in Java

with 5 comments

Introduction

I recently read a message on a mailing list of an open-source project were the user was comparing it to a commercial product and claiming that the commercial product was a lot faster. The specific topic was about randomly accessing a large file containing multi-line records. The user said when initially opening the file it took minutes to create an index to later quickly access the desired records. The issue with that was that the file only contained 50k records were as the real deal would be like 1 Mio. records. The commercial product opens the 1 Mio file instantly. So questions like that always trigger my curiosity also since I have to admit my knowledge in Java IO is very limited, meaning reading text files line by line with BufferedReader. So I set out on a quest into the java.io API.

The Open-Source code

The commercial product probably uses multi-threading to create index in background. Another possibility is, that in fact it does not allow true random access just scrolling up and down and does some read-ahead caching. Hard to tell without access to it. So I looked at the code of the open-source project and it turns out that it uses java.io.RandomAccessFile. This makes sense. I did not fully get the indexing method, it seemed more complex than it needed be but it read the file line by line using java.io.RandomAccessFile. Basically each record is separated by a delimiter that appears on a separate line. So mapping each line (or it’s position) seems a fast and reasonable way to index the file. Or so I thought.

Dark-Side of the JDK

java.io.RandomAccessFile.readLine() method was supposedly I quote

written by a first semester CS student that dropped out. It can hardly perform any worse and it performs two orders of magnitude slower than it could.

And anyone can check and confirm this…even in the newest JDK 7. It reads a file byte by byte with no buffer. So the conclusion was to look elsewhere. However BufferedReader or other alternatives do not offer random access or a way to get the offset in bytes from the start of the file. And java.nio remained a mystery even after consulting my best friend Google. So I was kind of lost.

Roll-your-own

Or shall I say learning by doing? I set out to create my own indexing method using BufferedReader and mapping line numbers. Going to a specific record then requires a certain number of readLine() calls without caring about the returned data. This was already a lot faster than the infamous java.io.RandomAccessFile.readLine() way of doing it. However I was not satisfied because it was still too slow and let’s be honest kind of an ugly way to do it. As a next step I tried to read the file in a buffered way using the java.io.RandomAccessFile.read(byte[]) method. I converted the buffer to a String and then searched for the delimiter and mapped it’s offset in bytes form the start of the file. With the java.io.RandomAccessFile.seek(long) method that position can the be quickly accessed, randomly. This took some tinkering till I got it right but to my surprise this was still not very fast, in fact it was hardly faster than the previous ugly BufferedReader method, This left me puzzled. Actually I’m still puzzled even after finding the actual solution why this was over 10 times slower. I guess at some critical places using “convenience” classes like String and ArrayList over shuffling around array indexes has a very high price.

The Solution

Instead of rolling my own indexing method I decided to create a RandomAccessFile wrapper that has a usable readLine() method. The solution now looking back is obvious. Basically I just copy & pasted the BufferedReader.readLine() method and made some minor adjustments. These adjustments are for tracking the position (or offset or file pointer) and then setting it to the correct position if say a write-method is called and invalidating the buffer used for readLine(). And it works! So I now have a way of fast random access and fast line by line reading in one single Java class called OptimizedRandomAccessFile. This indexing now is pretty much 100 times faster. Wow. One should have thought that is a simple task. Way to go ex-Sun and Oracle!

Written by kienerj

September 23, 2013 at 21:01

Posted in Java, Programming

Tagged with ,