Solutions to IT problems

Solutions I found when learning new IT stuff

Fast random file access and line by line reading in Java

with 5 comments


Introduction

I recently read a message on a mailing list of an open-source project were the user was comparing it to a commercial product and claiming that the commercial product was a lot faster. The specific topic was about randomly accessing a large file containing multi-line records. The user said when initially opening the file it took minutes to create an index to later quickly access the desired records. The issue with that was that the file only contained 50k records were as the real deal would be like 1 Mio. records. The commercial product opens the 1 Mio file instantly. So questions like that always trigger my curiosity also since I have to admit my knowledge in Java IO is very limited, meaning reading text files line by line with BufferedReader. So I set out on a quest into the java.io API.

The Open-Source code

The commercial product probably uses multi-threading to create index in background. Another possibility is, that in fact it does not allow true random access just scrolling up and down and does some read-ahead caching. Hard to tell without access to it. So I looked at the code of the open-source project and it turns out that it uses java.io.RandomAccessFile. This makes sense. I did not fully get the indexing method, it seemed more complex than it needed be but it read the file line by line using java.io.RandomAccessFile. Basically each record is separated by a delimiter that appears on a separate line. So mapping each line (or it’s position) seems a fast and reasonable way to index the file. Or so I thought.

Dark-Side of the JDK

java.io.RandomAccessFile.readLine() method was supposedly I quote

written by a first semester CS student that dropped out. It can hardly perform any worse and it performs two orders of magnitude slower than it could.

And anyone can check and confirm this…even in the newest JDK 7. It reads a file byte by byte with no buffer. So the conclusion was to look elsewhere. However BufferedReader or other alternatives do not offer random access or a way to get the offset in bytes from the start of the file. And java.nio remained a mystery even after consulting my best friend Google. So I was kind of lost.

Roll-your-own

Or shall I say learning by doing? I set out to create my own indexing method using BufferedReader and mapping line numbers. Going to a specific record then requires a certain number of readLine() calls without caring about the returned data. This was already a lot faster than the infamous java.io.RandomAccessFile.readLine() way of doing it. However I was not satisfied because it was still too slow and let’s be honest kind of an ugly way to do it. As a next step I tried to read the file in a buffered way using the java.io.RandomAccessFile.read(byte[]) method. I converted the buffer to a String and then searched for the delimiter and mapped it’s offset in bytes form the start of the file. With the java.io.RandomAccessFile.seek(long) method that position can the be quickly accessed, randomly. This took some tinkering till I got it right but to my surprise this was still not very fast, in fact it was hardly faster than the previous ugly BufferedReader method, This left me puzzled. Actually I’m still puzzled even after finding the actual solution why this was over 10 times slower. I guess at some critical places using “convenience” classes like String and ArrayList over shuffling around array indexes has a very high price.

The Solution

Instead of rolling my own indexing method I decided to create a RandomAccessFile wrapper that has a usable readLine() method. The solution now looking back is obvious. Basically I just copy & pasted the BufferedReader.readLine() method and made some minor adjustments. These adjustments are for tracking the position (or offset or file pointer) and then setting it to the correct position if say a write-method is called and invalidating the buffer used for readLine(). And it works! So I now have a way of fast random access and fast line by line reading in one single Java class called OptimizedRandomAccessFile. This indexing now is pretty much 100 times faster. Wow. One should have thought that is a simple task. Way to go ex-Sun and Oracle!

Advertisements

Written by kienerj

September 23, 2013 at 21:01

Posted in Java, Programming

Tagged with ,

5 Responses

Subscribe to comments with RSS.

  1. I realise this is an old post but i ran into a similar problem with RandomAccessFile, i hunted around for a solution and found this https://gist.github.com/anonymous/b0aa9819239bb821ffbc5525dc785f18. However for some reason it always ends in IOException. Im still mastering java but i have narrowed down the error to not handling EOF very well. Could you take a look and provide a succinct solution as i understand the code i linked better than yours.

  2. in fill() method , charBuffer[i] = (char) buffer[i] ,so the class cannot be used for non unicode file …

    黄笑

    November 17, 2015 at 13:15

  3. This does not work if file is encoded as non unicode. readLine’s result is broken. Please fix it.

    limo stoflimo

    January 27, 2015 at 18:58

  4. Thanks for your solution to this IT problem. I stumbled accorss the same problem and found your blog entry, which is very good and helfful. Thanks for it. But instead of writing my own class which is not a subclass of RandomAccessFile, I came up with another solution, which is creating an input stream using the file channel and reading from with with a BufferedReader.

    InputStream s = Channels.newInputStream(raf.getChannel());
    InputStreamReader isr = new InputStreamReader(s, Charset.forName(“US-ASCII”));
    BufferedReader br = new BufferedReader(isr);

    I hope Oracle fixes / fixes this problem in newer versions of Java. Still using Java 6 here.

    Tino Schlegel

    July 25, 2014 at 07:57

    • But in this case you cannot trace your reading offset in byte as raf.getChannel().position() is not correct in this
      case.

      黄笑

      November 17, 2015 at 13:20


Leave a Reply

Please log in using one of these methods to post your comment:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: