2007-09-20

System.IO.StreamReader + complex file processing = Headaches (Part 1)

System.IO.StreamReader is a great thing for simple reading operations but tends to cause some unforseen headaches in more complex scenarios. Try to image a scenario like the following:

You are trying to develop some kind application to display text subtitles for Audiobooks, apart from simply playing audiobooks and seeking within books, no other features a required. To keep things simple, will the Audiobooks be stored a common audio format and the subtitles shall be stored in additional text files of the following format:

Some Header-Lines

HH:MM:SS;The text to display at a certain time
HH:MM:SS;The text that will be displayed after the first line
...
Now first let's try to implement a simple szenario: Play the Book from the beginning to the end without seeking

The Algorithm for parsing the file could be as simple as the following:
Method Init:
Create a StreamReader to read the file.
A couple of readlines that parse the header.
Read the first and second text-line from the stream.
Display the first line.

while playing audiobook
Wait for the position where the next line should be diplayed.
Display the Next Line.
Read a new line from the stream.
loop

But as we explicitly want seeking our algorithm must be altered. But let's first evaluate some ways to implement seeking-capability:
  1. Keep all the lines stored in memory, but as we expect pretty large Subtitle files (about 15MB) this is not an option.
  2. Remember the position inside the filestream where the data began. On seek jump to the beginning of the data and then read out all lines till the right one has been found. This will consume the least memory but again as our files might become pretty large reading the file from the beginning on each seek-operation might result in a severe performance loss and a lot of disk interaction.
  3. Create a temporary index of the subtitle-file in the memory. That index will not contain the text but just the a dictionary to store the times found within the file and their position in the file-stream. Since a full index can still consume up to 5 MB a partial index will suffice where every 50th line will be stored.
In this case I think implementing the algorithm number 3 would be the best choice, so let's alter our code then:
Method Init:
Create a StreamReader to read the file.
A couple of readlines that parse the header.

while not end of file
if this is every 50th line then
parse the time
store time and FileStream.Position
endif
loop

Seek Filestream to first text-entry
Call ContinuePlaying


Method OnSeek(timeToSeek):
seeking = true

Look within our dictionary for the time directly smaller than or equal to timeToSeek
Seek our FileStream to the position of our time retrieved from the dictionary

if dictionaryTime == timeToSeek then
Read contents from stream and display it
Call ContinuePlaying
endif

currently retrieved time = dictionaryTime
previousLine = null
while currently retrieved time <= timeToSeek
previousLine = currentLine
currentLine = FileStream.ReadLine
loop

seeking = false
Display the previousLine
Call ContinuePlaying


Method ContinuePlaying:
While Not end of file
if seeking then
exit method
endif

Read next line from file
Wait till it's time to display the new line
Display the new Line
loop
So far so good, looks pretty correct to me (Please hit me if you think otherwise :-). However testing this with a sample subtitle file will terribly fail. But why?

Because StreamReader.ReadLine does not work similar like Read. While Read only reads out 1 char from the stream, ReadLine reads out the next 1kB, this will be buffered internally and the first line within that buffer will be returned. The next 1kB will be read from the string when all the lines from the buffer have been returned.

In our application this will mean that our index will not contain the positions but 1024, 2048, 3072,... instead. In addition as this buffer is not being cleared if the FileStream performs a seek, we won't be able to perform any seek operation as well.

So this is basically how System.IO.StreamReader caused me headaches (Took me quite some time and a lot of debug output to figure this behavior out). Next part of this post I will tell you what I did to fix this unforeseen complications.

3 comments:

Brian Nickel said...

I personally ran into much the same problem when doing a trace reader. Since traces can be 200+ MB, I only stored lines from a given level and then filled in when the user double clicked on a function name.

I ended up just using Stream.ReadByte. For your case, a function like the following may work:

int ReadLine (byte [] buffer, out TimeSpan time, out string text)
{
int i = 0;
int b;
while ((b = stream.ReadByte ()) != -1 && b != '\n')
buffer [i++] = (byte) b;

if (i < 8 || buffer [2] != 0x3a || buffer [5] != 0x3a || buffer [8] != 0x3b)
throw new Exception ("Bad Line");

time = new TimeSpan (
10 * (buffer [0] - 0x30) + buffer [1] - 0x30,
10 * (buffer [3] - 0x30) + buffer [4] - 0x30,
10 * (buffer [6] - 0x30) + buffer [7] - 0x30
);

text = System.Text.Encoding.Default.GetString (buffer, 9, i);
return i;
}

You can use a buffer to cut down on the allocations and keep track of position without using stream.Position by summing up the output.

- Brian

Kyle said...

We personally ran into a similar issue with very large files, containing one or more lines of arbitrary length, but some lines began with a timestamp.

We implemented a binary search algorithm. And used Read() (byte[]) -- basically we get the file length, pick a byte, then step forwards past \r\n's to find a line that starts with a timestamp. Then start halving the file till you find a timestamp previous to the one you want. (We can then decode the byte[] into text)

There are a few 'hairs' with it/edge cases you need to account for, but it is wicked fast, works on very large files, and reads a very small chunk of the file.

Anonymous said...

I had a problem where I wanted to go back and forth in a text file (IO.StreamReader). I was seeking and was using ReadLIne, so I ran into the buffer problem, which coupled with seek, means you may get some ReadLines returning random mixed chunks of file. After a seek, I used DiscardBufferData which fixed the ReadLine behavior. Of course the BaseStream.Position is still dependent on ReadLine's buffer so you can't use it to represent an exact place in the file.