Predictive Reading

I often start a book not knowing if I'll finish it. I have a graveyard (more kindly called a bookshelf) full of books that aren't quite completed. This is a nasty habit. Can I do anything to improve it?

I'm currently reading The Fractalist by Benoit Mandelbrot. So far, I've spent an hour and five minutes reading the book. If history is any indication, I'll finish it in another four hours and thirty minutes.

What do I mean? Determining the amount of reading time¹ to complete a book should be a straightforward prediction problem. The data are all there. How many pages have you read? How long did it take to read those pages? How many pages remain? With these three pieces of information, estimating the amount of time to complete a book is a matter of simple extrapolation: how many minutes, at my reading rate, will it take to read the remaining pages? A very simple word problem, indeed.

Of course, this only becomes simple if we can find a simple functional relationship between the time spent reading and the number of pages read. I'm not a machine (?), but the different pages of a pleasure²-reading book are more or less the same in terms of grammatical structure, vocabulary, and content-difficulty. (I'm not reading James Joyce's Ulysses.) Thus, there should be some consistency in the number of pages I read in a given interval of time.

Fortunately, I've been recording the amount of time it takes me to read a given number of pages for a given book since December 2012³. Here is a sample of this data for nine books I've read over the past six months:

As you can see, the number of pages I read in a minute is more or less linear⁴ as a function of the amount of time I spend reading. This linear form is consistent across books. The only thing that changes is the slope of the line.

This, of course, also makes sense. Different books will have different font sizes, different content difficulty, etc., which should result in differing reading rates. Which makes the concept of a fixed 'reading rate' a bit nebulous⁵.

The problem with my current approach is that with each new book, I start the prediction problem from scratch. Basically, I have the model \[ Y_{i} = \beta t_{i} + \epsilon_{i},\] where \(Y_{i}\) is the number of pages I've read on the \(i^{\text{th}}\) reading, \(t_{i}\) is the amount of time spent during that reading, \(\epsilon_{i}\) is a fudge factor, and \(\beta\) is my reading rate for a given book. The problem is that, a priori, I don't know how different my reading rate will be on the new book compared to previous books I've read. One way to get around this is to use a mixed-effect model, and assume that my reading rate on a book is some combination of an intrinsic reading rate plus a book-specific reading rate, \[ \beta = \beta_{\text{book}} + \beta_{\text{intrinsic}}.\] (This seems like a reasonable model to me.) I can then use the previous books to inform my guess at \(\beta_{\text{intrinsic}}\), and then estimate \(\beta_{\text{book}}\) from scratch. This is very similar to empirical Bayes, where we use prior information (collected empirically!) to get a better guess at some parameter.

I'd have to spend more time thinking about how to do this type of analysis formally. But that time would be better spent reading these books. So for now, my crude completion estimator will have to do.

As opposed to the number of calendar days. That's a question of behavioral control. Which, presumably, should be made easier with good data. A common pattern from control theory is that we must first be able to accurately monitor a system before we can control it. This applies to life just as much as it does to factories.↩
As opposed to a technical books. A good rule of thumb: never read a math textbook as if you're reading a novel.↩
Has it only been that long? I can't believe I've lived so much of my life in the dark, unable to predict how much more time I would need to commit to a book to complete it.↩
Except, perhaps, for very short amounts of time spent reading. There are myriad explanations for this: I haven't gotten 'in the zone,' the measurements are noisy at shorter reading times (differences across pages haven't had a chance to average out, time measurements are off, etc.).↩
Much like the idea of a fixed intelligence quotient.↩