The Data Daily

A Tale of Books and Bias, Nightingale

A Tale of Books and Bias, Nightingale

Books were one of our first methods of data storytelling-beyond speaking or paintings. Humans turned data into words, and now we have begun to turn those words back into data. We typically analyze books and their texts for their content or for the effect they have on us. But what about the effect we have on what other readers may think of those same books based on our ratings? Might we, as readers, be agents of bias?

I tried to answer that query by analyzing reader behavior as reflected in Goodreads, a social network for readers with more than 125 million users. In this social network, readers can report the books they have read and rate them on a scale from one to five.

I built up on Goodreads’ Web Scraper by Maria Antoniak to extract data from 42 fantasy and sci-fi book series. The scraped data included information about total number and distribution of ratings, as well as the order of each book inside its respective series.

For the purpose of this article, number of ratings will be treated as number of readers, although, as with any self-report measurement, there probably are Goodreads users who have nor registered or rated that book even if they’ve read it. And many of that book’s readers don’t even have a social network where they report their ratings. 

However, today I am going to focus on another type of self-selection bias: the (possible) self-selection bias among Goodreads readers of a book series, that is: which readers continue reading a book series and, consequently, rate it on Goodreads.

To find out about the likelihood of this bias’ existence, I examined the evolution of rating distributions and reader number throughout each book series. I use the term ‘return rate’ as the number of readers of a specific book divided by the number of readers of the first book of that series. Thus, the return rate is a percentage that is always 100 percent in the first book and should never increase from one book to the next inside the same series, because it doesn’t make sense to read a third book of a series if you have not read the second one.

The books included in the analysis, their rating distributions, weighted mean average and return rate are shown in Figure 1. 

There’s this saying that sequels are never as good as the first book in the series, but in 67 percent of the 42 book series analyzed, that statement seems to be false, as second books in a series have an average rating higher than their respective first part. However, not everyone who rated the first book rated the second one: mean return rate is 47 percent.

Does this effect propagate to the following books of each series? 

I tested that by calculating the weighted mean of the sequels and the return rate between its first and last books. By ‘weighted mean of the sequels’ I refer to the average rating of books from second to last one, each one multiplied by the number of readers of that book. I chose this method because, if on average a book was not liked at all, it’s likely that only those who did like it continued with the series, and the last parts would not reflect the opinion of those who chose not to continue, which is relevant to rate the whole series. By pondering the average by the number of readers, I can avoid that bias.

The result is similar to the first analysis, with 67 percent of sequels rating higher than their first books, although the mean return rate is much lower (28 percent). So, on average, sequels are rated (slightly) higher (

Images Powered by Shutterstock