If you could only read books to learn language, how many would you have to read to acquire an educated vocabulary? I set out to answer this question in this article. Oh, and, the answer is 137.
I besmirch my foreign language books with lines and marginalia when I don’t understand words and phrases. I spend a lot of time learning Mandarin, but since I can’t re-live my childhood in Beijing, I read to improve my vocabulary. Novels, non-fiction, newspapers, anything. Recently, I was astounded at how much I still had to highlight on a page, reflecting my ignorance of the language. When will it end? I thought to myself. When can I just read in Mandarin for pleasure?
I admit: flash cards and spaced-repetition software are highly effective solution to my problem. But they’re torturous. Reading is the only way I can remain stimulated enough to further my language learning. So, I embarked on a journey to find out how many books I would have to read to learn a fine vocabulary. The vocabulary of a well-educated native speaker!
Statistics are easy to find for English, so I constructed an experiment for it via a Monte Carlo simulation1. I further assumed that:
All of these are simplifying assumptions, which lifts these results into the realm of pseudoscience, but we can still get a rough idea. In the Monte Carlo simulation, a trial reader “reads” a stream of words until he encounters all 35,000 most frequent words in English at least 12 times, sampled from a Zipf distribution. When this condition is satisfied, the experiment ends, and we count the total number of words the reader encountered. Here are the results of 1,000 trials:
$ ./reading-monte-carlo -trials 1000
> Avg 11644939 words, or approx 137.0 books, over 1000 trials
The stream of words ends up being around 11.6 million words long, or about 137 books’ worth, to learn the 35,000-word vocabulary. The trend is linear (note, the numbers vary per simulation depending on parameters, but the trend is most important):
At a pace of one book every three weeks, which is a reasonable for a working professional with other obligations, one could read 137 books in seven years. Indeed, I don’t know any English speakers with an erudite vocabulary who haven’t spent years reading. For me, reading’s primary benefit is knowledge or entertainment. The language learning is just gravy, so I’m willing to accept a 7-year fate. You weren’t expecting a magical solution, were you? Language learning is hard work!
I’ve only read this many millions of words in German, over the course of a half decade. It’s rare for me to find a novel word now, but it happens, as it does in my native English. Highbrow sources use a broader vocabulary than popular ones; I’m more likely to learn new words reading Le Monde Diplomatique than in Bild, or, by way of anglophone analogy, more words from The Economist than in Cosmopolitan.
I provide the code10 to experiment with the simulation, change the assumptions, and expand it to other languages. Let me know what you find! Until then, I’ll be reading. A lot.
Included in the code are two versions of the trial. The first assumes a continuous word flow and considers a word “learned” if it is seen 12 times at all, and does not care about timing. The second requires a word to be seen in exponentially successive intervals, and also in total 12 times.
The results are about the same, likely because rare words in a Zipf distribution probabilistically appear with large chronological gaps between them.
Berlin. Write me: auf Deutsch, in English, 用中文.
Want updates when I write?