Ashwin Purohit

Prose

For a fine vocabulary, read 137 books

If you could only read books to learn language, how many would you have to read to acquire an educated vocabulary? I set out to answer this question in this article. Oh, and, the answer is 137.

I besmirch my foreign language books with lines and marginalia when I don’t understand words and phrases. I spend a lot of time learning Mandarin, but since I can’t re-live my childhood in Beijing, I read to improve my vocabulary. Novels, non-fiction, newspapers, anything. Recently, I was astounded at how much I still had to highlight on a page, reflecting my ignorance of the language. When will it end? I thought to myself. When can I just read in Mandarin for pleasure?

I admit: flash cards and spaced-repetition software are highly effective solution to my problem. But they’re torturous. Reading is the only way I can remain stimulated enough to further my language learning. So, I embarked on a journey to find out how many books I would have to read to learn a fine vocabulary. The vocabulary of a well-educated native speaker!

Statistics are easy to find for English, so I constructed an experiment for it via a Monte Carlo simulation1. I further assumed that:

All of these are simplifying assumptions, which lifts these results into the realm of pseudoscience, but we can still get a rough idea. In the Monte Carlo simulation, a trial reader “reads” a stream of words until he encounters all 35,000 most frequent words in English at least 12 times, sampled from a Zipf distribution. When this condition is satisfied, the experiment ends, and we count the total number of words the reader encountered. Here are the results of 1,000 trials:

$ ./reading-monte-carlo -trials 1000
> Avg 11644939 words, or approx 137.0 books, over 1000 trials

The stream of words ends up being around 11.6 million words long, or about 137 books’ worth, to learn the 35,000-word vocabulary.

At a pace of one book every three weeks, which is a reasonable for a working professional with other obligations, one could read 137 books in seven years. Indeed, I don’t know any English speakers with an erudite vocabulary who haven’t spent years reading. For me, reading’s primary benefit is knowledge or entertainment. The language learning is just gravy, so I’m willing to accept a 7-year fate. You weren’t expecting a magical solution, were you? Language learning is hard work!

I’ve only read this many millions of words in German, over the course of a half decade. It’s rare for me to find a novel word now, but it happens, as it does in my native English. Highbrow sources use a broader vocabulary than popular ones; I’m more likely to learn new words reading Le Monde Diplomatique than in Bild, or, by way of anglophone analogy, more words from The Economist than in Cosmopolitan.

I provide the code10 to experiment with the simulation, change the assumptions, and expand it to other languages. Let me know what you find! Until then, I’ll be reading. A lot.

A note on spaced-repetition

Included in the code are two versions of the trial. The first assumes a continuous word flow and considers a word “learned” if it is seen 12 times at all, and does not care about timing. The second requires a word to be seen in exponentially successive intervals, and also in total 12 times.

The results are about the same, likely because rare words in a Zipf distribution probabilistically appear with large chronological gaps between them.

Footnotes

  1. A Monte Carlo method is a statistical sampling method for people who, like me, are bad at math and can’t compute closed-form solutions to deterministic problems.
  2. Zipf’s law holds true of most languages, but the parameters differ considerably.
  3. 35,000 words is a purposely high bar: estimates put native educated speaker vocabularies between 20,000 and 40,000. The distinction between types, lemmas, and word families is important and described in the paper How many words do we know?; in this post, I use lemmas to speak of vocabulary size.
  4. 12 is somewhat arbitrary; there are no statistics I know of on how many times a word must be passively seen for it to sink in. There are some theories regarding active recall, which this isn’t.
  5. The principle of spaced-repetition indicates that optimal learning happens when you see words in successively longer intervals. That is, once today, once tomorrow, once in four days, once in two weeks, etc. However, those experiments are mostly for active recall. Intuitively, though, if forgetting is exponential as Ebbinghaus posited in Über das Gedachtnis, then we not want to see all 12 occurrences of the word we want to learn on the same page, rather, spread out over time in many books.
  6. Typically 80,000 to 90,000, source
  7. Vocabulary size estimate of English
  8. You can’t read 100 books on penguins and expect to learn a broad vocabulary.
  9. Most certainly false; vocabulary sizes differ drastically across languages.
  10. Source code

[email protected]

Berlin-based. Write me, auf Deutsch oder in English.

Want updates when I write?