Mere Words? Google’s Library Project Speaks Volumes

Google Library Project
Google's Online Library Project: 5 million+ volumes and growing.
An article published recently in Science magazine provides fascinating sociological findings based on researching the content of the growing number of books in Google’s digital library.

Google has amassed a database of some 2 billion words and phrases from more than 5 million books published over the past 200 years. Much of the news coverage about this project has been focused on the intense criticism of some publishers and authors who are concerned about copyright protections and Google’s alleged knowledge “power grab.”

But a more interesting and useful result of Google’s library project has been that linguists have been able to use this trove to measure information and trends based on the language in the books and the people and concepts that are referenced therein.

By analyzing the digitized text of the books in Google’s database in relation to when they were published, the researchers found that they can measure all sorts of trends – such as changing tastes in foods, ebbs and flows in relations between countries, and the role of religion in the world.

For example, references to “sausage” peaked in the 1940s and have dropped off dramatically since then, whereas references to “sushi” began to appear in significant volume in the 1980s.

It’s also interesting to see how references to certain “personalities” grow or decline over the decades. Revolutionary leader Che Guevara was covered widely in the 1960s but has receded since then, whereas Hollywood actress Marilyn Monroe has seen a slow, steady increase in references even decades following her death.

References to “God” have declined steadily since its peak usage in the 1840s, which likely comes as no surprise. More interestingly, references to “men” far outpaced women all through the 1800s and 1900s … until the 1980s when the two were at parity. And by 2000, references to women surpass those of men.

When evaluating emotional concepts, the researchers have found that concepts like “empathy” and “self esteem” have exploded since the 1940s and 1950s … while those of “will power,” “self control” and “prudence” have all declined.

Commenting on the importance of this academic research, Mark Liberman, a computational linguist at the University of Pennsylvania, said, “We see patterns in space, time and cultural context on a scale a million times greater than in the past.”

It turns out that Google’s digital database of books is but a small fraction of the total number of volumes published since the invention of the printing press; that figure has been estimated at ~129 million. But Google’s 5 million+ books are giving us a much more precise view of trends than what’s ever been possible before.

And an interesting ancillary finding of the research is realizing the number of completely new words that have come into use in the English language. It turns out that more than 500,000 new English words that have made their “debut” since 1950.

Google is making this data available at a time when it continues to face criticism about its online library endeavor. The initiative has faced copyright disputes, lawsuits and charges that Google is attempting to create an “information monopoly” (some of which have been sort of settled). But over the long haul, I think it’s a pretty safe bet that people will view the pluses as outweighing the minuses in Google’s library project.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s