The Big Dig: Scraping and Scooping the Web

Data ScrapersI’ve blogged before about how the Internet is making people’s lives pretty much an open book.

Most people who are online are pretty aware of how their reputation can be affected by their Facebook or MySpace pages and other public or quasi-public online information. But The Wall Street Journal has been publishing a series of stories on how much more pervasive than that digital snooping has become.

The series is titled “What They Know” … and it’s well-worth checking out. The most recent article appeared on the front page of the October 12, 2010 edition of the WSJ, and focuses on the phenomenon of “data scraping.”

For those who aren’t familiar with the term, “scraping” is a method by which sophisticated software is used to access and scoop up information that has been posted anonymously on sites that are supposed to be closed to prying eyes. One example cited in the WSJ article of a site that has been scraped is PatientsLikeMe, which has message boards and forums dealing with mental disorders, depression and other issues that most people would prefer to keep private.

People who post on discussion forums like these do so using pseudonyms, and the identity of the posters is carefully guarded by the host sites.

But it turns out that these sites are little match for the sophisticated IT capabilities of companies like Nielsen and PeekYou, who are in the business of matching psychographics as well as demographics to individual people for purposes of serving up relevant advertising — and goodness knows what else.

Think of it as the “lifestyle” direct mail lists of yesteryear – but now on steroids.

PeekYou has applied for a patent on a system whereby it matches real people to the pseudonyms used on forums, blogs, Twitter and other social media outlets. Taking a “peek” at the company’s patent application reveals the great lengths their systems go to ferret out and cross-analyze small, innocuous bits of information that, taken together, find the “needle in a haystack” match to the actual individual:

 Birthday match
 Age match
 First name match
 Nickname match
 Middle name match
 Middle initial match
 Gender match
 e-Mail address match
 Phone number match
 Physical address match
 Username match

When you consider that the same type of powerful computers that are used to analyze and process search engine queries are the ones processing millions or billions of information bits and instantaneously testing and slotting them based on relational patterns … it’s not hard to understand how, over time, eerily accurate portraits of individuals can be drawn that not only correctly reflect the “demographics” of the person, but also a host of psychographic and behavioral aspects such as:

 Shopping habits
 Recreational pursuits
 Personal finance profile
 Health information
 Political leanings
 Hobbies and interests
 Spirituality/religiosity
 Sexual preference or sexual proclivities

The WSJ articles detail how web sites are attempting to stay one step ahead of the “scrapers” by employing software that alerts them to suspicious “bot” activity on forums and other password-protected areas. It’s often a losing battle … and is that particularly surprising?

These days, not even the Orthodox monks at Mount Athos are protected, probably!

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s