The Big Dig: Scraping and Scooping the Web

Data ScrapersI’ve blogged before about how the Internet is making people’s lives pretty much an open book.

Most people who are online are pretty aware of how their reputation can be affected by their Facebook or MySpace pages and other public or quasi-public online information. But The Wall Street Journal has been publishing a series of stories on how much more pervasive than that digital snooping has become.

The series is titled “What They Know” … and it’s well-worth checking out. The most recent article appeared on the front page of the October 12, 2010 edition of the WSJ, and focuses on the phenomenon of “data scraping.”

For those who aren’t familiar with the term, “scraping” is a method by which sophisticated software is used to access and scoop up information that has been posted anonymously on sites that are supposed to be closed to prying eyes. One example cited in the WSJ article of a site that has been scraped is PatientsLikeMe, which has message boards and forums dealing with mental disorders, depression and other issues that most people would prefer to keep private.

People who post on discussion forums like these do so using pseudonyms, and the identity of the posters is carefully guarded by the host sites.

But it turns out that these sites are little match for the sophisticated IT capabilities of companies like Nielsen and PeekYou, who are in the business of matching psychographics as well as demographics to individual people for purposes of serving up relevant advertising — and goodness knows what else.

Think of it as the “lifestyle” direct mail lists of yesteryear – but now on steroids.

PeekYou has applied for a patent on a system whereby it matches real people to the pseudonyms used on forums, blogs, Twitter and other social media outlets. Taking a “peek” at the company’s patent application reveals the great lengths their systems go to ferret out and cross-analyze small, innocuous bits of information that, taken together, find the “needle in a haystack” match to the actual individual:

 Birthday match
 Age match
 First name match
 Nickname match
 Middle name match
 Middle initial match
 Gender match
 e-Mail address match
 Phone number match
 Physical address match
 Username match

When you consider that the same type of powerful computers that are used to analyze and process search engine queries are the ones processing millions or billions of information bits and instantaneously testing and slotting them based on relational patterns … it’s not hard to understand how, over time, eerily accurate portraits of individuals can be drawn that not only correctly reflect the “demographics” of the person, but also a host of psychographic and behavioral aspects such as:

 Shopping habits
 Recreational pursuits
 Personal finance profile
 Health information
 Political leanings
 Hobbies and interests
 Spirituality/religiosity
 Sexual preference or sexual proclivities

The WSJ articles detail how web sites are attempting to stay one step ahead of the “scrapers” by employing software that alerts them to suspicious “bot” activity on forums and other password-protected areas. It’s often a losing battle … and is that particularly surprising?

These days, not even the Orthodox monks at Mount Athos are protected, probably!

An About-Face on Facebook?

Facebook logoThis past week, social networking site Facebook trumpeted the fact that is signed up its 500 millionth member. That’s an impressive statistic — and all the more so when you realize that Facebook had only about 100 million registrants just two short years ago.

And the site is truly international these days, with ~70% of Facebook users living someplace other than the USA.

But there are some interesting rumblings in cyberspace these days that suggest the bloom may be off the rose for Facebook. After having climbed to the #1 perch in terms of registrations and site traffic, there are some intriguing new signs that all is not well in Farmville – or elsewhere in the land of Facebook.

Inside Facebook, an independent research entity that tracks the Facebook platform for developers and marketers, is reporting new Facebook registrations dropped in June to ~250,000. That may still seem like a lot of people, but it’s a far cry from the ~7.7 million new registrants in May.

Furthermore, looking at age demographics, Inside Facebook has concluded that in the critical 26-34 age group, the total number of U.S. users active on Facebook actually declined during the month of June.

Are these people being swayed by the privacy debate that’s happening concerning how much visibility Facebook postings are being given on Google and other search engines?

That may be one explanation for the decline, but there could be other forces at work as well. The latest American Customer Satisfaction Index report from ForeSee Results, a web research and consulting firm, places Facebook’s ranking near dead-last on a list of 30 major online web sites in terms of customer satisfaction with site design and utility.

Who scored highest? Dowdy old Wikipedia. Even boring government sites like the IRS scored better.

It’s evident the issue goes far beyond privacy concerns. There’s also confusion or irritation with Facebook’s ever-changing user interface. As Aaron Shapiro wrote recently in Media Post’s Online Media Daily:

“The truth is, Facebook isn’t fun to use anymore. It’s become a chore, just one more place that busy people have to log in to stay up-to-date. And Facebook is making the goal of staying up-to-date harder and harder to achieve. There are so many apps like Farmville producing status updates, as well as people using Facebook as their repository for passing thoughts and private/public conversations, I have to sort through tons of what I don’t want to read before I get to something I want or need to know.”

Back in its early days, the beauty of Facebook was that it provided such an easy framework to stay connected with family and friends. It was a way to share photos and other personal information quickly – and almost effortlessly – with far-flung contacts all over the world.

Those attributes seem to have gotten buried in all of the “spammy” hi-jinks and gimmicks that characterize so much of today’s Facebook.

Considering the growing dissatisfaction with Facebook, ranging from things like privacy (mis)management and ubiquitous advertising to confusion with the site’s ever-changing design and irritating lack of utility, some industry watchers are predicting that users will begin seriously looking at alternatives. Despite Facebook’s huge presence and large pool of registrants, they may find simpler, purer sites out there that are more to their liking. Several that could be beneficiaries of the “Facebook fall-off” are Diaspora and Collegiate Nation.