Sample Size vs. Sample Bias

Analysis,Business by on October 11, 2006 at 5:40 pm

There are numerous posts online about how the various online measurement firms present very different views on things like unique users and page views. A few of the better ones I’ve read include Fred Wilson’s ‘Whose Numbers are right?’, Donna Bogatin’s ‘Data Attraction: Hard science or numbers game?’ and Avinash Kaushik’s detailed analysis of how Hitwise and Comscore get their data. I’ve even commented on Alexa’s fallibility w.r.t. Judy’s Book Traffic.

Invariably, someone always comments that the company with the biggest sample must be the most reliable. However, people often overlook the point that sample size and sample bias are both equally essential to creating a statistically valid sample. In fact, sample bias is typically far more important if you’re trying to extrapolate the data you find.

Sample Size

Sample size is a hard concept for people to grok. Yes, a larger sample is better, but the larger sample size only increases the confidence of an estimate. Offline survey firms routinely conduct statistically significant samples that use only 500 to 1000 people to estimate the way that the nation feels on an issue. Using a simplified example, the confidence interval using a 99% precision of different sample sizes looks like this (assuming a population of 250 Million & using this sample size tool

  • 1000 (+/- 4.01%)
  • 10,000 (+/- 1.29%)
  • 100,000 (+/- 0.41%)
  • 1,000,000 (+/- 0.13%)
  • 10,000,000 (+/- 0.04%)

As you can tell, the effect of the extra 9 million sample points doesn’t greatly increase your confidence that Google Search improved or declined vis-a-vis Yahoo search. An increase in sample size does help collect enough data to estimate long-tail usage, but it doesn’t better quantify top websites.

Sample Bias

Offline measurement firms go through great lengths to reduce the bias of a sample. Folks like Gallup, Ipsos & Nielsen understand that sample bias can completely corrupt the results of a survey or sample. Examples of bias offline include:

  • bias that someone has a landline (younger people are less likely to have landlines)
  • bias that someone is home when you call (families are more likely to be home at dinner)
  • bias introduced by where you approach someone (in a mall or a coffee shop as an example)

There is a great wikipedia entry on the topic of selection bias (those discussed above), and sample bias in general.

Online, bias can be far worse and it is much more difficult to generate unbiased samples. A few examples of the bias introduced by Hitwise, Comscore & Alexa:

  • Hitwise: biased towards users at home (they get their logs from consumer ISPs)
  • Comscore: biased towards people that click on ads to ‘speed up their internet’ or ‘protect your computer from email viruses’
  • Alexa: biased towards people that install the Alexa toolbar. Widely believed to be webmasters that are curious about their site and those sites of others

In addition, all of the basic stats on these services can easily be gamed, introducing further bias. See Markus Frind’s post on how these services get gamed.

It is easy to see that online, sample bias has a greater impact on data quality than differences in sample size. One final note on sample bias. It is ok if you are aware of and can quantify the bias (therefore removing it). This can be an incredibly complex process, and can easily result in imperfect results that are believed to be accurate.

All of these services provide interesting data, but they are not dependable enough to make strategic business decisions.


No comments yet.

RSS feed for comments on this post.

Sorry, the comment form is closed at this time.

This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 2.5 License. | Dave Naffziger's BlogDave & Iva Naffziger