Friday, June 25, 2010

Bridging the Numerati-Ignoscenti tracking divide?



I've just finished reading an informative book on the likes of us - "They've got your number" by Stephen Baker. The book talks about the Numerati - "mathematicians who are mapping our behaviour" in various industries, not just e-commerce, for example in the workplace, in politics, blogging and healthcare. There were a number of themes in the book, none of which came as a surprise. For example, Baker talks about the large amounts of data available in each scenario, and how powerful mathematical tools and knowledgable analysts are required to not only derive insight but the correct interpretation of this data. In the chapter on terrorism he pointed out the importance that the NSA (or GCHQ) analysis is correct first time; in other industries as Avinash likes to point out, we can (and should) learn from our mistakes; indeed, failing makes success easier.

Whilst Baker's book didn't try and paint a picture of illicit snooping and stir up the usual scare stories, it did get me thinking about how this subject is perceived by the general public. There is a lot of information available on internet technology, and more of this is filtering into the public arena. For example, browser selection is becoming more sophisticated; whereas a couple of years ago Firefox was the preserve of net geeks, now my parents are using it - Microsoft's share of the market is eroding. But it's not just browser choice that people are becoming more au fait with; it's the contents of the options menu within the browser, and with it cookie blocking, then private browsing and opt-out addons.



Whilst we should respect the wish for privacy of those who've chosen to block cookies, adopt private browsing or install these addons, we should not be scared of making the case for tracking so that these people have all the facts at their disposal before they make their decision. As people become more aware of the perceived murky world of corporate tracking, without a clear counter-argument being proposed it's easy for the public to assume it's of no benefit to them, or worse. And yet, one of the most popular websites on the planet is in that position precisely because of its tracking. People agree that Amazon is a great site, and are impressed by its cross-selling abilities and its recommendations based on their search history (both on and off the site). It surely shouldn't be hard to use this to sell the benefits of tagging a site. Whilst it's becoming fashionable to talk about how we live in a "Big Brother" society with constant surveillance, be it CCTV or online tracking, it should be possible to make the distinction between a true "Big Brother" society whereby monitoring takes place to crush dissent, and one which is built to help people do what they want to on a website more effectively.

So how to do we go about getting rid of this "Big Brother" image before the battle's lost?
1.Site Transparency. A clearly stated (i.e. not legal speak) and up-front privacy policy page (i.e not hidden away in the smallest font possible somewhere inaccessible), explaining the methods used and the information gleaned.
2. Present a clear case to the public. Whilst the case is clear, how it should be communicated is less so. Is this something for the WAA to do? The case needs to be made globally, and whilst they have a presence across many countries, this is something which needs to get into the living rooms of people across the world. Web analytics is being discussed in German and American parliaments at the moment; maybe petitioning your local polititian to raise a question could bring it into the public domain. What is clear is that the internet is a global phenomenon, and, as with policing it, lobbying it is hard to do.
3. Better education. In a previous post I discussed the importance of educating children in the internet. IT is an important topic, and the learning about using the internet is a major part of it, be it tracking, site construction or communication. Informing young people of all the facts at an early age is the best way to remove this image, if a slightly long-termist one...
4. Improve your site! Earn the right to stop people deleting your cookies - people would be more reluctant to delete their cookies to a site if they got an Amazon experience from it.

So there we have it, my thoughts on how we can turn the ignoscenti into the cognoscenti. Have I left anything out? I'd love to hear your comments.

Wednesday, June 9, 2010

Applying statistical rigour to web analytics reporting

Web analytics is all about making decisions from the data. But how can you be sure of the quality of the data you investigate, and the recommendations you provide from it? Whilst the numbers may be accurate and reflect what happened on your site thanks to a successful tagging implementation, are they statistically significant? Furthermore, once you've uncovered what's fluke and what's not, how can you illustrate this succinctly in your reporting?

Unfortunately, with a few limited exceptions, vendors don't provide any indication of the robustness of the data in their consoles. Wouldn't it be great if, for a selected time period, you could segment your site and see that although there's a marked difference in conversion (or metric of preference) that it's only significant at the 50% level? Or, alternatively, what appears to be only a small difference in bounce rate is actually statistically significant? Until that day comes though, you need to be able to do it yourself. Avinash has written about on a couple of related topics a while back - applying statistical limits to reporting and an overview of statistical significance. In a more recent post, Anil Batra highlights the importance of not rushing to pick the winning page from an A/B test. And in the last few days, Alec Cochrane has written a great piece on how to improve the statistical significance of a web analytics data-based model.

There are plenty of statistics tests out there, with different datasets and situations that call for them, but for the purpose of this post I'll focus on just two both of which are listed here amongst others.

The two proportion z test compares whether the specified proportions of two samples are statistically different from one another.


This test can be applied to a number of reports within web analytics, but its main use would be for comparing the click-through rate or response rate of two campaigns to determine whether one is conclusively better than the other. The beauty of this test is that it only requires four values - the two proportions (%s) and the two sample sizes, and as such can be calculated without use of a spreadsheet.

The second test is the two-sample t-test which determines whether the means of two samples are statistically different from each other for the two given sample sizes and sample standard deviations.


By requiring the standard deviations of both samples, this result takes more time to compute by requiring the user to download the series data in question. This test has a variety of uses, for example comparing whether the different average values of a given metric for two segments are statistically different, or perhaps looking at the same data series before and after an external event takes place to determine whether it has had a statistically significant effect on the data or not.

Now that you're confident that you know which results are statistical flukes and which ones aren't, how do you go about illustrating this in your reporting? One option would be to include the t test results and significance levels in your reporting, but this is likely to clutter your reports as well as potentially confuse the reader. A neater way might be to colour code the values to illustrate their confidence level if you're happy to introduce different text colours to your report. For time series data you can add the mean, and upper and lower bounds to a graph, to show which peaks and troughs merit further investigation.

Of course, once you've come up with a clear and succinct way of displaying this statistical information, you still need to explain it to your stakeholders, not all of whom will have a background in statistical analysis. Demonstrating the robustness of the data to them and how the varying levels of robustness are determined will not only provide extra confidence in the decisions you're recommending from the data, but illustrate the importance of asking constructive questions of the data, rather than slavishly following what the data suggests at first glance.

Images from Wikipedia.org
Real Time Web Analytics
Feedback Form
Website Feedback