In my last article I had a quick look at some other tools for the analyst toolkit other than their web analytics system. These included Business Intelligence or OLAP tools, visualisation tools, statistical analysis and data mining tools. This week I want to take a deeper look at the use (and possible abuse) of statistical analysis and data mining techniques.

Statistical analysis and data mining covers a wide variety of approaches, methodologies and techniques that might be useful for the web analyst. They can be broadly be classified as follows:

  • Statistical analysis
  • Classification techniques
  • Clustering and segmentation methodologies
  • Forecasting
  • Text analysis

It’s probably best to start with a note of caution. There’s a saying “If you torture the data long enough, it will tell you anything you want it to”. These kinds of data analysis techniques can be very powerful and they can be used to uncover nuggets of gold in your data. They also need to be used carefully. The analyst needs to ensure that the results are robust, reliable and above all make sense. Data mining is as much an art as it is a science.

Simple statistical analysis techniques such as frequencies and histograms can reveal interesting patterns in your data. I’ve written before about the dangers of using averages metrics such as “average pages per visits” as they hide interesting differences in behaviour. Worse than that, they can actually be misleading.

Often in the work we do, we will spend a lot of time initially carrying out exploratory analysis looking at the patterns and distributions in the data. It’s time well spent. It gives you a feel for what is going on below the topline metrics and also helps later when you begin to look at the results of other analytical techniques. As a marketing analyst you need to have a sense of how the data is made up, how the topline metrics are constructed and where they come from. For example, you may find that there are some extreme values or “outliers” that might affect your results and so need to be dealt with in some way or another.

With statistical analysis you may want to compare different groups of visitors or customers. For example, looking to see whether the repeat order rate is higher amongst some groups of customers than others. You can apply statistical tests to see whether any differences are real significant differences or whether they just might be because of the variability in the data. Significant difference testing can be important in experiments such as A/B tests to ensure that “A” is really better or worse than “B” before making any changes to the site.

There are many different types of “classification” techniques including regression analysis, often used in credit scoring, as well as Articicial Intelligence approaches including neural networks. The class of techniques that I want to take a look at today is the use of “decision trees“. There are a number of different algorithms in this type of technique including CHAID, CART and QUEST. These algorithms essentially do the same thing in different ways and that is to assign the data records (such as visitors or customers) into groups of interest based upon the other variables that you have on the record.

For example, you may have records on customers that splits them into two groups: “single order customers” and “repeat customers”. You may then also have a whole string of other data on those customers and you are interested in understanding what are the key characteristics that distinguish between someone who orders once and someone who goes on to order again. Decision Tree methods will look at all the other variables and determine which one is the most important factor in determining the difference between a single order shopper and a repeat order shopper. It then repeats the process again and gain until it has determined what all the significant factors are in order of priority.

The great thing about decision trees is that the output is very visual and relatively easy to understand. They can get a bit big and cumbersome though especially if you are dealing with a lot of variables. Decision Tree techniques have been used for years in direct marketing work to determine which type of people are most likely to respond to mailings, so that companies can cut down on mailing costs.

In online marketing, mailing costs isn’t such as big issue as it is in the offline world but we have used techniques like decision tress in other areas to understand what the factors are that influence visitors to do something or not. In the example above of single order customers vs repeat order customers we did a piece of work where we looked at many potential factors that included:

  • the size or the first order
  • the number of visits to the website after the first order
  • the product category of the first order
  • the product categories browsed after the first order
  • whether they were opted in to the email newsletter
  • how many newsletters they had received
  • the timing of the newsletters after the first order

We found that the most important factor in determining whether someone went on to order again after their first order (out of all the ones we examined) was that someone had opted into the email newsletter and had received a newsletter within 5 days of that first order. Vital input into a retention marketing programme.

Decision Tress techniques are also useful for profiling and understanding different segments of visitors or customers. Segmentation techniques are what I will be looking at in the next part of this series.

Till then…

More from Applied Insights

See more: Articles
See more: Predictive analytics
See more: Data mining

Learn about our web analytics consulting Discover our innovative marketing applications Get to know Applied Insights