Steps 4 and 5 – Modelling and Evaluation, The Theory
Now for the serious stuff (or the fun stuff depending on your inclination!). Of course the modelling phase is at the core of a predictive analytic effort. CRISP rightly separates modelling and evaluation into separate steps which emphasises the importance of the latter. However they are intrinsically linked and we will consider them both together here.
As this is really the central issue I’ll break into 2 parts. Let’s talk about the theory of how we go about it and the in the next blog entry I’ll try and “make it real� with a practical example.
Just to recap on how we got here. Starting with a research or business objective we’ve garnered enough understanding to embark on a predictive exercise. Furthermore we’ve explored the data and found predictive potential. More than likely we’ve uncovered enough relationships in the data as we explored it to indicate that patterns exist which will allow us to predict the outcome(s) of interest.
So who can do this?
Traditionally predictive modelling has been the domain of the expert. The statistician, mathematician, econometrician, the numerate researcher or the more expert “analyst�, etc.. This is still largely the case today but we are seeing increasing signs of analytical democratisation.
Some of the contemporary tools discussed below do require less expertise to develop models today because of smarter user interfaces. There has been a move to more automated algorithms like decision trees where the analyst does not need to know as much about the data structures or the requirements/assumptions of the algorithm to specify the correct analysis. More traditional statistical methods, like Regressions for example, do require the analyst to understand the technique well enough to specify the right settings/options and to follow certain rules about the data; e.g. that the input variables are not too highly correlated (i.e. “multi-colinear�). Methods and algorithms from the world of Artificial Intelligence e.g. Neural nets, and the trees are generally more tolerant of different data patterns and have fewer options for the analyst to worry about.
Nevertheless it is rarely the case that we can just “press the button� without a certain level of expertise in the analytical tool and/or the handling of data. But with a few days training most business and research users should be able to run models even in the most advanced tools. More specifically developed Analytical Applications can often provide a higher level of accessibility to deeper analytical methods for broader, less expert, audience.
And how?…
For heavy duty predictive modelling the analyst will typically have an arsenal of predictive tools and algorithms at his/her disposal. We’ll revisit the various tools/platforms later but the vendors who probably offer the most are SAS and SPSS. Though there are some, relatively, new entrants making headway such as KXEN, Salford Systems and Think Analytics. See the Gartner Magic Quadrant for Customer Data Mining for one view of the landscape of predictive software tools.
In the last step we spent some time ensuring that the data was in the right shape for this step. Hence, in the simplest sense the modelling process itself is just about defining the input and output variable(s) of interest and building and evaluating multiple models.
Which method to choose?
In part of course this will depend on what you have available. If you only have Excel then, without purchasing an add-on like XLMiner, you have access to the models available in the Excel statistical pack. As I mentioned in earlier blogs if you are entering the predictive arena for the first time you may want to consider some of the freely available software, particularly R. The caveat to this is that, as I write, you need to be able to learn the R language to drive the models. I am not currently aware of any particular user interfaces that help accelerate the usage. Despite that initial technical hurdle R does offer a very impressive range of modelling algorithms. Alternatively you may have one, or more, of the toolsets from the Gartner Quadrant mentioned earlier.
We should probably try as many of the appropriate candidate models as time allows. Some – particularly those that come from classical statistics (see the earlier point) – may not be appropriate because of the shape of the data so may be rule out. Going in, especially with new data, it is usually difficult to know which type of model will give us the best predictions . From experience analysts may like to start with methods they know have produced the best models with what feels like similar data.
So what is a model?
The different types of algorithms construct models in different styles but at the most abstract level a model defines a pattern, or relationship, between the input variables and the output (outcome) variables. A [Statistical] regression model, for example, will use a mathematical formula to achieve this. A Decision Tree/Rule induction model will produce a tree or a set of rules to characterise the relationship. Whereas a Neural Network model will typically build a more opaque view of the relationships by connecting an abstract network of nodes, links and weights to encapsulate the underlying pattern.
The core train/test process
One of the beauties of predictive analytics is the way in which we construct a simple experimental structure which allows us to test (validate) models on unseen data. The empirical approach, if it is done properly, gives us a pretty good approximation to how the models will perform when deployed in a live setting on new data. For example, let’s say we have a data set from a period in time when we know which customers churned or stayed. We would typically model a customer’s likelihood to churn on a subset (60% say) of that data and then test it on the other 40% to see how well the model predicts churn. If the accuracy is good enough (and that depends on the success criteria that we defined) then … if all other things are equal and we had constructed a representative enough data mining table … then we would expect similar results if we use the model going forward in a live setting. Usually this means that we randomly split the data into two subsets
-
The training subset is the one use to build the model (the 60% in the churn modelling scenario described above).
-
The testing subset is the one we use to evaluate the model (the 40% for the above scenario). This second set is used to effectively simulate what we want to do in practice (when we deploy); that is to use our model to accurately predict the outcome(s) of interest.
We do this because the true test of a model is not how well it can predict the outcome when it knows it (which is what it does with the training subset). Rather how well can it predict the outcome when it doesn’t know what the outcome is.
So how good is my model (really)?
Until now we have only considered how accurate the model is by considering what percentage of the time it gets the prediction right e.g. predict churners. In practice of course this is only part of the evaluation process. We may find, for example, that our model is good at finding low value fraud (of which there is likely to be more and hence our overall percentage prediction) is higher … but that the more valuable transactions which hurt us more are missed. One way to address this could be to focus on (e.g. create a subset which focuses on the valuable minority while still being sufficiently representative to be deployable). Either way our evaluation of candidate models, and hence the models we might continue to develop and refine, should be led by model evaluations which include all the factors that we really care about. These are often around the cost/benefit of the actions that the model would have us take in the field to act on its predictions. This is where more involved simulations enable us to make more meaningful assessments of the future impact of a model.
Next we will take a real life example to better illustrate how this step can work in practice…
More from Applied Insights
See more: Applied Insights Blog
See more: Analytics strategy
See more: Predictive analytics
See more: Data mining
This entry was posted on 15 Mar 2007 by John McConnell.
Filed under:
- Applied Insights Blog
- Analytics strategy
- Predictive analytics
- Data mining
Find an article or post
Archives
- June 2008
- May 2008
- April 2008
- March 2008
- February 2008
- January 2008
- December 2007
- November 2007
- October 2007
- September 2007
- August 2007
- July 2007
- June 2007
- May 2007
- April 2007
- March 2007
- February 2007
- January 2007
- December 2006
- November 2006
- October 2006
- September 2006
- August 2006
- July 2006
- June 2006
- May 2006
- April 2006
- March 2006
- February 2006
- January 2006
- December 2005
- November 2005
- October 2005
- September 2005
- August 2005
- July 2005
Keywords
Analytics strategy Blogchat Campaign analysis Consumer insight Data integration Data mining Europe Forecasting Future conferences KPIs Loyalty Optimisation Past conferences Predictive analytics Search engine marketing Segmentation Surveys Testing WAA Web analytics


