Step 3 – Data Preparation
Anyone who has ever analysed data knows what a nuisance it can be. Whenever we want to analyse it in a new style we often have to manipulate in some way before we can do so. The more “raw” the data, or the more fundamentally different the analysis, the more work we typically have to do to get into the shape we need for the analysis we want to perform.


As I mentioned in an earlier blog; if the primary data source is a data warehouse which contains well structured, rigorously cleaned and de-duped data, then this is usually the best starting point. But it is only that. The shape of the data tables in the warehouse will inevitably have been defined with a certain type of analysis in mind; most often to produce the standard business intelligence style of reports. You might get lucky and find you can use that data as-is for the kinds of predictive analysis you have in mind. The chances are that you won’t, and that you will have to re-structure the data in preparation for that analysis.
Furthermore it may well contain aggregated data, perhaps an OLAP structure of some sorts, which may allow you to produce time series forecasts but which will most likely contain data which is too summarised for most other kinds of predictive analysis. If this is the case then you’ll probably need to go back and locate the sources of the summary data. That might not be a trivial exercise.

How did we get here?
In the previous steps, discussed in other blog entries, we effectively designed the analyses which we intend to perform at the next step; Data Modelling. In Data Understanding we learnt all about the existing data structures, formats and sources and we started to look for patterns in those sources which are pertinent to the analytical objectives we defined at the start of the process. The truth is that, to perform the exploration, we would have had to prepare the data to some extent. But this is the point where we get serious and apply the necessary data management steps to get the data into the shape(s) required for the main task; predictive modelling.
At the top level this means we end up doing one or more of the following:

  • Cleaning data
    This may not be necessary depending on how “clean” the original source is (though it is not unusual to find data problems when we start to analyse it in an unfamiliar way). Our previous exploration should have revealed any errors, or inconsistencies, which need to be corrected, or excluded.
  • Merging data from multiple sources
    If you are lucky the data will be in a single data file, or a single table in a database. If you are unlucky it will be in a variety of disparate sources with different formats in various locations
  • Shaping it for the analysis
    Often the most time consuming element. A classic example is where we have data with a sequence to it; typical if we are looking to predict the likelihood of a an event given a set of previous events. The starting point is typically data in a database which often contains all event transactions. In order to model it in a way which mimics how we will look to apply (deploy) the model we have to define an appropriate point in history as the baseline, e.g. if we are interested to know what will happen after March 2007 we might use March 2006 as that anchor point. We then have to restructure the incoming data to derive all the interesting predictors e.g. transaction frequency, transaction value in previous months, years, etc. from March 2006 backwards. We also need to have a separate data partition which contains the “what happened next” data for a period after March 2006 that corresponds to the period we want to predict into in 2007; so if we are interested to see which customers are likely to churn in April 2007, then April 2006 is likely to be the best month to look at it 2006. NB. Modelling and Evaluation (see later) will help test that hypothesis.
  • Deriving new data elements
    Typically new fields(variables). In our exploration, for example, we may have found that there appears to be a strong relationship between the rate at which a customer buys products and the likelihood that they will churn. In many cases that rate will not exist as a separate measure in the current data, so we create it in this step.
  • Describing it
    Labelling, formatting and generally documenting the data in a way which helps the analyst, or other viewers of the data, to understand its meaning.
    The outcome of the above is a set of tables, or data files, which are in the shape we believe we need for the modelling effort we have in mind.

You [almost always] never get it right first time
We’ve mentioned it before but it is worth re-stating that much of the CRISP process is iterative. Quite often we will get into the modelling step, for example, and discover a potential relationship that looks interesting but which we have to go back to the preparation process to derive. Frequently, because we are often building complex data handling processes from scratch, we just make mistakes which need to be corrected.
With large datasets the preparation time can be significant; It can take hours … sometimes days, so mistakes and re-runs can be costly. Hence wherever possible it is a good idea to test the process using data subsets, ideally random, or at least representative, samples. Samples can also be used to boost productivity when we get into the analysis…more on that next time.

An example
Data collected in the web channel is a great illustration of this point. We work with a lot of this kind of data typically for web sites with large numbers of visitors; usually millions per week. These sites inevitably have a web analytics tool which they use to analyse key metrics of site performance. Most often we are interested to apply predictive and/or segmentation methods to the site data. This typically involves:

  • Extracting behavioural data from the data warehouse (underlying the web analytics tool) or via a data feed that the analytics vendor provides. More often than not we extract this data to a number of text format files.
  • For our Customer Journey Framework we usually have an additional data source in the form of on-line surveys. Depending on the analytics tool that the client is using we have developed a number of ways of linking the data that the visitor provides as a respondent in the survey to the behavioural data which maps that visitors journey through the site.
  • The data we end up with can be at various levels but more often than not it is at the individual page or individual click level (remember these sites have millions of visitors so the number of records gets multiplied up). We take this data and aggregate it over a period of time to end up with tables for analysis which are at the visit and/or visitor level. Each of the resulting records will contain fields of interest; e.g. site content viewed, visit intentions and conversion goals which we will use for analysis.

For a typical site processing a weeks worth of data into the shape needed for analysis can take 4-6 hours.

Which tools?
As is often the case the choice of tool for data management comes down to those that the analyst/data is familiar with. Database tools are all about this type of work and often the best approach is to aim to construct data mining tables inside a relational database. This can be achieved using a combination of SQL, ETL tools and other database utilities.
Generally speaking; the more sophisticated the predictive tool itself the greater the data management capabilities which are built in. So SPSS, SPSS Clementine, SAS and SAS Enterprise Miner offer a broad range of data handling procedures.

So much for progressEven though we have more and better tools, and faster hardware, with which to manipulate data these days this is offset by the increasing volume of data, complexity of structures and number of sources. Hence the old adage that data management consumes more of a data analysis effort than the analysis itself typically holds as much today as it ever did. But it is a necessary pain to get us to the point where we can get to the next step which is at the core of the predictive process; Data Modelling.

More from Applied Insights

See more: Applied Insights Blog
See more: Segmentation
See more: Predictive analytics
See more: Data mining

Learn about our web analytics consulting Discover our innovative marketing applications Get to know Applied Insights