In this post, we continue describing the CRISP-DM (Cross Industry Standard Process for Data Mining) methodology, after our previous post Phase III: Data Understanding. In this case, we discuss the fourth phase of the data analysis project, known as Modeling.

It is an extracted version of:

Chapman, Pete (NCR); Clinton, Julian (SPSS); Kerber, Randy (NCR); Khabaza, Thomas (SPSS); Reinartz, Thomas (DaimlerChrysler); Shearer, Colin (SPSS); Wirth, Rüdiger (DaimlerChrysler). Step-by-step data mining guide. 2000. 

DataPrix. Metodología CRISP-DM para minería de datos. 2007.

CRISP-DM Fase IV. Modeling

Figure below outlines the different tasks that need to be carried out in this phase. 

Select modeling technique

In this task, we should select the actual modeling technique that is to be used. Although you may have already selected a tool during the Business Understanding phase, this task refers to the specific modeling technique. If multiple techniques are applied, this task is performed for each technique.

Many modeling techniques make specific assumptions about the data that we should take into consideration, for example, that all attributes have uniform distributions, no missing values allowed, etc.

Checklist below represents the tasks to be done at this stage:

  • Decide appropriate techniques for the scenario, taking into account the selected tool.
  • Document the actual modeling technique that is to be used.
  • Analyze any data assumption made by the modeling technique, such as, quality, format, and distribution. Then, share them with the data description report, and make sure they are still valid.

Generate test design

Before we actually build a model, we need to generate a procedure o mechanism to test the model’s quality and validity. Normally, data set is separated into train and test sets, and its quality is estimated on the separate test set. In this task, we describe the intended plan for training, testing, and assessment of models, but also we determine how to divide the available dataset into training, test, and validation datasets. Other validation strategies can be set, such as cross-validation.

Checklist below represents the tasks to be done at this stage:

  • Check existing test designs for each data mining goal.
  • Define necessary steps (number of iterations, number of folds, etc).
  • Prepare data required for test.

Build model

Once the training set is selected, modeling tool is executed to create one or more models. With any modeling tool, there are often a large number of parameters that can be adjusted, so it is necessary to list the parameters and their chosen values, along with the rationale for the choice.

At the end of this task, we obtain real models produced by the tool (not a report), and it is necessary to describe and interpret then, as well as document any difficulty encountered.

Checklist below represents the tasks to be done at this stage:

  • Describe any characteristics of the current model that may be useful for the future.
  • Adjust parameter settings used to produce the model.
  • Give a detailed description of the model and any special features.
  • For rule-based models, list the rules produces, plus any assessment of per-rule or overall model accuracy and coverage.
  • For opaque models, list any technical information about the model, such as neural network topology, and any behavioral descriptions produced by the modeling process, such as accuracy or sensitivity.
  • Describe the model’s behavior and interpretation.
  • State conclusions regarding patterns in the data, if any.

Assess model

The model should be assessed to ensure that it meets domain, data mining success criteria and the desired test criteria. We should evaluate it according to assessment criteria by applying only one technique several times, generate results with different techniques, and finally, compare results in accordance with assessment criteria. As an output, we should summarize results of this task, list qualities of generated models (e.g., in terms of accuracy), and rank their quality in relation to each other.

According to the model assessment, setting parameters should be revised and adjusted for the next run in the Build Model task. Besides, we should iterate model building and assessment until we find the best model.

Checklist below represents the tasks to be done at this stage:

  • Execute validation tests, and evaluate results according to evaluation criteria.
  • Compare evaluation results and interpretation.
  • Create ranking of results with respect to success and evaluation criteria.
  • Interpret results in business terms as far as possible at this stage.
  • Get comments on models by domain or data experts.
  • Check reliability of the model.
  • Check effect on data mining goals.
  • Check models against given knowledge base to see if the discovered information is new and useful.
  • Check credibility of the results.
  • Analyze potential for deployment of each result.
  • If there is a verbal description of the generated model (e.g., via rules), evaluate rules.
  • Analyze specific features of each modeling technique to get insights into why a certain modeling technique and certain parameter settings lead to good or bad results.

That was the last task in Phase IV, Modeling, that focuses on data modeling by using computational learning algorithms.  In the next post, we will describe Phase V, Evaluation, which focuses on concrete results and outcome evaluation.

Our professional team can effectively address Data Analytics projects in any complex scenario with the maximum guarantees of success applying CRISP-DM methodology. If you would like more information about this area, please do not hesitate to contact us. We will be glad to help you.

[translated by Marta Villegas González]