Data Preparation

In this post, we continue describing the CRISP-DM (Cross Industry Standard Process for Data Mining) methodology, after our previous post Phase II: Data Understanding. In this case, we discuss the third phase of the data analysis project, known as Data Preparation.

It is an extracted version of:

Chapman, Pete (NCR); Clinton, Julian (SPSS); Kerber, Randy (NCR); Khabaza, Thomas (SPSS); Reinartz, Thomas (DaimlerChrysler); Shearer, Colin (SPSS); Wirth, Rüdiger (DaimlerChrysler). Step-by-step data mining guide. 2000. 

DataPrix. Metodología CRISP-DM para minería de datos. 2007.

CRISP-DM Fase III. Data Preparation. Data analysis and features selection

Figure below outlines the different tasks that need to be carried out in this phase.  

Select data

In this task, we decide the data to be used for analysis. Criteria include relevance to the data mining goals, quality, and technical constraints such as limits on data volume or types. Thus, we should create a list with data to be included or excluded and the reasons for these decisions.

As a practical guide for the project manager, the checklist below represents the tasks to be done in this phase:

  • List the data to be included or excluded and the reasons for these decisions.
  • Collect appropriate additional data from both in-house and externally sources.
  • Perform correlation tests to decide if fields should be included.
  • Reconsider Data Selection Criteria (See phase II) according to data quality and data exploration.
  • Select different data subsets, for example, different attributes, only data which meet certain conditions, etc.
  • Consider the use of sampling techniques.
  • Document reasons for inclusion or exclusion of attributes.

Clean data

If it is necessary, we will have to raise data quality to the level required by the selected analysis techniques. This may involve the selection of clean data subsets or suitable data insertion by default.

Checklist below represents the tasks to be done in this phase:

  • Consider how to deal with noise.
  • Correct, remove, or ignore noise.
  • Decide how to deal with special values and their meaning.
  • Reconsider Data Selection Criteria (See phase II) according to data quality after the cleaning process.

Construct data

This task involves constructive data preparation operations, such as the production of derived attributes, known as synthetic variables, new records registration, or transformed values for existing attributes.

Derived attributes are new attributes that are constructed from one or more existing attributes in the same record, for example, area = length* width.

Checklist below represents the tasks to be done in this phase:

  • Construct derived attributes, complete new records, or transform values for existing attributes.
  • Decide what attributes could be normalized.
  • Consider adding new information on the relevant importance of attributes by adding new attributes (e.g., attribute weights, weighted normalization.
  • How can missing attributes be constructed or imputed? Decide type of construction (e.g., aggregate, average, and induction)
  • Specify necessary transformation steps in terms of available transformation facilities, for example, change a binning of a numeric attribute.
  • Execute transformation steps.

Integrate data

In this task, we combine information from multiple tables or other information sources to create new records or values. Combining tables refers to joining together two or more tables that have different information about the same object. Combined data may also include aggregations, which are operations where new values are computed by summarizing information from multiple records and/or tables. 

This is the checklist for this task:

  • Check if integration facilities are able to integrate input sources as required.
  • Integrate sources and store results.
  • Reconsider Data Selection Criteria (See Phase II) according to data integration results.

Format data

Formatting refers to syntactic modifications made to the data. These modifications do not change their meaning, but might be requires by the modeling tool (e.g., appropriate order, or delimitation required for parameters).

These are purely syntactic changes made to satisfy the requirements of the specific modeling tool.

And that was the last task in Phase III, Data Preparation, which involves data analysis and features selection. In the next post, we will describe Phase IV, Modeling, which focuses on data modeling by using computational learning algorithms.

Our professional team can effectively address Data Analytics projects in any complex scenario with the maximum guarantees of success applying CRISP-DM methodology. If you would like more information about this area, please do not hesitate to contact us. We will be glad to help you.

[translated by Marta Villegas González]