As data scientists, we have to deal with bad formatting data sets with missing or wrong values, and many other problems that hamper our progress. Some studies, such as the "Cleaning Big Data: Most Time-Consuming, Least Enjoyable Data Science Task, Survey Says", found that data scientists spend 60% of their time on cleaning and organizing data.
Once this process is completed, data analysis phase is performed. In this task, values, histograms, variables distribution, and correlations between them are studied. Most of the time, modeling phase involves repetitive analysis tasks, such as selecting the best algorithm by using automated procedures (for example, GridSearchCV in scikit-learn), or features selection process applying different predefined techniques.
In the end, all Data Analytics projects are very similar regarding methodology and techniques applied.
ADAM (Automated Discovery and Analysis Machine) system is developed in order to optimize our time and focus more on intellectual labors and techniques to solve the specific problem. ADAM is a framework that helps us to perform an automated analysis of the data set by applying Data Science techniques.
ADAM: Automated Discovery and Analysis Machine
ADAM is able to automatically clean, analyze and modeling a given data set, mining all possible information.
It is developed in Python and based on scikit-learn. It comprises two main modules: Data Understanding y Modeling.
Data Understanding Module
This is the first module to be executed in the system. It receives a data set and gives to the user basic statistical data. Several data cleaning tasks are also performed in this module. Depending on the type of problem (supervised or unsupervised learning), data set comprises examples characterized by predictor variables and objective variables (supervised), or just only predictor variables that help to modeling the problem (unsupervised scenario).
Information generated in this module is the following:
- Basic statistics: a Python dictionary is created containing as much information as possible of each variable: maximum, minimum, average, standard deviation, wrong or missing values (NaN), different values counting (nominal variables), etc. Besides, histograms for all possible variables are calculated so that the user can observe the distribution for each of them.
- Correlation analysis: main correlations between variables are shown. Firstly, we focus on correlations between predictor variables, and after, on correlations between each predictor variable and objective variable. Predictor variables with low or null correlation with the objective variable will be excluded.
- Support (association rule learning): calculate the support for each possible value match, taking into account the ratio of possible remarks over the data set, which contains that specific match.
- Synthetic variables creation and analysis: different synthetic variables are created. A synthetic variable is the result of an operation between two or more original variables. In order to validate these variables, some operations such as addition, subtraction and multiplication are performed with the variables. Correlation between new variables and the objective variable needs to be checked. If there is no minimal correlation, they will be excluded.
This module inherits a clean and well-processed data set from the previous module for further modeling process. It is necessary to analyze the importance of predictive variables so that we get a predictive or clustering trained model ready to be used.
- Most predictive variables: a SGD Regressor (Stochastic Gradient Descent) is trained. The list of variables is generated from this SGD Regressor, ranked according to prediction ability.
- Best predictive model searching: we’ll check if the variable is categorical, and we will look for the best predictive model between the main algorithms existing in Data Science: decision trees, nearest neighbors, different types of regression (lineal, logistic, Lasso, Ridge, etc.), support vector machine (SVM), neural networks, probabilistic model with Bayesian networks, random forest, etc.
Proof of Concept
The aim of this post is to reveal ADAM’s ability with a data set taken from a telecoms company. Firstly, we must list available columns. Most of them are categorical variables, so ADAM will equally divide them into binary variables and into the numbers of categories they have. Our objective variable in this data set is “Churn”, which is also categorical and indicates if a client leaves the company (True) or not (False).
ADAM extracts both basic statistical data and histograms of all possible variables. Statistical data is stored in a dictionary that is given to the user and has the following structure:
Besides, histograms below are automatically obtained.
Finally, Data Understanding module generates useful synthetic variables and excludes those that have low or null correlation with the objective. In the following figure, we can see most of the variables correlated with our objective variable are synthetic (only “Contract: Month-to-month” is an original variable).
In the following figure, we notice how some of the original variables are not correlated with the target variable.
Finally, data generated by the first module is introduced into the Modeling module that performs its analysis. The chart below shows a list of the variables with the highest predictive strength.
After this process, regression/classification algorithms need to be checked in order to find the best of them with best selected parameters.
In this case, the best scoring algorithm is Logistic Regression, which is the output model shown in the next figure.
Data scientist will use the information generated automatically by ADAM to evaluate Data Science problem and to take appropriate actions in order to find a solution. They will focus on the analysis and evaluation processes of the results and alternatives, without investing effort in the repetitive tasks mentioned above.
Nowadays, we systematically use ADAM system in Data Analytics projects. System is continually maintained and updated, adding new possibilities or algorithms and automating more analysis tasks.
A possibility is to deploy ADAM system as analysis service, giving access and functionality to external users through an API REST.
Sngular Data & Analytics has more than 20 years experience in data analysis. Our professional team can effectively address Data Analytics projects in any complex scenario with the maximum guarantees of success. If you would like more information about this area, please do not hesitate to contact us. We will be glad to help.
[translated by Marta Villegas González]