In this case study, we continue describing how Data Analytics techniques can be applied to analyze human personality based on the Big Five model. The aim of the study is to analyze questionnaires in order to detect personality profiles, based on the Big Five model. In a previous post, we described the first three phases of the CRISP-DM methodology: I. Business Understanding, II. Data Understanding, and III. Data Preparation. In that post, we described the context of the case study, the available data and its analysis in detail.
Now, we discuss the next phase of the analysis according to CRISP-DM methodology, Phase IV: Modeling. The objective is to identify personality profiles according to the Big Five model by applying data analytics techniques, especially, clustering with different algorithms.
Data analysis and preparation
Data set was described in-depth in the previous post. Besides, synthetic variables have been added to the original data set variables. These synthetic variables are the result of the addition of users’ answers values in the questionnaires. Synthetic variables (E, N, A, C, O) are, respectively, the addition and subtraction of the variables E1-E10, N1-N10, A1-A10, C1-C10, O1-O10. The criteria followed are the score scales described by the test coordinators. Thus, new synthetic variables, known as personality variables, will have a value between 10 and 50.
For this clustering analysis, variables such as “country” or “age” are not included, as most part of the sample is between 15 and 25 years old and from United States, as the exploratory analysis showed. Data set has a total of 62 variables. This analysis focuses on clustering only the results of the personality tests. We take into consideration the 5 synthetic variables generated that arise from the rest of variables and have the information needed, plus the gender, which allows us to analyze differences between women and men.
Besides, information on participants over 92 has not been included. We have also filtered the profiles that are in average +/- standard deviation. In order to have a clean data set, profiles that do not bring information to the study have not been included. Approximately, a 10% of the sample is not included after this procedure.
Data standardization process is done by using StandardScaler and it helps to change each variable range into average 0 and variance 1. Standardization is a very common requirement for many learning algorithms, as variables do not work properly without a standard distribution (e.g., a Gaussian function with an average of 0 and variance of 1).
The following histograms show standardized variables.
Clustering personality profiles
Different clustering algorithms are going to be applied by using dimensionality reduction. The silhouette value of the results will be calculated in order to evaluate the quality of the clusters. Silhouette analysis refers to a method of interpretation and validation of consistency within groups identified by a clustering algorithm. This technique provides a succinct graphical representation of how well each object lies within its cluster. It was first described by Peter J. Rousseeuw in 1986.
Silhouette value is a measure of how similar an object is to its own cluster (cohesion) compared to other clusters (separation). Silhouette values rages from -1 to 1. High value indicates that the object is well matched to its own cluster and poorly matched to neighboring clusters. If most objects have a high value near to 1, clustering configuration is appropriate. If many objects have a low or negative value, clustering configuration may have too many or too few clusters.
In our analysis, we will look for clusters between 2 or 6 groups. The objective is to obtain an optimal cluster with the best possible silhouette value. We will also measure subjectively the results of the groups for each algorithm, so that they are consistent with the expected results. Information on this process can be found on this page.
We are going to use 8000 samples instead of using the whole sample in order to reduce the process time. Once we get the best clusters, we will be able to work with the whole sample.
Firstly, we apply the KMeans algorithm on the data set. Next, we apply the Principal Component Analysis algorithm (PCA), with different types of kernel, to reduce the dimension of the data set and then apply the clustering algorithm in that transformed space, so that we can use silhouette technique to compare clustering setting results. To compare the groups and obtain them in their original space, a reverse PCA transformation needs to be done.
In order to choose the number of components to be used in PCA, we will take into account the clustering silhouette value within the transformed space, and the reverse transformation variance that needs to be enough. Reverse transformation should be distributed in a similar way than in the original sample, taking the same space. Consequently, we look for a variance that explains PCA components of around 80%.
Initially, we are going to apply KMeans without dimensionality reduction. After, we will apply PCA using different kernels to find the kernel that better suits. Finally, we will apply other clustering algorithms over the most suitable kernel in order to improve the clustering process.
KMeans without dimensionality reduction
First, the KMeans algorithm is directly applied on the data set, without dimensionality reduction. With k=2, silhouette value is 0.2239 as you can see in the following figure.
Analysis is performed until we get k=6, with similar results in terms of application. For k=6, we get a silhouette value of 0.2097 and the following charts.
The KMeans algorithm does not obtain a good silhouette score, although groups are appropriate for the evaluated values (k=2 up to k=6), since we obtain similar and appropriate silhouette profiles.
The following radar chart shows the centroids in all iterations for k value.
KMeans and PCA with linear kernel
Firstly, we apply linear PCA algorithm on the data set to study the optimal number of main components to be selected. In the following chart, we can see that with just 4 components, 83% of the sample variance is explained. Thus, we will use 4 components for linear PCA.
Applying KMeans on the data set transformed with linear PCA and k=2, we obtain a silhouette value of 0.2794. It is a better result than the previous one without PCA.
For k=6 we obtain a silhouette value of 0.2659. It is still much better than without PCA.
The following figure shows the obtained centroids.
We obtain a better silhouette value with KMeans clustering than applying the algorithm without dimensionality reduction. Profiles obtained are very similar to the groups obtained before. Consequently, PCA improves the cluster.
KMeans and PCA with polynomial kernel
The next phase is to apply PCA with polynomial kernel at different degrees. After trying several polynomials degrees, 90% of the sample’s variance can be explained with a degree of 4 and 4 components. In a similar analysis with KMeans for k=2, we obtain a silhouette value of 0.3846, and 0.3818 for k=6. Poly kernel of PCA does not work properly with the data set. It gets a good silhouette for small groups, but with more than 4 groups, it over fits some samples, so it is not suitable for this analysis. The following figure shows silhouette profiles for k=6 with the deviation mentioned above.
Figure below shows the centroids obtained for all iterations.
KMeans and PCA with other kernels
We have used two additional approaches, PCA with kernel RBF (radial basis function) and PCA with kernel cosine:
- With RBF kernel, we can obtain a better silhouette than with the linear kernel. Besides, obtained groups are consistent.
- Generally, with kernel cosine, we obtain groups that make sense, although clustering silhouette is lower than the one obtained with the linear.
Obtained results are shown below. Results from poly kernel have not been included because of overfitting.
RBF kernel obtains the best scores for 4 and 5 clusters
Clustering algorithm study
Hereafter, other clustering algorithms are going to be tested by using PCA with RBF kernel and 4 dimensions. The following chart shows the algorithms with better silhouette in the clustering.
From the previous chart, we select the following algorithms to compare resulting groups:
- KMeans RBF with 5 clusters
- SpectralClustering rbf with 5 clusters
- AgglomerativeClustering rbf with 5 clusters
Resulting groups are shown below.
Groups’ profiles for the three algorithms are very similar; they follow the same forms. Hence, any of the algorithms proposed are valid.
Resulting group’s description
Resulting groups can be described according to what makes them different from others. According to KMeans result, a possible interpretation of the groups could be the following:
- Group 1: introvert and discreet profile, self-centered and competitive, but with a lot of imagination and open to new experiences.
- Group 2: irresponsible profile, emotionally unstable and insecure, unpredictable with a lack of objectives.
- Group 3*: extrovert profile, sociable, supportive and organized, but carefree and without clear objectives at the same time.
- Group 4: introvert and discreet profile, not confident but creative and open to intense emotions.
- Group 5: pragmatic and persistent profile, quiet and independent.
*This group could also involve people who answer the test with maximum score without reading it. Consequently, it would be a group to highlight.
We’ve proved that it is possible to cluster personalities from data given in personality tests. This allows us to perform this type of analysis in business areas in order to improve personnel and team management.
In a future work, additional data could be study along with data from personality test, according to the specific objective, such as personal and geopositioning data, browsing history, buying preferences, or interviews transcripts.
The aim of this study is to show that Data Science techniques allow the analysis and interpretation of a given data set; in this case, data is related to human personality questionnaires. Our professional team can effectively address Data Analytics projects in any complex scenario with the maximum guarantee of success. If you would like more information about this area, please do not hesitate to contact us. We will be glad to help you.
[translated by Marta Villegas González]