Highly skilled data analyst with data scientist expertise, offering over 15 years of experience in data analysis using Power BI, SQL, proficient with Python-based tools for visualization, analysis, algorithms, and statistical modeling.
View My LinkedIn Profile
Return on Investment (ROI) is a critical metric for evaluating the effectiveness of marketing campaigns and other business initiatives. By segmenting customers based on their engagement metrics and marketing activities data, businesses can optimize their ROI by targeting the right customers with tailored marketing strategies.
In this analysis, we compare various clustering methods for segmenting users based on their characteristics and purchasing behavior. We have considered the following clustering methods: K-means, t-SNE, K-medoids, and GMM.
Main Findings
Based on the comparison of all methods, we can observe that the clusters derived from each method have similar characteristics:
However, the general trend among all methods is that they divide customers into three segments:
The choice of the best method for this dataset would depend on the specific goals of the analysis and the preference for the characteristics of the resulting clusters. Overall, K-means and t-SNE provide very similar results, while K-medoids and GMM might offer a more nuanced view of the customers’ family structures.
It is important to note that t-SNE is a stochastic method, which means that it can produce different results each time it is run. This variability can make it challenging to reproduce and validate the results consistently. Additionally, t-SNE is more computationally intensive than other methods like K-means, which can be a concern when working with larger datasets. Moreover, t-SNE is primarily a visualization technique and is not specifically designed for clustering, so it may not always yield the most accurate or meaningful groupings of data points.
The dataset contains the following features:
Note: You can assume that the data is collected in the year 2016.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
# To scale the data using z-score
from sklearn.preprocessing import StandardScaler, MinMaxScaler
# Importing PCA and t-SNE
from sklearn.decomposition import PCA
from sklearn.manifold import TSNE
from sklearn.mixture import GaussianMixture
from sklearn_extra.cluster import KMedoids
from sklearn.cluster import AgglomerativeClustering
from sklearn.cluster import DBSCAN
data = pd.read_csv("marketing_campaign+%284%29.csv")
learn = data.copy()
data.shape
(2240, 27)
data.head()
ID | Year_Birth | Education | Marital_Status | Income | Kidhome | Teenhome | Dt_Customer | Recency | MntWines | ... | NumCatalogPurchases | NumStorePurchases | NumWebVisitsMonth | AcceptedCmp3 | AcceptedCmp4 | AcceptedCmp5 | AcceptedCmp1 | AcceptedCmp2 | Complain | Response | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 5524 | 1957 | Graduation | Single | 58138.0 | 0 | 0 | 04-09-2012 | 58 | 635 | ... | 10 | 4 | 7 | 0 | 0 | 0 | 0 | 0 | 0 | 1 |
1 | 2174 | 1954 | Graduation | Single | 46344.0 | 1 | 1 | 08-03-2014 | 38 | 11 | ... | 1 | 2 | 5 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
2 | 4141 | 1965 | Graduation | Together | 71613.0 | 0 | 0 | 21-08-2013 | 26 | 426 | ... | 2 | 10 | 4 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
3 | 6182 | 1984 | Graduation | Together | 26646.0 | 1 | 0 | 10-02-2014 | 26 | 11 | ... | 0 | 4 | 6 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
4 | 5324 | 1981 | PhD | Married | 58293.0 | 1 | 0 | 19-01-2014 | 94 | 173 | ... | 3 | 6 | 5 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
5 rows × 27 columns
data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2240 entries, 0 to 2239
Data columns (total 27 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 ID 2240 non-null int64
1 Year_Birth 2240 non-null int64
2 Education 2240 non-null object
3 Marital_Status 2240 non-null object
4 Income 2216 non-null float64
5 Kidhome 2240 non-null int64
6 Teenhome 2240 non-null int64
7 Dt_Customer 2240 non-null object
8 Recency 2240 non-null int64
9 MntWines 2240 non-null int64
10 MntFruits 2240 non-null int64
11 MntMeatProducts 2240 non-null int64
12 MntFishProducts 2240 non-null int64
13 MntSweetProducts 2240 non-null int64
14 MntGoldProds 2240 non-null int64
15 NumDealsPurchases 2240 non-null int64
16 NumWebPurchases 2240 non-null int64
17 NumCatalogPurchases 2240 non-null int64
18 NumStorePurchases 2240 non-null int64
19 NumWebVisitsMonth 2240 non-null int64
20 AcceptedCmp3 2240 non-null int64
21 AcceptedCmp4 2240 non-null int64
22 AcceptedCmp5 2240 non-null int64
23 AcceptedCmp1 2240 non-null int64
24 AcceptedCmp2 2240 non-null int64
25 Complain 2240 non-null int64
26 Response 2240 non-null int64
dtypes: float64(1), int64(23), object(3)
memory usage: 472.6+ KB
data.isnull().sum()
ID 0
Year_Birth 0
Education 0
Marital_Status 0
Income 24
Kidhome 0
Teenhome 0
Dt_Customer 0
Recency 0
MntWines 0
MntFruits 0
MntMeatProducts 0
MntFishProducts 0
MntSweetProducts 0
MntGoldProds 0
NumDealsPurchases 0
NumWebPurchases 0
NumCatalogPurchases 0
NumStorePurchases 0
NumWebVisitsMonth 0
AcceptedCmp3 0
AcceptedCmp4 0
AcceptedCmp5 0
AcceptedCmp1 0
AcceptedCmp2 0
Complain 0
Response 0
dtype: int64
Since there are only 1% of rows with missing values, it makes sense to remove them.
# Drop rows with missing values
data = data.dropna()
# Display the updated dataset
print(data.head())
# Check the remaining number of rows
print(f"Remaining rows: {data.shape[0]}")
ID Year_Birth Education Marital_Status Income Kidhome Teenhome \
0 5524 1957 Graduation Single 58138.0 0 0
1 2174 1954 Graduation Single 46344.0 1 1
2 4141 1965 Graduation Together 71613.0 0 0
3 6182 1984 Graduation Together 26646.0 1 0
4 5324 1981 PhD Married 58293.0 1 0
Dt_Customer Recency MntWines ... NumCatalogPurchases NumStorePurchases \
0 04-09-2012 58 635 ... 10 4
1 08-03-2014 38 11 ... 1 2
2 21-08-2013 26 426 ... 2 10
3 10-02-2014 26 11 ... 0 4
4 19-01-2014 94 173 ... 3 6
NumWebVisitsMonth AcceptedCmp3 AcceptedCmp4 AcceptedCmp5 AcceptedCmp1 \
0 7 0 0 0 0
1 5 0 0 0 0
2 4 0 0 0 0
3 6 0 0 0 0
4 5 0 0 0 0
AcceptedCmp2 Complain Response
0 0 0 1
1 0 0 0
2 0 0 0
3 0 0 0
4 0 0 0
[5 rows x 27 columns]
Remaining rows: 2216
data.duplicated().sum()
0
data.ID.nunique()
2216
All ID are unique. As all entries are unique, it would not add value to our analysis. We can drop the “ID” column.
# Dropping the "ID" column
data.drop(columns = "ID", inplace = True)
To speed up the analysis, we create new variables that contain only numerical and categorical features from the original data. This will allow us to perform further data analysis more quickly, as we reduce the number of features that need to be considered.
# Making a list of all categorical variables
cat_col = list(data.select_dtypes("object").columns)
print(cat_col)
['Education', 'Marital_Status', 'Dt_Customer']
# Making a list of all categorical variables
num_col = list(data.select_dtypes(include=["int64", "float64"]).columns)
print(num_col)
['Year_Birth', 'Income', 'Kidhome', 'Teenhome', 'Recency', 'MntWines', 'MntFruits', 'MntMeatProducts', 'MntFishProducts', 'MntSweetProducts', 'MntGoldProds', 'NumDealsPurchases', 'NumWebPurchases', 'NumCatalogPurchases', 'NumStorePurchases', 'NumWebVisitsMonth', 'AcceptedCmp3', 'AcceptedCmp4', 'AcceptedCmp5', 'AcceptedCmp1', 'AcceptedCmp2', 'Complain', 'Response']
Questions:
data.describe().T
count | mean | std | min | 25% | 50% | 75% | max | |
---|---|---|---|---|---|---|---|---|
Year_Birth | 2216.0 | 1968.820397 | 11.985554 | 1893.0 | 1959.0 | 1970.0 | 1977.00 | 1996.0 |
Income | 2216.0 | 52247.251354 | 25173.076661 | 1730.0 | 35303.0 | 51381.5 | 68522.00 | 666666.0 |
Kidhome | 2216.0 | 0.441787 | 0.536896 | 0.0 | 0.0 | 0.0 | 1.00 | 2.0 |
Teenhome | 2216.0 | 0.505415 | 0.544181 | 0.0 | 0.0 | 0.0 | 1.00 | 2.0 |
Recency | 2216.0 | 49.012635 | 28.948352 | 0.0 | 24.0 | 49.0 | 74.00 | 99.0 |
MntWines | 2216.0 | 305.091606 | 337.327920 | 0.0 | 24.0 | 174.5 | 505.00 | 1493.0 |
MntFruits | 2216.0 | 26.356047 | 39.793917 | 0.0 | 2.0 | 8.0 | 33.00 | 199.0 |
MntMeatProducts | 2216.0 | 166.995939 | 224.283273 | 0.0 | 16.0 | 68.0 | 232.25 | 1725.0 |
MntFishProducts | 2216.0 | 37.637635 | 54.752082 | 0.0 | 3.0 | 12.0 | 50.00 | 259.0 |
MntSweetProducts | 2216.0 | 27.028881 | 41.072046 | 0.0 | 1.0 | 8.0 | 33.00 | 262.0 |
MntGoldProds | 2216.0 | 43.965253 | 51.815414 | 0.0 | 9.0 | 24.5 | 56.00 | 321.0 |
NumDealsPurchases | 2216.0 | 2.323556 | 1.923716 | 0.0 | 1.0 | 2.0 | 3.00 | 15.0 |
NumWebPurchases | 2216.0 | 4.085289 | 2.740951 | 0.0 | 2.0 | 4.0 | 6.00 | 27.0 |
NumCatalogPurchases | 2216.0 | 2.671029 | 2.926734 | 0.0 | 0.0 | 2.0 | 4.00 | 28.0 |
NumStorePurchases | 2216.0 | 5.800993 | 3.250785 | 0.0 | 3.0 | 5.0 | 8.00 | 13.0 |
NumWebVisitsMonth | 2216.0 | 5.319043 | 2.425359 | 0.0 | 3.0 | 6.0 | 7.00 | 20.0 |
AcceptedCmp3 | 2216.0 | 0.073556 | 0.261106 | 0.0 | 0.0 | 0.0 | 0.00 | 1.0 |
AcceptedCmp4 | 2216.0 | 0.074007 | 0.261842 | 0.0 | 0.0 | 0.0 | 0.00 | 1.0 |
AcceptedCmp5 | 2216.0 | 0.073105 | 0.260367 | 0.0 | 0.0 | 0.0 | 0.00 | 1.0 |
AcceptedCmp1 | 2216.0 | 0.064079 | 0.244950 | 0.0 | 0.0 | 0.0 | 0.00 | 1.0 |
AcceptedCmp2 | 2216.0 | 0.013087 | 0.113672 | 0.0 | 0.0 | 0.0 | 0.00 | 1.0 |
Complain | 2216.0 | 0.009477 | 0.096907 | 0.0 | 0.0 | 0.0 | 0.00 | 1.0 |
Response | 2216.0 | 0.150271 | 0.357417 | 0.0 | 0.0 | 0.0 | 0.00 | 1.0 |
The mean age of the customers in the dataset is 52 years with a standard deviation of 12 years. The average income of customers is 52,247 USD with a standard deviation of 25,173 USD.
On average, customers have 0.44 children and 0.51 teenagers in the household. The customers made purchases recently with an average recency of 49 days. Customers have made purchases in all categories with the highest average spending on wines (305USD) and the lowest on fruits (26USD).
Customers have made purchases through multiple channels with an average of 2.3 deals, 4.1 web purchases, 2.7 catalog purchases, and 5.8 store purchases. Customers visit the website an average of 5.3 times per month.
Only a small percentage of customers responded to marketing campaigns (acceptedcmp1: 6.4%, acceptedcmp2: 1.3%, acceptedcmp3: 7.4%, acceptedcmp4: 7.4%, acceptedcmp5: 7.3%). Only 0.9% of customers made a complaint.
Univariate analysis is used to explore each variable in a data set, separately. It looks at the range of values, as well as the central tendency of the values. It can be done for both numerical and categorical variables.
cont_cols = list(data[num_col].columns)
for col in cont_cols:
print(col)
print('Skew :',round(data[col].skew(),2))
plt.figure(figsize = (15, 4))
plt.subplot(1, 2, 1)
data[col].hist(bins = 10, grid = False)
plt.ylabel('count')
plt.subplot(1, 2, 2)
sns.boxplot(x = data[col])
plt.show()
Year_Birth
Skew : -0.35
Income
Skew : 6.76
Kidhome
Skew : 0.64
Teenhome
Skew : 0.41
Recency
Skew : 0.0
MntWines
Skew : 1.17
MntFruits
Skew : 2.1
MntMeatProducts
Skew : 2.03
MntFishProducts
Skew : 1.92
MntSweetProducts
Skew : 2.1
MntGoldProds
Skew : 1.84
NumDealsPurchases
Skew : 2.42
NumWebPurchases
Skew : 1.2
NumCatalogPurchases
Skew : 1.88
NumStorePurchases
Skew : 0.7
NumWebVisitsMonth
Skew : 0.22
AcceptedCmp3
Skew : 3.27
AcceptedCmp4
Skew : 3.26
AcceptedCmp5
Skew : 3.28
AcceptedCmp1
Skew : 3.56
AcceptedCmp2
Skew : 8.57
Complain
Skew : 10.13
Response
Skew : 1.96
After analyzing the ‘Income’ variable, it was found that it has a right-skewed distribution with some extreme values on the right side. The upper whisker in the box plot for the ‘Income’ variable lies at 122,308.5 USD, which means that any value above this value can be considered as an outlier. However, due to the lack of domain expertise and the fact that there are only a few rows with extreme values, it was decided to keep them in the dataset and analyze their impact on further analysis. The only exception was the value of 666,666 USD, which was removed as it was clearly an error or a dummy value. Therefore, we will proceed with the analysis with the extreme values included in the dataset except for the value of 666,666 USD.
for col in cat_col:
print(data[col].value_counts())
print(' ')
Graduation 1116
PhD 481
Master 365
2n Cycle 200
Basic 54
Name: Education, dtype: int64
Married 857
Together 573
Single 471
Divorced 232
Widow 76
Alone 3
Absurd 2
YOLO 2
Name: Marital_Status, dtype: int64
31-08-2012 12
12-09-2012 11
14-02-2013 11
12-05-2014 11
20-08-2013 10
..
05-08-2012 1
18-11-2012 1
09-05-2014 1
26-06-2013 1
09-01-2014 1
Name: Dt_Customer, Length: 662, dtype: int64
The distribution of the Income variable appears to be right-skewed, with most of the data points lying on the left side of the distribution. There are a few extreme values on the right side of the distribution that appear as outliers in both the histogram and the box plot.
rows_with_outliers = data[(data.Income > 122308.5)]
rows_with_outliers
Year_Birth | Education | Marital_Status | Income | Kidhome | Teenhome | Dt_Customer | Recency | MntWines | MntFruits | ... | NumCatalogPurchases | NumStorePurchases | NumWebVisitsMonth | AcceptedCmp3 | AcceptedCmp4 | AcceptedCmp5 | AcceptedCmp1 | AcceptedCmp2 | Complain | Response | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
164 | 1973 | PhD | Married | 157243.0 | 0 | 1 | 01-03-2014 | 98 | 20 | 2 | ... | 22 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
617 | 1976 | PhD | Together | 162397.0 | 1 | 1 | 03-06-2013 | 31 | 85 | 1 | ... | 0 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
655 | 1975 | Graduation | Divorced | 153924.0 | 0 | 0 | 07-02-2014 | 81 | 1 | 1 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
687 | 1982 | PhD | Married | 160803.0 | 0 | 0 | 04-08-2012 | 21 | 55 | 16 | ... | 28 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
1300 | 1971 | Master | Together | 157733.0 | 1 | 0 | 04-06-2013 | 37 | 39 | 1 | ... | 0 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
1653 | 1977 | Graduation | Together | 157146.0 | 0 | 0 | 29-04-2013 | 13 | 1 | 0 | ... | 28 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
2132 | 1949 | PhD | Married | 156924.0 | 0 | 0 | 29-08-2013 | 85 | 2 | 1 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
2233 | 1977 | Graduation | Together | 666666.0 | 1 | 0 | 02-06-2013 | 23 | 9 | 14 | ... | 1 | 3 | 6 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
8 rows × 26 columns
After analyzing the ‘Income’ variable, it was found that it has a right-skewed distribution with some extreme values on the right side. The upper whisker in the box plot for the ‘Income’ variable lies at 122,308.5 USD, which means that any value above this value can be considered as an outlier. However, due to the lack of domain expertise and the fact that there are only a few rows with extreme values, it was decided to keep them in the dataset and analyze their impact on further analysis. The only exception was the value of 666,666 USD, which was removed as it was clearly an error or a dummy value. Therefore, we will proceed with the analysis with the extreme values included in the dataset except for the value of 666,666 USD.
rows_with_outliers = data[(data.Income > 200000)]
# Get index for rows witn outliers
index_of_rows_to_drop = rows_with_outliers.index
# Drop rows
data = data.drop(index_of_rows_to_drop)
#Check the result
data[(data.Income > 200000)]
Year_Birth | Education | Marital_Status | Income | Kidhome | Teenhome | Dt_Customer | Recency | MntWines | MntFruits | ... | NumCatalogPurchases | NumStorePurchases | NumWebVisitsMonth | AcceptedCmp3 | AcceptedCmp4 | AcceptedCmp5 | AcceptedCmp1 | AcceptedCmp2 | Complain | Response |
---|
0 rows × 26 columns
We can create a new feature that represents the total number of members in each family by summing up the number of kids, teenagers, and adults in the household. This could be a useful variable for understanding household size and composition, and could potentially be used in segmentation and targeting strategies.
# Find columns with unusual marital status values
unusual_categories = ['Alone', 'Absurd', 'YOLO']
data[data['Marital_Status'].isin(unusual_categories)]
Year_Birth | Education | Marital_Status | Income | Kidhome | Teenhome | Dt_Customer | Recency | MntWines | MntFruits | ... | NumCatalogPurchases | NumStorePurchases | NumWebVisitsMonth | AcceptedCmp3 | AcceptedCmp4 | AcceptedCmp5 | AcceptedCmp1 | AcceptedCmp2 | Complain | Response | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
131 | 1958 | Master | Alone | 61331.0 | 1 | 1 | 10-03-2013 | 42 | 534 | 5 | ... | 1 | 6 | 8 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
138 | 1973 | PhD | Alone | 35860.0 | 1 | 1 | 19-05-2014 | 37 | 15 | 0 | ... | 1 | 2 | 5 | 1 | 0 | 0 | 0 | 0 | 0 | 1 |
153 | 1988 | Graduation | Alone | 34176.0 | 1 | 0 | 12-05-2014 | 12 | 5 | 7 | ... | 0 | 4 | 6 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
2093 | 1993 | Graduation | Absurd | 79244.0 | 0 | 0 | 19-12-2012 | 58 | 471 | 102 | ... | 10 | 7 | 1 | 0 | 0 | 1 | 1 | 0 | 0 | 1 |
2134 | 1957 | Master | Absurd | 65487.0 | 0 | 0 | 10-01-2014 | 48 | 240 | 67 | ... | 5 | 6 | 2 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
2177 | 1973 | PhD | YOLO | 48432.0 | 0 | 1 | 18-10-2012 | 3 | 322 | 3 | ... | 1 | 6 | 8 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
2202 | 1973 | PhD | YOLO | 48432.0 | 0 | 1 | 18-10-2012 | 3 | 322 | 3 | ... | 1 | 6 | 8 | 0 | 0 | 0 | 0 | 0 | 0 | 1 |
7 rows × 26 columns
# Replace 'Alone' with 'Single'
data['Marital_Status'] = data['Marital_Status'].replace('Alone', 'Single')
data[data['Marital_Status'].isin(unusual_categories)]
Year_Birth | Education | Marital_Status | Income | Kidhome | Teenhome | Dt_Customer | Recency | MntWines | MntFruits | ... | NumCatalogPurchases | NumStorePurchases | NumWebVisitsMonth | AcceptedCmp3 | AcceptedCmp4 | AcceptedCmp5 | AcceptedCmp1 | AcceptedCmp2 | Complain | Response | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
2093 | 1993 | Graduation | Absurd | 79244.0 | 0 | 0 | 19-12-2012 | 58 | 471 | 102 | ... | 10 | 7 | 1 | 0 | 0 | 1 | 1 | 0 | 0 | 1 |
2134 | 1957 | Master | Absurd | 65487.0 | 0 | 0 | 10-01-2014 | 48 | 240 | 67 | ... | 5 | 6 | 2 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
2177 | 1973 | PhD | YOLO | 48432.0 | 0 | 1 | 18-10-2012 | 3 | 322 | 3 | ... | 1 | 6 | 8 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
2202 | 1973 | PhD | YOLO | 48432.0 | 0 | 1 | 18-10-2012 | 3 | 322 | 3 | ... | 1 | 6 | 8 | 0 | 0 | 0 | 0 | 0 | 0 | 1 |
4 rows × 26 columns
# Find columns with unusual marital status values
unusual_categories = ['Absurd', 'YOLO']
mask = data['Marital_Status'].isin(unusual_categories)
unusual_columns = data.loc[mask].index
# Remove columns with unusual marital status values
data.drop(unusual_columns, axis=0, inplace=True)
# Create a new column for total number of family members
data['Family_Size'] = data['Kidhome'] + data['Teenhome'] + 1
# Update the family size column for customers who are Married or Together
data.loc[data['Marital_Status'].isin(['Married', 'Together']), 'Family_Size'] += 1
data.head()
Year_Birth | Education | Marital_Status | Income | Kidhome | Teenhome | Dt_Customer | Recency | MntWines | MntFruits | ... | NumStorePurchases | NumWebVisitsMonth | AcceptedCmp3 | AcceptedCmp4 | AcceptedCmp5 | AcceptedCmp1 | AcceptedCmp2 | Complain | Response | Family_Size | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1957 | Graduation | Single | 58138.0 | 0 | 0 | 04-09-2012 | 58 | 635 | 88 | ... | 4 | 7 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 1 |
1 | 1954 | Graduation | Single | 46344.0 | 1 | 1 | 08-03-2014 | 38 | 11 | 1 | ... | 2 | 5 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 3 |
2 | 1965 | Graduation | Together | 71613.0 | 0 | 0 | 21-08-2013 | 26 | 426 | 49 | ... | 10 | 4 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 2 |
3 | 1984 | Graduation | Together | 26646.0 | 1 | 0 | 10-02-2014 | 26 | 11 | 4 | ... | 4 | 6 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 3 |
4 | 1981 | PhD | Married | 58293.0 | 1 | 0 | 19-01-2014 | 94 | 173 | 43 | ... | 6 | 5 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 3 |
5 rows × 27 columns
To process the Dt_Customer feature, which is a date, we can extract useful information from this attribute by converting it to a numerical format and calculating the number of years since the customer’s registration date up to July 1, 2016. Since we do not know the exact date, we will take the midpoint of the year.
# Convert the 'Dt_Customer' column to datetime format with the specified format 'DD-MM-YYYY'
data['Dt_Customer'] = pd.to_datetime(data['Dt_Customer'], format='%d-%m-%Y')
# Set the reference date to 2016
reference_date = pd.to_datetime('01-07-2016')
# Calculate the number of years since the reference date
data['Years_Since_Registration'] = (reference_date - data['Dt_Customer']).dt.days / 365
# Round the new column to one decimal place
data['Years_Since_Registration'] = data['Years_Since_Registration'].round(1)
# Display the updated dataset
print(data.head())
Year_Birth Education Marital_Status Income Kidhome Teenhome \
0 1957 Graduation Single 58138.0 0 0
1 1954 Graduation Single 46344.0 1 1
2 1965 Graduation Together 71613.0 0 0
3 1984 Graduation Together 26646.0 1 0
4 1981 PhD Married 58293.0 1 0
Dt_Customer Recency MntWines MntFruits ... NumWebVisitsMonth \
0 2012-09-04 58 635 88 ... 7
1 2014-03-08 38 11 1 ... 5
2 2013-08-21 26 426 49 ... 4
3 2014-02-10 26 11 4 ... 6
4 2014-01-19 94 173 43 ... 5
AcceptedCmp3 AcceptedCmp4 AcceptedCmp5 AcceptedCmp1 AcceptedCmp2 \
0 0 0 0 0 0
1 0 0 0 0 0
2 0 0 0 0 0
3 0 0 0 0 0
4 0 0 0 0 0
Complain Response Family_Size Years_Since_Registration
0 0 1 1 3.3
1 0 0 3 1.8
2 0 0 2 2.4
3 0 0 3 1.9
4 0 0 3 2.0
[5 rows x 28 columns]
# Remove the 'Dt_Customer' column
data.drop('Dt_Customer', axis=1, inplace=True)
To create a new feature for age, we can subtract the birth year of each customer from 2016. We can do this by first converting the ‘Year_Birth’ column to a datetime format, and then using the .apply() method to subtract the birth year from 2016 for each row. We will also need to filter out any rows with birth years earlier than 1940 as these values are likely to be errors.
data[(data.Year_Birth < 1940)]
Year_Birth | Education | Marital_Status | Income | Kidhome | Teenhome | Recency | MntWines | MntFruits | MntMeatProducts | ... | NumWebVisitsMonth | AcceptedCmp3 | AcceptedCmp4 | AcceptedCmp5 | AcceptedCmp1 | AcceptedCmp2 | Complain | Response | Family_Size | Years_Since_Registration | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
192 | 1900 | 2n Cycle | Divorced | 36640.0 | 1 | 0 | 99 | 15 | 6 | 8 | ... | 5 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 2 | 2.3 |
239 | 1893 | 2n Cycle | Single | 60182.0 | 0 | 1 | 23 | 8 | 0 | 5 | ... | 4 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 2 | 1.6 |
339 | 1899 | PhD | Together | 83532.0 | 0 | 0 | 36 | 755 | 144 | 562 | ... | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 2 | 2.3 |
3 rows × 27 columns
rows_with_outliers = data[(data.Year_Birth < 1940)]
# Get index for rows witn outliers
index_of_rows_to_drop = rows_with_outliers.index
# Drop rows
data = data.drop(index_of_rows_to_drop)
#Check the result
data[(data.Year_Birth < 1940)]
Year_Birth | Education | Marital_Status | Income | Kidhome | Teenhome | Recency | MntWines | MntFruits | MntMeatProducts | ... | NumWebVisitsMonth | AcceptedCmp3 | AcceptedCmp4 | AcceptedCmp5 | AcceptedCmp1 | AcceptedCmp2 | Complain | Response | Family_Size | Years_Since_Registration |
---|
0 rows × 27 columns
data['Age'] = 2016 - data['Year_Birth']
data = data.drop('Year_Birth', axis=1)
data['Total_Spent'] = data[['MntWines', 'MntFruits', 'MntMeatProducts', 'MntFishProducts', 'MntSweetProducts', 'MntGoldProds']].sum(axis=1)
data['Total_Accepted_Offers'] = data['AcceptedCmp1'] + data['AcceptedCmp2'] + data['AcceptedCmp3'] + data['AcceptedCmp4'] + data['AcceptedCmp5'] + data['Response']
We can find out the amount spent per purchase by dividing the total amount spent on various products by the total number of purchases made by the customer. This can help us understand the average spending behavior of the customers and can also be used to identify high-value customers.
We can create a new feature for the amount spent per purchase by dividing the total amount spent on all products by the sum of the number of purchases made from all channels. The resulting value will represent the average amount spent per purchase.
data['AmountPerPurchase'] = (data['MntWines'] + data['MntFruits'] + data['MntMeatProducts'] +
data['MntFishProducts'] + data['MntSweetProducts'] + data['MntGoldProds']) / (
data['NumDealsPurchases'] + data['NumWebPurchases'] +
data['NumCatalogPurchases'] + data['NumStorePurchases']).replace(0, 1)
We can create a new feature to find the total number of kids and teens in the household by adding the ‘Kidhome’ and ‘Teenhome’ columns.
data['Total_Kids_Teens'] = data['Kidhome'] + data['Teenhome']
data.head()
Education | Marital_Status | Income | Kidhome | Teenhome | Recency | MntWines | MntFruits | MntMeatProducts | MntFishProducts | ... | AcceptedCmp2 | Complain | Response | Family_Size | Years_Since_Registration | Age | Total_Spent | Total_Accepted_Offers | AmountPerPurchase | Total_Kids_Teens | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | Graduation | Single | 58138.0 | 0 | 0 | 58 | 635 | 88 | 546 | 172 | ... | 0 | 0 | 1 | 1 | 3.3 | 59 | 1617 | 1 | 64.680000 | 0 |
1 | Graduation | Single | 46344.0 | 1 | 1 | 38 | 11 | 1 | 6 | 2 | ... | 0 | 0 | 0 | 3 | 1.8 | 62 | 27 | 0 | 4.500000 | 2 |
2 | Graduation | Together | 71613.0 | 0 | 0 | 26 | 426 | 49 | 127 | 111 | ... | 0 | 0 | 0 | 2 | 2.4 | 51 | 776 | 0 | 36.952381 | 0 |
3 | Graduation | Together | 26646.0 | 1 | 0 | 26 | 11 | 4 | 20 | 10 | ... | 0 | 0 | 0 | 3 | 1.9 | 32 | 53 | 0 | 6.625000 | 1 |
4 | PhD | Married | 58293.0 | 1 | 0 | 94 | 173 | 43 | 118 | 46 | ... | 0 | 0 | 0 | 3 | 2.0 | 35 | 422 | 0 | 22.210526 | 1 |
5 rows × 31 columns
These new columns can provide us with additional insights and help us better understand the behavior and preferences of customers.
plt.figure(figsize = (15, 10))
sns.heatmap(data.corr(), annot = True, fmt = '0.2f')
plt.show()
The analysis shows that there are some strong positive correlations between variables , such as Total_Spent and AmountPerPurchase with a correlation coefficient of 0.93, and Total_Spent and MntWines with a correlation coefficient of 0.89. This suggests that customers who spend more money on purchases also tend to make larger purchases in terms of amount and buy more wines. There are also strong positive correlations between Family_Size and Total_Kids_Teens with a correlation coefficient of 0.85, which is expected as having more kids or teenagers in a family will increase the family size.
Moreover, there are moderate positive correlations between variables, such as Income and Total_Spent with a correlation coefficient of 0.79, and Income and AmountPerPurchase with a correlation coefficient of 0.76. This indicates that customers with higher income tend to spend more money on purchases and make larger purchases in terms of amount.
There are also some weak to moderate correlations between variables, such as NumCatalogPurchases and MntMeatProducts with a correlation coefficient of 0.73, and Total_Accepted_Offers and AcceptedCmp5 with a correlation coefficient of 0.72. This suggests that customers who make more catalog purchases tend to buy more meat products, and customers who accept more offers tend to accept offers from campaign 5.
We can see that there are several pairs of variables that are strongly negatively correlated, such as NumWebVisitsMonth and Income, Total_Spent and Kidhome, and AmountPerPurchase and Kidhome. This suggests that as one variable increases, the other decreases.
On the other hand, there are several pairs of variables that are strongly positively correlated, such as Total_Spent and MntWines, Total_Kids_Teens and Family_Size, and Total_Spent and MntMeatProducts. This suggests that as one variable increases, the other also tends to increase.
sns.boxplot(x='Marital_Status', y='Total_Spent', data=data)
<AxesSubplot:xlabel='Marital_Status', ylabel='Total_Spent'>
sns.boxplot(x='Marital_Status', y='Income', data=data)
<AxesSubplot:xlabel='Marital_Status', ylabel='Income'>
sns.boxplot(x='Family_Size', y='Total_Spent', data=data)
<AxesSubplot:xlabel='Family_Size', ylabel='Total_Spent'>
sns.boxplot(x='Total_Kids_Teens', y='Total_Spent', data=data)
<AxesSubplot:xlabel='Total_Kids_Teens', ylabel='Total_Spent'>
sns.boxplot(x='Total_Kids_Teens', y='AmountPerPurchase', data=data)
<AxesSubplot:xlabel='Total_Kids_Teens', ylabel='AmountPerPurchase'>
sns.boxplot(x='Total_Kids_Teens', y='AmountPerPurchase', data=data)
<AxesSubplot:xlabel='Total_Kids_Teens', ylabel='AmountPerPurchase'>
sns.boxplot(x='Total_Kids_Teens', y='Income', data=data)
<AxesSubplot:xlabel='Total_Kids_Teens', ylabel='Income'>
sns.boxplot(x='Total_Kids_Teens', y='MntWines', data=data)
<AxesSubplot:xlabel='Total_Kids_Teens', ylabel='MntWines'>
sns.boxplot(x='Total_Kids_Teens', y='MntFishProducts', data=data)
<AxesSubplot:xlabel='Total_Kids_Teens', ylabel='MntFishProducts'>
sns.boxplot(x='Total_Kids_Teens', y='MntFruits', data=data)
<AxesSubplot:xlabel='Total_Kids_Teens', ylabel='MntFruits'>
sns.boxplot(x='Total_Kids_Teens', y='MntSweetProducts', data=data)
<AxesSubplot:xlabel='Total_Kids_Teens', ylabel='MntSweetProducts'>
sns.boxplot(x='Total_Kids_Teens', y='NumWebPurchases', data=data)
<AxesSubplot:xlabel='Total_Kids_Teens', ylabel='NumWebPurchases'>
sns.boxplot(x='Total_Kids_Teens', y='NumStorePurchases', data=data)
<AxesSubplot:xlabel='Total_Kids_Teens', ylabel='NumStorePurchases'>
sns.boxplot(x='Total_Kids_Teens', y='NumStorePurchases', data=data)
<AxesSubplot:xlabel='Total_Kids_Teens', ylabel='NumStorePurchases'>
sns.boxplot(x='Total_Accepted_Offers', y='Years_Since_Registration', data=data)
<AxesSubplot:xlabel='Total_Accepted_Offers', ylabel='Years_Since_Registration'>
sns.boxplot(x='Total_Kids_Teens', y='MntGoldProds', data=data)
<AxesSubplot:xlabel='Total_Kids_Teens', ylabel='MntGoldProds'>
sns.boxplot(x='Total_Kids_Teens', y='NumDealsPurchases', data=data)
<AxesSubplot:xlabel='Total_Kids_Teens', ylabel='NumDealsPurchases'>
plt.hist([data[data['Total_Accepted_Offers'] == 0]['Total_Kids_Teens'],
data[data['Total_Accepted_Offers'] > 0]['Total_Kids_Teens']],
bins=10, alpha=0.7, label=['No accepted offers', 'Accepted offers'])
plt.legend(loc='upper right')
plt.xlabel('Total Kids Teens')
plt.ylabel('Frequency')
plt.title('Distribution of Total Kids Teens by Total Accepted Offers')
plt.show()
import matplotlib.pyplot as plt
plt.scatter(data['Income'], data['Total_Spent'])
plt.title('Scatter Plot of Income vs. Total_Spent')
plt.xlabel('Income')
plt.ylabel('Total_Spent')
plt.show()
dataI = data.copy()
# Dropping the "ID" column
dataI.drop(columns = "Marital_Status", inplace = True)
# Dropping the "ID" column
dataI.drop(columns = "Education", inplace = True)
dataI[(dataI.MntSweetProducts > 200)]
Income | Kidhome | Teenhome | Recency | MntWines | MntFruits | MntMeatProducts | MntFishProducts | MntSweetProducts | MntGoldProds | ... | AcceptedCmp2 | Complain | Response | Family_Size | Years_Since_Registration | Age | Total_Spent | Total_Accepted_Offers | AmountPerPurchase | Total_Kids_Teens | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1898 | 113734.0 | 0 | 0 | 9 | 6 | 2 | 3 | 1 | 262 | 3 | ... | 0 | 0 | 0 | 1 | 1.6 | 71 | 277 | 0 | 10.259259 | 0 |
1 rows × 29 columns
rows_with_outliers = dataI[(dataI.MntSweetProducts > 200)]
# Get index for rows witn outliers
index_of_rows_to_drop = rows_with_outliers.index
# Drop rows
dataI = dataI.drop(index_of_rows_to_drop)
#Check the result
dataI[(dataI.MntSweetProducts > 200)]
Education | Marital_Status | Income | Kidhome | Teenhome | Recency | MntWines | MntFruits | MntMeatProducts | MntFishProducts | ... | AcceptedCmp2 | Complain | Response | Family_Size | Years_Since_Registration | Age | Total_Spent | Total_Accepted_Offers | AmountPerPurchase | Total_Kids_Teens |
---|
0 rows × 31 columns
dataI[(dataI.NumDealsPurchases > 13)]
Education | Marital_Status | Income | Kidhome | Teenhome | Recency | MntWines | MntFruits | MntMeatProducts | MntFishProducts | ... | AcceptedCmp2 | Complain | Response | Family_Size | Years_Since_Registration | Age | Total_Spent | Total_Accepted_Offers | AmountPerPurchase | Total_Kids_Teens | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
21 | Graduation | Married | 2447.0 | 1 | 0 | 42 | 1 | 1 | 1725 | 1 | ... | 0 | 0 | 0 | 3 | 3.0 | 37 | 1730 | 0 | 40.232558 | 1 |
164 | PhD | Married | 157243.0 | 0 | 1 | 98 | 20 | 2 | 1582 | 1 | ... | 0 | 0 | 0 | 3 | 1.9 | 43 | 1608 | 0 | 43.459459 | 1 |
432 | 2n Cycle | Together | 67309.0 | 1 | 1 | 76 | 515 | 47 | 181 | 149 | ... | 0 | 0 | 0 | 4 | 3.0 | 49 | 1082 | 0 | 27.743590 | 2 |
687 | PhD | Married | 160803.0 | 0 | 0 | 21 | 55 | 16 | 1622 | 17 | ... | 0 | 0 | 0 | 2 | 3.4 | 34 | 1717 | 0 | 39.022727 | 0 |
1042 | Graduation | Single | 8028.0 | 0 | 0 | 62 | 73 | 18 | 66 | 7 | ... | 0 | 0 | 0 | 1 | 3.3 | 25 | 178 | 0 | 11.125000 | 0 |
1245 | Graduation | Divorced | 1730.0 | 0 | 0 | 65 | 1 | 1 | 3 | 1 | ... | 0 | 0 | 0 | 1 | 1.6 | 45 | 8 | 0 | 0.533333 | 0 |
1846 | PhD | Married | 4023.0 | 1 | 1 | 29 | 5 | 0 | 1 | 1 | ... | 0 | 0 | 0 | 4 | 1.5 | 53 | 9 | 0 | 0.600000 | 2 |
7 rows × 31 columns
dataI[(dataI.MntGoldProds > 250)]
Education | Marital_Status | Income | Kidhome | Teenhome | Recency | MntWines | MntFruits | MntMeatProducts | MntFishProducts | ... | AcceptedCmp2 | Complain | Response | Family_Size | Years_Since_Registration | Age | Total_Spent | Total_Accepted_Offers | AmountPerPurchase | Total_Kids_Teens | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1328 | Master | Single | 6560.0 | 0 | 0 | 2 | 67 | 11 | 26 | 4 | ... | 0 | 0 | 0 | 1 | 2.1 | 34 | 373 | 0 | 186.50 | 0 |
1806 | PhD | Single | 7144.0 | 0 | 2 | 92 | 81 | 4 | 33 | 5 | ... | 0 | 0 | 0 | 3 | 2.1 | 50 | 416 | 0 | 16.64 | 2 |
1975 | Graduation | Married | 4428.0 | 0 | 1 | 0 | 16 | 4 | 12 | 2 | ... | 0 | 0 | 0 | 3 | 2.3 | 47 | 359 | 0 | 14.36 | 1 |
3 rows × 31 columns
rows_with_outliers = dataI[(dataI.MntGoldProds > 250)]
# Get index for rows witn outliers
index_of_rows_to_drop = rows_with_outliers.index
# Drop rows
dataI = dataI.drop(index_of_rows_to_drop)
#Check the result
dataI[(dataI.MntGoldProds > 250)]
Education | Marital_Status | Income | Kidhome | Teenhome | Recency | MntWines | MntFruits | MntMeatProducts | MntFishProducts | ... | AcceptedCmp2 | Complain | Response | Family_Size | Years_Since_Registration | Age | Total_Spent | Total_Accepted_Offers | AmountPerPurchase | Total_Kids_Teens |
---|
0 rows × 31 columns
dataI[(dataI.NumWebPurchases > 20)]
Education | Marital_Status | Income | Kidhome | Teenhome | Recency | MntWines | MntFruits | MntMeatProducts | MntFishProducts | ... | AcceptedCmp2 | Complain | Response | Family_Size | Years_Since_Registration | Age | Total_Spent | Total_Accepted_Offers | AmountPerPurchase | Total_Kids_Teens | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1806 | PhD | Single | 7144.0 | 0 | 2 | 92 | 81 | 4 | 33 | 5 | ... | 0 | 0 | 0 | 3 | 2.1 | 50 | 416 | 0 | 16.640000 | 2 |
1898 | PhD | Single | 113734.0 | 0 | 0 | 9 | 6 | 2 | 3 | 1 | ... | 0 | 0 | 0 | 1 | 1.6 | 71 | 277 | 0 | 10.259259 | 0 |
1975 | Graduation | Married | 4428.0 | 0 | 1 | 0 | 16 | 4 | 12 | 2 | ... | 0 | 0 | 0 | 3 | 2.3 | 47 | 359 | 0 | 14.360000 | 1 |
3 rows × 31 columns
rows_with_outliers = dataI[(dataI.NumWebPurchases > 20) & (dataI.Income < 800)]
# Get index for rows witn outliers
index_of_rows_to_drop = rows_with_outliers.index
# Drop rows
dataI = dataI.drop(index_of_rows_to_drop)
# Check the result
dataI[dataI.NumWebPurchases > 20]
Education | Marital_Status | Income | Kidhome | Teenhome | Recency | MntWines | MntFruits | MntMeatProducts | MntFishProducts | ... | AcceptedCmp2 | Complain | Response | Family_Size | Years_Since_Registration | Age | Total_Spent | Total_Accepted_Offers | AmountPerPurchase | Total_Kids_Teens |
---|
0 rows × 31 columns
# Apply one-hot encoding to 'Marital_Status' column using get_dummies()
dataI = pd.get_dummies(dataI, columns=['Marital_Status'], drop_first=True)
# Display the updated dataset
print(dataI.head())
# Check the new columns created
print(dataI.columns)
# Apply one-hot encoding to 'Marital_Status' column using get_dummies()
dataI = pd.get_dummies(dataI, columns=['Education'], drop_first=True)
# Display the updated dataset
print(dataI.head())
# Check the new columns created
print(dataI.columns)
Education Income Kidhome Teenhome Recency MntWines MntFruits \
0 Graduation 58138.0 0 0 58 635 88
1 Graduation 46344.0 1 1 38 11 1
2 Graduation 71613.0 0 0 26 426 49
3 Graduation 26646.0 1 0 26 11 4
4 PhD 58293.0 1 0 94 173 43
MntMeatProducts MntFishProducts MntSweetProducts ... \
0 546 172 88 ...
1 6 2 1 ...
2 127 111 21 ...
3 20 10 3 ...
4 118 46 27 ...
Years_Since_Registration Age Total_Spent Total_Accepted_Offers \
0 3.3 59 1617 1
1 1.8 62 27 0
2 2.4 51 776 0
3 1.9 32 53 0
4 2.0 35 422 0
AmountPerPurchase Total_Kids_Teens Marital_Status_Married \
0 64.680000 0 0
1 4.500000 2 0
2 36.952381 0 0
3 6.625000 1 0
4 22.210526 1 1
Marital_Status_Single Marital_Status_Together Marital_Status_Widow
0 1 0 0
1 1 0 0
2 0 1 0
3 0 1 0
4 0 0 0
[5 rows x 34 columns]
Index(['Education', 'Income', 'Kidhome', 'Teenhome', 'Recency', 'MntWines',
'MntFruits', 'MntMeatProducts', 'MntFishProducts', 'MntSweetProducts',
'MntGoldProds', 'NumDealsPurchases', 'NumWebPurchases',
'NumCatalogPurchases', 'NumStorePurchases', 'NumWebVisitsMonth',
'AcceptedCmp3', 'AcceptedCmp4', 'AcceptedCmp5', 'AcceptedCmp1',
'AcceptedCmp2', 'Complain', 'Response', 'Family_Size',
'Years_Since_Registration', 'Age', 'Total_Spent',
'Total_Accepted_Offers', 'AmountPerPurchase', 'Total_Kids_Teens',
'Marital_Status_Married', 'Marital_Status_Single',
'Marital_Status_Together', 'Marital_Status_Widow'],
dtype='object')
Income Kidhome Teenhome Recency MntWines MntFruits MntMeatProducts \
0 58138.0 0 0 58 635 88 546
1 46344.0 1 1 38 11 1 6
2 71613.0 0 0 26 426 49 127
3 26646.0 1 0 26 11 4 20
4 58293.0 1 0 94 173 43 118
MntFishProducts MntSweetProducts MntGoldProds ... AmountPerPurchase \
0 172 88 88 ... 64.680000
1 2 1 6 ... 4.500000
2 111 21 42 ... 36.952381
3 10 3 5 ... 6.625000
4 46 27 15 ... 22.210526
Total_Kids_Teens Marital_Status_Married Marital_Status_Single \
0 0 0 1
1 2 0 1
2 0 0 0
3 1 0 0
4 1 1 0
Marital_Status_Together Marital_Status_Widow Education_Basic \
0 0 0 0
1 0 0 0
2 1 0 0
3 1 0 0
4 0 0 0
Education_Graduation Education_Master Education_PhD
0 1 0 0
1 1 0 0
2 1 0 0
3 1 0 0
4 0 0 1
[5 rows x 37 columns]
Index(['Income', 'Kidhome', 'Teenhome', 'Recency', 'MntWines', 'MntFruits',
'MntMeatProducts', 'MntFishProducts', 'MntSweetProducts',
'MntGoldProds', 'NumDealsPurchases', 'NumWebPurchases',
'NumCatalogPurchases', 'NumStorePurchases', 'NumWebVisitsMonth',
'AcceptedCmp3', 'AcceptedCmp4', 'AcceptedCmp5', 'AcceptedCmp1',
'AcceptedCmp2', 'Complain', 'Response', 'Family_Size',
'Years_Since_Registration', 'Age', 'Total_Spent',
'Total_Accepted_Offers', 'AmountPerPurchase', 'Total_Kids_Teens',
'Marital_Status_Married', 'Marital_Status_Single',
'Marital_Status_Together', 'Marital_Status_Widow', 'Education_Basic',
'Education_Graduation', 'Education_Master', 'Education_PhD'],
dtype='object')
#dataS = dataI.copy()
dataS.head()
Income | Kidhome | Teenhome | Recency | MntWines | MntFruits | MntMeatProducts | MntFishProducts | MntSweetProducts | MntGoldProds | ... | AcceptedCmp2 | Complain | Response | Family_Size | Years_Since_Registration | Age | Total_Spent | Total_Accepted_Offers | AmountPerPurchase | Total_Kids_Teens | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 58138.0 | 0 | 0 | 58 | 635 | 88 | 546 | 172 | 88 | 88 | ... | 0 | 0 | 1 | 1 | 3.3 | 59 | 1617 | 1 | 64.680000 | 0 |
1 | 46344.0 | 1 | 1 | 38 | 11 | 1 | 6 | 2 | 1 | 6 | ... | 0 | 0 | 0 | 3 | 1.8 | 62 | 27 | 0 | 4.500000 | 2 |
2 | 71613.0 | 0 | 0 | 26 | 426 | 49 | 127 | 111 | 21 | 42 | ... | 0 | 0 | 0 | 2 | 2.4 | 51 | 776 | 0 | 36.952381 | 0 |
3 | 26646.0 | 1 | 0 | 26 | 11 | 4 | 20 | 10 | 3 | 5 | ... | 0 | 0 | 0 | 3 | 1.9 | 32 | 53 | 0 | 6.625000 | 1 |
4 | 58293.0 | 1 | 0 | 94 | 173 | 43 | 118 | 46 | 27 | 15 | ... | 0 | 0 | 0 | 3 | 2.0 | 35 | 422 | 0 | 22.210526 | 1 |
5 rows × 29 columns
scaler = MinMaxScaler()
data_scaled = pd.DataFrame(scaler.fit_transform(dataS), columns = dataS.columns)
data_scaled.head()
Income | Kidhome | Teenhome | Recency | MntWines | MntFruits | MntMeatProducts | MntFishProducts | MntSweetProducts | MntGoldProds | ... | AcceptedCmp2 | Complain | Response | Family_Size | Years_Since_Registration | Age | Total_Spent | Total_Accepted_Offers | AmountPerPurchase | Total_Kids_Teens | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0.351086 | 0.0 | 0.0 | 0.585859 | 0.425318 | 0.442211 | 0.316522 | 0.664093 | 0.335878 | 0.274143 | ... | 0.0 | 0.0 | 1.0 | 0.00 | 0.947368 | 0.696429 | 0.639683 | 0.2 | 0.344936 | 0.000000 |
1 | 0.277680 | 0.5 | 0.5 | 0.383838 | 0.007368 | 0.005025 | 0.003478 | 0.007722 | 0.003817 | 0.018692 | ... | 0.0 | 0.0 | 0.0 | 0.50 | 0.157895 | 0.750000 | 0.008730 | 0.0 | 0.021330 | 0.666667 |
2 | 0.434956 | 0.0 | 0.0 | 0.262626 | 0.285332 | 0.246231 | 0.073623 | 0.428571 | 0.080153 | 0.130841 | ... | 0.0 | 0.0 | 0.0 | 0.25 | 0.473684 | 0.553571 | 0.305952 | 0.0 | 0.195836 | 0.000000 |
3 | 0.155079 | 0.5 | 0.0 | 0.262626 | 0.007368 | 0.020101 | 0.011594 | 0.038610 | 0.011450 | 0.015576 | ... | 0.0 | 0.0 | 0.0 | 0.50 | 0.210526 | 0.214286 | 0.019048 | 0.0 | 0.032757 | 0.333333 |
4 | 0.352051 | 0.5 | 0.0 | 0.949495 | 0.115874 | 0.216080 | 0.068406 | 0.177606 | 0.103053 | 0.046729 | ... | 0.0 | 0.0 | 0.0 | 0.50 | 0.263158 | 0.267857 | 0.165476 | 0.0 | 0.116565 | 0.333333 |
5 rows × 29 columns
# Defining the number of principal components to generate
n = data_scaled.shape[1]
# Finding principal components for the data
pca1 = PCA(n_components = n, random_state = 1)
data_pca = pd.DataFrame(pca1.fit_transform(data_scaled))
# The percentage of variance explained by each principal component
exp_var1 = pca1.explained_variance_ratio_
X = data_pca
# Set up the PCA object with maximum number of components
pca = PCA(n_components=X.shape[1])
# Fit and transform the data
pca.fit_transform(X)
# Create a table to display cumulative variance explained by number of components
table = pd.DataFrame({'Components': range(1, X.shape[1]+1),
'Cumulative Variance Explained': np.cumsum(pca.explained_variance_ratio_)})
# Find the number of components needed to explain 80% of the variance
n_components = len(table[table['Cumulative Variance Explained'] <= 0.90])
# Plot the cumulative variance explained by number of components
plt.figure(figsize=(10, 6))
plt.plot(table['Components'], table['Cumulative Variance Explained'], '-o')
plt.xlabel('Number of Components')
plt.ylabel('Cumulative Variance Explained')
plt.axvline(x=n_components, color='r', linestyle='--')
plt.text(n_components+1, 0.85, f'{n_components} components', color='r')
plt.show()
## # Create a new dataframe with the first 12 principal components as columns
cols = ["PC1", "PC2", "PC3", "PC4", "PC5", "PC6", "PC7", "PC8", "PC9", "PC10", "PC11", "PC12"]
pc1 = pd.DataFrame(np.round(pca1.components_.T[:, 0:12], 2), index = data_scaled.columns, columns = cols)
pc1
PC1 | PC2 | PC3 | PC4 | PC5 | PC6 | PC7 | PC8 | PC9 | PC10 | PC11 | PC12 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
Income | 0.16 | -0.08 | 0.10 | 0.05 | 0.05 | 0.05 | 0.06 | -0.05 | -0.03 | -0.04 | 0.01 | -0.03 |
Kidhome | -0.29 | 0.25 | -0.01 | 0.04 | -0.16 | 0.09 | 0.56 | -0.01 | -0.19 | 0.03 | 0.22 | -0.30 |
Teenhome | -0.14 | -0.07 | 0.62 | 0.01 | 0.17 | 0.10 | -0.22 | -0.09 | 0.11 | -0.03 | -0.18 | 0.35 |
Recency | -0.02 | -0.21 | 0.04 | 0.17 | -0.84 | 0.27 | -0.22 | -0.29 | -0.08 | 0.02 | -0.06 | -0.00 |
MntWines | 0.29 | -0.05 | 0.23 | 0.04 | -0.04 | -0.02 | 0.05 | 0.14 | -0.09 | -0.12 | -0.07 | -0.32 |
MntFruits | 0.20 | -0.13 | -0.01 | -0.10 | 0.05 | 0.15 | 0.20 | -0.11 | -0.11 | 0.09 | 0.21 | 0.33 |
MntMeatProducts | 0.16 | -0.05 | -0.00 | -0.02 | -0.00 | 0.06 | 0.11 | -0.06 | -0.02 | -0.04 | 0.06 | -0.03 |
MntFishProducts | 0.22 | -0.14 | -0.02 | -0.10 | 0.05 | 0.15 | 0.22 | -0.12 | -0.03 | 0.23 | 0.30 | 0.27 |
MntSweetProducts | 0.15 | -0.09 | -0.00 | -0.05 | 0.01 | 0.10 | 0.14 | -0.07 | -0.01 | 0.06 | 0.09 | 0.19 |
MntGoldProds | 0.13 | -0.06 | 0.06 | -0.10 | 0.01 | 0.13 | 0.03 | 0.03 | -0.02 | 0.04 | 0.00 | 0.10 |
NumDealsPurchases | -0.04 | 0.02 | 0.17 | -0.10 | -0.02 | 0.00 | 0.05 | 0.01 | -0.09 | 0.05 | -0.10 | -0.10 |
NumWebPurchases | 0.08 | -0.03 | 0.10 | -0.05 | 0.01 | 0.02 | -0.01 | 0.02 | -0.05 | 0.03 | -0.07 | -0.06 |
NumCatalogPurchases | 0.12 | -0.04 | 0.04 | -0.02 | 0.01 | 0.06 | 0.02 | -0.01 | -0.02 | 0.00 | 0.02 | -0.03 |
NumStorePurchases | 0.26 | -0.23 | 0.21 | -0.07 | 0.09 | 0.00 | 0.09 | 0.06 | -0.26 | 0.04 | -0.40 | -0.36 |
NumWebVisitsMonth | -0.10 | 0.10 | 0.03 | -0.09 | -0.08 | -0.07 | -0.02 | 0.10 | 0.01 | 0.06 | -0.03 | -0.04 |
AcceptedCmp3 | 0.05 | 0.26 | -0.03 | -0.03 | 0.01 | 0.66 | -0.27 | 0.58 | -0.08 | 0.05 | 0.11 | -0.02 |
AcceptedCmp4 | 0.14 | 0.13 | 0.22 | 0.36 | -0.16 | -0.51 | -0.13 | 0.27 | -0.38 | 0.31 | 0.27 | 0.18 |
AcceptedCmp5 | 0.24 | 0.18 | 0.04 | 0.31 | -0.13 | -0.06 | 0.21 | 0.11 | 0.29 | -0.69 | 0.01 | 0.19 |
AcceptedCmp1 | 0.20 | 0.16 | 0.04 | 0.27 | -0.05 | 0.05 | 0.20 | -0.02 | 0.62 | 0.57 | -0.23 | -0.10 |
AcceptedCmp2 | 0.04 | 0.05 | 0.04 | 0.07 | -0.03 | -0.05 | -0.03 | 0.06 | -0.03 | 0.01 | -0.00 | -0.00 |
Complain | -0.01 | 0.00 | 0.00 | -0.01 | -0.01 | 0.00 | 0.00 | 0.00 | 0.01 | -0.01 | 0.01 | -0.00 |
Response | 0.26 | 0.69 | 0.03 | -0.20 | 0.05 | 0.03 | -0.16 | -0.53 | -0.21 | -0.02 | -0.10 | -0.01 |
Family_Size | -0.23 | 0.07 | 0.35 | 0.05 | 0.00 | 0.11 | 0.25 | 0.01 | -0.00 | -0.02 | 0.03 | 0.04 |
Years_Since_Registration | 0.07 | 0.09 | 0.18 | -0.73 | -0.39 | -0.27 | 0.05 | 0.26 | 0.28 | 0.00 | 0.08 | 0.05 |
Age | 0.02 | -0.08 | 0.25 | 0.03 | 0.13 | 0.00 | -0.32 | -0.22 | 0.29 | -0.05 | 0.65 | -0.45 |
Total_Spent | 0.35 | -0.11 | 0.14 | -0.03 | -0.01 | 0.08 | 0.16 | 0.02 | -0.09 | -0.05 | 0.05 | -0.12 |
Total_Accepted_Offers | 0.18 | 0.30 | 0.07 | 0.16 | -0.06 | 0.02 | -0.04 | 0.09 | 0.04 | 0.05 | 0.01 | 0.05 |
AmountPerPurchase | 0.22 | -0.05 | 0.04 | 0.00 | -0.01 | 0.04 | 0.08 | -0.00 | -0.02 | -0.06 | 0.11 | 0.01 |
Total_Kids_Teens | -0.29 | 0.12 | 0.41 | 0.03 | 0.00 | 0.13 | 0.22 | -0.06 | -0.06 | -0.00 | 0.02 | 0.03 |
Using this information, we can develop targeted marketing strategies to encourage customers to make larger purchases and to retain their loyalty over time.
from sklearn.metrics import silhouette_score
inertia = []
silhouette_scores = []
range_n_clusters = range(2, 11)
for n_clusters in range_n_clusters:
kmeans = KMeans(n_clusters=n_clusters, random_state=1)
kmeans_labels = kmeans.fit_predict(data_scaled)
inertia.append(kmeans.inertia_)
silhouette_scores.append(silhouette_score(data_scaled, kmeans_labels))
plt.figure(figsize=(10, 5))
plt.plot(range_n_clusters, inertia, marker='o')
plt.xlabel('Number of clusters')
plt.ylabel('Inertia')
plt.title('Elbow Method')
plt.show()
C:\Users\dvmes\anaconda3\lib\site-packages\sklearn\cluster\_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
warnings.warn(
C:\Users\dvmes\anaconda3\lib\site-packages\sklearn\cluster\_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
warnings.warn(
C:\Users\dvmes\anaconda3\lib\site-packages\sklearn\cluster\_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
warnings.warn(
C:\Users\dvmes\anaconda3\lib\site-packages\sklearn\cluster\_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
warnings.warn(
C:\Users\dvmes\anaconda3\lib\site-packages\sklearn\cluster\_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
warnings.warn(
C:\Users\dvmes\anaconda3\lib\site-packages\sklearn\cluster\_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
warnings.warn(
C:\Users\dvmes\anaconda3\lib\site-packages\sklearn\cluster\_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
warnings.warn(
C:\Users\dvmes\anaconda3\lib\site-packages\sklearn\cluster\_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
warnings.warn(
C:\Users\dvmes\anaconda3\lib\site-packages\sklearn\cluster\_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
warnings.warn(
kmeans = KMeans(n_clusters = 3, random_state = 1)
kmeans.fit(data_scaled)
# Adding predicted labels to the original data and the scaled data
data_scaled['KMeans_Labels'] = kmeans.predict(data_scaled)
data_scaled['KMeans_Labels'].value_counts()
# Add the cluster labels as a new column to the original dataset
data_with_labels['KMeans_Labels'] = kmeans.predict(data_scaled)
C:\Users\dvmes\anaconda3\lib\site-packages\sklearn\cluster\_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
warnings.warn(
data_with_labels.head()
Income | Kidhome | Teenhome | Recency | MntWines | MntFruits | MntMeatProducts | MntFishProducts | MntSweetProducts | MntGoldProds | ... | Response | Family_Size | Years_Since_Registration | Age | Total_Spent | Total_Accepted_Offers | AmountPerPurchase | Total_Kids_Teens | KMeans_Labels | ClusterLabels | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 58138.0 | 0 | 0 | 58 | 635 | 88 | 546 | 172 | 88 | 88 | ... | 1 | 1 | 3.3 | 59 | 1617 | 1 | 64.680000 | 0 | 2 | 1 |
1 | 46344.0 | 1 | 1 | 38 | 11 | 1 | 6 | 2 | 1 | 6 | ... | 0 | 3 | 1.8 | 62 | 27 | 0 | 4.500000 | 2 | 0 | 0 |
2 | 71613.0 | 0 | 0 | 26 | 426 | 49 | 127 | 111 | 21 | 42 | ... | 0 | 2 | 2.4 | 51 | 776 | 0 | 36.952381 | 0 | 1 | 1 |
3 | 26646.0 | 1 | 0 | 26 | 11 | 4 | 20 | 10 | 3 | 5 | ... | 0 | 3 | 1.9 | 32 | 53 | 0 | 6.625000 | 1 | 0 | 0 |
4 | 58293.0 | 1 | 0 | 94 | 173 | 43 | 118 | 46 | 27 | 15 | ... | 0 | 3 | 2.0 | 35 | 422 | 0 | 22.210526 | 1 | 0 | 0 |
5 rows × 31 columns
scaler = MinMaxScaler()
dataS_norm = scaler.fit_transform(dataS)
# Применяем KMeans
kmeans = KMeans(n_clusters=3, random_state=1)
kmeans.fit(dataS_norm)
# Добавление меток кластеров в DataFrame
data_with_labels['KMeans_Labels'] = kmeans.labels_
# Группировка данных по меткам кластеров и расчет средних значений для каждого столбца
cluster_means = data_with_labels.groupby('KMeans_Labels').mean()
# Группировка данных по меткам кластеров и расчет средних значений для каждого столбца
cluster_means = data_with_labels.groupby('KMeans_Labels').mean()
# Add the cluster labels as a new column to the original dataset
#data_with_labels = dataS.copy()
#data_with_labels['ClusterLabels'] = kmeans_labels
C:\Users\dvmes\anaconda3\lib\site-packages\sklearn\cluster\_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
warnings.warn(
data_with_labels.head()
Income | Kidhome | Teenhome | Recency | MntWines | MntFruits | MntMeatProducts | MntFishProducts | MntSweetProducts | MntGoldProds | ... | Family_Size | Years_Since_Registration | Age | Total_Spent | Total_Accepted_Offers | AmountPerPurchase | Total_Kids_Teens | KMeans_Labels | ClusterLabels | kmedoLabels | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 58138.0 | 0 | 0 | 58 | 635 | 88 | 546 | 172 | 88 | 88 | ... | 1 | 3.3 | 59 | 1617 | 1 | 64.680000 | 0 | 2 | 1 | 2 |
1 | 46344.0 | 1 | 1 | 38 | 11 | 1 | 6 | 2 | 1 | 6 | ... | 3 | 1.8 | 62 | 27 | 0 | 4.500000 | 2 | 1 | 0 | 1 |
2 | 71613.0 | 0 | 0 | 26 | 426 | 49 | 127 | 111 | 21 | 42 | ... | 2 | 2.4 | 51 | 776 | 0 | 36.952381 | 0 | 0 | 1 | 2 |
3 | 26646.0 | 1 | 0 | 26 | 11 | 4 | 20 | 10 | 3 | 5 | ... | 3 | 1.9 | 32 | 53 | 0 | 6.625000 | 1 | 1 | 0 | 0 |
4 | 58293.0 | 1 | 0 | 94 | 173 | 43 | 118 | 46 | 27 | 15 | ... | 3 | 2.0 | 35 | 422 | 0 | 22.210526 | 1 | 1 | 0 | 0 |
5 rows × 32 columns
data_with_labels['KMeans_Labels'].value_counts()
1 1202
0 756
2 250
Name: KMeans_Labels, dtype: int64
# Calculating the mean and the median of the original data for each label
mean = data_with_labels.groupby('KMeans_Labels').mean()
median = data_with_labels.groupby('KMeans_Labels').median()
df_kmeans = pd.concat([mean, median], axis = 0)
#df_kmeans.index = ['group_0 Mean', 'group_1 Mean', 'group_2 Mean', 'group_0 Median', 'group_1 Median', 'group_2 Median']
df_kmeans.T
KMeans_Labels | 0 | 1 | 2 | 0 | 1 | 2 |
---|---|---|---|---|---|---|
Income | 67475.526455 | 37438.415141 | 74715.284000 | 67432.500000 | 37070.000 | 78211.00000 |
Kidhome | 0.052910 | 0.769551 | 0.048000 | 0.000000 | 1.000 | 0.00000 |
Teenhome | 0.537037 | 0.547421 | 0.212000 | 1.000000 | 1.000 | 0.00000 |
Recency | 51.685185 | 48.996672 | 41.404000 | 54.000000 | 49.000 | 36.00000 |
MntWines | 514.887566 | 80.591514 | 751.260000 | 483.000000 | 29.000 | 769.50000 |
MntFruits | 48.722222 | 6.133943 | 55.436000 | 33.000000 | 3.000 | 37.50000 |
MntMeatProducts | 287.375661 | 35.958403 | 433.068000 | 220.000000 | 18.000 | 417.00000 |
MntFishProducts | 70.567460 | 8.465058 | 77.344000 | 51.500000 | 4.000 | 59.00000 |
MntSweetProducts | 49.940476 | 6.308652 | 57.688000 | 35.000000 | 3.000 | 42.00000 |
MntGoldProds | 70.124339 | 20.754576 | 74.844000 | 53.000000 | 12.000 | 49.00000 |
NumDealsPurchases | 2.160053 | 2.566556 | 1.640000 | 2.000000 | 2.000 | 1.00000 |
NumWebPurchases | 5.603175 | 2.786190 | 5.748000 | 5.000000 | 2.000 | 5.00000 |
NumCatalogPurchases | 4.619048 | 0.791181 | 5.804000 | 4.000000 | 1.000 | 6.00000 |
NumStorePurchases | 8.621693 | 3.651414 | 7.648000 | 9.000000 | 3.000 | 7.00000 |
NumWebVisitsMonth | 3.957672 | 6.467554 | 3.944000 | 4.000000 | 7.000 | 3.50000 |
AcceptedCmp3 | 0.031746 | 0.068220 | 0.228000 | 0.000000 | 0.000 | 0.00000 |
AcceptedCmp4 | 0.085979 | 0.017471 | 0.312000 | 0.000000 | 0.000 | 0.00000 |
AcceptedCmp5 | 0.050265 | 0.000000 | 0.488000 | 0.000000 | 0.000 | 0.00000 |
AcceptedCmp1 | 0.054233 | 0.002496 | 0.388000 | 0.000000 | 0.000 | 0.00000 |
AcceptedCmp2 | 0.006614 | 0.001664 | 0.088000 | 0.000000 | 0.000 | 0.00000 |
Complain | 0.007937 | 0.010815 | 0.004000 | 0.000000 | 0.000 | 0.00000 |
Response | 0.005291 | 0.094010 | 0.856000 | 0.000000 | 0.000 | 1.00000 |
Family_Size | 2.256614 | 2.979201 | 1.772000 | 2.000000 | 3.000 | 2.00000 |
Years_Since_Registration | 2.514947 | 2.448003 | 2.662800 | 2.500000 | 2.400 | 2.80000 |
Age | 49.880952 | 45.283694 | 47.384000 | 50.000000 | 44.000 | 46.50000 |
Total_Spent | 1041.617725 | 158.212146 | 1449.640000 | 1005.500000 | 77.000 | 1490.00000 |
Total_Accepted_Offers | 0.234127 | 0.183860 | 2.360000 | 0.000000 | 0.000 | 2.00000 |
AmountPerPurchase | 50.815040 | 12.940770 | 72.116445 | 42.630833 | 10.775 | 68.02381 |
Total_Kids_Teens | 0.589947 | 1.316972 | 0.260000 | 1.000000 | 1.000 | 0.00000 |
ClusterLabels | 1.113757 | 0.020799 | 1.540000 | 1.000000 | 0.000 | 2.00000 |
kmedoLabels | 1.968254 | 0.516639 | 2.000000 | 2.000000 | 0.000 | 2.00000 |
# Apply PCA
pca = PCA(n_components=2)
scaler = MinMaxScaler()
data_scaled_N = pd.DataFrame(scaler.fit_transform(data_with_labels), columns = data_with_labels.columns)
pca.fit(data_scaled_N)
pca_df = pd.DataFrame(pca.transform(data_scaled_N), columns=['PC1', 'PC2'])
# Plot the clusters
plt.scatter(pca_df['PC1'], pca_df['PC2'], c=data_with_labels['KMeans_Labels'], cmap='viridis')
plt.xlabel('PC1')
plt.ylabel('PC2')
plt.title('KMeans Clusters')
plt.show()
# For each variable
for col in df_kmeans.columns:
# Create a plot for each cluster
plt.figure()
df_kmeans.boxplot(column=[col], by='KMeans_Labels')
plt.title(col)
plt.xlabel('Cluster')
plt.ylabel('Value')
C:\Users\dvmes\AppData\Local\Temp\ipykernel_5196\3908337906.py:4: RuntimeWarning: More than 20 figures have been opened. Figures created through the pyplot interface (`matplotlib.pyplot.figure`) are retained until explicitly closed and may consume too much memory. (To control this warning, see the rcParam `figure.max_open_warning`).
plt.figure()
<Figure size 640x480 with 0 Axes>
<Figure size 640x480 with 0 Axes>
<Figure size 640x480 with 0 Axes>
<Figure size 640x480 with 0 Axes>
<Figure size 640x480 with 0 Axes>
<Figure size 640x480 with 0 Axes>
<Figure size 640x480 with 0 Axes>
<Figure size 640x480 with 0 Axes>
<Figure size 640x480 with 0 Axes>
<Figure size 640x480 with 0 Axes>
<Figure size 640x480 with 0 Axes>
<Figure size 640x480 with 0 Axes>
<Figure size 640x480 with 0 Axes>
<Figure size 640x480 with 0 Axes>
<Figure size 640x480 with 0 Axes>
<Figure size 640x480 with 0 Axes>
<Figure size 640x480 with 0 Axes>
<Figure size 640x480 with 0 Axes>
<Figure size 640x480 with 0 Axes>
<Figure size 640x480 with 0 Axes>
<Figure size 640x480 with 0 Axes>
<Figure size 640x480 with 0 Axes>
<Figure size 640x480 with 0 Axes>
<Figure size 640x480 with 0 Axes>
<Figure size 640x480 with 0 Axes>
<Figure size 640x480 with 0 Axes>
<Figure size 640x480 with 0 Axes>
<Figure size 640x480 with 0 Axes>
<Figure size 640x480 with 0 Axes>
<Figure size 640x480 with 0 Axes>
kmedo = KMedoids(n_clusters = 3, random_state = 1)
kmedo.fit(data_scaled)
data_with_labels['kmedoLabels'] = kmedo.predict(data_scaled)
#data['kmedoLabels'] = kmedo.predict(data_scaled)
data_with_labels.kmedoLabels.value_counts()
2 1009
0 608
1 591
Name: kmedoLabels, dtype: int64
# Calculating the mean and the median of the original data for each label
mean = data_with_labels.groupby('kmedoLabels').mean()
median = data_with_labels.groupby('kmedoLabels').median()
df_kmeans = pd.concat([mean, median], axis = 0)
#df_kmeans.index = ['group_0 Mean', 'group_1 Mean', 'group_2 Mean', 'group_0 Median', 'group_1 Median', 'group_2 Median']
df_kmeans.T
kmedoLabels | 0 | 1 | 2 | 0 | 1 | 2 |
---|---|---|---|---|---|---|
Income | 31127.623355 | 44047.563452 | 69111.584737 | 30324.5 | 44421.0000 | 69661.00000 |
Kidhome | 0.809211 | 0.729272 | 0.053518 | 1.0 | 1.0000 | 0.00000 |
Teenhome | 0.044408 | 1.052453 | 0.463826 | 0.0 | 1.0000 | 0.00000 |
Recency | 48.883224 | 48.133672 | 49.703667 | 48.0 | 49.0000 | 51.00000 |
MntWines | 39.486842 | 117.236887 | 575.466799 | 14.0 | 64.0000 | 529.00000 |
MntFruits | 6.702303 | 5.121827 | 50.509415 | 3.0 | 2.0000 | 34.00000 |
MntMeatProducts | 30.861842 | 38.477157 | 324.322101 | 15.0 | 22.0000 | 255.00000 |
MntFishProducts | 8.949013 | 7.563452 | 72.298315 | 4.0 | 3.0000 | 54.00000 |
MntSweetProducts | 6.735197 | 5.461929 | 51.969277 | 4.0 | 2.0000 | 35.00000 |
MntGoldProds | 18.108553 | 23.521151 | 71.120912 | 11.0 | 12.0000 | 51.00000 |
NumDealsPurchases | 1.960526 | 3.101523 | 2.084242 | 1.0 | 2.0000 | 1.00000 |
NumWebPurchases | 2.328947 | 3.223350 | 5.650149 | 2.0 | 3.0000 | 5.00000 |
NumCatalogPurchases | 0.578947 | 0.979695 | 4.918731 | 0.0 | 1.0000 | 4.00000 |
NumStorePurchases | 3.180921 | 4.050761 | 8.415263 | 3.0 | 4.0000 | 8.00000 |
NumWebVisitsMonth | 6.791118 | 6.131980 | 3.963330 | 7.0 | 6.0000 | 4.00000 |
AcceptedCmp3 | 0.082237 | 0.054146 | 0.080278 | 0.0 | 0.0000 | 0.00000 |
AcceptedCmp4 | 0.001645 | 0.030457 | 0.143707 | 0.0 | 0.0000 | 0.00000 |
AcceptedCmp5 | 0.000000 | 0.000000 | 0.158573 | 0.0 | 0.0000 | 0.00000 |
AcceptedCmp1 | 0.000000 | 0.006768 | 0.135778 | 0.0 | 0.0000 | 0.00000 |
AcceptedCmp2 | 0.000000 | 0.003384 | 0.026759 | 0.0 | 0.0000 | 0.00000 |
Complain | 0.011513 | 0.010152 | 0.006938 | 0.0 | 0.0000 | 0.00000 |
Response | 0.111842 | 0.077834 | 0.215064 | 0.0 | 0.0000 | 0.00000 |
Family_Size | 2.493421 | 3.478849 | 2.138751 | 3.0 | 3.0000 | 2.00000 |
Years_Since_Registration | 2.433882 | 2.445008 | 2.561645 | 2.4 | 2.4000 | 2.60000 |
Age | 38.899671 | 51.546531 | 49.427156 | 38.0 | 51.0000 | 50.00000 |
Total_Spent | 110.843750 | 197.382403 | 1145.686819 | 63.0 | 122.0000 | 1101.00000 |
Total_Accepted_Offers | 0.195724 | 0.172589 | 0.760159 | 0.0 | 0.0000 | 0.00000 |
AmountPerPurchase | 11.447391 | 14.526143 | 55.951559 | 9.0 | 13.1875 | 52.73913 |
Total_Kids_Teens | 0.853618 | 1.781726 | 0.517344 | 1.0 | 2.0000 | 0.00000 |
KMeans_Labels | 0.996711 | 0.966159 | 0.520317 | 1.0 | 1.0000 | 0.00000 |
ClusterLabels | 0.000000 | 0.000000 | 1.240833 | 0.0 | 0.0000 | 1.00000 |
HCLabels | 2.996711 | 3.000000 | 0.753221 | 3.0 | 3.0000 | 1.00000 |
# Apply PCA
pca = PCA(n_components=2)
scaler = MinMaxScaler()
data_scaled_N = pd.DataFrame(scaler.fit_transform(data_with_labels), columns = data_with_labels.columns)
pca.fit(data_scaled_N)
pca_df = pd.DataFrame(pca.transform(data_scaled_N), columns=['PC1', 'PC2'])
# Plot the clusters
plt.scatter(pca_df['PC1'], pca_df['PC2'], c=data_with_labels['kmedoLabels'], cmap='viridis')
plt.xlabel('PC1')
plt.ylabel('PC2')
plt.title('KMedoids Clusters')
plt.show()
# For each variable
for col in df_kmeans.columns:
# Creating a boxplot chart for each cluster
plt.figure()
df_kmeans.boxplot(column=[col], by='kmedoLabels')
plt.title(col)
plt.xlabel('Cluster')
plt.ylabel('Value')
C:\Users\dvmes\AppData\Local\Temp\ipykernel_5196\4071859610.py:4: RuntimeWarning: More than 20 figures have been opened. Figures created through the pyplot interface (`matplotlib.pyplot.figure`) are retained until explicitly closed and may consume too much memory. (To control this warning, see the rcParam `figure.max_open_warning`).
plt.figure()
<Figure size 640x480 with 0 Axes>
<Figure size 640x480 with 0 Axes>
<Figure size 640x480 with 0 Axes>
<Figure size 640x480 with 0 Axes>
<Figure size 640x480 with 0 Axes>
<Figure size 640x480 with 0 Axes>
<Figure size 640x480 with 0 Axes>
<Figure size 640x480 with 0 Axes>
<Figure size 640x480 with 0 Axes>
<Figure size 640x480 with 0 Axes>
<Figure size 640x480 with 0 Axes>
<Figure size 640x480 with 0 Axes>
<Figure size 640x480 with 0 Axes>
<Figure size 640x480 with 0 Axes>
<Figure size 640x480 with 0 Axes>
<Figure size 640x480 with 0 Axes>
<Figure size 640x480 with 0 Axes>
<Figure size 640x480 with 0 Axes>
<Figure size 640x480 with 0 Axes>
<Figure size 640x480 with 0 Axes>
<Figure size 640x480 with 0 Axes>
<Figure size 640x480 with 0 Axes>
<Figure size 640x480 with 0 Axes>
<Figure size 640x480 with 0 Axes>
<Figure size 640x480 with 0 Axes>
<Figure size 640x480 with 0 Axes>
<Figure size 640x480 with 0 Axes>
<Figure size 640x480 with 0 Axes>
<Figure size 640x480 with 0 Axes>
<Figure size 640x480 with 0 Axes>
<Figure size 640x480 with 0 Axes>
gmm = GaussianMixture(n_components = 3, random_state = 1)
gmm.fit(data_scaled)
data_with_labels['GmmLabels'] = gmm.predict(data_scaled)
data_with_labels.GmmLabels.value_counts()
0 1190
1 741
2 277
Name: GmmLabels, dtype: int64
# Calculating the mean and the median of the original data for each label
mean = data_with_labels.groupby('GmmLabels').mean()
median = data_with_labels.groupby('GmmLabels').median()
df_kmeans = pd.concat([mean, median], axis = 0)
#df_kmeans.index = ['group_0 Mean', 'group_1 Mean', 'group_2 Mean', 'group_0 Median', 'group_1 Median', 'group_2 Median']
df_kmeans.T
GmmLabels | 0 | 1 | 2 | 0 | 1 | 2 |
---|---|---|---|---|---|---|
Income | 37439.544538 | 65589.304993 | 77749.317690 | 36967.000000 | 66334.000 | 79174.000000 |
Kidhome | 0.768908 | 0.066127 | 0.046931 | 1.000000 | 0.000 | 0.000000 |
Teenhome | 0.536975 | 0.572200 | 0.194946 | 1.000000 | 1.000 | 0.000000 |
Recency | 48.547899 | 50.240216 | 48.083032 | 49.000000 | 51.000 | 46.000000 |
MntWines | 76.266387 | 485.896086 | 805.541516 | 28.000000 | 459.000 | 823.000000 |
MntFruits | 5.913445 | 49.273954 | 52.407942 | 3.000000 | 35.000 | 31.000000 |
MntMeatProducts | 32.886555 | 280.990553 | 438.252708 | 17.000000 | 217.000 | 415.000000 |
MntFishProducts | 8.204202 | 68.114710 | 81.675090 | 4.000000 | 50.000 | 63.000000 |
MntSweetProducts | 6.010084 | 48.388664 | 60.476534 | 3.000000 | 33.000 | 42.000000 |
MntGoldProds | 20.239496 | 69.607287 | 75.841155 | 12.000000 | 51.000 | 54.000000 |
NumDealsPurchases | 2.496639 | 2.419703 | 1.314079 | 2.000000 | 2.000 | 1.000000 |
NumWebPurchases | 2.713445 | 5.727395 | 5.592058 | 2.000000 | 5.000 | 5.000000 |
NumCatalogPurchases | 0.745378 | 4.558704 | 5.880866 | 1.000000 | 4.000 | 6.000000 |
NumStorePurchases | 3.607563 | 8.443995 | 8.191336 | 3.000000 | 8.000 | 8.000000 |
NumWebVisitsMonth | 6.482353 | 4.130904 | 3.527076 | 7.000000 | 4.000 | 3.000000 |
AcceptedCmp3 | 0.068908 | 0.044534 | 0.173285 | 0.000000 | 0.000 | 0.000000 |
AcceptedCmp4 | 0.015966 | 0.062078 | 0.357401 | 0.000000 | 0.000 | 0.000000 |
AcceptedCmp5 | 0.000000 | 0.000000 | 0.577617 | 0.000000 | 0.000 | 1.000000 |
AcceptedCmp1 | 0.000000 | 0.005398 | 0.494585 | 0.000000 | 0.000 | 0.000000 |
AcceptedCmp2 | 0.001681 | 0.000000 | 0.097473 | 0.000000 | 0.000 | 0.000000 |
Complain | 0.010924 | 0.008097 | 0.003610 | 0.000000 | 0.000 | 0.000000 |
Response | 0.095798 | 0.116059 | 0.472924 | 0.000000 | 0.000 | 0.000000 |
Family_Size | 2.973109 | 2.253711 | 1.884477 | 3.000000 | 2.000 | 2.000000 |
Years_Since_Registration | 2.441345 | 2.581646 | 2.495668 | 2.400000 | 2.600 | 2.500000 |
Age | 45.078151 | 50.094467 | 47.740072 | 44.000000 | 51.000 | 47.000000 |
Total_Spent | 149.520168 | 1002.271255 | 1514.194946 | 76.000000 | 976.000 | 1544.000000 |
Total_Accepted_Offers | 0.182353 | 0.228070 | 2.173285 | 0.000000 | 0.000 | 2.000000 |
AmountPerPurchase | 12.868616 | 48.426336 | 75.099342 | 10.585714 | 40.375 | 68.538462 |
Total_Kids_Teens | 1.305882 | 0.638327 | 0.241877 | 1.000000 | 1.000 | 0.000000 |
KMeans_Labels | 0.982353 | 0.271255 | 1.198556 | 1.000000 | 0.000 | 2.000000 |
ClusterLabels | 0.000000 | 0.987854 | 1.877256 | 0.000000 | 1.000 | 2.000000 |
kmedoLabels | 0.489916 | 1.986505 | 2.000000 | 0.000000 | 2.000 | 2.000000 |
HCLabels | 3.000000 | 0.905533 | 0.411552 | 3.000000 | 1.000 | 0.000000 |
DBSLabels | -1.000000 | -1.000000 | -1.000000 | -1.000000 | -1.000 | -1.000000 |
Cluster 0 represents lower-income families with more children and teenagers at home, spending less on products and having a lower acceptance rate for marketing campaigns. Cluster 1 represents middle-income families with fewer kids at home, moderate spending on products, and a moderate acceptance rate for marketing campaigns. Cluster 2 represents high-income families with the smallest family size and very few kids at home, spending the most on products, and having the highest acceptance rate for marketing campaigns.
# Apply PCA
pca = PCA(n_components=2)
scaler = MinMaxScaler()
data_scaled_N = pd.DataFrame(scaler.fit_transform(data_with_labels), columns = data_with_labels.columns)
pca.fit(data_scaled_N)
pca_df = pd.DataFrame(pca.transform(data_scaled_N), columns=['PC1', 'PC2'])
# Plot the clusters
plt.scatter(pca_df['PC1'], pca_df['PC2'], c=data_with_labels['GmmLabels'], cmap='viridis')
plt.xlabel('PC1')
plt.ylabel('PC2')
plt.title('GmmLabels')
plt.show()
# For each variable
for col in df_kmeans.columns:
# Creating a boxplot chart for each cluster
plt.figure()
df_kmeans.boxplot(column=[col], by='GmmLabels')
plt.title(col)
plt.xlabel('Cluster')
plt.ylabel('Value')
C:\Users\dvmes\AppData\Local\Temp\ipykernel_5196\404400333.py:4: RuntimeWarning: More than 20 figures have been opened. Figures created through the pyplot interface (`matplotlib.pyplot.figure`) are retained until explicitly closed and may consume too much memory. (To control this warning, see the rcParam `figure.max_open_warning`).
plt.figure()
<Figure size 640x480 with 0 Axes>
<Figure size 640x480 with 0 Axes>
<Figure size 640x480 with 0 Axes>
<Figure size 640x480 with 0 Axes>
<Figure size 640x480 with 0 Axes>
<Figure size 640x480 with 0 Axes>
<Figure size 640x480 with 0 Axes>
<Figure size 640x480 with 0 Axes>
<Figure size 640x480 with 0 Axes>
<Figure size 640x480 with 0 Axes>
<Figure size 640x480 with 0 Axes>
<Figure size 640x480 with 0 Axes>
<Figure size 640x480 with 0 Axes>
<Figure size 640x480 with 0 Axes>
<Figure size 640x480 with 0 Axes>
<Figure size 640x480 with 0 Axes>
<Figure size 640x480 with 0 Axes>
<Figure size 640x480 with 0 Axes>
<Figure size 640x480 with 0 Axes>
<Figure size 640x480 with 0 Axes>
<Figure size 640x480 with 0 Axes>
<Figure size 640x480 with 0 Axes>
<Figure size 640x480 with 0 Axes>
<Figure size 640x480 with 0 Axes>
<Figure size 640x480 with 0 Axes>
<Figure size 640x480 with 0 Axes>
<Figure size 640x480 with 0 Axes>
<Figure size 640x480 with 0 Axes>
<Figure size 640x480 with 0 Axes>
<Figure size 640x480 with 0 Axes>
<Figure size 640x480 with 0 Axes>
<Figure size 640x480 with 0 Axes>
<Figure size 640x480 with 0 Axes>
<Figure size 640x480 with 0 Axes>
for i in range(10, 90, 5):
tsne = TSNE(n_components = 2, random_state = 1, perplexity = i)
data_tsne = tsne.fit_transform(data_scaled)
data_tsne = pd.DataFrame(data_tsne)
data_tsne.columns = ['X1', 'X2']
plt.figure(figsize = (7,7))
sns.scatterplot(x = 'X1', y = 'X2', data = data_tsne)
plt.title("perplexity = {}".format(i))
# Add the cluster labels as a new column to the original dataset
data_with_labels = dataS.copy()
data_with_labels['ClusterLabels_TSNE'] = kmeans_labels
data_with_labels['ClusterLabels_TSNE'].value_counts()
2 1073
1 733
0 402
Name: ClusterLabels_TSNE, dtype: int64
# Calculating the mean and the median of the original data for each label
mean = data_with_labels.groupby('ClusterLabels_TSNE').mean()
median = data_with_labels.groupby('ClusterLabels_TSNE').median()
df_kmeans = pd.concat([mean, median], axis = 0)
#df_kmeans.index = ['group_0 Mean', 'group_1 Mean', 'group_2 Mean', 'group_0 Median', 'group_1 Median', 'group_2 Median']
df_kmeans.T
ClusterLabels_TSNE | 0 | 1 | 2 | 0 | 1 | 2 |
---|---|---|---|---|---|---|
Income | 63948.136816 | 66295.001364 | 37642.038211 | 70947.500000 | 66825.000000 | 37085.000000 |
Kidhome | 0.293532 | 0.057299 | 0.761417 | 0.000000 | 0.000000 | 1.000000 |
Teenhome | 0.278607 | 0.566166 | 0.549860 | 0.000000 | 1.000000 | 1.000000 |
Recency | 39.728856 | 51.504775 | 50.880708 | 35.000000 | 54.000000 | 51.000000 |
MntWines | 566.721393 | 500.802183 | 73.654240 | 549.000000 | 466.000000 | 27.000000 |
MntFruits | 41.039801 | 48.435198 | 5.652377 | 23.000000 | 33.000000 | 3.000000 |
MntMeatProducts | 324.651741 | 277.094134 | 32.735322 | 254.000000 | 216.000000 | 16.000000 |
MntFishProducts | 56.907960 | 69.914052 | 8.141659 | 32.000000 | 50.000000 | 4.000000 |
MntSweetProducts | 44.179104 | 48.705321 | 5.870457 | 25.500000 | 33.000000 | 3.000000 |
MntGoldProds | 62.343284 | 70.484311 | 18.588071 | 40.000000 | 53.000000 | 10.000000 |
NumDealsPurchases | 2.124378 | 2.283765 | 2.423113 | 1.000000 | 2.000000 | 2.000000 |
NumWebPurchases | 5.099502 | 5.612551 | 2.663560 | 5.000000 | 5.000000 | 2.000000 |
NumCatalogPurchases | 4.480100 | 4.540246 | 0.712954 | 4.000000 | 4.000000 | 0.000000 |
NumStorePurchases | 6.539801 | 8.618008 | 3.609506 | 6.000000 | 9.000000 | 3.000000 |
NumWebVisitsMonth | 4.838308 | 4.075034 | 6.356011 | 5.000000 | 4.000000 | 7.000000 |
AcceptedCmp3 | 0.213930 | 0.038199 | 0.045666 | 0.000000 | 0.000000 | 0.000000 |
AcceptedCmp4 | 0.203980 | 0.088677 | 0.015843 | 0.000000 | 0.000000 | 0.000000 |
AcceptedCmp5 | 0.393035 | 0.002729 | 0.000000 | 0.000000 | 0.000000 | 0.000000 |
AcceptedCmp1 | 0.238806 | 0.055935 | 0.003728 | 0.000000 | 0.000000 | 0.000000 |
AcceptedCmp2 | 0.054726 | 0.006821 | 0.001864 | 0.000000 | 0.000000 | 0.000000 |
Complain | 0.007463 | 0.009550 | 0.009320 | 0.000000 | 0.000000 | 0.000000 |
Response | 0.823383 | 0.000000 | 0.000000 | 1.000000 | 0.000000 | 0.000000 |
Family_Size | 2.116915 | 2.272851 | 2.994408 | 2.000000 | 2.000000 | 3.000000 |
Years_Since_Registration | 2.671393 | 2.540246 | 2.398509 | 2.800000 | 2.600000 | 2.400000 |
Age | 46.977612 | 49.986357 | 45.164958 | 46.000000 | 51.000000 | 44.000000 |
Total_Spent | 1095.843284 | 1015.435198 | 144.642125 | 1213.000000 | 989.000000 | 71.000000 |
Total_Accepted_Offers | 1.927861 | 0.192360 | 0.067102 | 2.000000 | 0.000000 | 0.000000 |
AmountPerPurchase | 55.667561 | 49.459610 | 12.458372 | 57.027778 | 41.269231 | 10.111111 |
Total_Kids_Teens | 0.572139 | 0.623465 | 1.311277 | 0.000000 | 1.000000 | 1.000000 |
KMeans_Labels | 1.251244 | 1.549795 | 1.381174 | 1.000000 | 2.000000 | 2.000000 |
The t-SNE clustering result shows that the customers can be grouped into three distinct clusters based on their purchasing behavior, demographic characteristics, and response to marketing campaigns. The first cluster, labeled as 0, has the highest average income, the highest amount spent on purchases, and the highest number of accepted marketing campaigns. They are mostly middle-aged, with relatively small families, and have made a significant number of purchases from the store and through the website. The second cluster, labeled as 1, has a lower average income and a lower amount spent on purchases compared to the first cluster. They have a moderate number of accepted marketing campaigns and tend to have more teenage children. The third cluster, labeled as 2, has the lowest average income, the lowest amount spent on purchases, and the lowest number of accepted marketing campaigns. They tend to have more kids and teens in their families and are less likely to make purchases through the website.
# Для каждой переменной
for col in df_kmeans.columns:
# Создание графика boxplot для каждого кластера
plt.figure()
df_kmeans.boxplot(column=[col], by='ClusterLabels_TSNE')
plt.title(col)
plt.xlabel('Cluster')
plt.ylabel('Value')
C:\Users\dvmes\AppData\Local\Temp\ipykernel_5196\3974133964.py:4: RuntimeWarning: More than 20 figures have been opened. Figures created through the pyplot interface (`matplotlib.pyplot.figure`) are retained until explicitly closed and may consume too much memory. (To control this warning, see the rcParam `figure.max_open_warning`).
plt.figure()
<Figure size 640x480 with 0 Axes>
<Figure size 640x480 with 0 Axes>
<Figure size 640x480 with 0 Axes>
<Figure size 640x480 with 0 Axes>
<Figure size 640x480 with 0 Axes>
<Figure size 640x480 with 0 Axes>
<Figure size 640x480 with 0 Axes>
<Figure size 640x480 with 0 Axes>
<Figure size 640x480 with 0 Axes>
<Figure size 640x480 with 0 Axes>
<Figure size 640x480 with 0 Axes>
<Figure size 640x480 with 0 Axes>
<Figure size 640x480 with 0 Axes>
<Figure size 640x480 with 0 Axes>
<Figure size 640x480 with 0 Axes>
<Figure size 640x480 with 0 Axes>
<Figure size 640x480 with 0 Axes>
<Figure size 640x480 with 0 Axes>
<Figure size 640x480 with 0 Axes>
<Figure size 640x480 with 0 Axes>
<Figure size 640x480 with 0 Axes>
<Figure size 640x480 with 0 Axes>
<Figure size 640x480 with 0 Axes>
<Figure size 640x480 with 0 Axes>
<Figure size 640x480 with 0 Axes>
<Figure size 640x480 with 0 Axes>
<Figure size 640x480 with 0 Axes>
<Figure size 640x480 with 0 Axes>
<Figure size 640x480 with 0 Axes>
<Figure size 640x480 with 0 Axes>
Based on the comparison, we can observe that the clusters derived from each method have similar characteristics. K-means and t-SNE produce almost identical cluster profiles, while K-medoids and GMM generate clusters with slightly different family structures. However, the general trend among all methods is that they divide customers into three segments: