K-means Clustering for Customer Segmentation.

In this tutorial, you will learn how to implement K-means Customer segmentation with BigQuery

Look at the assignment guide: https://scribehow.com/shared/K-means_in_Google_Cloud_Console__X8Z_dr2DSmSGiBpTWnO67Q

Install the Scribe Chrome extension, or open the guide in a different browser. Go to Google Cloud, and keep this tab open.

1. Introduction to Customer Segmentation

Marketing professionals need to understand their customers. Data can be collected from numerous sources, including online storefronts, mobile apps, email campaigns, CRM systems, website analytics, social media channels, surveys, market research, and in-store transactions. Once the data is gathered, it is organized and analyzed. Selecting a customer segmentation method is crucial for tailoring marketing strategies to specific population segments.

1.1. Customer Segmentation Methods

Demographic information includes age, gender, income, education, and occupation. We can gather this data using tools such as GA4 and CRM systems like Salesforce and HubSpot. With Google Signals enabled Google Analytics can give us insights into user demographics, such as age, gender, location, and device type, based on how they interact with our website.CRM platforms can gather demographic information through forms, chatbot interactions, or landing pages. Social media platforms can also provide demographic insights.

Socio-cultural segmentation measures the cultural, societal, and familial influences on consumers. Variables like values, beliefs, social class, religion, lifestyle, cultural background, social identity, and family structure are important to consider. You can keep track of these socio-cultural values by analyzing social media posts, comments, interactions, and geo-locations. Other methods include analyzing data from surveys, statistics, purchase habits, media consumption patterns, search trends, focus groups, and customer feedback.

Behavioral segmentation is a powerful tool that divides customers based on their actions, such as purchasing patterns, product usage, or brand loyalty. Variables for this segmentation can be retrieved from point-of-sale records, website and app analytics, loyalty program data, CRM systems, social media interactions, and customer feedback. Tools like GA4 track user interactions, session durations, and conversion paths on websites and apps. CRM systems provide sales data and customer interactions across various channels; social media analytics reveal engagement patterns and brand interactions.

Marketers can segment customers based on psychological characteristics, including personality, values, attitudes, interests, and lifestyles. They can gather information about customers’ opinions, beliefs, and lifestyle choices through surveys, questionnaires, and interviews. Additionally, businesses can use social media analytics to better understand their customers by examining their online activities and interactions. Platforms like Facebook Audience Insights, X, and TikTok Analytics can show what customers are interested in. Customer relationship management (CRM) systems track how customers interact with the business and what they prefer.

2. Big Query

Before diving into the specifics of customer segmentation methods with machine learning, we must select a data warehouse to store and manage data. According to Gartner’s 2023 Magic Quadrant for Cloud Database Management Systems, the top companies in the cloud data warehouse field are Google with BigQuery, Amazon with Redshift, Microsoft with Azure Synapse Analytics, and Oracle. Their 2023 data warehouse report compares different data warehouses.
For our segmentation exercise, we’ve chosen BigQuery. BigQuery is a data management solution on Google Cloud. It uses standard SQL for querying and has built-in machine learning (BigQuery ML), allowing us to create and run machine learning models using SQL. SQL (Structured Query Language) was developed at IBM in the early 1970s. SQL helps us organize, retrieve, insert data, and modify databases.

2.1 Pull and Prepare the Data

To conduct the segmentation, we should connect our data sources to the data management solution we selected. The customer data we will use for our WordPress websites is in Woocommerce, the e-commerce plugin installed on our websites. We also need Google Analytics data to gather additional behavioral data.

To automate the extraction of WooCommerce data, we need to pull in the following variables for segmentation: customer_id, name, age, city, country, order_id, order_date, order_amount, and product_id. We can use tools like Supermetrics, which provides pre-built connectors to pull the data into BigQuery. Alternatively, we can use the WooCommerce REST API to connect WooCommerce with BigQuery directly, or we can set up cron jobs (automated scripts) on a server or cloud service to fetch and load the data into BigQuery on a daily or weekly basis.
Google Analytics is more straightforward to connect because BigQuery has a native connector for GA4 (Google Analytics 4) that can automatically import data.
We can combine both datasets once the data is imported into one table. SQL can do that for us. We need to use an SQL joins query that connects the WooCommerce table with the Google Analytics table using the customer_id field. This will combine the main fields from both sources and create one complete table. To keep this combined table up to date with the most recent data, we can schedule the SQL query to run automatically in BigQuery.

3. Recency, Frequency, Monetary Value

We can cluster customer groups based on their purchasing behavior. This segmentation is often based on the Recency, Frequency, and Monetary (RFM) model. This model takes into account how much money a customer spends on purchases (Monetary), how recently a customer made a purchase (Recency), and how often a customer makes a purchase (Frequency). We must calculate each customer’s Recency, Frequency, and Monetary (RFM) values to analyze customer behavior. To calculate Recency, we will find the number of days since each customer’s last purchase by subtracting their most recent order date from the current date. For Frequency, we will count the total number of purchases made by each customer. Finally, to calculate the Monetary value, we will sum up the total amount spent by each customer over a specific period. We can use a combined SQL query in BigQuery to gather the necessary data for calculating the RFM values.


WITH recency AS (
  SELECT
    customer_id,
    DATE_DIFF(CURRENT_DATE(), MAX(order_date), DAY) AS recency
  FROM
    `your_project.your_dataset.woocommerce_table`
  GROUP BY
    customer_id
),
frequency AS (
  SELECT
    customer_id,
    COUNT(order_id) AS frequency
  FROM
    `your_project.your_dataset.woocommerce_table`
  GROUP BY
    customer_id
),
monetary AS (
  SELECT
    customer_id,
    SUM(order_amount) AS monetary
  FROM
    `your_project.your_dataset.woocommerce_table`
  GROUP BY
    customer_id
)

SELECT
  r.customer_id,
  r.recency,
  f.frequency,
  m.monetary
FROM
  recency r
JOIN
  frequency f ON r.customer_id = f.customer_id
JOIN
  monetary m ON r.customer_id = m.customer_id;

3.1 Standardize the RFM variables.

Before using K-means clustering, make sure your data is standardized. This is important because RFM values have different scales. For example, monetary values could range in the thousands, while frequency could be a single-digit number. Standardizing the data, usually by converting it to z-scores, helps the K-means algorithm perform better by treating all variables equally. The z-score shows how far a data point is from the average of a dataset, measured in standard deviations. It helps determine if a data point is typical or unusual compared to the rest. A positive z-score means the value is above average, and a negative z-score is below average.


Z-Score Calculation:

A z-score is calculated using the formula:

\[
Z = \frac{X – \mu}{\sigma}
\]

where:

  • \(X\) is the individual data point (e.g., a customer’s recency, frequency, or monetary value),
  • \(\mu\) is the mean of the dataset,
  • \(\sigma\) is the standard deviation of the dataset. The standard deviation measures how far the datapoints are from the mean.

Standardization SQL Query:


WITH standardized_rfm AS (
  SELECT
    customer_id,
    (recency - AVG(recency) OVER()) / STDDEV(recency) OVER() AS recency_z,
    (frequency - AVG(frequency) OVER()) / STDDEV(frequency) OVER() AS frequency_z,
    (monetary - AVG(monetary) OVER()) / STDDEV(monetary) OVER() AS monetary_z
  FROM
    `your_project.your_dataset.rfm_table`
)
SELECT
  customer_id,
  recency_z,
  frequency_z,
  monetary_z
FROM
  standardized_rfm;

4. K-means Clustering

To improve segmentation efficiency, we can use machine learning to help us identify customer groups. K-means clustering is an unsupervised learning algorithm that is particularly useful when dealing with datasets that do not have predefined labels. The K-means algorithm identifies fraud, detects cyberattacks, helps with medical image analysis, and optimizes delivery routes. K-means is also a popular method for customer segmentation. In k-means clustering, data is divided into a set number of clusters (k). The value of k is chosen beforehand, often randomly. For example, if we choose four clusters, the algorithm places four initial points (centroids) in a multidimensional space where each variable is a dimension. Each data point is assigned to the nearest centroid, forming four clusters. After all the points are assigned to a cluster, the algorithm calculates the average of all the points in each cluster to find a new center (or centroid) for that cluster. This new centroid now represents the center of that cluster. The data points are then reassigned based on their proximity to the new centroids, and the process repeats. This process repeats until the centroids no longer move, resulting in the final clusters. The simplified method is illustrated below.

View the interactive version of these graphs: K-means steps.

SQL Query to Create K-means Model:

num_clusters=2,3,4,5 : Adjust the number of clusters based on how finely you want to segment your customers


CREATE OR REPLACE MODEL `your_project.your_dataset.rfm_kmeans_model`
OPTIONS(model_type='kmeans', num_clusters=4) AS
SELECT
  recency_z,
  frequency_z,
  monetary_z
FROM
  `your_project.your_dataset.standardized_rfm_table`;

Save the model

4.1 Evaluate the K-means Model

After creating the K-means model, evaluate it to understand how well the clustering has performed.

After running the K-Means model, we can view the results on the evaluation panel. The first two metrics in this example are the Davies-Bouldin index (DBI) and the Mean squared distance (MDI). The Davies-Bouldin index shows how well the clusters are formed. A high index number shows that the clusters are not well-separated. If the data points in a cluster are spread out, it becomes harder to separate different segments. The Mean squared distance measures the average distance between each data point in a cluster and the cluster’s centroid (the cluster’s center). It indicates how tightly the data points in a cluster are grouped around the centroid. Ideally, the Davies-Bouldin index and the Mean Squared distance should be as low as possible. A low DBI indicates that the clusters are well-formed and separated; a low MDI means that the data points are close to the centroids and that the clusters are cohesive.

Since the results shown in the example above are not ideal (>1), I could rerun the K-Means model with a different number of clusters to see if this would improve both the Davies-Bouldin index and the Mean Squared distance.

In this example, we have three clusters. Cluster one has 508 customers who have a lower-than-average frequency of purchases and spend less money but were recently active on your e-commerce platform. The second cluster buys more frequently, spends more money, and is also recently active. Cluster three has a higher-than-average frequency and order value but has not been active buyers recently. Name the clusters to give them more relevant segment names. You could, for example, use this classification by Bloomreach. Businesses usually invent names for their customer segments.

4.2 Predict and Assign Clusters to Customers

Use the ML.PREDICT function to apply the model to your data and assign each customer to a cluster.

Prediction SQL Query:


SELECT
  customer_id,
  recency_z AS recency,
  frequency_z AS frequency,
  monetary_z AS monetary,
  CENTROID_ID AS predicted_cluster
FROM
  ML.PREDICT(MODEL `your_project.your_dataset.rfm.model`,
    (
    SELECT
      customer_id,
      recency_z,
      frequency_z,
      monetary_z
    FROM
      `your_project.your_dataset.RFM_Z_VALUES`
    )
  );

This query will add a predicted_cluster column to your data, indicating which cluster each customer belongs to.

Save the table

4.3 Combine the Tables

SQL Query:


-- Create a new table in your dataset called RFM_COMBINED
CREATE TABLE `your_project.your_dataset.RFM_COMBINED` AS

-- Select the necessary columns from the original RFM_TABLE (alias as t1)
SELECT
    t1.customer_id,      -- Customer ID
    t1.order_id,         -- Order ID
    t1.order_date,       -- Date of the order
    t1.order_amount,     -- Amount of the order
    t1.name,             -- Customer's name
    t1.age,              -- Customer's age
    t1.city,             -- Customer's city
    t1.country,          -- Customer's country
    t1.favorite_product, -- Customer's favorite product
    t1.frequency,        -- Original frequency value
    t1.recency,          -- Original recency value
    t1.monetary,         -- Original monetary value

    -- Select and rename columns from the predicted clusters table (alias as t2)
    t2.frequency AS frequency_z,     -- Predicted frequency value (standardized)
    t2.recency AS recency_z,         -- Predicted recency value (standardized)
    t2.monetary AS monetary_z,       -- Predicted monetary value (standardized)
    t2.predicted_cluster             -- Cluster ID predicted by the ML model

-- Indicate the primary table to select from (the original RFM table)
FROM
    `your_project.your_dataset.RFM_TABLE` AS t1

-- Left join the original table with the predicted clusters on customer_id
LEFT JOIN
    `your_project.your_dataset.rfm_cluster` AS t2  -- Join with the predictions generated by the ML model
ON
    t1.customer_id = t2.customer_id;  -- Match records by customer_id

Save the table

5. Visualize the results using a data visualization tool.

We can choose from many data visualization and reporting tools to visualize our data. Some of the most well-known tools are Google’s Looker Studio, Salesforce’s Tableau, and Microsoft’s Power BI. Because we are already working in Google Cloud, we will choose Looker Studio to visualize our data. Once you have finished running the K-means model, you can go to the table with the results and click on visualize in Looker Studio.

Save the Looker studio template

Complete the related assignment describing Milestone I

Resources

Christy, A. J., Umamakeswari, A., Priyatharsini, L., & Neyaa, A. (2022). RFM ranking – An effective approach to customer segmentation. Journal of King Saud University – Computer and Information Sciences, 33(10), 1251-1257. https://www.sciencedirect.com/science/article/pii/S1319157819309802?via%3Dihub

Hariram, S., Subramanian, R. R., Kamalakannan, P., Reddy, M. T., Dhanasekaran, S., & Ishvarya, M. (2023). Customer segmentation analysis leveraging machine learning algorithms. IEEE. https://ieeexplore.ieee.org/document/10263179

Smaili, M. Y., & Hachimi, H. (2023). New RFM-D classification model for improving customer analysis and response prediction. Ain Shams Engineering Journal, 14(12), Article 102254. https://doi.org/10.1016/j.asej.2023.102254

Sun, Y., Liu, H., & Gao, Y. (2023). Research on customer lifetime value based on machine learning algorithms and customer relationship management analysis model. Heliyon, 9(2), e13384. https://doi.org/10.1016/j.heliyon.2023.e13384

Venkatesan, R., & Lecinski, J. (2021). The AI marketing canvas : a five-stage road map to implementing artificial intelligence in marketing. Stanford Business Books. Boston University Library Link.