K-means clustering and customer segmentation

Understanding your customers is essential in marketing. To help you know who your customers are, you can gather data from various sources, such as your online storefront, mobile apps, email campaigns, CRM systems, website analytics, social media channels, surveys, market research, and in-store transactions. After setting up data collection platforms, we should gather the data in a data warehouse through data connectors or application programming interfaces.

1. Customer Segmentation Methods.

Demographic information includes age, gender, income, education, and occupation. We can gather this data using tools such as GA4 and CRM systems like Salesforce and HubSpot. With Google Signals enabled, Google Analytics can give us insights into user demographics, such as age, gender, location, and device type, based on how they interact with our website.CRM platforms can gather demographic information through forms, chatbot interactions, or landing pages. Social media platforms can also provide demographic insights.

Socio-cultural segmentation entails the cultural, societal, and familial influences on consumers. Variables like values, beliefs, social class, religion, lifestyle, cultural background, social identity, and family structure are important to consider. You can keep track of these socio-cultural values by analyzing social media posts, comments, interactions, and geo-locations. Other methods include analyzing data from surveys, statistics, purchase habits, media consumption patterns, search trends, focus groups, and customer feedback. This analysis should help you understand the socio-cultural segmentation process better.

Behavioral segmentation is a powerful tool that divides customers based on their actions, such as purchasing patterns, product usage, or brand loyalty. Variables for this segmentation can be sourced from point-of-sale records, website and app analytics, loyalty program data, CRM systems, social media interactions, and customer feedback. Tools like GA4 track user interactions, session durations, and conversion paths on websites and apps. CRM systems provide sales data and customer interactions across various channels; social media analytics reveal engagement patterns and brand interactions.

We can segment customers based on psychological characteristics, including personality, values, attitudes, interests, and lifestyles. This helps businesses understand the underlying motivations driving consumer behavior. You can gather information about customers’ opinions, beliefs, and lifestyle choices through surveys, questionnaires, and interviews. Tools like SurveyMonkey and Qualtrics can help create and distribute these surveys.

Additionally, businesses can use social media analytics to better understand their customers by examining their online activities and interactions. Platforms like Facebook Audience Insights, Twitter (now called X), and TikTok Analytics can show what customers are interested in. Customer relationship management (CRM) systems track how customers interact with the business and what they prefer.

2. K-means Clustering and RFM segmentation.

To improve segmentation efficiency, we can use machine learning to help us identify customer groups. K-means clustering is an unsupervised learning algorithm that is particularly useful when dealing with datasets that do not have predefined labels. The algorithm divides a dataset into clusters, each containing data points that are more similar than those in different clusters. The similarity is typically measured using distance metrics to minimize the variance within each cluster. A centroid represents each cluster, the average of all the data points within that cluster. The K-means algorithm adjusts these centroids to ensure each data point is assigned to the cluster with the nearest centroid, thereby refining the cluster boundaries.

A centroid represents each cluster, the average of all the data points within that cluster. The K-means algorithm adjusts these centroids to ensure each data point is assigned to the cluster with the nearest centroid, thereby refining the cluster boundaries.

2.1 Applying K-means to Customer Segmentation

In customer segmentation, K-means clustering groups customers into segments based on their purchasing behaviors. This segmentation is often based on the Recency, Frequency, and Monetary (RFM) model:

Monetary: How much money a customer spends on purchases.

Recency: How recently a customer made a purchase.

Frequency: How often a customer makes a purchase.

3. Big Query

According to Gartner’s 2023 Magic Quadrant for Cloud Database Management Systems, the top companies in the cloud data warehouse field are Google with BigQuery, Amazon with Redshift, Microsoft with Azure Synapse Analytics, and Oracle. Their 2023 data warehouse report compares different data warehouses.
For our segmentation exercise, we’ve chosen BigQuery. BigQuery is a data management solution on Google Cloud. It uses standard SQL for querying and has built-in machine learning (BigQuery ML), allowing us to create and run machine learning models using SQL.

3.1 Pull and Prepare the Data

To automate the extraction of WooCommerce data, I need to pull in the following key variables for segmentation: customer ID, name, age, city, country, order ID, order date, order amount, product ID, product name, and category. I can use tools like Supermetrics, which provides pre-built connectors to pull the data into BigQuery.Alternatively, I can use the WooCommerce REST API to connect WooCommerce with BigQuery directly, or I can set up cron jobs (automated scripts) on a server or cloud service to fetch and load the data into BigQuery on a daily or weekly basis.
Google Analytics is more straightforward to connect because BigQuery has a native connector for GA4 (Google Analytics 4) that can automatically import data.
Once the data is imported into one table, you can combine both datasets. SQL can do that for us. I need to use an SQL joins query that connects the WooCommerce table with the Google Analytics table using the customer_id field. This will combine the main fields from both sources and create one complete table.

To keep this combined table up to date with the most recent data, I can schedule the SQL query to run automatically in BigQuery. Using the scheduled queries feature in BigQuery, I can set the query to run daily or weekly to ensure the data is current and ready for analysis. Setting up this schedule is simple: go to the BigQuery console, create a new scheduled query, and specify how often you want it to run.

3.2 Recency, Frequency, Monetary Value

To analyze customer behavior, we must calculate each customer’s Recency, Frequency, and Monetary (RFM) values.

For Recency, we’ll determine how many days have passed since each customer’s last purchase. This involves identifying each customer’s most recent order date and subtracting it from the current date. Next, for Frequency, we’ll count each customer’s total number of purchases by looking at their orders. Finally, to calculate Monetary value, we’ll add the total amount each customer spent over the period. You’ll use a combined SQL query to gather the needed data to calculate Recency, Frequency, and Monetary (RFM) values in BigQuery.

3.3 Prepare the Data for Clustering

Before applying K-means clustering, ensure your data is standardized. This step is crucial because RFM values might have different scales (e.g., Monetary values might range in the thousands, while Frequency could be a single-digit number). Standardizing the data (usually by converting it to z-scores) helps the K-means algorithm perform better by treating all variables equally.

The Z-score formula standardizes data by determining how many standard deviations a data point is from the mean. It is calculated as:

\[ Z = \frac{X – \mu}{\sigma} \]

Where:

Z = Z-score (standardized value)
X = Original data point
μ = Mean of the dataset
σ = Standard deviation of the dataset

Standardization SQL Query:

3.4 Create the K-means Model in BigQuery

BigQuery ML allows you to build machine learning models directly in BigQuery using SQL. You can create a K-means clustering model on the standardized RFM data.

SQL Query to Create K-means Model:

num_clusters=4: Adjust the number of clusters based on how finely you want to segment your customers

3.5 Evaluate the K-means Model

After creating the K-means model, evaluate it to understand how well the clustering has performed. You can use the ML.EVALUATE function to assess the model.

Evaluation SQL Query:


SELECT *
FROM
  ML.EVALUATE(MODEL `your_project.your_dataset.rfm_kmeans_model`);

This will give you metrics like the Davies-Bouldin index, which measures the average similarity ratio of each cluster.

2.6 Predict and Assign Clusters to Customers

Use the ML.PREDICT function to apply the model to your data and assign each customer to a cluster.

Prediction SQL Query:


SELECT
  customer_id,
  recency,
  frequency,
  monetary,
  predicted_cluster
FROM
  ML.PREDICT(MODEL `your_project.your_dataset.rfm_kmeans_model`,
    (
    SELECT
      customer_id,
      recency_z,
      frequency_z,
      monetary_z
    FROM
      `your_project.your_dataset.standardized_rfm_table`
    )
  );

This query will add a predicted_cluster column to your data, indicating which cluster each customer belongs to.

Table of Contents