Unsupervised ML: Practical

1 libraries

⚠️ See https://rpkgs.datanovia.com/factoextra/

1.1 Basic setup

Code
library(tidyverse)
library(knitr)
library(glue)
library(bertheme)
library(tidymodels)

knitr::opts_chunk$set(
  echo = TRUE,
  eval = TRUE,
  message = FALSE,
  warning = FALSE,
  fig.width = 6,
  fig.height = 5,
  fig.align='center',
  cache = TRUE
  )

2 Functions

Code
tidy_catdesc <- function(df, cluster){
  options(stringsAsFactors = F)
  
  if(missing(cluster)){
    cluster_col <- NULL
  } else {
    cluster_col <- enquo(cluster)
  }
  
  if(class(df)[1] %in% c("HCPC")){
    df <- df$data.clust %>%
      data.frame %>%
      select(clust, everything())
  } else {
    if(is.null(cluster_col)) stop("Please provide cluster column name for analysis")
    
    df <- df %>%
      select(!!cluster_col, everything())
  }
  df[,1] <- factor(df %>% pull(1))
  df <- df %>% tibble()
  
  res_catdes <- catdes(df,  1)
  
  quali <- res_catdes['category'][[1]]
  quanti <- res_catdes['quanti'][[1]]
  
  if(!is.null(quali))
  {
    quali <-  quali %>%
        purrr::compact() %>%
        imap(~ {
          df <- as.data.frame(.x)
          df$cluster <- .y
          df
        }) %>% 
       reduce(rbind) %>%
       tibble::rownames_to_column("variable") %>% 
       tibble %>% 
       mutate(type_variable = "qualitative") %>% 
       janitor::clean_names()
    
  } else { quali <- data.frame(Message = "No qualitative variables were significant")}
  
  if(!is.null(quanti)){
    quanti <- quanti %>%
      purrr::compact() %>%
      imap(~ {
          df <- as.data.frame(.x)
          df$cluster <- .y
          df
        }) %>% 
       reduce(rbind) %>%
       tibble::rownames_to_column("variable") %>% 
       tibble %>% 
       janitor::clean_names() %>%
      mutate(type_variable = "quantitative")
  } else { quanti <- data.frame(Message = "No quantitative variables were significant")}
  
  list(quali, quanti)
}

3 Library

Code
library(FactoMineR)
library(factoextra)

4 Clustering the tidy way

4.1 Load Data

We load a dataset of rental and sale listings from Gumtree, each with a rich free-text description field. This field contains unstructured information about the property and will serve as our main input for NLP. The target variable is type, which labels each listing as either “rental” or “sales”.

Code
gumtree_texts <- read_csv("data/gumtree_clean.zip")

4.2 Basic MFA

We apply Multiple Factor Analysis (MFA) to explore joint patterns in a dataset containing both categorical and numerical variables describing property listings.

MFA is a powerful method for: - Combining mixed data types (e.g., room counts, prices, parking types) - Balancing groups of variables so no group dominates the analysis - Producing interpretable low-dimensional components for clustering or visualization

We first select and clean relevant columns, treating all categorical variables as strings.

Code
df_og <- gumtree_texts %>% 
  select(ad_id, dwelling_type, contains("rooms"), parking, 
         size_sqm, price) %>% 
  mutate(across(dwelling_type:parking, as.character)) %>% 
  drop_na() %>% 
  sample_n(5e3)

💡 We ensure that all categorical variables are explicitly cast as characters to be recognized by MFA.

We define 3 blocks: - Group 1: ad_id → supplementary ID - Group 2: dwelling_type, bedrooms, bathrooms, parking → qualitative - Group 3: size_sqm, price → quantitative

Code
mfa_res <- read_rds("data/mfa_res.rds")
Code
mfa_res <- MFA(df_og, group = c(1, 4, 2), type = c("n","n", "s"),
    ncp = 5, name.group = c("id","qual" , "quant"),
    num.group.sup = c(1), graph = FALSE)

saveRDS(mfa_res, "data/mfa_res.rds")
  • The qualitative variables are handled like in MCA, and the quantitative variables like in PCA — both projected into a shared low-dimensional space.
  • MFA weights each group equally, regardless of number of variables or variable type.
  • The result is a set of global principal components summarizing joint structure across variable types.

You can now use the MFA result to:

  • Cluster listings (mfa_res$global.pca$ind$coord)
  • Visualize listings by dimensions
  • Interpret the structure of variation across property types and price

4.3 Eigenvalues/variances of dimensions

Code
mfa_res$global.pca$eig[1:5,]
#        eigenvalue percentage of variance cumulative percentage of variance
# comp 1  1.3127765              15.461616                          15.46162
# comp 2  0.9943915              11.711741                          27.17336
# comp 3  0.8331033               9.812122                          36.98548
# comp 4  0.4763161               5.609955                          42.59543
# comp 5  0.4546441               5.354706                          47.95014

What do eigenvalues mean?

  • Each dimension (component) in MFA is associated with an eigenvalue, which reflects how much variance it explains.
  • Higher eigenvalues = more important dimension.
  • The percentage of variance tells us how much structure is captured by each axis.
  • The cumulative percentage helps us decide how many dimensions to retain: > “How many axes do we need to explain ~80% of the variation?”

In this example:

Component Variance (%) Cumulative (%)
Comp 1 15.5% 15.5%
Comp 2 11.7% 27.2%
Comp 3 9.8% 37.0%
Comp 4 5.6% 42.6%
Comp 5 5.4% 47.9%
Code
fviz_screeplot(mfa_res)

  • Look for the elbow point: the dimension after which added components contribute little additional variance.

  • Typically, we retain the first few dimensions that together explain a meaningful portion (e.g., ~70–80%) of the variance.

    This step is crucial before moving on to interpretation or clustering in the reduced space.

4.4 Group of variables

This plot shows each group projected onto the first two dimensions of the MFA:

  • Dim1 (15.5%) and Dim2 (11.7%) are the axes explaining the most variance.
  • Each triangle represents a group of variables used in the MFA:
    • quant: the quantitative variable group
    • qual: the qualitative variable group
    • id: an identifying or supplementary group (e.g. individual IDs or metadata)
Code
fviz_mfa_var(mfa_res, "group")

4.5 Qualitative variables

Code
mfa_res$quali.var$coord
#                                       Dim.1        Dim.2       Dim.3        Dim.4        Dim.5
# apartment                       -0.83588029 -0.008914313  0.86445122 -0.111072093 -0.093158964
# farm                             2.96810353 -0.511488057  1.34637909 -2.116365658  0.817341813
# house                            0.43088461 -0.055881962 -0.44924188  0.156258499 -0.005603694
# other                            0.05987695 -0.044024172 -0.07901804 -0.143475484  4.038681986
# townhouse_villa                  0.06178860  0.864872018 -0.49369313 -1.007681203  0.136880393
# bedrooms_1                      -1.13468761 -0.014320919  1.23902994  0.401293447 -0.181532801
# bedrooms_2                      -0.77649867  0.140085119  0.47391471 -0.302449679 -0.116602224
# bedrooms_3                       0.13491469 -0.061295020 -0.66368427 -0.354805310  0.188147185
# bedrooms_4                       0.92849230 -0.105326761 -0.28762702  0.627898295 -0.513579271
# bedrooms_5                       1.42909382  0.207611232 -0.04804985  1.104500907 -0.291413960
# bedrooms_6                       2.20073568 -0.163669449  1.02742117  0.388086674  1.005826420
# bedrooms_studio_or_bachelor_pad -0.47375299 -0.045047111  1.19494751  1.576102915  3.844282556
# bathrooms_1                     -0.90160741  0.125068738  0.68194329  0.170978778  0.022465569
# bathrooms_2                     -0.03607059 -0.057388334 -0.50285119 -0.491781824  0.128351580
# bathrooms_3                      0.72253619 -0.110183573 -0.53094583  0.483910282 -0.499290006
# bathrooms_4                      2.05585998 -0.005044342  0.66013615  0.693494535  0.203930093
# covered                         -0.82194311 -0.050049578  0.91140608 -0.256872129 -0.380843161
# garage                           0.41657378  0.005739822 -0.38302835 -0.003298937 -0.016919656
# none                            -0.89326440 -0.070178839  0.55898803  0.319338571  0.849108103
# off_street                      -0.48782816 -0.071343159  0.31717749  1.195584059 -1.354311245
# secure_parking                  -0.83390784  0.240482405  0.30569590  0.843897882  1.389923676
# street                          -0.82763107  0.039653373  0.54923925  0.299850725  0.576385314

This plot shows:

  • Each point represents a category of a qualitative variable (e.g. bedrooms_3, covered, garage)
  • Points are placed in the factor space defined by the first two dimensions (Dim1 and Dim2)
  • Categories that are closer together are more similar in terms of their contribution to the underlying MFA structure
  • Categories that are farther from the origin contribute more strongly to the structure captured by these dimensions

This visualisation helps us:

  • Explore associations between levels of qualitative variables
  • Group similar categories (e.g., homes with 4+ bedrooms)
  • Spot outliers or standout categories (e.g., studio_or_bachelor_pad)
Code
fviz_mfa_var(mfa_res, "quali.var")

  • After examining the positions of qualitative categories in the factor space, we now look at how much each category contributes to the construction of each dimension.

Each number represents the percentage contribution of a specific category to a given dimension. For example:

  • bathrooms_4 contributes 8.62% to Dim1
  • studio_or_bachelor_pad contributes 28.5% to Dim5
  • bathrooms_2 contributes 15.84% to Dim4, making it one of the key drivers of that dimension

For interpretation:

  • Higher contribution values = the category helps define that dimension
  • If several categories of the same variable contribute to the same dimension, that dimension likely captures a latent trait related to that variable (e.g. household size, parking quality)
  • Some categories only show up prominently in later dimensions (e.g. other, studio_or_bachelor_pad, secure_parking on Dim5) - these may capture more nuanced variation

Practical Use

You can use this information to:

  • Name/interpret dimensions based on dominant contributing categories
  • Select dimensions for clustering or visualisation
  • Identify influential or standout category levels in your data
Code
mfa_res$quali.var$contrib
#                                 Dim.1 Dim.2 Dim.3 Dim.4 Dim.5
# apartment                        4.92  0.00 13.07  0.66  0.51
# farm                             0.93  0.05  0.47  3.58  0.59
# house                            2.37  0.07  6.41  2.37  0.00
# other                            0.00  0.00  0.00  0.02 17.16
# townhouse_villa                  0.00  1.26  0.59  7.46  0.15
# bedrooms_1                       2.68  0.00  7.94  2.55  0.57
# bedrooms_2                       3.50  0.20  3.24  4.03  0.66
# bedrooms_3                       0.14  0.05  8.69  7.60  2.35
# bedrooms_4                       2.94  0.07  0.70 10.20  7.49
# bedrooms_5                       1.80  0.07  0.01  8.18  0.63
# bedrooms_6                       3.26  0.03  1.76  0.77  5.68
# bedrooms_studio_or_bachelor_pad  0.05  0.00  0.82  4.36 28.50
# bathrooms_1                      5.69  0.19  8.09  1.55  0.03
# bathrooms_2                      0.01  0.05  5.41 15.84  1.18
# bathrooms_3                      1.76  0.07  2.36  5.98  6.99
# bathrooms_4                      8.62  0.00  2.21  7.45  0.71
# covered                          3.17  0.02  9.67  2.35  5.67
# garage                           2.43  0.00  5.09  0.00  0.03
# none                             0.64  0.01  0.62  0.62  4.80
# off_street                       0.03  0.00  0.03  1.14  1.61
# secure_parking                   0.56  0.08  0.19  4.32 12.87
# street                           0.45  0.00  0.49  0.45  1.81

The cos² value (squared cosine) tells us how well each qualitative category is represented by each dimension. Each value ranges from 0 to 1:

  • Closer to 1 → the category is well represented on that dimension
  • Closer to 0 → the dimension does not explain that category’s position well

Why this matters

  • cos² helps us filter for the most interpretable categories when labelling dimensions

  • It’s also helpful for plotting: e.g. show only points with high cos² to avoid clutter

    • High cos² = “this dimension is a good summary of this category’s behavior”
    • Low cos² = “this category doesn’t fit cleanly into any of the top dimensions”
Code
mfa_res$quali.var$cos2 %>% round(2)
#                                 Dim.1 Dim.2 Dim.3 Dim.4 Dim.5
# apartment                        0.45  0.00  0.48  0.01  0.01
# farm                             0.11  0.00  0.02  0.06  0.01
# house                            0.40  0.01  0.44  0.05  0.00
# other                            0.00  0.00  0.00  0.00  0.26
# townhouse_villa                  0.00  0.09  0.03  0.13  0.00
# bedrooms_1                       0.26  0.00  0.31  0.03  0.01
# bedrooms_2                       0.40  0.01  0.15  0.06  0.01
# bedrooms_3                       0.02  0.00  0.52  0.15  0.04
# bedrooms_4                       0.33  0.00  0.03  0.15  0.10
# bedrooms_5                       0.21  0.00  0.00  0.12  0.01
# bedrooms_6                       0.33  0.00  0.07  0.01  0.07
# bedrooms_studio_or_bachelor_pad  0.01  0.00  0.04  0.07  0.42
# bathrooms_1                      0.55  0.01  0.32  0.02  0.00
# bathrooms_2                      0.00  0.00  0.38  0.36  0.02
# bathrooms_3                      0.21  0.00  0.12  0.10  0.10
# bathrooms_4                      0.65  0.00  0.07  0.07  0.01
# covered                          0.32  0.00  0.40  0.03  0.07
# garage                           0.49  0.00  0.41  0.00  0.00
# none                             0.08  0.00  0.03  0.01  0.07
# off_street                       0.00  0.00  0.00  0.02  0.03
# secure_parking                   0.07  0.01  0.01  0.07  0.19
# street                           0.06  0.00  0.03  0.01  0.03

4.6 Quantitative

The following plot shows how the quantitative variables are projected into the global MFA space:

Code
fviz_mfa_var(mfa_res, "quanti.var", palette = "jco",
  col.var.sup = "violet", repel = TRUE)

Each arrow represents a quantitative variable — its direction and length reveal:

  • Direction: the dimension it is most aligned with
  • Length: the strength of its contribution (longer = more important)

Interpretation

  • price is strongly aligned with Dim1
  • size_sqm is strongly aligned with Dim2
  • The two variables are nearly orthogonal (90°), meaning they are uncorrelated in this reduced space
  • This means the first two dimensions capture two distinct drivers in the data: Dim1: variation in price, Dim2: variation in size

This helps you interpret the meaning of each dimension. You can also use this to assess which variables are dominant in driving the structure - helpful for:

  • Dimension labelling
  • Clustering interpretation
  • Communicating insights back to stakeholders

4.7 Partial axes

The partial axes plot shows how each group of variables (e.g. quantitative, qualitative) contributes to the position of each observation in the global MFA space.

Code
fviz_mfa_axes(mfa_res)

What does this show?

  • Each point (observation) is projected into the global space multiple times — once per variable group
  • The plot shows these partial projections, one per group, and how they relate to the global projection
  • Lines connect each group’s position to the global consensus position for that observation

Why is this useful?

  • Helps diagnose which group pulls an observation in which direction
  • Reveals whether groups agree or conflict in their view of the data structure
  • Useful in mixed-data contexts: e.g., do qualitative and quantitative blocks reinforce or contradict each other?
    • Short lines = groups agree on the observation’s position
    • Long lines = disagreement across groups → heterogeneity or measurement conflict

5 Clustering on MFA Coordinates

Now that we’ve reduced the data into lower-dimensional space using MFA, we can apply clustering algorithms to uncover potential group structure.

We’ll use: - K-Means: Partition-based, requires number of clusters - GMM (Gaussian Mixture Models): Probabilistic clustering - HDBSCAN: Density-based clustering that can handle noise and variable cluster shapes

5.1 Prepare the data

Code
mfa_coords <- mfa_res$global.pca$ind$coord[,1:3] %>% 
  as_tibble()

5.2 K-Means Clustering

This extracts the individual coordinates from the global PCA - a reduced representation ideal for clustering.

Lets try and see what number of clusters I should use?

Code
library(NbClust)

Before applying K-Means, we need to choose a good value for k, the number of clusters. We use three complementary methods:

  • Elbow Method (WSS – Within Sum of Squares)
Code
fviz_nbclust(mfa_coords, kmeans, method = "wss") +
  geom_vline(xintercept = 2, linetype = 2) +
  labs(subtitle = "Elbow method")

This method looks for a “bend” in the WSS plot - the elbow point - where adding more clusters yields diminishing returns in variance explained.

  • Silhouette Method
Code
fviz_nbclust(mfa_coords, kmeans, method = "silhouette") +
  labs(subtitle = "Silhouette method")

This method chooses the k that gives the highest average silhouette width, measuring how well-separated each point is from points in other clusters.

  • Gap Statistic
Code
set.seed(123)
# fviz_nbclust(mfa_coords, kmeans, nstart = 25, method = "gap_stat", nboot = 50) +
#   labs(subtitle = "Gap statistic method")

Compares within-cluster dispersion to that expected under a null reference distribution - the optimal k is where the gap is largest.

5.2.1 Final K-Means Clustering with k = 3

After reviewing the diagnostics above, we select k = 3 and run K-Means:

Code
set.seed(123)
km_res <- kmeans(mfa_coords, centers = 3, nstart = 25)

clusters <- tibble(
  cluster_kmean = factor(km_res$cluster)
)

You can now plot the clusters or compare results with GMM and HDBSCAN.

5.3 Gaussian Mixture Model (GMM)

Code
library(mclust)

gmm_res <- Mclust(mfa_coords)

clusters$cluster_gmm <- as.factor(gmm_res$classification)

5.4 HDBSCAN – Density-Based Clustering

Code
library(dbscan)

hdb_res <- hdbscan(mfa_coords, minPts = 50, 
                   cluster_selection_epsilon = 1)

clusters$cluster_hdb <- as.factor(hdb_res$cluster)

HDBSCAN is useful for detecting clusters of varying density and shapes - and automatically labels noise as cluster 0.

5.5 Visualize Clustering Results (e.g. K-means)

Code
library(ggplot2)
library(patchwork)

df <- mfa_coords %>% bind_cols(clusters)

p1 <- ggplot(df, aes(x = Dim.1, y = Dim.2, color = cluster_kmean)) +
  geom_point(alpha = 0.7) +
  labs(title = "K-Means Clustering on MFA Coordinates") +
  theme_minimal() + 
  ylim(0, 4)

p2 <- ggplot(df, aes(x = Dim.1, y = Dim.2, color = cluster_gmm)) +
  geom_point(alpha = 0.7) +
  labs(title = "GMM Clustering on MFA Coordinates") +
  theme_minimal() +
  ylim(0, 4)

p3 <- ggplot(df, aes(x = Dim.1, y = Dim.2, color = cluster_hdb)) +
  geom_point(alpha = 0.7) +
  labs(title = "HDB Clustering on MFA Coordinates") +
  theme_minimal() +
  ylim(0, 4)
Code
p1

Code
p2

Code
p3

5.6 Profiling

Code
all_results <- df_og %>% select(-ad_id) %>% 
  bind_cols(clusters)
Code
all_tests <- all_results %>% 
  select(-c(cluster_kmean, cluster_gmm)) %>% 
  tidy_catdesc(., cluster = "cluster_hdb")
Code
all_tests[[1]] %>% split(f = .$cluster)

To understand what defines each cluster, we use catdes() to identify over- and underrepresented categories in each group. Below is the result for Cluster 0.

This output tells us:

  • Which categorical levels are most characteristic of Cluster 0
  • How these levels compare to the global distribution
  • Which associations are statistically significant
Variable % in Cluster (cla_mod) % of Level in Cluster (mod_cla) Global % p-value v-test Interpretation
bathrooms=4 45.4% 49.5% 12.5% < 1e-120 +23.8 Strongly overrepresented
bedrooms=6 55.3% 22.0% 4.56% < 1e-60 +16.7 Strongly overrepresented
dwelling_type=farm 94.1% 2.79% 0.34% < 1e-14 +7.72 Dominant dwelling type in cluster
parking=secure_parking 22.3% 6.79% 3.5% < 0.01 +4.13 Moderately overrepresented
bedrooms=2 7.3% 17.4% 27.3% < 1e-8 -5.84 Underrepresented
bathrooms=2 5.5% 19.9% 41.4% < 1e-30 -11.6 Strongly underrepresented

Column Descriptions

Column Meaning
cla_mod % of individuals in the cluster that have this level
mod_cla % of individuals with this level that fall into this cluster
global % of all individuals in the dataset with this level
p_value Significance of the association between the level and the cluster (χ²)
v_test Standardized test statistic; positive = overrepresentation

Cluster 0 is characterized by: - Large, expensive farm properties - Many bathrooms (3–4) and bedrooms (5–6) - Rare in the general dataset but dominant in this group - Underrepresents small apartments and low-bedroom listings

This profiling helps us assign meaning to the cluster and supports downstream analysis (e.g., marketing segmentation or price modeling).

Code
# $`0`
# # A tibble: 17 × 8
#    variable                        cla_mod mod_cla global   p_value v_test cluster type_variable
#    <chr>                             <dbl>   <dbl>  <dbl>     <dbl>  <dbl> <chr>   <chr>        
#  1 bathrooms=4                       45.4    49.5   12.5  1.60e-125  23.8  0       qualitative  
#  2 bedrooms=6                        55.3    22.0    4.56 2.29e- 62  16.7  0       qualitative  
#  3 dwelling_type=farm                94.1     2.79   0.34 1.16e- 14   7.72 0       qualitative  
#  4 bedrooms=5                        23.9    12.4    5.94 4.97e- 10   6.22 0       qualitative  
#  5 dwelling_type=other               60       2.61   0.5  7.82e-  9   5.77 0       qualitative  
#  6 parking=secure_parking            22.3     6.79   3.5  3.55e-  5   4.13 0       qualitative  
#  7 bedrooms=4                        15.3    21.4   16.1  3.55e-  4   3.57 0       qualitative  
#  8 parking=off_street                31.6     2.09   0.76 9.40e-  4   3.31 0       qualitative  
#  9 bathrooms=3                       14.6    19.3   15.2  4.71e-  3   2.83 0       qualitative  
# 10 bedrooms=studio_or_bachelor_pad   25.5     2.26   1.02 5.34e-  3   2.79 0       qualitative  
# 11 parking=covered                    9.40   18.1   22.1  1.26e-  2  -2.49 0       qualitative  
# 12 bedrooms=1                         7.91    6.27   9.1  9.43e-  3  -2.60 0       qualitative  
# 13 dwelling_type=house               10.5    56.3   61.4  7.45e-  3  -2.68 0       qualitative  
# 14 bedrooms=2                         7.34   17.4   27.3  5.14e-  9  -5.84 0       qualitative  
# 15 bedrooms=3                         5.83   18.3   36.0  7.53e- 23  -9.84 0       qualitative  
# 16 bathrooms=2                        5.51   19.9   41.4  4.79e- 31 -11.6  0       qualitative  
# 17 bathrooms=1                        4.21   11.3   30.9  2.52e- 31 -11.6  0       qualitative  
Code
all_tests[[2]] %>% split(f = .$cluster)
Code
# $`0`
# # A tibble: 2 × 9
#   variable v_test mean_in_category overall_mean sd_in_category overall_sd   p_value cluster type_variable
#   <chr>     <dbl>            <dbl>        <dbl>          <dbl>      <dbl>     <dbl> <chr>   <chr>        
# 1 price     34.9         13129699.     3864220.      16502989.   6764558. 1.75e-266 0       quantitative 
# 2 size_sqm   9.39            4627.         820.         30170.     10319. 5.81e- 21 0       quantitative 
# 
# $`1`
# # A tibble: 2 × 9
#   variable  v_test mean_in_category overall_mean sd_in_category overall_sd  p_value cluster type_variable
#   <chr>      <dbl>            <dbl>        <dbl>          <dbl>      <dbl>    <dbl> <chr>   <chr>        
# 1 size_sqm1  -4.16             161.         820.           225.     10319. 3.20e- 5 1       quantitative 
# 2 price1    -20.8          1698308.     3864220.       1180197.   6764558. 2.07e-96 1       quantitative

Quantitative Variables

In addition to categorical levels, catdes() also evaluates how numeric variables (e.g., price, size_sqm) vary across clusters. This is done using ANOVA-style comparisons.

Each row below describes the relationship between a numeric variable and a given cluster.

Cluster 0 (High-end listings)

Variable Cluster Mean Global Mean v-test p-value Interpretation
price R13.1 million R3.86 million +34.9 < 1e-250 Listings in Cluster 0 are vastly more expensive
size_sqm 4,627 m² 820 m² +9.39 < 1e-20 These listings are much larger
  • The positive v-test indicates that values are significantly higher in this cluster than average.
  • Both variables are strongly overrepresented in Cluster 0.

Cluster 1 (Low-cost listings)

Variable Cluster Mean Global Mean v-test p-value Interpretation
price1 R1.7 million R3.86 million −20.8 < 1e-95 Prices in this cluster are significantly lower
size_sqm1 161 m² 820 m² −4.16 < 0.001 Smaller listings dominate this group
  • The negative v-test shows that Cluster 1 listings are significantly smaller and cheaper than the dataset average.
Column Meaning
mean_in_category Mean of the variable within this cluster
overall_mean Global mean of the variable (all clusters)
v_test Standardized test statistic (like a t-value)
p_value Significance level for mean difference (via ANOVA)
sd_in_category Std. deviation in this cluster
overall_sd Global std. deviation for that variable

5.6.1 Conclusion

Quantitative profiling helps us numerically distinguish clusters by measuring: - Which clusters represent high-value or low-value properties - Whether size, price, or other features are defining dimensions

Together with categorical profiles, this gives us a full statistical summary for interpreting the meaning of each cluster.

You can now merge this with the qualitative results to build complete cluster narratives.