Unsupervised ML: Practical

1 libraries

⚠️ See https://rpkgs.datanovia.com/factoextra/

1.1 Basic setup

Code

library(tidyverse)
library(knitr)
library(glue)
library(bertheme)
library(tidymodels)

knitr::opts_chunk$set(
  echo = TRUE,
  eval = TRUE,
  message = FALSE,
  warning = FALSE,
  fig.width = 6,
  fig.height = 5,
  fig.align='center',
  cache = TRUE
  )

2 Functions

Code

tidy_catdesc <- function(df, cluster){
  options(stringsAsFactors = F)
  
  if(missing(cluster)){
    cluster_col <- NULL
  } else {
    cluster_col <- enquo(cluster)
  }
  
  if(class(df)[1] %in% c("HCPC")){
    df <- df$data.clust %>%
      data.frame %>%
      select(clust, everything())
  } else {
    if(is.null(cluster_col)) stop("Please provide cluster column name for analysis")
    
    df <- df %>%
      select(!!cluster_col, everything())
  }
  df[,1] <- factor(df %>% pull(1))
  df <- df %>% tibble()
  
  res_catdes <- catdes(df,  1)
  
  quali <- res_catdes['category'][[1]]
  quanti <- res_catdes['quanti'][[1]]
  
  if(!is.null(quali))
  {
    quali <-  quali %>%
        purrr::compact() %>%
        imap(~ {
          df <- as.data.frame(.x)
          df$cluster <- .y
          df
        }) %>% 
       reduce(rbind) %>%
       tibble::rownames_to_column("variable") %>% 
       tibble %>% 
       mutate(type_variable = "qualitative") %>% 
       janitor::clean_names()
    
  } else { quali <- data.frame(Message = "No qualitative variables were significant")}
  
  if(!is.null(quanti)){
    quanti <- quanti %>%
      purrr::compact() %>%
      imap(~ {
          df <- as.data.frame(.x)
          df$cluster <- .y
          df
        }) %>% 
       reduce(rbind) %>%
       tibble::rownames_to_column("variable") %>% 
       tibble %>% 
       janitor::clean_names() %>%
      mutate(type_variable = "quantitative")
  } else { quanti <- data.frame(Message = "No quantitative variables were significant")}
  
  list(quali, quanti)
}

3 Library

Code

library(FactoMineR)
library(factoextra)

4 Clustering the tidy way

4.1 Load Data

We load a dataset of rental and sale listings from Gumtree, each with a rich free-text description field. This field contains unstructured information about the property and will serve as our main input for NLP. The target variable is type, which labels each listing as either “rental” or “sales”.

Code

gumtree_texts <- read_csv("data/gumtree_clean.zip")

4.2 Basic MFA

We apply Multiple Factor Analysis (MFA) to explore joint patterns in a dataset containing both categorical and numerical variables describing property listings.

MFA is a powerful method for: - Combining mixed data types (e.g., room counts, prices, parking types) - Balancing groups of variables so no group dominates the analysis - Producing interpretable low-dimensional components for clustering or visualization

We first select and clean relevant columns, treating all categorical variables as strings.

Code

df_og <- gumtree_texts %>% 
  select(ad_id, dwelling_type, contains("rooms"), parking, 
         size_sqm, price) %>% 
  mutate(across(dwelling_type:parking, as.character)) %>% 
  drop_na() %>% 
  sample_n(5e3)

💡 We ensure that all categorical variables are explicitly cast as characters to be recognized by MFA.

We define 3 blocks: - Group 1: ad_id → supplementary ID - Group 2: dwelling_type, bedrooms, bathrooms, parking → qualitative - Group 3: size_sqm, price → quantitative

Code

mfa_res <- read_rds("data/mfa_res.rds")

Code

mfa_res <- MFA(df_og, group = c(1, 4, 2), type = c("n","n", "s"),
    ncp = 5, name.group = c("id","qual" , "quant"),
    num.group.sup = c(1), graph = FALSE)

saveRDS(mfa_res, "data/mfa_res.rds")

The qualitative variables are handled like in MCA, and the quantitative variables like in PCA — both projected into a shared low-dimensional space.
MFA weights each group equally, regardless of number of variables or variable type.
The result is a set of global principal components summarizing joint structure across variable types.

You can now use the MFA result to:

Cluster listings (mfa_res$global.pca$ind$coord)
Visualize listings by dimensions
Interpret the structure of variation across property types and price

4.3 Eigenvalues/variances of dimensions

Code

mfa_res$global.pca$eig[1:5,]
#        eigenvalue percentage of variance cumulative percentage of variance
# comp 1  1.3127765              15.461616                          15.46162
# comp 2  0.9943915              11.711741                          27.17336
# comp 3  0.8331033               9.812122                          36.98548
# comp 4  0.4763161               5.609955                          42.59543
# comp 5  0.4546441               5.354706                          47.95014

What do eigenvalues mean?

Each dimension (component) in MFA is associated with an eigenvalue, which reflects how much variance it explains.
Higher eigenvalues = more important dimension.
The percentage of variance tells us how much structure is captured by each axis.
The cumulative percentage helps us decide how many dimensions to retain: > “How many axes do we need to explain ~80% of the variation?”

In this example:

Component	Variance (%)	Cumulative (%)
Comp 1	15.5%	15.5%
Comp 2	11.7%	27.2%
Comp 3	9.8%	37.0%
Comp 4	5.6%	42.6%
Comp 5	5.4%	47.9%

Code

fviz_screeplot(mfa_res)

Look for the elbow point: the dimension after which added components contribute little additional variance.
Typically, we retain the first few dimensions that together explain a meaningful portion (e.g., ~70–80%) of the variance.

This step is crucial before moving on to interpretation or clustering in the reduced space.

4.4 Group of variables

This plot shows each group projected onto the first two dimensions of the MFA:

Dim1 (15.5%) and Dim2 (11.7%) are the axes explaining the most variance.
Each triangle represents a group of variables used in the MFA:
- quant: the quantitative variable group
- qual: the qualitative variable group
- id: an identifying or supplementary group (e.g. individual IDs or metadata)

Code

fviz_mfa_var(mfa_res, "group")

4.5 Qualitative variables

Code

mfa_res$quali.var$coord
#                                       Dim.1        Dim.2       Dim.3        Dim.4        Dim.5
# apartment                       -0.83588029 -0.008914313  0.86445122 -0.111072093 -0.093158964
# farm                             2.96810353 -0.511488057  1.34637909 -2.116365658  0.817341813
# house                            0.43088461 -0.055881962 -0.44924188  0.156258499 -0.005603694
# other                            0.05987695 -0.044024172 -0.07901804 -0.143475484  4.038681986
# townhouse_villa                  0.06178860  0.864872018 -0.49369313 -1.007681203  0.136880393
# bedrooms_1                      -1.13468761 -0.014320919  1.23902994  0.401293447 -0.181532801
# bedrooms_2                      -0.77649867  0.140085119  0.47391471 -0.302449679 -0.116602224
# bedrooms_3                       0.13491469 -0.061295020 -0.66368427 -0.354805310  0.188147185
# bedrooms_4                       0.92849230 -0.105326761 -0.28762702  0.627898295 -0.513579271
# bedrooms_5                       1.42909382  0.207611232 -0.04804985  1.104500907 -0.291413960
# bedrooms_6                       2.20073568 -0.163669449  1.02742117  0.388086674  1.005826420
# bedrooms_studio_or_bachelor_pad -0.47375299 -0.045047111  1.19494751  1.576102915  3.844282556
# bathrooms_1                     -0.90160741  0.125068738  0.68194329  0.170978778  0.022465569
# bathrooms_2                     -0.03607059 -0.057388334 -0.50285119 -0.491781824  0.128351580
# bathrooms_3                      0.72253619 -0.110183573 -0.53094583  0.483910282 -0.499290006
# bathrooms_4                      2.05585998 -0.005044342  0.66013615  0.693494535  0.203930093
# covered                         -0.82194311 -0.050049578  0.91140608 -0.256872129 -0.380843161
# garage                           0.41657378  0.005739822 -0.38302835 -0.003298937 -0.016919656
# none                            -0.89326440 -0.070178839  0.55898803  0.319338571  0.849108103
# off_street                      -0.48782816 -0.071343159  0.31717749  1.195584059 -1.354311245
# secure_parking                  -0.83390784  0.240482405  0.30569590  0.843897882  1.389923676
# street                          -0.82763107  0.039653373  0.54923925  0.299850725  0.576385314

This plot shows:

Each point represents a category of a qualitative variable (e.g. bedrooms_3, covered, garage)
Points are placed in the factor space defined by the first two dimensions (Dim1 and Dim2)
Categories that are closer together are more similar in terms of their contribution to the underlying MFA structure
Categories that are farther from the origin contribute more strongly to the structure captured by these dimensions

This visualisation helps us:

Explore associations between levels of qualitative variables
Group similar categories (e.g., homes with 4+ bedrooms)
Spot outliers or standout categories (e.g., studio_or_bachelor_pad)

Code

fviz_mfa_var(mfa_res, "quali.var")

After examining the positions of qualitative categories in the factor space, we now look at how much each category contributes to the construction of each dimension.

Each number represents the percentage contribution of a specific category to a given dimension. For example:

bathrooms_4 contributes 8.62% to Dim1
studio_or_bachelor_pad contributes 28.5% to Dim5
bathrooms_2 contributes 15.84% to Dim4, making it one of the key drivers of that dimension

For interpretation:

Higher contribution values = the category helps define that dimension
If several categories of the same variable contribute to the same dimension, that dimension likely captures a latent trait related to that variable (e.g. household size, parking quality)
Some categories only show up prominently in later dimensions (e.g. other, studio_or_bachelor_pad, secure_parking on Dim5) - these may capture more nuanced variation

Practical Use

You can use this information to:

Name/interpret dimensions based on dominant contributing categories
Select dimensions for clustering or visualisation
Identify influential or standout category levels in your data

Code

mfa_res$quali.var$contrib
#                                 Dim.1 Dim.2 Dim.3 Dim.4 Dim.5
# apartment                        4.92  0.00 13.07  0.66  0.51
# farm                             0.93  0.05  0.47  3.58  0.59
# house                            2.37  0.07  6.41  2.37  0.00
# other                            0.00  0.00  0.00  0.02 17.16
# townhouse_villa                  0.00  1.26  0.59  7.46  0.15
# bedrooms_1                       2.68  0.00  7.94  2.55  0.57
# bedrooms_2                       3.50  0.20  3.24  4.03  0.66
# bedrooms_3                       0.14  0.05  8.69  7.60  2.35
# bedrooms_4                       2.94  0.07  0.70 10.20  7.49
# bedrooms_5                       1.80  0.07  0.01  8.18  0.63
# bedrooms_6                       3.26  0.03  1.76  0.77  5.68
# bedrooms_studio_or_bachelor_pad  0.05  0.00  0.82  4.36 28.50
# bathrooms_1                      5.69  0.19  8.09  1.55  0.03
# bathrooms_2                      0.01  0.05  5.41 15.84  1.18
# bathrooms_3                      1.76  0.07  2.36  5.98  6.99
# bathrooms_4                      8.62  0.00  2.21  7.45  0.71
# covered                          3.17  0.02  9.67  2.35  5.67
# garage                           2.43  0.00  5.09  0.00  0.03
# none                             0.64  0.01  0.62  0.62  4.80
# off_street                       0.03  0.00  0.03  1.14  1.61
# secure_parking                   0.56  0.08  0.19  4.32 12.87
# street                           0.45  0.00  0.49  0.45  1.81

The cos² value (squared cosine) tells us how well each qualitative category is represented by each dimension. Each value ranges from 0 to 1:

Closer to 1 → the category is well represented on that dimension
Closer to 0 → the dimension does not explain that category’s position well

Why this matters

cos² helps us filter for the most interpretable categories when labelling dimensions
It’s also helpful for plotting: e.g. show only points with high cos² to avoid clutter
- High cos² = “this dimension is a good summary of this category’s behavior”
- Low cos² = “this category doesn’t fit cleanly into any of the top dimensions”

Code

mfa_res$quali.var$cos2 %>% round(2)
#                                 Dim.1 Dim.2 Dim.3 Dim.4 Dim.5
# apartment                        0.45  0.00  0.48  0.01  0.01
# farm                             0.11  0.00  0.02  0.06  0.01
# house                            0.40  0.01  0.44  0.05  0.00
# other                            0.00  0.00  0.00  0.00  0.26
# townhouse_villa                  0.00  0.09  0.03  0.13  0.00
# bedrooms_1                       0.26  0.00  0.31  0.03  0.01
# bedrooms_2                       0.40  0.01  0.15  0.06  0.01
# bedrooms_3                       0.02  0.00  0.52  0.15  0.04
# bedrooms_4                       0.33  0.00  0.03  0.15  0.10
# bedrooms_5                       0.21  0.00  0.00  0.12  0.01
# bedrooms_6                       0.33  0.00  0.07  0.01  0.07
# bedrooms_studio_or_bachelor_pad  0.01  0.00  0.04  0.07  0.42
# bathrooms_1                      0.55  0.01  0.32  0.02  0.00
# bathrooms_2                      0.00  0.00  0.38  0.36  0.02
# bathrooms_3                      0.21  0.00  0.12  0.10  0.10
# bathrooms_4                      0.65  0.00  0.07  0.07  0.01
# covered                          0.32  0.00  0.40  0.03  0.07
# garage                           0.49  0.00  0.41  0.00  0.00
# none                             0.08  0.00  0.03  0.01  0.07
# off_street                       0.00  0.00  0.00  0.02  0.03
# secure_parking                   0.07  0.01  0.01  0.07  0.19
# street                           0.06  0.00  0.03  0.01  0.03

4.6 Quantitative

The following plot shows how the quantitative variables are projected into the global MFA space:

Code

fviz_mfa_var(mfa_res, "quanti.var", palette = "jco",
  col.var.sup = "violet", repel = TRUE)

Each arrow represents a quantitative variable — its direction and length reveal:

Direction: the dimension it is most aligned with
Length: the strength of its contribution (longer = more important)

Interpretation

price is strongly aligned with Dim1
size_sqm is strongly aligned with Dim2
The two variables are nearly orthogonal (90°), meaning they are uncorrelated in this reduced space
This means the first two dimensions capture two distinct drivers in the data: Dim1: variation in price, Dim2: variation in size

This helps you interpret the meaning of each dimension. You can also use this to assess which variables are dominant in driving the structure - helpful for:

Dimension labelling
Clustering interpretation
Communicating insights back to stakeholders

4.7 Partial axes

The partial axes plot shows how each group of variables (e.g. quantitative, qualitative) contributes to the position of each observation in the global MFA space.

Code

fviz_mfa_axes(mfa_res)

What does this show?

Each point (observation) is projected into the global space multiple times — once per variable group
The plot shows these partial projections, one per group, and how they relate to the global projection
Lines connect each group’s position to the global consensus position for that observation

Why is this useful?

Helps diagnose which group pulls an observation in which direction
Reveals whether groups agree or conflict in their view of the data structure
Useful in mixed-data contexts: e.g., do qualitative and quantitative blocks reinforce or contradict each other?
- Short lines = groups agree on the observation’s position
- Long lines = disagreement across groups → heterogeneity or measurement conflict

5 Clustering on MFA Coordinates

Now that we’ve reduced the data into lower-dimensional space using MFA, we can apply clustering algorithms to uncover potential group structure.

We’ll use: - K-Means: Partition-based, requires number of clusters - GMM (Gaussian Mixture Models): Probabilistic clustering - HDBSCAN: Density-based clustering that can handle noise and variable cluster shapes

5.1 Prepare the data

Code

mfa_coords <- mfa_res$global.pca$ind$coord[,1:3] %>% 
  as_tibble()

5.2 K-Means Clustering

This extracts the individual coordinates from the global PCA - a reduced representation ideal for clustering.

Lets try and see what number of clusters I should use?

Code

library(NbClust)

Before applying K-Means, we need to choose a good value for k, the number of clusters. We use three complementary methods:

Elbow Method (WSS – Within Sum of Squares)

Code

fviz_nbclust(mfa_coords, kmeans, method = "wss") +
  geom_vline(xintercept = 2, linetype = 2) +
  labs(subtitle = "Elbow method")

This method looks for a “bend” in the WSS plot - the elbow point - where adding more clusters yields diminishing returns in variance explained.

Silhouette Method

Code

fviz_nbclust(mfa_coords, kmeans, method = "silhouette") +
  labs(subtitle = "Silhouette method")

This method chooses the k that gives the highest average silhouette width, measuring how well-separated each point is from points in other clusters.

Gap Statistic

Code

set.seed(123)
# fviz_nbclust(mfa_coords, kmeans, nstart = 25, method = "gap_stat", nboot = 50) +
#   labs(subtitle = "Gap statistic method")

Compares within-cluster dispersion to that expected under a null reference distribution - the optimal k is where the gap is largest.

5.2.1 Final K-Means Clustering with `k = 3`

After reviewing the diagnostics above, we select k = 3 and run K-Means:

Code

set.seed(123)
km_res <- kmeans(mfa_coords, centers = 3, nstart = 25)

clusters <- tibble(
  cluster_kmean = factor(km_res$cluster)
)

You can now plot the clusters or compare results with GMM and HDBSCAN.

5.3 Gaussian Mixture Model (GMM)

Code

library(mclust)

gmm_res <- Mclust(mfa_coords)

clusters$cluster_gmm <- as.factor(gmm_res$classification)

5.4 HDBSCAN – Density-Based Clustering

Code

library(dbscan)

hdb_res <- hdbscan(mfa_coords, minPts = 50, 
                   cluster_selection_epsilon = 1)

clusters$cluster_hdb <- as.factor(hdb_res$cluster)

HDBSCAN is useful for detecting clusters of varying density and shapes - and automatically labels noise as cluster 0.

5.5 Visualize Clustering Results (e.g. K-means)

Code

library(ggplot2)
library(patchwork)

df <- mfa_coords %>% bind_cols(clusters)

p1 <- ggplot(df, aes(x = Dim.1, y = Dim.2, color = cluster_kmean)) +
  geom_point(alpha = 0.7) +
  labs(title = "K-Means Clustering on MFA Coordinates") +
  theme_minimal() + 
  ylim(0, 4)

p2 <- ggplot(df, aes(x = Dim.1, y = Dim.2, color = cluster_gmm)) +
  geom_point(alpha = 0.7) +
  labs(title = "GMM Clustering on MFA Coordinates") +
  theme_minimal() +
  ylim(0, 4)

p3 <- ggplot(df, aes(x = Dim.1, y = Dim.2, color = cluster_hdb)) +
  geom_point(alpha = 0.7) +
  labs(title = "HDB Clustering on MFA Coordinates") +
  theme_minimal() +
  ylim(0, 4)

Code

p1

Code

p2

Code

p3

5.6 Profiling

Code

all_results <- df_og %>% select(-ad_id) %>% 
  bind_cols(clusters)

Code

all_tests <- all_results %>% 
  select(-c(cluster_kmean, cluster_gmm)) %>% 
  tidy_catdesc(., cluster = "cluster_hdb")

Code

all_tests[[1]] %>% split(f = .$cluster)

To understand what defines each cluster, we use catdes() to identify over- and underrepresented categories in each group. Below is the result for Cluster 0.

This output tells us:

Which categorical levels are most characteristic of Cluster 0
How these levels compare to the global distribution
Which associations are statistically significant

Variable	% in Cluster (`cla_mod`)	% of Level in Cluster (`mod_cla`)	Global %	p-value	v-test	Interpretation
`bathrooms=4`	45.4%	49.5%	12.5%	< 1e-120	+23.8	Strongly overrepresented
`bedrooms=6`	55.3%	22.0%	4.56%	< 1e-60	+16.7	Strongly overrepresented
`dwelling_type=farm`	94.1%	2.79%	0.34%	< 1e-14	+7.72	Dominant dwelling type in cluster
`parking=secure_parking`	22.3%	6.79%	3.5%	< 0.01	+4.13	Moderately overrepresented
`bedrooms=2`	7.3%	17.4%	27.3%	< 1e-8	-5.84	Underrepresented
`bathrooms=2`	5.5%	19.9%	41.4%	< 1e-30	-11.6	Strongly underrepresented

Column Descriptions

Column	Meaning
`cla_mod`	% of individuals in the cluster that have this level
`mod_cla`	% of individuals with this level that fall into this cluster
`global`	% of all individuals in the dataset with this level
`p_value`	Significance of the association between the level and the cluster (χ²)
`v_test`	Standardized test statistic; positive = overrepresentation

Cluster 0 is characterized by: - Large, expensive farm properties - Many bathrooms (3–4) and bedrooms (5–6) - Rare in the general dataset but dominant in this group - Underrepresents small apartments and low-bedroom listings

This profiling helps us assign meaning to the cluster and supports downstream analysis (e.g., marketing segmentation or price modeling).

Code

# $`0`
# # A tibble: 17 × 8
#    variable                        cla_mod mod_cla global   p_value v_test cluster type_variable
#    <chr>                             <dbl>   <dbl>  <dbl>     <dbl>  <dbl> <chr>   <chr>        
#  1 bathrooms=4                       45.4    49.5   12.5  1.60e-125  23.8  0       qualitative  
#  2 bedrooms=6                        55.3    22.0    4.56 2.29e- 62  16.7  0       qualitative  
#  3 dwelling_type=farm                94.1     2.79   0.34 1.16e- 14   7.72 0       qualitative  
#  4 bedrooms=5                        23.9    12.4    5.94 4.97e- 10   6.22 0       qualitative  
#  5 dwelling_type=other               60       2.61   0.5  7.82e-  9   5.77 0       qualitative  
#  6 parking=secure_parking            22.3     6.79   3.5  3.55e-  5   4.13 0       qualitative  
#  7 bedrooms=4                        15.3    21.4   16.1  3.55e-  4   3.57 0       qualitative  
#  8 parking=off_street                31.6     2.09   0.76 9.40e-  4   3.31 0       qualitative  
#  9 bathrooms=3                       14.6    19.3   15.2  4.71e-  3   2.83 0       qualitative  
# 10 bedrooms=studio_or_bachelor_pad   25.5     2.26   1.02 5.34e-  3   2.79 0       qualitative  
# 11 parking=covered                    9.40   18.1   22.1  1.26e-  2  -2.49 0       qualitative  
# 12 bedrooms=1                         7.91    6.27   9.1  9.43e-  3  -2.60 0       qualitative  
# 13 dwelling_type=house               10.5    56.3   61.4  7.45e-  3  -2.68 0       qualitative  
# 14 bedrooms=2                         7.34   17.4   27.3  5.14e-  9  -5.84 0       qualitative  
# 15 bedrooms=3                         5.83   18.3   36.0  7.53e- 23  -9.84 0       qualitative  
# 16 bathrooms=2                        5.51   19.9   41.4  4.79e- 31 -11.6  0       qualitative  
# 17 bathrooms=1                        4.21   11.3   30.9  2.52e- 31 -11.6  0       qualitative

Code

all_tests[[2]] %>% split(f = .$cluster)

Code

# $`0`
# # A tibble: 2 × 9
#   variable v_test mean_in_category overall_mean sd_in_category overall_sd   p_value cluster type_variable
#   <chr>     <dbl>            <dbl>        <dbl>          <dbl>      <dbl>     <dbl> <chr>   <chr>        
# 1 price     34.9         13129699.     3864220.      16502989.   6764558. 1.75e-266 0       quantitative 
# 2 size_sqm   9.39            4627.         820.         30170.     10319. 5.81e- 21 0       quantitative 
# 
# $`1`
# # A tibble: 2 × 9
#   variable  v_test mean_in_category overall_mean sd_in_category overall_sd  p_value cluster type_variable
#   <chr>      <dbl>            <dbl>        <dbl>          <dbl>      <dbl>    <dbl> <chr>   <chr>        
# 1 size_sqm1  -4.16             161.         820.           225.     10319. 3.20e- 5 1       quantitative 
# 2 price1    -20.8          1698308.     3864220.       1180197.   6764558. 2.07e-96 1       quantitative

Quantitative Variables

In addition to categorical levels, catdes() also evaluates how numeric variables (e.g., price, size_sqm) vary across clusters. This is done using ANOVA-style comparisons.

Each row below describes the relationship between a numeric variable and a given cluster.

Cluster 0 (High-end listings)

Variable	Cluster Mean	Global Mean	v-test	p-value	Interpretation
`price`	R13.1 million	R3.86 million	+34.9	< 1e-250	Listings in Cluster 0 are vastly more expensive
`size_sqm`	4,627 m²	820 m²	+9.39	< 1e-20	These listings are much larger

The positive v-test indicates that values are significantly higher in this cluster than average.
Both variables are strongly overrepresented in Cluster 0.

Cluster 1 (Low-cost listings)

Variable	Cluster Mean	Global Mean	v-test	p-value	Interpretation
`price1`	R1.7 million	R3.86 million	−20.8	< 1e-95	Prices in this cluster are significantly lower
`size_sqm1`	161 m²	820 m²	−4.16	< 0.001	Smaller listings dominate this group

The negative v-test shows that Cluster 1 listings are significantly smaller and cheaper than the dataset average.

Column	Meaning
`mean_in_category`	Mean of the variable within this cluster
`overall_mean`	Global mean of the variable (all clusters)
`v_test`	Standardized test statistic (like a t-value)
`p_value`	Significance level for mean difference (via ANOVA)
`sd_in_category`	Std. deviation in this cluster
`overall_sd`	Global std. deviation for that variable

5.6.1 Conclusion

Quantitative profiling helps us numerically distinguish clusters by measuring: - Which clusters represent high-value or low-value properties - Whether size, price, or other features are defining dimensions

Together with categorical profiles, this gives us a full statistical summary for interpreting the meaning of each cluster.

You can now merge this with the qualitative results to build complete cluster narratives.

1 libraries

1.1 Basic setup

2 Functions

3 Library

4 Clustering the tidy way

4.1 Load Data

4.2 Basic MFA

4.3 Eigenvalues/variances of dimensions

4.4 Group of variables

4.5 Qualitative variables

4.6 Quantitative

4.7 Partial axes

5 Clustering on MFA Coordinates

5.1 Prepare the data

5.2 K-Means Clustering

5.2.1 Final K-Means Clustering with k = 3

5.3 Gaussian Mixture Model (GMM)

5.4 HDBSCAN – Density-Based Clustering

5.5 Visualize Clustering Results (e.g. K-means)

5.6 Profiling

5.6.1 Conclusion

5.2.1 Final K-Means Clustering with `k = 3`