We load a dataset of rental and sale listings from Gumtree, each with a rich free-text description field. This field contains unstructured information about the property and will serve as our main input for NLP. The target variable is type, which labels each listing as either “rental” or “sales”.
We apply Multiple Factor Analysis (MFA) to explore joint patterns in a dataset containing both categorical and numerical variables describing property listings.
MFA is a powerful method for: - Combining mixed data types (e.g., room counts, prices, parking types) - Balancing groups of variables so no group dominates the analysis - Producing interpretable low-dimensional components for clustering or visualization
We first select and clean relevant columns, treating all categorical variables as strings.
Each point represents a category of a qualitative variable (e.g. bedrooms_3, covered, garage)
Points are placed in the factor space defined by the first two dimensions (Dim1 and Dim2)
Categories that are closer together are more similar in terms of their contribution to the underlying MFA structure
Categories that are farther from the origin contribute more strongly to the structure captured by these dimensions
This visualisation helps us:
Explore associations between levels of qualitative variables
Group similar categories (e.g., homes with 4+ bedrooms)
Spot outliers or standout categories (e.g., studio_or_bachelor_pad)
Code
fviz_mfa_var(mfa_res, "quali.var")
After examining the positions of qualitative categories in the factor space, we now look at how much each category contributes to the construction of each dimension.
Each number represents the percentage contribution of a specific category to a given dimension. For example:
bathrooms_4 contributes 8.62% to Dim1
studio_or_bachelor_pad contributes 28.5% to Dim5
bathrooms_2 contributes 15.84% to Dim4, making it one of the key drivers of that dimension
For interpretation:
Higher contribution values = the category helps define that dimension
If several categories of the same variable contribute to the same dimension, that dimension likely captures a latent trait related to that variable (e.g. household size, parking quality)
Some categories only show up prominently in later dimensions (e.g. other, studio_or_bachelor_pad, secure_parking on Dim5) - these may capture more nuanced variation
Practical Use
You can use this information to:
Name/interpret dimensions based on dominant contributing categories
Select dimensions for clustering or visualisation
Identify influential or standout category levels in your data
Each arrow represents a quantitative variable — its direction and length reveal:
Direction: the dimension it is most aligned with
Length: the strength of its contribution (longer = more important)
Interpretation
price is strongly aligned with Dim1
size_sqm is strongly aligned with Dim2
The two variables are nearly orthogonal (90°), meaning they are uncorrelated in this reduced space
This means the first two dimensions capture two distinct drivers in the data: Dim1: variation in price, Dim2: variation in size
This helps you interpret the meaning of each dimension. You can also use this to assess which variables are dominant in driving the structure - helpful for:
Dimension labelling
Clustering interpretation
Communicating insights back to stakeholders
4.7 Partial axes
The partial axes plot shows how each group of variables (e.g. quantitative, qualitative) contributes to the position of each observation in the global MFA space.
Code
fviz_mfa_axes(mfa_res)
What does this show?
Each point (observation) is projected into the global space multiple times — once per variable group
The plot shows these partial projections, one per group, and how they relate to the global projection
Lines connect each group’s position to the global consensus position for that observation
Why is this useful?
Helps diagnose which group pulls an observation in which direction
Reveals whether groups agree or conflict in their view of the data structure
Useful in mixed-data contexts: e.g., do qualitative and quantitative blocks reinforce or contradict each other?
Short lines = groups agree on the observation’s position
Long lines = disagreement across groups → heterogeneity or measurement conflict
5 Clustering on MFA Coordinates
Now that we’ve reduced the data into lower-dimensional space using MFA, we can apply clustering algorithms to uncover potential group structure.
We’ll use: - K-Means: Partition-based, requires number of clusters - GMM (Gaussian Mixture Models): Probabilistic clustering - HDBSCAN: Density-based clustering that can handle noise and variable cluster shapes
To understand what defines each cluster, we use catdes() to identify over- and underrepresented categories in each group. Below is the result for Cluster 0.
This output tells us:
Which categorical levels are most characteristic of Cluster 0
How these levels compare to the global distribution
Which associations are statistically significant
Variable
% in Cluster (cla_mod)
% of Level in Cluster (mod_cla)
Global %
p-value
v-test
Interpretation
bathrooms=4
45.4%
49.5%
12.5%
< 1e-120
+23.8
Strongly overrepresented
bedrooms=6
55.3%
22.0%
4.56%
< 1e-60
+16.7
Strongly overrepresented
dwelling_type=farm
94.1%
2.79%
0.34%
< 1e-14
+7.72
Dominant dwelling type in cluster
parking=secure_parking
22.3%
6.79%
3.5%
< 0.01
+4.13
Moderately overrepresented
bedrooms=2
7.3%
17.4%
27.3%
< 1e-8
-5.84
Underrepresented
bathrooms=2
5.5%
19.9%
41.4%
< 1e-30
-11.6
Strongly underrepresented
Column Descriptions
Column
Meaning
cla_mod
% of individuals in the cluster that have this level
mod_cla
% of individuals with this level that fall into this cluster
global
% of all individuals in the dataset with this level
p_value
Significance of the association between the level and the cluster (χ²)
v_test
Standardized test statistic; positive = overrepresentation
Cluster 0 is characterized by: - Large, expensive farm properties - Many bathrooms (3–4) and bedrooms (5–6) - Rare in the general dataset but dominant in this group - Underrepresents small apartments and low-bedroom listings
This profiling helps us assign meaning to the cluster and supports downstream analysis (e.g., marketing segmentation or price modeling).
In addition to categorical levels, catdes() also evaluates how numeric variables (e.g., price, size_sqm) vary across clusters. This is done using ANOVA-style comparisons.
Each row below describes the relationship between a numeric variable and a given cluster.
Cluster 0 (High-end listings)
Variable
Cluster Mean
Global Mean
v-test
p-value
Interpretation
price
R13.1 million
R3.86 million
+34.9
< 1e-250
Listings in Cluster 0 are vastly more expensive
size_sqm
4,627 m²
820 m²
+9.39
< 1e-20
These listings are much larger
The positive v-test indicates that values are significantly higher in this cluster than average.
Both variables are strongly overrepresented in Cluster 0.
Cluster 1 (Low-cost listings)
Variable
Cluster Mean
Global Mean
v-test
p-value
Interpretation
price1
R1.7 million
R3.86 million
−20.8
< 1e-95
Prices in this cluster are significantly lower
size_sqm1
161 m²
820 m²
−4.16
< 0.001
Smaller listings dominate this group
The negative v-test shows that Cluster 1 listings are significantly smaller and cheaper than the dataset average.
Column
Meaning
mean_in_category
Mean of the variable within this cluster
overall_mean
Global mean of the variable (all clusters)
v_test
Standardized test statistic (like a t-value)
p_value
Significance level for mean difference (via ANOVA)
sd_in_category
Std. deviation in this cluster
overall_sd
Global std. deviation for that variable
5.6.1 Conclusion
Quantitative profiling helps us numerically distinguish clusters by measuring: - Which clusters represent high-value or low-value properties - Whether size, price, or other features are defining dimensions
Together with categorical profiles, this gives us a full statistical summary for interpreting the meaning of each cluster.
You can now merge this with the qualitative results to build complete cluster narratives.