Practical NLP

1 Load Data

We load a dataset of rental and sale listings from Gumtree, each with a rich free-text description field. This field contains unstructured information about the property and will serve as our main input for NLP. The target variable is type, which labels each listing as either “rental” or “sales”.

Code
gumtree_texts <- read_csv("data/gumtree_clean.zip")
Rows: 16681 Columns: 17
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (14): type, ad_id, ad_url, location, for_sale_by, dwelling_type, bedroom...
dbl  (3): bathrooms, size_sqm, price

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Code
gumtree_texts %>% 
    sample_n(1) %>% 
    t
               [,1]                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                     
type           "sales"                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  
ad_id          "ecbb8120ba01be9cb4d473efe21c8901"                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                       
ad_url         "https://www.gumtree.co.za/a-houses-flats-for-sale/milnerton/family-oasis-in-tijgerhof-milnerton/10012471292341012689660009"                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                             
location       "milnerton_northern_suburbs"                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                             
for_sale_by    "agency"                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                 
dwelling_type  "house"                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  
bedrooms       "4"                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      
bathrooms      "3"                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      
size_sqm       "368"                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    
parking        "street"                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                 
price          "2e+06"                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  
available_from NA                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                       
for_rent_by    NA                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                       
furnished      NA                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                       
smoking        NA                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                       
pet_friendly   NA                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                       
description    "Your Ideal Haven Awaits!Please note!  All written offers FROM R2 000 000 will be considered by the Seller.Nestled in the heart of Tijgerhof, Milnerton, this enchanting family home is a haven of comfort and convenience. Perfectly tailored to accommodate the needs of your growing family, this property offers a lifestyle of ease and enjoyment. Here's why you'll fall in love with this home:Prime Location:Experience the epitome of convenience with this residence located within walking distance to shops and public transport. The proximity to schools ensures a stress-free morning routine, and the allure of the beach, just 6 minutes away, promises endless family adventures and relaxation.Comfortable Living:Boasting 4 bedrooms, 3 of which having built-in cupboards and one being a flatlet, this home provides ample space for your family to thrive. The main bedroom is a sanctuary with a sleek ensuite featuring a refreshing shower, offering parents a private retreat within the home.Family-Friendly Layout:With 1.5 bathrooms, including the main Bathroom, mornings become a breeze for the entire family. Two separate entrances, one with its own bathroom, provide flexibility for extended family arrangements or guests, ensuring everyone feels at home.Parking and More:Ample parking space allows for the hassle-free arrival and departure of family members and guests. The kitchen, complete with built-in cupboards, is a culinary haven where family meals become cherished moments.Outdoor Bliss:Step into the large backyard, a sprawling canvas for family gatherings and outdoor activities. Whether it's a weekend braai, playtime for the kids, or a serene moment in nature, this backyard is a true extension of your living space.Sustainable Living:The property is equipped with JoJo tanks and an irrigation system for both the back and front yards, promoting water-wise living and ensuring your garden flourishes year-round.Fibre Ready:Stay connected with high-speed internet as the property is fibre-ready, catering to the demands of modern family life.Seize the opportunity to create lasting memories in this family-oriented haven. This Tijgerhof gem offers not just a home but a lifestyle where comfort, convenience, and joy intersect seamlessly. Don't miss out - your family's next chapter begins here!Has GardenProperty Reference #: 2199987Agent Details:Eben DebbesIcon Property Group (SA)201 Pinehurst, Somerset Links Office Park,De Beers Avenue,Somerset West 9 Kruger StreetStrand7139"

2 Concordance

We start by converting the description text into tokens — the basic building blocks for most NLP tasks. This uses the tokens() function from {quanteda}, which allows us to control how text is split. Here, we lowercase everything and strip out punctuation, numbers, symbols and URLs. This step transforms raw text into a structured format that we can analyze and manipulate.

Code
gumtree_tokens <- tokens(tolower(gumtree_texts$description), 
      remove_punct = TRUE,
      remove_symbols = TRUE,
      remove_numbers = TRUE,
      remove_url = TRUE, 
      verbose = TRUE)
Creating a tokens from a character object...
 ...starting tokenization
 ...tokenizing 1 of 1 blocks
 ...preserving hyphens
 ...preserving elisions
 ...preserving social media tags (#, @)
 ...removing separators, punctuation, symbols, numbers, URLs 
 ...95,537 unique types
 ...complete, elapsed time: 7.03 seconds.
Finished constructing tokens from 16,681 documents.

Next, we use kwic() (Keyword-in-Context) to search for all instances of the word “ocean” in the tokenized descriptions. The window = 5 argument means we extract five words before and after each occurrence, allowing us to see the word in context - a useful exploratory tool in text analysis.

Code
ocean_kwic <- kwic(
  # define text
  gumtree_tokens, 
  # define search pattern
  pattern = "ocean", 
  # define context window size
  window = 5) %>% 
  as_tibble %>% 
  select(-pattern)

We can also search for multi-word expressions using phrase(). Here we look for the exact phrase “blue ocean” and again extract a small context window. This is helpful for identifying marketing language or specialized phrasing in real estate listings.

Code
ocean_kwic_phrase <- kwic(
  # define text
  gumtree_tokens, 
  # define search pattern
  pattern = phrase("blue ocean"), 
  # define context window size
  window = 5) %>% 
  as_tibble %>% 
  select(-pattern)

The result is a tidy tibble with columns for the document, token position, and the pre- and post-context around the word “ocean”. This makes it easy to examine patterns in usage - for example, you might find that listings with “ocean views” tend to emphasize luxury or location.

Code
ocean_kwic
# A tibble: 1,895 × 6
   docname  from    to pre                                         keyword post 
   <chr>   <int> <int> <chr>                                       <chr>   <chr>
 1 text1      16    16 has views overlooking the open              ocean   robb…
 2 text3     102   102 stunning vistas of the azure                ocean   that…
 3 text5     121   121 unobstructed views of the atlantic          ocean   from…
 4 text7      70    70 panoramic views of the atlantic             ocean   and …
 5 text7     250   250 views of the majestic atlantic              ocean   desi…
 6 text7     380   380 mesmerising glimpse of the atlantic         ocean   disp…
 7 text8      21    21 two bedroom apartment overlooks the         ocean   and …
 8 text12     32    32 connecting terrace for sunset and           ocean   view…
 9 text14     16    16 has views overlooking the open              ocean   robb…
10 text18     57    57 entertainer's balcony encompassing the mag… ocean   view…
# ℹ 1,885 more rows

3 Ngrams

We first combine the pre- and post-context around the keyword (“ocean”) into a single text column. This gives us a compact window of nearby words from which we can extract n-grams. Keeping track of the docname allows us to trace each n-gram back to its source listing if needed.

Code
ocean_pre_post <- ocean_kwic %>% 
    unite("text", c(pre, post)) %>% 
    select(docname, text)

Using unnest_tokens() from the tidytext package, we break the context text into individual tokens (i.e., unigrams). We specify n = 1 to extract single words, but this could be increased to extract bigrams (n = 2) or trigrams (n = 3) for phrase analysis.

Code
ocean_tokens <- ocean_pre_post %>% 
    unnest_tokens(input = text, output = word, 
                  token = "ngrams", 
                  n = 1)

After tokenization, we:

  • Clean up underscore artifacts (in case they appear in n-grams),
  • Remove common stopwords using the built-in stop_words list from tidytext.

This leaves us with only meaningful content words surrounding the keyword “ocean”, which we can now count, visualize, or use in downstream models.

Code
ocean_tokens <- ocean_tokens %>% 
    mutate(word = gsub("_", "\\1", word)) %>% 
    anti_join(stop_words, by = join_by(word))

We now create a word cloud of the most frequent words that appeared near the word “ocean” in the property listings. The steps are:

  • count(word, name = "obs"): Tally how often each word appears in the context.
  • sample_frac(weight = obs, size = 0.1): Take a weighted sample to reduce clutter while still emphasizing common words.
  • geom_text_wordcloud(): Plot the word cloud using text size and color to reflect frequency.
  • scale_color_gradient(): Apply a blue-to-red color gradient for better visual contrast.
  • theme_minimal(): Use a clean theme to keep the focus on the words.
Code
(p1 <- ocean_tokens %>% 
  count(word, name = "obs", sort = TRUE) %>% 
  sample_frac(weight = obs, size = 0.1) %>% 
  ggplot(., aes(label = word, size = obs, 
                color = obs)) +
  geom_text_wordcloud() +
  scale_color_gradient(low = "#189bcc", 
                       high = "#960018") +
  scale_size_area(max_size = 20) +
  theme_minimal())

Code
# ggsave("figures/ocean_wordcloud.png", plot = p1, 
#        width = 8, 
#        height = 8, 
#        bg = "transparent")

4 Economic Analysis

This pipeline applies a lexicon-based sentiment analysis using the Bing Liu dictionary, which classifies individual words as either positive or negative. The steps are:

  • unnest_tokens(...): Tokenize each listing description into words.
  • anti_join(stop_words, ...): Remove common stopwords like “the” or “and”.
  • inner_join(get_sentiments("bing"), ...): Retain only words that have an associated sentiment label.
  • count(...): Count how many positive or negative words appear in each listing.
  • pivot_wider(...): Reshape the data to create separate columns for positive and negative word counts.
  • mutate(sentiment = ...): Compute a normalized sentiment score for each ad using the formula:

\text{sentiment} = \frac{W_p - W_n}{W_p + W_n} This score lies between −1 and +1 and gives a sense of the tone of each property listing. You can use this to study whether certain property types (e.g., “apartment” vs “house”) are marketed more positively.

Code
gumtree_texts %>% select(ad_id, dwelling_type, description) %>% 
  unnest_tokens(word, description) %>% 
  anti_join(stop_words, by = "word") %>% 
  inner_join(get_sentiments("bing"), by = "word") %>% 
  drop_na() %>% 
  count(ad_id, dwelling_type, sentiment, name = "obs") %>% 
  pivot_wider(names_from = "sentiment", values_from = "obs", 
              values_fill = 0) %>% 
  mutate(sentiment = (positive - negative)/(positive + negative))
Warning in inner_join(., get_sentiments("bing"), by = "word"): Detected an unexpected many-to-many relationship between `x` and `y`.
ℹ Row 1193502 of `x` matches multiple rows in `y`.
ℹ Row 3281 of `y` matches multiple rows in `x`.
ℹ If a many-to-many relationship is expected, set `relationship =
  "many-to-many"` to silence this warning.
# A tibble: 16,116 × 5
   ad_id                            dwelling_type positive negative sentiment
   <chr>                            <chr>            <int>    <int>     <dbl>
 1 00026b744459e17f11d5d66a9634f159 apartment            3        0     1    
 2 000513d19f7999faf08868e98d1a5dde house               12        0     1    
 3 0007b1cbdc9ef27ad3d663aaa5a11240 house                9        2     0.636
 4 000a2cd6656b24763bd1fc416ea01b00 house                9        1     0.8  
 5 000b5a78eeb86987a0f8f4fba68fd568 house                3        1     0.5  
 6 000bb0c255cf59fec4ce6d65d7276e6b apartment           14        1     0.867
 7 000eef2b00e4fa0068d40cb3bd4977ce apartment           28        5     0.697
 8 0010f52723234aafbfe6c7717a7b13b0 house                5        0     1    
 9 00130b3cffe0f6eed05ebb1f656a4a80 house                8        1     0.778
10 00174e940b79dabf51d580976595af3f apartment            6        0     1    
# ℹ 16,106 more rows