9 Text Analysis III: Finding Groups of Texts

9.1 Goals

introduce text clustering approaches that allow us find groups of thematically similar texts:
- hierarchical clustering;
- k-means clustering;
- topic modeling;

9.2 Preliminaries

9.2.1 Libraries

The following are the libraries that we will need for this section. Install those that you do not have yet.

#install.packages("ggplot2", "LDAvis", "readr", "slam", "stringr", "tictoc", "tidytext", "tidyverse", "tm", "topicmodels")

# general
library(ggplot2)

# text analysis specific
library(readr)
library(slam)
library(stringr)
library(tidytext)
library(tidyverse)
library(tm)
library(topicmodels)

library(text2vec)
library(stylo)

library(cluster)    # clustering algorithms
library(factoextra) # clustering algorithms & visualization

# extra
library(tictoc) # to time operations

9.2.1.1 The Dispatch Data: Preprocessing

Texts for experiments:

Topic models, pretrained:

We will limit to only one year, but, of course, the results are always more interesting with more data…

Loading…

d1864 <- read.delim("./files/data/dispatch_1864.tsv", encoding="UTF-8", header=TRUE, quote="", stringsAsFactors = FALSE)
d1864$date <- as.Date(d1864$date, format="%Y-%m-%d")

knitr::kable(head(d1864, 1), table.attr = "class=\"striped\"", format="html")

id	date	type	header	text
1864-04-28_article_1	1864-04-28	article	The news.	The news. Yesterday was an unusual dull day, there being a great dearth in the news market, which was the subject of general remark. – The report which prevailed towards night, to the effect that a battle was progressing on the line of the Rapidan, we feel authorized to contradict. All was quiet when the Central train left Gordonsville yesterday evening. The cannonading heard in the direction of Germanna Ford, previously noticed, was caused by artillery practice.

Let’s remove low freq items:

d1864_lowFreq <- d1864 %>%
  unnest_tokens(word, text) %>%
  count(word, sort=TRUE)

summary(d1864_lowFreq)

##      word                 n            
##  Length:58774       Min.   :     1.00  
##  Class :character   1st Qu.:     1.00  
##  Mode  :character   Median :     2.00  
##                     Mean   :    61.63  
##                     3rd Qu.:     8.00  
##                     Max.   :278040.00

lowFreq <- d1864_lowFreq %>%
  filter(n <= 1)
summary(lowFreq)

##      word                 n    
##  Length:28107       Min.   :1  
##  Class :character   1st Qu.:1  
##  Mode  :character   Median :1  
##                     Mean   :1  
##                     3rd Qu.:1  
##                     Max.   :1

Most of these low-frequency items are typos:

knitr::kable(head(lowFreq, 15), table.attr = "class=\"striped\"", format="html")

word	n
0001850holstein482	1
00024	1
0003750	1
0004unshelled	1
0005	1
000aliens	1
000anne	1
000another	1
000april	1
000august	1
000banks’s	1
000butchers5500blacksmiths2200wheelrights2200teamsters505	1
000butler’s	1
000california50	1
000connecticut40	1

We can anti-join our corpus with lowFreq, which will remove them:

d1864_clean <- d1864 %>%
  filter(type != "ad_blank")

d1864_clean <- d1864_clean %>%
  unnest_tokens(word, text) %>%
  anti_join(lowFreq, by="word") %>%
  group_by(id) %>%
  count(word, sort=TRUE)

# unfiltered:      2,815,144
# filtered (>3):   2,749,078

Additionally, we need to remove stop words, but first we need to identify them.

d1864_clean_FL <- d1864_clean %>%
  group_by(word) %>%
  summarize(freq=sum(n)) %>%
  arrange(desc(freq))

knitr::kable(head(d1864_clean_FL, 15), table.attr = "class=\"striped\"", format="html")

word	freq
the	278040
of	163446
and	114985
to	98596
in	68070
a	67022
that	35265
for	32885
on	31679
was	30730
is	29191
at	27723
be	27673
by	26285
from	24300

To make things faster, you can remove top 50, 100, 150, 200 most frequent words, but this is a rather brutal way. Ideally, you want to curate your own stop word list that will be tuned to your texts. Below, I have taken top 500 words and manually removed everything that was meaningful (or, better, what I considered meaningful). Additionally, there are also NLP procedures that are designed to lemmatize words (i.e., reduce all words to their dictionary forms) and also do part-of-speech tagging, which allows to remove words categorically (for example, keeping only nouns, adjectives and verbs).

word <- c("the", "of", "and", "to", "in", "a", "that", "for", "on", "was", "is", "at", "be", "by", "from", "his", "he", "it", "with", "as", "this", "will", "which", "have", "or", "are", "they", "their", "not", "were", "been", "has", "our", "we", "all", "but", "one", "had", "who", "an", "no", "i", "them", "about", "him", "two", "upon", "may", "there", "any", "some", "so", "men", "when", "if", "day", "her", "under", "would", "c", "such", "made", "up", "last", "j", "time", "years", "other", "into", "said", "new", "very", "five", "after", "out", "these", "shall", "my", "w", "more", "its", "now", "before", "three", "m", "than", "h", "o'clock", "old", "being", "left", "can", "s", "man", "only", "same", "act", "first", "between", "above", "she", "you", "place", "following", "do", "per", "every", "most", "near", "us", "good", "should", "having", "great", "also", "over", "r", "could", "twenty", "people", "those", "e", "without", "four", "received", "p", "then", "what", "well", "where", "must", "says", "g", "large", "against", "back", "000", "through", "b", "off", "few", "me", "sent", "while", "make", "number", "many", "much", "give", "1", "six", "down", "several", "high", "since", "little", "during", "away", "until", "each", "5", "year", "present", "own", "t", "here", "d", "found", "reported", "2", "right", "given", "age", "your", "way", "side", "did", "part", "long", "next", "fifty", "another", "1st", "whole", "10", "still", "among", "3", "within", "get", "named", "f", "l", "himself", "ten", "both", "nothing", "again", "n", "thirty", "eight", "took", "never", "came", "called", "small", "passed", "just", "brought", "4", "further", "yet", "half", "far", "held", "soon", "main", "8", "second", "however", "say", "heavy", "thus", "hereby", "even", "ran", "come", "whom", "like", "cannot", "head", "ever", "themselves", "put", "12", "cause", "known", "7", "go", "6", "once", "therefore", "thursday", "full", "apply", "see", "though", "seven", "tuesday", "11", "done", "whose", "let", "how", "making", "immediately", "forty", "early", "wednesday", "either", "too", "amount", "fact", "heard", "receive", "short", "less", "100", "know", "might", "except", "supposed", "others", "doubt", "set", "works") 

sWordsDF <- data.frame(word)

d1864_clean_minusSW <- d1864_clean %>%
  anti_join(sWordsDF, by="word")

dim(d1864_clean)

## [1] 1765298       3

# 1,759,828

dim(d1864_clean_minusSW)

## [1] 1162644       3

# 1,159,214

9.2.2 TF-IDF

From Wikipedia: In information retrieval, tf–idf or TFIDF, short for term frequency–inverse document frequency, is a numerical statistic that is intended to reflect how important a word is to a document in a collection or corpus.[1] It is often used as a weighting factor in searches of information retrieval, text mining, and user modeling. The tf–idf value increases proportionally to the number of times a word appears in the document and is offset by the number of documents in the corpus that contain the word, which helps to adjust for the fact that some words appear more frequently in general. tf–idf is one of the most popular term-weighting schemes today; 83% of text-based recommender systems in digital libraries use tf–idf. Variations of the tf–idf weighting scheme are often used by search engines as a central tool in scoring and ranking a document’s relevance given a user query. tf–idf can be successfully used for stop-words filtering in various subject fields, including text summarization and classification. One of the simplest ranking functions is computed by summing the tf–idf for each query term; many more sophisticated ranking functions are variants of this simple model.

df_TF_IDF <- d1864_clean_minusSW %>% # d1864_clean, d1864_clean_minusSW
  bind_tf_idf(word, id, n) %>%
  arrange(desc(tf_idf)) %>%
  ungroup

knitr::kable(head(df_TF_IDF, 15), table.attr = "class=\"striped\"", format="html")

id	word	n	tf	idf	tf_idf
1864-07-25_article_6	suicides	2	0.2500000	9.589941	2.397485
1864-09-12_advert_131	leonard’s	3	0.2307692	8.896793	2.053106
1864-02-05_article_19	sissy	3	0.2000000	9.589941	1.917988
1864-01-28_article_49	generality	1	0.2500000	7.644030	1.911008
1864-01-11_advert_105	blacking	6	0.2307692	8.203646	1.893149
1864-09-12_advert_131	pills	3	0.2307692	8.203646	1.893149
1864-12-08_article_70	council	2	0.4000000	4.508536	1.803414
1864-12-12_advert_143	alf	2	0.1818182	9.589941	1.743626
1864-09-27_article_95	calendar	2	0.2857143	6.006422	1.716120
1864-03-22_article_71	apartments	2	0.2222222	7.644030	1.698673
1864-12-29_article_42	dishonest	1	0.2500000	6.756727	1.689182
1864-09-10_article_61	mobile	2	0.5000000	3.373334	1.686667
1864-05-05_article_17	ling	1	0.2000000	8.203646	1.640729
1864-09-26_article_63	hoy	1	0.2000000	8.203646	1.640729
1864-03-30_article_50	telegrams	2	0.3333333	4.810817	1.603606

articleID = "1864-07-25_article_6"
filter(df_TF_IDF, id==articleID) %>%
  arrange(desc(tf_idf))

## # A tibble: 6 × 6
##   id                   word          n    tf   idf tf_idf
##   <chr>                <chr>     <int> <dbl> <dbl>  <dbl>
## 1 1864-07-25_article_6 suicides      2 0.25   9.59  2.40 
## 2 1864-07-25_article_6 france        2 0.25   4.35  1.09 
## 3 1864-07-25_article_6 suicide       1 0.125  6.50  0.812
## 4 1864-07-25_article_6 committed     1 0.125  3.09  0.387
## 5 1864-07-25_article_6 persons       1 0.125  2.61  0.327
## 6 1864-07-25_article_6 take          1 0.125  2.26  0.283

d1864$text[d1864$id==articleID]

## [1] "Suicides in France -- More than ten suicides take place every day in France; last year 4, 000 persons committed suicide."

9.3 Hierarchical clustering

Clustering is a method to break items into closely related groups—clusters. The code below show how our data can be broken into clusters with hierarchical clustering, using distance matrices. Hierarchical clustering has rather high precision, but sensitive to outliers and computationally expensive, which makes it nearly unusable with large datasets. K-means clustering is more suitable for large datasets, but struggles with uniform data. In both cases you have to define the number of clusters.

In what follows, we take our TF-IDF data, sample it, and run cluster analysis on a small sample. The chunk below simply prepares our data for analysis:

# RANDOMLY SELECT N ITEMS
set.seed(48965)
N = 100
sampledIDs <- sample(unique(df_TF_IDF$id), N)
sample_d1864_tfidf <- df_TF_IDF %>% filter(id %in% sampledIDs) %>%
  select(id, word, tf_idf)

# CONVERT INTO DTM MATRIX
colnames(sample_d1864_tfidf) <- c("document", "term", "count")
sample <- tibble(sample_d1864_tfidf) %>% cast_dtm(document, term, count)
sample_matrix <- as.matrix(sample)

# CONVERT INTO REGULAR MATRIC AND CALCULATE DISTANCES
distanceMatrix <- dist.cosine(sample_matrix) # from library(stylo)
distanceMatrixHC <- as.dist(distanceMatrix)

Now we can do the actual clustering. There are several clustering/linkage methods that can be used for clustering, and it usually depends on your goals.

From ?hclust: “A number of different clustering methods are provided. Ward’s minimum variance method aims at finding compact, spherical clusters. The complete linkage method finds similar clusters. The single linkage method (which is closely related to the minimal spanning tree) adopts a ‘friends of friends’ clustering strategy. The other methods can be regarded as aiming for clusters with characteristics somewhere between the single and complete link methods. Note however, that methods”median” and “centroid” are not leading to a monotone distance measure, or equivalently the resulting dendrograms can have so called inversions or reversals which are hard to interpret …”

As a rule of thumb: you want balanced trees when you want an even number of items assigned to each cluster. Unbalanced trees are useful for finding outliers — with this method you can find which items you might want to remove in order to achieve better clustering.

You can find additional explanations here, here, and here.

# THE FOLLOWING IS THE ACTUAL CLUSTERING   
clustered.data.ward <- hclust(distanceMatrixHC, method = "ward.D")
clustered.data.complete <- hclust(distanceMatrixHC, method = "complete")
clustered.data.average <- hclust(distanceMatrixHC, method = "average")
clustered.data.single <- hclust(distanceMatrixHC, method = "single")

str(clustered.data.ward)

## List of 7
##  $ merge      : int [1:99, 1:2] -23 -2 -58 -99 -52 -3 -51 -60 -86 -17 ...
##  $ height     : num [1:99] 0.699 0.714 0.732 0.733 0.758 ...
##  $ order      : int [1:100] 51 23 72 86 52 58 89 32 60 61 ...
##  $ labels     : chr [1:100] "1864-10-10_printrun_6" "1864-04-16_article_26" "1864-10-01_ad-blank_69" "1864-03-23_article_58" ...
##  $ method     : chr "ward.D"
##  $ call       : language hclust(d = distanceMatrixHC, method = "ward.D")
##  $ dist.method: NULL
##  - attr(*, "class")= chr "hclust"

plot(clustered.data.ward, labels=FALSE, main="Ward")
abline(h=1.25, col="blue", lty=3)
rect.hclust(clustered.data.ward, k=10, border="red")

You can use rect.hclust() either with h= — which will cut tree at a given height, thus determining the number of clusters; or, with k= — which will cut tree into a given number of clusters.

9.3.1 PCA viz for HCLUST

Note that PCA (principal component analysis) and clustering are not the same, but PCA can be also used to visualize hierarchical clustering:

library(factoextra)

set.seed(1)
distanceMatrixHC.scaled <- scale(distanceMatrixHC)
hc.cut <- hcut(distanceMatrixHC.scaled, k = 3, hc_method = "ward.D")
fviz_cluster(hc.cut, labelsize=0, ellipse.type = "convex")

plot(clustered.data.complete, labels=FALSE, main="Complete")

plot(clustered.data.average, labels=FALSE, main="Average")

plot(clustered.data.single, labels=FALSE, main="Single")

hclust() creates an object from which you can extract further information. For example, we can use cutree() function to extract clustering information. You can use cutree() either with h= — which will cut tree at a given height, thus determining the number of clusters; or, with k= — which will cut tree into a given number of clusters.

clusters_named_vector <- cutree(clustered.data.ward, k=10)
clusters_df <- stack(clusters_named_vector)
colnames(clusters_df) <- c("cluster", "id")
clusters_df <- clusters_df %>% select(id, cluster)

knitr::kable(head(clusters_df, 15), table.attr = "class=\"striped\"", format="html")

id	cluster
1864-10-10_printrun_6	1
1864-04-16_article_26	2
1864-10-01_ad-blank_69	3
1864-03-23_article_58	4
1864-08-23_article_36	4
1864-05-09_advert_47	1
1864-01-06_article_114	4
1864-04-14_article_39	2
1864-12-12_advert_116	4
1864-10-12_article_14	4
1864-10-27_advert_59	4
1864-07-11_advert_55	5
1864-03-09_article_53	6
1864-09-12_advert_4	7
1864-01-07_article_68	4

We can then left_join these clustering results with the original table and manually check if our clustering makes sense.

d1864_clustering <- d1864 %>%
  left_join(clusters_df, by = "id")

Let’s print out a few clusters:

cluster <- d1864_clustering %>% filter(cluster == 9) %>%
  select(text)

knitr::kable(cluster, table.attr = "class=\"striped\"", format="html")

text
Twelve Hundred and fifty Dollars reward. – A reward of two Hundred and fifty Dollars will be paid for the delivery to me, in this city, of each one of the following Slaves, who have absconded from the Carbon Hill Mines, in this county, during the past six months: Charles, a mulatto, about five feet nine inches in height, thirty-two years old, not stout, with very noticeable grey eyes; from Orange county. Daniel, a dark brown negro, about five feet eight inches in height, nineteen or twenty years old, very sprightly countenance, and very talkative. William, dark brown or black, five feet eight or nine inches in height, very slightly made, thirty-two or thirty-four years old, very quiet in his demeanor; from Gloucester county Virginia. Festus, dark brown, five feet seven or eight inches in height, well made, pleasant countenance and good address, twenty-two to twenty-four years of age. Lewis, black, five feet seven or eight inches in height, square built, but not very stout; twenty-eight or thirty years of age; limps slightly. The two latter have relations living in Richmond, and have been employed for some years in Manchester. John J. Werth, Agent. de 2 – 2aw4w
50 dollars reward. – Ran away from my farm, near White Oak Swamp bridge, my negro boy Mitchell. Said negro is about 18 years old, black, five feet high, has a scar over the left temple; had on when last seen a slouched hat and black woolen pants; supposed to be lurking about Richmond. The above reward will be given if delivered at my farm, or Mr. Jack Fisher’s. Loftin D. Allen. Henrico county. je 10 – cod3t
One hundred and fifty dollars reward. – Ran away, on Monday, 17th instant, a Negro Boy, sixteen years old; about five feet eight or nine inches high, very nearsighted, and black, with a scar on the left cheek, under the eye. He has been seen in the city every day. The above reward will be paid for his delivery at the New Richmond Theatre, corner Seventh and Broad streets. oc 21 – ts
50 dollars reward. – Ran away on Monday, the 18th inst, from Mrs Lucy C Binford, , a negro girl about eighteen years old, well grown, dark brown, slender made — She was at the head of Mechanicsville Turnpike, Hanover county, and she is probably lurking about that neighborhood. She has a husband living with Dr John G Lempkis. I will pay the above reward if delivered to me, or secured in jail so that I can get her. J. B. Keesee, Adm’r of W. M. A. Binford, dec’d. Henrico co, July 25, 1864 jy 26 – 2t
500 dollars reward. – I will give the above reward for the apprehension of, and delivery to me; at my office, or his lodgement in jail so that I can get him) of my servant boy George Henry Ray. Said boy is a bright mulatto, about five feet high, wooly hair, has a swaggering walk, speaks very quick, and has a very sullen look. He may possibly be lurking about the city, as he has acquaintances in every part of it, but my impression is that he is endeavoring to get to Fredericksburg, from thence to Stafford, to the farm of the late Major John Seddon where he has a father.) in order to pass the lines. Soldiers in the camps are cautioned against employing said boy. M A Blackman, Surgeon and Dentist, No. 83 Main street, Richmond, Va. As he has been accompanying the to the hospitals, he has a pass from Capt Coke, requesting the guard to pass him unmolested. M A B. jy 22 – ts
Ran away – From the gravel train on Sunday, 28th ult, four negroes, named Ned, Frederick, Efford, and Albert, hired of Mrs. A C Isbell, of Cumberland county. Efford and Albert are of a bright gingerbread color, 5 feet 10 inches high; Ned and Frederick are of dark complexion, stout, 5 feet 7 inches high. The usual reward will be paid for their apprehension. C G Talbott. Supt Richmond and Denville R R. mh 6 – ts
200 dollars reward. – Ranaway from the subscriber about the 17th of January last, two slaves, named Doctor and Raleigh commonly called Flem. Both are black, quick, active man. – Doctor is about 19 years old, had his left hand hurt in an apple mill and scarred on the outside, and has lost one or more of his nails from that hand. He had on a red flannel shirt when he left. Flem is left handed, and is about 17 years of age. I will give $100 a piece for the delivery to me, or to some jail from which I can get them, of the said slaves, if caught out of the county of Charlotte, or $50 a piece if apprehended in the county. I believe they are passing as free men, and are trying to get employment on the Richmond and Danville, or Southside Railroad. Address, Henry E. Edmunds, Moesingford P. O., Charlotte county. Va. fe 16 – 1m
200 dollars reward. – Ran off on 19th March a negro woman named Creasy, a very small black woman, thick lips, and large nose, and very short spoken. Carried off with her two new striped homespun yarn dresses, one brown and the other black. The above reward will be given for her apprehension if taken out of the county, at one hundred if taken in the county, in either case to be delivered to me or secured in jail so I get her again. William Priddy. Negro Foot P. O., Hanover co, Va. ap 2 – 6t
300 dollars reward – will be paid for the delivery of my servant girl Carellec, who left on Tuesday morning, the 1st instant. She is black, about 18 years of age, well grown, rather stout, has round face and thick lips. A Rodeker, Druggist, No. 19 Main st. au 6 –
Five hundred dollars reward. – Ran away from the Richmond Arsenal, where he was hired, about the 20th of July, my negro man, Peter miles, sometimes called Peter Redd. He is twenty-two or twenty-three years of age, five feet eight or nine inches high, gingerbread color, rather a long face, nose rather long and flat, and carries himself very erect. When he left the arsenal be wore a grey jacket and cap. He is no doubt making an effort to pass through our lines. I will give the above reward if caught beyond the limits of the city, or two hundred and fifty dollars if caught within the corporate limits and secured in any jail so that I get him. Neal McCURDY. au 10 – 6t*
Ran away from my Farm, at the Half-way House, on the Richmond and Petersburg railroad, Chesterfield country, my man Richard. He left my farm last Tuesday morning, the 9th instant, and had on when he left a pair of dark pants, white cotton shirt, and had on a pair of shoes, no coat nor hat. He is about twenty or twenty-one years old, five feet six or seven inches high, black, has a small moustache, and speaks slow. I bought him last April, of Lee & Bowman, in Richmond. He formerly belonged to Miss Margaret Bottom, of Amelia Courthouse. He has a wife at or near Amelia Courthouse, and may be trying to go there. He was last seen near the Half-Way Station. I will pay a liberal reward if caught and put in jail, or delivered to me. Address J. M. Wolff, 64 Main street, Richmond, Va., or Proctor’s Creek, Chesterfield county. au 17 – 6t*
Ranaway. – $100 reward. – From the subscriber, in 10th March, 1864, my men Washington, again about 26 years, about complexion, large mouth will be given if taken out of the State, and if taken in it, and secured so I get him. My address in Fork Union, Fluvanna co, Va., Samuel R. Pellet. ap 26 – 1aw4t
100 Dollars reward. – Ranaway from the Midlothian Coal Mines, a negro man named Joe, or Joe Hampton. He is about 25 years old, of dark brown color, spare made, about 5 feet 10 inches high, with rather large eyes, and somewhat wild expression of countenance, though generally smiling when spoken to. He was bought in January last of Mr C C Burton, near Petersburg, where his friends and connexions are, and he is probably in the neighborhood of that place. The above reward will be paid for his apprehension and delivery in any jail, or to the agent of the Company, at their mines, or in Richmond. my 12 – ts
Ruanaway. – $100 reward will be paid for the delivery to S N Davis & Co. of a negro boy named John. He is about 18 years old, gingerbread color; he had on a black felt hat, boots tipped on the toes, and gray pantaloons, when he left Friday. He was raised in Albemarle, by Dr. T J Cook Geo. Turner mh 14 – 6t

cluster <- d1864_clustering %>% filter(cluster == 7) %>%
  select(text)

knitr::kable(cluster, table.attr = "class=\"striped\"", format="html")

text
For rent – A store neatly fitted up, and the best stand in the town. Apply to John Perry, Ashland Hotel. fe 6 – 2t*
For Rent – Two nicely furnishes rooms, suitable for ladies and gentlemen, or gentlemen, with use of parlor and kitchen. Apply at the corner house on the north corner of Marshall and Adams streets. mh 7 – 1t*
Shockoe Mill, Seventeenth street, across the Dock. Wheat and Corn ground on tell. For sale – Nice New Flour at $1. 25 per pound. Corn Meal at $55 per bushel. au 4 – 10t* For Rent and Sale.
For Rent, a furnished House in a pleasant part of the city. For terms, address " E. S.," with references, Dispatch office. se 12 – 3t*
For Rent, a very desirable Dwelling-House, with eight rooms, a spacious yard, out-houses, & c., on Nineteenth street, one door from Grace. Possession given on 1st October. Apply to Mrs. Branch, Franklin street, between Eighteenth and Nineteenth. se 12 – eod3t*

cluster <- d1864_clustering %>% filter(cluster == 5) %>%
  select(text)

knitr::kable(cluster, table.attr = "class=\"striped\"", format="html")

text
Silk culture. Some of our citizens are disposed to turn their attention to the culture of silk. It would afford a new and most profitable employment to women and children. Many years ago there was a great deal of silk produced in this State. The morus multicaulis was extensively cultivated, and almost every house had its room devoted to silk worms, and their feeding and spinning. The business was abandoned, however, and the multicaulis trees were taken up by the roots, and cast out upon the highways to die. Yet there are no doubt some scattering bushes which escaped the general massacre which occurred when it was concluded that the enterprise was unprofitable. Moreover, it may be that some person having faith in the old adage that everything — the most useless, even — comes in use once in seven years, may have saved and kept on hand a small stock of the silk worms. There is now an enquiry for them, and if anybody has ever so small a family, their eggs will sell almost as high as hen’s eggs. The have but to announce that they have the genuine silk worm, to invite orders from every direction.
50 Dollars reward. – Stolen from the residence of James Sinton, Jr, on Franklin st, on Sunday, the 8th last, one new large black Silk Circular, trimmed with white buttons. Also, one Stella Shawl. The above reward will be given for the return of the articles to the " Illustrated News Office," and no questions asked. my 9 – 3t
Blockade Goods. – Mantle Depot, 60 Main, between 14th and 15th streets. – Ladies’silk parasols, large sizes; white silk illusion; superior quality bonnet ribbons; blond laces; blk silk lacer; blk English crapes; blk English crapes; blk dress silk; bonnet frames; checked muslins, jaconets; cambrics; bleached cotton; extra heavy English cotton hose; lists hose; crape collars; palm leaf fans; black alpacas; castile soap; Ezekiel’s hair tonic; tooth brushes; pomades; collogue, & c, & c. Mantle Depot, 69 Main, bet 14th and 15th sts jy 11 – 3t*

9.3.2 Determining the optimal number of clusters: “Elbow Method” and “Average Silhouette Method”

In a nutshell, we repeatedly run clustering, increasing the number of clusters by one, and calculate the total within-cluster sum of square (wss). We then plot the curve of wss and look for a point in the curve with the sharpest bend (hence the “elbow”), which is considered to be an indicator of the appropriate number of clusters. library(factoextra) can perform this with a single command:

set.seed(786)
fviz_nbclust(as.matrix(distanceMatrixHC), FUN = hcut, k.max=10, method = "wss")

Ideally, we sould have something like L or V. Here, perhaps, 3 is our optimal number. We can try another method — average silhouette method (which is also easily callable from library(factoextra)). Like with elbow method, we run clustering multiple times but here we measures the quality of a clustering, but determining how well each object lies within its cluster. A high average silhouette width indicates a good clustering.

set.seed(786)
fviz_nbclust(as.matrix(distanceMatrixHC), FUN = hcut, k.max = 10, method = "silhouette")

More on hierarchical clustering: “Hierarchical Cluster Analysis”, in in U of Cincinnati Business Analytics R Programming Course http://uc-r.github.io/hc_clustering.

9.4 K-means clustering

Let’s get a different sample from our data. With k-means clustering we can run on more data:

# RANDOMLY SELECT N ITEMS
set.seed(48965)
N = 1000
sampledIDs <- sample(unique(df_TF_IDF$id), N)
sample_d1864_tfidf <- df_TF_IDF %>% filter(id %in% sampledIDs) %>%
  select(id, word, tf_idf)

# CONVERT INTO DTM MATRIX
colnames(sample_d1864_tfidf) <- c("document", "term", "count")
sample <- tibble(sample_d1864_tfidf) %>% cast_dtm(document, term, count)
sample_matrix <- as.matrix(sample)

# CONVERT INTO REGULAR MATRIC AND CALCULATE DISTANCES
distanceMatrix <- dist.cosine(sample_matrix) # from library(stylo)
distanceMatrixKM <- as.dist(distanceMatrix)

kmeansClusters <- kmeans(distanceMatrixKM, centers=5, nstart=25)
str(kmeansClusters)

## List of 9
##  $ cluster     : Named int [1:1000] 2 2 2 2 2 2 5 2 2 2 ...
##   ..- attr(*, "names")= chr [1:1000] "1864-01-28_article_49" "1864-10-10_printrun_6" "1864-10-10_advert_3" "1864-06-22_article_22" ...
##  $ centers     : num [1:5, 1:1000] 1 0.997 1 1 1 ...
##   ..- attr(*, "dimnames")=List of 2
##   .. ..$ : chr [1:5] "1" "2" "3" "4" ...
##   .. ..$ : chr [1:1000] "1864-01-28_article_49" "1864-10-10_printrun_6" "1864-10-10_advert_3" "1864-06-22_article_22" ...
##  $ totss       : num 2199
##  $ withinss    : num [1:5] 9.03 1313.82 84.42 186.05 9.94
##  $ tot.withinss: num 1603
##  $ betweenss   : num 595
##  $ size        : int [1:5] 12 806 37 121 24
##  $ iter        : int 3
##  $ ifault      : int 0
##  - attr(*, "class")= chr "kmeans"

We can extract cluster information in the same manner as with hclust() object:

kmeans_clusters_df <- stack(kmeansClusters$cluster)
colnames(kmeans_clusters_df) <- c("cluster", "id")
kmeans_clusters_df <- kmeans_clusters_df %>% select(id, cluster)

head(kmeans_clusters_df)

##                      id cluster
## 1 1864-01-28_article_49       2
## 2 1864-10-10_printrun_6       2
## 3   1864-10-10_advert_3       2
## 4 1864-06-22_article_22       2
## 5 1864-04-16_article_26       2
## 6 1864-03-07_advert_112       2

We can visualize the results of our kmeans clustering using fviz_cluster() (from factoextra). Our data is multidimensional — each word in our matrix is a single dimension, so this function will perform principal component analysis (PCA) and plot data according to the first two principal components that explain majority of the variance in our dataset.

set.seed(786)
fviz_cluster(kmeansClusters, data = distanceMatrix, labelsize = 0)

9.4.1 Determining the optimal number of clusters: “Elbow Method” and “Average Silhouette Method”

set.seed(786)
fviz_nbclust(as.matrix(distanceMatrix), kmeans, k.max=20, method = "wss")

Ideally, we should have something like L or V. Here results do not seem to be very helpful. We can try another method — average silhouette method (which is also easily callable from library(factoextra)). Like with elbow method, we run clustering multiple times but here we measures the quality of a clustering, bu determining how well each object lies within its cluster. A high average silhouette width indicates a good clustering.

set.seed(786)
fviz_nbclust(as.matrix(distanceMatrix), kmeans, k.max = 20, method = "silhouette")

Let’s try to visualize our clusters again:

set.seed(786)
kmeansClusters <- kmeans(distanceMatrixKM, centers=5, nstart=25)
fviz_cluster(kmeansClusters, data = distanceMatrix, labelsize = 0)

Here is, however, an example of where k-means clustering may/can fail: we gave a different starting point to the algorithm and the clustering outcome may be different:

set.seed(786)
kmeansClusters <- kmeans(distanceMatrixKM, centers=5, nstart=50)
fviz_cluster(kmeansClusters, data = distanceMatrix, labelsize = 0)

More information, see:

“K-means Cluster Analysis”, in U of Cincinnati Business Analytics R Programming Course https://uc-r.github.io/kmeans_clustering.
https://towardsdatascience.com/k-means-clustering-algorithm-applications-evaluation-methods-and-drawbacks-aa03e644b48a

9.5 Other “clustering” methods

Although not cluster analysis techniques strictly speaking, PCA (principal component analysis) and MDS (multi-dimensional scaling) are used in similar ways. We will touch upon these in the context of stylometry.

9.6 Topic Modeling

9.6.1 Topics?

9.6.1.1 Example 1

Thursday, March 27, 1862

LIGHT ARTILLERY

—I am authorized by the Governor of Virginia to raise a Company of Light Artillery for the war. All those desirous of enlisting in this the most effective arm of the service, would do well to call at once at the office of Johnson & Guigon, Whig Building.

Uniforms and subsistence furnished.

A. B. GUIGON. mh 25—6t

9.6.1.2 Example 2

Wednesday, August 17, 1864

Royal Marriages.

—There is a story circulated in Germany, and some in Paris, that the match between the heir-apparent of the Imperial throne of Russia and the Princess Dagmar of Denmark having been definitively broken off, another is in the course of negotiation between His Imperial Highness and the Princess Helens of England.

9.6.1.3 Example 3

Monday, June 22, 1863

NEWS FROM EUROPE.

The steamship Scotia arrived at New York on Thursday from Europe, with foreign news to the 7th inst. The news is not important. The Confederate steamer Lord Clyde was searched by order of the British Government, but nothing contraband being found on board her she was permitted to sail. The Russians have been defeated near Grochoury by the Polish insurgents. The three Powers have sent an earnest note to Russia, asking for a representative Government, a general amnesty, and an immediate cessation of hostilities in Poland.

9.6.2 Getting to code

We can start with our preprocessed variable d1864_clean, which is essentially a cumulative frequency list for all articles.

library(tm)
library(topicmodels)
library(tictoc)

head(d1864_clean_minusSW)

## # A tibble: 6 × 3
## # Groups:   id [5]
##   id                    word      n
##   <chr>                 <chr> <int>
## 1 1864-09-12_advert_30  mrs     307
## 2 1864-09-12_advert_30  miss    251
## 3 1864-04-18_article_10 mr       84
## 4 1864-08-27_article_41 slave    79
## 5 1864-11-24_article_46 mr       71
## 6 1864-04-26_article_8  enemy    64

d1864_dm <- d1864_clean_minusSW %>%
  cast_dtm(id, word, n)
d1864_dm

## <<DocumentTermMatrix (documents: 14617, terms: 30391)>>
## Non-/sparse entries: 1162644/443062603
## Sparsity           : 100%
## Maximal term length: 32
## Weighting          : term frequency (tf)

Training a model. NB: eval=FALSE setting will prevent from running this chunk when you Knit the notebook; but you can still execute it within the notebook, when you run chunks individually

tic()

d1864_lda <- LDA(d1864_dm, k = 4, control = list(seed = 1234))
d1864_lda

toc()

#A LDA_VEM topic model with 2 topics.
#35.962 sec elapsed

#A LDA_VEM topic model with 4 topics.
#72.545 sec elapsed

Do not run this in class: it takes too long!

tic()

kVal <- 25
d1864_lda_better <- LDA(d1864_dm, k=kVal, control=list(seed=1234))

toc()

#A LDA_VEM topic model with 20 topics.
#1261.087 sec elapsed (21 minutes)

#A LDA_VEM topic model with 25 topics.
#1112.262 sec elapsed (18 minutes)

Save/load the model, so that there is no need to regenerate it every time.

#d1864_lda_vem_25t_better <- d1864_lda_better
#save(d1864_lda_vem_25t_better, file="./files/data/d1864_lda_vem_25t_better.rda")
#load(file="./data/d1864_lda_vem_25t_better.rda")

load(file="./files/data/d1864_lda_vem_25t_better.rda")
lda_model <- d1864_lda_vem_25t_better
corpus <- d1864

From this point on, the code should simply run — if you rename your own model produced above to lda_model.

9.6.3 Per-topic-per-word probabilities (beta)

lda_model_better_topic_term_prob <- tidy(lda_model, matrix="beta")

lda_model_better_topic_term_prob %>%
  filter(term == "bonds") %>%
  arrange(desc(beta))

## # A tibble: 25 × 3
##    topic term      beta
##    <int> <chr>    <dbl>
##  1     8 bonds 3.53e- 2
##  2     9 bonds 5.05e- 4
##  3    24 bonds 3.92e- 4
##  4    11 bonds 2.34e-13
##  5     2 bonds 2.13e-13
##  6     1 bonds 3.67e-16
##  7    16 bonds 2.01e-20
##  8    14 bonds 3.08e-21
##  9    18 bonds 3.87e-22
## 10     3 bonds 2.00e-22
## # … with 15 more rows

NB: This step may throw an error. The error seems a bit cryptic, but restarting R (without saving workspace) seems to help. (beta stands for term-per-topic probability).

top_terms <- lda_model_better_topic_term_prob %>%
  group_by(topic) %>%
  top_n(15, beta) %>%
  ungroup() %>%
  arrange(topic, -beta)

head(top_terms)

## # A tibble: 6 × 3
##   topic term          beta
##   <int> <chr>        <dbl>
## 1     1 york       0.0113 
## 2     1 mr         0.0102 
## 3     1 lincoln    0.00915
## 4     1 union      0.00827
## 5     1 states     0.00809
## 6     1 washington 0.00689

library(ggplot2)

graph <- top_terms %>%
  mutate(term = reorder(term, beta)) %>%
  ggplot(aes(term, beta, fill = factor(topic))) +
  geom_col(show.legend=FALSE) +
  facet_wrap(~topic, scales = "free") +
  coord_flip()

Here it is in a bit of a better quality:

Topic-per-document probabilities: this object will tell us to which topics documents belong (and to what extent): (gamma stands for topic-per-document probability).

lda_model_topic_doc_prob <- tidy(lda_model, matrix="gamma")
lda_model_topic_doc_prob

## # A tibble: 356,150 × 3
##    document              topic     gamma
##    <chr>                 <int>     <dbl>
##  1 1864-09-12_advert_30      1 0.0000115
##  2 1864-04-18_article_10     1 0.709    
##  3 1864-08-27_article_41     1 0.0000581
##  4 1864-11-24_article_46     1 0.0614   
##  5 1864-04-26_article_8      1 0.0000152
##  6 1864-05-09_article_1      1 0.0000152
##  7 1864-05-14_article_23     1 0.139    
##  8 1864-05-16_article_6      1 0.122    
##  9 1864-06-27_article_2      1 0.0000270
## 10 1864-04-16_article_15     1 0.620    
## # … with 356,140 more rows

Pick a document and print out topics it belongs to (from the most prominent to less prominent). (Hint: use the object we just created > filter > arrange).

# your code here

Top N documents per topic: this will create an object with top N documents per each topic.

N = 10

top_docs <- lda_model_topic_doc_prob %>%
  group_by(topic) %>%
  top_n(N, gamma) %>%
  ungroup() %>%
  arrange(topic, -gamma)

top_docs

## # A tibble: 285 × 3
##    document               topic gamma
##    <chr>                  <int> <dbl>
##  1 1864-09-14_article_65      1 0.992
##  2 1864-08-31_article_103     1 0.989
##  3 1864-09-08_article_34      1 0.984
##  4 1864-04-19_article_30      1 0.980
##  5 1864-03-23_article_68      1 0.977
##  6 1864-06-03_article_12      1 0.965
##  7 1864-12-29_article_47      1 0.957
##  8 1864-03-10_article_118     1 0.956
##  9 1864-06-08_article_15      1 0.940
## 10 1864-09-01_article_65      1 0.939
## # … with 275 more rows

Now that we have IDs of representative documents, we can check those documents one by one, but let’s do something else first: topic-title function—–it is not really necessary, but it will combine together topic number (from the model) and its top words, which can be used for graphs.

topicVar <- function(top, topTerms){
  topicVar <- topTerms %>%
    filter(topic == top) %>%
    arrange(-beta)
  vals = paste(topicVar$term, collapse=", ")
  as.String(paste(c("Topic ", top, ": ", vals), collapse=""))
}

topicNum = 8 
idTest   = "1864-08-31_orders_74"

topicVar(topicNum, top_terms)

## Topic 8: treasury, bonds, notes, 1864, states, cent, certificates, confederate, secretary, department, america, richmond, payment, interest, treasurer

print("---")

## [1] "---"

as.String(d1864[d1864$id==idTest, ]["text"])

## Treasury Department, Confederate States of America, Richmond, August 8, 1864. Certificates of Indebtedness Bearing Six Per Cent. Per Annum interest and Free from Taxation. -- By the fourteenth section of the act to reduce the currency, approved February 17, 1864, the Secretary of the Treasury is authorized to issue the above certificates, payable two years after the ratification of a treaty of peace with the United States. They cannot be sold, but are only to be issued to such creditors of the Government as are willing to receive the same in payment of their demands. They must also be given at par, though free from taxation. The attention of purchasing agents and disbursing officers of the Government is called to this class of public securities as offering peculiar advantages to those from whom the supplies of the Government are bought; and to facilitate the use of them, checks drawn by disbursing officers upon the depositaries holding these funds, and marked across the face " payable in certificates of indebtedness," will be paid in conformity therewith. Depositaries are hereby authorized and required to comply with this regulation, and to make application to the Register for supplies of certificates as required. Signed) G. A. Trenholm, Secretary of Treasury. au 22 -- ts

9.6.4 Topics over time

corpus_light <- corpus %>%
  select(-header, -text)

lda_model_topics <- lda_model_topic_doc_prob %>%
  rename(id=document) %>%
  left_join(corpus_light, by="id") %>%
  group_by(topic, date) %>%
  summarize(freq=sum(gamma))

## `summarise()` has grouped output by 'topic'. You can override using the
## `.groups` argument.

lda_model_topics

## # A tibble: 7,725 × 3
## # Groups:   topic [25]
##    topic date        freq
##    <int> <date>     <dbl>
##  1     1 1864-01-01 0.219
##  2     1 1864-01-02 0.731
##  3     1 1864-01-04 1.02 
##  4     1 1864-01-05 0.587
##  5     1 1864-01-06 0.708
##  6     1 1864-01-07 1.35 
##  7     1 1864-01-08 0.400
##  8     1 1864-01-09 0.580
##  9     1 1864-01-11 0.838
## 10     1 1864-01-12 0.783
## # … with 7,715 more rows

Now, we can plot topic distribution over time:

topicVal = 8

lda_model_topics_final <- lda_model_topics %>%
  filter(topic == topicVal)

graph2 <- plot(x=lda_model_topics_final$date, y=lda_model_topics_final$freq, type="l", lty=3, lwd=1,
     main=topicVar(topicVal, top_terms),
     xlab = "1864 - Dispatch coverage", ylab = "topic saliency"); segments(x0=lda_model_topics_final$date, x1=lda_model_topics_final$date, y0=0, y1=lda_model_topics_final$freq, lty=1, lwd=2)

9.6.5 Exploring topics

LDAvis offers a visual browser for topics, which has already became a very popular tool for this purpose. If everything is done right, a visualization similar to the one below should open in a browser.

LDAvis Browser Example.

We can use the following function that extracts all needed information from a model and converts it into a format that LDAvis expects:

library(LDAvis)
library(slam)

topicmodels2LDAvis <- function(x, ...){
  post <- topicmodels::posterior(x)
  if (ncol(post[["topics"]]) < 3) stop("The model must contain > 2 topics")
  mat <- x@wordassignments
  LDAvis::createJSON(
    phi = post[["terms"]], 
    theta = post[["topics"]],
    vocab = colnames(post[["terms"]]),
    doc.length = slam::row_sums(mat, na.rm = TRUE),
    term.frequency = slam::col_sums(mat, na.rm = TRUE)
  )
}

serVis(topicmodels2LDAvis(lda_model))

NB: there are some issues with LDAvis and for some reason it does not always parse out a topic model object. We can try loading another one, which does work: this is a 20 topic model based on issues of the Dispatch covering the period of 1861-1864.

load(file="./files/data/dispatch_lda_vem_better.rda")
serVis(topicmodels2LDAvis(dispatch_lda_vem_better))

9.7 Addendum: different distances code sample

This is simply a chunk of code that you can reuse for generating different distances. This code will not run because there is no variable YOUR_MATRIX!

#library(stylo)

# YOUR_MATRIX must be generated before you run the code below

# USING library(stylo) FUNCTIONS
if (distanceMethod == "cosine"){distanceMatrix <- dist.cosine(YOUR_MATRIX)
} else if (distanceMethod == "delta"){distanceMatrix <- dist.delta(YOUR_MATRIX)
} else if (distanceMethod == "argamon"){distanceMatrix <- dist.argamon(YOUR_MATRIX)
} else if (distanceMethod == "eder"){distanceMatrix <- dist.eder(YOUR_MATRIX)
} else if (distanceMethod == "minmax"){distanceMatrix <- dist.minmax(YOUR_MATRIX)
} else if (distanceMethod == "enthropy"){distanceMatrix <- dist.enthropy(YOUR_MATRIX)
} else if (distanceMethod == "simple"){distanceMatrix <- dist.simple(YOUR_MATRIX)
} else if (distanceMethod == "wurzburg"){distanceMatrix <- dist.wurzburg(YOUR_MATRIX)
# USING dist() FUNCTION
} else if (distanceMethod == "euclidean"){distanceMatrix <- dist(YOUR_MATRIX, method="euclidean")
} else if (distanceMethod == "maximum"){distanceMatrix <- dist(YOUR_MATRIX, method="maximum")
} else if (distanceMethod == "manhattan"){distanceMatrix <- dist(YOUR_MATRIX, method="manhattan")
} else if (distanceMethod == "canberra"){distanceMatrix <- dist(YOUR_MATRIX, method="canberra")
} else if (distanceMethod == "binary"){distanceMatrix <- dist(YOUR_MATRIX, method="binary")
} else if (distanceMethod == "minkowski"){distanceMatrix <- dist(YOUR_MATRIX, method="minkowski")

distanceMatrix <- as.dist(distanceMatrix)
clustered.data <- hclust(distanceMatrix, method = clusteringMethod)

9.8 Homework

given in the chapter.

9.9 Submitting homework

Homework assignment must be submitted by the beginning of the next class;
Email your homework to the instructor as attachments.
- In the subject of your email, please, add the following: 57528-LXX-HW-YourLastName-YourMatriculationNumber, where LXX is the number of the lesson for which you submit homework; YourLastName is your last name; and YourMatriculationNumber is your matriculation number.

9.10 Additional Materials

Chapter 3 “Analyzing word and document frequency: tf-idf” in: Silge, Julia, and David Robinson. 2017. Text Mining with R: A Tidy Approach. First edition. Beijing Boston Farnham: O´Reilly. https://www.tidytextmining.com/. Available online: https://www.tidytextmining.com/tfidf.html
Chapter 6 “Topic modeling” in: Silge, Julia, and David Robinson. 2017. Text Mining with R: A Tidy Approach. First edition. Beijing Boston Farnham: O´Reilly. https://www.tidytextmining.com/. Available online: https://www.tidytextmining.com/topicmodeling.html
Chapter on Topic Modeling in: Benjamin Soltoff. MACS 305001: Computing for the Social Sciences, University of Chicago, https://cfss.uchicago.edu/notes/topic-modeling/
David Meza. Topic Modeling using R, https://knowledger.rbind.io/post/topic-modeling-using-r/
Grün, Bettina, and Kurt Hornik. 2011. “Topicmodels: An R Package for Fitting Topic Models.” Journal of Statistical Software 40 (13). https://doi.org/10.18637/jss.v040.i13.