9 Text Analysis III: Finding Groups of Texts
9.1 Goals
- introduce text clustering approaches that allow us find groups of thematically similar texts:
- hierarchical clustering;
- k-means clustering;
- topic modeling;
9.2 Preliminaries
9.2.1 Libraries
The following are the libraries that we will need for this section. Install those that you do not have yet.
#install.packages("ggplot2", "LDAvis", "readr", "slam", "stringr", "tictoc", "tidytext", "tidyverse", "tm", "topicmodels")
# general
library(ggplot2)
# text analysis specific
library(readr)
library(slam)
library(stringr)
library(tidytext)
library(tidyverse)
library(tm)
library(topicmodels)
library(text2vec)
library(stylo)
library(cluster) # clustering algorithms
library(factoextra) # clustering algorithms & visualization
# extra
library(tictoc) # to time operations
9.2.1.1 The Dispatch Data: Preprocessing
Texts for experiments:
- The Richmond Dispatch (1861)
- The Richmond Dispatch (1862)
- The Richmond Dispatch (1863)
- The Richmond Dispatch (1864)
Topic models, pretrained:
We will limit to only one year, but, of course, the results are always more interesting with more data…
Loading…
<- read.delim("./files/data/dispatch_1864.tsv", encoding="UTF-8", header=TRUE, quote="", stringsAsFactors = FALSE)
d1864 $date <- as.Date(d1864$date, format="%Y-%m-%d") d1864
::kable(head(d1864, 1), table.attr = "class=\"striped\"", format="html") knitr
id | date | type | header | text |
---|---|---|---|---|
1864-04-28_article_1 | 1864-04-28 | article | The news. | The news. Yesterday was an unusual dull day, there being a great dearth in the news market, which was the subject of general remark. – The report which prevailed towards night, to the effect that a battle was progressing on the line of the Rapidan, we feel authorized to contradict. All was quiet when the Central train left Gordonsville yesterday evening. The cannonading heard in the direction of Germanna Ford, previously noticed, was caused by artillery practice. |
Let’s remove low freq items:
<- d1864 %>%
d1864_lowFreq unnest_tokens(word, text) %>%
count(word, sort=TRUE)
summary(d1864_lowFreq)
## word n
## Length:58774 Min. : 1.00
## Class :character 1st Qu.: 1.00
## Mode :character Median : 2.00
## Mean : 61.63
## 3rd Qu.: 8.00
## Max. :278040.00
<- d1864_lowFreq %>%
lowFreq filter(n <= 1)
summary(lowFreq)
## word n
## Length:28107 Min. :1
## Class :character 1st Qu.:1
## Mode :character Median :1
## Mean :1
## 3rd Qu.:1
## Max. :1
Most of these low-frequency items are typos:
::kable(head(lowFreq, 15), table.attr = "class=\"striped\"", format="html") knitr
word | n |
---|---|
0001850holstein482 | 1 |
00024 | 1 |
0003750 | 1 |
0004unshelled | 1 |
0005 | 1 |
000aliens | 1 |
000anne | 1 |
000another | 1 |
000april | 1 |
000august | 1 |
000banks’s | 1 |
000butchers5500blacksmiths2200wheelrights2200teamsters505 | 1 |
000butler’s | 1 |
000california50 | 1 |
000connecticut40 | 1 |
We can anti-join
our corpus with lowFreq
, which will remove them:
<- d1864 %>%
d1864_clean filter(type != "ad_blank")
<- d1864_clean %>%
d1864_clean unnest_tokens(word, text) %>%
anti_join(lowFreq, by="word") %>%
group_by(id) %>%
count(word, sort=TRUE)
# unfiltered: 2,815,144
# filtered (>3): 2,749,078
Additionally, we need to remove stop words
, but first we need to identify them.
<- d1864_clean %>%
d1864_clean_FL group_by(word) %>%
summarize(freq=sum(n)) %>%
arrange(desc(freq))
::kable(head(d1864_clean_FL, 15), table.attr = "class=\"striped\"", format="html") knitr
word | freq |
---|---|
the | 278040 |
of | 163446 |
and | 114985 |
to | 98596 |
in | 68070 |
a | 67022 |
that | 35265 |
for | 32885 |
on | 31679 |
was | 30730 |
is | 29191 |
at | 27723 |
be | 27673 |
by | 26285 |
from | 24300 |
To make things faster, you can remove top 50, 100, 150, 200 most frequent words, but this is a rather brutal way. Ideally, you want to curate your own stop word list that will be tuned to your texts. Below, I have taken top 500 words and manually removed everything that was meaningful (or, better, what I considered meaningful). Additionally, there are also NLP procedures that are designed to lemmatize words (i.e., reduce all words to their dictionary forms) and also do part-of-speech tagging, which allows to remove words categorically (for example, keeping only nouns, adjectives and verbs).
<- c("the", "of", "and", "to", "in", "a", "that", "for", "on", "was", "is", "at", "be", "by", "from", "his", "he", "it", "with", "as", "this", "will", "which", "have", "or", "are", "they", "their", "not", "were", "been", "has", "our", "we", "all", "but", "one", "had", "who", "an", "no", "i", "them", "about", "him", "two", "upon", "may", "there", "any", "some", "so", "men", "when", "if", "day", "her", "under", "would", "c", "such", "made", "up", "last", "j", "time", "years", "other", "into", "said", "new", "very", "five", "after", "out", "these", "shall", "my", "w", "more", "its", "now", "before", "three", "m", "than", "h", "o'clock", "old", "being", "left", "can", "s", "man", "only", "same", "act", "first", "between", "above", "she", "you", "place", "following", "do", "per", "every", "most", "near", "us", "good", "should", "having", "great", "also", "over", "r", "could", "twenty", "people", "those", "e", "without", "four", "received", "p", "then", "what", "well", "where", "must", "says", "g", "large", "against", "back", "000", "through", "b", "off", "few", "me", "sent", "while", "make", "number", "many", "much", "give", "1", "six", "down", "several", "high", "since", "little", "during", "away", "until", "each", "5", "year", "present", "own", "t", "here", "d", "found", "reported", "2", "right", "given", "age", "your", "way", "side", "did", "part", "long", "next", "fifty", "another", "1st", "whole", "10", "still", "among", "3", "within", "get", "named", "f", "l", "himself", "ten", "both", "nothing", "again", "n", "thirty", "eight", "took", "never", "came", "called", "small", "passed", "just", "brought", "4", "further", "yet", "half", "far", "held", "soon", "main", "8", "second", "however", "say", "heavy", "thus", "hereby", "even", "ran", "come", "whom", "like", "cannot", "head", "ever", "themselves", "put", "12", "cause", "known", "7", "go", "6", "once", "therefore", "thursday", "full", "apply", "see", "though", "seven", "tuesday", "11", "done", "whose", "let", "how", "making", "immediately", "forty", "early", "wednesday", "either", "too", "amount", "fact", "heard", "receive", "short", "less", "100", "know", "might", "except", "supposed", "others", "doubt", "set", "works")
word
<- data.frame(word)
sWordsDF
<- d1864_clean %>%
d1864_clean_minusSW anti_join(sWordsDF, by="word")
dim(d1864_clean)
## [1] 1765298 3
# 1,759,828
dim(d1864_clean_minusSW)
## [1] 1162644 3
# 1,159,214
9.2.2 TF-IDF
From Wikipedia: In information retrieval, tf–idf or TFIDF, short for term frequency–inverse document frequency, is a numerical statistic that is intended to reflect how important a word is to a document in a collection or corpus.[1] It is often used as a weighting factor in searches of information retrieval, text mining, and user modeling. The tf–idf value increases proportionally to the number of times a word appears in the document and is offset by the number of documents in the corpus that contain the word, which helps to adjust for the fact that some words appear more frequently in general. tf–idf is one of the most popular term-weighting schemes today; 83% of text-based recommender systems in digital libraries use tf–idf. Variations of the tf–idf weighting scheme are often used by search engines as a central tool in scoring and ranking a document’s relevance given a user query. tf–idf can be successfully used for stop-words filtering in various subject fields, including text summarization and classification. One of the simplest ranking functions is computed by summing the tf–idf for each query term; many more sophisticated ranking functions are variants of this simple model.
<- d1864_clean_minusSW %>% # d1864_clean, d1864_clean_minusSW
df_TF_IDF bind_tf_idf(word, id, n) %>%
arrange(desc(tf_idf)) %>%
ungroup
::kable(head(df_TF_IDF, 15), table.attr = "class=\"striped\"", format="html") knitr
id | word | n | tf | idf | tf_idf |
---|---|---|---|---|---|
1864-07-25_article_6 | suicides | 2 | 0.2500000 | 9.589941 | 2.397485 |
1864-09-12_advert_131 | leonard’s | 3 | 0.2307692 | 8.896793 | 2.053106 |
1864-02-05_article_19 | sissy | 3 | 0.2000000 | 9.589941 | 1.917988 |
1864-01-28_article_49 | generality | 1 | 0.2500000 | 7.644030 | 1.911008 |
1864-01-11_advert_105 | blacking | 6 | 0.2307692 | 8.203646 | 1.893149 |
1864-09-12_advert_131 | pills | 3 | 0.2307692 | 8.203646 | 1.893149 |
1864-12-08_article_70 | council | 2 | 0.4000000 | 4.508536 | 1.803414 |
1864-12-12_advert_143 | alf | 2 | 0.1818182 | 9.589941 | 1.743626 |
1864-09-27_article_95 | calendar | 2 | 0.2857143 | 6.006422 | 1.716120 |
1864-03-22_article_71 | apartments | 2 | 0.2222222 | 7.644030 | 1.698673 |
1864-12-29_article_42 | dishonest | 1 | 0.2500000 | 6.756727 | 1.689182 |
1864-09-10_article_61 | mobile | 2 | 0.5000000 | 3.373334 | 1.686667 |
1864-05-05_article_17 | ling | 1 | 0.2000000 | 8.203646 | 1.640729 |
1864-09-26_article_63 | hoy | 1 | 0.2000000 | 8.203646 | 1.640729 |
1864-03-30_article_50 | telegrams | 2 | 0.3333333 | 4.810817 | 1.603606 |
= "1864-07-25_article_6"
articleID filter(df_TF_IDF, id==articleID) %>%
arrange(desc(tf_idf))
## # A tibble: 6 × 6
## id word n tf idf tf_idf
## <chr> <chr> <int> <dbl> <dbl> <dbl>
## 1 1864-07-25_article_6 suicides 2 0.25 9.59 2.40
## 2 1864-07-25_article_6 france 2 0.25 4.35 1.09
## 3 1864-07-25_article_6 suicide 1 0.125 6.50 0.812
## 4 1864-07-25_article_6 committed 1 0.125 3.09 0.387
## 5 1864-07-25_article_6 persons 1 0.125 2.61 0.327
## 6 1864-07-25_article_6 take 1 0.125 2.26 0.283
$text[d1864$id==articleID] d1864
## [1] "Suicides in France -- More than ten suicides take place every day in France; last year 4, 000 persons committed suicide."
9.3 Hierarchical clustering
Clustering is a method to break items into closely related groups—clusters. The code below show how our data can be broken into clusters with hierarchical clustering, using distance matrices. Hierarchical clustering has rather high precision, but sensitive to outliers and computationally expensive, which makes it nearly unusable with large datasets. K-means clustering is more suitable for large datasets, but struggles with uniform data. In both cases you have to define the number of clusters.
In what follows, we take our TF-IDF data, sample it, and run cluster analysis on a small sample. The chunk below simply prepares our data for analysis:
# RANDOMLY SELECT N ITEMS
set.seed(48965)
= 100
N <- sample(unique(df_TF_IDF$id), N)
sampledIDs <- df_TF_IDF %>% filter(id %in% sampledIDs) %>%
sample_d1864_tfidf select(id, word, tf_idf)
# CONVERT INTO DTM MATRIX
colnames(sample_d1864_tfidf) <- c("document", "term", "count")
<- tibble(sample_d1864_tfidf) %>% cast_dtm(document, term, count)
sample <- as.matrix(sample)
sample_matrix
# CONVERT INTO REGULAR MATRIC AND CALCULATE DISTANCES
<- dist.cosine(sample_matrix) # from library(stylo)
distanceMatrix <- as.dist(distanceMatrix) distanceMatrixHC
Now we can do the actual clustering. There are several clustering/linkage methods that can be used for clustering, and it usually depends on your goals.
From
?hclust
: “A number of different clustering methods are provided. Ward’s minimum variance method aims at finding compact, spherical clusters. The complete linkage method finds similar clusters. The single linkage method (which is closely related to the minimal spanning tree) adopts a ‘friends of friends’ clustering strategy. The other methods can be regarded as aiming for clusters with characteristics somewhere between the single and complete link methods. Note however, that methods”median” and “centroid” are not leading to a monotone distance measure, or equivalently the resulting dendrograms can have so called inversions or reversals which are hard to interpret …”
As a rule of thumb: you want balanced trees when you want an even number of items assigned to each cluster. Unbalanced trees are useful for finding outliers — with this method you can find which items you might want to remove in order to achieve better clustering.
You can find additional explanations here, here, and here.
# THE FOLLOWING IS THE ACTUAL CLUSTERING
<- hclust(distanceMatrixHC, method = "ward.D")
clustered.data.ward <- hclust(distanceMatrixHC, method = "complete")
clustered.data.complete <- hclust(distanceMatrixHC, method = "average")
clustered.data.average <- hclust(distanceMatrixHC, method = "single")
clustered.data.single
str(clustered.data.ward)
## List of 7
## $ merge : int [1:99, 1:2] -23 -2 -58 -99 -52 -3 -51 -60 -86 -17 ...
## $ height : num [1:99] 0.699 0.714 0.732 0.733 0.758 ...
## $ order : int [1:100] 51 23 72 86 52 58 89 32 60 61 ...
## $ labels : chr [1:100] "1864-10-10_printrun_6" "1864-04-16_article_26" "1864-10-01_ad-blank_69" "1864-03-23_article_58" ...
## $ method : chr "ward.D"
## $ call : language hclust(d = distanceMatrixHC, method = "ward.D")
## $ dist.method: NULL
## - attr(*, "class")= chr "hclust"
plot(clustered.data.ward, labels=FALSE, main="Ward")
abline(h=1.25, col="blue", lty=3)
rect.hclust(clustered.data.ward, k=10, border="red")
You can use rect.hclust()
either with h=
— which will cut tree at a given height, thus determining the number of clusters; or, with k=
— which will cut tree into a given number of clusters.
9.3.1 PCA viz for HCLUST
Note that PCA (principal component analysis) and clustering are not the same, but PCA can be also used to visualize hierarchical clustering:
library(factoextra)
set.seed(1)
<- scale(distanceMatrixHC)
distanceMatrixHC.scaled <- hcut(distanceMatrixHC.scaled, k = 3, hc_method = "ward.D")
hc.cut fviz_cluster(hc.cut, labelsize=0, ellipse.type = "convex")
plot(clustered.data.complete, labels=FALSE, main="Complete")
plot(clustered.data.average, labels=FALSE, main="Average")
plot(clustered.data.single, labels=FALSE, main="Single")
hclust()
creates an object from which you can extract further information. For example, we can use cutree()
function to extract clustering information. You can use cutree()
either with h=
— which will cut tree at a given height, thus determining the number of clusters; or, with k=
— which will cut tree into a given number of clusters.
<- cutree(clustered.data.ward, k=10)
clusters_named_vector <- stack(clusters_named_vector)
clusters_df colnames(clusters_df) <- c("cluster", "id")
<- clusters_df %>% select(id, cluster)
clusters_df
::kable(head(clusters_df, 15), table.attr = "class=\"striped\"", format="html") knitr
id | cluster |
---|---|
1864-10-10_printrun_6 | 1 |
1864-04-16_article_26 | 2 |
1864-10-01_ad-blank_69 | 3 |
1864-03-23_article_58 | 4 |
1864-08-23_article_36 | 4 |
1864-05-09_advert_47 | 1 |
1864-01-06_article_114 | 4 |
1864-04-14_article_39 | 2 |
1864-12-12_advert_116 | 4 |
1864-10-12_article_14 | 4 |
1864-10-27_advert_59 | 4 |
1864-07-11_advert_55 | 5 |
1864-03-09_article_53 | 6 |
1864-09-12_advert_4 | 7 |
1864-01-07_article_68 | 4 |
We can then left_join
these clustering results with the original table and manually check if our clustering makes sense.
<- d1864 %>%
d1864_clustering left_join(clusters_df, by = "id")
Let’s print out a few clusters:
<- d1864_clustering %>% filter(cluster == 9) %>%
cluster select(text)
::kable(cluster, table.attr = "class=\"striped\"", format="html") knitr
text |
---|
Twelve Hundred and fifty Dollars reward. – A reward of two Hundred and fifty Dollars will be paid for the delivery to me, in this city, of each one of the following Slaves, who have absconded from the Carbon Hill Mines, in this county, during the past six months: Charles, a mulatto, about five feet nine inches in height, thirty-two years old, not stout, with very noticeable grey eyes; from Orange county. Daniel, a dark brown negro, about five feet eight inches in height, nineteen or twenty years old, very sprightly countenance, and very talkative. William, dark brown or black, five feet eight or nine inches in height, very slightly made, thirty-two or thirty-four years old, very quiet in his demeanor; from Gloucester county Virginia. Festus, dark brown, five feet seven or eight inches in height, well made, pleasant countenance and good address, twenty-two to twenty-four years of age. Lewis, black, five feet seven or eight inches in height, square built, but not very stout; twenty-eight or thirty years of age; limps slightly. The two latter have relations living in Richmond, and have been employed for some years in Manchester. John J. Werth, Agent. de 2 – 2aw4w |
50 dollars reward. – Ran away from my farm, near White Oak Swamp bridge, my negro boy Mitchell. Said negro is about 18 years old, black, five feet high, has a scar over the left temple; had on when last seen a slouched hat and black woolen pants; supposed to be lurking about Richmond. The above reward will be given if delivered at my farm, or Mr. Jack Fisher’s. Loftin D. Allen. Henrico county. je 10 – cod3t |
One hundred and fifty dollars reward. – Ran away, on Monday, 17th instant, a Negro Boy, sixteen years old; about five feet eight or nine inches high, very nearsighted, and black, with a scar on the left cheek, under the eye. He has been seen in the city every day. The above reward will be paid for his delivery at the New Richmond Theatre, corner Seventh and Broad streets. oc 21 – ts |
50 dollars reward. – Ran away on Monday, the 18th inst, from Mrs Lucy C Binford, , a negro girl about eighteen years old, well grown, dark brown, slender made — She was at the head of Mechanicsville Turnpike, Hanover county, and she is probably lurking about that neighborhood. She has a husband living with Dr John G Lempkis. I will pay the above reward if delivered to me, or secured in jail so that I can get her. J. B. Keesee, Adm’r of W. M. A. Binford, dec’d. Henrico co, July 25, 1864 jy 26 – 2t |
500 dollars reward. – I will give the above reward for the apprehension of, and delivery to me; at my office, or his lodgement in jail so that I can get him) of my servant boy George Henry Ray. Said boy is a bright mulatto, about five feet high, wooly hair, has a swaggering walk, speaks very quick, and has a very sullen look. He may possibly be lurking about the city, as he has acquaintances in every part of it, but my impression is that he is endeavoring to get to Fredericksburg, from thence to Stafford, to the farm of the late Major John Seddon where he has a father.) in order to pass the lines. Soldiers in the camps are cautioned against employing said boy. M A Blackman, Surgeon and Dentist, No. 83 Main street, Richmond, Va. As he has been accompanying the to the hospitals, he has a pass from Capt Coke, requesting the guard to pass him unmolested. M A B. jy 22 – ts |
Ran away – From the gravel train on Sunday, 28th ult, four negroes, named Ned, Frederick, Efford, and Albert, hired of Mrs. A C Isbell, of Cumberland county. Efford and Albert are of a bright gingerbread color, 5 feet 10 inches high; Ned and Frederick are of dark complexion, stout, 5 feet 7 inches high. The usual reward will be paid for their apprehension. C G Talbott. Supt Richmond and Denville R R. mh 6 – ts |
200 dollars reward. – Ranaway from the subscriber about the 17th of January last, two slaves, named Doctor and Raleigh commonly called Flem. Both are black, quick, active man. – Doctor is about 19 years old, had his left hand hurt in an apple mill and scarred on the outside, and has lost one or more of his nails from that hand. He had on a red flannel shirt when he left. Flem is left handed, and is about 17 years of age. I will give $100 a piece for the delivery to me, or to some jail from which I can get them, of the said slaves, if caught out of the county of Charlotte, or $50 a piece if apprehended in the county. I believe they are passing as free men, and are trying to get employment on the Richmond and Danville, or Southside Railroad. Address, Henry E. Edmunds, Moesingford P. O., Charlotte county. Va. fe 16 – 1m |
200 dollars reward. – Ran off on 19th March a negro woman named Creasy, a very small black woman, thick lips, and large nose, and very short spoken. Carried off with her two new striped homespun yarn dresses, one brown and the other black. The above reward will be given for her apprehension if taken out of the county, at one hundred if taken in the county, in either case to be delivered to me or secured in jail so I get her again. William Priddy. Negro Foot P. O., Hanover co, Va. ap 2 – 6t |
300 dollars reward – will be paid for the delivery of my servant girl Carellec, who left on Tuesday morning, the 1st instant. She is black, about 18 years of age, well grown, rather stout, has round face and thick lips. A Rodeker, Druggist, No. 19 Main st. au 6 – |
Five hundred dollars reward. – Ran away from the Richmond Arsenal, where he was hired, about the 20th of July, my negro man, Peter miles, sometimes called Peter Redd. He is twenty-two or twenty-three years of age, five feet eight or nine inches high, gingerbread color, rather a long face, nose rather long and flat, and carries himself very erect. When he left the arsenal be wore a grey jacket and cap. He is no doubt making an effort to pass through our lines. I will give the above reward if caught beyond the limits of the city, or two hundred and fifty dollars if caught within the corporate limits and secured in any jail so that I get him. Neal McCURDY. au 10 – 6t* |
Ran away from my Farm, at the Half-way House, on the Richmond and Petersburg railroad, Chesterfield country, my man Richard. He left my farm last Tuesday morning, the 9th instant, and had on when he left a pair of dark pants, white cotton shirt, and had on a pair of shoes, no coat nor hat. He is about twenty or twenty-one years old, five feet six or seven inches high, black, has a small moustache, and speaks slow. I bought him last April, of Lee & Bowman, in Richmond. He formerly belonged to Miss Margaret Bottom, of Amelia Courthouse. He has a wife at or near Amelia Courthouse, and may be trying to go there. He was last seen near the Half-Way Station. I will pay a liberal reward if caught and put in jail, or delivered to me. Address J. M. Wolff, 64 Main street, Richmond, Va., or Proctor’s Creek, Chesterfield county. au 17 – 6t* |
Ranaway. – $100 reward. – From the subscriber, in 10th March, 1864, my men Washington, again about 26 years, about complexion, large mouth will be given if taken out of the State, and if taken in it, and secured so I get him. My address in Fork Union, Fluvanna co, Va., Samuel R. Pellet. ap 26 – 1aw4t |
100 Dollars reward. – Ranaway from the Midlothian Coal Mines, a negro man named Joe, or Joe Hampton. He is about 25 years old, of dark brown color, spare made, about 5 feet 10 inches high, with rather large eyes, and somewhat wild expression of countenance, though generally smiling when spoken to. He was bought in January last of Mr C C Burton, near Petersburg, where his friends and connexions are, and he is probably in the neighborhood of that place. The above reward will be paid for his apprehension and delivery in any jail, or to the agent of the Company, at their mines, or in Richmond. my 12 – ts |
Ruanaway. – $100 reward will be paid for the delivery to S N Davis & Co. of a negro boy named John. He is about 18 years old, gingerbread color; he had on a black felt hat, boots tipped on the toes, and gray pantaloons, when he left Friday. He was raised in Albemarle, by Dr. T J Cook Geo. Turner mh 14 – 6t |
<- d1864_clustering %>% filter(cluster == 7) %>%
cluster select(text)
::kable(cluster, table.attr = "class=\"striped\"", format="html") knitr
text |
---|
For rent – A store neatly fitted up, and the best stand in the town. Apply to John Perry, Ashland Hotel. fe 6 – 2t* |
For Rent – Two nicely furnishes rooms, suitable for ladies and gentlemen, or gentlemen, with use of parlor and kitchen. Apply at the corner house on the north corner of Marshall and Adams streets. mh 7 – 1t* |
Shockoe Mill, Seventeenth street, across the Dock. Wheat and Corn ground on tell. For sale – Nice New Flour at $1. 25 per pound. Corn Meal at $55 per bushel. au 4 – 10t* For Rent and Sale. |
For Rent, a furnished House in a pleasant part of the city. For terms, address " E. S.," with references, Dispatch office. se 12 – 3t* |
For Rent, a very desirable Dwelling-House, with eight rooms, a spacious yard, out-houses, & c., on Nineteenth street, one door from Grace. Possession given on 1st October. Apply to Mrs. Branch, Franklin street, between Eighteenth and Nineteenth. se 12 – eod3t* |
<- d1864_clustering %>% filter(cluster == 5) %>%
cluster select(text)
::kable(cluster, table.attr = "class=\"striped\"", format="html") knitr
text |
---|
Silk culture. Some of our citizens are disposed to turn their attention to the culture of silk. It would afford a new and most profitable employment to women and children. Many years ago there was a great deal of silk produced in this State. The morus multicaulis was extensively cultivated, and almost every house had its room devoted to silk worms, and their feeding and spinning. The business was abandoned, however, and the multicaulis trees were taken up by the roots, and cast out upon the highways to die. Yet there are no doubt some scattering bushes which escaped the general massacre which occurred when it was concluded that the enterprise was unprofitable. Moreover, it may be that some person having faith in the old adage that everything — the most useless, even — comes in use once in seven years, may have saved and kept on hand a small stock of the silk worms. There is now an enquiry for them, and if anybody has ever so small a family, their eggs will sell almost as high as hen’s eggs. The have but to announce that they have the genuine silk worm, to invite orders from every direction. |
50 Dollars reward. – Stolen from the residence of James Sinton, Jr, on Franklin st, on Sunday, the 8th last, one new large black Silk Circular, trimmed with white buttons. Also, one Stella Shawl. The above reward will be given for the return of the articles to the " Illustrated News Office," and no questions asked. my 9 – 3t |
Blockade Goods. – Mantle Depot, 60 Main, between 14th and 15th streets. – Ladies’silk parasols, large sizes; white silk illusion; superior quality bonnet ribbons; blond laces; blk silk lacer; blk English crapes; blk English crapes; blk dress silk; bonnet frames; checked muslins, jaconets; cambrics; bleached cotton; extra heavy English cotton hose; lists hose; crape collars; palm leaf fans; black alpacas; castile soap; Ezekiel’s hair tonic; tooth brushes; pomades; collogue, & c, & c. Mantle Depot, 69 Main, bet 14th and 15th sts jy 11 – 3t* |
9.3.2 Determining the optimal number of clusters: “Elbow Method” and “Average Silhouette Method”
In a nutshell, we repeatedly run clustering, increasing the number of clusters by one, and calculate the total within-cluster sum of square (wss). We then plot the curve of wss and look for a point in the curve with the sharpest bend (hence the “elbow”), which is considered to be an indicator of the appropriate number of clusters. library(factoextra)
can perform this with a single command:
set.seed(786)
fviz_nbclust(as.matrix(distanceMatrixHC), FUN = hcut, k.max=10, method = "wss")
Ideally, we sould have something like L
or V
. Here, perhaps, 3 is our optimal number. We can try another method — average silhouette method (which is also easily callable from library(factoextra)
). Like with elbow method, we run clustering multiple times but here we measures the quality of a clustering, but determining how well each object lies within its cluster. A high average silhouette width indicates a good clustering.
set.seed(786)
fviz_nbclust(as.matrix(distanceMatrixHC), FUN = hcut, k.max = 10, method = "silhouette")
More on hierarchical clustering: “Hierarchical Cluster Analysis”, in in U of Cincinnati Business Analytics R Programming Course http://uc-r.github.io/hc_clustering.
9.4 K-means clustering
Let’s get a different sample from our data. With k-means clustering we can run on more data:
# RANDOMLY SELECT N ITEMS
set.seed(48965)
= 1000
N <- sample(unique(df_TF_IDF$id), N)
sampledIDs <- df_TF_IDF %>% filter(id %in% sampledIDs) %>%
sample_d1864_tfidf select(id, word, tf_idf)
# CONVERT INTO DTM MATRIX
colnames(sample_d1864_tfidf) <- c("document", "term", "count")
<- tibble(sample_d1864_tfidf) %>% cast_dtm(document, term, count)
sample <- as.matrix(sample)
sample_matrix
# CONVERT INTO REGULAR MATRIC AND CALCULATE DISTANCES
<- dist.cosine(sample_matrix) # from library(stylo)
distanceMatrix <- as.dist(distanceMatrix) distanceMatrixKM
<- kmeans(distanceMatrixKM, centers=5, nstart=25)
kmeansClusters str(kmeansClusters)
## List of 9
## $ cluster : Named int [1:1000] 2 2 2 2 2 2 5 2 2 2 ...
## ..- attr(*, "names")= chr [1:1000] "1864-01-28_article_49" "1864-10-10_printrun_6" "1864-10-10_advert_3" "1864-06-22_article_22" ...
## $ centers : num [1:5, 1:1000] 1 0.997 1 1 1 ...
## ..- attr(*, "dimnames")=List of 2
## .. ..$ : chr [1:5] "1" "2" "3" "4" ...
## .. ..$ : chr [1:1000] "1864-01-28_article_49" "1864-10-10_printrun_6" "1864-10-10_advert_3" "1864-06-22_article_22" ...
## $ totss : num 2199
## $ withinss : num [1:5] 9.03 1313.82 84.42 186.05 9.94
## $ tot.withinss: num 1603
## $ betweenss : num 595
## $ size : int [1:5] 12 806 37 121 24
## $ iter : int 3
## $ ifault : int 0
## - attr(*, "class")= chr "kmeans"
We can extract cluster information in the same manner as with hclust()
object:
<- stack(kmeansClusters$cluster)
kmeans_clusters_df colnames(kmeans_clusters_df) <- c("cluster", "id")
<- kmeans_clusters_df %>% select(id, cluster)
kmeans_clusters_df
head(kmeans_clusters_df)
## id cluster
## 1 1864-01-28_article_49 2
## 2 1864-10-10_printrun_6 2
## 3 1864-10-10_advert_3 2
## 4 1864-06-22_article_22 2
## 5 1864-04-16_article_26 2
## 6 1864-03-07_advert_112 2
We can visualize the results of our kmeans clustering using fviz_cluster()
(from factoextra
). Our data is multidimensional — each word in our matrix is a single dimension, so this function will perform principal component analysis (PCA) and plot data according to the first two principal components that explain majority of the variance in our dataset.
set.seed(786)
fviz_cluster(kmeansClusters, data = distanceMatrix, labelsize = 0)
9.4.1 Determining the optimal number of clusters: “Elbow Method” and “Average Silhouette Method”
In a nutshell, we repeatedly run clustering, increasing the number of clusters by one, and calculate the total within-cluster sum of square (wss). We then plot the curve of wss and look for a point in the curve with the sharpest bend (hence the “elbow”), which is considered to be an indicator of the appropriate number of clusters. library(factoextra)
can perform this with a single command:
set.seed(786)
fviz_nbclust(as.matrix(distanceMatrix), kmeans, k.max=20, method = "wss")
Ideally, we should have something like L
or V
. Here results do not seem to be very helpful. We can try another method — average silhouette method (which is also easily callable from library(factoextra)
). Like with elbow method, we run clustering multiple times but here we measures the quality of a clustering, bu determining how well each object lies within its cluster. A high average silhouette width indicates a good clustering.
set.seed(786)
fviz_nbclust(as.matrix(distanceMatrix), kmeans, k.max = 20, method = "silhouette")
Let’s try to visualize our clusters again:
set.seed(786)
<- kmeans(distanceMatrixKM, centers=5, nstart=25)
kmeansClusters fviz_cluster(kmeansClusters, data = distanceMatrix, labelsize = 0)
Here is, however, an example of where k-means clustering may/can fail: we gave a different starting point to the algorithm and the clustering outcome may be different:
set.seed(786)
<- kmeans(distanceMatrixKM, centers=5, nstart=50)
kmeansClusters fviz_cluster(kmeansClusters, data = distanceMatrix, labelsize = 0)
More information, see:
- “K-means Cluster Analysis”, in U of Cincinnati Business Analytics R Programming Course https://uc-r.github.io/kmeans_clustering.
- https://towardsdatascience.com/k-means-clustering-algorithm-applications-evaluation-methods-and-drawbacks-aa03e644b48a
9.5 Other “clustering” methods
Although not cluster analysis techniques strictly speaking, PCA (principal component analysis) and MDS (multi-dimensional scaling) are used in similar ways. We will touch upon these in the context of stylometry.
9.6 Topic Modeling
9.6.1 Topics?
9.6.1.1 Example 1
Thursday, March 27, 1862
LIGHT ARTILLERY
—I am authorized by the Governor of Virginia to raise a Company of Light Artillery for the war. All those desirous of enlisting in this the most effective arm of the service, would do well to call at once at the office of Johnson & Guigon, Whig Building.
Uniforms and subsistence furnished.
A. B. GUIGON. mh 25—6t
9.6.1.2 Example 2
Wednesday, August 17, 1864
Royal Marriages.
—There is a story circulated in Germany, and some in Paris, that the match between the heir-apparent of the Imperial throne of Russia and the Princess Dagmar of Denmark having been definitively broken off, another is in the course of negotiation between His Imperial Highness and the Princess Helens of England.
9.6.1.3 Example 3
Monday, June 22, 1863
NEWS FROM EUROPE.
The steamship Scotia arrived at New York on Thursday from Europe, with foreign news to the 7th inst. The news is not important. The Confederate steamer Lord Clyde was searched by order of the British Government, but nothing contraband being found on board her she was permitted to sail. The Russians have been defeated near Grochoury by the Polish insurgents. The three Powers have sent an earnest note to Russia, asking for a representative Government, a general amnesty, and an immediate cessation of hostilities in Poland.
9.6.2 Getting to code
We can start with our preprocessed variable d1864_clean
, which is essentially a cumulative frequency list for all articles.
library(tm)
library(topicmodels)
library(tictoc)
head(d1864_clean_minusSW)
## # A tibble: 6 × 3
## # Groups: id [5]
## id word n
## <chr> <chr> <int>
## 1 1864-09-12_advert_30 mrs 307
## 2 1864-09-12_advert_30 miss 251
## 3 1864-04-18_article_10 mr 84
## 4 1864-08-27_article_41 slave 79
## 5 1864-11-24_article_46 mr 71
## 6 1864-04-26_article_8 enemy 64
<- d1864_clean_minusSW %>%
d1864_dm cast_dtm(id, word, n)
d1864_dm
## <<DocumentTermMatrix (documents: 14617, terms: 30391)>>
## Non-/sparse entries: 1162644/443062603
## Sparsity : 100%
## Maximal term length: 32
## Weighting : term frequency (tf)
Training a model. NB: eval=FALSE
setting will prevent from running this chunk when you Knit
the notebook; but you can still execute it within the notebook, when you run chunks individually
tic()
<- LDA(d1864_dm, k = 4, control = list(seed = 1234))
d1864_lda
d1864_lda
toc()
#A LDA_VEM topic model with 2 topics.
#35.962 sec elapsed
#A LDA_VEM topic model with 4 topics.
#72.545 sec elapsed
Do not run this in class: it takes too long!
tic()
<- 25
kVal <- LDA(d1864_dm, k=kVal, control=list(seed=1234))
d1864_lda_better
toc()
#A LDA_VEM topic model with 20 topics.
#1261.087 sec elapsed (21 minutes)
#A LDA_VEM topic model with 25 topics.
#1112.262 sec elapsed (18 minutes)
Save/load the model, so that there is no need to regenerate it every time.
#d1864_lda_vem_25t_better <- d1864_lda_better
#save(d1864_lda_vem_25t_better, file="./files/data/d1864_lda_vem_25t_better.rda")
#load(file="./data/d1864_lda_vem_25t_better.rda")
load(file="./files/data/d1864_lda_vem_25t_better.rda")
<- d1864_lda_vem_25t_better
lda_model <- d1864 corpus
From this point on, the code should simply run — if you rename your own model produced above to lda_model
.
9.6.3 Per-topic-per-word probabilities (beta)
<- tidy(lda_model, matrix="beta")
lda_model_better_topic_term_prob
%>%
lda_model_better_topic_term_prob filter(term == "bonds") %>%
arrange(desc(beta))
## # A tibble: 25 × 3
## topic term beta
## <int> <chr> <dbl>
## 1 8 bonds 3.53e- 2
## 2 9 bonds 5.05e- 4
## 3 24 bonds 3.92e- 4
## 4 11 bonds 2.34e-13
## 5 2 bonds 2.13e-13
## 6 1 bonds 3.67e-16
## 7 16 bonds 2.01e-20
## 8 14 bonds 3.08e-21
## 9 18 bonds 3.87e-22
## 10 3 bonds 2.00e-22
## # … with 15 more rows
NB: This step may throw an error. The error seems a bit cryptic, but restarting R (without saving workspace) seems to help. (beta stands for term-per-topic probability).
<- lda_model_better_topic_term_prob %>%
top_terms group_by(topic) %>%
top_n(15, beta) %>%
ungroup() %>%
arrange(topic, -beta)
head(top_terms)
## # A tibble: 6 × 3
## topic term beta
## <int> <chr> <dbl>
## 1 1 york 0.0113
## 2 1 mr 0.0102
## 3 1 lincoln 0.00915
## 4 1 union 0.00827
## 5 1 states 0.00809
## 6 1 washington 0.00689
library(ggplot2)
<- top_terms %>%
graph mutate(term = reorder(term, beta)) %>%
ggplot(aes(term, beta, fill = factor(topic))) +
geom_col(show.legend=FALSE) +
facet_wrap(~topic, scales = "free") +
coord_flip()
Here it is in a bit of a better quality:
Topic-per-document probabilities: this object will tell us to which topics documents belong (and to what extent): (gamma stands for topic-per-document probability).
<- tidy(lda_model, matrix="gamma")
lda_model_topic_doc_prob lda_model_topic_doc_prob
## # A tibble: 356,150 × 3
## document topic gamma
## <chr> <int> <dbl>
## 1 1864-09-12_advert_30 1 0.0000115
## 2 1864-04-18_article_10 1 0.709
## 3 1864-08-27_article_41 1 0.0000581
## 4 1864-11-24_article_46 1 0.0614
## 5 1864-04-26_article_8 1 0.0000152
## 6 1864-05-09_article_1 1 0.0000152
## 7 1864-05-14_article_23 1 0.139
## 8 1864-05-16_article_6 1 0.122
## 9 1864-06-27_article_2 1 0.0000270
## 10 1864-04-16_article_15 1 0.620
## # … with 356,140 more rows
- Pick a document and print out topics it belongs to (from the most prominent to less prominent). (Hint: use the object we just created > filter > arrange).
# your code here
Top N documents per topic: this will create an object with top N documents per each topic.
= 10
N
<- lda_model_topic_doc_prob %>%
top_docs group_by(topic) %>%
top_n(N, gamma) %>%
ungroup() %>%
arrange(topic, -gamma)
top_docs
## # A tibble: 285 × 3
## document topic gamma
## <chr> <int> <dbl>
## 1 1864-09-14_article_65 1 0.992
## 2 1864-08-31_article_103 1 0.989
## 3 1864-09-08_article_34 1 0.984
## 4 1864-04-19_article_30 1 0.980
## 5 1864-03-23_article_68 1 0.977
## 6 1864-06-03_article_12 1 0.965
## 7 1864-12-29_article_47 1 0.957
## 8 1864-03-10_article_118 1 0.956
## 9 1864-06-08_article_15 1 0.940
## 10 1864-09-01_article_65 1 0.939
## # … with 275 more rows
Now that we have IDs of representative documents, we can check those documents one by one, but let’s do something else first: topic-title function—–it is not really necessary, but it will combine together topic number (from the model) and its top words, which can be used for graphs.
<- function(top, topTerms){
topicVar <- topTerms %>%
topicVar filter(topic == top) %>%
arrange(-beta)
= paste(topicVar$term, collapse=", ")
vals as.String(paste(c("Topic ", top, ": ", vals), collapse=""))
}
= 8
topicNum = "1864-08-31_orders_74"
idTest
topicVar(topicNum, top_terms)
## Topic 8: treasury, bonds, notes, 1864, states, cent, certificates, confederate, secretary, department, america, richmond, payment, interest, treasurer
print("---")
## [1] "---"
as.String(d1864[d1864$id==idTest, ]["text"])
## Treasury Department, Confederate States of America, Richmond, August 8, 1864. Certificates of Indebtedness Bearing Six Per Cent. Per Annum interest and Free from Taxation. -- By the fourteenth section of the act to reduce the currency, approved February 17, 1864, the Secretary of the Treasury is authorized to issue the above certificates, payable two years after the ratification of a treaty of peace with the United States. They cannot be sold, but are only to be issued to such creditors of the Government as are willing to receive the same in payment of their demands. They must also be given at par, though free from taxation. The attention of purchasing agents and disbursing officers of the Government is called to this class of public securities as offering peculiar advantages to those from whom the supplies of the Government are bought; and to facilitate the use of them, checks drawn by disbursing officers upon the depositaries holding these funds, and marked across the face " payable in certificates of indebtedness," will be paid in conformity therewith. Depositaries are hereby authorized and required to comply with this regulation, and to make application to the Register for supplies of certificates as required. Signed) G. A. Trenholm, Secretary of Treasury. au 22 -- ts
9.6.4 Topics over time
<- corpus %>%
corpus_light select(-header, -text)
<- lda_model_topic_doc_prob %>%
lda_model_topics rename(id=document) %>%
left_join(corpus_light, by="id") %>%
group_by(topic, date) %>%
summarize(freq=sum(gamma))
## `summarise()` has grouped output by 'topic'. You can override using the
## `.groups` argument.
lda_model_topics
## # A tibble: 7,725 × 3
## # Groups: topic [25]
## topic date freq
## <int> <date> <dbl>
## 1 1 1864-01-01 0.219
## 2 1 1864-01-02 0.731
## 3 1 1864-01-04 1.02
## 4 1 1864-01-05 0.587
## 5 1 1864-01-06 0.708
## 6 1 1864-01-07 1.35
## 7 1 1864-01-08 0.400
## 8 1 1864-01-09 0.580
## 9 1 1864-01-11 0.838
## 10 1 1864-01-12 0.783
## # … with 7,715 more rows
Now, we can plot topic distribution over time:
= 8
topicVal
<- lda_model_topics %>%
lda_model_topics_final filter(topic == topicVal)
<- plot(x=lda_model_topics_final$date, y=lda_model_topics_final$freq, type="l", lty=3, lwd=1,
graph2 main=topicVar(topicVal, top_terms),
xlab = "1864 - Dispatch coverage", ylab = "topic saliency"); segments(x0=lda_model_topics_final$date, x1=lda_model_topics_final$date, y0=0, y1=lda_model_topics_final$freq, lty=1, lwd=2)
9.6.5 Exploring topics
LDAvis
offers a visual browser for topics, which has already became a very popular tool for this purpose. If everything is done right, a visualization similar to the one below should open in a browser.

LDAvis
Browser Example.
We can use the following function that extracts all needed information from a model and converts it into a format that LDAvis
expects:
library(LDAvis)
library(slam)
<- function(x, ...){
topicmodels2LDAvis <- topicmodels::posterior(x)
post if (ncol(post[["topics"]]) < 3) stop("The model must contain > 2 topics")
<- x@wordassignments
mat ::createJSON(
LDAvisphi = post[["terms"]],
theta = post[["topics"]],
vocab = colnames(post[["terms"]]),
doc.length = slam::row_sums(mat, na.rm = TRUE),
term.frequency = slam::col_sums(mat, na.rm = TRUE)
) }
serVis(topicmodels2LDAvis(lda_model))
NB: there are some issues with LDAvis
and for some reason it does not always parse out a topic model object. We can try loading another one, which does work: this is a 20 topic model based on issues of the Dispatch covering the period of 1861-1864.
load(file="./files/data/dispatch_lda_vem_better.rda")
serVis(topicmodels2LDAvis(dispatch_lda_vem_better))
9.7 Addendum: different distances code sample
This is simply a chunk of code that you can reuse for generating different distances. This code will not run because there is no variable YOUR_MATRIX
!
#library(stylo)
# YOUR_MATRIX must be generated before you run the code below
# USING library(stylo) FUNCTIONS
if (distanceMethod == "cosine"){distanceMatrix <- dist.cosine(YOUR_MATRIX)
else if (distanceMethod == "delta"){distanceMatrix <- dist.delta(YOUR_MATRIX)
} else if (distanceMethod == "argamon"){distanceMatrix <- dist.argamon(YOUR_MATRIX)
} else if (distanceMethod == "eder"){distanceMatrix <- dist.eder(YOUR_MATRIX)
} else if (distanceMethod == "minmax"){distanceMatrix <- dist.minmax(YOUR_MATRIX)
} else if (distanceMethod == "enthropy"){distanceMatrix <- dist.enthropy(YOUR_MATRIX)
} else if (distanceMethod == "simple"){distanceMatrix <- dist.simple(YOUR_MATRIX)
} else if (distanceMethod == "wurzburg"){distanceMatrix <- dist.wurzburg(YOUR_MATRIX)
} # USING dist() FUNCTION
else if (distanceMethod == "euclidean"){distanceMatrix <- dist(YOUR_MATRIX, method="euclidean")
} else if (distanceMethod == "maximum"){distanceMatrix <- dist(YOUR_MATRIX, method="maximum")
} else if (distanceMethod == "manhattan"){distanceMatrix <- dist(YOUR_MATRIX, method="manhattan")
} else if (distanceMethod == "canberra"){distanceMatrix <- dist(YOUR_MATRIX, method="canberra")
} else if (distanceMethod == "binary"){distanceMatrix <- dist(YOUR_MATRIX, method="binary")
} else if (distanceMethod == "minkowski"){distanceMatrix <- dist(YOUR_MATRIX, method="minkowski")
}
<- as.dist(distanceMatrix)
distanceMatrix <- hclust(distanceMatrix, method = clusteringMethod) clustered.data
9.8 Homework
- given in the chapter.
9.9 Submitting homework
- Homework assignment must be submitted by the beginning of the next class;
- Email your homework to the instructor as attachments.
- In the subject of your email, please, add the following:
57528-LXX-HW-YourLastName-YourMatriculationNumber
, whereLXX
is the number of the lesson for which you submit homework;YourLastName
is your last name; andYourMatriculationNumber
is your matriculation number.
- In the subject of your email, please, add the following:
9.10 Additional Materials
- Chapter 3 “Analyzing word and document frequency: tf-idf” in: Silge, Julia, and David Robinson. 2017. Text Mining with R: A Tidy Approach. First edition. Beijing Boston Farnham: O´Reilly. https://www.tidytextmining.com/. Available online: https://www.tidytextmining.com/tfidf.html
- Chapter 6 “Topic modeling” in: Silge, Julia, and David Robinson. 2017. Text Mining with R: A Tidy Approach. First edition. Beijing Boston Farnham: O´Reilly. https://www.tidytextmining.com/. Available online: https://www.tidytextmining.com/topicmodeling.html
- Chapter on Topic Modeling in: Benjamin Soltoff. MACS 305001: Computing for the Social Sciences, University of Chicago, https://cfss.uchicago.edu/notes/topic-modeling/
- David Meza. Topic Modeling using R, https://knowledger.rbind.io/post/topic-modeling-using-r/
- Grün, Bettina, and Kurt Hornik. 2011. “Topicmodels: An R Package for Fitting Topic Models.” Journal of Statistical Software 40 (13). https://doi.org/10.18637/jss.v040.i13.