2 years ago
#15800
w5698
Question about pre-processing Russian text for stm in R
I am trying to run a structural topic model in R
using the stm
package. The corpus is a collection of Russian-language speeches. The problem I am having is that the Russian words are not being pre-processed correctly. Here is the code I have written thus far:
library(stm) # Package for sturctural topic modeling
library(igraph) # Package for network analysis and visualisation
library(stmCorrViz) # Package for hierarchical correlation view of STMs
data <- read.csv("convocation4.csv") # Load data
stopwordsRU <- readLines("stop_words_russian.txt", encoding = "UTF-8") # Custom stopwords
processed <- textProcessor(
data$text,
metadata = data,
lowercase = TRUE,
removestopwords = TRUE,
removenumbers = TRUE,
removepunctuation = TRUE,
stem = TRUE,
language = "ru",
customstopwords = stopwordsRU
)
docs <- out$documents
vocab <- out$vocab
meta <- out$meta
fit <- stm(out$documents, out$vocab, K=20, prevalence=~party_id,
max.em.its=75, data=out$meta, init.type="Spectral",
seed=8458159)
Here is an example of the problem. I have included one Russian stopword, уважаемый, in my list of custom stopwords. However, the word уважаемые, the plural form, comes out as one of the top words after running the model. Why is this happening? How might I solve it?
r
topic-modeling
0 Answers
Your Answer