Question about pre-processing Russian text for stm in R

2 years ago

#15800

w5698

I am trying to run a structural topic model in R using the stm package. The corpus is a collection of Russian-language speeches. The problem I am having is that the Russian words are not being pre-processed correctly. Here is the code I have written thus far:

library(stm)        # Package for sturctural topic modeling
library(igraph)     # Package for network analysis and visualisation
library(stmCorrViz) # Package for hierarchical correlation view of STMs

data <- read.csv("convocation4.csv") # Load data 

stopwordsRU <- readLines("stop_words_russian.txt", encoding = "UTF-8") # Custom stopwords 

processed <- textProcessor( 
  data$text,
  metadata = data,
  lowercase = TRUE,
  removestopwords = TRUE,
  removenumbers  = TRUE,
  removepunctuation = TRUE,
  stem = TRUE,
  language = "ru",
  customstopwords = stopwordsRU
)

docs <- out$documents
vocab <- out$vocab
meta <- out$meta

fit <- stm(out$documents, out$vocab, K=20, prevalence=~party_id, 
                       max.em.its=75, data=out$meta, init.type="Spectral", 
                       seed=8458159)

Here is an example of the problem. I have included one Russian stopword, уважаемый, in my list of custom stopwords. However, the word уважаемые, the plural form, comes out as one of the top words after running the model. Why is this happening? How might I solve it?

topic-modeling

0 Answers

Your Answer

Posts

Questions

Blogs

Jobs