2 years ago
#4476

Karl Wolfschtagg
Fundamental problem - Word frequencies/histogram in R
I've spent an entire day on this and I cannot seem to find my mistake. I'm sure someone else will be able to spot it in a second (for which I am thankful).
I have a few text files that I need to analyze; it doesn't matter what's in them (use lorem ipsum for a reproducible example). I am at the EDA step. My code looks like this:
lorem <- readLines("lorem.txt", skipNul = TRUE) # Read in data; this works
# stringi functions
library(stringi)
# Count words in each entry; this also works fine
lorem_wc <- data.frame(stri_count_words(lorem))
This produces a data frame with one column with the number of words in each line - so far, this works great. Now I'd like to create a histogram (i.e., see a distribution of word-count frequencies); the number of words would be on the horizontal axis and the frequency of each would be on the vertical. My thought was this:
lorem_df <- table(lorem_wc)
This produces something I don't want (I'm not actually sure what this is, but it's clearly not what I want - it's a single row of information, but I don't understand what R computed. If I try this:
lorem_df <- data.frame(table(stri_count_words(lorem)))
I get the same information, but it's now in a column instead of a row. The contents are still not what I'd expect. So I created a test file and did the same thing:
testfile <- readLines("test.txt", skipNul = TRUE)
testtab <- data.frame(table(stri_count_words(testfile)))
When I print testtab
, it looks like a two-column data frame; the first column is called Var1
and looks like it's the number of words per line in the text file called "test.txt". The second column is called Freq
and looks like it's the frequency that goes with the number of words in the first column.
Now I use ggplot to generate a histogram:
g <- ggplot(testtab, aes(x = Var1))
g <- g + geom_histogram()
g
... which throws an error: Error: StatBin requires a continuous x variable: the x variable is discrete. Perhaps you want stat="count"?
I'm sure this is trivial, but I've apparently gone into vapor lock.
Thanks in advance.
r
ggplot2
nlp
histogram
word-frequency
0 Answers
Your Answer