R Fun! – Text Mining to Create Vocabulary Lists

Use R to scrape and mine text from the web to create personalised discipline specific vocabulary lists!

I love playing with R and I have recently learnt how to scrape and text mine websites. I am going to provide a short tutorial on how to do this using an example I hope you find useful.

Learning the jargon of a new topic that you're interested in can significantly increase your comprehension of the subject matter, so it can be important to spend some time getting to know the lingo. But how can you work out the most important words in the area. You could find lists of key words but these may only identify words that people within the field think you need to learn. Another way is to created a vocabulary list by identifying the most common words across several texts on the topic. This is what we will be doing.

First of all you will need a topic. I will be using the topics of nutrigenomics because Jess (my wife) has recently become interested in learning about the interaction between nutrition and the genome. Now that we have a topic we will follow the following process to created our vocabulary list:

  • Find the documents that you will use to build your vocabulary list.
  • Scrape the text from the website.
  • Clean up the text to get rid of useless information.
  • Identify the most common words across the texts.

Finding the Documents

I am going to use PLOS ONE to find papers on nutrigenomics because it is open access and I will be able to retrieve the information I want. I start by searching PLOS ONE for the nutrigenomics, which finds 192 matches as of the 22/08/2015. Each match is listed by the paper name which contains a hyperlink to the URL for the full paper with the text we are interested in. In R we will use URL's to find the website we are interested in and scrape it's text. In order to scrape the text from every paper we will need to retrieve the corresponding URL's for each paper. To do this we will use the magic of package rvest which allows you to specify specific elements of a website to scrape, in this case we will be scraping the URL links associated with the heading of each paper returned in our PLOS ONE search. So lets get started!


First take note of the URL from your PLOS ONE search. In my case it is: http://journals.plos.org/plosone/search?q=nutrigenomics&filterJournals=PLoSONE. As I mentioned earlier there are 192 results associated with this search but they don't all show up on the same page. However, if I go to the bottom of the page at select to see 30 results per page the URL changes to specify the number of results per page. We can use this to our advantage and change the number from 30 to 192, which then gets the whole list of papers on one page and more importantly all their associated URL's on one page, e.g. http://journals.plos.org/plosone/search?q=nutrigenomics&sortOrder=RELEVANCE&filterJournals=PLoSONE&resultsPerPage=192. We are going to use this URL to find all of the URL's to our papers.

First we will open R and load the package that we require to get our vocabulary list. I like to use rvest.


Now we can create a vector which contains the html for for the PLOS ONE nutrigenics search, with all returned papers on the same page. This literally pulls down the html code from the web address that you parse to the html() function.

paperList <- html("http://journals.plos.org/plosone/search?q=nutrigenomics&sortOrder=RELEVANCE&filterJournals=PLoSONE&resultsPerPage=192")

Using this HTML code, we can now locate the URL's associated with each paper title with the special rvest function html_nodes(). This function uses css or xpath syntax to identify specific locations within the structure of a HTML document. So to pull out the URL's we are after we will need to determine the path to them. This can be easily done on a Google chrome web browser using the inspect element functionality (I am not sure whether other web browser have a similar function but I am sure they do).

In Google chrome go to the list of papers in the PLOS ONE search page, right click on one of the paper titles and select 'inspect element'. This will split your window in half and show you the HTML for the webpage. In the HTML viewer the code for the specific element that you clicked on will be highlighted, this is what you want. You can right click this highlighted section and select 'copy css path' or 'copy xpath' and you will get the specific location for that node to use in html_nodes(). However, we want to specify every URL associated with a paper title in the document so we will need to use a path the contains common elements for every location we are interested in. Luckily 'css path' and 'xpath' syntax can specific multiple locations if they contain the same identifying elements. By looking at the HTML with Google chromes inspect element we can see that the URL's we are interested in are identified by class="search-results-title" and contained within a href="URL" tag. These two elements are common for each of our papers but will not include href= for links elsewhere on the page.


The code to retrieve the URL's occurs in three parts; first we parse our HTML file, then we specify the locations we are interested in with html_nodes(), and finally we indicate what we want to retrieve. In this case we will be retreiving a HTML attribute using the function html_attr()

paperURLs <- paperList %>%
             html_nodes(xpath="//*[@class='search-results-title']/a") %>%

This returns a list of 192 URL's that specify the location of the papers we are interested in.

## [1] "/plosone/article?id=10.1371/journal.pone.0001681"
## [2] "/plosone/article?id=10.1371/journal.pone.0082825"
## [3] "/plosone/article?id=10.1371/journal.pone.0060881"
## [4] "/plosone/article?id=10.1371/journal.pone.0026669"
## [5] "/plosone/article?id=10.1371/journal.pone.0110614"
## [6] "/plosone/article?id=10.1371/journal.pone.0112665"

If you look closely you will notice that the URL's are missing the beginning of a proper web address. Using these URLs will result in a retrieval error. To fix this we will add the start to the URL's with paste(). Here we are simply saying paste the string http://journals.plos.org to the beginning of each of out paperURLs and separate these two strings by no space.

paperURLs <- paste("http://journals.plos.org", paperURLs, sep = "")

# Check it out
## [1] "http://journals.plos.org/plosone/article?id=10.1371/journal.pone.0001681"
## [2] "http://journals.plos.org/plosone/article?id=10.1371/journal.pone.0082825"
## [3] "http://journals.plos.org/plosone/article?id=10.1371/journal.pone.0060881"
## [4] "http://journals.plos.org/plosone/article?id=10.1371/journal.pone.0026669"
## [5] "http://journals.plos.org/plosone/article?id=10.1371/journal.pone.0110614"
## [6] "http://journals.plos.org/plosone/article?id=10.1371/journal.pone.0112665"

As you can see we now have a complete URL. Try copy/pasting one into your browser to make sure it is working.

Scraping the Text

We can scrape the text from these papers, using the URLs we have just extracted. We will do this by pulling down each paper in its HTML format.

Using the URL's we have extracted from the previous step we will pull down the HTML file for each of the 192 papers. We will use sapply() to do this, which is a looping function that allows us to run html() on every item whithin a list. This step is pulling a large amount of information from the web so it might take a few minutes to run.

paper_html <- sapply(1:length(paperURLs),
                     function(x) html(paperURLs[x]))

Now we can extract the text from all of these HTML files. Using the inspect element functionality of Google chrome we have determined that the content of the articles is found within class="article-content". We are using html_text() to extract only text from the html documents and trim off any white space with stringr function str_trim(). Because we have a list of 192 HTML documents we will iterate over each document using the awesome sapply() function. Where 1:length(paper_html) simply says iterate the following function where x equals 1 until 192.

paperText <- sapply(1:length(paper_html), function(x) paper_html[[1]] %>%
                     html_nodes(xpath="//*[@class='article-content']") %>%
                     html_text() %>%

This results in a very large vector containing the text for each of the 192 papers we are interested in.

Cleaning the Text

Now that we have all of the text that we are interested in we can transform it into a format used for text mining and start to clean it up clean it up.

First we need to load the tm and SnowballC packages.tm is used for text mining and SnowballC has some useful functions that will be explained later.

Now we will transform it into a document corpus using the tm function Corpus() and specifying that the text is of a VectorSource().

paperCorp <- Corpus(VectorSource(paperText))

Now we will remove any text elements that are not useful to us. This includes punctuation, common words such as 'a', 'is', 'the', and remove numbers.

First we will remove any special characters that we might find in the document. To determine what these will be take some time to look at one of the paperText elements.

# Check it out by running the following code.

Now that we have identified the special?characters that we want to get rid of we can?remove them using the following function.

for(j in seq(paperCorp))
paperCorp[[j]] <- gsub(":", " ", paperCorp[[j]])
paperCorp[[j]] <- gsub("\n", " ", paperCorp[[j]])
paperCorp[[j]] <- gsub("-", " ", paperCorp[[j]])

The tm package has several built in functions to remove common elements from text, which are rather self explanatory given their names.

paperCorp <- tm_map(paperCorp, removePunctuation)
paperCorp <- tm_map(paperCorp, removeNumbers)

It is really important to run the tolower argument in tm_map(), which changes all characters “to lower” characters. (NOTE: I didn't do this in the beginning and it caused me trouble when I tried to remove specific words in later steps. Thanks to phiver on stackoverflow for helping fix this problem for me!). We will also remove commonly used words in the english language, using the removeWords stopwords() arguments.

paperCorp <- tm_map(paperCorp, tolower)
paperCorp <- tm_map(paperCorp, removeWords, stopwords("english"))

We also want to remove all the common endings to english words, such as 'ing', 'es, and 's'. This is referred to as 'stemming' and is done with a function in the SnowballC package.

paperCorp <- tm_map(paperCorp, stemDocument)

To make sure none of our filtering has left any annoying white space we will make sure to remove it.

paperCorp <- tm_map(paperCorp, stripWhitespace)

If you have a look at this document you can see that it is very different from when you started.


Now we tell R to treat the processed documents as text documents.

paperCorpPTD <- tm_map(paperCorp, PlainTextDocument)

Finally we use this plain text document to create a document term matrix. This is a large matrix that contains statistics about each of the words that are contained within the document. We use the document term matrix that we use to look at the details of our documents.

dtm <- DocumentTermMatrix(paperCorpPTD)
## <<DocumentTermMatrix (documents: 192, terms: 1684)>>
## Non-/sparse entries: 323328/0
## Sparsity           : 0%
## Maximal term length: 27
## Weighting          : term frequency (tf)

We are close but there's still one cleaning step that we need to do. There will be words that occur commonly in our document that we aren't interested. We will want to remove these words but first we need to identify what they are. To do this we will find the frequent terms in the document term matrix. We can calculate the frequency of each of our terms and then creat a data.frame where they are order from most frequent to least frequent. We can look through the most common terms in the dataframe and remove those that we aren't interested in. First we will calculate the frequency of each term.

termFreq <- colSums(as.matrix(dtm))

# Have a look at it.
##       able  abolished    absence absorption   abstract       acad 
##        192        192        192       1344        192        960

Now we will create a dataframe and order it by term frequency.

tf <- data.frame(term = names(termFreq), freq = termFreq)
tf <- tf[order(-tf[,2]),]

# Have a look at it.
##            term  freq
## fatty     fatty 29568
## pparα     pparα 23232
## acids     acids 22848
## gene       gene 15360
## dietary dietary 12864
## article article 12288

As we can see there are a number of terms that are simply a product of the text being scraped from a website, e.g. 'google', 'article', etc. Now go through the list and make not of all of the terms that aren't important to you. Once you have a list remove the words from the paperCorp document.

paperCorp <- tm_map(paperCorp, removeWords, c("also", "article", "analysis",
                                      "download", "google", "figure",
                                      "fig", "groups", "however",
                                      "high", "human", "levels",
                                      "larger", "may", "number",
                                      "shown", "study", "studies", "this",
                                      "using", "two", "the", "scholar",
                                      "pubmedncbi", "view", "the", "biol",
                                      "via", "image", "doi", "one"

There will also be particular terms that should occur together but which end up being split up in the text matrix. We will replace these terms so they occure together.

for (j in seq(paperCorp))
  paperCorp[[j]] <- gsub("fatty acid", "fatty_acid", paperCorp[[j]])

Now we have to recreate our document term matrix.

paperCorp <- tm_map(paperCorp, stripWhitespace)
paperCorpPTD <- tm_map(paperCorp, PlainTextDocument)
dtm <- DocumentTermMatrix(paperCorpPTD)
termFreq <- colSums(as.matrix(dtm))
tf <- data.frame(term = names(termFreq), freq = termFreq)
tf <- tf[order(-tf[,2]),]
##                    term  freq
## pparα             pparα 23232
## fatty_acids fatty_acids 22272
## gene               gene 15360
## dietary         dietary 12864
## expression   expression 10752
## genes             genes  9408

From this dataset we will create a word cloud of the most frequent terms. The number of words being displayed is determined by 'max.words'. We will do this using the wordcloud package.

wordcloud(tf$term, tf$freq, max.words = 100, rot.per = 0.2, colors = brewer.pal(5, "Dark2"))


You can use the tm dataframe to find common terms that occur in your field and build a vocabulary list.

By changing your search term in PLoS ONE you can create a vocabulary list for any scientific field you like.

That's it, have fun!!

If anyone has suggested changes to the code, qeustions or comments, please leave a reply below.