R Fun! – Text Mining to Create Vocabulary Lists

Use R to scrape and mine text from the web to create personalised discipline specific vocabulary lists!

I love playing with R and I have recently learnt how to scrape and text mine websites. I am going to provide a short tutorial on how to do this using an example I hope you find useful.

Learning the jargon of a new topic that you're interested in can significantly increase your comprehension of the subject matter, so it can be important to spend some time getting to know the lingo. But how can you work out the most important words in the area. You could find lists of key words but these may only identify words that people within the field think you need to learn. Another way is to created a vocabulary list by identifying the most common words across several texts on the topic. This is what we will be doing.

First of all you will need a topic. I will be using the topics of nutrigenomics because Jess (my wife) has recently become interested in learning about the interaction between nutrition and the genome. Now that we have a topic we will follow the following process to created our vocabulary list:

  • Find the documents that you will use to build your vocabulary list.
  • Scrape the text from the website.
  • Clean up the text to get rid of useless information.
  • Identify the most common words across the texts.

Finding the Documents

I am going to use PLOS ONE to find papers on nutrigenomics because it is open access and I will be able to retrieve the information I want. I start by searching PLOS ONE for the nutrigenomics, which finds 192 matches as of the 22/08/2015. Each match is listed by the paper name which contains a hyperlink to the URL for the full paper with the text we are interested in. In R we will use URL's to find the website we are interested in and scrape it's text. In order to scrape the text from every paper we will need to retrieve the corresponding URL's for each paper. To do this we will use the magic of package rvest which allows you to specify specific elements of a website to scrape, in this case we will be scraping the URL links associated with the heading of each paper returned in our PLOS ONE search. So lets get started!


First take note of the URL from your PLOS ONE search. In my case it is: http://journals.plos.org/plosone/search?q=nutrigenomics&filterJournals=PLoSONE. As I mentioned earlier there are 192 results associated with this search but they don't all show up on the same page. However, if I go to the bottom of the page at select to see 30 results per page the URL changes to specify the number of results per page. We can use this to our advantage and change the number from 30 to 192, which then gets the whole list of papers on one page and more importantly all their associated URL's on one page, e.g. http://journals.plos.org/plosone/search?q=nutrigenomics&sortOrder=RELEVANCE&filterJournals=PLoSONE&resultsPerPage=192. We are going to use this URL to find all of the URL's to our papers.

First we will open R and load the package that we require to get our vocabulary list. I like to use rvest.


Now we can create a vector which contains the html for for the PLOS ONE nutrigenics search, with all returned papers on the same page. This literally pulls down the html code from the web address that you parse to the html() function.

paperList <- html("http://journals.plos.org/plosone/search?q=nutrigenomics&sortOrder=RELEVANCE&filterJournals=PLoSONE&resultsPerPage=192")

Using this HTML code, we can now locate the URL's associated with each paper title with the special rvest function html_nodes(). This function uses css or xpath syntax to identify specific locations within the structure of a HTML document. So to pull out the URL's we are after we will need to determine the path to them. This can be easily done on a Google chrome web browser using the inspect element functionality (I am not sure whether other web browser have a similar function but I am sure they do).

In Google chrome go to the list of papers in the PLOS ONE search page, right click on one of the paper titles and select 'inspect element'. This will split your window in half and show you the HTML for the webpage. In the HTML viewer the code for the specific element that you clicked on will be highlighted, this is what you want. You can right click this highlighted section and select 'copy css path' or 'copy xpath' and you will get the specific location for that node to use in html_nodes(). However, we want to specify every URL associated with a paper title in the document so we will need to use a path the contains common elements for every location we are interested in. Luckily 'css path' and 'xpath' syntax can specific multiple locations if they contain the same identifying elements. By looking at the HTML with Google chromes inspect element we can see that the URL's we are interested in are identified by class="search-results-title" and contained within a href="URL" tag. These two elements are common for each of our papers but will not include href= for links elsewhere on the page.


The code to retrieve the URL's occurs in three parts; first we parse our HTML file, then we specify the locations we are interested in with html_nodes(), and finally we indicate what we want to retrieve. In this case we will be retreiving a HTML attribute using the function html_attr()

paperURLs <- paperList %>%
             html_nodes(xpath="//*[@class='search-results-title']/a") %>%

This returns a list of 192 URL's that specify the location of the papers we are interested in.

## [1] "/plosone/article?id=10.1371/journal.pone.0001681"
## [2] "/plosone/article?id=10.1371/journal.pone.0082825"
## [3] "/plosone/article?id=10.1371/journal.pone.0060881"
## [4] "/plosone/article?id=10.1371/journal.pone.0026669"
## [5] "/plosone/article?id=10.1371/journal.pone.0110614"
## [6] "/plosone/article?id=10.1371/journal.pone.0112665"

If you look closely you will notice that the URL's are missing the beginning of a proper web address. Using these URLs will result in a retrieval error. To fix this we will add the start to the URL's with paste(). Here we are simply saying paste the string http://journals.plos.org to the beginning of each of out paperURLs and separate these two strings by no space.

paperURLs <- paste("http://journals.plos.org", paperURLs, sep = "")

# Check it out
## [1] "http://journals.plos.org/plosone/article?id=10.1371/journal.pone.0001681"
## [2] "http://journals.plos.org/plosone/article?id=10.1371/journal.pone.0082825"
## [3] "http://journals.plos.org/plosone/article?id=10.1371/journal.pone.0060881"
## [4] "http://journals.plos.org/plosone/article?id=10.1371/journal.pone.0026669"
## [5] "http://journals.plos.org/plosone/article?id=10.1371/journal.pone.0110614"
## [6] "http://journals.plos.org/plosone/article?id=10.1371/journal.pone.0112665"

As you can see we now have a complete URL. Try copy/pasting one into your browser to make sure it is working.

Scraping the Text

We can scrape the text from these papers, using the URLs we have just extracted. We will do this by pulling down each paper in its HTML format.

Using the URL's we have extracted from the previous step we will pull down the HTML file for each of the 192 papers. We will use sapply() to do this, which is a looping function that allows us to run html() on every item whithin a list. This step is pulling a large amount of information from the web so it might take a few minutes to run.

paper_html <- sapply(1:length(paperURLs),
                     function(x) html(paperURLs[x]))

Now we can extract the text from all of these HTML files. Using the inspect element functionality of Google chrome we have determined that the content of the articles is found within class="article-content". We are using html_text() to extract only text from the html documents and trim off any white space with stringr function str_trim(). Because we have a list of 192 HTML documents we will iterate over each document using the awesome sapply() function. Where 1:length(paper_html) simply says iterate the following function where x equals 1 until 192.

paperText <- sapply(1:length(paper_html), function(x) paper_html[[1]] %>%
                     html_nodes(xpath="//*[@class='article-content']") %>%
                     html_text() %>%

This results in a very large vector containing the text for each of the 192 papers we are interested in.

Cleaning the Text

Now that we have all of the text that we are interested in we can transform it into a format used for text mining and start to clean it up clean it up.

First we need to load the tm and SnowballC packages.tm is used for text mining and SnowballC has some useful functions that will be explained later.

Now we will transform it into a document corpus using the tm function Corpus() and specifying that the text is of a VectorSource().

paperCorp <- Corpus(VectorSource(paperText))

Now we will remove any text elements that are not useful to us. This includes punctuation, common words such as 'a', 'is', 'the', and remove numbers.

First we will remove any special characters that we might find in the document. To determine what these will be take some time to look at one of the paperText elements.

# Check it out by running the following code.

Now that we have identified the special?characters that we want to get rid of we can?remove them using the following function.

for(j in seq(paperCorp))
paperCorp[[j]] <- gsub(":", " ", paperCorp[[j]])
paperCorp[[j]] <- gsub("\n", " ", paperCorp[[j]])
paperCorp[[j]] <- gsub("-", " ", paperCorp[[j]])

The tm package has several built in functions to remove common elements from text, which are rather self explanatory given their names.

paperCorp <- tm_map(paperCorp, removePunctuation)
paperCorp <- tm_map(paperCorp, removeNumbers)

It is really important to run the tolower argument in tm_map(), which changes all characters “to lower” characters. (NOTE: I didn't do this in the beginning and it caused me trouble when I tried to remove specific words in later steps. Thanks to phiver on stackoverflow for helping fix this problem for me!). We will also remove commonly used words in the english language, using the removeWords stopwords() arguments.

paperCorp <- tm_map(paperCorp, tolower)
paperCorp <- tm_map(paperCorp, removeWords, stopwords("english"))

We also want to remove all the common endings to english words, such as 'ing', 'es, and 's'. This is referred to as 'stemming' and is done with a function in the SnowballC package.

paperCorp <- tm_map(paperCorp, stemDocument)

To make sure none of our filtering has left any annoying white space we will make sure to remove it.

paperCorp <- tm_map(paperCorp, stripWhitespace)

If you have a look at this document you can see that it is very different from when you started.


Now we tell R to treat the processed documents as text documents.

paperCorpPTD <- tm_map(paperCorp, PlainTextDocument)

Finally we use this plain text document to create a document term matrix. This is a large matrix that contains statistics about each of the words that are contained within the document. We use the document term matrix that we use to look at the details of our documents.

dtm <- DocumentTermMatrix(paperCorpPTD)
## <<DocumentTermMatrix (documents: 192, terms: 1684)>>
## Non-/sparse entries: 323328/0
## Sparsity           : 0%
## Maximal term length: 27
## Weighting          : term frequency (tf)

We are close but there's still one cleaning step that we need to do. There will be words that occur commonly in our document that we aren't interested. We will want to remove these words but first we need to identify what they are. To do this we will find the frequent terms in the document term matrix. We can calculate the frequency of each of our terms and then creat a data.frame where they are order from most frequent to least frequent. We can look through the most common terms in the dataframe and remove those that we aren't interested in. First we will calculate the frequency of each term.

termFreq <- colSums(as.matrix(dtm))

# Have a look at it.
##       able  abolished    absence absorption   abstract       acad 
##        192        192        192       1344        192        960

Now we will create a dataframe and order it by term frequency.

tf <- data.frame(term = names(termFreq), freq = termFreq)
tf <- tf[order(-tf[,2]),]

# Have a look at it.
##            term  freq
## fatty     fatty 29568
## pparα     pparα 23232
## acids     acids 22848
## gene       gene 15360
## dietary dietary 12864
## article article 12288

As we can see there are a number of terms that are simply a product of the text being scraped from a website, e.g. 'google', 'article', etc. Now go through the list and make not of all of the terms that aren't important to you. Once you have a list remove the words from the paperCorp document.

paperCorp <- tm_map(paperCorp, removeWords, c("also", "article", "analysis",
                                      "download", "google", "figure",
                                      "fig", "groups", "however",
                                      "high", "human", "levels",
                                      "larger", "may", "number",
                                      "shown", "study", "studies", "this",
                                      "using", "two", "the", "scholar",
                                      "pubmedncbi", "view", "the", "biol",
                                      "via", "image", "doi", "one"

There will also be particular terms that should occur together but which end up being split up in the text matrix. We will replace these terms so they occure together.

for (j in seq(paperCorp))
  paperCorp[[j]] <- gsub("fatty acid", "fatty_acid", paperCorp[[j]])

Now we have to recreate our document term matrix.

paperCorp <- tm_map(paperCorp, stripWhitespace)
paperCorpPTD <- tm_map(paperCorp, PlainTextDocument)
dtm <- DocumentTermMatrix(paperCorpPTD)
termFreq <- colSums(as.matrix(dtm))
tf <- data.frame(term = names(termFreq), freq = termFreq)
tf <- tf[order(-tf[,2]),]
##                    term  freq
## pparα             pparα 23232
## fatty_acids fatty_acids 22272
## gene               gene 15360
## dietary         dietary 12864
## expression   expression 10752
## genes             genes  9408

From this dataset we will create a word cloud of the most frequent terms. The number of words being displayed is determined by 'max.words'. We will do this using the wordcloud package.

wordcloud(tf$term, tf$freq, max.words = 100, rot.per = 0.2, colors = brewer.pal(5, "Dark2"))


You can use the tm dataframe to find common terms that occur in your field and build a vocabulary list.

By changing your search term in PLoS ONE you can create a vocabulary list for any scientific field you like.

That's it, have fun!!

If anyone has suggested changes to the code, qeustions or comments, please leave a reply below.

Compassionate Behaviour: a challenge to researchers of animal behaviour

It’s time for researchers of animal behaviour to develop ethical methods for studying animal behaviour, and move towards a framework of ‘Compassionate Behaviour’.

Behavioural research is perhaps one of the most fascinating areas of science because it explores our presence and interaction in the world. From communication to cognition and personality to pain, behavioural research captures our imagination and increasingly shows us that non-human species have behaviours that are as complex as our own.

Behaviour is studied in various ways including relatively innocuous observational methods, where an investigator sits and watches a study subject for hours, days, weeks and sometimes years; think Jane Goodall watching chimpanzees. If the investigator wants to test a particular aspect of behaviour they might manipulate the natural environment and see what response the study subject has (e.g. study shows bees have map-like spatial memory). However, there is a darker side to behavioural research that is not often discussed.


The Problem

There are invasive techniques in behaviour research which have a significant impact on the individuals being studied, including methods that manipulate the life history of individuals in the wild, as is done in brood size manipulations, or trapping wild individuals and bringing them into a laboratory. Of individuals brought into laboratory conditions many will be killed once the research is completed.

Finally, some behavioural studies will breed individuals for the sole purpose of conducting behavioural research, in which case they will live and die in the laboratory.

Ironically, behavioural research has helped us understand the incredible nature of non-human animals and has fostered movements that promote greater moral consideration for them (see 12). We now recognise that many species have incredibly complex social systems, excellent cognitive skills, individual personalities and can form friendships. That they are individuals who have an interest in self determination and freedom from interference.


Unfortunately our understanding of animals has not stopped us from purposefully subjugated them to tormenting conditions. In our pursuit of knowledge we have ignored the rights of non-human individuals to be free from unnecessary use.

This was particularly evident at Behaviour 2015, the 34th International Ethological Conference held in Cairns (10th – 14th August 2015) and where over 800 delegates from all over the world presented, discussed and planned all things behavioural science.

I was lucky enough to attend the conference and see dozens of talks on current animal behaviour research. A large number of talks involved studies that used captive individuals and many ending with the study subjects being killed. Despite the considerable cost to the individuals being studied I didn’t hear anyone question the ethics of what was being done. In fact, there seemed to be a generally accepted assumption in animal behaviour research that it’s OK to use and kill animals if it is in the pursuit of knowledge.

I would like to challenge this assumption, and argue that it is unethical to use invasive methods during the course of animal behaviour research.

Our Faulty Logic for Using Animals

Animals are used in experimental research simply because they are not human. This is evident by the extensive limits on human research but not non-human animal research. The distinction between human and non-human animals is based on species membership, whereby being a Homo sapien grants one greater moral consideration. However, species membership is a morally irrelevant characteristic because it is based on a difference in physiology, and physiology shouldn’t matter when deciding how to treat an individual. For instance, whether someone is male or female, Brazilian or Chinese does not matter how they should be treated morally. This is because the randomness of being born a female in Zimbabwe does not mean you are any less worthy of ethical treatment than any other person in the world. Biology does not matter when deciding what is morally acceptable, therefore taking away the freedoms of individual based on their species is unethical (see 3 for a more extensive discussion of this idea).

Some people argue that certain human traits make them superior to other species and justifies the human position to treat non-human animals as we like. This thinking is problematic for several reasons. First, it is incredibly bias to a human view of the world because superiority is based on exceptional human traits, e.g. intellectual capacity. If superiority were judge on the ability to fly, or swim, humans would be considered inferior to many other species. Second, behavioural research is finding that humans are not as exceptional as we once thought! An argument of superiority is illogical and does not give us cause to use non-human animals how we want.

The Challenge

I suggest that behavioural scientists look at the emergence of Compassionate Conservation as a guide to how ethics concerning the individual can be used in scientific research. Compassionate conservation seeks to promote the protection of wildlife as individuals and takes into consideration individual interests when planning management options. This is a huge change from a field where the wholesale killing of millions of individuals is considered OK just because they ‘don’t belong’ in an ecosystem; even when such intervention ultimately have little to no benefit.

We need a new field of Compassionate Behaviour which considers the interests of individuals in the pursuit of knowledge.