Skip to content

Preprocessing and Text Analysis

Preprocessing

The data were cleaned to remove rows with null values and declare the data types of each variable, a process dependent on the dataset. The responses were prepared for topic modelling by expanding contractions (e.g., don’t -¿ do not) , removing punctuation and digits, making words lowercase and removing stopwords. Stopwords are words which convey little to no additional meaning to a sentence such as ”a” ”and” ”am”. These stop words are able to be customised to include other non-informative words from the corpus as determined by the user such as ”nothing” and ”nope”. The words were also normalised to their root using stemming or lemmatisation, tokenised and converted into a document feature matrix for STM. All tokens are unigrams, unless otherwise stated. Responses that had fewer than 3 tokens were removed. In stemming a part of the word is removed to reduce the word it stem. In lemmatisation, a word is mapped to its lemma, or canonical form. Lemmatisation can been seen as more interpretable as the root derived are real words and words such as ”happy” and ”happily” map to ”happy” where as with stemming they would map to ”happy” and ”happi”. ”run”, ”running” and ”ran” all map to ”run” .

N-gram analysis

The Tidytext library was used to extract unigrams, bigrams and trigrams from the text.The frequency of these n-grams was calculated, which provided an overview of the types of words and phrases in the data. This was used to highlight features that needed to be considered in the preprocessing. Unigrams and bigrams were used as the input for stm in an experiment. Otherwise, only unigrams were used.

Sentiment Analysis

Sentiment analysis calculates the affective state of text. Commonly, polarity is calculated, in which a score is given stating how positive, negative or neutral a statement is. Several sentiment analysis tools available in R perform well on text. On the raw data, we compared the performance of VADER, SentimentAnalysis, ANew and NRC emotion lexicon. The libraries operate using different methods and are trained with different corpora, for example, VADER is a rule based dictionary trained on Twitter data and SentimentAnalysis: Dictionary GI is a general purpose dictionary using Havard-IV dictionary.