This post demonstrates the difference between simple word count and term frequency-inverse document frequency(tf-idf) methods in document retrieval using an example. We use document retrieval method to retrieve /recommend/ similar articles with the article of interest.
Word count approach counts the number of words found in a document and retrieve similar documents based on frequent words commonly available in the documents whereas tf-idf magnifies the most important words that are common locally and rare globally. In other words, important words are informative words that describe the unique aspect of a document.
Mathematically, we can compute term frequency (simple word count) and tf-idf as:
$$tf = f_{w,d} $$$$idf= log(N /(1+D))$$$$tfidf= tf* idf $$Where: tf is term frequency (word count); $f_{w,d} $ is the frequency of a word (w) in a document(d); idf is inverse document frequency (one is added in the denominator to avoid division by zero if the word is not in the corpus); N is the total number of documents in the corpus; D is the total number of documents containing the word; tfidf is term frequency-inverse document frequency.
The data was obtained from Machine Learning Foundation: A Case Study Approach course offered by University of Washington on Coursera. The analysis was performed using graphlab library, a python library that performs a range of machine learning tasks. For more information about graphlab, refer here.
The data contains three columns; url, name and text to refer to the link to wikipedia article, the name of the person and text of article, respectively. There are 59,071 observations.
import graphlab
people= graphlab.SFrame('people_wiki.gl') # load the data
people.shape # the dimension of the data frame
people.head(2) # the first two observations
The following two graphlab functions perform two tasks. The first counts the words in the text column of the data and the second computes tf-idf. The results are then added to the original data as new columns.
people["word_count"]=graphlab.text_analytics.count_words(people["text"])
people["tf-idf"]= graphlab.text_analytics.tf_idf(people["word_count"])
people.head(2) # the new data frame with word count and tfidf included
In order to explore the difference between document retrieval with simple word count and tf-idf, I randomly picked an article written about the 67th United States Secretary of State, Hillary Clinton. Please note that the information on the articles may not reflect the current status of the people.
HClinton = people[people["name"]=="Hillary Rodham Clinton"] # Querying information about Hillary Clinton
HClinton
# The first few lines of Hillary Clinton's article
HClinton["text"][0][0:800]
As you can see from the table below, "the", "in" and "of" are the most frequent words in Hillary Clinton's article and they are not informative to retrieve similar documents.
# convert the word count dictionary into a table having two columns, word and count.
hclinton_wc= HClinton[['word_count']].stack('word_count', new_column_name=['word', 'count'])
# Word count of Hillary Clintons's article in descending order
hclinton_wc.sort('count', ascending=False)
Informative words describe the uniqueness of a document for which similar documents will be retrieved. The tradeoff between rare globally and common locally can be balanced by weighing words by their tf-idf score. Words with large tf-idf score are more informative than words with low tf-idf score. In the article of Hillary Clinton, "clinton", "lady" and "she" are the most informative words.
# convert the "tf-idf" dictionary into a table having two columns, word and tf-idf.
hclinton_tfidf=HClinton[["tf-idf"]].stack("tf-idf",new_column_name=["word", "tf-idf"])
# Sort by tf-idf in descending order.
hclinton_tfidf.sort('tf-idf', ascending=False)
In this post, cosine similarity was used to compute the similarity of documents in the corpus. It is the most commonly used similarity measure in information retrieval. Graphlab computes the cosine distance which is one minus the cosine similarity. The smaller the cosine distance, the similar the documents are.
# knn model using word_count as feature
knn_model_wc = graphlab.nearest_neighbors.create(people, features=['word_count'], label='name',distance='cosine')
# knn model using tfidf as feature
knn_model_tfidf = graphlab.nearest_neighbors.create(people, features=['tf-idf'], label='name',distance='cosine')
When I use word frequency as a feature to make neighbor search, I found the articles of Adrienne Corri, Maria Damanaki, Diana B. Henriques and Raylawni Branch close to Hillary Clinton’s article.
# the top four similar articles with Hillary Clinton using word_count as feature
query_wc=knn_model_wc.query(HClinton, verbose=False)
query_wc
The following line of code prints the name of the people and the first few line of their article that are suggested as similar with Hillary Clinton's article including hers.
As we can see from the text, the articles retrieved using word count are not very similar with Hillary Clinton. For example, the closest article is that of Adrienne Corri while she is an actress, Maria Damanaki is the global managing director for oceans and Diana B. Henriques is an American financial journalist and so on.
similar_wc= query_wc["reference_label"]
for i in range(len(similar_wc)):
print("{name}: {article}\n".format(name=similar_wc[i], article=people[people["name"]==similar_wc[i]]["text"][0][0:200]))
The tf-idf model suggested that the articles of Bill Clinton, Melanne Verveer, Barack Obama and Sheffield Nelson are the top four closest articles to Hillary Clinton.
#similar articles to Hillary Clinton using tfidf as feature
query_tfidf=knn_model_tfidf.query(HClinton,verbose=False)
query_tfidf
The articles retrieved using tfidf make more sense. The closest article with hers is that of Bill Clinton as he is her spouse and politician. Other people listed above such as Barack Obama are politicians and may have more similarity with her.
# Few lines of the articles of people similar to that of Hillary Clinton including hers.
similar_tfidf= query_tfidf["reference_label"]
for i in range(len(similar_tfidf)):
print("{name}: {article}\n".format(name=similar_tfidf[i], article=people[people["name"]==similar_tfidf[i]]["text"][0][0:170]))
Analyzing document similarity just by word count can be misleading as common words such as "the”, “of" and "in" may dominate rare informative words. Term frequency-inverse document frequency(tf-idf) addresses the problem by magnifying the most important words in the document and helps better retrieve similar documents. As we can see in the example of Hillary Clinton's article, simple word count retrieves the articles of people who are actress, director of ocean, financial journalist and so on. On the other hand, the articles retrieved using tfidf are that of politicians and Bill Clinton's article was the first to be suggested. From our common understanding, we can validate the outcome that someone reading about Hillary Clinton may be interested to read about Bill Clinton (her spouse) and other politicians including Barack Obama as retrieved using tf-idf.