In this post, I used Bidirectional Encoder Representations from Transformers (BERT) to classify whether a news is fake or real. BERT is a state-of-the-art technique for Natural Language Processing (NLP) created and published by Google in 2018. Bidirectional means that it looks both left and right context to understand the text. It can be used for next sentence prediction, question answering, language inference and more. Here, we used BERT for news classification using transformers library from Hugging Face. Transformers allows PyTorch interface.
The analysis was created and executed through Google colaboratory notebook and can be accessible here. Google colaboratory, or in short Colab, is a free research tool provided by Google to execute python and perform machine learning tasks. It can allow us to set Graphics Processing Unit (GPU) or Tensor Processing Unit (TPU).
Import libraries
# Important libraries
import pandas as pd
import numpy as np
import re # to use regular expression pattern
import datetime as dt # to parse to datetime
import string
from scipy import stats
from collections import defaultdict
#for data preprocessing
from sklearn.model_selection import train_test_split
# to evaluate model performance
from sklearn.metrics import confusion_matrix, classification_report
# for visualization
import matplotlib.pyplot as plt
from matplotlib import rc
from pylab import rcParams
import seaborn as sns
%matplotlib inline
The analysis is discussed step by step as follows:
The data is publicly available and can be downloaded as fake and true news separately. I downloaded the data on the local machine and read from there. Though there are different ways to get data into Google Colab, I get the data by mounting my Google Drive into Colab environment.
true= pd.read_csv("drive/My Drive/NLP_files/True.csv", parse_dates=["date"]) # the true news data
fake= pd.read_csv("drive/My Drive/NLP_files/Fake.csv") # the fake news data
true.head(3) # the first few rows of real dataframe
fake.head(3)
# the size of fake and true news data sets
true.shape, fake.shape
# create a new column called is_fake and label as 0 for true news
true["label"]="true"
# create a new column called is_fake and label as 1 for fake news
fake["label"]= "fake"
# parse the date column into datetime.
#Since it has three datetime format, we need three formats to parse.
def parsing_datetime(string):
for f in ("%B %d, %Y", '%d-%b-%y', "%b %d, %Y"): # format 19-Feb-18
try:
return dt.datetime.strptime(string, f)
except ValueError:
pass
# parse the date column of fake dataframe into datetime
fake.date= fake.date.apply(lambda x: parsing_datetime(x))
# Merge the fake and read dataframe
news= pd.concat([true,fake], axis=0, ignore_index=True) # index to have unique index
news.head(3)
sns.countplot(x="label", data=news)
plt.title("Count plot of fake and true news")
plt.show()
Regular expression patterns have been used to detect and remove emoji symbols, url links, html tags, special characters and punctuation marks.
def remove_pattern(text, patterns):
"""The function remove_pattern returns the new string with a set of patterns removed.
Parameters:
------------
text: the text from which the pattern will be removed
patterns: is the set of patterns(iterable) we are interested to remove from the text
"""
for pattern in patterns:
new= re.sub(pattern, "", text)
text= new
return new
# patterns to be extracted and to be removed from the data
emoji = "[\U0001F300-\U0001F5FF\U0001F600-\U0001F64F\U0001F680-\U0001F6FF\u2600-\u26FF\u2700-\u27BF\U000024C2-\U0001F251]+"
url= re.compile("https?://\S+|www\.\S+") # pattern for url
html= r'<.*?>' # pattern for html tag
num_with_text= r"\S*\d+\S*" # pattern for digit
reuters= r"(\s\(Reuters\))" # pattern to detect Reuters, it is common word in true news
punctuation= r"[#@&%$~=\.;:\?,(){}\"\“\”\‘\'\*!\+`^<>\[\]\-]+" #pattern for punctuations and special characters
apostroph=r"\’s?"
# collect the patterns
patterns=[emoji, url, html, num_with_text, apostroph,reuters,punctuation] # punctuation removed
At this statage of data cleaning ,the emoji, url links, html tag, digits and special characters are removed.
# Clean the title and text of merged data using regular expression patterns
news_clean_title = news.title.apply(remove_pattern, patterns= patterns)
news_clean_text= news.text.apply(remove_pattern, patterns= patterns)
news["cleaned_title"]= news_clean_title # add the cleaned title as new column
news["cleaned_text"]= news_clean_text # add the cleaned text as new column
dicmap= {"true": 0, "fake": 1} # lable true news as 0 and fake news as 1
news["is_fake"]= news.label.apply(lambda x: dicmap[x])
news.head(2)
# copy the news to data_brt split into training, validating and test set
data_brt= news[["cleaned_title","cleaned_text","is_fake","label"]]
data_brt.head(3)
#train_data, validate_data, test_data (70%, 15%, 15% respectively)
train_data, validate_data, test_data= np.split(data_brt.sample(frac=1, random_state=42), [ int(.7*len(news)), int(.85*len(news))])
# size of trainng, validating and test data
train_data.shape, validate_data.shape, test_data.shape
Note that the steps in this section are similar to the steps in Valkov article posted in Curiously blog with some adaptation to the data at hand.
!pip install -qq transformers
# For BERT tokenization and modeling
import transformers
from transformers import BertModel, BertTokenizer, AdamW, get_linear_schedule_with_warmup
import torch
from textwrap import wrap
from torch import nn, optim
from torch.utils.data import Dataset, DataLoader
device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
device
Tokenization is the process of breaking a text into smaller units called tokens. For example, breaking a sentence into a list of words. BERT provides the option to be case sensitive (uppercase or lowercase characters) or to be case-insensitive. In our case, the case-insensitive(uncased) version of BERT is used.
# uncased version of BERT
PRE_TRAINED_MODEL_NAME = 'bert-base-uncased'
tokenizer = BertTokenizer.from_pretrained(PRE_TRAINED_MODEL_NAME, do_lower_case=True)
train_data.iloc[0,0] # get a sample news title from the training data
sample_txt=train_data.iloc[0,0]
# tokenizing sample text from training data
encoding = tokenizer.encode_plus(
sample_txt,
max_length=20,
add_special_tokens=True, # Add '[CLS]' and '[SEP]'
return_token_type_ids=False,
padding='max_length',
return_attention_mask=True,
return_tensors='pt', # Return PyTorch tensors
truncation=True # to truncate excess tokens to meet the maximum length
)
encoding.keys()
Parameter explanation :
For more parameters and detailed explanations please refere here.
encoding["input_ids"]
encoding["attention_mask"]# shows 1 for real token and 0 for pad tokens
Set the maximum length for our training data
To set the max_length, let us look at token length in the news title. In this post only the title of the news was considered for the analysis and for classification. Even though the text of the news has detailed information and has better prediction potential, due to the limited memory capacity of the free version of Google colab, we couldn't use the text of the news for classification. But a similar approach can be applied to use the text of the news for prediction.
token_length=[] # place holder to count the number of tokens in news
for ttl in data_brt.cleaned_title:
tokens = tokenizer.encode(ttl, max_length=512, truncation=True)
token_length.append(len(tokens)) # list of token count of each news title
plt.figure(figsize=(12,4))
#sns.distplot(token_length)
sns.countplot(token_length)
plt.title("Frequency of token count")
plt.xlabel("Number of tokens in news title")
plt.ylabel("Frequency")
plt.show()
# see the median, 95% quantile and maximum token count
print("median number of tokens = {},95% quantile = {}, maximum number of tokens = {}"\
.format(np.median(token_length),np.quantile(token_length,0.95),np.max(token_length)))
To be safe, we can set the maximum length as 60. But we can also set a lesser number as the majority of token counts are below 30.
max_len= 60 # maximum length
# define a class that can tokenize the data
class NewsDataset(Dataset):
def __init__(self, text, targets, tokenizer, max_len):
self.text = text
self.targets = targets
self.tokenizer = tokenizer
self.max_len = max_len
def __len__(self):
return len(self.text)
def __getitem__(self, item):
row = str(self.text[item])
target = self.targets[item]
encoding = self.tokenizer.encode_plus(
row,
add_special_tokens=True,
max_length=self.max_len,
return_token_type_ids=False,
padding="max_length",
return_attention_mask=True,
truncation=True, # to truncate excess tokens to meet the maximum length
return_tensors='pt',
)
return {
'row_text': row,
'input_ids': encoding['input_ids'].flatten(),
'attention_mask': encoding['attention_mask'].flatten(),
'targets': torch.tensor(target, dtype=torch.long)
}
The following function is specifically defined for the title of the news. For the text of the news, use similar approach but the input data has to be the column "cleaned_text".
# cleate dataloader for news title
def create_data_loader(data, tokenizer, max_len, batch_size):
ds = NewsDataset(
text=data.cleaned_title.to_numpy(),
targets=data.is_fake.to_numpy(),
tokenizer=tokenizer,
max_len=max_len
)
return DataLoader(ds,batch_size=batch_size, num_workers=4)
BATCH_SIZE = 16
train_data_loader = create_data_loader(train_data, tokenizer, max_len, BATCH_SIZE)
val_data_loader = create_data_loader(validate_data, tokenizer, max_len, BATCH_SIZE)
test_data_loader = create_data_loader(test_data, tokenizer, max_len, BATCH_SIZE)
data = next(iter(train_data_loader)) # get
data.keys()
# the shape of input_ids, attention_mask and targets for a batch of 16
print(data['input_ids'].shape) # batch size * max_length
print(data['attention_mask'].shape) # batch size * max_length
print(data['targets'].shape)
Let us see how the result looks like:
data["row_text"][0:2] # the first two news title in training data
print(data['input_ids'][:2]) # the first two news input_ids in training data
print(data['targets'][:2]) # target value of the first two rows from a batch of 16
# We can confirm the above result from the training dataframe
train_data.head(2)
BERT Base-uncase pretrained model was used to train and validate the data.
bert_model = BertModel.from_pretrained(PRE_TRAINED_MODEL_NAME)
# for the sample data
last_hidden_state, pooled_output = bert_model(
input_ids=encoding['input_ids'],
attention_mask=encoding['attention_mask']
)
Last_hidden state in this case contians each token in a news and their corresponding hidden state of the last layer in the feedforward-networks whereas pooled_output is the summary of a news title on last_hidden_state.
last_hidden_state.shape # the number of hidden units in the feedforward-networks=768
pooled_output.shape
# to make classification
class NewsClassifier(nn.Module):
def __init__(self, n_classes):
super(NewsClassifier, self).__init__()
self.bert = BertModel.from_pretrained(PRE_TRAINED_MODEL_NAME)
self.drop = nn.Dropout(p=0.1) # set dropout probablity for regularization
self.out = nn.Linear(self.bert.config.hidden_size, n_classes)
def forward(self, input_ids, attention_mask):
_, pooled_output = self.bert(
input_ids=input_ids,
attention_mask=attention_mask
)
output = self.drop(pooled_output)
return self.out(output)
#create classifier instance and move it to the GPU
model = NewsClassifier(2) # we have an argument 2, because we have two classes, fake and real news labeles
model = model.to(device)
# move the batch of training data to GPU
input_ids = data['input_ids'].to(device)
attention_mask = data['attention_mask'].to(device)
print(input_ids.shape) # batch size x seq length
print(attention_mask.shape) # batch size x seq length
# predicted probabilities from the trained model using softmax function
nn.functional.softmax(model(input_ids, attention_mask), dim=1)
Recommendations for fine-tuning by BERT's author:
#Training
EPOCHS = 4
optimizer = AdamW(model.parameters(), lr=2e-5, correct_bias=False)
total_steps = len(train_data_loader) * EPOCHS
scheduler = get_linear_schedule_with_warmup(
optimizer,
num_warmup_steps=0,
num_training_steps=total_steps
)
loss_fn = nn.CrossEntropyLoss().to(device)
Cross entropy is the measure of the performance of classification obtained by comparing the actual probability with the predicted probability(for more detail explanation please refer here)
# helper function to train the model
def train_epoch( model, data_loader, loss_fn, optimizer, device, scheduler, n_examples):
model = model.train()
losses = []
correct_predictions = 0
for d in data_loader:
input_ids = d["input_ids"].to(device)
attention_mask = d["attention_mask"].to(device)
targets = d["targets"].to(device)
outputs = model(
input_ids=input_ids,
attention_mask=attention_mask
)
_, preds = torch.max(outputs, dim=1) #get the maximum probablity for prediction
loss = loss_fn(outputs, targets)
correct_predictions += torch.sum(preds == targets) # number of correct predictions
losses.append(loss.item())
loss.backward()
nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0)
optimizer.step()
scheduler.step()
optimizer.zero_grad()
return correct_predictions.double() / n_examples, np.mean(losses) # accuracy, loss
#helper function to evaluate the model
def eval_model(model, data_loader, loss_fn, device, n_examples):
model = model.eval()
losses = []
correct_predictions = 0
with torch.no_grad():
for d in data_loader:
input_ids = d["input_ids"].to(device)
attention_mask = d["attention_mask"].to(device)
targets = d["targets"].to(device)
outputs = model(
input_ids=input_ids,
attention_mask=attention_mask
)
_, preds = torch.max(outputs, dim=1) # take the maximum
loss = loss_fn(outputs, targets)
correct_predictions += torch.sum(preds == targets)
losses.append(loss.item())
return correct_predictions.double() / n_examples, np.mean(losses) # accuracy, loss
%%time
history = defaultdict(list) # placeholder to store the history of training and validation performance
best_accuracy = 0
for epoch in range(EPOCHS):
print(f'Epoch {epoch + 1}/{EPOCHS}')
print('-' * 10)
# training performance
train_acc, train_loss = train_epoch(model,train_data_loader,loss_fn, optimizer, device, scheduler,len(train_data))
print(f'Train loss {train_loss} accuracy {train_acc}')
# validation performance
val_acc, val_loss = eval_model(model,val_data_loader, loss_fn, device,len(validate_data))
print(f'Val loss {val_loss} accuracy {val_acc}')
print()
history['train_acc'].append(train_acc)
history['train_loss'].append(train_loss)
history['val_acc'].append(val_acc)
history['val_loss'].append(val_loss)
if val_acc > best_accuracy:
torch.save(model.state_dict(), 'best_model_state.bin') # store the best model
best_accuracy = val_acc
# plot the accuracy of training and validation data for different epoch
plt.plot(history['train_acc'], label='train accuracy')
plt.plot(history['val_acc'], label='validation accuracy')
plt.title('Training history')
plt.ylabel('Accuracy')
plt.xlabel('Epoch')
plt.legend()
plt.ylim([0, 1]);
# accuracy of the model for test data
test_acc, _ = eval_model(model, test_data_loader,loss_fn, device, len(test_data))
test_acc.item()
The accuracy of the model for test data is very good and the model can be used for new data sets.
# A fuction to make prediction for new data. It returns the actual text,
# prediction , prediction probablity and real labeling
def get_predictions(model, data_loader):
model = model.eval()
news_texts = []
predictions = []
prediction_probs = []
real_values = []
with torch.no_grad():
for d in data_loader:
texts = d["row_text"]
input_ids = d["input_ids"].to(device)
attention_mask = d["attention_mask"].to(device)
targets = d["targets"].to(device)
outputs = model(
input_ids=input_ids,
attention_mask=attention_mask
)
_, preds = torch.max(outputs, dim=1)
news_texts.extend(texts)
predictions.extend(preds)
prediction_probs.extend(outputs)
real_values.extend(targets)
predictions = torch.stack(predictions).cpu()
prediction_probs = torch.stack(prediction_probs).cpu()
real_values = torch.stack(real_values).cpu()
return news_texts, predictions, prediction_probs, real_values
y_news_texts, y_pred, y_pred_probs, y_test = get_predictions(
model, test_data_loader)
# classification report for test data
print(classification_report(y_test, y_pred, target_names=["true","fake"]))
# to display the confusion matrix
def show_confusion_matrix(confusion_matrix):
hmap = sns.heatmap(confusion_matrix, annot=True, fmt="d", cmap="Blues")
hmap.yaxis.set_ticklabels(hmap.yaxis.get_ticklabels(), rotation=0, ha='right')
hmap.xaxis.set_ticklabels(hmap.xaxis.get_ticklabels(), rotation=30, ha='right')
plt.ylabel('True lable')
plt.xlabel('Predicted label');
cm = confusion_matrix(y_test, y_pred)
df_cm = pd.DataFrame(cm, index=[0,1], columns=[0,1]) # [0,1] is class name
show_confusion_matrix(df_cm)
We can also see the examples from test data and their corresponding predictions.
idx = 2 # index
class_names= ["true","fake"]
news_text_ttl = y_news_texts[idx] # extract the news title with the given index
true_label = y_test[idx]
pred_df = pd.DataFrame({'class_names': class_names, 'values': y_pred_probs[idx]})
print("\n".join(wrap(news_text_ttl)))
print()
print(f'True prediction: {class_names[true_label]}')
y_pred[2] #prediction
test_data.iloc[2,:]# we can confirm the third row(index of two) from the test data
Some news from Snopes, a fact checking website, were collected to test our model. It has news rated as false, true, mixed and so on. To test the model, archived news from the year 2016/2017 were collected from Snopes. This is because the data at hand was from 2015 to 2018. The current news might not be a good test as news in 2020 might be dominated by current issues such as covid-19. Thus, the older news was considered. A total of twenty news titles, half of them labeled as fake and rest were true news, were used to test our model.
# news title and their corresponding labeling by Snopes
snopes_data= ["Is This James Earl Jones Dressed as Darth Vader",
"David Rockefeller's Sixth Heart Transplant Successful at Age 99",
"Did Bloomington Police Discover Over 200 Penises During Raid at a Mortician's Home?",
"Is the Trump Administration Price Gouging Puerto Rico Evacuees and Seizing Passports?",
"2017 Tainted Halloween Candy Reports 11/5/2014",
"Did President Trump Say Pedophiles Will Get the Death Penalty?",
"Michelle Obama Never Placed Her Hand Over Her Heart During the National Anthem?",
"Katy Perry Reveals Penchant for Cannibalism?" ,
"Is a Virginia Church Ripping Out an 'Offensive' George Washington Plaque?",
"Were Scientists Caught Tampering with Raw Data to Exaggerate Sea Level Rise?",
"Did Trump Retweet a Cartoon of a Train Hitting a CNN Reporter?",
"Did Pipe-Bombing Suspect Cesar Sayoc Attend Donald Trump Rallies?",
"Did President Trump’s Grandfather Beg the Government of Bavaria Not to Deport Him?",
"Did Gun Violence Kill More People in U.S. in 9 Weeks than U.S. Combatants Died on D-Day?",
"Did the Florida Shooter’s Instagram Profile Picture Feature a ‘MAGA’ Hat?",
"Wisconsin Department of Natural Resources Removes References to ‘Climate’ from Web Site",
"Hillary Clinton Referenced RFK Assassination as Reason to Continue 2008 Campaign",
"Did Richard Nixon Write a Letter Predicting Donald Trump’s Success in Politics?",
"Did a Twitter User Jeopardize Her NASA Internship by Insulting a Member of the National Space Council?",
"Did WaPo Headline Call IS Leader al-Baghdadi an ‘Austere Religious Scholar’?"]
label_actual = ["fake", "fake","fake","fake","fake","mixed", "fake","fake","mostly_false","fake","true",
"true", "true","true","true","true","true","true","true","true"] # rated by Snopes
label_adjusted = ["fake", "fake","fake","fake","fake","fake", "fake","fake","fake","fake","true","true",
"true","true","true","true","true","true","true","true"] # adjusted to fake or true
Clean the news in the same way as we cleaned the training data and then make prediction.
snopes_pred=[]
count_true_pred=0
for pos,ttl in enumerate(snopes_data):
ttl= remove_pattern(ttl, patterns) # clean the title
encoded_snopes = tokenizer.encode_plus(
ttl,
max_length=max_len,
add_special_tokens=True,
return_token_type_ids=False,
padding="max_length",
return_attention_mask=True,
truncation=True,
return_tensors='pt',
)
input_ids = encoded_snopes['input_ids'].to(device)
attention_mask = encoded_snopes['attention_mask'].to(device)
output = model(input_ids, attention_mask)
_, prediction = torch.max(output, dim=1)
pred_label=class_names[prediction] # prediction class
snopes_pred.append(pred_label)
# compaire the predicted and actual class label
count_true_pred += (pred_label==label_adjusted[pos])
print(f'Snopes news title: {snopes_data}')
print(f'Prediction : {snopes_pred}')
print(f'accuracy : {count_true_pred/len(snopes_data)}')
The BERT model has better accuracy than other models done by word count, tfidf and word2vec (available in my github repository). The accuracy of test data using the BERT model was approximately 0.98 whereas the accuracies obtained by other models were between 0.93 and 0.94. Though due to the capacity limit, we could not be able to predict using news text, we expect the same better performance than the corresponding models. This is left for further experiment.