In this post, I am going to demonstrate web scraping using BeautifulSoup, a Python package for parsing HTML and XML documents. Web scaping is a technique used to extract data from websites and save the data as a more structured format. Information about some companies from New Zealand companies registry website was extracted. The webpage allows us to search companies by their name, number or NZBN(New Zealand Business Number). The goal of this scraping is to gather information about companies and compile it in a more structured or semi-structured form. Thus, follow-up analysis will be easier.
To limit the number of companies, the keyword "Kiwi company" was used to search for only Kiwi companies. The list of Kiwi companies found in the registry are available here. Note that if the keyword is only "Kiwi" , the list will be different. Each company has a link that goes to a different page with some tabs providing different information about the company. The task was to collect all that information for each company and bring it together. Let us see how it was done!
The following python libraries were used to perform the tasks commented next to them.
import requests # to request the webpage
from bs4 import BeautifulSoup # to make soup and pull data out of HTML
import urllib.robotparser # to check the legitimacy to scrap the web
import json # to save the output as json file
import pandas as pd # to see saved data as dataframe
import IPython # to display the webpage
Before scraping a web page, it is good to check the legitimacy to use. Website owners list their rules in a file called robots.txt.
The permission to scrap a web page can be checked using the following code:
# read the rules of the web site found in robots.txt
robots = requests.get("https://companies-register.companiesoffice.govt.nz/robots.txt")
#print(robots.text)
robotpars = urllib.robotparser.RobotFileParser() #instantiate the RobotFileParser
#set the robots.txt url of the New Zealand Companies Office
robotpars.set_url("https://companies-register.companiesoffice.govt.nz/robots.txt")
robotpars.read() # Reads the robots.txt
# to check if useragent can fetch the url, true means fetching is possible.
print("Can we fetch New Zealand Companies Office website?", \
robotpars.can_fetch("*", "https://companies-register.companiesoffice.govt.nz"))
Using the search word "Kiwi company", hundreds of companies are displayed as shown below.
# the link for search result of Kiwi companies available at New Zealand Companies Office website
url_Kiwi= "https://app.companiesoffice.govt.nz/companies/app/ui/pages/companies/search?q=kiwi+company&entityTypes=LTD&entityTypes=UNLTD&entityTypes=COOP&entityTypes=ASIC&entityTypes=NON_ASIC&entityStatusGroups=ALL&incorpFrom=&incorpTo=&addressTypes=ALL&addressKeyword=&start=0&limit=15&sf=&sd=&advancedPanel=true&mode=advanced"
# diaplay the searched page
IPython.display.IFrame(url_Kiwi, width=800, height=400)
Once we check the legitimacy to scrap the website, the next step is to gather the relevant information about the companies.
- First, get the link for search result of Kiwi companies
- make a soup for the link containing search result
- get company numbers appeared in the search result( I used only the first page)
- knowing company numbers, collect all the relevant informations including:
- summary information
- directors information
- shareholder information
- addresses information
- save the collected data as json file and look the data as dataframe
Google Chrome was used to inspect the html source code.
def getSoup(url):
"""make a soup for a webpage given its url"""
# request the webpage and get the text
pagetext= requests.get(url).text
#make a soup and use html parser for the content of the web page
soup= BeautifulSoup(pagetext, "html.parser")
return soup
def getCompanyNumberFP(soup):
"""get companies' number displayed in the first page of the searched page.
soup: is the soup of a webpage that contains the companies we are looking for. Example 'Kiwi companies'"""
# links to company detail
links= soup.find_all("a", attrs={"class": "link"}) # finds all <a> tags with attribute class="link"
companyNum=[] # place holder for companies's number
for link in links:
if "javascript:viewCompanyDetails" in link.get("href"):
href=link.get("href") # get the hr
num = href.replace("javascript:viewCompanyDetails('", "")
num = num.replace("');","") # get only the number
companyNum.append(num) # append company numbers listed in a single page
return companyNum
# the first 15 coumpany numbers appeared by search result.
soupKiwi= getSoup(url_Kiwi) # get the soup for Kiwi search result page
print(getCompanyNumberFP(soupKiwi))
Once we get the company number from above, we can proceed to the company detail to collect all relevant informations. We can collect one company information at a time or we can collectively gather information about many companies. But the latter takes longer to execute or might affect the performance of the website. I designed the program to collect information about a single entry and append to a datastore.
def makeAsoup(companyNumber):
"""make a soup for a webpage that contains informtion about a company given its number.
CompanyNumber is uased as an ID for the spesific company """
a= "https://app.companiesoffice.govt.nz/companies/app/ui/pages/companies/"
link= a + str(companyNumber)
# request the webpage and get the text
pagetext= requests.get(link).text
#make a soup
entrySoup= BeautifulSoup(pagetext, "html.parser")
return entrySoup
def entryInfoTabs(entrySoup):
""" the function entryInfoTabs gets the soup of an entry (a company) and returns the links to navigate
to each information tab such as summary, directors, addresses, shareholders and so on """
infoTab={} # place holder for tab name and link pair
# navigate to unordered list
tabs=entrySoup.find("ul", attrs={"class":"tabs"}) # finds a <ul> tag with attribute class="tabs"
li= tabs.find_all("li") # finds all lists
for i in range(len(li)):
try:
# collect the main page plus the addresses of each information tab(summary, directors, ...)
url="https://app.companiesoffice.govt.nz" # main page
infoTab[li[i].get("id")]= url + li[i].a.get("href")
except Exception:
pass
return infoTab
# see the tabs' link for a company its number is 1658678, the first company appeared in the search result
entrySoup= makeAsoup("1658678")
entryInfoTabs(entrySoup)
Look at the page displayed below to see and navigate to the tabs to get information about a specific company. To mention some, the tabs are Company Summary, Addresses, Directors, Shareholdings, Documents and so on.
# diaplay the tabs for a single company
companyinfo_url= "https://app.companiesoffice.govt.nz/companies/app/ui/pages/companies/1658678"
IPython.display.IFrame(companyinfo_url, width=800, height=400)
Using the function entryInfoTabs(), we can get the link to navigate to each information tab including company summary, directors, shareholders and addresses. To make the code efficient, I used "entrySoup"(the soup of a company page given its company number) as an input for functions defined to get information from each tab. But the code can also be adjusted to get similar information but company number as input. This will be more clear after looking at the following functions and their sample output.
# get summary information
def getSummaryInfo(entrySoup):
"""The function getSummaryInfo gets the soup of an entry(a company) as input and
returns relevant summary informations as a dictionary where the keys
are the informations we are interested in and the values are their values."""
divs= entrySoup.find_all("div", attrs={"class":"row"})
summary={}
for div in divs:
labl= div.find("label")
try :
if labl.has_attr("id"):
key= labl.get_text(strip=True)
elif labl.has_attr("for"):
key =labl.get_text(strip=True)
elif not labl.has_attr("for") and not labl.has_attr("id"):
key= labl.get_text(strip=True)
labl.clear() # clear the content of the label tag
#format the key and remove the colon(:)and space
key= key.title().replace(":", "")
key=key.replace(" ","")
value=div.get_text(strip=True)
summary[key]= value # collect summary information
except Exception:
pass
# company informations we are interested in, including company link to be used as reference
relevantInfo= ['CompanyNumber','Nzbn','IncorporationDate','CompanyStatus', 'EntityType', \
'ConstitutionFiled', 'ArFilingMonth', 'UltimateHoldingCompany','CompanyRecordLink']
# display only relevant summary information
relevantSummary= {}
for key in summary.keys():
if key in relevantInfo:
# Exception handling incase the page doesn't have the required information
try:
if key =="ArFilingMonth":
summary[key]=summary[key].split("\r")[0] # to get only relevant value
if key =="UltimateHoldingCompany":
summary[key]=summary[key].capitalize().replace("edit", "") # to get only relevant text
relevantSummary[key]= summary[key]
except Exception:
pass
return relevantSummary
# sample to see how the summary information looks. To be used as reference I added CompanyRecordLink
getSummaryInfo(entrySoup) #entrySoup=makeAsoup(1658678)
def getDirectorsInfo(entrySoup):
"""The function getDirectorsInfo() gets the soup of a company as an input
and returns the list of company directors and thier informations."""
director= entryInfoTabs(entrySoup)["directorsTab"] # get the link that contains directors information
directorSoup= getSoup(director) # make a soup for a page containing directors information
# collect directors information from the page
dirs= directorSoup.find_all("td", attrs={"class": "director"})# all <td> tags with attribute class="director"
dirList=[] # place holder for directors list
dirInfo={} # place holder for director's information
for dr in dirs: # loop over directors
drow= dr.find_all("div", attrs={"class": "row"}) # all <div> tags with attribute class="row"
dirInfo={} # make empty for the next iteration/director/
for row in drow:
rlabel= row.find("label")
rname=rlabel.get_text() # get the info name
rvalue= rlabel.parent.get_text(strip=True).replace(rname, "") # get the formated value
rvalue=rvalue.replace("\r\n","")
rname=rname.replace(":","") # delete the colon(:) from the info name
dirInfo[rname]= rvalue # collect director's information
try : # try if the following information are available
# if the information is about consent and shareholder
if rname=="Consent" or rname=="Shareholder":
del dirInfo[rname]
except KeyError:
pass # pass if the key is not found
dirList.append(dirInfo) # adds each directors to the list if there are more than one
return dirList
# sample view of directors information
getDirectorsInfo(entrySoup) # entrySoup=makeAsoup(1658678)
def getSharesInfo(entrySoup):
"""The function getSharesInfo() gets the soup of a company as an input
and returns the list of company shareholders and thier informations."""
shareholder= entryInfoTabs(entrySoup)["shareholdingsTab"]# get the link that contains shareholders information
shareSoup= getSoup(shareholder) # get the soup for shareholder
# collect shareholders information from the page
# a list containing allocations
shares= shareSoup.find_all("div", attrs={"class": "allocationDetail"})# all <div> tags with attribute class="allocationDetail"
shareInfo=[] # place holder for shareholder's information
shareList=[] # place holder for shareholder list
for share in shares: # loop over allocations/shareholders
drow= share.find_all("div", attrs={"class": "row"}) # all <div> tags with attribute class="row"
shareInfo=[] # make empty for next shareholder
count=0 # counter
for row in drow:
nextrow= row.find_next("label")
nextrow.clear() # clear the first label tag
count=count+1
if count==1:
rtext= row.get_text()
allocation=rtext.splitlines() # split share number and share portion
shareNum= allocation[0].strip() # share number
sharePort= allocation[1].strip() # share portion
shareInfo.extend([shareNum, sharePort]) # add share number and portion to the list
else:
rtext= row.get_text(strip=True)
shareInfo.append(rtext) # information about a shareholder
count=0 # to restart counting for the next shareholder
# name the information gathered
shareInfoDic={} # to create a dictionary for shareholder info
for key, value in zip(["ShareNumber","SharePortion", "Name", "Address"], shareInfo):
shareInfoDic[key]=value
shareList.append(shareInfoDic) # adds each shareholder to the list if there are more than one
return shareList
# sample view how shareholders information looks
getSharesInfo(entrySoup) #entrySoup=makeAsoup(1658678)
def getAddress(entrySoup):
"""The function getAddress() gets the soup of a company as an input
and returns the list of company shareholders and thier informations."""
#make a soup for addresses page
addrUrl=entryInfoTabs(entrySoup)["addressesTab"] #get the link that contains addresses information
addressSoup= getSoup(addrUrl) # make a soup for address page
# get list of addreses
addressLines= addressSoup.find_all("div", attrs={"class": "addressLine"}) # get the <div> tags containing class="addressLine"
address={} # place holder for an address
addressList= [] #place holder for collection of addresses
try:
for adrsLine in addressLines:
address["Address"]= adrsLine.get_text(strip=True)# get striped text of the address
addressList.append(address) # List of addresses
except Exception:
pass
return addressList
#Sample Address view
getAddress(entrySoup) #entrySoup=makeAsoup(1658678)
To make the code flexible, I used the list of company numbers as one of the parameters for saveMultiData() function. But we can also automatically gather information about companies listed in the searched result. For demonstration purposes, information about companies that appeared only in the first page(15 companies) of the search result were collected. First, their numbers were collected as a list using the function getCompanyNumberFP() and then used as an argument for the function saveMultiData() to save and display the summary.
def saveMultiData(companyNumList, fileName):
"""The function saveMultiData() saves all relevant informations about companies
whose numbers are given and returns the informations as a dataframe.
We can use this function for a single company but make sure the number is given as list data type
Parameters:
-companyNumList is a list that contains company numbers.
-fileName is the name of the file we want to save as. """
alldata=[] # collect information for multiple companies
for i in range(len(companyNumList)):
entrySoup= makeAsoup(companyNumList[i])
relevantSummary= getSummaryInfo(entrySoup)
dirList= getDirectorsInfo(entrySoup)
shareList= getSharesInfo(entrySoup)
addressList= getAddress(entrySoup)
# collect all informations together in a dictionary for a single company
entryData= {**relevantSummary, **{"Directors": dirList}, \
**{"Shareholders":shareList}, **{"Addresses":addressList}}
# gather all companies info
alldata.append(entryData)
# save the data as json file
with open(fileName, "w", encoding="utf8") as jsonfile:
json.dump(alldata, jsonfile, ensure_ascii = False, indent=4)
return pd.read_json(fileName).head() # read and disply the saved file using pandas as dataframe
fristpage= getCompanyNumberFP(soupKiwi) # where; soupKiwi= getSoup(url_Kiwi), url_Kiwi is the url of search result
print(fristpage)
#save and display the first five companies appeared in the first page of search result
saveMultiData(fristpage, "KiwiCompanies")
# or it is possible to read the saved json file using :
jsonfile= pd.read_json("KiwiCompanies")
jsonfile.head(2) # see the first two companies
By scraping the web page of New Zealand’s public registry, we could be able to collect information about companies, their address, directors, shareholders and other information and put together in a structured format.
- getSoup(url):
"""make a soup for a webpage given its url"""
- getCompanyNumberFP(soup):
"""get companies' number displayed in the first page of the searched page.
soup: is the soup of a webpage that contains the companies we are looking for. Example 'Kiwi companies'"""
- makeAsoup(companyNumber):
"""make a soup for a webpage that contains informtion about a company given its number"""
- entryInfoTabs(entrySoup):
""" the function entryInfoTabs gets the soup of an entry (a company) and returns the links to navigate
to each information tab such as summary, directors, addresses, shareholders and so on """
- getSummaryInfo(entrySoup):
"""The function getSummaryInfo gets the soup of an entry(a company) as input and
returns relevant summary informations as a dictionary where the keys
are the informations we are interested in and the values are their values."""
- getDirectorsInfo(entrySoup):
"""The function getDirectorsInfo() gets the soup of a company as an input
and returns the list of company directors and thier informations."""
- getSharesInfo(entrySoup):
"""The function getSharesInfo() gets the soup of a company as an input
and returns the list of company shareholders and thier informations."""
- getAddress(entrySoup):
"""The function getAddress() gets the soup of a company as an input
and returns the list of company shareholders and thier informations."""
- saveMultiData(companyNumList, fileName):
"""The function saveMultiData() saves all relevant informations about companies
whose numbers are given and returns the informations as a dataframe.
We can use this function for a single company but make sure the number is given as list data type
Parameters:
-companyNumList is a list that contains company numbers.
-fileName is the name of the file we want to save as. """