Web Scraping :¶

Gathering information about Companies from the New Zealand’s public registry¶

In this post, I am going to demonstrate web scraping using BeautifulSoup, a Python package for parsing HTML and XML documents. Web scaping is a technique used to extract data from websites and save the data as a more structured format. Information about some companies from New Zealand companies registry website was extracted. The webpage allows us to search companies by their name, number or NZBN(New Zealand Business Number). The goal of this scraping is to gather information about companies and compile it in a more structured or semi-structured form. Thus, follow-up analysis will be easier.

To limit the number of companies, the keyword "Kiwi company" was used to search for only Kiwi companies. The list of Kiwi companies found in the registry are available here. Note that if the keyword is only "Kiwi" , the list will be different. Each company has a link that goes to a different page with some tabs providing different information about the company. The task was to collect all that information for each company and bring it together. Let us see how it was done!

Important Libraries¶

The following python libraries were used to perform the tasks commented next to them.

import requests                # to request the webpage
from bs4 import BeautifulSoup  # to make soup and pull data out of HTML
import urllib.robotparser      # to check the legitimacy to scrap the web
import json                    # to save the output as json file
import pandas as pd            # to  see saved data as dataframe 
import IPython                 # to display the webpage

Check If Scraping is Allowed¶

Before scraping a web page, it is good to check the legitimacy to use. Website owners list their rules in a file called robots.txt.

The permission to scrap a web page can be checked using the following code:

# read the rules of the web site found in robots.txt
robots = requests.get("https://companies-register.companiesoffice.govt.nz/robots.txt")
#print(robots.text)

robotpars = urllib.robotparser.RobotFileParser()  #instantiate the RobotFileParser

#set the robots.txt url of the New Zealand Companies Office
robotpars.set_url("https://companies-register.companiesoffice.govt.nz/robots.txt")
robotpars.read() # Reads the robots.txt 

# to check if useragent can fetch the url, true means fetching is possible. 
print("Can we fetch New Zealand Companies Office website?", \
      robotpars.can_fetch("*", "https://companies-register.companiesoffice.govt.nz"))

Can we fetch New Zealand Companies Office website? True

Search Result¶

Using the search word "Kiwi company", hundreds of companies are displayed as shown below.

# the link for search result of Kiwi companies available at New Zealand Companies Office website
url_Kiwi= "https://app.companiesoffice.govt.nz/companies/app/ui/pages/companies/search?q=kiwi+company&entityTypes=LTD&entityTypes=UNLTD&entityTypes=COOP&entityTypes=ASIC&entityTypes=NON_ASIC&entityStatusGroups=ALL&incorpFrom=&incorpTo=&addressTypes=ALL&addressKeyword=&start=0&limit=15&sf=&sd=&advancedPanel=true&mode=advanced"

# diaplay the searched page
IPython.display.IFrame(url_Kiwi,  width=800, height=400)

Gathering Information from the Website¶

Once we check the legitimacy to scrap the website, the next step is to gather the relevant information about the companies.

    - First, get the link for search result of Kiwi companies 
    - make a soup for the link containing search result 
    - get company numbers appeared in the search result( I used only the first page)  
    - knowing company numbers, collect all the relevant informations including:
            - summary information 
            - directors information
            - shareholder information 
            - addresses information      
    - save the collected data as json file and look the data as dataframe

Google Chrome was used to inspect the html source code.

def getSoup(url):
    """make a soup for a webpage given its url""" 

    # request the webpage and get the text
    pagetext= requests.get(url).text
    #make a soup and use html parser for the content of the web page 
    soup= BeautifulSoup(pagetext, "html.parser")
    return soup

def getCompanyNumberFP(soup):
    """get companies' number displayed in the first page of the searched page. 
       soup: is the soup of a webpage that contains the companies we are looking for. Example 'Kiwi companies'"""
    
  # links to company detail 
    links= soup.find_all("a", attrs={"class": "link"}) # finds all <a> tags with attribute class="link"
    companyNum=[] # place holder for companies's number
    for link in links:
        if "javascript:viewCompanyDetails" in link.get("href"):
            href=link.get("href")  # get the hr
            num = href.replace("javascript:viewCompanyDetails('", "")
            num = num.replace("');","") # get only the number 
            companyNum.append(num)  # append company numbers listed in a single page
    return companyNum

# the first 15 coumpany numbers appeared by search result. 
soupKiwi= getSoup(url_Kiwi) # get the soup for Kiwi search result page 
print(getCompanyNumberFP(soupKiwi))

['1658678', '5671967', '3258968', '904994', '86417', '6016487', '863627', '1116824', '2060257', '1837222', '3922140', '2307068', '1472075', '276849', '970820']

Once we get the company number from above, we can proceed to the company detail to collect all relevant informations. We can collect one company information at a time or we can collectively gather information about many companies. But the latter takes longer to execute or might affect the performance of the website. I designed the program to collect information about a single entry and append to a datastore.

def makeAsoup(companyNumber):
    """make a soup for a webpage that contains informtion about a company given its number. 
    CompanyNumber is uased as an ID for the spesific company """ 
    
    a= "https://app.companiesoffice.govt.nz/companies/app/ui/pages/companies/"
    link=  a + str(companyNumber)
    # request the webpage and get the text
    pagetext= requests.get(link).text
    #make a soup 
    entrySoup= BeautifulSoup(pagetext, "html.parser")
    return entrySoup

def entryInfoTabs(entrySoup):
    """ the function entryInfoTabs gets the soup of an entry (a company) and returns the links to navigate 
    to each information tab such as summary, directors, addresses, shareholders and so on """
    
    infoTab={}  # place holder for tab name and link pair
    
    # navigate to  unordered list
    tabs=entrySoup.find("ul", attrs={"class":"tabs"}) # finds a <ul> tag with attribute class="tabs" 
    li= tabs.find_all("li")   # finds all lists 
    for i in range(len(li)):
        try:           
            # collect the main page plus the addresses of each information tab(summary, directors, ...)
            url="https://app.companiesoffice.govt.nz" # main page
            infoTab[li[i].get("id")]=  url + li[i].a.get("href")                
        
        except Exception:
            pass
    return infoTab

# see the tabs' link for a company its number is 1658678, the first company appeared in the search result
entrySoup= makeAsoup("1658678")
entryInfoTabs(entrySoup)

{'companySummaryTab': 'https://app.companiesoffice.govt.nz/companies/app/ui/pages/companies/1658678',
 'addressesTab': 'https://app.companiesoffice.govt.nz/companies/app/ui/pages/companies/1658678/addresses',
 'directorsTab': 'https://app.companiesoffice.govt.nz/companies/app/ui/pages/companies/1658678/directors',
 'shareholdingsTab': 'https://app.companiesoffice.govt.nz/companies/app/ui/pages/companies/1658678/shareholdings',
 'documents': 'https://app.companiesoffice.govt.nz/companies/app/ui/pages/companies/1658678/documents',
 'ppsrSearchTab': 'https://app.companiesoffice.govt.nzjavascript:performPpsrSearch()',
 'nzbnDetailsTab': 'https://app.companiesoffice.govt.nz/companies/app/ui/pages/companies/1658678/nzbnPbdView'}

Look at the page displayed below to see and navigate to the tabs to get information about a specific company. To mention some, the tabs are Company Summary, Addresses, Directors, Shareholdings, Documents and so on.

# diaplay the tabs for a single company 
companyinfo_url= "https://app.companiesoffice.govt.nz/companies/app/ui/pages/companies/1658678"
IPython.display.IFrame(companyinfo_url,  width=800, height=400)

Get Summary Information¶

Using the function entryInfoTabs(), we can get the link to navigate to each information tab including company summary, directors, shareholders and addresses. To make the code efficient, I used "entrySoup"(the soup of a company page given its company number) as an input for functions defined to get information from each tab. But the code can also be adjusted to get similar information but company number as input. This will be more clear after looking at the following functions and their sample output.

# get summary information  
def getSummaryInfo(entrySoup):
    
    """The function getSummaryInfo gets the soup of an entry(a company) as input and
    returns relevant summary informations as a dictionary where the keys 
    are the informations we are interested in and the values are their values."""

    divs= entrySoup.find_all("div", attrs={"class":"row"})
    summary={}
    for div in divs:
        labl= div.find("label")
        try : 
            if labl.has_attr("id"):        
                key= labl.get_text(strip=True) 
            elif labl.has_attr("for"):
                key =labl.get_text(strip=True)
            elif not labl.has_attr("for") and not labl.has_attr("id"):
                key= labl.get_text(strip=True)

            labl.clear() # clear the content of the label tag

            #format the key and remove the colon(:)and space 
            key= key.title().replace(":", "") 
            key=key.replace(" ","")
            value=div.get_text(strip=True) 
            summary[key]= value  # collect summary information
        except Exception:
            pass
            
    # company informations we are interested in, including company link to be used as reference
    relevantInfo= ['CompanyNumber','Nzbn','IncorporationDate','CompanyStatus', 'EntityType', \
                  'ConstitutionFiled', 'ArFilingMonth', 'UltimateHoldingCompany','CompanyRecordLink']

    
    # display only relevant summary information 
    relevantSummary= {}
    for key in summary.keys():
        if key in relevantInfo:

            # Exception handling incase the page doesn't have the required information
            try: 
                if key =="ArFilingMonth":
                    summary[key]=summary[key].split("\r")[0] # to get only relevant value
                if key =="UltimateHoldingCompany":
                    summary[key]=summary[key].capitalize().replace("edit", "") # to get only relevant text
                relevantSummary[key]= summary[key]
            except Exception:
                pass
    
    return relevantSummary

# sample to see how the summary information looks. To be used as reference I added CompanyRecordLink 
getSummaryInfo(entrySoup)  #entrySoup=makeAsoup(1658678)

{'CompanyNumber': '1658678',
 'Nzbn': '9429034670484',
 'IncorporationDate': '27 Jul 2005',
 'CompanyStatus': 'Registered',
 'EntityType': 'NZ Unlimited Company',
 'ConstitutionFiled': 'Yes',
 'ArFilingMonth': 'November',
 'UltimateHoldingCompany': 'Whitecloud property holdings company',
 'CompanyRecordLink': 'http://app.companiesoffice.govt.nz/co/1658678'}

Get Information about Directors¶

def getDirectorsInfo(entrySoup): 
    """The function getDirectorsInfo() gets the soup of a company as an input 
    and returns the list of company directors and thier informations."""
    
    director= entryInfoTabs(entrySoup)["directorsTab"] # get the link that contains directors information 
    directorSoup= getSoup(director) # make a soup for a page containing directors information
    
    # collect directors information from the page
    
    dirs= directorSoup.find_all("td", attrs={"class": "director"})# all <td> tags with attribute class="director"
    dirList=[] # place holder for directors list
    dirInfo={} # place holder for director's information

    for dr in dirs: # loop over directors 
        drow= dr.find_all("div", attrs={"class": "row"}) # all <div> tags with attribute class="row"
        dirInfo={} # make empty for the next iteration/director/
        for row in drow:
            rlabel= row.find("label")
            rname=rlabel.get_text() # get the info name 
            rvalue= rlabel.parent.get_text(strip=True).replace(rname, "") # get the formated value 
            rvalue=rvalue.replace("\r\n","")
            rname=rname.replace(":","") # delete the colon(:) from the info name

            dirInfo[rname]= rvalue # collect director's information 

            try : # try if the following information are available
                
                # if the information is about consent and shareholder 
                if rname=="Consent" or rname=="Shareholder":
                    del dirInfo[rname]
            except KeyError:
                pass  # pass if the key is not found
        dirList.append(dirInfo) # adds each directors to the list if there are more than one

    return dirList

# sample view of directors information 
getDirectorsInfo(entrySoup) # entrySoup=makeAsoup(1658678)

[{'Full legal name': 'Melanie  Joy BARBER',
  'Residential Address': '15 Tipperary Lane, Patumahoe, Auckland, 2679, New Zealand',
  'Appointment Date': '02 Sep 2008'},
 {'Full legal name': 'Matthew  Gary COCKRAM',
  'Residential Address': '4 Fairholme Avenue, Epsom, Auckland, 1023, New Zealand',
  'Appointment Date': '27 Jul 2005'},
 {'Full legal name': 'Thomas  Albert Cecil MURRAY',
  'Residential Address': '25 Ewen Street, Hauraki, Auckland, 0622, New Zealand',
  'Appointment Date': '27 Jul 2005'}]

Get Shareholders Information¶

def getSharesInfo(entrySoup):
    """The function getSharesInfo() gets the soup of a company as an input 
    and returns the list of company shareholders and thier informations."""

    shareholder= entryInfoTabs(entrySoup)["shareholdingsTab"]# get the link that contains shareholders information 
    shareSoup= getSoup(shareholder) # get the soup for shareholder

    # collect shareholders information from the page
    
    # a list containing allocations 
    shares= shareSoup.find_all("div", attrs={"class": "allocationDetail"})# all <div> tags with attribute class="allocationDetail"

    shareInfo=[] # place holder for shareholder's information
    shareList=[] # place holder for shareholder list
    
    for share  in shares: # loop over allocations/shareholders
        drow= share.find_all("div", attrs={"class": "row"}) # all <div> tags with attribute class="row"
        shareInfo=[] # make empty for next shareholder
        count=0 # counter 
        for row in drow:
            nextrow= row.find_next("label")
            nextrow.clear()  # clear the first label tag 

            count=count+1

            if count==1:
                rtext= row.get_text()
                allocation=rtext.splitlines() # split  share number and share portion
                shareNum= allocation[0].strip()  # share number 
                sharePort= allocation[1].strip() # share portion 
                shareInfo.extend([shareNum, sharePort]) # add share number and portion to the list
            else:
                rtext= row.get_text(strip=True)
                shareInfo.append(rtext) # information about a shareholder

        count=0 # to restart counting for the next shareholder 

        # name the information gathered 
        shareInfoDic={} # to create a dictionary for shareholder info
        for key, value in zip(["ShareNumber","SharePortion", "Name", "Address"], shareInfo):
            shareInfoDic[key]=value
        shareList.append(shareInfoDic) # adds each shareholder to the list if there are more than one
    
    return  shareList

# sample view how shareholders information looks
getSharesInfo(entrySoup)  #entrySoup=makeAsoup(1658678)

[{'ShareNumber': '1',
  'SharePortion': '(100.00%)',
  'Name': 'BRITOMART GROUP COMPANY',
  'Address': 'Level 3, 130 Quay Street, Auckland Cbd\r\n,'}]

Get Information about Adresses¶

def getAddress(entrySoup):
    """The function getAddress() gets the soup of a company as an input 
    and returns the list of company shareholders and thier informations."""

  
    #make a soup for addresses page 
    addrUrl=entryInfoTabs(entrySoup)["addressesTab"] #get the link that contains addresses information 
    addressSoup= getSoup(addrUrl) # make a soup for address page
    
    # get list of addreses

    addressLines= addressSoup.find_all("div", attrs={"class": "addressLine"}) # get the <div> tags containing class="addressLine" 
    address={} # place holder for an address
    addressList= [] #place holder for collection of addresses
    
    try: 
        for adrsLine in addressLines:
            address["Address"]= adrsLine.get_text(strip=True)# get striped text of the address
            addressList.append(address) # List of addresses
    except Exception:
        pass
    
    return addressList

#Sample Address view 
getAddress(entrySoup) #entrySoup=makeAsoup(1658678)

[{'Address': 'Level 3, 130 Quay Street, Auckland\r\n, New Zealand'},
 {'Address': 'Level 3, 130 Quay Street, Auckland\r\n, New Zealand'}]

Gather All Information Together , Save and Display¶

To make the code flexible, I used the list of company numbers as one of the parameters for saveMultiData() function. But we can also automatically gather information about companies listed in the searched result. For demonstration purposes, information about companies that appeared only in the first page(15 companies) of the search result were collected. First, their numbers were collected as a list using the function getCompanyNumberFP() and then used as an argument for the function saveMultiData() to save and display the summary.

def saveMultiData(companyNumList, fileName):
    """The function saveMultiData() saves all relevant informations about companies
    whose numbers are given and returns the informations as a dataframe. 
    We can use this function for a single company but make sure the number is given as list data type
    Parameters:
        -companyNumList is a list that contains company numbers.
        -fileName is the name of the file we want to save as. """
    
    alldata=[] # collect information for multiple companies 
    
    for i in range(len(companyNumList)):
        entrySoup= makeAsoup(companyNumList[i])
        
        relevantSummary= getSummaryInfo(entrySoup)
        dirList= getDirectorsInfo(entrySoup)
        shareList= getSharesInfo(entrySoup)
        addressList= getAddress(entrySoup)
    
        # collect all informations together in a dictionary for a single company
        entryData= {**relevantSummary, **{"Directors": dirList}, \
                    **{"Shareholders":shareList}, **{"Addresses":addressList}} 

        
        # gather all companies info 
        alldata.append(entryData)
    
    # save the data as json file 
    with open(fileName, "w", encoding="utf8") as jsonfile:   
        json.dump(alldata, jsonfile, ensure_ascii = False, indent=4)
    
    return pd.read_json(fileName).head() # read and disply the saved file using pandas as dataframe

fristpage= getCompanyNumberFP(soupKiwi) # where; soupKiwi= getSoup(url_Kiwi), url_Kiwi is the url of search result
print(fristpage)

['1658678', '5671967', '3258968', '7481616', '904994', '86417', '6016487', '863627', '1116824', '2060257', '1837222', '3922140', '2307068', '1472075', '276849']

#save and display the first five companies appeared in the first page of search result 
saveMultiData(fristpage, "KiwiCompanies")

# or it is possible to read the saved json file using :
jsonfile= pd.read_json("KiwiCompanies")
jsonfile.head(2) # see the first two companies

Summary¶

By scraping the web page of New Zealand’s public registry, we could be able to collect information about companies, their address, directors, shareholders and other information and put together in a structured format.

Appendix : User-defined functions used in the code and their documentation¶

- getSoup(url):
    """make a soup for a webpage given its url""" 

- getCompanyNumberFP(soup):
    """get companies' number displayed in the first page of the searched page. 
       soup: is the soup of a webpage that contains the companies we are looking for. Example 'Kiwi companies'"""

 - makeAsoup(companyNumber):
     """make a soup for a webpage that contains informtion about a company given its number""" 

 -  entryInfoTabs(entrySoup):
      """ the function entryInfoTabs gets the soup of an entry (a company) and returns the links to navigate 
      to each information tab such as summary, directors, addresses, shareholders and so on """

 - getSummaryInfo(entrySoup):
     """The function getSummaryInfo gets the soup of an entry(a company) as input and
     returns relevant summary informations as a dictionary where the keys 
     are the informations we are interested in and the values are their values."""

 - getDirectorsInfo(entrySoup): 
     """The function getDirectorsInfo() gets the soup of a company as an input 
      and returns the list of company directors and thier informations."""

 - getSharesInfo(entrySoup):
     """The function getSharesInfo() gets the soup of a company as an input 
      and returns the list of company shareholders and thier informations."""

 - getAddress(entrySoup):
     """The function getAddress() gets the soup of a company as an input 
      and returns the list of company shareholders and thier informations."""

 - saveMultiData(companyNumList, fileName):
     """The function saveMultiData() saves all relevant informations about companies
     whose numbers are given and returns the informations as a dataframe. 
     We can use this function for a single company but make sure the number is given as list data type
     Parameters:
        -companyNumList is a list that contains company numbers.
        -fileName is the name of the file we want to save as. """

	CompanyNumber	Nzbn	IncorporationDate	CompanyStatus	EntityType	ConstitutionFiled	ArFilingMonth	UltimateHoldingCompany	CompanyRecordLink	Directors	Shareholders	Addresses
0	1658678	9429034670484	27 Jul 2005	Registered	NZ Unlimited Company	Yes	November	Whitecloud property holdings company	http://app.companiesoffice.govt.nz/co/1658678	[{'Full legal name': 'Melanie Joy BARBER', 'R...	[{'ShareNumber': '1', 'SharePortion': '(100.00...	[{'Address': 'Level 3, 130 Quay Street, Auckla...
1	5671967	9429041733066	30 Apr 2015	Registered	NZ Unlimited Company	Yes	October	Rank trust limited	http://app.companiesoffice.govt.nz/co/5671967	[{'Full legal name': 'Gregory Alan COLE', 'Re...	[{'ShareNumber': '31785001', 'SharePortion': '...	[{'Address': 'Records in s189(1)(a)-(i) Compan...
2	3258968	9429031229883	04 Feb 2011	Removed	NZ Unlimited Company	Yes	NaN	No	http://app.companiesoffice.govt.nz/co/3258968	[{'Full legal name': 'Gregory Alan COLE', 'Re...	[{'ShareNumber': '1905', 'SharePortion': '(100...	[{'Address': 'Records in s189(1)(a)-(i) Compan...
3	7481616	9429047473942	24 May 2019	Registered	NZ Unlimited Company	Yes	October	Big apple kiwi trustee limited	http://app.companiesoffice.govt.nz/co/7481616	[{'Full legal name': 'Gregory Alan COLE', 'Re...	[{'ShareNumber': '4012501', 'SharePortion': '(...	[{'Address': 'Records in s189(1)(a)-(i) Compan...
4	904994	9429037861971	21 Apr 1998	Removed	NZ Co-operative Company	Yes	NaN	Not specified	http://app.companiesoffice.govt.nz/co/904994	[{'Full legal name': 'Dennis Keith DARE', 'Re...	[{'ShareNumber': '2000', 'SharePortion': '(9.0...	[{'Address': 'Finn & Partners, 486 Alexandra S...