Upcoming post: Hypothesis Testing
In this post, statistical analyses result that compare the price of hotel and non-hotel Airbnb listing at New York city will be presented. The comparison was made using frequentist and bootstrapping techniques. The post taking through step by step starting from data acquisition to conclusion. First, hotel and non-hotel listings were identified from the listing description. Then, the following steps were followed, defining the null and alternative hypothesis, identifying test statistics, checking assumptions to use the test statistics, setting significant level (rejection error), computing the test statistics, computing the p-value and interpreting, and making conclusion about the test. Please stay tuned for the more detailed explanation about the analysis but in the meantime, you may check the analysis from my GitHub repository.
(By Muluemebet G Ayalew, Sept 2020)
In this post, I used Bidirectional Encoder Representations from Transformers (BERT) to classify whether a news is fake or real. BERT is a state-of-the-art technique for Natural Language Processing (NLP) created and published by Google in 2018. Bidirectional means that it looks both left and right context to understand the text. It can be used for next sentence prediction, question answering, language inference and more. Read more >>
(By Muluemebet G Ayalew, Feb 2020)
In this post, I am going to demonstrate web scraping using BeautifulSoup, a Python package for parsing HTML and XML documents. Web scaping is a technique used to extract data from websites and save the data as a more structured format. Information about some companies from the New Zealand companies registry website was extracted. The webpage allows us to search companies by their name, number or NZBN(New Zealand Business Number). The goal of this scraping is to gather information about companies and compile it in a more structured or semi-structured form. Thus, follow-up analysis will be easier. Read more >>
(By Muluemebet G Ayalew, Nov 2017)
I have been using Jupyter notebook for a while and I found that opening ipynb files for quick reference is not straightforward. Jupyter notebook has a functionality to download into a familiar and easy to share format such as html or pdf. However, it is cumbersome to download each and every file manually, especially when you have multiple files.
In this post, I demonstrate step by step automated Jupyter notebook file conversion into html or pdf from command line. Read more >>
(By Muluemebet G Ayalew, June 2017)
This post demonstrates the difference between simple word count and term frequency-inverse document frequency(tf-idf) methods in document retrieval using an example. We use document retrieval method to retrieve /recommend/ similar articles with the article of interest. Read more >>
(By Muluemebet G Ayalew, May 2017)
A multivariable linear regression was used to predict and validate the price of pre owned cars. In this post, you can find a step by step model implemntation, including data preprocessing, variable selection, model building, checking regression assumptions, taking corrective actions and testing the final model. Read more >>
(By Muluemebet G Ayalew, MAY 2017)
Data science tools and techniques have been rapidly evolving and it is hard to memorize everything. I found the following cheat sheets are important for quick references. More cheat sheets to be added...
R Cheat Sheets that includes the following packages:
R Base,
dplyr,
Ggplot2,
devtools,
rmarkdown,
Data ExplorationCheatsheet,
R-reference Card
Python Cheat Sheets:
Basics,
NumPy Basics,
Pandas Basics,
Bokeh,
Scikit-Learn
Machine Learning, Data Science, Probability, SQL & Big Data This page includes top 28 Cheat Sheets for Machine Learning, Data Science, Probability, SQL & Big Data
(By Muluemebet G Ayalew, FEB 2017)
This post demonstrated how classification can be done using Support Vector Machine (SVM). SVM is a supervised learning algorithm used to classify the training data set by finding an optimal hyperplane that separates the class labels. The purposes of training, validating and test datasets in the course of model development are discussed. The dataset is obtained from a Machine Learning online course material offered by Dr Andrew Ng through Coursera. We have implemented SVM in the class using Octave software. Here I show how to implement it using R as well as the importance of the three datasets. ... Read more >>
(By Muluemebet G Ayalew, FEB 2017)
Earlier in my blog I explored the average Medicare spending per episode hospital data using R and R Shiny . Here, I explored the average spending across states, and by claim type and period as well as whether the mean and median leads to similar conclusion. Note that the data provided was average spending per beneficiary episodes, here after called spending, and described in more detail in my previous post. The story made using Tableau shows four stories: ... Read more >>
(By Muluemebet G Ayalew, JAN 2017)
In this blog I developed an R shiny apps that explores the differences on average medicare spending across providers by period and claim type within each state. You can find the R Shiny codes here or on my github page, and you can read the data description and explore the R Shiny apps here.
(By Muluemebet G Ayalew, JAN 2017)
In this blog, I explored whether the total average spending per complete episode vary across states, and whether the average spending per episode vary across states per agiven claim type and period. .. Read more >>.
(By Muluemebet G Ayalew, DEC 2016)
In this post, I illustrate the commonly used clustering algorithm called K-means using a self-written Octave functions/codes. Indeed, there are readily available R, python and octave packages/function which I listed at the end of this post. Cluster analysis groups similar objects together so that objects in the same group are more similar. It can be used for market segmentation, social network analysis, astronomical data analysis, and so on (see for example Bjiuraj etal).
K-means clustering is unsupervised learning algorithm that groups a given dataset into k coherent clusters so that each observation belongs to a group with the nearest mean. The number of clusters (Ks) should be known or determined a priori, and for each cluster we need to provide initial means(centroids). ... Read more >>
(By Muluemebet G Ayalew, NOV 2016)
Preparing and managing data is one of the most time-consuming task for data scientists. Importing data into a data science software is the first step for the downstream analysis, which seems trivial but at times frustrating and time consuming to figure out the right way. Data can be directly entered into a system (often feasible only for small data), or imported into analytical and visualization software from different data file formats. In this post, I put together a list of R packages/functions (see Table below) with illustrations that, I hope, will ease the burden of locating the right package/function. Note that the table neither list all data file formats nor all available packages. For a comprehensive R import/export discussion you can read R Import/Export documentation. ... Read more >>