Preparing and managing data is one of the most time-consuming task for data scientists. Importing data into a data science software is the first step for the downstream analysis, which seems trivial but at times frustrating and time consuming to figure out the right way. Data can be directly entered into a system (often feasible only for small data), or imported into analytical and visualization software from different data file formats. In this post, I put together a list of R packages/functions (see Table below) with illustrations that, I hope, will ease the burden of locating the right package/function. Note that the table neither list all data file formats nor all available packages. For a comprehensive R import/export discussion you can read R Import/Export documentation.
R packages / functions to read/import data files
Data File Format | File extension | R Package | R Import Function |
Text file | .txt | Base package 'utils' | read.table() read.delim()* |
Comma Separated Value (CSV) | .csv | Base package 'utils' | read.csv()** |
Excel | .xlsx | xlsx | read.xlsx() |
OCTAVE | .mat | foreign R.matlab |
read.octave() readMat() |
SPSS | .sav | foreign Hmisc |
read.spss() spss.get() |
SYSTAT | .sys | foreign | read.systat() |
STATA | .dta | foreign | read.dta() |
SAS | .sas7bdat | sas7bdat | read.sas7bdat() |
SAS Transport file | .xpt | foreign Hmisc |
read.xport() sasxport.get() |
EpiInfo | .rec | foreign | read.epiinfo() |
Minitab | .mtp | foreign | read.mtp() |
RDA or RDATA | .rda | base | load() |
* If comma is used instead of decimal point, use read.delim2()
** If comma is used instead of decimal point, use read.csv2()
Before running the data import function, make sure that the required packages are installed. If the package is not base package or not installed previously, the first task is installing and loading the package. I wrote the following R function to install R packages along with any package that the given package depends on.
# R function (r.pkg.install) to install multiple packages at once.
r_pkg_install <- function(r_pkgs)
for (i in 1:length(r_pkgs)){
# "dependencies" option is TRUE to install all packages that a given package depends on.
install.packages(paste(r_pkgs[i]), lib="directory you want to install package",
repos = "https://cran.cnr.berkeley.edu/", dependencies = TRUE)
}
# List all R packages you need to install into your computer ;
r_pkg_install(r_pkgs=c("xlsx", "XML", "sas7bdat", "foreign", "R.matlab"))
# Locating the data file directory.
# You can specify the full directory in the R function or, or change the working directory so that you don't need to specify every time you run your R function.
setwd("C:/Users/TM/MWWW/dataset1/") # Replace by your working directory
# Check working directory
getwd()
# Note that each function has several arguments and can be checked using help() or ?. For example,
help(read.table)
?read.table
# using read.table ;
tab_data <- read.table("iris.txt", header = TRUE)
head(tab_data)
# Or using read.delim ;
delim_data <- read.delim("iris.txt", sep="\t", header = TRUE)
head(delim_data)
Caution should be taken in reading text files with an empty (" ") missing data value. For example, for missing data values within a tab delimited text file, you can specify sep="\t" option to read in the data properly.
tab_data <- read.table("iris.txt", sep="\t", header = TRUE)
head(tab_data)
Some data files coded missing values differently, for example "?". We can read such data using the na.string option. This will automatically replace the special missing values by NA. For example, if you are reading a data with missing value recorded as "?", then use the following code
tab_data_missing <- read.table("iris_missing.txt", na.strings=c("?"), header = TRUE)
head(tab_data_missing)
On the other hand, if you want to replace NA by a different value, for example, replacing all NA by 0, you can use the following code
tab_data[is.na(tab_data)] <- 0
csv_data <- read.csv("iris.csv" )
head(csv_data)
library(rJava)
library(xlsxjars)
library(xlsx)
xlsx_data <- read.xlsx("iris.xlsx", sheetName = "iris" )
head(xlsx_data)
library(foreign)
spss_data <- read.spss("iris.sav", to.data.frame = TRUE )
head(spss_data)
library(sas7bdat)
stata_data <- read.dta("iris.dta" )
ead(stata_data)
library(sas7bdat)
sas_data <- read.sas7bdat("iris.sas7bdat" )
head(sas_data)
library(foreign)
sasxpt_data <- read.xport("iris.xpt" )
head(sasxpt_data)
The above list helps to locate the right R package/function. When the file size is larger than the computer memory allocated to R, the functions may give you an error message. In such cases, you may 1) look for a computer with bigger memory that accommodate the file size 2) split the data into several small datasets, read and analyze each of them separately, and later combine the results. 3) Use R packages or techniques that are tailored to big data. See for example the discussion in High-Performance and Parallel Computing with R.