This page is the continuation of my blog post on R commands. On the blog, see also why use R and the RSS feed of posts labelled R.

See also documentation at:

Installing R and packages

To install R and Rstudio on Debian, see the debian.html#r_and_rstudio page on this site.

To install packages simply enter the following at an R command prompt:

install.packages("package_name")

Some packages have dependencies that need to be installed at the OS level. Error messages:

"Configuration failed because libcurl was not found." 
"Configuration failed because libxml-2.0 was not found."
"Configuration failed because openssl was not found."

Can be solved by installing these dependencies:

sudo apt install libcurl4-openssl-dev 
sudo apt install libxml2-dev 
sudo apt install libssl-dev

Information about your R system

sessionInfo()
installed.packages()

Files input output

getwd()
list.files(tempdir()) 
dir.create("blabla")

CSV files

Read one csv file with default R function.

read.csv("data.csv", )

Read many csv files with functions from the tidyverse packages.

First write sample csv files to a temporary directory. It’s more complicated than I thought it would be.

data_folder <- file.path(tempdir(), "iris")
dir.create(data_folder)
iris %>%
    # Keep the Species column in the output
    # Create a new column that will be used as the grouping variable
    mutate(species_group = Species) %>% 
    group_by(species_group) %>%
    nest() %>%
    by_row(~write.csv(.$data, 
                      file = file.path(data_folder, paste0(.$species_group, ".csv")),
                      row.names = FALSE))

Read these csv files into one data frame. Note the Species column has to be present in the csv files, otherwise we would loose that information.

iris_csv <- list.files(data_folder, full.names = TRUE) %>% 
    map_dfr(read_csv)

write_csv returned an Error in write_delim(...) is.data.frame(x) is not TRUE That’s why we used write.csv instead.

Vectors

SO answer concerning indexing up to end of vector/matrix “Sometimes it’s easier to tell R what you don’t want”

x <- c(5,5,4,3,2,1)
x[-(1:3)]
x[-c(1,3,6)]

Set operations

x = letters[1:3]
y = letters[3:5]
union(x, y)
## [1] "a" "b" "c" "d" "e"
intersect(x, y)
## [1] "c"
setdiff(x, y)
## [1] "a" "b"
setdiff(y, x)
## [1] "d" "e"
setequal(x, y)
## [1] FALSE

Lists

Given a list structure x, unlist simplifies it to produce a vector which contains all the atomic components which occur in x.

l1 <- list(a="a", b="2,", c="pi+2i")
str(l1)
## List of 3
##  $ a: chr "a"
##  $ b: chr "2,"
##  $ c: chr "pi+2i"
unlist(l1) # a character vector 
##       a       b       c 
##     "a"    "2," "pi+2i"
str(unlist(l1))
##  Named chr [1:3] "a" "2," "pi+2i"
##  - attr(*, "names")= chr [1:3] "a" "b" "c"

Strings

message("Using the following letters: ", paste(letters, collapse=","), ".")
## Using the following letters: a,b,c,d,e,f,g,h,i,j,k,l,m,n,o,p,q,r,s,t,u,v,w,x,y,z.

Levenshtein distance between words

Cf. https://en.wikipedia.org/wiki/Levenshtein_distance

adist("kitten", "sitting")

S3 methods

x<-1

List all available methods for a class:

methods(class="lm") 

One liners

Remove all objects in the workspace except one :

rm(list=ls()[!ls()=="object_to_keep"]) 
rm(list=ls()[!ls()=="con"]) # Remove all except a database connection

knitr

kable to create tables

cat(kable(head(iris, 1), format = "html"))
## <table>
##  <thead>
##   <tr>
##    <th style="text-align:right;"> Sepal.Length </th>
##    <th style="text-align:right;"> Sepal.Width </th>
##    <th style="text-align:right;"> Petal.Length </th>
##    <th style="text-align:right;"> Petal.Width </th>
##    <th style="text-align:left;"> Species </th>
##   </tr>
##  </thead>
## <tbody>
##   <tr>
##    <td style="text-align:right;"> 5.1 </td>
##    <td style="text-align:right;"> 3.5 </td>
##    <td style="text-align:right;"> 1.4 </td>
##    <td style="text-align:right;"> 0.2 </td>
##    <td style="text-align:left;"> setosa </td>
##   </tr>
## </tbody>
## </table>
cat(kable(head(iris, 1), format = "latex"))
## 
## \begin{tabular}{r|r|r|r|l}
## \hline
## Sepal.Length & Sepal.Width & Petal.Length & Petal.Width & Species\\
## \hline
## 5.1 & 3.5 & 1.4 & 0.2 & setosa\\
## \hline
## \end{tabular}
cat(kable(head(iris, 1), format = "markdown"))
## | Sepal.Length| Sepal.Width| Petal.Length| Petal.Width|Species | |------------:|-----------:|------------:|-----------:|:-------| |          5.1|         3.5|          1.4|         0.2|setosa  |

Setting knitr options

Those 2 commands are different. Sets the options for chunk, within a knitr chunk inside the .Rmd document

opts_chunk$set(fig.width=10)

Sets the options for knitr outside the .Rmd document

opts_knit$set()

R markdown python engine

Rstudio: R Mardown python engine

Rscript

Capture arguments in an Rscript on windows and write them to a file

"C:\Program Files\R\R-3.5.0\bin\Rscript.exe" --verbose -e "args = commandArgs(trailingOnly=TRUE)" -e "writeLines(args,'C:\\Dev\\args.txt')" "file1.csv" "file2.csv" "file3.csv"

Arguments can be extracted one by one with args[1] commandArgs() returns a character vector containing the name of the executable and the user-supplied command line arguments.

Tidyverse

dplyr

pipes

library(dplyr)
cars %>%
    group_by(speed) %>%
    print() %>% # works because the print function returns its argument
    summarise(numberofcars = n(),
              min = min(dist),
              mean = mean(dist),
              max = max(dist)) 
## # A tibble: 50 x 2
## # Groups:   speed [19]
##    speed  dist
##    <dbl> <dbl>
##  1     4     2
##  2     4    10
##  3     7     4
##  4     7    22
##  5     8    16
##  6     9    10
##  7    10    18
##  8    10    26
##  9    10    34
## 10    11    17
## # … with 40 more rows
## # A tibble: 19 x 5
##    speed numberofcars   min  mean   max
##    <dbl>        <int> <dbl> <dbl> <dbl>
##  1     4            2     2   6      10
##  2     7            2     4  13      22
##  3     8            1    16  16      16
##  4     9            1    10  10      10
##  5    10            3    18  26      34
##  6    11            2    17  22.5    28
##  7    12            4    14  21.5    28
##  8    13            4    26  35      46
##  9    14            4    26  50.5    80
## 10    15            3    20  33.3    54
## 11    16            2    32  36      40
## 12    17            3    32  40.7    50
## 13    18            4    42  64.5    84
## 14    19            3    36  50      68
## 15    20            5    32  50.4    64
## 16    22            1    66  66      66
## 17    23            1    54  54      54
## 18    24            4    70  93.8   120
## 19    25            1    85  85      85

group_by() creates a tbl_df objects which is a wrapper around a data.frame to enable some functionalities. Note that print returns its output on a tbl_df object. So print() can be used inside the pipe without stopping the workflow.

Mutate

Mutate multiple variables in the dataframe at once using the vars() helper function to scope the mutation:

iris %>%
    mutate_at(vars(starts_with("Petal")), round) %>%
    head()
##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1          5.1         3.5            1           0  setosa
## 2          4.9         3.0            1           0  setosa
## 3          4.7         3.2            1           0  setosa
## 4          4.6         3.1            2           0  setosa
## 5          5.0         3.6            1           0  setosa
## 6          5.4         3.9            2           0  setosa

purrr

Hadley Wickham’s answer to a SO question Why use purrr::map instead of lapply?

Map a function to nested data sets

Load data

list.files(getwd())
forestEU_wide <- read.csv("Forest-R-EU.csv", stringsAsFactors = FALSE)
head(forestEU)

Pivot to long format

# at some pointin the future this will be called pivot_long
forestEU <- forestEU_wide  %>% 
      # select everything except the Year, then pivot all columns and put the value in area
      gather(-Year, key = "Country", value = "Area")

Interpolate for one country

country_interpolation <- function(df) {
      df <- data.frame(approx(df$Year, df$Area, method = "linear", n=71))
  df <- rename(df, Year = x, Area = y)
    return(df)
}

forestEU %>%  filter(Country=="Austria") %>% country_interpolation()

Interpolate for all countries Perform the Interpolation on all countries

See documentation in * many models https://r4ds.had.co.nz/many-models.html * blog https://emoriebeck.github.io/R-tutorials/purrr/

forestEU_nested <- forestEU %>%
      # Remove empty area
      filter(!is.na(Area)) %>% 
      group_by(Country) %>%  
      nest() %>% 
      mutate(interpolated = map(data, country_interpolation))

# forestEU_nested %>% unnest(data)
# Unnest the interpolated data to look at it and plotting
forestEU_interpolated <- forestEU_nested %>% unnest(interpolated)

Write to many csv

#dir.create("output")
forestEU_nested <- forestEU_nested %>% 
    mutate(filename = paste0("output/", Country, ".csv"),
    wrote_stuff = map2(interpolated, filename, write.csv))

tidy evaluation

scatter_plot <- function(data, x, y) {
    x <- enquo(x)
    y <- enquo(y)
    ggplot(data) + geom_point(aes(!!x, !!y))
}
scatter_plot(mtcars, disp, drat)

Another example use of metaprogramming to change variables

add1000 <- function(dtf, var){
      varright <- enquo(var)
  varleft <- quo_name(enquo(var))
    dtf %>% 
            mutate(!!varleft := 1000 + (!!varright))
}
add1000(iris, Sepal.Length)

tidyr

tidyr vignette on tidy data In the section on “Multiple types in one table”:

Datasets often involve values collected at multiple levels, on different types of observational units. During tidying, each type of observational unit should be stored in its own table. This is closely related to the idea of database normalisation, where each fact is expressed in only one place. It’s important because otherwise inconsistencies can arise.

Normalisation is useful for tidying and eliminating inconsistencies. However, there are few data analysis tools that work directly with relational data, so analysis usually also requires denormalisation or the merging the datasets back into one table.

Example use of tidyr::nest() to generate a group of plots: make ggplot2 purrr.

library(tidyr)
library(dplyr)
library(purrr)
library(ggplot2)
piris <- iris %>%  
    group_by(Species) %>% 
    nest() %>% 
    mutate(plot = map2(data, Species, 
                       ~ggplot(data = .x, 
                               aes(x = Petal.Length, y = Petal.Width)) + 
                           geom_point() + ggtitle(.y)))
piris$plot[1]
piris$plot[3]
piris$plot[2]

Discussions

  • Gavin Simpson My aversion to pipers shows an Hadley tweet explaining that pipe might not be good in package development.

Plotting with ggplot2

geom_bar
geom_tile + a gradient produce heat maps

Palettes

Setting up colour palettes in R

To create a RColorBrewer palette, use the brewer.pal function. It takes two arguments: n, the number of colors in the palette; and name, the name of the palette. Let’s make a palette of 8 colors from the qualitative palette, “Set2”.

library(RColorBrewer)
brewer.pal(n = 8, name = "Set2") 
[1] "#66C2A5" "#FC8D62" "#8DA0CB" "#E78AC3"
"#A6D854" "#FFD92F" "#E5C494" "#B3B3B3" 
palette(brewer.pal(n = 8, name = "Set2"))

Use this palette in ggplot2

ggplot(iris, aes(x=Sepal.Length, y=Petal.Length, color=Species)) + 
    geom_point() +
    scale_color_brewer(palette = "Set2")

Use a named vector to set a palette in ggplot2 as explained in ggplot2 scale_manual

p <- ggplot(iris, aes(x=Sepal.Length, y=Petal.Length, color=Species)) +
        geom_point() 
p + scale_colour_manual(values = c(setosa='black', versicolor='red', virginica='green'))

Create a named palette using R colour brewer:

species_names <- c("setosa", "versicolor", "virginica") 
iris_palette <- setNames(brewer.pal(n=length(species_names), name='Set2'), 
                         species_names)
p + scale_colour_manual(values = iris_palette)

Display palettes

Display qualitative palettes:

display.brewer.all(type="qual") 

Display all palettes

display.brewer.all()

Creating a package

You might want to read the CRAN manual on Writing R Extensions, and its section on Package dependencies. See also Hadley’s book on R package and its section on Namespace

Use the devtools library to start a package folder structure:

devtools::create("package_name")

Use git to track code modifications (shell commands):

$ cd package_name
$ git init

Documentation

The roxygen2 package helps with function documentation. Documentation can be written in the form of comments #’ tags such as @param and @description structure the documentation of each function.

For an introcution to roxygen2, call vignette("roxygen2", package = "roxygen2") at the R prompt.

Since roxygen2 version 6, markdown formating can be used in the documentation, by specifying the @md tag.

Examples

Examples are crucial to demonstrate the use of a fonction. They can be specified in a roxygen block called @examples:

#' @examples

Wrap the examples in donttest if you don’t want R CMD check to test them at package building time.

#' \donttest{d

It is also possible to wrap them in another statement called dontrun, but this is not recomended on CRAN according to this Stackoverflow question.

Vignettes

Vignettes: long-form documentation

devtools::use_vignette("my-vignette")

Where to put package vignettes for CRAN submission

“You put the .Rnw sources in vignettes/ as you did, but you missed out a critical step; don’t check the source tree. The expected workflow is to build the source tarball and then check that tarball. Building the tarball will create the vignette PDF.”

R CMD build ../foo/pkg
R CMD check ./pkg-0.4.tar.gz

Issues while building vignette for a packages:

“Maybe you’re running R CMD check using the directory name rather than the .tar.gz file?”

“Installing texlive-fonts-extra should take care of it.”

R CMD checking data for non-ASCII characters found 179 marked UTF-8 strings No solution for this one but I guess it’s ok since it concerns country names?

Unit tests

Back in R, add testing infrastructure:

devtools::use_testthat()

When checking the package with R CMD CHECK, How can I handle R CMD check “no visible binding for global variable”? These notes are caused by variables used with dplyr verbs and ggplot2 aesthetics.

Continuous Integration

It is good to know if your package can be installed on a fresh system. Continuous integration systems make this possible each time you submit a modification to your repository. I have used travis-ci which is free for open github repositories. Instructions to build an R project on travis. Unit tests are also run on travis, in addition to R CMD CHECK.

Package dependencies can be configured in a .travis.yml file that is read by the travis machine performing the build. For package that are not on Cran, it’s possible to specify a dependency field under r_github_packages.

Differences between python and R

l = c(1,2,3)
s = l
s[3]
[1] 3
s[3] = "a"
s
[1] "1" "2" "a"
l
[1] 1 2 3

Using the address function to see the address of these objects in memory We can see that s and l share the same address. The address only changes when we asign something to s.

library(pryr)
l = c(1,2,3)
s = l
address(l)
[1] "0x316f718"
address(s)
[1] "0x316f718"
s[3] = "a"
address(s)
[1] "0x36a7d30"
s
[1] "1" "2" "a"
l
[1] 1 2 3

Checking string objects

bla = "qsdfmlkj"
address(bla)
[1] "0x38e5120"
bli = bla
address(bli)
[1] "0x38e5120"
bli = paste(bli, "sdf")
address(bli)
[1] "0x38d1d60"

TODO Compare to the same code in python to see the difference between the above and passing by reference.