2 Getting started with H2O in R

2.1 Summary of how it works

When using H2O from R, one launches an H2O instance from R. R then communicates with the instance via a REST API. So in essence all data manipulations and calculations are not done in R but are handled by H2O, one just uses familiar R syntax to send instructions to H2O via the API.

Data sets are not directly transmitted through the REST API. Instead, commands (for example, importing a data set) are sent either through the browser or the REST API to perform the specified task. The data set will then be assigned an identifier (the .hex file type in H2O) which is relayed back to R. H2O does how ever provide the functionality to convert h2oFrames to R dataframes for manipulation in R.

2.2 Installation

The installation of h2o via R seems the most straightforward approach. This approach, for the most part, is uniform on systems running Linux, Windows and OSX. For other installation methods or installation from other platforms like python check out the h2o documentation.

2.2.1 prep the environment

Before the actual installation of H2O there are some dependencies that need to be taken care of. Below we look at the most common ones:

  • H2O runs in a JVM so the first thing to install is java.
  • Next, especially for Linux distributions, we need to install some dependencies outside of R. On Ubuntu run apt-get install libcurl4-openssl-dev in the terminal. This is needed by the RCurl package in R.
  • For a clean install, run the following lines in a R console to remove any previously installed H2O packages
# detach h2o package if loaded
if ("package:h2o" %in% search()) { detach("package:h2o", unload=TRUE) }
# remove h2o package if installed  
if ("h2o" %in% rownames(installed.packages())) { remove.packages("h2o") }
  • If installing from source (see install and test below) then dependencies for the h2o package may need to be manually installed in R. The following code will achieve that
pkgs <- c("RCurl","jsonlite","utils","tools","graphics","stats","statmod","methods")
for (pkg in pkgs) {
  if (! (pkg %in% rownames(installed.packages()))) { install.packages(pkg) }
}

2.2.2 install and test

With the environment prepared we can proceed with the installation. The installation of the package will also trigger the download of the h2o.jar file from H2O. The stable version from cran may be a version behind the one from source (for the notes we use the version from source).

# installing the latest stable release from source
install.packages("h2o", 
                 type="source",
                 repos=(c("http://h2o-release.s3.amazonaws.com/h2o/latest_stable_R")))

# installing the latest stable release from cran
install.packages("h2o")

Next we test if everything went well by running a demo example. By default the initialisation will also run an instance of h2o Flow with port 54321 exposed on localhost. The default behavior can be changed via parameters of h2o.init() ( have a look at ?h2o.init()). Launching h2o from R ties it to the R session that launched it so termination of the session also terminates the h2o instance.

# load the library into environment
library(h2o)
# initialise an instance of h2o with the maximum number of cores available
localH2O <- h2o.init(nthreads = -1)

# Starts H2O using localhost IP, port 54321, all CPUs,and 4g of memory
# h2o.init(ip = 'localhost', port = 54321, nthreads= -1,max_mem_size = '4g')

# run a demo to test if all went well
demo(h2o.kmeans)