3 Simple Example
This is a small example to demonstrate how one can create a ML workflow in h2o via R. We will use the breast cancer data set from the mlbench package.
3.1 Start an instance of H2O
Start H2O using localhost IP, port 54321, all CPUs,and 6g of memory. The recommendation for max_mem_size is 4X the size of the data set. The 6gig is clearly over kill given the size of the data used in this example,but ah!
library(h2o)
# Starts H2O using localhost IP, port 54321, all CPUs,and 6g of memory
h2o.init(ip = 'localhost', port = 54321, nthreads= -1,max_mem_size = '6g')
3.2 Data preparation
3.2.1 read in data
3.2.1.1 from R dataframe to h2oFrame
The function as.h2o() converts an R dataframe to an H2OFrame in the h2o instance. Some features have their type set to ordered factors and as.h2o() seems to have a problem with parsing ordered factors. Before the conversion we convert to unordered factor variables.
library(dplyr)
library(mlbench)
library(purrr)
# load the data set into environment
data(BreastCancer)
# convert data frame to an H2OFrame
BreastCancer_h2o <- map(BreastCancer,factor,ordered = F) %>%
as.data.frame() %>% # map returns list, coerce that to a df
as.h2o() # convert to h2oFrame
# can call head to confirm data was transfared correctly
h2o.head(BreastCancer_h2o)
3.2.1.2 reading data directly into h2o
In most cases data will not be in R but from some other data source like a csv file or a Hadoop cluster. In fact chances are the main reason one even considered h2o is to scale R to working with large scale data sets, so first importing data into R then push it to an h2o instance wouldn’t be particularly helpful. The code below reads in a csv file stored locally on the host machine, further info in loading data from other sources can be found in the documentation here
data_path <- "./data/BreastCancer.csv"
BreastCancer_h2o <- h2o.importFile(path = data_path,
destination_frame = "BreastCancer_h2o",
header = T,
col.types = rep("Enum",11)) # Enum is a factor
Once loaded the data can also be viewed in h2o flow at localhost:54321 using the getFlow command
3.2.2 split data
According to the documentation the h2o.splitFrame() function doesn’t perform an exact split as specified by the proportions but an approximate split. For the most part this isn’t an issue when dealing with large amounts of data. When dealing with small data sets and data sets with imbalanced data though, it may become something worth paying attention to.
BreastCancer_h2o_split_vec <- h2o.splitFrame(BreastCancer_h2o,
ratios = 0.80,
seed = 820)
train_h20 <- BreastCancer_h2o_split_vec[[1]]
test_h2o <- BreastCancer_h2o_split_vec[[2]]
3.3 Data exploration
Here we showcase some functions h2o offers as far as data exploration and manipulation is concerned. For the most part h2o feels like performing surgery with a butter knife, especially to someone used to dealing with relatively small data sets, just keep in mind that it’s mainly designed for efficiency with huge data sets, and that it does beautifully. For data sets small enough to be read into R, it may be less frustrating to pull the h2oFrame into R as a data frame using as.data.frame(h2oFrame_object), wrangle and explore the data using the mighty dplyr package then push the data back into h2o for training.
3.3.1 get dimensions
h2o.dim(train_h20)
[1] 561 11
3.3.2 glimpse
Some R functions (e.g head,dim) seem to work on h2oFrames as though the operations are done in R, however, it may be wise to keep prepending the functions with the standard h2o. prefix even in such cases to make operations not done in R explicitly clear.
h2o.head(train_h20,n=5) # controling number of observations to pull
Id Cl.thickness Cell.size Cell.shape Marg.adhesion Epith.c.size
1 1000025 5 1 1 1 2
2 1002945 5 4 4 5 7
3 1015425 3 1 1 1 2
4 1016277 6 8 8 1 3
5 1017122 8 10 10 8 7
Bare.nuclei Bl.cromatin Normal.nucleoli Mitoses Class
1 1 3 1 1 benign
2 10 3 2 1 benign
3 2 3 1 1 benign
4 4 3 7 1 benign
5 10 9 7 1 malignant
3.3.3 get summary statistics
h2o.describe(train_h20)
Label Type Missing Zeros PosInf NegInf Min Max Mean
1 Id enum 0 1 0 0 0 644 NA
2 Cl.thickness enum 0 119 0 0 0 9 NA
3 Cell.size enum 0 311 0 0 0 9 NA
4 Cell.shape enum 0 284 0 0 0 9 NA
5 Marg.adhesion enum 0 331 0 0 0 9 NA
6 Epith.c.size enum 0 38 0 0 0 9 NA
7 Bare.nuclei enum 13 328 0 0 0 9 NA
8 Bl.cromatin enum 0 115 0 0 0 9 NA
9 Normal.nucleoli enum 0 354 0 0 0 9 NA
10 Mitoses enum 0 463 0 0 0 8 NA
11 Class enum 0 370 0 0 0 1 0.3404635
Sigma Cardinality
1 NA 645
2 NA 10
3 NA 10
4 NA 10
5 NA 10
6 NA 10
7 NA 10
8 NA 10
9 NA 10
10 NA 9
11 0.474288 2
h2o.summary(train_h20)
Id Cl.thickness Cell.size Cell.shape Marg.adhesion Epith.c.size
1182404:4 1 :119 1 :311 1 :284 1 :331 2:314
1276091:4 5 :104 10: 53 2 : 49 3 : 46 3: 54
1198641:3 3 : 86 3 : 41 3 : 47 10: 45 4: 41
1061990:2 4 : 64 2 : 36 10: 45 2 : 45 1: 38
1105524:2 10: 54 4 : 33 4 : 34 4 : 23 6: 31
1114570:2 2 : 41 6 : 25 6 : 25 8 : 23 5: 29
Bare.nuclei Bl.cromatin Normal.nucleoli Mitoses Class
1 :328 2:142 1 :354 1 :463 benign :370
10:103 3:138 10: 49 2 : 29 malignant:191
5 : 25 1:115 3 : 36 3 : 25
2 : 22 7: 57 2 : 29 10: 12
3 : 22 4: 31 8 : 19 4 : 9
8 : 18 5: 23 6 : 18 7 : 9
NA: 13
3.3.4 create summary table
3.3.4.1 one variable
We create a simple frequency bar plot showing the frequencies of each of the Class variable levels. We create a table of all the counts in h2o then pull it into r as a data frame for plotting. This seemed like a nice opportunity to demonstrate the conversion of an h2oFrame into an R dataframe.
library(ggplot2)
library(dplyr)
h2o.table(train_h20[,"Class"]) %>% # create table in h20
as.data.frame() %>% # convert h2oFrame to dataframe and feed to ggplot
ggplot(aes(x=Class,y=Count))+
geom_bar(stat="identity", fill="steelblue")+
labs(title='Number of clinic cases recorded by cancer class',
x='cancer class',
y='number of clinic cases')+
theme(plot.title = element_text(hjust = 0.5))
3.3.4.2 two variables
This generalises to n variables
level_order <- c("1","2","3","4","5","6","7","8","9","10")
h2o.table(train_h20[,c("Class","Cell.size")]) %>% # create table grouping by Class & Cell.size
as.data.frame() %>% # convert h2oFrame to dataframe
mutate(Cell.size=ordered(Cell.size,levels=level_order)) %>% # convert Cell.size to ordered factor
ggplot(aes(x=Cell.size,y=Counts,fill = Class))+
geom_bar(stat="identity",position = "dodge")+
labs(title='Number of clinic cases recorded by cell size',
x='cancer cell size',
y='number of clinic cases')+
theme(plot.title = element_text(hjust = 0.5))+
scale_fill_manual(guide= FALSE,
values = c(benign="steelblue",
malignant="blue")
)
3.3.5 missing values
The h2o.describe() function revealed that the Bare.nuclei variable has 13 missing values we impute with the mode
h2o.impute(data=train_h20,
column = "Bare.nuclei",
method = "mode")
3.4 Model building
A number of algorithms have been implemented in h2o. A full list can be found in the documentation here
3.4.1 Random forest model
3.4.1.1 train
h2o_rf <- h2o.randomForest(y = "Class", # specify response variable
training_frame = train_h20,
validation_frame = test_h2o,
ntrees = 500,
nfolds = 10, # specify k in k-fold cv
seed = 3328)
3.4.1.2 check performance metrics
h2o Flow provides a really sweet graphical representation of all training metrics, it is definitely worth looking at. the flow also allows monitoring of metrics while training through performance graphs. Models can be accessed by running the command getModels in h20 flow. Below we explore how the performance can be viewed from R
h2o.performance(h2o_rf)
H2OBinomialMetrics: drf
** Reported on training data. **
** Metrics reported on Out-Of-Bag training samples **
MSE: 0.01727592
RMSE: 0.1314379
LogLoss: 0.06625494
Mean Per-Class Error: 0.01748267
AUC: 0.9985001
pr_auc: 0.9919338
Gini: 0.9970001
R^2: 0.9230636
Confusion Matrix (vertical: actual; across: predicted) for F1-optimal threshold:
benign malignant Error Rate
benign 359 11 0.029730 =11/370
malignant 1 190 0.005236 =1/191
Totals 360 201 0.021390 =12/561
Maximum Metrics: Maximum metrics at their respective thresholds
metric threshold value idx
1 max f1 0.415569 0.969388 187
2 max f2 0.415569 0.984456 187
3 max f0point5 0.691324 0.986770 165
4 max accuracy 0.691324 0.978610 165
5 max precision 1.000000 1.000000 0
6 max recall 0.246980 1.000000 193
7 max specificity 1.000000 1.000000 0
8 max absolute_mcc 0.415569 0.953699 187
9 max min_per_class_accuracy 0.558049 0.975676 182
10 max mean_per_class_accuracy 0.415569 0.982517 187
Gains/Lift Table: Extract with `h2o.gainsLift(<model>, <data>)` or `h2o.gainsLift(<model>, valid=<T/F>, xval=<T/F>)`
One can’t help but marvel at the wealth of information provided by the h2o model object, it is perhaps even more exciting that getting the kind of performance we got from our first model with practically no work done on the data. a thing of beauty indeed!
plot(h2o_rf)
3.4.2 Extreme gradient boosted model
3.4.2.1 train
h2o_xgb <- h2o.xgboost(y = "Class", # specify response variable
training_frame = train_h20,
validation_frame = test_h2o,
ntrees = 500,
nfolds = 10, # specify k in k-fold cv
seed = 3328)
3.4.2.2 check performance metrics
h2o.performance(h2o_rf)
H2OBinomialMetrics: drf
** Reported on training data. **
** Metrics reported on Out-Of-Bag training samples **
MSE: 0.01727592
RMSE: 0.1314379
LogLoss: 0.06625494
Mean Per-Class Error: 0.01748267
AUC: 0.9985001
pr_auc: 0.9919338
Gini: 0.9970001
R^2: 0.9230636
Confusion Matrix (vertical: actual; across: predicted) for F1-optimal threshold:
benign malignant Error Rate
benign 359 11 0.029730 =11/370
malignant 1 190 0.005236 =1/191
Totals 360 201 0.021390 =12/561
Maximum Metrics: Maximum metrics at their respective thresholds
metric threshold value idx
1 max f1 0.415569 0.969388 187
2 max f2 0.415569 0.984456 187
3 max f0point5 0.691324 0.986770 165
4 max accuracy 0.691324 0.978610 165
5 max precision 1.000000 1.000000 0
6 max recall 0.246980 1.000000 193
7 max specificity 1.000000 1.000000 0
8 max absolute_mcc 0.415569 0.953699 187
9 max min_per_class_accuracy 0.558049 0.975676 182
10 max mean_per_class_accuracy 0.415569 0.982517 187
Gains/Lift Table: Extract with `h2o.gainsLift(<model>, <data>)` or `h2o.gainsLift(<model>, valid=<T/F>, xval=<T/F>)`
plot(h2o_rf)
3.5 Some last thoughts
H2o is indeed an awesome tool with a moderate learning curve relative to something like sparklyr (provides an interface for Apache Spark in R). Though powerful, h2o should probably be thought of as supplementing the language it is used in and not as a substitute. If you are interested in learning more about H2O the documentation is probably the next best alternative to look at.