Table of Model Results using kable and kableExtra

Author

Published

September 13, 2019

I’m at the R/Medicine conference (no it’s not a Reddit thing) and got to help Alison Hill with her R Markdown for Medicine workshop. One of the questions I got to tinker with was making tables used to report model results.

One technique I learned while doing my MPH was to add variables to your model in blocks. It reduces the number of tests you need to perform, and is more meaningful than saying “I ran stepwise”. So, as a researcher, you will add variables in meaningful blocks and then see if that block of variables is “significant” by looking at the F-statistic.

Here is one example of at least putting your model results into a table.

Creating the models

library(reshape2) # using this just for the tips dataset
library(broom)
library(purrr)
library(knitr)
library(kableExtra)

head(tips)

  total_bill  tip    sex smoker day   time size
1      16.99 1.01 Female     No Sun Dinner    2
2      10.34 1.66   Male     No Sun Dinner    3
3      21.01 3.50   Male     No Sun Dinner    3
4      23.68 3.31   Male     No Sun Dinner    2
5      24.59 3.61 Female     No Sun Dinner    4
6      25.29 4.71   Male     No Sun Dinner    4

# create a bumch of models
m1 <- lm(tip ~ total_bill, data = tips)
m2 <- lm(tip ~ total_bill + sex, data = tips)
m3 <- lm(tip ~ total_bill + sex + smoker, data = tips)
m4 <- lm(tip ~ total_bill + smoker + size, data = tips)

Now that we have the models, we can use the broom package to get a nice consistent dataframe of the coefficients

# get a dataframe of the model estimates
models <- list(m1, m2, m3, m4)
broom_outputs <- purrr::map(models, broom::tidy)
broom_outputs

[[1]]
# A tibble: 2 × 5
  term        estimate std.error statistic  p.value
  <chr>          <dbl>     <dbl>     <dbl>    <dbl>
1 (Intercept)    0.920   0.160        5.76 2.53e- 8
2 total_bill     0.105   0.00736     14.3  6.69e-34

[[2]]
# A tibble: 3 × 5
  term        estimate std.error statistic  p.value
  <chr>          <dbl>     <dbl>     <dbl>    <dbl>
1 (Intercept)   0.933    0.174       5.37  1.84e- 7
2 total_bill    0.105    0.00746    14.1   2.33e-33
3 sexMale      -0.0266   0.138      -0.192 8.48e- 1

[[3]]
# A tibble: 4 × 5
  term        estimate std.error statistic  p.value
  <chr>          <dbl>     <dbl>     <dbl>    <dbl>
1 (Intercept)   0.977    0.178       5.48  1.05e- 7
2 total_bill    0.106    0.00748    14.2   1.73e-33
3 sexMale      -0.0281   0.138      -0.203 8.39e- 1
4 smokerYes    -0.149    0.135      -1.10  2.72e- 1

[[4]]
# A tibble: 4 × 5
  term        estimate std.error statistic  p.value
  <chr>          <dbl>     <dbl>     <dbl>    <dbl>
1 (Intercept)   0.709    0.205       3.46  6.38e- 4
2 total_bill    0.0939   0.00933    10.1   4.26e-20
3 smokerYes    -0.0834   0.138      -0.605 5.46e- 1
4 size          0.180    0.0878      2.05  4.11e- 2

What we have now is a list of dataframes, one for each model we fit.

Just the beta estimates

If we just want the coefficients, we only need the terms and estimates columns and do a join. We can use the purrr::reduce function to do this

# join them all together
model_outputs <- purrr::reduce(broom_outputs, dplyr::full_join, by = 'term')
model_outputs

# A tibble: 5 × 17
  term       estimate.x std.error.x statistic.x p.value.x estimate.y std.error.y
  <chr>           <dbl>       <dbl>       <dbl>     <dbl>      <dbl>       <dbl>
1 (Intercep…      0.920     0.160          5.76  2.53e- 8     0.933      0.174  
2 total_bill      0.105     0.00736       14.3   6.69e-34     0.105      0.00746
3 sexMale        NA        NA             NA    NA           -0.0266     0.138  
4 smokerYes      NA        NA             NA    NA           NA         NA      
5 size           NA        NA             NA    NA           NA         NA      
# ℹ 10 more variables: statistic.y <dbl>, p.value.y <dbl>, estimate.x.x <dbl>,
#   std.error.x.x <dbl>, statistic.x.x <dbl>, p.value.x.x <dbl>,
#   estimate.y.y <dbl>, std.error.y.y <dbl>, statistic.y.y <dbl>,
#   p.value.y.y <dbl>

This will join all the dataframes together using the term column as the key. Variables that do not exist in the model will be filled with an NA. Problem now, is that we have a bunch of columns we don’t need. So the next step is to pull out the columns we want, and rename them accordingly.

model_table <- model_outputs %>%
  dplyr::select(term, tidyselect::starts_with('estimate'))

names(model_table) <- c('Variable', 'Model 1', "Model 2", "Model 3", "Model 4")
model_table

# A tibble: 5 × 5
  Variable    `Model 1` `Model 2` `Model 3` `Model 4`
  <chr>           <dbl>     <dbl>     <dbl>     <dbl>
1 (Intercept)     0.920    0.933     0.977     0.709 
2 total_bill      0.105    0.105     0.106     0.0939
3 sexMale        NA       -0.0266   -0.0281   NA     
4 smokerYes      NA       NA        -0.149    -0.0834
5 size           NA       NA        NA         0.180

What if I need more variables?

Just reporting the beta coefficients is not how you want to report your findings, typically you’d want to also display some measure of uncertainty along with the results. Here I’m just going to add the std.error column, but you can calculate the 95% confidence intervals using the std.error.

Here’s the reference from the kableExtra library I’m using to build the coefficient table:

# this is the exmaple form the kable docs I used for reference
# https://cran.r-project.org/web/packages/kableExtra/vignettes/awesome_table_in_html.html

collapse_rows_dt <- data.frame(C1 = c(rep("a", 3), rep("b", 2)),
                               C2 = c(rep("c", 2), rep("d", 1), rep("c", 1), rep("d", 1)),
                               C3 = 1:5,
                               C4 = sample(c(0,1), 5, replace = TRUE))
collapse_rows_dt

  C1 C2 C3 C4
1  a  c  1  1
2  a  c  2  1
3  a  d  3  0
4  b  c  4  1
5  b  d  5  0

kable(collapse_rows_dt, align = "c") %>%
  kable_styling(full_width = F) %>%
  column_spec(1, bold = T) %>%
  collapse_rows(columns = 1:2, valign = "top")

C1	C2	C3	C4
a	c	1	1
	c	2	1
	d	3	0
b	c	4	1
b	d	5	0

This example was important to my solution, becuase I would need to format my coefficient table the same way as collapse_rows_dt. This means that:

C1 would be the variables in my model
C2 would need to alternate between the estimate and the std.error
The other columns would be the actual values for each model.

Since the model outputs the estimate and std.error as columns, I know some kind of reshaping/tidying step will need to be involved (melt/gather / pivot_longer).

We still start from the same data steps as the previous example

broom_outputs

[[1]]
# A tibble: 2 × 5
  term        estimate std.error statistic  p.value
  <chr>          <dbl>     <dbl>     <dbl>    <dbl>
1 (Intercept)    0.920   0.160        5.76 2.53e- 8
2 total_bill     0.105   0.00736     14.3  6.69e-34

[[2]]
# A tibble: 3 × 5
  term        estimate std.error statistic  p.value
  <chr>          <dbl>     <dbl>     <dbl>    <dbl>
1 (Intercept)   0.933    0.174       5.37  1.84e- 7
2 total_bill    0.105    0.00746    14.1   2.33e-33
3 sexMale      -0.0266   0.138      -0.192 8.48e- 1

[[3]]
# A tibble: 4 × 5
  term        estimate std.error statistic  p.value
  <chr>          <dbl>     <dbl>     <dbl>    <dbl>
1 (Intercept)   0.977    0.178       5.48  1.05e- 7
2 total_bill    0.106    0.00748    14.2   1.73e-33
3 sexMale      -0.0281   0.138      -0.203 8.39e- 1
4 smokerYes    -0.149    0.135      -1.10  2.72e- 1

[[4]]
# A tibble: 4 × 5
  term        estimate std.error statistic  p.value
  <chr>          <dbl>     <dbl>     <dbl>    <dbl>
1 (Intercept)   0.709    0.205       3.46  6.38e- 4
2 total_bill    0.0939   0.00933    10.1   4.26e-20
3 smokerYes    -0.0834   0.138      -0.605 5.46e- 1
4 size          0.180    0.0878      2.05  4.11e- 2

The trick here is to get the data in the correct format that can be used by kableExtra

broom_stats_long <- purrr::map(broom_outputs, tidyr::gather, 'key', 'value', estimate, std.error)
broom_stats_long

[[1]]
# A tibble: 4 × 5
  term        statistic  p.value key         value
  <chr>           <dbl>    <dbl> <chr>       <dbl>
1 (Intercept)      5.76 2.53e- 8 estimate  0.920  
2 total_bill      14.3  6.69e-34 estimate  0.105  
3 (Intercept)      5.76 2.53e- 8 std.error 0.160  
4 total_bill      14.3  6.69e-34 std.error 0.00736

[[2]]
# A tibble: 6 × 5
  term        statistic  p.value key          value
  <chr>           <dbl>    <dbl> <chr>        <dbl>
1 (Intercept)     5.37  1.84e- 7 estimate   0.933  
2 total_bill     14.1   2.33e-33 estimate   0.105  
3 sexMale        -0.192 8.48e- 1 estimate  -0.0266 
4 (Intercept)     5.37  1.84e- 7 std.error  0.174  
5 total_bill     14.1   2.33e-33 std.error  0.00746
6 sexMale        -0.192 8.48e- 1 std.error  0.138  

[[3]]
# A tibble: 8 × 5
  term        statistic  p.value key          value
  <chr>           <dbl>    <dbl> <chr>        <dbl>
1 (Intercept)     5.48  1.05e- 7 estimate   0.977  
2 total_bill     14.2   1.73e-33 estimate   0.106  
3 sexMale        -0.203 8.39e- 1 estimate  -0.0281 
4 smokerYes      -1.10  2.72e- 1 estimate  -0.149  
5 (Intercept)     5.48  1.05e- 7 std.error  0.178  
6 total_bill     14.2   1.73e-33 std.error  0.00748
7 sexMale        -0.203 8.39e- 1 std.error  0.138  
8 smokerYes      -1.10  2.72e- 1 std.error  0.135  

[[4]]
# A tibble: 8 × 5
  term        statistic  p.value key          value
  <chr>           <dbl>    <dbl> <chr>        <dbl>
1 (Intercept)     3.46  6.38e- 4 estimate   0.709  
2 total_bill     10.1   4.26e-20 estimate   0.0939 
3 smokerYes      -0.605 5.46e- 1 estimate  -0.0834 
4 size            2.05  4.11e- 2 estimate   0.180  
5 (Intercept)     3.46  6.38e- 4 std.error  0.205  
6 total_bill     10.1   4.26e-20 std.error  0.00933
7 smokerYes      -0.605 5.46e- 1 std.error  0.138  
8 size            2.05  4.11e- 2 std.error  0.0878

Now that we have the data formatted the way we need, we now select only the columns we need.

# filter only the columns we need
broom_stats_long_filter <- purrr::map(broom_stats_long, dplyr::select, term, key, value)
broom_stats_long_filter

[[1]]
# A tibble: 4 × 3
  term        key         value
  <chr>       <chr>       <dbl>
1 (Intercept) estimate  0.920  
2 total_bill  estimate  0.105  
3 (Intercept) std.error 0.160  
4 total_bill  std.error 0.00736

[[2]]
# A tibble: 6 × 3
  term        key          value
  <chr>       <chr>        <dbl>
1 (Intercept) estimate   0.933  
2 total_bill  estimate   0.105  
3 sexMale     estimate  -0.0266 
4 (Intercept) std.error  0.174  
5 total_bill  std.error  0.00746
6 sexMale     std.error  0.138  

[[3]]
# A tibble: 8 × 3
  term        key          value
  <chr>       <chr>        <dbl>
1 (Intercept) estimate   0.977  
2 total_bill  estimate   0.106  
3 sexMale     estimate  -0.0281 
4 smokerYes   estimate  -0.149  
5 (Intercept) std.error  0.178  
6 total_bill  std.error  0.00748
7 sexMale     std.error  0.138  
8 smokerYes   std.error  0.135  

[[4]]
# A tibble: 8 × 3
  term        key          value
  <chr>       <chr>        <dbl>
1 (Intercept) estimate   0.709  
2 total_bill  estimate   0.0939 
3 smokerYes   estimate  -0.0834 
4 size        estimate   0.180  
5 (Intercept) std.error  0.205  
6 total_bill  std.error  0.00933
7 smokerYes   std.error  0.138  
8 size        std.error  0.0878

Next, you need to create some kind of ID value for each model you’re creating, that way when you’re combining everything into a single dataframe, the estimates are traveling together with the standard error. Here I’m labeling each model as a number starting with 1.

tidy_tables <- purrr::map2_df(broom_stats_long_filter, 1:length(broom_stats_long_filter),
                              function(x, y) dplyr::mutate(x, mod = y))
tidy_tables

# A tibble: 26 × 4
   term        key          value   mod
   <chr>       <chr>        <dbl> <int>
 1 (Intercept) estimate   0.920       1
 2 total_bill  estimate   0.105       1
 3 (Intercept) std.error  0.160       1
 4 total_bill  std.error  0.00736     1
 5 (Intercept) estimate   0.933       2
 6 total_bill  estimate   0.105       2
 7 sexMale     estimate  -0.0266      2
 8 (Intercept) std.error  0.174       2
 9 total_bill  std.error  0.00746     2
10 sexMale     std.error  0.138       2
# ℹ 16 more rows

Almost there!

Lastly, we need to sort the columns by using arrange. Here is where the ID column, mod get’s used to make sure model values are traveling together.

df <- tidy_tables %>%
  dplyr::arrange(term, mod, key, value) %>%
  tidyr::spread(mod, value, drop = FALSE)
df

# A tibble: 10 × 6
   term        key            `1`      `2`      `3`      `4`
   <chr>       <chr>        <dbl>    <dbl>    <dbl>    <dbl>
 1 (Intercept) estimate   0.920    0.933    0.977    0.709  
 2 (Intercept) std.error  0.160    0.174    0.178    0.205  
 3 sexMale     estimate  NA       -0.0266  -0.0281  NA      
 4 sexMale     std.error NA        0.138    0.138   NA      
 5 size        estimate  NA       NA       NA        0.180  
 6 size        std.error NA       NA       NA        0.0878 
 7 smokerYes   estimate  NA       NA       -0.149   -0.0834 
 8 smokerYes   std.error NA       NA        0.135    0.138  
 9 total_bill  estimate   0.105    0.105    0.106    0.0939 
10 total_bill  std.error  0.00736  0.00746  0.00748  0.00933

kable(df, align='c') %>%
  kable_styling(full_width = F) %>%
  column_spec(1, bold = T) %>%
  collapse_rows(columns = 1, valign = "top")

term	key	1	2	3	4
(Intercept)	estimate	0.9202696	0.9332785	0.9770353	0.7090155
(Intercept)	std.error	0.1597347	0.1737557	0.1781633	0.2048813
sexMale	estimate	NA	-0.0266087	-0.0280926	NA
sexMale	std.error	NA	0.1383340	0.1382793	NA
size	estimate	NA	NA	NA	0.1803316
size	std.error	NA	NA	NA	0.0878033
smokerYes	estimate	NA	NA	-0.1491923	-0.0834326
smokerYes	std.error	NA	NA	0.1354353	0.1380005
total_bill	estimate	0.1050245	0.1052324	0.1059431	0.0938884
total_bill	std.error	0.0073648	0.0074582	0.0074827	0.0093314

kableExtra

There kableExtra is great for tables like this. Here are links to more information

https://cran.r-project.org/web/packages/kableExtra/vignettes/awesome_table_in_html.html
http://haozhu233.github.io/kableExtra/