Posted on Categories Opinion, Programming, Statistics

# There is usually more than one way in R

Python has a fairly famous design principle (from “PEP 20 — The Zen of Python”):

There should be one– and preferably only one –obvious way to do it.

Frankly in `R` (especially once you add many packages) there is usually more than one way. As an example we will talk about the common `R` functions: `str()`, `head()`, and the `tibble package`‘s `glimpse()`.

## `tibble::glimpse()`

Consider the important task inspecting a `data.frame` to see column types and a few example values. The `dplyr`/`tibble`/`tidyverse` way of doing this is as follows:

```library("tibble")
glimpse(mtcars)

Observations: 32
Variables: 11
\$ mpg  <dbl> 21.0, 21.0, 22.8, 21.4, 18.7, 18.1, 14.3, 24.4, 22.8, 19.2, 17.8, 16.4, 17.3, 15.2, 10.4, 10....
\$ cyl  <dbl> 6, 6, 4, 6, 8, 6, 8, 4, 4, 6, 6, 8, 8, 8, 8, 8, 8, 4, 4, 4, 4, 8, 8, 8, 8, 4, 4, 4, 8, 6, 8, 4
\$ disp <dbl> 160.0, 160.0, 108.0, 258.0, 360.0, 225.0, 360.0, 146.7, 140.8, 167.6, 167.6, 275.8, 275.8, 27...
\$ hp   <dbl> 110, 110, 93, 110, 175, 105, 245, 62, 95, 123, 123, 180, 180, 180, 205, 215, 230, 66, 52, 65,...
\$ drat <dbl> 3.90, 3.90, 3.85, 3.08, 3.15, 2.76, 3.21, 3.69, 3.92, 3.92, 3.92, 3.07, 3.07, 3.07, 2.93, 3.0...
\$ wt   <dbl> 2.620, 2.875, 2.320, 3.215, 3.440, 3.460, 3.570, 3.190, 3.150, 3.440, 3.440, 4.070, 3.730, 3....
\$ qsec <dbl> 16.46, 17.02, 18.61, 19.44, 17.02, 20.22, 15.84, 20.00, 22.90, 18.30, 18.90, 17.40, 17.60, 18...
\$ vs   <dbl> 0, 0, 1, 1, 0, 1, 0, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 1
\$ am   <dbl> 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1
\$ gear <dbl> 4, 4, 4, 3, 3, 3, 3, 4, 4, 4, 4, 3, 3, 3, 3, 3, 3, 4, 4, 4, 3, 3, 3, 3, 3, 4, 5, 5, 5, 5, 5, 4
\$ carb <dbl> 4, 4, 1, 1, 2, 1, 4, 2, 2, 4, 4, 3, 3, 3, 4, 4, 4, 1, 2, 1, 1, 2, 2, 4, 2, 1, 2, 2, 4, 6, 8, 2
```

## `utils::str()`

A common “base-`R` (actually from the `utils` package) way to examine the data is:

```str(mtcars)

'data.frame':	32 obs. of  11 variables:
\$ mpg : num  21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
\$ cyl : num  6 6 4 6 8 6 8 4 4 6 ...
\$ disp: num  160 160 108 258 360 ...
\$ hp  : num  110 110 93 110 175 105 245 62 95 123 ...
\$ drat: num  3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
\$ wt  : num  2.62 2.88 2.32 3.21 3.44 ...
\$ qsec: num  16.5 17 18.6 19.4 17 ...
\$ vs  : num  0 0 1 1 0 1 0 1 1 1 ...
\$ am  : num  1 1 1 0 0 0 0 0 0 0 ...
\$ gear: num  4 4 4 3 3 3 3 4 4 4 ...
\$ carb: num  4 4 1 1 2 1 4 2 2 4 ...
```

However, both of the above summaries have unfortunately obscured an important fact about the `mtcars` `data.frame`: the car names! This is because `mtcars` stores this important key as row-names instead of as a column. Even `base::summary()` will hide this from the analyst.

## `utils::head()`

The base-`R` command `head()` (again from the `utils` package) provides a good way to examine the first few rows of data:

```head(mtcars)
mpg cyl disp  hp drat    wt  qsec vs am gear carb
Mazda RX4         21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
Mazda RX4 Wag     21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
Datsun 710        22.8   4  108  93 3.85 2.320 18.61  1  1    4    1
Hornet 4 Drive    21.4   6  258 110 3.08 3.215 19.44  1  0    3    1
Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2
Valiant           18.1   6  225 105 2.76 3.460 20.22  1  0    3    1
```

We are missing the size of the table and the column types, but those are easy to get with “`dim(mtcars)`” and “`stack(vapply(mtcars, class, character(1)))`“. And we can get something like the “columns on the side” presentation as follows:

```t(head(mtcars))

Mazda RX4 Mazda RX4 Wag Datsun 710 Hornet 4 Drive Hornet Sportabout Valiant
mpg      21.00        21.000      22.80         21.400             18.70   18.10
cyl       6.00         6.000       4.00          6.000              8.00    6.00
disp    160.00       160.000     108.00        258.000            360.00  225.00
hp      110.00       110.000      93.00        110.000            175.00  105.00
drat      3.90         3.900       3.85          3.080              3.15    2.76
wt        2.62         2.875       2.32          3.215              3.44    3.46
qsec     16.46        17.020      18.61         19.440             17.02   20.22
vs        0.00         0.000       1.00          1.000              0.00    1.00
am        1.00         1.000       1.00          0.000              0.00    0.00
gear      4.00         4.000       4.00          3.000              3.00    3.00
carb      4.00         4.000       1.00          1.000              2.00    1.00
```

Also, `head()` is usually re-adapted (through `R`‘s `S3` object system) to work with remote data sources.

```library("sparklyr")
sc <- sparklyr::spark_connect(version='2.0.2',
master = "local")

dRemote <- copy_to(sc, mtcars)

# Source:   query [6 x 11]
# Database: spark connection master=local[4] app=sparklyr local=TRUE
#
# # A tibble: 6 x 11
#     mpg   cyl  disp    hp  drat    wt  qsec    vs    am  gear  carb
#   <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
# 1  21.0     6   160   110  3.90 2.620 16.46     0     1     4     4
# 2  21.0     6   160   110  3.90 2.875 17.02     0     1     4     4
# 3  22.8     4   108    93  3.85 2.320 18.61     1     1     4     1
# 4  21.4     6   258   110  3.08 3.215 19.44     1     0     3     1
# 5  18.7     8   360   175  3.15 3.440 17.02     0     0     3     2
# 6  18.1     6   225   105  2.76 3.460 20.22     1     0     3     1

glimpse(dRemote)

# Observations: 32
# Variables: 11
#
#  Rerun with Debug
#  Error in if (width[i] <= max_width[i]) next :
#   missing value where TRUE/FALSE needed

broom::glance(dRemote)

#  Error: glance doesn't know how to deal with data of class tbl_sparktbl_sqltbl_lazytbl
```

## `replyr::summary()`

And, as we pointed out before: use `replyr::summary()` to get summaries of remote data columns.

## Conclusion

`R` often has more than one way to nearly perform the same task. When working in `R` consider trying a few functions to see which one best fits your needs. Also be aware that base-`R` (`R` with the standard packages) often already has powerful capabilities.