Python_and_R/07-basic-plotting-and-visualization.Rmd at main · uvastatlab/Python_and_R · GitHub

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
# Basic Plotting and Visualization

```{r,echo=FALSE}
knitr::opts_chunk$set(comment = '', prompt = FALSE, collapse = FALSE)
```

This chapter discusses methods for creating plots to explore and understand data. Visualization in Python and R is a gigantic and evolving topic; we don't pretend to present a comprehensive comparison.

The plots below make use of the [**palmerpenguins**](https://allisonhorst.github.io/palmerpenguins/) data set, which contains various measurements for 344 penguins across three islands in the Antarctic Palmer Archipelago. The data were collected by [Kristen Gorman](https://www.uaf.edu/cfos/people/faculty/detail/kristen-gorman.php) and colleagues, and they were made available under a [CC0 public domain license](https://creativecommons.org/share-your-work/public-domain/cc0/) by Allison Horst, Alison Hill, and Kristen Gorman.

For the R sections below, we show how to make each plot with base R and with **ggplot2**.

```{r, echo = F}
library(palmerpenguins)
```

```{python, echo = F}
penguins = r.penguins
```

Here's a glimpse at the data set:
```{r}
head(penguins)
```

## Histograms

Visualizing the distribution of numeric data.

#### Python {-}

The Python library **Matplotlib** provides the `hist()` function for plotting a histogram. There are many arguments that can be specified in the `hist()` function for customizing the histogram plot to fit your needs: Some parameters include the number of bins, the upper and lower bounds on each bin, the coloration, and more.

Below, we show how to generate a histogram of the bill lengths from the `penguins` dataset. We specified 30 bins, each of which is light blue with a black outline of `linewidth = 1`. The `hist()` function defaults to showing no bin outline, which can make it difficult to distinguish bins clearly, so we add in the outlines here.

Note that the bins are left-inclusive and right-exclusive. For example, if a particular bin spans the range of 1 to 3, the bin will include the value 1 but will exclude the value 3 (and will include all values between 1 and 3). In short, bin ranges are [x1,x2), where x1 is the starting point of the bin and x2 is the ending point of the bin.

Notice the semicolon at the end of the `plt.hist()` function. This suppresses the printing of the array generated to create the histogram.

```{python}
import matplotlib.pyplot as plt

plt.clf() # clears the current figure to ensure that multiple plots are not overlaid
plt.hist(penguins.bill_length_mm, bins = 30, range = (30, 60),
         color = 'lightblue', edgecolor = 'k', linewidth = 1);
plt.title("Penguin Bill Lengths")
plt.xlabel("Bill Length (mm)")
plt.ylabel("Count")
plt.show()
```

#### R {-}

Base R's `hist()` function generates histograms, and features of the histogram---like the bar color, number of bins/breaks, and so on---can be easily customized with arguments as demonstrated below.

```{r}
hist(penguins$bill_length_mm, breaks = 25, col = 'lightblue',
     xlim = c(30, 60),
     main = 'Penguin Bill Lengths',
     xlab = 'Bill Length (mm)', ylab = 'Count')
```

The **ggplot2** method for generating histograms follows the standard **ggplot2** syntax: Initialize a plot with `ggplot()`, then add layers thereto. Here, the layer to add is `geom_histogram()`.

```{r, warning = F}
library(ggplot2)

ggplot(penguins, aes(x = bill_length_mm)) +
  geom_histogram(fill = 'lightblue', color = 'black', bins = 25) +
  xlim(30, 60) +
  labs(title = 'Penguin Bill Lengths',
       x = 'Bill Length (mm)', y = 'Count')
```

## Barplots

Visualizing the distribution of categorical data.

#### Python {-}

For this example, we generate a barplot showing how many penguins of each species---Adelie, Chinstrap, Gentoo---are represented in the data. We show two ways of doing so here.

First, we can use the function `bar()` from the **Matplotlib** plotting library. We determine the per-species penguin counts and then use that data to create the bar plot.

```{python}
import matplotlib.pyplot as plt

# Determine the number of each species
adelie_counts = len(penguins.loc[penguins["species"] == "Adelie"])
chinstrap_counts = len(penguins.loc[penguins["species"] == "Chinstrap"])
gentoo_counts = len(penguins.loc[penguins["species"] == "Gentoo"])

# Save the counts information as an arrays for input into the bar() function
spec = ["Adelie", "Chinstrap", "Gentoo"]
counts = [adelie_counts, chinstrap_counts, gentoo_counts]

plt.clf()
plt.bar(spec, counts)
plt.show()
```

The data are stored in a **pandas** Dataframe, which has its own built-in plotting module, `plot`. As an alternative approach to the one above, we can use the **pandas** `bar()` function to recreate the barplot.

```{python}
plt.clf()
penguins["species"].value_counts().plot.bar()
plt.show()
```

Using this second approach, we generated the same barplot with far less effort. Using the built-in **pandas** plotting routine proved to be the more efficient method here.

#### R {-}

To form barplots, we'll first create a summary data frame containing the statistics we're looking to plot. Here, that's simply the sample size of each species in the data set.

```{r}
species_counts <- as.data.frame(xtabs(~ species, data = penguins))
species_counts
```

We can plot those values using the `barplot()` function in base R, specifying arguments along the way to customize the title/axis labeling, bar colors, and range of the y axis. To add values above the bars, we can follow `barplot()` with a `text()` call as below.

```{r}
penguin_plot <- barplot(Freq ~ species, data = species_counts,
                        col = c('lightblue', 'cornflowerblue', 'darkslateblue'),
                        main = 'Species Sample Size',
                        xlab = 'Species', ylab = 'Count', ylim = c(0, 200))
text(x = penguin_plot, y = species_counts$Freq + 10,
     labels = species_counts$Freq)
```

To recreate the bar plot above with **ggplot2**, one can add a `geom_bar()` layer to a plot initialized with `ggplot()`.

```{r, warning = F}
ggplot(species_counts, aes(x = species, y = Freq)) +
  geom_bar(aes(fill = species), stat = 'identity') +

  # scale_fill_manual() is used here for bar-color customization
  scale_fill_manual(values = c('lightblue', 'cornflowerblue', 'darkslateblue')) +
  labs(title = 'Species Sample Size', x = 'Species', y = 'Count') +

  # For simplicity, we omit the legend
  theme(legend.position = 'none') +
  ylim(0, 200) +

  # geom_text() is used here to add counts above the bars
  geom_text(aes(label = Freq, vjust = -0.5))
```

## Scatterplot

Visualizing the relationship between two numeric variables.

#### Python {-}

The `scatter()` function, part of **Matplotlib**, can produce scatterplots in Python. The `x` and `y` arguments specify the points to plot. In the example below, we also use the `c` and `marker` arguments to customize the point color and point shape, respectively. The `xlabel()`, `ylabel()`, and `title()` functions customize plot labels.

```{python}
import matplotlib.pyplot as plt

# Remove NA rows, as we only want to plot present data
penguins_no_na = penguins.dropna()

plt.clf()

# 'd' generates diamond markers (learn more about available marker shapes here: https://matplotlib.org/stable/api/markers_api.html#module-matplotlib.markers)
plt.scatter(x = penguins_no_na['body_mass_g'],
            y = penguins_no_na['flipper_length_mm'],
            c = 'lightblue', marker = 'd')
plt.xlabel('Body Mass (g)')
plt.ylabel('Flipper Length (mm)')
plt.title('Scatterplot of Body Mass and Flipper Length')

plt.show()
```

#### R {-}

Scatterplots can be generated in base R with the `plot()` function. The `pch` argument below modifies the point shape (e.g., 20 = solid circle; 24 = unfilled triangle; etc.)

```{r}
plot(x = penguins$body_mass_g, y = penguins$flipper_length_mm,
     col = 'navy', pch = 20,
     main = 'Scatterplot of Body Mass and Flipper Length',
     xlab = 'Body Mass (g)', ylab = 'Flipper Length (mm)')
```

To generate a scatterplot with **ggplot2**, initialize a plot with `ggplot()`, then add a layer of points with `geom_point()`.

```{r, warning = F}
ggplot(penguins, aes(x = body_mass_g, y = flipper_length_mm)) +
  geom_point(color = 'navy') +
  labs(title = 'Scatterplot of Body Mass and Flipper Length',
       x = 'Body Mass (g)', y = 'Flipper Length (mm)')
```

## Stripcharts

Stripcharts, or strip plots, are one-dimensional scatterplots. Like boxplots, they reveal the distribution of a numeric variable within levels of a categorical variable.

#### Python {-}

The **seaborn** package provides the `stripplot()` function. The user simply specifies which variables they want on the x and y axes. Below, we specify `island` on the y axis to see the distribution of `bill_depth_mm` horizontally. Specify the **pandas** DataFrame using the `data` argument. Finally, create the plot using `plt.show()`.

```{python}
import seaborn as sns
import matplotlib.pyplot as plt
sns.stripplot(x = "bill_depth_mm", y = "island", data = penguins)
plt.show()
```

The [stripplot](https://seaborn.pydata.org/generated/seaborn.stripplot.html) help page provides more examples.

#### R {-}

Base R offers the `stripchart()` function. To indicate the numeric variable and the grouping variable, you can use formula notation: `numeric_var ~ grouping_var`. Adding `method = 'jitter'` to the arguments spreads the points out slightly within each level of the grouping variable, making it easier to see points that might otherwise be obscured by overlap.

```{r}
stripchart(bill_depth_mm ~ island, data = penguins,
           method = 'jitter',
           ylab = 'Island', xlab = 'Bill Depth (mm)',
           main = 'Stripchart of Bill Depth by Island')
```

Stripcharts can also be made with **ggplot2**'s  `geom_jitter()` function, as shown below. You can control the amount of jitter with a `position` argument in `geom_jitter()`.

```{r}
ggplot(penguins, aes(x = island, y = bill_depth_mm)) +
  geom_jitter(aes(color = island), position = position_jitter(0.1)) +

  # scale_fill_manual() is used to manually specify group colors
  # once aes(color = island) is specified in `geom_jitter()`
  scale_color_manual(values = c('lightblue', 'cornflowerblue', 'darkslateblue')) +

  labs(title = 'Stripchart of Bill Depth by Island',
       x = 'Island', y = 'Bill Depth (mm)') +
  theme(legend.position = 'none')
```

## Boxplots

Visualizing the relationship between a numeric variable and a categorical variable via five-number summaries.

#### Python {-}

The `boxplot()` function in **seaborn** generates boxplots, and functions from **Matplotlib**, like `xlabel()` and `ylabel()`, can be used to control the aesthetics of the plot.

```{python}
import seaborn as sns
import matplotlib.pyplot as plt
plt.figure()
sns.boxplot(x = "island", y = "bill_depth_mm", data = penguins)
plt.xlabel("Island")
plt.ylabel("Bill Depth (mm)")
plt.title("Bill Depth by Island")
plt.show()
```

#### R {-}

The `boxplot()` function in base R generates boxplots. A user specifies the grouping variable and the numeric variable to be plotted in formula notation: `y ~ grouping_var`.

```{r}
boxplot(bill_depth_mm ~ island, data = penguins,
        col = c('lightblue', 'cornflowerblue', 'darkslateblue'),
        main = 'Boxplot of Bill Depth by Island',
        xlab = 'Island', ylab = 'Bill Depth (mm)')
```

To generate a boxplot with **ggplot2**, add a `geom_boxplot()` layer to a plot initialized with `ggplot()`.

```{r, warning = F}
ggplot(penguins, aes(x = island, y = bill_depth_mm)) +
  geom_boxplot(aes(fill = island)) +
  scale_fill_manual(values = c('lightblue', 'cornflowerblue', 'darkslateblue')) +
  labs(title = 'Boxplot of Bill Depth by Island',
       x = 'Island', y = 'Bill Depth (mm)') +
  theme(legend.position = 'none')
```

## Facet plots

Facet plots (also called trellis plots, lattice plots, and conditional plots) are comprised of multiple smaller plots, where each subplot contains a subset of the overall data, with subsets defined by one or more faceting variables.

#### Python {-}

The **seaborn** package provides several functions for creating facet plots, including `relplot()`, `displot()`, `catplot()`, and `lmplot()`. Below we demonstrate the `lmplot()` function, which allows a user to create scatterplots at certain levels of categorical variable. A user specifies the x and y variables as character strings using the `x` and `y` arguments, respectively. The grouping variable is identified with either the `col` or `row` arguments. By default, a least-squares lines is added to each plot. Setting `ci = None` suppresses the confidence interval ribbon.

```{python}
import seaborn as sns
plt.clf()
sns.lmplot(x = "bill_length_mm", y = "bill_depth_mm",
           col = "species",
           data = penguins, ci = None)
```

To facet by two variables, provide variables to both the `col` and `row` arguments.

```{python}
plt.clf()
sns.lmplot(x = "bill_length_mm", y = "bill_depth_mm",
           col = "species", row = "sex",
           data = penguins, ci = None)
```

Set `lowess = True` for smooth trend lines. Notice also that color can be mapped to the same variable used for faceting.

```{python}
plt.clf()
sns.lmplot(x = "bill_length_mm", y = "bill_depth_mm",
           col = "species", hue = "species",
           data = penguins, ci = None, lowess = True)
```


The [1lmplot()1](https://seaborn.pydata.org/generated/seaborn.lmplot.html#seaborn.lmplot) help page showcases other examples.

#### R {-}

The `coplot()` function in base R produces conditioning plots using formula notation: `y ~ x | grouping_var`. The `rows` and `columns` arguments control layout. Below we specify one row of plots. The `panel` argument controls what action is carried out in each plot. The default is a scatterplot. Below we use the base R `panel.smooth` function to create scatter plots with a smooth trend line.

```{r warning=FALSE, message=FALSE, results='hide'}
coplot(bill_depth_mm ~ bill_length_mm | species,
       data = penguins,
       panel = panel.smooth,
       rows = 1, col = "navy")
```

To condition on two variables, use formula notation with syntax: `y ~ x | grp_var1 * grp_var2`.

```{r warning=FALSE, message=FALSE, results='hide'}
coplot(bill_depth_mm ~ bill_length_mm | species * sex,
       data = penguins,
       panel = panel.smooth,
       col = "navy")
```

The labels of the conditioning variables unfortunately use a lot of real estate in the margins, and there is no easy way to modify that. However, this design works quite well when we condition on a _numeric variable_. The `coplot()` function automatically creates overlapping group intervals to condition on, and the stacked layout of the labels helps us visualize how the relationship between x and y changes between the groups. To manually set the number of groups, use the `number` argument. Below, we specify four groups to be generated for `bill_length_mm`.

```{r warning=FALSE, message=FALSE, results='hide'}
coplot(flipper_length_mm ~ body_mass_g | bill_length_mm,
       data = penguins,
       panel = panel.smooth,
       number = 4,
       rows = 1, col = "navy")
```

Alternatively, **ggplot2** provides a intuitive and easy-to-use method for generating facet plots: A user specifies the aesthetics of the plot using standard **ggplot2** syntax (i.e., as a series of added layers) and then adds an additional call, `facet_wrap()` or `facet_grid()` (the differences are discussed below), specifying the faceting variable(s) to split up the plots by.

```{r, warning = F}
ggplot(penguins, aes(x = bill_length_mm, y = bill_depth_mm)) +
  geom_point(color = 'navy') +

  # Add least-squares line
  geom_smooth(method = 'lm', se = FALSE, color = 'salmon') +

  # Use formula notation, a character vector, or vars() to
  # specify faceting variables; e.g., ~species, c('species'),
  # or vars(species)
  facet_wrap(~ species)
```

The number of rows and columns can be manually specified with `nrow` and `ncol` arguments in `facet_wrap()`. By default, the x and y axes of all facet plots will be on the same scale. The axis ranges can be set to vary freely by adding `scales = 'free'` as an argument (or, alternatively, `scales = 'free_x'` or `scales = 'free_y'` to free just the x or y axis).

Both `facet_wrap()` and `facet_grid()` can be used to make facet plots. When faceting based on multiple variables (e.g., species and sex), `facet_wrap()` will drop group combinations for which there are no data points, whereas `facet_grid()` will generate a plot for all possible group combinations:

```{r, warning = F}
ggplot(penguins, aes(x = bill_length_mm, y = bill_depth_mm)) +
  geom_point(color = 'navy') +
  geom_smooth(method = 'lm', se = F, color = 'salmon') +
  facet_wrap(vars(species, sex))

ggplot(penguins, aes(x = bill_length_mm, y = bill_depth_mm)) +
  geom_point(color = 'navy') +
  geom_smooth(method = 'lm', se = F, color = 'salmon') +

  # Note that facet_grid() has separate `rows` and `cols` arguments
  # for specifying faceting variables
  facet_grid(rows = vars(sex), cols = vars(species))
```