### Introduction

This is the third tutorial in the series Elegant Data Visualization with
ggplot2. In the previous tutorial, we learnt how to create plots using the
`qplot()`

function. In this tutorial, we will create some of the most routinely
used plots to explore data using the `geom_*`

functions.

### Libraries, Code & Data

We will use the following libraries in this tutorial:

All the data sets used in this tutorial can be found here and code can be downloaded from here.

### Data

```
ecom <- readr::read_csv('https://raw.githubusercontent.com/rsquaredacademy/datasets/master/web.csv')
ecom
```

```
## # A tibble: 1,000 x 11
## id referrer device bouncers n_visit n_pages duration country purchase
## <dbl> <chr> <chr> <lgl> <dbl> <dbl> <dbl> <chr> <lgl>
## 1 1 google laptop TRUE 10 1 693 Czech ~ FALSE
## 2 2 yahoo tablet TRUE 9 1 459 Yemen FALSE
## 3 3 direct laptop TRUE 0 1 996 Brazil FALSE
## 4 4 bing tablet FALSE 3 18 468 China TRUE
## 5 5 yahoo mobile TRUE 9 1 955 Poland FALSE
## 6 6 yahoo laptop FALSE 5 5 135 South ~ FALSE
## 7 7 yahoo mobile TRUE 10 1 75 Bangla~ FALSE
## 8 8 direct mobile TRUE 10 1 908 Indone~ FALSE
## 9 9 bing mobile FALSE 3 19 209 Nether~ FALSE
## 10 10 google mobile TRUE 6 1 208 Czech ~ FALSE
## # ... with 990 more rows, and 2 more variables: order_items <dbl>,
## # order_value <dbl>
```

#### Data Dictionary

- id: row id
- referrer: referrer website/search engine
- os: operating system
- browser: browser
- device: device used to visit the website
- n_pages: number of pages visited
- duration: time spent on the website (in seconds)
- repeat: frequency of visits
- country: country of origin
- purchase: whether visitor purchased
- order_value: order value of visitor (in dollars)

### Scatter Plot

A scatter plot displays the relationship between two continuous variables. In
ggplot2, we can build a scatter plot using `geom_point()`

. Scatterplots can
show you visually

- the strength of the relationship between the variables
- the direction of the relationship between the variables
- and whether outliers exist

#### Point

The variables representing the X and Y axis can be specified either in `ggplot()`

or in `geom_point()`

. We will learn to modify the appearance of the points in a
different tutorial.

```
ggplot(ecom, aes(x = n_pages, y = duration)) +
geom_point()
```

#### Regression Line

A regression line can be fit using either:

`geom_abline()`

`geom_smooth()`

#### Regression Line

If you are using `geom_abline()`

, you need to specify the intercept and slope
as shown in the below example:

```
ggplot(mtcars, aes(x = wt, y = mpg)) +
geom_point() +
geom_abline(intercept = 37.285, slope = -5.344)
```

#### Regression Line

If you are using `geom_smooth()`

, you need to specify the method of fitting the
line, which can be `lm`

or `loess`

. You also need to indicate whether the
confidence interval must be displayed using the `se`

argument.

```
ggplot(mtcars, aes(x = wt, y = mpg)) +
geom_smooth(method = 'lm', se = TRUE)
```

`## `geom_smooth()` using formula 'y ~ x'`

#### Loess Method

Here we use the `'loess'`

method to fit the regression line.

```
ggplot(mtcars, aes(x = wt, y = mpg)) +
geom_smooth(method = 'loess', se = FALSE)
```

`## `geom_smooth()` using formula 'y ~ x'`

#### Horizontal/Vertical Lines

Add horizontal or vertical lines using

`geom_hline()`

`geom_vline()`

#### Horizontal Line

To add a horizontal line, the Y axis intercept must be supplied using the
`yintercept`

argument.

```
ggplot(mtcars, aes(x = wt, y = mpg)) +
geom_point() +
geom_hline(yintercept = 30)
```

#### Vertical Line

For the vertical line, the X axis intercept must be supplied using the
`xintercept`

argument.

```
ggplot(mtcars, aes(x = wt, y = mpg)) +
geom_point() +
geom_vline(xintercept = 5)
```

#### Bar Plot

Bar plots present grouped data with rectangular bars. The bars may represent the frequency of the groups or values. Bar plots can be:

- horizontal
- vertical
- grouped
- stacked
- proportional

#### Frequency

```
ggplot(ecom, aes(x = factor(device))) +
geom_bar()
```

#### Weight

If the bars should represent a continuous variable, use the `weight`

argument
within `aes()`

. In the below example, the bars do not represent the count of
devices, instead, they represent the total order value for each device type.

```
ggplot(ecom, aes(x = factor(device))) +
geom_bar(aes(weight = order_value))
```

#### Stacked Bar Plot

To create a stacked bar plot, the `fill`

argument must be mapped to a
categorical variable.

```
ggplot(ecom, aes(x = factor(device))) +
geom_bar(aes(fill = purchase))
```

#### Horizontal Bar Plot

A horizontal bar plot can be created by flipping the coordinate axes using the
`coord_flip()`

function.

```
ggplot(ecom, aes(x = factor(device))) +
geom_bar(aes(fill = purchase)) +
coord_flip()
```

#### Columns

If the data has already been summarized, you can use `geom_col()`

instead of
`geom_bar()`

. In the below example, we have the total visits for each device
type. The data has already been summarized and as such we cannot use `geom_bar()`

.

```
device <- c('laptop', 'mobile', 'tablet')
visits <- c(30000, 12000, 5000)
traffic <- tibble::tibble(device, visits)
ggplot(traffic, aes(x = device, y = visits)) +
geom_col(fill = 'blue')
```

#### Boxplot

The box plot is a standardized way of displaying the distribution of data
based on the five number summary: minimum, first quartile, median, third
quartile, and maximum. Box plots are useful for detecting outliers and for
comparing distributions. It shows the shape, central tendancy and variability
of the data. Use `geom_boxplot()`

to create a box plot.

```
ggplot(ecom, aes(x = factor(device), y = n_pages)) +
geom_boxplot()
```

#### Histogram

A histogram is a plot that can be used to examine the shape and spread of
continuous data. It looks very similar to a bar graph and can be used to detect
outliers and skewness in data. Use `geom_histogram()`

to create a histogram.

```
ggplot(ecom, aes(x = duration)) +
geom_histogram()
```

`## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.`

You can control the number of bins using the `bins`

argument.

```
ggplot(ecom, aes(x = duration)) +
geom_histogram(bins = 5)
```

#### Line

Line charts are used to examine trends over time. We will use a different data set for exploring line plots.

#### Data

`gdp <- readr::read_csv('https://raw.githubusercontent.com/rsquaredacademy/datasets/master/gdp.csv')`

`## Warning: Missing column names filled in: 'X1' [1]`

`gdp`

```
## # A tibble: 6 x 6
## X1 X year growth india china
## <dbl> <dbl> <date> <dbl> <dbl> <dbl>
## 1 1 1 2000-01-01 6 5 8
## 2 2 2 2001-01-01 9 9 5
## 3 3 3 2002-01-01 8 8 6
## 4 4 4 2003-01-01 9 8 8
## 5 5 5 2004-01-01 9 5 9
## 6 6 6 2005-01-01 8 7 8
```

Use `geom_line()`

to create a line chart. In the below plot, we chart the GDP
of India, the fastest growing economy in emerging markets, across years.

```
ggplot(gdp, aes(year, india)) +
geom_line()
```

The color and line type can be modified using the `color`

and `linetype`

arguments. We will explore the different line types in an upcoming tutorial.

```
ggplot(gdp, aes(year, india)) +
geom_line(color = 'blue', linetype = 'dashed')
```

#### Label

You can label the points using `geom_label()`

.

```
ggplot(mtcars, aes(disp, mpg, label = rownames(mtcars))) +
geom_label()
```

#### Text

`geom_text()`

offers another way to add text to the plots. We will learn to
modify the appearance and location of the text in another tutorial.

```
ggplot(mtcars, aes(disp, mpg, label = rownames(mtcars))) +
geom_text(check_overlap = TRUE, size = 2)
```

### Summary

In this tutorial, we learnt about different `geoms`

such as

`geom_point()`

`geom_line()`

`geom_histogram()`

`geom_bar()`

`geom_boxplot()`

`geom_abline()`

`geom_text()`

### Up Next..

In the next tutorial, we will learn about aesthetics.