Draw a Box Plot for One Column in R

Manufactures - R Graphics Essentials

Plot One Variable: Frequency Graph, Density Distribution and More

To visualize one variable, the type of graphs to apply depends on the blazon of the variable:

For categorical variables (or grouping variables). You lot tin visualize the count of categories using a bar plot or using a pie chart to bear witness the proportion of each category.
For continuous variable, you can visualize the distribution of the variable using density plots, histograms and alternatives.

In this R graphics tutorial, yous'll learn how to:

Visualize the frequency distribution of a chiselled variable using bar plots, dot charts and pie charts
Visualize the distribution of a continuous variable using:
- density and histogram plots,
- other alternatives, such equally frequency polygon, surface area plots, dot plots, box plots, Empirical cumulative distribution function (ECDF) and Quantile-quantile plot (QQ plots).
- Density ridgeline plots, which are useful for visualizing changes in distributions, of a continuous variable, over time or space.
- Bar plot and modern alternatives, including lollipop charts and cleveland's dot plots.

Contents:

Prerequisites
One chiselled variable
- Bar plot of counts
- Pie charts
- Dot charts
One continuous variable
- Information format
- Basic plots
- Density plots
- Histogram plots
- Alternative to density and histogram plots
- Density ridgeline plots
- Bar plot and mod alternatives
Conclusion
References

Prerequisites

Load required packages and ready the theme function theme_pubr() [in ggpubr] every bit the default theme:

                      library(ggplot2) library(ggpubr) theme_set(theme_pubr())

One chiselled variable

Bar plot of counts

Plot types: Bar plot of the count of grouping levels
Key function: geom_bar()
Cardinal arguments: alpha, colour, fill, linetype and size

Demo data set: diamonds [in ggplot2]. Contains the prices and other attributes of almost 54000 diamonds. The column cut contains the quality of the diamonds cut (Fair, Expert, Very Skilful, Premium, Ideal).

The R code below creates a bar plot visualizing the number of elements in each category of diamonds cut.

                        ggplot(diamonds, aes(cut)) +   geom_bar(fill = "#0073C2FF") +   theme_pubclean()

Compute the frequency of each category and add the labels on the bar plot:

dplyr bundle used to summarise the data
geom_bar() with pick stat = "identity" is used to create the bar plot of the summary output as it is.
geom_text() used to add text labels. Adjust the position of the labels by using hjust (horizontal justification) and vjust (vertical justification). Values should be in [0, 1].

                        # Compute the frequency library(dplyr) df <- diamonds %>%   group_by(cut) %>%   summarise(counts = due north()) df

                        ## # A tibble: five ten 2 ##         cutting counts ##                                                                                    ## 1      Fair   1610 ## two      Good   4906 ## 3 Very Good  12082 ## 4   Premium  13791 ## 5     Ideal  21551

                        # Create the bar plot. Utilize theme_pubclean() [in ggpubr] ggplot(df, aes(x = cut, y = counts)) +   geom_bar(fill = "#0073C2FF", stat = "identity") +   geom_text(aes(label = counts), vjust = -0.3) +    theme_pubclean()

Pie charts

Pie chart is just a stacked bar chart in polar coordinates.

Offset,

Suit the grouping variable (cutting) in descending gild. This of import to compute the y coordinates of labels.
compute the proportion (counts/total) of each category
compute the position of the text labels equally the cumulative sum of the proportion. To put the labels in the centre of pies, nosotros'll apply cumsum(prop) - 0.5*prop as characterization position.

                        df <- df %>%   accommodate(desc(cut)) %>%   mutate(prop = round(counts*100/sum(counts), 1),          lab.ypos = cumsum(prop) - 0.5*prop) head(df, four)

                        ## # A tibble: four x 4 ##         cut counts  prop lab.ypos ##                                                                                                                                                      ## 1     Ideal  21551  40.0     20.0 ## 2   Premium  13791  25.6     52.8 ## three Very Good  12082  22.4     76.8 ## 4      Practiced   4906   9.1     92.5

Create the pie charts using ggplot2 verbs. Primal function: coord_polar().

                        ggplot(df, aes(x = "", y = prop, fill up = cut)) +   geom_bar(width = 1, stat = "identity", color = "white") +   geom_text(aes(y = lab.ypos, label = prop), color = "white")+   coord_polar("y", start = 0)+   ggpubr::fill_palette("jco")+   theme_void()

Alternative solution to hands create a pie chart: use the function ggpie()[in ggpubr]:

                        ggpie(   df, ten = "prop", label = "prop",   lab.pos = "in", lab.font = list(colour = "white"),    fill = "cutting", color = "white",   palette = "jco" )

Dot charts

Dot chart is an alternative to bar plots. Primal functions:

geom_linerange():Creates line segments from x to ymax
geom_point(): adds dots
ggpubr::color_palette(): changes color palette.

                        ggplot(df, aes(cut, prop)) +   geom_linerange(     aes(10 = cutting, ymin = 0, ymax = prop),      colour = "lightgray", size = i.v     )+   geom_point(aes(color = cutting), size = ii)+   ggpubr::color_palette("jco")+   theme_pubclean()

Like shooting fish in a barrel alternative to create a dot chart. Use ggdotchart() [ggpubr]:

                        ggdotchart(   df, x = "cut", y = "prop",   colour = "cut", size = iii,      # Points colour and size   add = "segment",              # Add together line segments   add.params = list(size = 2),    palette = "jco",   ggtheme = theme_pubclean() )

I continuous variable

Dissimilar types of graphs tin can be used to visualize the distribution of a continuous variable, including: density and histogram plots.

Data format

Create some information (wdata) containing the weights past sex (M for male; F for female):

                        set up.seed(1234) wdata = data.frame(         sex = gene(rep(c("F", "M"), each=200)),         weight = c(rnorm(200, 55), rnorm(200, 58))         ) head(wdata, 4)

                        ##   sex weight ## 1   F   53.8 ## 2   F   55.3 ## 3   F   56.1 ## iv   F   52.7

Compute the mean weight by sex using the dplyr package. First, the information is grouped by sex activity and so summarized past computing the hateful weight by groups. The operator %>% is used to combine multiple operations:

                        library("dplyr") mu <- wdata %>%    group_by(sex) %>%   summarise(grp.mean = mean(weight)) mu

                        ## # A tibble: 2 10 2 ##      sex grp.mean ##                                                                                    ## ane      F     54.9 ## 2      M     58.i

Basic plots

We start past creating a plot, named a, that nosotros'll finish in the next section past adding a layer.

                        a <- ggplot(wdata, aes(10 = weight))

Possible layers include: geom_density() (for density plots) and geom_histogram() (for histogram plots).

Key arguments to customize the plots:

color, size, linetype: change the line colour, size and blazon, respectively
fill: alter the areas fill color (for bar plots, histograms and density plots)
blastoff: create a semi-transparent color.

Density plots

Cardinal function: geom_density().

Create basic density plots. Add a vertical line corresponding to the hateful value of the weight variable (geom_vline()):

                        # y axis scale = ..density.. (default behaviour) a + geom_density() +   geom_vline(aes(xintercept = mean(weight)),               linetype = "dashed", size = 0.vi)    # Change y axis to count instead of density a + geom_density(aes(y = ..count..), fill = "lightgray") +   geom_vline(aes(xintercept = mean(weight)),               linetype = "dashed", size = 0.6,              color = "#FC4E07")

Alter areas fill and add line color past groups (sex activity):

Add vertical mean lines using geom_vline(). Information: mu, which contains the mean values of weights by sex (computed in the previous section).
Change color manually:
- utilise scale_color_manual() or scale_colour_manual() for changing line colour
- use scale_fill_manual() for changing area fill colors.

                        # Change line color past sex a + geom_density(aes(color = sexual activity)) +   scale_color_manual(values = c("#868686FF", "#EFC000FF")) # Modify fill color by sex and add mean line # Use semi-transparent fill: alpha = 0.iv a + geom_density(aes(fill = sex), alpha = 0.4) +       geom_vline(aes(xintercept = grp.mean, color = sex),              data = mu, linetype = "dashed") +   scale_color_manual(values = c("#868686FF", "#EFC000FF"))+   scale_fill_manual(values = c("#868686FF", "#EFC000FF"))

Simple solution to create a ggplot2-based density plots: use ggboxplot() [in ggpubr].

                        library(ggpubr) # Basic density plot with mean line and marginal carpeting ggdensity(wdata, x = "weight",            fill up = "#0073C2FF", color = "#0073C2FF",           add = "hateful", rug = TRUE)       # Modify outline and make full colors by groups ("sexual activity") # Employ a custom palette ggdensity(wdata, x = "weight",    add = "mean", carpet = True,    colour = "sexual practice", make full = "sex",    palette = c("#0073C2FF", "#FC4E07"))

Histogram plots

An alternative to density plots is histograms, which represents the distribution of a continuous variable by dividing into bins and counting the number of observations in each bin.

Key office: geom_histogram(). The bones usage is quite like to geom_density().

Create a basic plots. Add a vertical line corresponding to the mean value of the weight variable:

                        a + geom_histogram(bins = 30, color = "blackness", fill = "gray") +   geom_vline(aes(xintercept = mean(weight)),               linetype = "dashed", size = 0.6)

Note that, by default:

Past default, geom_histogram() uses 30 bins - this might not be good default. You can change the number of bins (eastward.grand.: bins = 50) or the bin width (e.g.: binwidth = 0.5)
The y centrality corresponds to the count of weight values. If you desire to alter the plot in gild to accept the density on y axis, specify the statement y = ..density.. in aes().

Change areas fill and add line color by groups (sex):

Add together vertical mean lines using geom_vline(). Data: mu, which contains the mean values of weights past sex.
Change color manually:
- use scale_color_manual() or scale_colour_manual() for changing line color
- apply scale_fill_manual() for changing area fill colors.
Adjust the position of histogram bars by using the argument position. Allowed values: "identity", "stack", "contrivance". Default value is "stack".

                        # Change line color by sex activity a + geom_histogram(aes(colour = sex), fill = "white",                     position = "identity") +   scale_color_manual(values = c("#00AFBB", "#E7B800"))  # alter fill and outline color manually  a + geom_histogram(aes(color = sex, fill = sexual activity),                          alpha = 0.4, position = "identity") +   scale_fill_manual(values = c("#00AFBB", "#E7B800")) +   scale_color_manual(values = c("#00AFBB", "#E7B800"))

Combine histogram and density plots:

Plot histogram with density values on y-axis (instead of count values).
Add density plot with transparent density plot

                        # Histogram with density plot a + geom_histogram(aes(y = ..density..),                     colour="black", fill up="white") +   geom_density(blastoff = 0.2, fill up = "#FF6666")        # Color by groups a + geom_histogram(aes(y = ..density.., color = sex activity),                     fill = "white",                    position = "identity")+   geom_density(aes(color = sexual practice), size = 1) +   scale_color_manual(values = c("#868686FF", "#EFC000FF"))

Elementary solution to create a ggplot2-based histogram plots: use gghistogram() [in ggpubr].

                        library(ggpubr) # Bones histogram plot with mean line and marginal rug gghistogram(wdata, x = "weight", bins = xxx,              fill = "#0073C2FF", colour = "#0073C2FF",             add = "mean", rug = Truthful)       # Change outline and fill up colors by groups ("sex") # Use a custom palette gghistogram(wdata, x = "weight", bins = thirty,    add = "mean", rug = Truthful,    color = "sex", fill up = "sexual practice",    palette = c("#0073C2FF", "#FC4E07"))

Culling to density and histogram plots

Frequency polygon. Very close to histogram plots, but it uses lines instead of confined.
- Key office: geom_freqpoly().
- Cardinal arguments: color, size, linetype: change, respectively, line colour, size and type.
Area plots. This is a continuous analog of a stacked bar plot.
- Key part: geom_area().
- Key arguments:
  - color, size, linetype: alter, respectively, line color, size and type.
  - make full: modify area fill color.

In this section, we'll use the theme theme_pubclean() [in ggpubr]. This is a theme without centrality lines, to direct more attending to the data. Type this to use the theme:

                        theme_set(theme_pubclean())

Create a bones frequency polygon and basic expanse plots:

                        # Basic frequency polygon a + geom_freqpoly(bins = 30)  # Bones surface area plots, which can be filled past colour a + geom_area( stat = "bin", bins = 30,                color = "black", make full = "#00AFBB")

Modify colors past groups (sex):

                        # Frequency polygon:  # Change line colors and types by groups a + geom_freqpoly( aes(colour = sexual practice, linetype = sex),                    bins = 30, size = 1.5) +   scale_color_manual(values = c("#00AFBB", "#E7B800")) # Area plots: alter fill colors by sexual activity # Create a stacked expanse plots a + geom_area(aes(fill up = sex), colour = "white",                stat ="bin", bins = xxx) +   scale_fill_manual(values = c("#00AFBB", "#E7B800"))

As in histogram plots, the default y values is count. To have density values on y axis, specify y = ..density.. in aes().

Dot plots. Represents another culling to histograms and density plots, that tin can be used to visualize a continuous variable. Dots are stacked with each dot representing one observation. The width of a dot corresponds to the bin width.

Key office: geom_dotplot().
Key arguments: alpha, color, fill and dotsize.

Create a dot plot colored past groups (sex activity):

                        a + geom_dotplot(aes(fill = sex), binwidth = one/4) +   scale_fill_manual(values = c("#00AFBB", "#E7B800"))

Box plot:
- Create a box plot of one continuous variable: geom_boxplot()
- Add jittered points, where each point corresponds to an individual observation: geom_jitter(). Alter the color and the shape of points by groups (sex)

                        ggplot(wdata, aes(x = factor(1), y = weight)) +   geom_boxplot(width = 0.4, fill up = "white") +   geom_jitter(aes(colour = sex, shape = sex),                width = 0.one, size = ane) +   scale_color_manual(values = c("#00AFBB", "#E7B800")) +    labs(x = Cipher)   # Remove 10 axis label

Empirical cumulative distribution function (ECDF). Provides another alternative visualization of distribution. It reports for any given number the percent of individuals that are below that threshold.

For instance, in the following plots, you tin can see that:

nigh 25% of our females are shorter than 50 inches
about 50% of males are shorter than 58 inches

                        # Another option for geom = "indicate" a + stat_ecdf(aes(colour = sex,linetype = sex),                geom = "step", size = i.5) +   scale_color_manual(values = c("#00AFBB", "#E7B800"))+   labs(y = "f(weight)")

Quantile-quantile plot (QQ plots). Used to check whether a given data follows normal distribution.

Key office: stat_qq().
Fundamental arguments: color, shape and size to modify point color, shape and size.

Create a qq-plot of weight. Change color by groups (sexual activity)

                        # Alter point shapes by groups ggplot(wdata, aes(sample = weight)) +   stat_qq(aes(color = sexual activity)) +   scale_color_manual(values = c("#00AFBB", "#E7B800"))+   labs(y = "Weight")

Alternative plot using the function ggqqplot() [in ggpubr]. The 95% confidence band is shown by default.

                        library(ggpubr) ggqqplot(wdata, x = "weight",    color = "sexual activity",     palette = c("#0073C2FF", "#FC4E07"),    ggtheme = theme_pubclean())

Density ridgeline plots

The density ridgeline plot is an alternative to the standard geom_density() office that can exist useful for visualizing changes in distributions, of a continuous variable, over time or infinite. Ridgeline plots are partially overlapping line plots that create the impression of a mount range.

This functionality is provided in the R package ggridges (Wilke 2017).

Installation:

                        install.packages("ggridges")

Load and set the default theme to theme_ridges() [in ggridges]:

                        library(ggplot2) library(ggridges) theme_set(theme_ridges())

Example 1: Simple distribution plots by groups. Distribution of Sepal.Length by Species using the iris information prepare. The grouping variable Species will be mapped to the y-centrality:

                        ggplot(iris, aes(x = Sepal.Length, y = Species)) +   geom_density_ridges(aes(fill up = Species)) +   scale_fill_manual(values = c("#00AFBB", "#E7B800", "#FC4E07"))

Yous can control the overlap between the different densities using the scale selection. Default value is one. Smaller values create a separation between the curves, and larger values create more overlap.

                        ggplot(iris, aes(10 = Sepal.Length, y = Species)) +   geom_density_ridges(calibration = 0.9)

Example 4: Visualize temperature data.

Data set: lincoln_weather [in ggridges]. Weather in Lincoln, Nebraska in 2016.
Create the density ridge plots of the Hateful Temperature by Month and change the fill color according to the temperature value (on 10 axis). A gradient color is created using the office geom_density_ridges_gradient()

                        ggplot(   lincoln_weather,    aes(x = `Mean Temperature [F]`, y = `Month`)   ) +   geom_density_ridges_gradient(     aes(fill up = ..x..), scale = three, size = 0.3     ) +   scale_fill_gradientn(     colours = c("#0D0887FF", "#CC4678FF", "#F0F921FF"),     proper noun = "Temp. [F]"     )+   labs(title = 'Temperatures in Lincoln NE')

For more examples, type the following R code:

                        browseVignettes("ggridges")

Bar plot and modernistic alternatives

In this section, we'll depict how to create hands basic and ordered bar plots using ggplot2 based helper functions available in the ggpubr R parcel. We'll also nowadays some modern alternatives to bar plots, including lollipop charts and cleveland's dot plots.

Load required packages:

                        library(ggpubr)

Load and prepare data:

                        # Load information dfm <- mtcars # Convert the cyl variable to a factor dfm$cyl <- as.gene(dfm$cyl) # Add the name colums dfm$name <- rownames(dfm) # Audit the information head(dfm[, c("name", "wt", "mpg", "cyl")])

                        ##                                name   wt  mpg cyl ## Mazda RX4                 Mazda RX4 two.62 21.0   6 ## Mazda RX4 Wag         Mazda RX4 Wag ii.88 21.0   six ## Datsun 710               Datsun 710 2.32 22.8   4 ## Hornet four Bulldoze       Hornet 4 Drive 3.21 21.four   6 ## Hornet Sportabout Hornet Sportabout three.44 18.7   8 ## Valiant                     Valiant 3.46 eighteen.one   6

Create an ordered bar plot of the mpg variable. Modify the fill up color by the grouping variable "cyl". Sorting volition be done globally, but not past groups.

                        ggbarplot(dfm, ten = "proper noun", y = "mpg",           fill up = "cyl",               # change fill color past cyl           colour = "white",            # Ready bar edge colors to white           palette = "jco",            # jco periodical color palett. meet ?ggpar           sort.val = "asc",          # Sort the value in dscending order           sort.by.groups = TRUE,     # Don't sort inside each group           x.text.angle = 90,           # Rotate vertically ten centrality texts           ggtheme = theme_pubclean()           )+   font("x.text", size = viii, vjust = 0.5)

To sort confined inside each group, utilize the argument sort.by.groups = TRUE

Create a Lollipop chart:
- Colour by groups and fix a custom color palette.
- Sort values in ascending order.
- Add segments from y = 0 to dots. Change segment colour and size.

                        ggdotchart(dfm, ten = "name", y = "mpg",            colour = "cyl",                                            palette = c("#00AFBB", "#E7B800", "#FC4E07"),             sorting = "asc", sort.by.groups = TRUE,                                  add = "segments",                                        add.params = list(color = "lightgray", size = two),             group = "cyl",                                            dot.size = 4,                                             ggtheme = theme_pubclean()            )+   font("x.text", size = viii, vjust = 0.v)

Read more: Bar Plots and Modern Alternatives

Conclusion

Create a bar plot of a grouping variable:

                      ggplot(diamonds, aes(cut)) +   geom_bar(fill up = "#0073C2FF") +   theme_minimal()

Visualize a continuous variable:

Outset by creating a plot, named a, that we'll exist finished by adding a layer.

                      a <- ggplot(wdata, aes(x = weight))

Possible layers include:

geom_density(): density plot
geom_histogram(): histogram plot
geom_freqpoly(): frequency polygon
geom_area(): area plot
geom_dotplot(): dot plot
stat_ecdf(): empirical cumulative density function
stat_qq(): quantile - quantile plot

Fundamental arguments to customize the plots:

color, size, linetype: alter the line color, size and type, respectively
make full: change the areas fill up color (for bar plots, histograms and density plots)
alpha: create a semi-transparent color.

References

leckieallepabuse.blogspot.com

Source: http://sthda.com/english/articles/32-r-graphics-essentials/133-plot-one-variable-frequency-graph-density-distribution-and-more

Leckie Allepabuse