Draw a Box Plot for One Column in R
Manufactures - R Graphics Essentials
Plot One Variable: Frequency Graph, Density Distribution and More
To visualize one variable, the type of graphs to apply depends on the blazon of the variable:
- For categorical variables (or grouping variables). You lot tin visualize the count of categories using a bar plot or using a pie chart to bear witness the proportion of each category.
- For continuous variable, you can visualize the distribution of the variable using density plots, histograms and alternatives.
In this R graphics tutorial, yous'll learn how to:
- Visualize the frequency distribution of a chiselled variable using bar plots, dot charts and pie charts
- Visualize the distribution of a continuous variable using:
- density and histogram plots,
- other alternatives, such equally frequency polygon, surface area plots, dot plots, box plots, Empirical cumulative distribution function (ECDF) and Quantile-quantile plot (QQ plots).
- Density ridgeline plots, which are useful for visualizing changes in distributions, of a continuous variable, over time or space.
- Bar plot and modern alternatives, including lollipop charts and cleveland's dot plots.
Contents:
- Prerequisites
- One chiselled variable
- Bar plot of counts
- Pie charts
- Dot charts
- One continuous variable
- Information format
- Basic plots
- Density plots
- Histogram plots
- Alternative to density and histogram plots
- Density ridgeline plots
- Bar plot and mod alternatives
- Conclusion
- References
Prerequisites
Load required packages and ready the theme function theme_pubr()
[in ggpubr] every bit the default theme:
library(ggplot2) library(ggpubr) theme_set(theme_pubr())
One chiselled variable
Bar plot of counts
- Plot types: Bar plot of the count of grouping levels
- Key function:
geom_bar()
- Cardinal arguments:
alpha
,colour
,fill
,linetype
andsize
Demo data set: diamonds
[in ggplot2]. Contains the prices and other attributes of almost 54000 diamonds. The column cut
contains the quality of the diamonds cut (Fair, Expert, Very Skilful, Premium, Ideal).
The R code below creates a bar plot visualizing the number of elements in each category of diamonds cut.
ggplot(diamonds, aes(cut)) + geom_bar(fill = "#0073C2FF") + theme_pubclean()
Compute the frequency of each category and add the labels on the bar plot:
-
dplyr
bundle used to summarise the data -
geom_bar()
with pickstat = "identity"
is used to create the bar plot of the summary output as it is. -
geom_text()
used to add text labels. Adjust the position of the labels by usinghjust
(horizontal justification) andvjust
(vertical justification). Values should be in [0, 1].
# Compute the frequency library(dplyr) df <- diamonds %>% group_by(cut) %>% summarise(counts = due north()) df
## # A tibble: five ten 2 ## cutting counts ## ## 1 Fair 1610 ## two Good 4906 ## 3 Very Good 12082 ## 4 Premium 13791 ## 5 Ideal 21551
# Create the bar plot. Utilize theme_pubclean() [in ggpubr] ggplot(df, aes(x = cut, y = counts)) + geom_bar(fill = "#0073C2FF", stat = "identity") + geom_text(aes(label = counts), vjust = -0.3) + theme_pubclean()
Pie charts
Pie chart is just a stacked bar chart in polar coordinates.
Offset,
- Suit the grouping variable (
cutting
) in descending gild. This of import to compute the y coordinates of labels. - compute the proportion (counts/total) of each category
- compute the position of the text labels equally the cumulative sum of the proportion. To put the labels in the centre of pies, nosotros'll apply
cumsum(prop) - 0.5*prop
as characterization position.
df <- df %>% accommodate(desc(cut)) %>% mutate(prop = round(counts*100/sum(counts), 1), lab.ypos = cumsum(prop) - 0.5*prop) head(df, four)
## # A tibble: four x 4 ## cut counts prop lab.ypos ## ## 1 Ideal 21551 40.0 20.0 ## 2 Premium 13791 25.6 52.8 ## three Very Good 12082 22.4 76.8 ## 4 Practiced 4906 9.1 92.5
- Create the pie charts using ggplot2 verbs. Primal function:
coord_polar()
.
ggplot(df, aes(x = "", y = prop, fill up = cut)) + geom_bar(width = 1, stat = "identity", color = "white") + geom_text(aes(y = lab.ypos, label = prop), color = "white")+ coord_polar("y", start = 0)+ ggpubr::fill_palette("jco")+ theme_void()
- Alternative solution to hands create a pie chart: use the function
ggpie()
[in ggpubr]:
ggpie( df, ten = "prop", label = "prop", lab.pos = "in", lab.font = list(colour = "white"), fill = "cutting", color = "white", palette = "jco" )
Dot charts
Dot chart is an alternative to bar plots. Primal functions:
-
geom_linerange()
:Creates line segments from x to ymax -
geom_point()
: adds dots -
ggpubr::color_palette()
: changes color palette.
ggplot(df, aes(cut, prop)) + geom_linerange( aes(10 = cutting, ymin = 0, ymax = prop), colour = "lightgray", size = i.v )+ geom_point(aes(color = cutting), size = ii)+ ggpubr::color_palette("jco")+ theme_pubclean()
Like shooting fish in a barrel alternative to create a dot chart. Use ggdotchart()
[ggpubr]:
ggdotchart( df, x = "cut", y = "prop", colour = "cut", size = iii, # Points colour and size add = "segment", # Add together line segments add.params = list(size = 2), palette = "jco", ggtheme = theme_pubclean() )
I continuous variable
Dissimilar types of graphs tin can be used to visualize the distribution of a continuous variable, including: density and histogram plots.
Data format
Create some information (wdata
) containing the weights past sex (M for male; F for female):
set up.seed(1234) wdata = data.frame( sex = gene(rep(c("F", "M"), each=200)), weight = c(rnorm(200, 55), rnorm(200, 58)) ) head(wdata, 4)
## sex weight ## 1 F 53.8 ## 2 F 55.3 ## 3 F 56.1 ## iv F 52.7
Compute the mean weight by sex using the dplyr
package. First, the information is grouped by sex activity and so summarized past computing the hateful weight by groups. The operator %>%
is used to combine multiple operations:
library("dplyr") mu <- wdata %>% group_by(sex) %>% summarise(grp.mean = mean(weight)) mu
## # A tibble: 2 10 2 ## sex grp.mean ## ## ane F 54.9 ## 2 M 58.i
Basic plots
We start past creating a plot, named a
, that nosotros'll finish in the next section past adding a layer.
a <- ggplot(wdata, aes(10 = weight))
Possible layers include: geom_density()
(for density plots) and geom_histogram()
(for histogram plots).
Key arguments to customize the plots:
-
color, size, linetype
: change the line colour, size and blazon, respectively -
fill
: alter the areas fill color (for bar plots, histograms and density plots) -
blastoff
: create a semi-transparent color.
Density plots
Cardinal function: geom_density()
.
- Create basic density plots. Add a vertical line corresponding to the hateful value of the weight variable (
geom_vline()
):
# y axis scale = ..density.. (default behaviour) a + geom_density() + geom_vline(aes(xintercept = mean(weight)), linetype = "dashed", size = 0.vi) # Change y axis to count instead of density a + geom_density(aes(y = ..count..), fill = "lightgray") + geom_vline(aes(xintercept = mean(weight)), linetype = "dashed", size = 0.6, color = "#FC4E07")
- Alter areas fill and add line color past groups (sex activity):
- Add vertical mean lines using
geom_vline()
. Information:mu
, which contains the mean values of weights by sex (computed in the previous section). - Change color manually:
- utilise
scale_color_manual()
orscale_colour_manual()
for changing line colour - use
scale_fill_manual()
for changing area fill colors.
- utilise
# Change line color past sex a + geom_density(aes(color = sexual activity)) + scale_color_manual(values = c("#868686FF", "#EFC000FF")) # Modify fill color by sex and add mean line # Use semi-transparent fill: alpha = 0.iv a + geom_density(aes(fill = sex), alpha = 0.4) + geom_vline(aes(xintercept = grp.mean, color = sex), data = mu, linetype = "dashed") + scale_color_manual(values = c("#868686FF", "#EFC000FF"))+ scale_fill_manual(values = c("#868686FF", "#EFC000FF"))
- Simple solution to create a ggplot2-based density plots: use
ggboxplot()
[in ggpubr].
library(ggpubr) # Basic density plot with mean line and marginal carpeting ggdensity(wdata, x = "weight", fill up = "#0073C2FF", color = "#0073C2FF", add = "hateful", rug = TRUE) # Modify outline and make full colors by groups ("sexual activity") # Employ a custom palette ggdensity(wdata, x = "weight", add = "mean", carpet = True, colour = "sexual practice", make full = "sex", palette = c("#0073C2FF", "#FC4E07"))
Histogram plots
An alternative to density plots is histograms, which represents the distribution of a continuous variable by dividing into bins and counting the number of observations in each bin.
Key office: geom_histogram()
. The bones usage is quite like to geom_density()
.
- Create a basic plots. Add a vertical line corresponding to the mean value of the weight variable:
a + geom_histogram(bins = 30, color = "blackness", fill = "gray") + geom_vline(aes(xintercept = mean(weight)), linetype = "dashed", size = 0.6)
Note that, by default:
- Past default,
geom_histogram()
uses 30 bins - this might not be good default. You can change the number of bins (eastward.grand.: bins = 50) or the bin width (e.g.: binwidth = 0.5) - The y centrality corresponds to the count of weight values. If you desire to alter the plot in gild to accept the density on y axis, specify the statement
y = ..density..
inaes()
.
- Change areas fill and add line color by groups (sex):
- Add together vertical mean lines using
geom_vline()
. Data:mu
, which contains the mean values of weights past sex. - Change color manually:
- use
scale_color_manual()
orscale_colour_manual()
for changing line color - apply
scale_fill_manual()
for changing area fill colors.
- use
- Adjust the position of histogram bars by using the argument
position
. Allowed values: "identity", "stack", "contrivance". Default value is "stack".
# Change line color by sex activity a + geom_histogram(aes(colour = sex), fill = "white", position = "identity") + scale_color_manual(values = c("#00AFBB", "#E7B800")) # alter fill and outline color manually a + geom_histogram(aes(color = sex, fill = sexual activity), alpha = 0.4, position = "identity") + scale_fill_manual(values = c("#00AFBB", "#E7B800")) + scale_color_manual(values = c("#00AFBB", "#E7B800"))
- Combine histogram and density plots:
- Plot histogram with density values on y-axis (instead of count values).
- Add density plot with transparent density plot
# Histogram with density plot a + geom_histogram(aes(y = ..density..), colour="black", fill up="white") + geom_density(blastoff = 0.2, fill up = "#FF6666") # Color by groups a + geom_histogram(aes(y = ..density.., color = sex activity), fill = "white", position = "identity")+ geom_density(aes(color = sexual practice), size = 1) + scale_color_manual(values = c("#868686FF", "#EFC000FF"))
- Elementary solution to create a ggplot2-based histogram plots: use
gghistogram()
[in ggpubr].
library(ggpubr) # Bones histogram plot with mean line and marginal rug gghistogram(wdata, x = "weight", bins = xxx, fill = "#0073C2FF", colour = "#0073C2FF", add = "mean", rug = Truthful) # Change outline and fill up colors by groups ("sex") # Use a custom palette gghistogram(wdata, x = "weight", bins = thirty, add = "mean", rug = Truthful, color = "sex", fill up = "sexual practice", palette = c("#0073C2FF", "#FC4E07"))
Culling to density and histogram plots
- Frequency polygon. Very close to histogram plots, but it uses lines instead of confined.
- Key office:
geom_freqpoly()
. - Cardinal arguments:
color
,size
,linetype
: change, respectively, line colour, size and type.
- Key office:
- Area plots. This is a continuous analog of a stacked bar plot.
- Key part:
geom_area()
. - Key arguments:
-
color
,size
,linetype
: alter, respectively, line color, size and type. -
make full
: modify area fill color.
-
- Key part:
In this section, we'll use the theme theme_pubclean()
[in ggpubr]. This is a theme without centrality lines, to direct more attending to the data. Type this to use the theme:
theme_set(theme_pubclean())
- Create a bones frequency polygon and basic expanse plots:
# Basic frequency polygon a + geom_freqpoly(bins = 30) # Bones surface area plots, which can be filled past colour a + geom_area( stat = "bin", bins = 30, color = "black", make full = "#00AFBB")
- Modify colors past groups (sex):
# Frequency polygon: # Change line colors and types by groups a + geom_freqpoly( aes(colour = sexual practice, linetype = sex), bins = 30, size = 1.5) + scale_color_manual(values = c("#00AFBB", "#E7B800")) # Area plots: alter fill colors by sexual activity # Create a stacked expanse plots a + geom_area(aes(fill up = sex), colour = "white", stat ="bin", bins = xxx) + scale_fill_manual(values = c("#00AFBB", "#E7B800"))
As in histogram plots, the default y values is count. To have density values on y axis, specify y = ..density..
in aes()
.
- Dot plots. Represents another culling to histograms and density plots, that tin can be used to visualize a continuous variable. Dots are stacked with each dot representing one observation. The width of a dot corresponds to the bin width.
- Key office:
geom_dotplot()
. - Key arguments:
alpha
,color
,fill
anddotsize
.
Create a dot plot colored past groups (sex activity):
a + geom_dotplot(aes(fill = sex), binwidth = one/4) + scale_fill_manual(values = c("#00AFBB", "#E7B800"))
- Box plot:
- Create a box plot of one continuous variable:
geom_boxplot()
- Add jittered points, where each point corresponds to an individual observation:
geom_jitter()
. Alter the color and the shape of points by groups (sex)
- Create a box plot of one continuous variable:
ggplot(wdata, aes(x = factor(1), y = weight)) + geom_boxplot(width = 0.4, fill up = "white") + geom_jitter(aes(colour = sex, shape = sex), width = 0.one, size = ane) + scale_color_manual(values = c("#00AFBB", "#E7B800")) + labs(x = Cipher) # Remove 10 axis label
- Empirical cumulative distribution function (ECDF). Provides another alternative visualization of distribution. It reports for any given number the percent of individuals that are below that threshold.
For instance, in the following plots, you tin can see that:
- nigh 25% of our females are shorter than 50 inches
- about 50% of males are shorter than 58 inches
# Another option for geom = "indicate" a + stat_ecdf(aes(colour = sex,linetype = sex), geom = "step", size = i.5) + scale_color_manual(values = c("#00AFBB", "#E7B800"))+ labs(y = "f(weight)")
- Quantile-quantile plot (QQ plots). Used to check whether a given data follows normal distribution.
- Key office:
stat_qq()
. - Fundamental arguments:
color
,shape
andsize
to modify point color, shape and size.
Create a qq-plot of weight. Change color by groups (sexual activity)
# Alter point shapes by groups ggplot(wdata, aes(sample = weight)) + stat_qq(aes(color = sexual activity)) + scale_color_manual(values = c("#00AFBB", "#E7B800"))+ labs(y = "Weight")
Alternative plot using the function ggqqplot()
[in ggpubr]. The 95% confidence band is shown by default.
library(ggpubr) ggqqplot(wdata, x = "weight", color = "sexual activity", palette = c("#0073C2FF", "#FC4E07"), ggtheme = theme_pubclean())
Density ridgeline plots
The density ridgeline plot is an alternative to the standard geom_density()
office that can exist useful for visualizing changes in distributions, of a continuous variable, over time or infinite. Ridgeline plots are partially overlapping line plots that create the impression of a mount range.
This functionality is provided in the R package ggridges
(Wilke 2017).
- Installation:
install.packages("ggridges")
- Load and set the default theme to
theme_ridges()
[in ggridges]:
library(ggplot2) library(ggridges) theme_set(theme_ridges())
- Example 1: Simple distribution plots by groups. Distribution of Sepal.Length by Species using the
iris
information prepare. The grouping variable Species will be mapped to the y-centrality:
ggplot(iris, aes(x = Sepal.Length, y = Species)) + geom_density_ridges(aes(fill up = Species)) + scale_fill_manual(values = c("#00AFBB", "#E7B800", "#FC4E07"))
Yous can control the overlap between the different densities using the scale
selection. Default value is one. Smaller values create a separation between the curves, and larger values create more overlap.
ggplot(iris, aes(10 = Sepal.Length, y = Species)) + geom_density_ridges(calibration = 0.9)
- Example 4: Visualize temperature data.
-
Data set:
lincoln_weather
[in ggridges]. Weather in Lincoln, Nebraska in 2016. -
Create the density ridge plots of the
Hateful Temperature
byMonth
and change the fill color according to the temperature value (on 10 axis). A gradient color is created using the officegeom_density_ridges_gradient()
ggplot( lincoln_weather, aes(x = `Mean Temperature [F]`, y = `Month`) ) + geom_density_ridges_gradient( aes(fill up = ..x..), scale = three, size = 0.3 ) + scale_fill_gradientn( colours = c("#0D0887FF", "#CC4678FF", "#F0F921FF"), proper noun = "Temp. [F]" )+ labs(title = 'Temperatures in Lincoln NE')
For more examples, type the following R code:
browseVignettes("ggridges")
Bar plot and modernistic alternatives
In this section, we'll depict how to create hands basic and ordered bar plots using ggplot2 based helper functions available in the ggpubr R parcel. We'll also nowadays some modern alternatives to bar plots, including lollipop charts and cleveland's dot plots.
- Load required packages:
library(ggpubr)
- Load and prepare data:
# Load information dfm <- mtcars # Convert the cyl variable to a factor dfm$cyl <- as.gene(dfm$cyl) # Add the name colums dfm$name <- rownames(dfm) # Audit the information head(dfm[, c("name", "wt", "mpg", "cyl")])
## name wt mpg cyl ## Mazda RX4 Mazda RX4 two.62 21.0 6 ## Mazda RX4 Wag Mazda RX4 Wag ii.88 21.0 six ## Datsun 710 Datsun 710 2.32 22.8 4 ## Hornet four Bulldoze Hornet 4 Drive 3.21 21.four 6 ## Hornet Sportabout Hornet Sportabout three.44 18.7 8 ## Valiant Valiant 3.46 eighteen.one 6
- Create an ordered bar plot of the
mpg
variable. Modify the fill up color by the grouping variable "cyl". Sorting volition be done globally, but not past groups.
ggbarplot(dfm, ten = "proper noun", y = "mpg", fill up = "cyl", # change fill color past cyl colour = "white", # Ready bar edge colors to white palette = "jco", # jco periodical color palett. meet ?ggpar sort.val = "asc", # Sort the value in dscending order sort.by.groups = TRUE, # Don't sort inside each group x.text.angle = 90, # Rotate vertically ten centrality texts ggtheme = theme_pubclean() )+ font("x.text", size = viii, vjust = 0.5)
To sort confined inside each group, utilize the argument sort.by.groups = TRUE
- Create a Lollipop chart:
- Colour by groups and fix a custom color palette.
- Sort values in ascending order.
- Add segments from y = 0 to dots. Change segment colour and size.
ggdotchart(dfm, ten = "name", y = "mpg", colour = "cyl", palette = c("#00AFBB", "#E7B800", "#FC4E07"), sorting = "asc", sort.by.groups = TRUE, add = "segments", add.params = list(color = "lightgray", size = two), group = "cyl", dot.size = 4, ggtheme = theme_pubclean() )+ font("x.text", size = viii, vjust = 0.v)
Read more: Bar Plots and Modern Alternatives
Conclusion
- Create a bar plot of a grouping variable:
ggplot(diamonds, aes(cut)) + geom_bar(fill up = "#0073C2FF") + theme_minimal()
- Visualize a continuous variable:
Outset by creating a plot, named a
, that we'll exist finished by adding a layer.
a <- ggplot(wdata, aes(x = weight))
Possible layers include:
- geom_density(): density plot
- geom_histogram(): histogram plot
- geom_freqpoly(): frequency polygon
- geom_area(): area plot
- geom_dotplot(): dot plot
- stat_ecdf(): empirical cumulative density function
- stat_qq(): quantile - quantile plot
Fundamental arguments to customize the plots:
-
color, size, linetype
: alter the line color, size and type, respectively -
make full
: change the areas fill up color (for bar plots, histograms and density plots) -
alpha
: create a semi-transparent color.
References
Source: http://sthda.com/english/articles/32-r-graphics-essentials/133-plot-one-variable-frequency-graph-density-distribution-and-more