Chapter 2 Visualizing the iris flower data set | Learn R through examples (2024)

Learning objectives:

  • Basic concepts of R graphics
  • Install and load R packages
  • Basic plots with ggplot2
  • Different ways to visualize the iris flower dataset

2.1 Basic concepts of R graphics

We start with base R graphics. The first important distinction should be made abouthigh- and low-level graphics functions in base R.High-level graphics functions initiate new plots, to which new elements could beadded using the low-level functions. See table below.

High-level graphics functions initiate a new plot.
FunctionDescription
plotGeneric plotting function
boxplotboxplot. Can be applied to multiple columns of a matrix, or use equations boxplot( y ~ x)
histhistogram
qqnormQuantile-quantile (Q-Q) plot to check for normality
curveGraph an arithmetic function
barplotBarplot
mosaicplotUsing mosaics to represent the frequencies of tabulated counts.
heatmapUsing colors to visualize a matrix of numeric values.

In contrast, low-level graphics functions do not wipe out the existing plot;they add elements to it. The functions are listed below:

Low-level graphics functions that add elements to an existing plot.
FunctionDescription
pointsAdd points to a plot
linesAdd lines to a plot
ablineAdd a straight line
segmentsAdd line segments
textAdd text to a plot
legendAdd a legend to a plot
arrowsAdd arrows to a plot

Another distinction about data visualization is between plain, exploratory plots andrefined, annotated ones. Both types are essential.Sometimes we generate many graphics for exploratory data analysis (EDA)to get some sense of what the data looks like. We can achieve this by usingplotting functions with default settings to quickly generate a lot of“plain” plots. R is a very powerful EDA tool.

If we find something interesting about a dataset, we want to generatereally “cool”-looking graphics for papers andpresentations. Often we want to use a plot to convey a message to an audience.Making such plots typically requires a bit more coding, as youhave to customize different parameters. For me, it usually involveslots of Google searches, copy-and-paste of example codes, and then lots of trial-and-error.

One of the open secrets of R programming is that you can start from a plainfigure and refine it step by step. Here isan example using the base R graphics. This produces a basic scatter plot withthe petal length on the x-axis and petal width on the y-axis. Since iris is adata frame, we will use the iris$Petal.Length to refer to the Petal.Lengthcolumn.

PL <- iris$Petal.LengthPW <- iris$Petal.Widthplot(PL, PW)

Chapter 2 Visualizing the iris flower data set | Learn R through examples (1)

To hange the type of symbols:

plot(PL, PW, pch = 2) # pch = 2 means the symbol is triangle

Chapter 2 Visualizing the iris flower data set | Learn R through examples (2)

The pch parameter can take values from 0 to 25. Each value correspondsto a different type of symbol. The commonly used values and point symbolsare shown in Figure 2.1.

Chapter 2 Visualizing the iris flower data set | Learn R through examples (3)

Figure 2.1: Points Symbols

There are many other parameters to the plot function in R. You can get thesefrom the documentation:

? plot

We can also change the color of the data points easily with the col = parameter.

plot(PL, PW, pch = 2, col = "green") # change the symbol color to green

Chapter 2 Visualizing the iris flower data set | Learn R through examples (4)

Next, we can use different symbols for different species. First, extract the species information.

iris$Species
## [1] setosa setosa setosa setosa setosa setosa ## [7] setosa setosa setosa setosa setosa setosa ## [13] setosa setosa setosa setosa setosa setosa ## [19] setosa setosa setosa setosa setosa setosa ## [25] setosa setosa setosa setosa setosa setosa ## [31] setosa setosa setosa setosa setosa setosa ## [37] setosa setosa setosa setosa setosa setosa ## [43] setosa setosa setosa setosa setosa setosa ## [49] setosa setosa versicolor versicolor versicolor versicolor## [55] versicolor versicolor versicolor versicolor versicolor versicolor## [61] versicolor versicolor versicolor versicolor versicolor versicolor## [67] versicolor versicolor versicolor versicolor versicolor versicolor## [73] versicolor versicolor versicolor versicolor versicolor versicolor## [79] versicolor versicolor versicolor versicolor versicolor versicolor## [85] versicolor versicolor versicolor versicolor versicolor versicolor## [91] versicolor versicolor versicolor versicolor versicolor versicolor## [97] versicolor versicolor versicolor versicolor virginica virginica ## [103] virginica virginica virginica virginica virginica virginica ## [109] virginica virginica virginica virginica virginica virginica ## [115] virginica virginica virginica virginica virginica virginica ## [121] virginica virginica virginica virginica virginica virginica ## [127] virginica virginica virginica virginica virginica virginica ## [133] virginica virginica virginica virginica virginica virginica ## [139] virginica virginica virginica virginica virginica virginica ## [145] virginica virginica virginica virginica virginica virginica ## Levels: setosa versicolor virginica

This output shows that the 150 observations are classed into threespecies setosa, versicolor, and virginica.If you are read theiris data from a file, like what we did in Chapter 1,the data type of the Species column is character. We need to convert this column into a factor.

iris$Species <- as.factor(iris$Species)str(iris$Species)
## Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...

Once convertetd into a factor, each observation is represented by one of the three levels of
the three species setosa, versicolor, and virginica. factors are used tostore categorical variables as levels.To completely convert this factor to numbers for plotting, we use the as.numeric function.Since we do not want to change the data frame, we will define a new variable called speciesID.

speciesID <- as.numeric(iris$Species)speciesID
## [1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1## [38] 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2## [75] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 3 3 3 3 3 3 3 3 3 3 3## [112] 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3## [149] 3 3

We can assign different markers to different species by letting pch = speciesID.

plot(PL, PW, pch = speciesID, col = "green")

Chapter 2 Visualizing the iris flower data set | Learn R through examples (5)What happens here is that the 150 integers stored in the speciesID factor are usedto alter marker types. The first 50 data points (setosa) are represented by opencircles (pch = 1). The next 50 (versicolor) are represented by triangles (pch = 2), while the last50 (virginica) are in crosses (pch = 3).

Similarily, we can set three different colors for three species.

# assign 3 colors red, green, and blue to 3 species *setosa*, *versicolor*,# and *virginica* respectivelyplot(PL, PW, pch = speciesID, col = speciesID)

Chapter 2 Visualizing the iris flower data set | Learn R through examples (6)

This figure starts to looks nice, as the three species are easily separated bycolor and shape. It seems redundant, but it make it easier for the reader.Some people are even color blind.Let us change the x- and y-labels, andadd a main title.

plot(PL, PW, # x and y pch = speciesID, # symbol type col = speciesID, # color xlab = "Petal length (cm)", # x label ylab = "Petal width (cm)", # y label main = "Petal width vs. length" # title) 

Chapter 2 Visualizing the iris flower data set | Learn R through examples (7)Note that this command spans many lines. Very long lines make it hard to read.The benefit of multiple lines is that we can clearly see each line contain a parameter.When you are typing in the Console window, R knows that you are not done andwill be waiting for the second parenthesis.

Note that the indention is by two space characters and this chunk of code ends with a right parenthesis.It is essential to write your code so that it could be easily understood, or reused by others(or your future self). Seebreif anddetailed style guides. This is like checking thedressing code before going to an event.

A true perfectionist never settles. We notice a strong linear correlation betweenpetal length and width. Let’s add a trend line using abline(), a low level graphics function.

abline(lm(PW ~ PL)) # the order is reversed as we need y ~ x.

Chapter 2 Visualizing the iris flower data set | Learn R through examples (8)The lm(PW ~ PL) generates a linear model (lm) of petal width as a function petallength. y ~ x is formula notation that used in many different situations.This linear regression model is used to plot the trend line.

We calculate the Pearson’s correlation coefficient and mark it to the plot.

PCC <- cor(PW, PL) # Pearson's correlation coefficientPCC <- round(PCC, 2) # round to the 2nd place after decimal point.paste("R =", PCC)
## [1] "R = 0.96"

The paste function glues two strings together. Then we use the text function toplace strings at lower right by specifying the coordinate of (x=5, y=0.5).

text(5, 0.5, paste("R=", PCC)) # add text annotation.legend("topleft", # specify the location of the legend levels(iris$Species), # specify the levels of species pch = 1:3, # specify three symbols used for the three species col = 1:3 # specify three colors for the three species) 

Chapter 2 Visualizing the iris flower data set | Learn R through examples (9)

Figure 2.2: A refined scatter plot using base R graphics.

The last expression adds a legend at the top left using the legend function.While plot is a high-level graphics function that starts a new plot,abline, text, and legend are all low-level functions that can beadded to an existing plot.You should be proud of yourself if you are able to generate this plot.This is how we create complex plots step-by-step with trial-and-error.

Such a refinement process can be time-consuming.But another open secret of coding is that we frequently steal others ideas andcode. The book R Graphics Cookbook includes all kinds of R plots andexample code. Some websites list all sorts of R graphics and example codes that you can use.For example, this website: http://www.r-graph-gallery.com/ containsmore than 200 such examples. Anotherone is available here:: http://bxhorn.com/r-graphics-gallery/.If you know what types of graphs you want, it is very easy to start with thetemplate code and swap out the dataset.

2.2 Scatter plot matrix

While data frames can have a mixture of numbers and characters in differentcolumns, a matrix often only contains numbers. Let’s extract the first 4columns from the data frame iris and convert to a matrix:

ma <- as.matrix(iris[, 1:4]) # convert to matrixcolMeans(ma) # column means for matrix
## Sepal.Length Sepal.Width Petal.Length Petal.Width ## 5.843333 3.057333 3.758000 1.199333
colSums(ma)
## Sepal.Length Sepal.Width Petal.Length Petal.Width ## 876.5 458.6 563.7 179.9

The same thing can be done with rows via rowMeans(x) and rowSums(x).

We can generate a matrix of scatter plot by pairs() function.

pairs(ma)
pairs(ma, col = rainbow(3)[speciesID]) # set colors by species

Chapter 2 Visualizing the iris flower data set | Learn R through examples (10)

Figure 2.3: Scatter plot matrix

Exercise 2.1

Interpret the scatter plot matrix of Figure 2.3 .

Exercise 2.2

Plot a scatter plot matrix for dataset mtcars. Set different point symbols and colors based on the types of cyl.

2.3 Star and segment diagrams

Star plot uses stars to visualize multidimensional data. Radar chart is a useful way to display multivariate observations with an arbitrary number of variables. Each observation is represented as a star-shaped figure with one ray for each variable. For a given observation, the length of each ray is made proportional to the size of that variable. The star plot was firstly used by Georg von Mayr in 1877!

df <- iris[, 1:4]stars(df) # do I see any diamonds?stars(df, key.loc = c(17, 0)) # What does this tell you?

Exercise 2.3

Based on the star plot, what is your overall impression regarding the differences among these 3 species of flowers?

The stars() function can also be used to generate segment diagrams, where each variable is used to generate colorful segments. The sizes of the segments are proportional to the measurements.

stars(df, key.loc = c(20, 0.5), draw.segments = TRUE)

Chapter 2 Visualizing the iris flower data set | Learn R through examples (11)

Figure 2.4: Star plots and segments diagrams.

Exercise 2.4

Produce the segments diagram of the state data (state.x77) and offer some interpretation of South Dakota compared with other states. Hints: Convert the matrix to a data frame using df.state.x77 <- as.data.frame(state.x77), then plot the segments diagram based on the dataset df.state.x77.

2.4 The ggplot2 package is intuitive and powerful

In addition to the graphics functions in base R, there are many other packageswe can use to create plots. The most widely used are lattice and ggplot2.Together with base R graphics,sometimes these are referred to as the three independent paradigms of Rgraphics. The lattice package extends base R graphics and enables the creatingof graphs in multiple facets. The ggplot2 is developed based on a Grammar ofGraphics (hence the “gg”), a modular approach that builds complex graphics byadding layers. Intuitive yet powerful, ggplot2 is becoming increasingly popular.

ggplot2 is a modular, intuitive system for plotting, as we use different functions to refine different aspects of a chart step-by-step:

  • Aesthetic mapping from variables to visual characteristics
  • Geometric shapes (points, lines, boxes, bars, histograms, maps, etc)
  • Scales and statistical transformations (log, reverse, count, etc)
  • coordinate systems
  • facets ( multiple plots)
  • Labels and annotations

Detailed tutorials on ggplot2 can be find here andby its author.

The ggplot2 functions is not included in the base distribution of R.These are available as an additional package, on the CRAN website.They need to be downloaded and installed. One of the main advantages of R is that itis open, and users can contribute their code as packages. If you are usingRStudio, you can choose Tools->Install packages from the main menu, andthen enter the name of the package. If you are using R software, you can installadditional packages, by clicking Packages in the main menu, and select amirror site. All these mirror sites work the same, but some may be faster. Afterchoosing a mirror and clicking “OK,” you can scroll down the long list to findyour package. Alternatively, you can type this command to install packages.

# Install the package. You can also do it through the Packages Tabinstall.packages("ggplot2")

Packages only need to be installed once.But every time you need to use the functions or data in a package,you have to load it from your hard drive into memory.

library(ggplot2) # load the ggplot2 package
# define a canvasggplot(data = iris)

Chapter 2 Visualizing the iris flower data set | Learn R through examples (12)

Figure 2.5: Basic scatter plot using the ggplot2 package.

# map data to x and y coordinatesggplot(data = iris) + aes(x = Petal.Length, y = Petal.Width)

Chapter 2 Visualizing the iris flower data set | Learn R through examples (13)

Figure 2.6: Basic scatter plot using the ggplot2 package.

# add data pointsggplot(data = iris) + aes(x = Petal.Length, y = Petal.Width) + geom_point()

Chapter 2 Visualizing the iris flower data set | Learn R through examples (14)

Figure 2.7: Basic scatter plot using the ggplot2 package.

# change color & symbol typeggplot(data = iris) + aes(x = Petal.Length, y = Petal.Width) + geom_point(aes(color = Species, shape = Species))

Chapter 2 Visualizing the iris flower data set | Learn R through examples (15)

Figure 2.8: Basic scatter plot using the ggplot2 package.

# add trend lineggplot(data = iris) + aes(x = Petal.Length, y = Petal.Width) + geom_point(aes(color = Species, shape = Species)) + geom_smooth(method = lm)

Chapter 2 Visualizing the iris flower data set | Learn R through examples (16)

Figure 2.9: Basic scatter plot using the ggplot2 package.

# add annotation text to a specified location by setting coordinates x = , y =ggplot(data = iris) + aes(x = Petal.Length, y = Petal.Width) + geom_point(aes(color = Species, shape = Species)) + geom_smooth(method = lm) + annotate("text", x = 5, y = 0.5, label = "R=0.96")

Chapter 2 Visualizing the iris flower data set | Learn R through examples (17)

Figure 2.10: Basic scatter plot using the ggplot2 package.

Some ggplot2 commands span multiple lines.The first line defines the plotting space.The ending “+” signifies that another layer ( data points) of plotting is added.Also, the ggplot2 package handles a lot of the details for us.We can easily generate many different types of plots.It is also much easier to generate a plot like Figure 2.2.

ggplot(iris) + aes(x = Petal.Length, y = Petal.Width) + # define space geom_point(aes(color = Species, shape = Species)) + # add points geom_smooth(method = lm) + # add trend line annotate("text", x = 5, y = 0.5, label = "R=0.96") + # annotate with text xlab("Petal length (cm)") + # x-axis labels ylab("Petal width (cm)") + # y-axis labels ggtitle("Correlation between petal length and width") # title
## `geom_smooth()` using formula 'y ~ x'

Chapter 2 Visualizing the iris flower data set | Learn R through examples (18)

As you can see, data visualization using ggplot2 is similar to painting:we first find a blank canvas, paint background, sketch outlines, and then add details.Chapter 2 Visualizing the iris flower data set | Learn R through examples (19)

Exercise 2.5

Create a scatter plot of sepal length vs sepal width, change colors and shapes with species, and add trend line.

2.5 Other types of plots with ggplot2

Box plots can be generated by ggplot.

library(ggplot2)ggplot(data = iris) + aes(x = Species, y = Sepal.Length, color = Species) + geom_boxplot()

Chapter 2 Visualizing the iris flower data set | Learn R through examples (20)

Here we use Species, a categorical variable, as x-coordinate. Essentially, weuse it to define three groups of data. The y-axis is the sepal length,whose distribution we are interested in. The benefit of using ggplot2 is evident as we can easily refine it.

ggplot(data = iris) + aes(x = Species, y = Sepal.Length, color = Species) + geom_boxplot() + geom_jitter(position = position_jitter(0.2))

Chapter 2 Visualizing the iris flower data set | Learn R through examples (21)

Figure 2.11: Box plot with raw data points. This is getting increasingly popular.

On top of the boxplot, we add another layer representing the raw datapoints for each of the species. Since lining up data points on astraight line is hard to see, we jittered the relative x-position within each subspecies randomly.We also color-coded three species simply by adding “color = Species.” Many of the low-levelgraphics details are handled for us by ggplot2 as the legend is generated automatically.

ggplot(data = iris) + aes(x = Petal.Length, fill = Species) + geom_density(alpha = 0.3)

Chapter 2 Visualizing the iris flower data set | Learn R through examples (22)

Figure 2.12: Density plot of petal length, grouped by species.

ggplot(data = iris) + aes(x = Petal.Length, fill = Species) + geom_density(alpha = 0.3) + facet_wrap(~Species, nrow = 3)

Chapter 2 Visualizing the iris flower data set | Learn R through examples (23)

Figure 2.13: Density plot by subgroups using facets.

ggplot(iris) + aes(x = Species, y = Sepal.Width, fill = Species) + stat_summary(geom = "bar", fun = "mean") + stat_summary(geom = "errorbar", fun.data = "mean_se", width = .3)

Chapter 2 Visualizing the iris flower data set | Learn R through examples (24)

Figure 2.14: Dynamite Plot

The bar plot with error bar in 2.14 we generated above is calleddynamite plots for its similarity. “The dynamite plots must die!”, arguedrenowned statistician Rafael Irizarry in his blog.Dynamite plots give very little information; the mean and standard errors just could beprinted out. The outliers and overall distribution is hidden. Many scientists have chosen to use this boxplot with jittered points.

Exercise 2.6

Use boxplot, and density plots to investigate the similarity and differences of petal width of three species in the iris dataset.

2.6 Hierarchical clustering and heat map

Hierarchical clustering summarizes observations into trees representing the overall similarities.

ma <- as.matrix(iris[, 1:4]) # convert to matrixdisMatarix <- dist(ma)plot(hclust(disMatarix))

Chapter 2 Visualizing the iris flower data set | Learn R through examples (25)

We first calculate a distance matrix using the dist() function with the default Euclideandistance method. The distance matrix is then used by the hclust1() function to generate ahierarchical clustering tree with the default complete linkage method, which is then plotted in a nested command.

The 150 samples of flowers are organized in this cluster dendrogram based on their Euclideandistance, which is labeled vertically by the bar to the left side. Highly similar flowers aregrouped together in smaller branches, and their distances can be found according to the verticalposition of the branching point. We are often more interested in looking at the overall structureof the dendrogram. For example, we see two big clusters.

First, each of the flower samples is treated as a cluster. The algorithm joinsthe two most similar clusters based on a distance function. This is performediteratively until there is just a single cluster containing all 150 flowers. Ateach iteration, the distances between clusters are recalculated according to oneof the methods—Single linkage, complete linkage, average linkage, and so on.In the single-linkage method, the distance between two clusters is defined bythe smallest distance among the all possible object pairs. This approach puts‘friends of friends’ into a cluster. On the contrary, the complete linkagemethod defines the distance as the largest distance between object pairs. Itfinds similar clusters. Between these two extremes, there are many options inbetween. The linkage method I found the most robust is the average linkagemethod, which uses the average of all distances. However, the default seems tobe the complete linkage. Thus we need to change that in our final version.

Heat maps with hierarchical clustering are my favorite way of visualizing data matrices. The rows and columns are reorganized based on hierarchical clustering, and the values in the matrix are coded by colors. Heat maps can directly visualize millions of numbers in one plot. The hierarchical trees also show the similarity among rows and columns.

heatmap(ma, scale = "column", RowSideColors = rainbow(3)[iris$Species])

Chapter 2 Visualizing the iris flower data set | Learn R through examples (26)

Scaling is handled by the scale() function, which subtracts the mean from eachcolumn and then divides by the standard division. Afterward, all the columnshave the same mean of approximately 0 and standard deviation of 1. This is alsocalled standardization. The default color scheme codes bigger numbers in yellowand smaller numbers in red. The color bar on the left codes for differentspecies. But we still miss a legend and many other things can be polished.Instead of going down the rabbit hole of adjusting dozens of parameters toheatmap function (and it’s improved version heatmap.2 in the ggplots package), Wewill refine this plot using another R package called pheatmap.

install.packages("pheatmap")
library(pheatmap)ma <- as.matrix(iris[, 1:4]) # convert to matrixrow.names(ma) <- row.names(iris) # assign row names in the matrixpheatmap(ma, scale = "column", clustering_method = "average", # average linkage annotation_row = iris[, 5, drop = FALSE], # the 5th column as color bar show_rownames = FALSE)

Chapter 2 Visualizing the iris flower data set | Learn R through examples (27)

Figure 2.15: Heatmap for iris flower dataset.

First, we convert the first 4 columns of the iris data frame into a matrix. Thenthe row names are assigned to be the same, namely, “1” to “150.” This isrequired because row names are used to match with the column annotationinformation, specified by the annotation_row parameter. Even though we onlyneed the 5th column, i.e., Species, this has to be a data frame. To prevent Rfrom automatically converting a one-column data frame into a vector, we useddrop = FALSE option. Multiple columns can be contained in the columnannotation data frame to display multiple color bars. The rows could beannotated the same way.

More information about the pheatmap function can be obtained by reading the helpdocument. But most of the times, I rely on the online tutorials. Beyond theofficial documents prepared by the author, there are many documents created by Rusers across the world. The R user community is uniquely open and supportive. Iwas researching heatmap.2, a more refined version of heatmap part of the gplotspackage and landed on Dave Tang’sblog, whichmentioned that there is a more user-friendly package called pheatmap describedin his otherblog.To figure out the code chuck above, I tried several times and also used KamilSlowikowski’s blog.

We can gain many insights from Figure 2.15. The 150 flowers in the rows are organized into different clusters. I. Setosa samples obviously formed a unique cluster, characterized by smaller (blue) petal length, petal width, and sepal length. The other two subspecies are not clearly separated but we can notice that some I. Virginica samples form a small subcluster showing bigger petals. The columns are also organized into dendrograms, which clearly suggest that petal length and petal width are highly correlated.

2.7 Principal component analysis (PCA)

This section can be skipped, as it contains more statistics than R programming.To visualize high-dimensional data, we use PCA to map data to lower dimensions.

PCA is a linear dimension-reduction method. As illustrated in Figure 2.16,it tries to define a new set of orthogonal coordinates to represent the data such thatthe new coordinates can be ranked by the amount of variation or information it capturesin the dataset. After running PCA, you get many pieces of information:

  • How the new coordinates are defined,
  • The percentage of variances captured by each of the new coordinates,
  • A representation of all the data points onto the new coordinates.

Chapter 2 Visualizing the iris flower data set | Learn R through examples (28)

Figure 2.16: Concept of PCA. Here the first component x’ gives a relatively accurate representation of the data.

Here is an example of running PCA on the first 4 columns of the iris data.Note that “scale = TRUE” in the followingcommand means that the data is normalized before conduction PCA so that eachvariable has unit variance.

pca <- prcomp(iris[, 1:4], scale = TRUE)pca # Have a look at the results.
## Standard deviations (1, .., p=4):## [1] 1.7083611 0.9560494 0.3830886 0.1439265## ## Rotation (n x k) = (4 x 4):## PC1 PC2 PC3 PC4## Sepal.Length 0.5210659 -0.37741762 0.7195664 0.2612863## Sepal.Width -0.2693474 -0.92329566 -0.2443818 -0.1235096## Petal.Length 0.5804131 -0.02449161 -0.1421264 -0.8014492## Petal.Width 0.5648565 -0.06694199 -0.6342727 0.5235971

The first principal component is positively correlated with Sepal length, petal length, and petal width. Recall that these three variables are highly correlated. Sepal width is the variable that is almost the same across three species with small standard deviation. PC2 is mostly determined by sepal width, less so by sepal length.

plot(pca) # plot the amount of variance each principal components captures.

Chapter 2 Visualizing the iris flower data set | Learn R through examples (29)

str(pca) # this shows the structure of the object, listing all parts.
## List of 5## $ sdev : num [1:4] 1.708 0.956 0.383 0.144## $ rotation: num [1:4, 1:4] 0.521 -0.269 0.58 0.565 -0.377 ...## ..- attr(*, "dimnames")=List of 2## .. ..$ : chr [1:4] "Sepal.Length" "Sepal.Width" "Petal.Length" "Petal.Width"## .. ..$ : chr [1:4] "PC1" "PC2" "PC3" "PC4"## $ center : Named num [1:4] 5.84 3.06 3.76 1.2## ..- attr(*, "names")= chr [1:4] "Sepal.Length" "Sepal.Width" "Petal.Length" "Petal.Width"## $ scale : Named num [1:4] 0.828 0.436 1.765 0.762## ..- attr(*, "names")= chr [1:4] "Sepal.Length" "Sepal.Width" "Petal.Length" "Petal.Width"## $ x : num [1:150, 1:4] -2.26 -2.07 -2.36 -2.29 -2.38 ...## ..- attr(*, "dimnames")=List of 2## .. ..$ : NULL## .. ..$ : chr [1:4] "PC1" "PC2" "PC3" "PC4"## - attr(*, "class")= chr "prcomp"
head(pca$x) # the new coordinate values for each of the 150 samples
## PC1 PC2 PC3 PC4## [1,] -2.257141 -0.4784238 0.12727962 0.024087508## [2,] -2.074013 0.6718827 0.23382552 0.102662845## [3,] -2.356335 0.3407664 -0.04405390 0.028282305## [4,] -2.291707 0.5953999 -0.09098530 -0.065735340## [5,] -2.381863 -0.6446757 -0.01568565 -0.035802870## [6,] -2.068701 -1.4842053 -0.02687825 0.006586116

To plot the PCA results, we first construct a data frame with all information, as required by ggplot2.

pcaData <- as.data.frame(pca$x[, 1:2]) # extract first two columns and convert to data framepcaData <- cbind(pcaData, iris$Species) # bind by columnscolnames(pcaData) <- c("PC1", "PC2", "Species") # change column nameslibrary(ggplot2)ggplot(pcaData) + aes(PC1, PC2, color = Species, shape = Species) + # define plot area geom_point(size = 2) # adding data points

Now we have a basic plot. We can add elements one by one using the “+”sign at the end of the first line. We will add details to this plot.

percentVar <- round(100 * summary(pca)$importance[2, 1:2], 0) # compute % variancesggplot(pcaData, aes(PC1, PC2, color = Species, shape = Species)) + # starting ggplot2 geom_point(size = 2) + # add data points xlab(paste0("PC1: ", percentVar[1], "% variance")) + # x label ylab(paste0("PC2: ", percentVar[2], "% variance")) + # y label ggtitle("Principal component analysis (PCA)") + # title theme(aspect.ratio = 1) # width and height ratio

Chapter 2 Visualizing the iris flower data set | Learn R through examples (30)

Figure 2.17: PCA plot of the iris flower dataset using R base graphics (left) and ggplot2 (right).

The result (Figure 2.17) is a projection of the 4-dimensionaliris flowering data on 2-dimensional space using the first two principal components.We can see that the first principal component alone is useful in distinguishing the three species.We could use simple rules like this: If PC1 < -1, then Iris setosa.If PC1 > 1.5 then Iris virginica. If -1 < PC1 < 1, then Iris versicolor.

Exercise 2.7

Create PCA plot of the state.x77 data set (convert matrix to data frame).Use the state.region information to color code the states. Interpret your results.Hint: do not forget normalization using the scale option.

2.8 Classification by predicting the odds of binary outcomes

This section can be skipped, as it contains more statistics than R programming.It is easy to distinguish I. setosa from the other two species, just based onpetal length alone. Here we focus on building a predictive model that canpredict between I. versicolor and I. virginica. For this purpose, we use the logisticregression to model the odds ratio of being I. virginica as a function of allof the 4 measurements:

\[ln(odds)=ln(\frac{p}{1-p}) =a×Sepal.Length + b×Sepal.Width + c×Petal.Length + d×Petal.Width+c+e.\]

df <- iris[51:150, ] # removes the first 50 samples, which represent I. setosadf <- droplevels(df) # removes setosa, an empty levels of species.model <- glm(Species ~ ., # Model: Species as a function of other variables family = binomial(link = "logit"), data = df)summary(model)
## ## Call:## glm(formula = Species ~ ., family = binomial(link = "logit"), ## data = df)## ## Deviance Residuals: ## Min 1Q Median 3Q Max ## -2.01105 -0.00541 -0.00001 0.00677 1.78065 ## ## Coefficients:## Estimate Std. Error z value Pr(>|z|) ## (Intercept) -42.638 25.707 -1.659 0.0972 .## Sepal.Length -2.465 2.394 -1.030 0.3032 ## Sepal.Width -6.681 4.480 -1.491 0.1359 ## Petal.Length 9.429 4.737 1.991 0.0465 *## Petal.Width 18.286 9.743 1.877 0.0605 .## ---## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1## ## (Dispersion parameter for binomial family taken to be 1)## ## Null deviance: 138.629 on 99 degrees of freedom## Residual deviance: 11.899 on 95 degrees of freedom## AIC: 21.899## ## Number of Fisher Scoring iterations: 10

Sepal length and width are not useful in distinguishing versicolor fromvirginica. The most significant (P=0.0465) factor is Petal.Length. One unitincrease in petal length will increase the log-odds of being virginica by9.429. A marginally significant effect is found for Petal.Width.

If you do not fully understand the mathematics behind linear regression orlogistic regression, do not worry about it too much. In this class, Ijust want to show you how to do these analyses in R and interpret the results. Ido not understand how computers work. Yet I use it every day.

The best way to learn R is to use it. After the first two chapters, it is entirelypossible to start working on a your own dataset.If you do not have a dataset, you can find one from sourcessuch as TidyTuesday.
You do not need to finish the rest of this book.Set a goal or a research question. Feel free to search forand steal some example code.

Exercise 2.8

So far, we used a variety of techniques to investigate the iris flower dataset. Recall that in the very beginning, I asked you to eyeball the data and answer two questions:

  1. What distinguishes these three species?
  2. If we have a flower with sepals of 6.5cm long and 3.0cm wide, petals of 6.2cm long, and 2.2cm wide, which species does it most likely belong to?

Review all the analysis we did, examine the raw data, and answer the above questions. Write a paragraph and provide evidence of your thinking. Do more analysis if needed.

References:1 Beckerman, A. (2017). Getting started with r second edition. New York, NY, Oxford University Press.

Chapter 2 Visualizing the iris flower data set | Learn R through examples (2024)
Top Articles
Latest Posts
Article information

Author: Greg O'Connell

Last Updated:

Views: 5599

Rating: 4.1 / 5 (42 voted)

Reviews: 89% of readers found this page helpful

Author information

Name: Greg O'Connell

Birthday: 1992-01-10

Address: Suite 517 2436 Jefferey Pass, Shanitaside, UT 27519

Phone: +2614651609714

Job: Education Developer

Hobby: Cooking, Gambling, Pottery, Shooting, Baseball, Singing, Snowboarding

Introduction: My name is Greg O'Connell, I am a delightful, colorful, talented, kind, lively, modern, tender person who loves writing and wants to share my knowledge and understanding with you.