Data manipulations using the dplyr package
1.Examine the structure of the iris data set. How many observations
and variables are in the data set?
library(tidyverse)
## Warning: package 'stringr' was built under R version 4.4.2
## Warning: package 'lubridate' was built under R version 4.4.2
## ── Attaching core tidyverse packages ────────────────────────────────────────────── tidyverse 2.0.0 ──
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ lubridate 1.9.4 ✔ tibble 3.2.1
## ✔ purrr 1.0.2 ✔ tidyr 1.3.1
## ✔ readr 2.1.5
## ── Conflicts ──────────────────────────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
data("iris")
view(iris)
glimpse(iris)
## Rows: 150
## Columns: 5
## $ Sepal.Length <dbl> 5.1, 4.9, 4.7, 4.6, 5.0, 5.4, 4.6, 5.0, 4.4, 4.9, 5.4, 4.…
## $ Sepal.Width <dbl> 3.5, 3.0, 3.2, 3.1, 3.6, 3.9, 3.4, 3.4, 2.9, 3.1, 3.7, 3.…
## $ Petal.Length <dbl> 1.4, 1.4, 1.3, 1.5, 1.4, 1.7, 1.4, 1.5, 1.4, 1.5, 1.5, 1.…
## $ Petal.Width <dbl> 0.2, 0.2, 0.2, 0.2, 0.2, 0.4, 0.3, 0.2, 0.2, 0.1, 0.2, 0.…
## $ Species <fct> setosa, setosa, setosa, setosa, setosa, setosa, setosa, s…
2.Create a new data frame iris1 that contains only the species
virginica and versicolor with sepal lengths longer than 6 cm and sepal
widths longer than 2.5 cm. How many observations and variables are in
the data set?
iris1<-iris%>%
filter(Species=="virginica" | Species=="versicolor", Sepal.Length>6, Sepal.Width>2.5)
glimpse(iris1)
## Rows: 56
## Columns: 5
## $ Sepal.Length <dbl> 7.0, 6.4, 6.9, 6.5, 6.3, 6.6, 6.1, 6.7, 6.1, 6.1, 6.4, 6.…
## $ Sepal.Width <dbl> 3.2, 3.2, 3.1, 2.8, 3.3, 2.9, 2.9, 3.1, 2.8, 2.8, 2.9, 3.…
## $ Petal.Length <dbl> 4.7, 4.5, 4.9, 4.6, 4.7, 4.6, 4.7, 4.4, 4.0, 4.7, 4.3, 4.…
## $ Petal.Width <dbl> 1.4, 1.5, 1.5, 1.5, 1.6, 1.3, 1.4, 1.4, 1.3, 1.2, 1.3, 1.…
## $ Species <fct> versicolor, versicolor, versicolor, versicolor, versicolo…
3.Now, create a iris2 data frame from iris1 that contains only the
columns for Species, Sepal.Length, and Sepal.Width. How many
observations and variables are in the data set?
iris2<-iris1%>%
select(Species,Sepal.Length, Sepal.Width)
glimpse(iris2)
## Rows: 56
## Columns: 3
## $ Species <fct> versicolor, versicolor, versicolor, versicolor, versicolo…
## $ Sepal.Length <dbl> 7.0, 6.4, 6.9, 6.5, 6.3, 6.6, 6.1, 6.7, 6.1, 6.1, 6.4, 6.…
## $ Sepal.Width <dbl> 3.2, 3.2, 3.1, 2.8, 3.3, 2.9, 2.9, 3.1, 2.8, 2.8, 2.9, 3.…
4.Create an iris3 data frame from iris2 that orders the observations
from largest to smallest sepal length. Show the first 6 rows of this
data set.
iris3<-iris2%>%
arrange(by=desc(Sepal.Length))
head(iris3)
## Species Sepal.Length Sepal.Width
## 1 virginica 7.9 3.8
## 2 virginica 7.7 3.8
## 3 virginica 7.7 2.6
## 4 virginica 7.7 2.8
## 5 virginica 7.7 3.0
## 6 virginica 7.6 3.0
5.Create an iris4 data frame from iris3 that creates a column with a
sepal area (length * width) value for each observation. How many
observations and variables are in the data set?
iris4<-iris3%>%
mutate(sepal.area=Sepal.Length*Sepal.Width)
glimpse(iris4)
## Rows: 56
## Columns: 4
## $ Species <fct> virginica, virginica, virginica, virginica, virginica, vi…
## $ Sepal.Length <dbl> 7.9, 7.7, 7.7, 7.7, 7.7, 7.6, 7.4, 7.3, 7.2, 7.2, 7.2, 7.…
## $ Sepal.Width <dbl> 3.8, 3.8, 2.6, 2.8, 3.0, 3.0, 2.8, 2.9, 3.6, 3.2, 3.0, 3.…
## $ sepal.area <dbl> 30.02, 29.26, 20.02, 21.56, 23.10, 22.80, 20.72, 21.17, 2…
6.Create iris5 that calculates the average sepal length, the average
sepal width, and the sample size of the entire iris4 data frame and
print iris5.
iris5<-iris4%>%
summarize(meanLength=mean(Sepal.Length),meanWidth=mean(Sepal.Width),Size=n())
print(iris5)
## meanLength meanWidth Size
## 1 6.698214 3.041071 56
7.Finally, create iris6 that calculates the average sepal length,
the average sepal width, and the sample size for each species of in the
iris4 data frame and print iris6.
irisSpecies<-group_by(iris4,Species)
head(irisSpecies)
## # A tibble: 6 × 4
## # Groups: Species [1]
## Species Sepal.Length Sepal.Width sepal.area
## <fct> <dbl> <dbl> <dbl>
## 1 virginica 7.9 3.8 30.0
## 2 virginica 7.7 3.8 29.3
## 3 virginica 7.7 2.6 20.0
## 4 virginica 7.7 2.8 21.6
## 5 virginica 7.7 3 23.1
## 6 virginica 7.6 3 22.8
summarize(irisSpecies,meanLength=mean(Sepal.Length),meanWidth=mean(Sepal.Width),Size=n())
## # A tibble: 2 × 4
## Species meanLength meanWidth Size
## <fct> <dbl> <dbl> <int>
## 1 versicolor 6.48 2.99 17
## 2 virginica 6.79 3.06 39
iris6<-iris4%>%
group_by(Species)%>%
summarize(meanLength=mean(Sepal.Length),meanWidth=mean(Sepal.Width),Size=n())
print(iris6)
## # A tibble: 2 × 4
## Species meanLength meanWidth Size
## <fct> <dbl> <dbl> <int>
## 1 versicolor 6.48 2.99 17
## 2 virginica 6.79 3.06 39
9.Create a ‘longer’ data frame using the original iris data set with
three columns named “Species”, “Measure”, “Value”. The column “Species”
will retain the species names of the data set. The column “Measure” will
include whether the value corresponds to Sepal.Length, Sepal.Width,
Petal.Length, or Petal.Width and the column “Value” will include the
numerical values of those measurements.
irisLong<-iris%>%
pivot_longer(cols = c(Sepal.Length, Sepal.Width, Petal.Length, Petal.Width),
names_to = "Measure",
values_to = "Value") %>%
select(Species, Measure, Value)
print(irisLong)
## # A tibble: 600 × 3
## Species Measure Value
## <fct> <chr> <dbl>
## 1 setosa Sepal.Length 5.1
## 2 setosa Sepal.Width 3.5
## 3 setosa Petal.Length 1.4
## 4 setosa Petal.Width 0.2
## 5 setosa Sepal.Length 4.9
## 6 setosa Sepal.Width 3
## 7 setosa Petal.Length 1.4
## 8 setosa Petal.Width 0.2
## 9 setosa Sepal.Length 4.7
## 10 setosa Sepal.Width 3.2
## # ℹ 590 more rows