How to Use R for Data Science (2024)

This section is designed to kickstart your journey into data science with R through the R package swirl that offers an interactive learning platform. swirl teaches you R programming and data science interactively, at your own pace, and right in the R console! You get immediate feedback on your progress. If you are new to R, have no fear. swirl will walk you through each of the steps required to employ Rstudio and R for your purpose.

For those seeking additional or alternative resources beyond swirl, exploring other introductory textbooks and resources on R is highly recommended. Please consider the resources I discuss in Section 1.3. One notable example is Irizarry (2022) who provides a comprehensive and conservative approach to understanding R.

4.1 Set up `swirl`

To install swirl and my learning modules, please follow my instructions precisely!

Open Rstudio and type in the console the following:

install.packages("swirl")library("swirl")install_course_github("hubchev", "swirl-it")swirl()

The above four lines of code do the following:

Install the swirl package, ensuring it’s available for use in R.
Load the swirl package, making its functions accessible.
Install my swirl course that is hosted on GitHub, making its functions accessible.
By entering swirl into the Console (located at the bottom-left in RStudio) and pressing the Enter key, you initiate swirl. This begins your interactive learning experience with the package.

Tip4.1: If the course has failed to install,

you can try to download the file swirl-it.swc from github.com/hubchev/swirl-it and install the course with loading the swirl package and typing install_course() into the console.

After initiating the swirl environment, follow the instructions displayed in the Console. Specifically, select the swirl-it course and the huber-intro-1 learning module to begin. You can exit swirl at any moment by typing bye() into the Console or pressing the Esc key on your keyboard.

4.2 swirl-it: huber-intro-1

Click to see the full content of the module

Welcome to this swirl course. If you find any errors or if you have suggestions for improvement, please let me know via stephan.huber@hs-fresenius.de.

The RStudio interface consists of several windows. You can change the size of the windows by dragging the grey bars between the windows. We’ll go through the most important windows now.

Bottom left is the Console window (also called command window/line). Here you can type commands after the > prompt and R will then execute your command. This is the most important window, because this is where R actually does stuff.

Top left is the Editor window (also called script window). Here collections of commands (scripts) can be edited and saved. When you do not get this window, you can open it with ‘File’ > ‘New’ > ‘R script’.

Just typing a command in the editor window is not enough, it has to be send to the Console before R executes the command. If you want to run a line from the script window (or the whole script), you can click ‘Run’ or press ‘CTRL+ENTER’ to send it to the command window.

The shortcut to send the current line to the console and run it there is _________.

CTRL+SHIFT
CTRL+ENTER
CTRL+SPACE
SHIFT+ENTER

Hint: You find all shortcuts in the menu at Tools > Keyboard Shortcuts Help or click ALT+SHIFT+K. If you are a Mac user, your shortcut is ‘Cmd+Return’ instead of ‘SHIFT+ENTER’. To move on type skip().

Solution

answer: b

Top right is the environment window (a.k.a workspace). Here you can see which data R has in its memory. You can view and edit the values by clicking on them.

Bottom right is the plots / packages / help window. Here you can view plots, install and load packages or use the help function.

The first thing you should do whenever you start Rstudio is to check if you are happy with your working directory. That directory is the folder on your computer in which you are currently working. That means, when you ask R to open a certain file, it will look in the working directory for this file, and when you tell R to save a data file or figure, it will save it in the working directory.

You can check your working directory with the function getwd(). So let’s do that. Type in the command window getwd() .

getwd()

[1] "/home/sthu/Dropbox/hsf/courses/dsr"

Are you happy with that place? if not, you should set your working directory to where all your data and script files are (or will be). Within RStudio you can go to ‘Session’ > ‘Set working directory’ > ‘Choose directory’. Please do this now.

Instead of clicking, you can use the function setwd("/YOURPATH"). For example, setwd("/Users/MYNAME/MYFOLDER") or setwd("C:/Users/jenny/myrstuff"). Make sure that the slashes are forward slashes and that you do not forget the apostrophes. R is case sensitive, so make sure you write capitals where necessary.

Whenever you want R to do something you need to use a function. It is like a command. All functions of R are organized in so-called packages or libraries. With the standard installation many packages are already installed. However, many more exist and some of them are really cool. For example, with installed.packages() all installed packages are listed. Or, with swirl(), you started swirl.

Of course, you can also go to the Packages window at the bottom right. If the box in front of the package name is ticked, the package is loaded (activated) and can be used. To see via Console which packages are loaded type in the console (.packages())

(.packages())

[1] "stats" "graphics" "grDevices" "utils" "datasets" "methods" [7] "base"

There are many more packages available on the R website. If you want to install and use a package (for example, the package called geometry) you should first install the package. Type install.packages("geometry") in the console. Don’t be afraid about the many messages. Depending on your PC and your internet connection this may take some time.

install.packages("geometry")

After having installed a package, you need to load the package. That is a bit annoying but essential. Type in library("geometry") in the Console. You also did this for the swirl package (otherwise you couldn’t have been doing these exercises).

library("geometry")

Check if the package is loaded typing (.packages())

(.packages())

Now, let’s get started with the real programming.

R can be used as a calculator. You can just type your equation in the command window after the >. Type 10^2 + 36.

10^2 + 36

[1] 136

And R gave the answer directly. By the way, spaces do not matter.

If you use brackets and forget to add the closing bracket, the > on the command line changes into a +. The + can also mean that R is still busy with some heavy computation. If you want R to quit what it was doing and give back the >, press ESC.

You can also give numbers a name. By doing so, they become so-called variables which can be used later. For example, you can type in the command window A <- 4.

A <- 4

The <- is the so-called assignment operator. It allows you to assign data to a named object in order to store the data.

Don’t be confussed about the term object. All sorts of data are stored in so-called objects in R. All objects of a session are shown in the Environment window. In the second part of this course, I will introduce different data types.

You can see that A appeared in the environment window in the top right corner, which means that R now remembers what A is.

You can also ask R what A is. Just type A in the command window.

[1] 4

You can also do calculations with A. Type A * 5 .

A * 5

[1] 20

If you specify A again, it will forget what value it had before. You can also assign a new value to A using the old one. Type A <- A + 10 .

A <- A + 10

You can see that the value in the environment window changed.

To remove all variables from R’s memory, type rm(list=ls()) .

rm(list = ls())

You see that the environment window is now empty. You can also click the broom icon (clear all) in the environment window. You can see that RStudio then empties the environment window. If you only want to remove the variable A, you can type rm(A).

Like in many other programs, R organizes numbers in scalars (a single number, 0-dimensional), vectors (a row of numbers, also called arrays, 1-dimensional) and matrices (like a table, 2-dimensional).

The A you defined before was a scalar. To define a vector with the numbers 3, 4 and 5, you need the function c(), which is short for concatenate (paste together). Type B=c(3,4,5).

B <- c(3, 4, 5)

If you would like to compute the mean of all the elements in the vector B from the example above, you could type (3+4+5)/3. Try this

(3 + 4 + 5) / 3

[1] 4

But when the vector is very long, this is very boring and time-consuming work. This is why things you do often are automated in so-called functions. For example, type mean(x=B) and guess what this function mean() can do for you.

mean(x = B)

[1] 4

Within the brackets you specify the arguments. Arguments give extra information to the function. In this case, the argument x says of which set of numbers (vector) the mean should be computed (namely of B). Sometimes, the name of the argument is not necessary; mean(B) works as well. Try it.

mean(B)

[1] 4

Compute the sum of 4, 5, 8 and 11 by first combining them into a vector and then using the function sum. Use the function c inside the function sum.

sum(c(4, 5, 8, 11))

[1] 28

The function rnorm, as another example, is a standard R function which creates random samples from a normal distribution. Type rnorm(10) and you will see 10 random numbers

rnorm(10)

 [1] 0.3755876 -0.1219572 -0.8941009 1.6236759 0.7970290 1.1445139 [7] 0.9363113 0.9702480 0.2002156 0.5223241

Here rnorm is the function and the 10 is an argument specifying how many random numbers you want - in this case 10 numbers (typing n=10 instead of just 10 would also work). The result is 10 random numbers organised in a vector with length 10.

If you want 10 random numbers out of normal distribution with mean 1.2 and standard deviation 3.4 you can type rnorm(10, mean=1.2, sd=3.4). Try this.

rnorm(10, mean = 1.2, sd = 3.4)

 [1] 1.7127167 0.6255948 -2.2725458 1.2477239 -0.9643572 2.0744850 [7] 6.1662256 1.0912033 -1.0149063 -3.6364031

This shows that the same function (rnorm()) may have different interfaces and that R has so called named arguments (in this case mean and sd).

Comparing this example to the previous one also shows that for the function rnorm only the first argument (the number 10) is compulsory, and that R gives default values to the other so-called optional arguments. Use the help function to see which values are used as default by typing ?rnorm.

?rnorm

You see the help page for this function in the help window on the right. RStudio has a nice features such as autocompletion and snapshots of the R documentation. For example, when you type rnorm( in the command window and press TAB, RStudio will show the possible arguments.

You can also store the output of the function in a variable. Type x=rnorm(100).

x <- rnorm(100)

Now 100 random numbers are assigned to the variable x, which becomes a vector by this operation. You can see it appears in the Environment window.

R can also make graphs. Type plot(x) for a very simple example.

plot(x)

The 100 random numbers are now plotted in the plots window on the right.

You now are more familiar to RStudio and you know some basic R stuff. In particular, you know…

…that everything in R is said with functions,

…that functions can but don’t have to have arguments,

…that you can install packages which contain functions,

…that you must load the installed packages every time you start a session in RStudio, and

…that this is just the beginning. Thus, please continue with the second module of this introduction.

After you have successfully finished learning module huber-intro-1 please go ahead with the learning module huber-intro-2 that is also part of my swirl course swirl-it.

4.3 swirl-it: huber-intro-2

Click to see the full content of the module

Welcome to the second module. Again, if you find any errors or if you have suggestions for improvement, please let me know via stephan.huber@hs-fresenius.de .

Before you start working, you should set your working directory to where all your data and script files are or should be stored. Within RStudio you can go to ‘Session’> ‘Set working directory’, or you can type in setwd(YOURPATH). Please do this now.

setwd("/home/sthu/Documents/mydir")

Hint: Instead of clicking, you can also type setwd(“path”), where you replace “path” with the location of your folder, for example setwd(“D:/R/swirl”).

R is an interpreter that uses a command line based environment. This means that you have to type commands, rather than use the mouse and menus. This has many advantages. Foremost, it is easy to get a full transcript of everything you did and you can replicate your work easy.

As already mentioned, all commands in R are functions where arguments come (or do not come) in round brackets after the function name.

You can store your workflow in files, the so-called scripts. These scripts have typically file names with the extension, e.g., foo.R .

You can open an editor window to edit these files by clicking ‘File’ and ‘New’. Try this. Under ‘File’ you also find the options ‘Open file…’, ‘Save’ and ‘Save as’. Alternatively, just type CTRL+SHIFT+N.

You can run (send to the Console window) part of the code by selecting lines and pressing CTRL+ENTER or click ‘Run’ in the editor window. If you do not select anything, R will run the line your cursor is on.

You can always run the whole script with the console command source, so e.g.for the script in the file foo.R you type source(‘foo.R’). You can also click ‘Run all’ in the editor window or type CTRL+SHIFT+S to run the whole script at once.

Make a script called firstscript.R. Therefore, open the editor window with ‘File’ > ‘New’. Type plot(rnorm(100)) in the script, save it as firstscript.R in the working directory. Then type source("firstscript.R") on the command line.

source("firstscript.R")

Run your script again with source("firstscript.R"). The plot will change because new numbers are generated.

source("firstscript.R")

Hint: Type source(“firstscript.R”) again or type skip() if you are not interested.

Vectors were already introduced, but they can do more. Make a vector with numbers 1, 4, 6, 8, 10 and call it vec1.

Hint: Type vec1 <- c(1,4,6,8,10).

vec1 <- c(1, 4, 6, 8, 10)

Elements in vectors can be addressed by standard [i] indexing. Select the 5th element of this vector by typing vec1[5].

vec1[5]

Replace the 3rd element with a new number by typing vec1[3]=12.

vec1[3] <- 12

Ask R what the new version is of vec1.

vec1

You can also see the numbers of vec1 in the environment window. Make a new vector vec2 using the seq() (sequence) function by typing seq(from=0, to=1, by=0.25) and check its values in the environment window.

Hint: Type vec2 <- seq(from=0, to=1, by=0.25).

vec2 <- seq(from = 0, to = 1, by = 0.25)

Type sum(vec1).

sum(vec1)

The function sum sums up the elements within a vector, leading to one number (a scalar). Now use + to add the two vectors.

Hint: Type vec1 + vec2.

vec1 + vec2

If you add two vectors of the same length, the first elements of both vectors are summed, and the second elements, etc., leading to a new vector of length 5 (just like in regular vector calculus).

Matrices are nothing more than 2-dimensional vectors. To define a matrix, use the function matrix. Make a matrix with matrix(data=c(9,2,3,4,5,6),ncol=3) and call it mat.

Hint: Type mat <- matrix(data=c(9,2,3,4,5,6),ncol=3) or type skip() if you are not interested.

mat <- matrix(data = c(9, 2, 3, 4, 5, 6), ncol = 3)

The third type of data structure treated here is the data frame. Time series are often ordered in data frames. A data frame is a matrix with names above the columns. This is nice, because you can call and use one of the columns without knowing in which position it is. Make a data frame with t = data.frame(x = c(11,12,14), y = c(19,20,21), z = c(10,9,7)).

t <- data.frame(x = c(11, 12, 14), y = c(19, 20, 21), z = c(10, 9, 7))

Ask R what t is.

Hint: Type t or skip() if you are not interested.

The data frame is called t and the columns have the names x, y and z. You can select one column by typing t$z. Try this.

t$z

Another option is to type t[["z"]]. Try this as well.

t[["z"]]

Compute the mean of column z in data frame t.

Hint: Use function mean or type skip() if you are not interested.

mean(t$z)

In the following question you will be asked to modify a script that will appear as soon as you move on from this question. When you have finished modifying the script, save your changes to the script and type submit() and the script will be evaluated. There will be some comments in the script that opens up. Be sure to read them!

Make a script file which constructs three random normal vectors of length 100. Call these vectors x1, x2 and x3. Make a data frame called t with three columns (called a, b and c) containing respectively x1, x1+x2 and x1+x2+x3. Call plot(t) for this data frame. Then, save it and type submit() on the command line.

Hint: Type plot(rnorm(100)) in the script, save it and type submit() on the command line.

# Text behind the #-sign is not evaluated as code by R.# This is useful, because it allows you to add comments explaining what the script does.# In this script, replace the ... with the appropriate commands.x1 <- ...x2 <- ...x3 <- ...t <- ...plot(...)

Result

# Text behind the #-sign is not evaluated as code by R.# This is useful, because it allows you to add comments explaining what the script does.# In this script, replace the ... with the appropriate commands.x1 <- rnorm(100)x2 <- rnorm(100)x3 <- rnorm(100)t <- data.frame(a = x1, b = x1 + x2, c = x1 + x2 + x3)plot(t)

Do you understand the results?

Another basic structure in R is a list. The main advantage of lists is that the columns (they are not really ordered in columns any more, but are more a collection of vectors) don’t have to be of the same length, unlike matrices and data frames. Make this list L <- list(one=1, two=c(1,2), five=seq(0, 1, length=5)).

L <- list(one = 1, two = c(1, 2), five = seq(0, 1, length = 5))

The list L has names and values. You can type L to see the contents.

L also appeared in the environment window. To find out what’s in the list, type names(L).

names(L)

Add 10 to the column called five.

Hint: Type L$five + 10

L$five + 10

Plotting is an important statistical activity. So it should not come as a surprise that R has many plotting facilities. Type plot(rnorm(100), type="l", col="gold").

Hint: The symbol between quotes after the type=, is the letter l, not the number 1. To see the result you can also just type skip().

plot(rnorm(100), type = "l", col = "gold")

Hundred random numbers are plotted by connecting the points by lines in a gold color.

Another very simple example is the classical statistical histogram plot, generated by the simple command hist. Make a histogram of 100 random numbers.

Hint: Type hist(rnorm(100))

hist(rnorm(100))

The script that opens up is the same as the script you made before, but with more plotting commands. Type submit() on the command line to run it (you don’t have to change anything yet).

Hint: Change plotting parameters in the script, save it and type submit() on the command line.

# Text behind the #-sign is not evaluated as code by R.# This is useful, because it allows you to add comments explaining what the script does.# Make data framex1 <- rnorm(100)x2 <- rnorm(100)x3 <- rnorm(100)t <- data.frame(a = x1, b = x1 + x2, c = x1 + x2 + x3)# Plot data frameplot(t$a, type = "l", ylim = range(t), lwd = 3, col = rgb(1, 0, 0, 0.3))lines(t$b, type = "s", lwd = 2, col = rgb(0.3, 0.4, 0.3, 0.9))points(t$c, pch = 20, cex = 4, col = rgb(0, 0, 1, 0.3))# Note that with plot you get a new plot window while points and lines add to the previous plot.

Try to find out by experimenting what the meaning is of rgb, the last argument of rgb, lwd, pch, cex. Type play() on the command line to experiment. Modify lines 11, 12 and 13 of the script by putting your cursor there and pressing CTRL+ENTER. When you are finished, type nxt() and then ?par.

Hint: Type ?par or type skip() if you are not interested.

?par

You searched for par in the R help. This is a useful page to learn more about formatting plots. Google ‘R color chart’ for a pdf file with a wealth of color options.

To copy your plot to a document, go to the plots window, click the ‘Export’ button, choose the nicest width and height and click ‘Copy’ or ‘Save’.

After having almost completed the second learning module, you are getting closer to become a nerd as you know…

…that everything in R is stored in objects (values, vectors, matrices, lists, or data frames),

…that you should always work in scripts and send code from scripts to the Console,

…that you can do it if you don’t give up.

Please continue choosing another swirl learning module.

4.4 swirl-it: Data analytical basics

In my swirl modules huber-data-1, huber-data-2, and huber-data-3 I introduce some very basic statistical principles on how to analyse data.

4.5 swirl-it: The `tidyverse` package

I compiled a short swirl module to introduce the tidyverse universe. This is a powerful collection of packages which I discuss later on. The learning module is also part of my swirl-it course.

4.6 Other `swirl` modules

You can also install some other courses. You find a list of courses here http://swirlstats.com/scn/index.html or here https://github.com/swirldev/swirl_courses.

I recommend this one as it gives a general overview on very basic principles of R:

library(swirl)install_course_github("swirldev", "R_Programming_E")swirl()

How to Use R for Data Science (2024)

4.1 Set up swirl

4.2 swirl-it: huber-intro-1

4.3 swirl-it: huber-intro-2

4.4 swirl-it: Data analytical basics

4.5 swirl-it: The tidyverse package

4.6 Other swirl modules

References

4.1 Set up `swirl`

4.5 swirl-it: The `tidyverse` package

4.6 Other `swirl` modules