Tuesday, March 12, 2013

First Steps in R

First Steps in R

First Steps in R

If you are coming to R without a programming background, getting started can seem very daunting, especially that taunting blinking cursor. In this post, I will just point out a few things that can make getting started easier

Use An IDE

An IDE is an Integrated Development Environment. While it is perfectly reasonable to use TextEdit or something of that sort to write code, most programmers use IDE's in part because they have tools that make their lives easier.

(Un)fortunately, there are many choices in IDE's, so you can pick and choose which IDE is best for you. I will highlight a few.

RStudio

RStudio is a new player in the IDE world for R, and it has made a big splash. Rstudio has several nice features, including a nice package manager, it stores figures into a single panel (so you aren't searching all over your workplace for missing plots, or having to regenerate a ton of figures), parentheses matching, colored syntax matching, integration to Rcpp and more. All of the things you would want are all in one multipaned window. Since many reading are newbies, this is all brand new, so I will just say that RStudio genuinely makes life easier, is available on Mac, Windows and Linux, requires nothing more than an installation of R and RStudio to install, and has some of the top developers in R adding to their project (including Hadley Wickham). For newbies, especially those without programming experience, this is where I would go.

Note: RStudio does NOT handle dual monitors

Tinn-R

Tinn-R is a popular editor on the Windows platform. I am not a windows user, so I have limited experience with this IDE. In general, my understanding is that Tinn-R opens a separate editor window and a seperate R window, basically code is run from the editor over to the R window. A lot of people love Tinn-R, and quite a few newbies use it. Also can be used for many other languages.

Notepad++

Notepad++ is much like Tinn-R above. I've used, but frankly cannot comment on the differences between this and Tinn-R

Eclipse + StatET

Eclipse is a heavy lifter, the major IDE in the Java world that has so many potential bells and whistles it will spin your head. There is al nice implementation for Eclipse for R using StatET that sets up a powerful environment for R. I find that this is probably the hardest to set up, but it's still pretty easy (text windows guide you to your every step). There are blog posts that describe developing C++ code for R, and LaTeX, it really is a powerful set of development tools. I do find that it can be a bit clunky to make changes.

Vi

Vi (or Vim) is one of the popular text editors under Linux/ Unix environments. It has a very hardcore user group. I am not a Vi user, but if you are one, yes, you can use Vi with R, powerfully.

Emacs + ESS

Emacs is the counterfoil to Vi, and has a fairly comprehensive statistical environment ESS (Emacs Speaks Statistics) that allows for a very powerful integration of R and emacs. The development seems to be responsive (knitr is fully integrated for example). I use emacs for everything, calendars, organization, agendas, writing documents, writing code for R, C++, Python, Clojure etc. Like Vi, it's a huge, flexible environment that has a devout following (the war between Vi and Emacs is obvious, the package that allows one to use Vi in emacs is called the extensible vi layer, or EVIL). I am learning more and more about how to use emacs everyday. All this said, I do not, ever use windows, so emacs installs readily on Mac and Linux. You can get it on Windows, but if you aren't already running a linux environment through cygwin or something similar, I would recommend going with one of the alternatives above, notably RStudio.

(Full disclosure: I was having trouble getting Emacs to compile this blog post, so this post was done in RStudio using R Markup)

Plan your analysis/ session in advance

I think that part of the initial struggle with the blinking cursor when starting R is not knowing what you want to do. This is understandable, and sure, we would all love something that reads our minds so we could just say “analyze my data.” Unfortunately, we are not there yet.

A useful way to get started is to open your IDE, and outline what you want to do. Note that the pound sign '#' is a comment in R. A comment just means that R will not read anything to the right of the pound sign. In general, my approach is to outline the verbs for your analysis.

# Generate normal data

# Calculate Mean and Variance

# Plot Histogram of Data

Then fill in the gaps

# Generate normal data

Data = rnorm(1000, mean = 1, sd = 1)

# Calculate Mean and Variance

cat("Mean = ", mean(Data), ", Variance = ", var(Data), fill = TRUE)
## Mean =  0.9867 , Variance =  1.019

# Plot Histogram of Data
hist(Data)

plot of chunk unnamed-chunk-2

So an example analysis might have an outline like

  1. Import data

  2. Check the import

  3. Plot Data

  4. Generate summary statistics

  5. Check for correlated covariates

  6. Create regression models

  7. Check model assumptions

  8. Check model fit

  9. Generate plots and tables

From here, fill in the gaps. It should make you more productive, especially since the planning phase can be used to make sure you aren't missing anything. This is hard to check when you are going by the seat of your pants. To me, this is the approach most programmers use. Sure, lots of people during an exploratory phase will just go to the blinking cursor and play for a little bit, and for quick and dirty tasks they may shortcut, but any serious analysis is often planned out

Finding Help

Google

When you have a question, typically you can find it through Google. I tend to search under “R cran” then whatever.

R Mailing Lists

Another route is to read from the R Mailing Lists. I will post a note of caution on this front: if you consider posting, make sure you read the posting guide before you post, and spend a few moments on Google making sure there isn't an obvious answer to your question. This is because “dumb” questions can be harshly treated on this board. To be fair, one of the famous retorters here is one of the lead developers of R, and takes time out of his development and academic schedule to answer questions, so it is at least courteous to make an effort to be sure that you have a legitimate new question, rather than rehashing something that has been answered many times.