Calling R from Other Software

by Robert A. Muenchen


The R software is powerful, but it takes a long time to learn to use it well. However, you can keep using your current software to access and manage data, then call R for just the things your current software doesn’t do. This paper introduces the minimal amount of R commands you need to work this way. It also shows how to call R routines from many other packages, including SAS, SPSS, and JMP.


R, R-Project, Open Source, Data Analysis, SAS, SAS/IML Studio, JMP, SPSS, Stata, Statistica, integration, import, export, correlation, multiple testing, Holm


R (1) is free, open-source software that does statistics and graphics. You can download it from Written by Robert Gentleman and Ross Ihaka, R is based upon the S language developed by John Chambers (2) and others at Bell Labs in the 1970s.  It is a language that is optimized for writing analytic procedures, somewhat similar to SAS/IML, SPSS Matrix, and Stata’s Mata. R also includes a rich array of pre-written procedures called functions. These functions are written in the R language and in compiled languages such as C or FORTRAN (3). These functions are all open for you to study and, if you like, change. Both the quality of the R language and its openness to change has attracted many developers. These volunteers have written more than 5,000 add-on programs that add new procedures to R. Many data analysis packages can call (i.e. run or execute) these functions.

The R language completely integrates accessing and managing data, running analytic procedures, performing repetitive “macros,” managing output (as SAS ODS or SPSS OMS), and adding new functions through matrix algebra functions, all in a single consistent style of programming. Other software typically uses a different language for each of those steps (4).

R is free and powerful, but it does have limitations. While its language is powerful and consistent, it is considered by many to be harder to learn than other software. That is because it has more types of data structures than just the data set, and its equivalent of a macro language and output management must be learned from the start. Other software, such as SAS or SPSS, allow you to skip those topics until you need them.

Another factor that makes R somewhat harder to learn is that its help files are written for relatively advanced users. For example, the SAS help file for PROC PRINT provides a readily comprehensible description: “The PRINT procedure prints the observations in a SAS data set, using all or some of the variables. You can create a variety of reports ranging from a simple listing to a highly customized report that groups the data and calculates totals and subtotals for numeric variables.” However, the help file for R’s equivalent print function provides a relatively cryptic description: “print prints its argument and returns it invisibly (via invisible(x)). It is a generic function which means that new printing methods can be easily added for new classes.” That is much less clear. What is it printing, and will the output be invisible? Despite this complicated description, using the function to print your data set is as simple as entering, print(mydata), or even simpler by merely entering mydata.

An important limitation of R is that it must hold all its data in your computer’s main memory. Although that allows it to analyze a few million records, it is not sufficient to handle the massive amounts of data that are becoming increasingly popular. R users who analyze such very large data sets usually manage them in a database and then work on samples small enough to fit into memory. Since the field of statistics does a good job of generalizing the results obtained on relatively small samples to large populations, this is not as severe a limitation as it might first appear. Several projects are underway to overcome this memory limitation. A commercially available version of R has overcome this limitation for some of its functions (5).

Installing R

When you purchase commercial software, you receive it on DVD(s), or you download it from the vendor. Every part of it you purchased arrived at once, and you install it all at once. Although commercial software is written in compiled languages such as C, FORTRAN and Java, the vendor compiles that source code and gives you only the binary version.

R is available as a download from the Comprehensive R Archive Network, or CRAN, at Most people will want to get the binary version of R to install. However, since R is open-source software, you can download the C and FORTRAN source code version and perhaps even change it to better meet your needs before you compile and install it.

Since R has thousands of add-on packages, they are not all included in the initial installation. There are several ways you can find useful R packages. If you are a SAS or SPSS user, you are probably already familiar with their add-on products. I maintain a table of these add-ons at, which shows R packages that are roughly equivalent.

Another good source to learn about packages is the Task Views page on CRAN: .  Vanderbilt University maintains a similar site at  Detailed information about most R packages is available at The “packages” page on the main CRAN site is organized alphabetically, making it easy to find a package if you know its name. Another useful site with package information is, “Crantastic!” at Created and maintained by Hadley Wickham and Bjørn Mæland, the site allows R users to rate the packages and write brief reviews.

There are repositories other than CRAN. One is R-Forge, at Another, Bioconductor at, focuses on the analysis and comprehension of bioinformatic data.

When using an Internet search engine such as Google, the letter “R” doesn’t help narrow a search down very much, but adding “R package” (including the quotes) to any analytic term is likely to lead you to an R package that performs that analysis.

Once you have found the package you need, you install it by starting R. You only need to do this once per version of R you install:


R will ask you to choose a CRAN mirror, from a list of software repositories that are scattered around the world. When finished, the package is in your R library. Since R’s data must fit into your computer’s main memory, add-on packages are not loaded automatically from your library. Every time you start R and want to use that package, you must load it using the following function call:


R is case-sensitive, so be careful to respect every lower or uppercase letter in package names.

Missing Data

While most commercial software packages use as much data as possible, R functions often yield only a missing result if it finds missing data. R does offer all the usual ways of dealing with missing values, including listwise, pairwise, mean substitution, multiple imputation, and so on. However, if you are calling R from other software, you will probably prefer using that software to eliminate missing values. For example, if you had variables named x1 through x10, in SAS you could select a subset of complete observations to pass to R using:

data subset; set all;
where n(x1-x10) = 10;

You could then send the subset to R for analysis. SPSS or Stata users could do the same type of selection by using the nvalid or rownonmiss functions, respectively.

Selecting Variables

In most commercial software, you select variables using a fairly trivial process. For example, SAS uses the var statement in its various forms:

var q1 q2 q3 q4;
var q1-q4;
var q: ;
var a–z;

However, you select observations in most software using a completely different approach which is based on logic. For example, SAS uses the where statement:

where gender=’f’;

In R, you can select both variables and observations using many of the same methods. While that makes it relatively easy to select variables based on complex string searches, it also adds to R’s complexity. We will skip most of that complexity and focus on only three ways of selecting variables: using the attach function, using $ notation, and using formulas.

Let us consider a practice data set named “mydata.” In R, data sets are called data frames, and their variables are called vectors. Here is our example data frame containing the vectors id, workshop, gender, etc.:

  workshop gender q1 q2 q3 q4
1        1      f  1  1  5  1
2        2      f  2  1  4  1
3        1      f  2  2  4  3
4        2   <NA>  3  1 NA  3
5        1      m  4  5  2  4
6        2      m  5  4  5  5
7        1      m  5  3  4  4
8        2      m  4  5  5  5

As in most software, if we specify the name of the data set and do not tell the function which variables to use, R will often use them all. Here we use the summary function to get basic summary statistics.

> summary(mydata)

   workshop      gender                q1
Min.   :1.0   Length:8           Min.   :1.00
1st Qu.:1.0   Class :character   1st Qu.:2.00
Median :1.5   Mode  :character   Median :3.50
Mean   :1.5                      Mean   :3.25
3rd Qu.:2.0                      3rd Qu.:4.25
Max.   :2.0                      Max.   :5.00

       q2             q3             q4     

>Min.   :1.00   Min.   :2.00   Min.   :1.00
1st Qu.:1.00   1st Qu.:4.00   1st Qu.:2.50
Median :2.50   Median :4.00   Median :3.50
Mean   :2.75   Mean   :4.14   Mean   :3.25
3rd Qu.:4.25   3rd Qu.:5.00   3rd Qu.:4.25
Max.   :5.00   Max.   :5.00   Max.   :5.00
NA’s   :1.00               

Note that R uses “>” as a prompt. Commands that follow that character are things we are entering. If we continue a command onto another line, R will change the prompt to “+” to let you know it is expecting more. You can continue typing on a new line whenever the part on the first line is not a complete command itself. So we could spread our summary function call across three lines like this:

> summary(
+ mydata
+ )

If you see the “+” prompt when you think you have finished the function call, you can press the Esc key on Windows or  CTRL-C on Mac or Linux/UNIX systems. That will return you to the “>” prompt.

When you look at the output, you see information that is mostly useful: means, medians, and the like. In the output, you can see NA’s: 1.00 under the variable q3. R uses those letters, “NA” (Not Available), as its code for missing values. However, for the variable workshop, we get the mean, which is inappropriate for a categorical variable. Gender happens to be a character variable at the moment, so all R tells us that there are eight observations (Length: 8). We will rectify those issues in a moment.

If we try to get summary statistics on just one variable, R will not find it:

> summary(q1)

Error in summary(q1) : object ‘q1’ not found

This is because the variable (also called a vector) is stored in a data set (called a data frame), and we have not told it which one. One way to specify the names of both the data frame and the variable is to use the form, dataframe$variable. So we can get summary statistics on q1 by requesting it thus:

> summary(mydata$q1)

However, that makes all variable requests longer. If we instead use the attach function, we can specify the data set in advance and then dispense with the “dataframe$” part of the name:

> attach(mydata)
> summary(q1)

So far, we have seen R analyze a data frame and a vector. Both data frames and vectors are single objects in R. In fact, R functions often accept only single objects as their parameters or arguments. Therefore, to analyze just the q variables, we must first combine those vectors into a single object: their own data frame. We can do this using a call to the data.frame function:

> summary( data.frame(q1,q2,q3,q4) )

This also demonstrates a fundamental feature of R: you can nest functions inside one another so long as the output from one is compatible with the input of the next. It is as if the SAS Output Delivery System or SPSS Output Management System were integrated directly into every R function!

Categorical Variables

In other software, you tell it which variables are categorical by listing them on statements like CLASS or GROUP. If you used a categorical variable as a predictor in a regression equation, the result would be nonsense. In R, things are quite different. You specify categorical variables using the factor function:

> mydata$workshop <- factor(mydata$workshop)
> mydata$gender   <- factor(mydata$gender)

This tells R to replace the original workshop with a factor version of it. The two-character sequence, “<-”  is the assignment operator. That is roughly the equivalent to the equals sign in most other software.  By specifying “mydata$” before the variable names, we make it clear that we want the variables (vectors) to be stored in our data frame.

Before, when we asked for summary statistics on workshop it gave us the mean, which was an inappropriate statistic for categorical data. Now that it is a factor, let us see what R does:

> summary(data.frame(workshop, gender))

> workshop  gender
1:4      f   :3
2:4      m   :4

We see that now R knows not to calculate the mean, and it tells us instead that four students took each workshop. Once a vector is defined as a factor, R functions will usually do the right thing. For example, used as a predictor in a regression analysis, it would create the usual indicator variables, yielding a proper result. If we had desired value labels for our factors, we could have specified them on the factor function call. See help(factor) for details.

Modeling Functions

Since your current analytic software already has an extensive array of analyses built-in, you will likely seek out R functions that do modeling. Those functions are particularly easy to control because they have a data argument that lets you specify which data frame to use for your model’s formula. For example, the following code uses the lm function to do a linear model, in this case, linear regression:

> myModel <- lm(q4 ~ q1 + q2 + q3, data = mydata)

Formulas are in the form dependent ~ independent1 + independent2…,  and they look only within “mydata” because that is where the data argument tells it to look. However, if we left that off, then we would have to attach the data frame or use the dollar-style names, dataframe$y. In fact, R is capable of doing a single analysis that combines variables from multiple data sets at the same time! See Table 1 for a list of common statistical formulas.

Table 1. Formulas for common models in R.

Model R Formula
Simple Regression y ~ x
Multiple Regression with Interaction y ~ x1 + x2 + x1:x2 or y ~ x1*x2
Regression without Intercept y ~ -1 + x
One-way Analysis of Variance y ~ a
Two-way Analysis of Variance with interaction y ~ a + b + a:b  or  y ~ a*b
Analysis of Covariance y ~ x * a, then y ~ x + a
Analysis of Variance with b nested within a y ~ b %in% A  or  y ~ a/b

The model we created above was saved in “myModel” but no output appeared! The resulting model is a model object now, which we can manipulate using extractor functions. These are functions that will extract the model information and do things with them. Before we saw what the summary function would do with a data frame. Let’s see what it will do with a model:

> summary(myModel)


>lm(formula = q4 ~ q1 + q2 + q3, data = mydata)

1       2       3       5       6       7       8
-0.3114 -0.4262  0.9428 -0.1797  0.0766  0.0226 -0.1247

Estimate Std. Error t value Pr(>|t|)
(Intercept)   -1.324      1.288   -1.03    0.379
q1             0.430      0.262    1.64    0.200
q2             0.631      0.250    2.52    0.086 .
q3             0.315      0.256    1.23    0.306

Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.638 on 3 degrees of freedom
(1 observation deleted due to missingness)
Multiple R-squared: 0.93,       Adjusted R-squared: 0.86
F-statistic: 13.3 on 3 and 3 DF,  p-value: 0.0308

Other things we can do with models:

  • plot(myModel) will create a set of diagnostic plots appropriate to the type of model you are doing.
  • anova(myModel) will extract the analysis of variance table from your model.
  • anova(myReducedModel, myFullModel) will perform an ANOVA to compare two models.
  • predict(myModel, newdata) will make predictions in the newdata data frame using mymodel, so long as newdata contains variables with the same names.

Since your current analytic software probably has excellent procedures for linear regression, you are unlikely to use R for that analysis. However, as one of the most popular analysis methods, most people will understand it easily. Note that the summary function did simple descriptive statistics on our data set, but it also provided a summary of our regression model. Since it can change its output depending upon what you provide it, it is called a generic function. More formally, a generic function has different methods (types of output) for different classes (types) of objects.

Although R has a very different perspective, we can see it makes some things, like comparing models or making predictions, very easy.

A Popular Example

Let’s examine a more useful feature of R that your current analytic software probably does not provide: p-values in a correlation matrix that are corrected for the number of correlations done. If you performed a one-way analysis of variance with a group factor that had four levels and got a significant result, one thing you might do is perform post-hoc tests to find out which of the groups differed. Most packages provide a wide range of tests for just that circumstance, such as the popular Tukey HSD test. Instead, you could do pairwise t-tests, but each test provides a significance level based on the assumption that you are doing only one. They do not correct for the number of tests you perform.

Unfortunately, this problem with p-values is not specific to the analysis of variance; it applies to any statistical test that you perform repeatedly. R offers a function, p.adjust, which provides an easy way to perform various adjustments. Let us say we have done five chi-squared tests and want to adjust their p-values. You provide them as arguments to the function, but they must first be combined into a vector using the c function. They are sorted in ascending order here, but they don’t need to be:

p.adjust( c(.001,.01,.02,.049,.049) )

[1] 0.005 0.040 0.060 0.098 0.098

The default method used by the p.adjust function is by Holm. It multiplies each p-value by the number of tests performed in a sequential fashion. So the best p-value is multiplied by 5, the next best by 4, and so on. The exact details of the algorithm are beyond our scope, but you can use other methods. See help(p.adjust) in R for the options.

When you create a correlation matrix, commercial software will calculate many p-values, but they are not adjusted for the number done. This is just as serious a problem as it would be performing many t-tests, but the commercial software vendors have not yet provided an easy solution.

To make this adjustment, I started with the rcorr function, written by Frank Harrell, author of the Hmisc package for R. I wrote a small function that extracted the correlation p-values he calculated, ran them through the p.adjust, and wrote them back. John Fox liked the function, so he added it to his popular R Commander package. That is a graphical user interface to R that looks similar to the SPSS user interface (see So now you can use this function by installing R Commander. Note that R uses the pound sign, “#”, to begin comments. They continue until the end of the line. Here’s a three-line program that uses it:

> install.packages(“Rcmdr”)  # Run this only once.

> library(“Rcmdr”)  # Run this each time you use R.

> rcorr.adjust( data.frame(q1,q2,q3,q4))

     q1    q2    q3    q4
q1  1.00  0.78 -0.12  0.88
q2  0.78  1.00 -0.27  0.90
q3 -0.12 -0.27  1.00 -0.03
q4  0.88  0.90 -0.03  1.00
n= 7

q1     q2     q3     q4
q1        0.0385 0.7894 0.0090
q2 0.0385        0.5581 0.0053
q3 0.7894 0.5581        0.9556
q4 0.0090 0.0053 0.9556      

> Adjusted p-values (Holm’s method)
q1     q2     q3     q4
q1        0.1542 1.0000 0.0450
q2 0.1542        1.0000 0.0318
q3 1.0000 1.0000        1.0000
q4 0.0450 0.0318 1.0000 

We can see that before correction, it appeared we had three significant correlations, but after correction, the q1-q2 correlation moved from 0.0385 to .1542, nowhere near the popular .05 cutoff for significance.

This is a good example of why R is growing so rapidly. One person wrote the p.adjust function. A second person wrote the rcorr function. A third person combined the two into rcorr.adjust, and then a fourth person included it into the R Commander package. Such loose teamwork is happening every day in the R community.

Running R from SAS

As a SAS user, the easiest way to run R is through SAS/IML Studio. The command ExportDatasetToR sends your data set to R. Then to execute R code, you bracket it between a “submit /R;” and an “endsubmit;” statement. The following is an example. The R code is indented only to differentiate the R code from the surrounding SAS code:

run ExportDatasetToR(“myLib.mydata”);
attach(mydata)   # the first R statement
install.packages(“Rcmdr”)  # do this one time
library(“Rcmdr”)  # do this every time
rcorr.adjust( data.frame(q1,q2,q3,q4)) # the last R statement

For details regarding transferring data or results back and forth between SAS and R, see the SAS/IML Studio 3.4 for SAS/STAT Users (6).

Another way to run R programs from within SAS is to use Philip Rack’s A Bridge to R, available from MineQuest, LLC ( That program adds the ability to run R programs from either Base SAS, or the compatible World Programming System software.  It sends your data from SAS or WPS to R using a SAS transport format data set, which only allows for 8-character variable names. To use it, simply place your R programming statements where our indented example is below and submit your program as usual.

attach(mydata)   # the first R statement
install.packages(“Rcmdr”)  # do this one time
library(“Rcmdr”)  # do this every time
rcorr.adjust( data.frame(q1,q2,q3,q4)) # the last R statement



Finally, if you don’t have either SAS/IML Studio or A Bridge to R, the easiest way to get your data to R is through a transport data set. Save only the variables you need to send to R in their own data set, and eliminate observations with missing values to avoid learning that topic in R. Then write a transport format data set. Here is an example:

LIBNAME myLib ‘C:myRfolder’;
LIBNAME To_R xport ‘myRfoldermydata.xpt’;
DATA To_R.mydata;
SET  myLib.mydata;
*Keep those with no missing values;
IF N(OF q1-q4)=4;

Then in R, you can read the transport SAS data set. This requires the foreign package, which is built into R, and the Hmisc package, which you must install the first time you use it with the install.packages function:

install.packages(“Rcmdr”)  # do this one time
library(“Rcmdr”)  # do this every time
rcorr.adjust( data.frame(q1,q2,q3,q4)) # the last R statement

The X command in SAS can execute any operating system command. You can use it to automatically run R programs that pass data and results back and forth between SAS and R. See (4) for details.

Running R from SPSS

SPSS has included the ability to run R programs at no extra cost since version 16. To use it, you must first install the R plug-in available from When you have finished the installation, you can run R programs by including them in an SPSS syntax file and submitting it as usual. In the example below, we use an R plug-in function spssdata.GetDataFromSPSS to bring the data to R from SPSS:

GET FILE=‘mydata.sav’.
mydata <- spssdata.GetDataFromSPSS(
variables=c(“q1 to q4”),
attach(mydata)   # the first R statement
install.packages(“Rcmdr”)  # do this one time
library(“Rcmdr”)  # do this every time
rcorr.adjust( data.frame(q1,q2,q3,q4)) # the last R statement

You can also use SPSS to create menus and dialog boxes that you can use to run R functions from the SPSS graphical user interface. The documentation for the R Integration Plug-in is in the SPSS Help system.

Running R from Stata

Statacorp has not yet offered a way to run R programs from within their software. Perhaps the easiest way for a Stata user to use R is to save a Stata file and then import it into R. Here is an example program that does that.

library(“Hmisc”)  # Contains stata.get function.
attach(mydata)   # The first R statement.
install.packages(“Rcmdr”)  # Do this one time.
library(“Rcmdr”)  # Do this every time.
rcorr.adjust( data.frame(q1,q2,q3,q4)) # The last R statement.

You can also use Stata’s shell command to execute any operating system command. You can use shell to run an R batch program that reads your Stata file, does some analysis, then writes the result to a file that you then read into Stata. That is the approach used by Robert Fornango in his thorough blog post here.

Running R from Statistica

Statistica has the ability to run R programs directly within SATISTICA, WebSTATISTICA and STATISTICA Enterprise. The R output can then be returned to the STATISTICA product that ran it, as native spreadsheets and fully editable graphs. You can also add R programs on to the menus to make R functions as easy to use as those that come with STATISTICA. For details, see the StatSoft white paper, Integration Options and Features to Leverage Specialized R Functionality in STATISTICA and Statistica Enterprise Server Solutions (7).

Running R from JMP

JMP offers extensive integration of R using both JMP Scripting Language and its graphical user interface. The JMP web site offers several documents and video tutorials on the subject ( Here is an example of a JSL program that opens, initializes the connection to R, sends mydata, installs a package then loads it from the library, runs correlations with adjusted p-values and then terminates the connection.

JMPDt = Open(“”)
R init();
R Send(mydata);
R Submit(>install.packages(“Rcmdr”)  # Do this one time.”);
R Submit(“library(“Rcmdr”)  # Do this every time.”)
R Submit(“rcorr.adjust( data.frame(q1,q2,q3,q4))”);
R Term;

Where to Learn More

SAS and SPSS users can learn more about R by reading the free version of R for SAS and SPSS Users (8), by Robert A. Muenchen, available at That document focuses on data management and basic statistics, providing much of its explanation in the form of program comments. The Springer edition of R for SAS and SPSS Users (4) offers close to 700 pages of examples, explanations, and coverage of graphics.

SAS users will also enjoy the book, R and SAS, (9) by Ken Kleinman and Nicholas J. Horton. That book offers brief descriptions of how to use R for a wide range of tasks.

Stata users can learn more by reading R for Stata Users (10), by Robert A. Muenchen and Joseph M. Hilbe. The example programs and data sets shown in this paper are also downloadable from that site.


We have discussed what R is, the basics of how it works and how you can use it as an adjunct to your main data analysis software. Given that you already have a statistics package at your disposal, the amount of R you need to learn to use an occasional R function is fairly minimal. The return on this investment of effort can be great: access to many thousands of additional analytical and graphical methods. Once you see what R offers, perhaps you will become interested in learning more about it.


My thanks go out to Phil Rack, who provided advice and testing of the example program using his Bridge to R.


(1) R Development Core Team (2011). R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria. ISBN 3-900051-07-0, URL


(3) Wrathematics, How Much of R is Written in R?,, 2011

(4) R. A. Muenchen, R for SAS and SPSS Users, 2nd Edition, Springer, Berlin, Heidelberg, New York, 2011


(6) SAS IML/Studio 3.4 for SAS/STAT Users, SAS Institute Inc., Cary, NC, 2011

(7) Integration Options and Features to Leverage Specialized R Functionality in STATISTICA and Statistica Enterprise Server Solutions, Statsoft Inc.,

(8) R. A. Muenchen, R for SAS and SPSS Users, Free early version,

(9) K. Kleinman, N. J. Horton, R and SAS, CRC Press, 2009

(10) R. A. Muenchen, J. M. Hilbe, R for Stata Users, Springer, Berlin, Heidelberg, New York, 2010

Copyright 2012, Robert A. Muenchen

6 thoughts on “Calling R from Other Software”

  1. Minor edit: in the paragraph on calling R from JMP, you have “The R web site offers…”; it looks like you meant “The JMP web site offers…”.

  2. Thanks, I thought so. I’m surprised that RapidMiner isn’t used and taught more widely. It’s basically a substitute/alternative to R, as I understand it — except it’s powered by a graphical interface, so it’s easier to use. It’s also very forgiving when it comes to testing your models and whatnot — gives hints for what to do next, etc. And since it also allows you to use R as a plug-in, then it’s also a GUI environment for using R directly.

    I’m teaching myself to do analytics, so perhaps I’m completely wrong about everything I just said, but that’s my impression so far.

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.