R Function of the Day: tapply

The R Function of the Day series will focus on describing in plain language how certain R functions work, focusing on simple examples that you can apply to gain insight into your own data.

Today, I will discuss the tapply function.

What situation is tapply useful in?

In statistics, one of the most basic activities we do is computing summaries of variables. These summaries might be as simple as an average, or more complex. Let’s look at some simple examples.

When you read the results of a medical trial, you will see things such as “The average age of subjects in this trial was 55 years in the treatment group, and 54 years in the control group.”

As another example, let’s look at one from the world of baseball.

Batting Leaders per Team

Team	Player	Batting Average
Minnesota Twins	Joe Mauer	.374
Seattle Mariners	Ichiro Suzuki	.355
Boston Red Sox	Kevin Youkilis	.309
…	…	…

These two examples have a lot in common, even if they don’t appear to when first reading. In the first example, we have a dataset from a medical trial. We want to break up the dataset into two groups, treatment and control, and then compute the sample average for age within each group.

In the second example, we want to break up the dataset into 30 groups, one for each MLB team, and then compute the maximum batting average within each group.

So what is in common?

In each case we have

A dataset that can be broken up into groups
We want to break it up into groups
Within each group, we want to apply a function

The following table summarizes the situation.

Example	Group Variable	Summary Variable	Function
Medical Example	Treatment	age	mean
Baseball Example	Team	batting average	max

The tapply function can solve both of these problems for us!

How do I use tapply?

The tapply function is simple to use. First, we will generate some data.


> ## generate data for medical example
> medical.example <-
    data.frame(patient = 1:100,
               age = rnorm(100, mean = 60, sd = 12),
               treatment = gl(2, 50,
                 labels = c("Treatment", "Control")))
> summary(medical.example)
    patient            age             treatment 
 Min.   :  1.00   Min.   : 29.40   Treatment:50  
 1st Qu.: 25.75   1st Qu.: 54.31   Control  :50  
 Median : 50.50   Median : 61.24                 
 Mean   : 50.50   Mean   : 61.29                 
 3rd Qu.: 75.25   3rd Qu.: 66.22                 
 Max.   :100.00   Max.   :102.47                  
> ## generate data for baseball example
> ## 5 teams with 5 players per team
> 
> baseball.example <-
    data.frame(team = gl(5, 5,
                 labels = paste("Team", LETTERS[1:5])),
               player = sample(letters, 25),
               batting.average = runif(25, .200, .400))
> summary(baseball.example)
     team       player   batting.average 
 Team A:5   a      : 1   Min.   :0.2172  
 Team B:5   c      : 1   1st Qu.:0.2553  
 Team C:5   d      : 1   Median :0.2854  
 Team D:5   e      : 1   Mean   :0.2887  
 Team E:5   f      : 1   3rd Qu.:0.3013  
            g      : 1   Max.   :0.3859  
            (Other):19

Now we have some sample data. Using tapply is now straightforward. In general, the call to the function will look like the example in the first comment. Then, actual calls to the function using the data we defined above are shown.


> ## Generic Example
> ## tapply(Summary Variable, Group Variable, Function)
> 
> ## Medical Example
> tapply(medical.example$age, medical.example$treatment, mean)
Treatment   Control 
 62.26883  60.30371  
> ## Baseball Example
> tapply(baseball.example$batting.average, baseball.example$team,
         max)
   Team A    Team B    Team C    Team D    Team E 
0.3784396 0.3012680 0.3488655 0.2962828 0.3858841

Summary of tapply

The tapply function is useful when we need to break up a vector into groups defined by some classifying factor, compute a function on the subsets, and return the results in a convenient form. You can even specify multiple factors as the grouping variable, for example treatment and sex, or team and handedness.

This entry was posted on Sunday, September 20th, 2009 at 8:19 pm and is filed under Uncategorized. You can follow any responses to this entry through the RSS 2.0 feed. You can leave a response, or trackback from your own site.

6 Responses to R Function of the Day: tapply

R Function of the Day: rle « Blogistic Reflections says:

September 22, 2009 at 8:13 pm

[…] in this case the longest run of heads is 9 and the longest run of tails is 8. The tapply function was discussed in a previous R Function of the Day […]

Reply
learnr says:

September 30, 2009 at 4:08 am

I would also suggest looking at the plyr and reshape packages on CRAN.

> library(plyr)
> ddply(baseball.example, .(team), summarise, max = max(batting.average))
team max
1 Team A 0.3850099
2 Team B 0.3899625
3 Team C 0.3784054
4 Team D 0.3616533
5 Team E 0.3554805

> library(reshape)
> recast(baseball.example, . ~team, max)
Using team, player as id variables
value Team A Team B Team C Team D Team E
1 (all) 0.3850099 0.3899625 0.3784054 0.3616533 0.3554805

Reply
erikr says:

September 30, 2009 at 8:39 am

That’s great! I did take a look at Hadley’s plyr package when it was in its infancy, maybe about a year ago. I think it’s come a long way since then, and it is definitely on my list of things to check out.

By the way, Hadley seems to have a ton of interesting projects on his github page at http://github.com/hadley , I think a lot of them are not very mature yet, but he has some really interesting ideas for extending R!

Reply
Reinaldo Parlor says:

April 5, 2010 at 8:49 pm

Baseball is the most interesting game on the planet. I’m excited about the new season. Should be fantastic.

Reply
Simon Kiss says:

September 21, 2010 at 6:03 pm

OK, so how do you do two grouping variables, not just one?

Reply
- erikr says:
  
  September 21, 2010 at 6:33 pm
  
  Simply pass a list as the second argument. For example,
  
  tapply(df$var1, list(df$var2, df$var3), mean)
  
  You might also be interested in the plyr package for breaking up data into subsets and summarizing them.
  
  Reply

Blogistic Reflections