Reshaping wide to long with multiple values columns

r reshape reshape2 melt

55,037

Solution 1

reshape does this with the appropriate arguments.

varying lists the columns which exist in the wide format, but are split into multiple rows in the long format. v.names is the long format equivalents. Between the two, a mapping is created.

From ?reshape:

Also, guessing is not attempted if v.names is given explicitly. Notice that the order of variables in varying is like x.1,y.1,x.2,y.2.

Given these varying and v.names arguments, reshape is smart enough to see that I've specified that the index is before the dot here (i.e., order 1.x, 1.y, 2.x, 2.y). Note that the original data has the columns in this order, so we can specify varying=2:5 for this example data, but that is not safe in general.

Given the values of times and v.names, reshape splits the varying columns on a . character (the default sep argument) to create the columns in the output.

times specifies values that are to be used in the created var column, and v.names are pasted onto these values to get column names in the wide format for mapping to the result.

Finally, idvar is specified to be the sbj column, which identifies individual records in the wide format (thanks @thelatemail).

reshape(dw, direction='long', 
        varying=c('f1.avg', 'f1.sd', 'f2.avg', 'f2.sd'), 
        timevar='var',
        times=c('f1', 'f2'),
        v.names=c('avg', 'sd'),
        idvar='sbj')

##      sbj blabla var avg sd
## A.f1   A     bA  f1  10  6
## B.f1   B     bB  f1  12  5
## C.f1   C     bC  f1  20  7
## D.f1   D     bD  f1  22  8
## A.f2   A     bA  f2  50 10
## B.f2   B     bB  f2  70 11
## C.f2   C     bC  f2  20  8
## D.f2   D     bD  f2  22  9

Solution 2

Another option using Hadley's new tidyr package.

library(tidyr)
library(dplyr)

dw <- read.table(header=T, text='
 sbj f1.avg f1.sd f2.avg f2.sd  blabla
   A   10    6     50     10      bA
   B   12    5     70     11      bB
   C   20    7     20     8       bC
   D   22    8     22     9       bD
 ')

dw %>% 
  gather(v, value, f1.avg:f2.sd) %>% 
  separate(v, c("var", "col")) %>% 
  arrange(sbj) %>% 
  spread(col, value)

Solution 3

melt from the >=1.9.6 version of data.table, does this by specifying the column index in measure.vars as a list.

 melt(setDT(dw), measure.vars=list(c(2,4), c(3,5)), 
     variable.name='var', value.name=c('avg', 'sd'))[, 
      var:= paste0('f',var)][order(sbj)]
#   sbj blabla var avg sd
#1:   A     bA  f1  10  6
#2:   A     bA  f2  50 10
#3:   B     bB  f1  12  5
#4:   B     bB  f2  70 11
#5:   C     bC  f1  20  7
#6:   C     bC  f2  20  8
#7:   D     bD  f1  22  8
#8:   D     bD  f2  22  9

Or you could use the new patterns function:

melt(setDT(dw), 
     measure = patterns("avg", "sd"),
     variable.name = 'var', value.name = c('avg', 'sd'))
#    sbj blabla var avg sd
# 1:   A     bA   1  10  6
# 2:   B     bB   1  12  5
# 3:   C     bC   1  20  7
# 4:   D     bD   1  22  8
# 5:   A     bA   2  50 10
# 6:   B     bB   2  70 11
# 7:   C     bC   2  20  8
# 8:   D     bD   2  22  9

Solution 4

This seems to do what you want except that the f is removed from elements in time.

reshape(dw, idvar = "sbj", varying = list(c(2,4),c(3,5)), v.names = c("ave", "sd"), direction = "long")

    sbj blabla time ave sd
A.1   A     bA    1  10  6
B.1   B     bB    1  12  5
C.1   C     bC    1  20  7
D.1   D     bD    1  22  8
A.2   A     bA    2  50 10
B.2   B     bB    2  70 11
C.2   C     bC    2  20  8
D.2   D     bD    2  22  9

Solution 5

To add to the options available here, you can also consider merged.stack from my "splitstackshape" package:

library(splitstackshape)
merged.stack(dw, var.stubs = c("avg", "sd"), sep = "var.stubs", atStart = FALSE)
#    sbj blabla .time_1 avg sd
# 1:   A     bA     f1.  10  6
# 2:   A     bA     f2.  50 10
# 3:   B     bB     f1.  12  5
# 4:   B     bB     f2.  70 11
# 5:   C     bC     f1.  20  7
# 6:   C     bC     f2.  20  8
# 7:   D     bD     f1.  22  8
# 8:   D     bD     f2.  22  9

You can also do a little more cleanup on the ".time_1" variable, like this.

merged.stack(dw, var.stubs = c("avg", "sd"), 
             sep = "var.stubs", atStart = FALSE)[, .time_1 := sub(
               ".", "", .time_1, fixed = TRUE)][]
#    sbj blabla .time_1 avg sd
# 1:   A     bA      f1  10  6
# 2:   A     bA      f2  50 10
# 3:   B     bB      f1  12  5
# 4:   B     bB      f2  70 11
# 5:   C     bC      f1  20  7
# 6:   C     bC      f2  20  8
# 7:   D     bD      f1  22  8
# 8:   D     bD      f2  22  9

You would note the use of the atStart = FALSE argument. This is because your names are in a little bit of a different order than reshape-related functions seem to like. In general, the "stub" is expected to come first, and then the "times", like this:

dw2 <- dw
setnames(dw2, gsub("(.*)\\.(.*)", "\\2.\\1", names(dw2)))
names(dw2)
# [1] "sbj"    "avg.f1" "sd.f1"  "avg.f2" "sd.f2"  "blabla"

If the names were in that format, then both base R's reshape and merged.stack benefit from more direct syntax:

merged.stack(dw2, var.stubs = c("avg", "sd"), sep = ".")
reshape(dw2, idvar = c("sbj", "blabla"), varying = 2:5, 
        sep = ".", direction = "long")

View more solutions

55,037

Author by

Vasily A

Updated on July 14, 2022

Comments

Vasily A almost 2 years

I need to reshape my wide table into long format but keeping multiple fields for each record, for example:

dw <- read.table(header=T, text='
 sbj f1.avg f1.sd f2.avg f2.sd  blabla
   A   10    6     50     10      bA
   B   12    5     70     11      bB
   C   20    7     20     8       bC
   D   22    8     22     9       bD
 ')

# Now I want to melt this table, keeping both AVG and SD as separate fields for each measurement, to get something like this:

 #    sbj var avg  sd  blabla
 #     A   f1  10  6     bA
 #     A   f2  50  10    bA
 #     B   f1  12  5     bB
 #     B   f2  70  11    bB
 #     C   f1  20  7     bC
 #     C   f2  20  8     bC
 #     D   f1  22  8     bD
 #     D   f2  22  9     bD

I have basic knowledge of using melt and reshape, but it is not obvious for me how to apply such reshaping in my case.

Vasily A almost 10 years

aha, you do almost the same call as in previous answer except for using idvar instead of timevar and times - I will sort out the difference... Thanks!
Mark Miller almost 10 years

@VasilyA The structure of varying is also very different between the answers. Indeed, the only things they seem to have in common is they both use the same v.names and both use direction="long".
Vasily A almost 10 years

indeed, now I see... thanks for pointing out!
thelatemail almost 10 years

Adding idvar='sbj' will avoid the adding of the id column to the result too.
IRTFM almost 10 years

The interesting thing in comparing the two correct answers is that this one uses the structure of the list argument to 'varying', while the Lundberg answer uses 'times' and 'v.names' to convey some of the structure. I've never gotten those options straightened out in my head and usually use trial and error.
Brian D over 7 years

surprisingly (at least to me), the order of the columns matters. If you reorder the columns dw = dw[,c(1,4,3,2,5,6)] and then run this answer, the result is incorrect.
Matthew Lundberg over 7 years

@BrianD So it does, thanks! I was tacitly using an assumption of the data (even without realizing it) and will fix the answer shortly.
Brian D over 7 years

Helpful info about tidyr and how the above code works here: blog.rstudio.org/2014/07/22/introducing-tidyr
Robin over 5 years

To clarify, the code for this answer does not remove the f from the column names to create the elements. Without, the times argument the column names aren't even considered. When creating the dataframe dw, I can replace the "f1" with "f1.alpha" and "f2" with "f2.beta", run the code in this answer, and the "time" column will be the same as in this answer. It won't be "1.alpha" or "2.beta". If someone wants to keep the column names as elements, the column names need to be specified in the times argument (in the same order as in the varying argument).
Robin over 5 years

To make this answer more robust at handling many pairs of "avg" and "sd" columns replace the varying argument with varying=list(grep("avg", colnames(dw), value=T), grep("sd", colnames(dw), value=T))