Reshaping data frame in R
Solution 1
reshape always seems tricky to me too, but it always seems to work with a little trial and error. Here's what I ended up finding:
> x
unique_id seq response detailed.name treatment
1 a N1 123.23 dN1 T1
2 a N2 231.12 dN2 T1
3 a N3 231.23 dN3 T1
4 b N1 343.23 dN1 T2
5 b N2 281.13 dN2 T2
6 b N3 901.23 dN3 T2
> x2 <- melt(x, c("seq", "detailed.name", "treatment"), "response")
> x2
seq detailed.name treatment variable value
1 N1 dN1 T1 response 123.23
2 N2 dN2 T1 response 231.12
3 N3 dN3 T1 response 231.23
4 N1 dN1 T2 response 343.23
5 N2 dN2 T2 response 281.13
6 N3 dN3 T2 response 901.23
> cast(x2, seq + detailed.name ~ treatment)
seq detailed.name T1 T2
1 N1 dN1 123.23 343.23
2 N2 dN2 231.12 281.13
3 N3 dN3 231.23 901.23
Your original data was already in long format, but not in the long format that melt/cast uses. So I re-melted it. The second argument (id.vars) is list of things not to melt. The third argument (measure.vars) is the list of things that vary.
Then, the cast uses a formula. Left of the tilde are the things that stay as they are, and right of the tilde are the columns that are used to condition the value column.
More or less...!
Solution 2
Building on Harlan's answer - the remelting step can be avoided if the data is already in the long format, and the column holding values is specified in the cast
call.
> x <- read.table(textConnection(" unique_id seq response detailed.name treatment
+ 1 a N1 123.23 dN1 T1
+ 2 a N2 231.12 dN2 T1
+ 3 a N3 231.23 dN3 T1
+ 4 b N1 343.23 dN1 T2
+ 5 b N2 281.13 dN2 T2
+ 6 b N3 901.23 dN3 T2"))
>
> cast(x, seq + detailed.name ~ treatment, value = "response")
seq detailed.name T1 T2
1 N1 dN1 123.23 343.23
2 N2 dN2 231.12 281.13
3 N3 dN3 231.23 901.23
Solution 3
Another option would be to use spread
from tidyr
library(tidyr)
Wide1 <- spread(x[-1], treatment, response)
Wide1
# seq detailed.name T1 T2
#1 N1 dN1 123.23 343.23
#2 N2 dN2 231.12 281.13
#3 N3 dN3 231.23 901.23
The opposite action is performed by gather
gather(Wide1, detailed.name, response, T1:T2)
# seq detailed.name detailed.name response
#1 N1 dN1 T1 123.23
#2 N2 dN2 T1 231.12
#3 N3 dN3 T1 231.23
#4 N1 dN1 T2 343.23
#5 N2 dN2 T2 281.13
#6 N3 dN3 T2 901.23
Also, there is dcast.data.table
from data.table
library(data.table)
dcast.data.table(setDT(x), seq + detailed.name~treatment,
value.var='response')
# seq detailed.name T1 T2
#1: N1 dN1 123.23 343.23
#2: N2 dN2 231.12 281.13
#3: N3 dN3 231.23 901.23
data
x <- structure(list(unique_id = structure(c(1L, 1L, 1L, 2L, 2L, 2L
), .Label = c("a", "b"), class = "factor"), seq = structure(c(1L,
2L, 3L, 1L, 2L, 3L), .Label = c("N1", "N2", "N3"), class = "factor"),
response = c(123.23, 231.12, 231.23, 343.23, 281.13, 901.23
), detailed.name = structure(c(1L, 2L, 3L, 1L, 2L, 3L), .Label = c("dN1",
"dN2", "dN3"), class = "factor"), treatment = structure(c(1L,
1L, 1L, 2L, 2L, 2L), .Label = c("T1", "T2"), class = "factor")), .Names =
c("unique_id", "seq", "response", "detailed.name", "treatment"), class =
"data.frame", row.names = c(NA, -6L))
Solution 4
You can also use the reshape
function in the stats
package. I don't have your sample dataset, but it will look something like this:
reshape(x, idvar=c("seq","detailed.name"), timevar="treatment", direction="wide")
Solution 5
If you want to get the same results using reshape2
, which is a faster and more memory efficient rewrite of the reshape
package, then the following will work.
The main change is the use of the dcast
function when you want to cast
with a data.frame
as output. This replaces the cast
function of reshape
library(reshape2)
x = read.table(text = "unique_id seq response detailed.name treatment
a N1 123.23 dN1 T1
a N2 231.12 dN2 T1
a N3 231.23 dN3 T1
b N1 343.23 dN1 T2
b N2 281.13 dN2 T2
b N3 901.23 dN3 T2",
sep = "", header = TRUE)
x
y <- dcast(x, seq + detailed.name ~ treatment, value.var = "response")
y
# seq detailed.name T1 T2
# 1 N1 dN1 123.23 343.23
# 2 N2 dN2 231.12 281.13
# 3 N3 dN3 231.23 901.23
# EDIT to show how to return to the original data set:
melt(y, id.vars=c('seq', 'detailed.name'), variable.name='T', value.name='response')
# seq detailed.name T response
# 1 N1 dN1 T1 123.23
# 2 N2 dN2 T1 231.12
# 3 N3 dN3 T1 231.23
# 4 N1 dN1 T2 343.23
# 5 N2 dN2 T2 281.13
# 6 N3 dN3 T2 901.23
Related videos on Youtube
Vince
I work at the Bioinformatics Core at UC Davis doing statistical/scientific/bioinformatics programming in R, Python, Perl, and C. Favorites: Emacs, Git, C, Python, R, and any nifty open source *nix tool.
Updated on March 05, 2020Comments
-
Vince about 4 years
I'm running into difficulties reshaping a large dataframe. And I've been relatively fortunate in avoiding reshaping problems in the past, which also means I'm terrible at it.
My current dataframe looks something like this:
unique_id seq response detailed.name treatment a N1 123.23 descr. of N1 T1 a N2 231.12 descr. of N2 T1 a N3 231.23 descr. of N3 T1 ... b N1 343.23 descr. of N1 T2 b N2 281.13 descr. of N2 T2 b N3 901.23 descr. of N3 T2 ...
And I'd like:
seq detailed.name T1 T2 N1 descr. of N1 123.23 343.23 N2 descr. of N2 231.12 281.13 N3 descr. of N3 231.23 901.23
I've looked into the reshape package, but I'm not sure how I can convert the treatment factors into individual column names.
Thanks!
Edit: I tried running this on my local machine (4GB dual-core iMac 3.06Ghz) and it keeps failing with:
> d.tmp.2 <- cast(d.tmp, `SEQ_ID` + `GENE_INFO` ~ treatments) Aggregation requires fun.aggregate: length used as default R(5751) malloc: *** mmap(size=647168) failed (error code=12) *** error: can't allocate region *** set a breakpoint in malloc_error_break to debug
I'll try running this on one of our bigger machines when I get a chance.
-
Matt Parker over 14 yearsMan, you're fast, Harlan. Vince, I always just try to remember that whatever goes on the right side of the "+" in cast() will end up as a column with values in your final data frame.
-
mnel over 11 yearsThe package
reshape2
is a rewrite ofreshape
to be faster and more memory efficient. It is not backwards compatible toreshape
, hence the new package, not a new version of the old package. -
andilabs over 10 years@Mark Miller: what was the biggest data frame you used this tool for?