Understanding `scale` in R

r scale transformation heatmap

189,370

Solution 1

log simply takes the logarithm (base e, by default) of each element of the vector.
scale, with default settings, will calculate the mean and standard deviation of the entire vector, then "scale" each element by those values by subtracting the mean and dividing by the sd. (If you use scale(x, scale=FALSE), it will only subtract the mean but not divide by the std deviation.)

Note that this will give you the same values

   set.seed(1)
   x <- runif(7)

   # Manually scaling
   (x - mean(x)) / sd(x)

   scale(x)

Solution 2

It provides nothing else but a standardization of the data. The values it creates are known under several different names, one of them being z-scores ("Z" because the normal distribution is also known as the "Z distribution").

More can be found here:

http://en.wikipedia.org/wiki/Standard_score

Solution 3

This is a late addition but I was looking for information on the scale function myself and though it might help somebody else as well.

To modify the response from Ricardo Saporta a little bit.
Scaling is not done using standard deviation, at least not in version 3.6.1 of R, I base this on "Becker, R. (2018). The new S language. CRC Press." and my own experimentation.

X.man.scaled <- X/sqrt(sum(X^2)/(length(X)-1))
X.aut.scaled <- scale(X, center = F)

The result of these rows are exactly the same, I show it without centering because of simplicity.

I would respond in a comment but did not have enough reputation.

Solution 4

I thought I would contribute by providing a concrete example of the practical use of the scale function. Say you have 3 test scores (Math, Science, and English) that you want to compare. Maybe you may even want to generate a composite score based on each of the 3 tests for each observation. Your data could look as as thus:

student_id <- seq(1,10)
math <- c(502,600,412,358,495,512,410,625,573,522)
science <- c(95,99,80,82,75,85,80,95,89,86)
english <- c(25,22,18,15,20,28,15,30,27,18)
df <- data.frame(student_id,math,science,english)

Obviously it would not make sense to compare the means of these 3 scores as the scale of the scores are vastly different. By scaling them however, you have more comparable scoring units:

z <- scale(df[,2:4],center=TRUE,scale=TRUE)

You could then use these scaled results to create a composite score. For instance, average the values and assign a grade based on the percentiles of this average. Hope this helped!

Note: I borrowed this example from the book "R In Action". It's a great book! Would definitely recommend.

View more solutions

189,370

Jen

Learning python, ni! And now also learning Rrrrrrrrrr (And now some perl too!)

Updated on July 05, 2022

Comments

Jen almost 2 years

I'm trying to understand the definition of scale that R provides. I have data (mydata) that I want to make a heat map with, and there is a VERY strong positive skew. I've created a heatmap with a dendrogram for both scale(mydata) and log(my data), and the dendrograms are different for both. Why? What does it mean to scale my data, versus log transform my data? And which would be more appropriate if I want to look at the dendrogram illustrating the relationship between the columns of my data?

Thank you for any help! I've read the definitions but they are whooping over my head.
Jen over 10 years

thanks for the answer! But what is the significance of scale()? What could my reasoning be for using it (it makes the data look nicer, etc.). I'm just trying to understand the 'point' of scale(). Thanks!
Ricardo Saporta over 10 years

scale make more sense when you have multiple variables that you are considering across different scales. eg, one var is of order of magnitude 100 while another is of order of magnitude 1000000
Ricardo Saporta over 10 years

@Jen: Another (very lose) way to think about it: when using scale, you are not changing the data, rather you are changing the scale (the axis values when plotting). Think of grabbing the axis at the two ends and stretching or compressing it. That is scale. In contrast, log actually changes the data. The impact of log is "stronger" for larger values and more minimal for smaller values.
Jen over 10 years

@@Ricardo Saporta, oh okay, thanks, that makes sense! Especially the idea of looking at multiple variables with different scales, that just clicked for me! Thanks a lot!
sherek_66 almost 5 years

@RicardoSaporta those formula are not the same! I just check them out
digestivee about 4 years

From the documentation of scale : "The value of scale determines how column scaling is performed (after centering). If scale is a numeric-alike vector with length equal to the number of columns of x, then each column of x is divided by the corresponding value from scale. If scale is TRUE then scaling is done by dividing the (centered) columns of x by their standard deviations if center is TRUE, and the root mean square otherwise. If scale is FALSE, no scaling is done." This implies that your formula is correct because you didn't center first