R: adjust scale color gradient in ggplot2

23,537

As per my comment and your response, I think the problem is that you have some outliers that are forcing the scale to expand to accommodate them.

From your summary(), 75% of your cases of NUM_PICKUPS are between 10 and 59. The remaining 25% then increases to 14243, three orders of magnitude greater!

To summarise, the range of your values of NUM_PICKUPS is too great to show variation at anything below about 1,000.

The solution you choose will depend on your data and what you want to do with it. One option is to simply show only the values up to 75% and exclude the highest 25% as outliers. You could do this without altering the data by manually setting the limits with, I think:

g1 + scale_colour_gradient(limits = c(0, 60))

Another option would be to transform your data (perhaps with log() or log10()). For example, mydata$LOG_PICKUPS <- log10(mydata$NUM_PICKUPS) might help reduce the range sufficiently to plot.

Share:
23,537
Ling Zhang
Author by

Ling Zhang

Updated on April 13, 2020

Comments

  • Ling Zhang
    Ling Zhang about 4 years

    First, here is part of mydata(121315*4):

         LONGITUDE LATITUDE NUM_PICKUPS TOTAL_REVENUE
     1   121.6177  38.9124          21         337.0
     2   121.8069  39.0210          16         454.7
     3   121.5723  38.9645          38         696.9
     4   121.6423  38.9258         622       13609.7
     5   121.5647  38.9129         116        2016.7
     6   121.6429  38.8846         120        2417.3
     7   121.5852  38.9279         117        1975.0
     8   121.6616  38.9189          94        1712.4
     9   121.5812  38.9828          50         981.6
     10  121.6411  38.9255         225        4696.2
    

    Seeing that, the first and second column is the longitude and latitude.

    mydata[1,3]=21 means that in the palce(121.6177, 38.9124), there are 21 pickups.

    Then, I resort mydata with NUM_PICKUPS desc:

    LONGITUDE   LATITUDE    NUM_PICKUPS TOTAL_REVENUE
    121.6019    39.0181        14243    514716
    121.5382    38.9609        13244    443754.7
    121.5381    38.9609         9645    325056
    121.5382    38.9608         8846    294345.6
    121.602     39.0181         6556    232254.5
    121.5383    38.9609         6152    208967.6
    121.5383    38.9608         6014    207677.8
    121.5381    38.9608         5544    185398.3
    121.6018    39.018          4546    167662.1
    121.5382    38.9607         4260    143088.9
    121.5827    38.8948         4133    72202.8
    121.6303    38.9183         3837    67683.6
    121.5966    38.9665         3747    56378.7
    

    And there is the summary of mydata:

    summary(mydata)
    
    
     LONGITUDE        LATITUDE      NUM_PICKUPS       TOTAL_REVENUE     
      Min.   :121.1   Min.   :38.76   Min.   :   10.00   Min.   :    92.9  
      1st Qu.:121.6   1st Qu.:38.91   1st Qu.:   15.00   1st Qu.:   289.7  
      Median :121.6   Median :38.92   Median :   27.00   Median :   515.1  
      Mean   :121.6   Mean   :38.93   Mean   :   57.03   Mean   :  1067.6  
      3rd Qu.:121.6   3rd Qu.:38.96   3rd Qu.:   59.00   3rd Qu.:  1089.5  
      Max.   :122.0   Max.   :39.32   Max.   :14243.00   Max.   :514716.0
    

    Now, I want to draw the map which is colored by NUM_PICKUPS, look at my codes.

    g1 <- ggplot() + geom_point(data = mydata,aes(x = LONGITUDE,y = LATITUDE,color=NUM_PICKUPS))
    

    enter image description here

    Yeah, both the codes and graph are right, but look the color, it's hard to indentify where is the place with high num_pickups? And where is less?

    I try to modify my codes with scale_colour_gradient():

    g1 + scale_colour_gradient(low = "red",high = "white")
    

    enter image description here

    But look the picture, the color is also hard to classify .

    Third try: This time I add parameters of alpha=I(1/100) and breaks():

    g1 <- ggplot() + geom_point(data = mydata,aes(x = LONGITUDE,y = LATITUDE,color=NUM_PICKUPS),alpha=I(1/100))
    g1 + scale_colour_gradient(low = "red",high = "white", breaks=c(0,2000,4000))
    

    enter image description here

    But it's still helpless!

    Fourth try:

    ggplot(data = mydata, aes(x = LONGITUDE,y = LATITUDE, color = NUM_PICKUPS)) + geom_point() + scale_colour_gradient(limits = c(0, 60))
    

    enter image description here

    Fifth Try: According to the post 3 years ago, ggplot2 Color Scale Over Affected by Outliers, I try to modify my codes again:

    mydata$NUM_PICKUPS1 <- "> 2000"
    mydata$NUM_PICKUPS1[mydata$NUM_PICKUPS <= 2000] <- NA
    
    g2 <- ggplot() + geom_point(data = subset(mydata,NUM_PICKUPS <= 2000),
          aes(x = LONGITUDE,y = LATITUDE,color=NUM_PICKUPS),size=2) +    geom_point(data = subset(mydata,NUM_PICKUPS > 2000),aes(x = LONGITUDE,y = LATITUDE,fill=NUM_PICKUPS1))
    

    enter image description here

    Something did change in the outliers, but the color scale is still hard to classify!

    So, my question is how to modify my codes to make the color of NUM_PICKUPS easily to identify?

  • Ling Zhang
    Ling Zhang over 8 years
    Year, your analysis of NUM_PICKUPS of mydata is quite correct. With your code: g1 + scale_colour_manual(limits = c(0, 60)) ,there is an error Continuous value supplied to discrete scale , so I change it to g1 + scale_colour_gradient(limits = c(0, 60))
  • Ling Zhang
    Ling Zhang over 8 years
    I have tried both of your advice, but it's still helpless, few things have changed in the map
  • Phil
    Phil over 8 years
    You're quite right about scale_colour_gradient(); I've corrected it. How is it 'helpless'? Can you describe what's still wrong with the map?
  • Phil
    Phil over 8 years
    @LingZhang a thought occurred: when you use g1 does it still have the manual limits set? I.e. can you run ggplot(data = mydata, aes(x = LONGITUDE,y = LATITUDE, color = NUM_PICKUPS)) + geom_point() + scale_colour_gradient(limits = c(0, 60)) and see if corrects it?
  • Ling Zhang
    Ling Zhang over 8 years
    Thx, I have tried your advice and updated my questions again, please take a look on it
  • Ling Zhang
    Ling Zhang over 8 years
    It makes some improvements, but the color scale is not easily to identify
  • Phil
    Phil over 8 years
    What's the standard deviation of NUM_PICKUPS (sd(mydata$NUM_PICKUPS))? From your updated question it just looks like there's very little variance in your data which would be why there's very little variance in the colour of your plotted points.
  • Ling Zhang
    Ling Zhang over 8 years
    sir, the sd of mydata is 126.7398, and the var of mydata is 16062.97