Python: Fast way of MinMax scaling an array

13,587

Solution 1

The answer by MadPhysicist can be optimized to avoid unneeded allocation of temporary data:

x -= x.min()
x /= x.ptp()

Inplace operators (+=, -=, etc...) don't eat your memory (so swapping on disk is less likely to occur). Of course, this destroys your initial x so it's only OK if you don't need x afterwards...

Also, the idea he proposed to concatenate multi data in higher dimension matrices, is a good idea if you have many many channels, but again it should be tested whether this BIG matrix generates disk swapping or not, compared to small matrices processed in sequence.

Solution 2

It's risky to use ptp, i.e. max - min, as it can in theory be 0, leading to an exception. It's safer to use minmax_scale as it doesn't have this issue. First, pip install scikit-learn.

from sklearn.preprocessing import minmax_scale

minmax_scale(array)

If using an sklearn Pipeline, use MinMaxScaler instead.

Share:
13,587

Related videos on Youtube

Wise
Author by

Wise

I know Python, Java and a little of C++ and php.

Updated on June 04, 2022

Comments

  • Wise
    Wise almost 2 years

    I use the following way to scale an n-dimensional array between 0 and 1:

    x_scaled = (x-np.amin(x))/(np.amax(x)-np.amin(x))

    But it's very slow for large datasets. I have thousands of relatively large arrays which I need to process. Is there a faster method to this in python?

    Edit: My arrays are in shape (24,24,24,9). For MinMax scaler in scikit, the input array has to have a certain shape which mine doesn't so I can't use it. In the documentation it says:

    Parameters: 
    X : array-like, shape [n_samples, n_features]
    
    • Sreeram TP
      Sreeram TP about 6 years
      what about using MinMaxScaler from sklearn.?
    • pault
      pault about 6 years
    • Mad Physicist
      Mad Physicist about 6 years
      Don't compute min twice?
    • MaxU - stop genocide of UA
      MaxU - stop genocide of UA about 6 years
      what is the shape of your data set?
    • pault
      pault about 6 years
      Can you show us the output of sklearn.preprocessing.minmax_scale(x)? Is there an error message? Wrong answer?
    • Wise
      Wise about 6 years
      @MaxU The shape is (samples,24,24,24,9)
    • Mad Physicist
      Mad Physicist about 6 years
      What is the meaning of the shape of your array?
    • Wise
      Wise about 6 years
      @MadPhysicist It is a shape used for Keras convolutional layers, 3d images of 24,24,24 with 9 channels.
    • Mad Physicist
      Mad Physicist about 6 years
      @Wise. Would you be OK with looking at is as a 24*24*24-by-9 array then?
    • Mad Physicist
      Mad Physicist about 6 years
      @Wise. Please clarify your question with some context. The fact that you are unhappy with all the solutions proposed so far means that your question is incomplete. Instead of leaking important details one by one and wasting everyone's time, please indicate where and how you are getting the multiple arrays, what you are trying to do with them, and the issues you have had with the approaches you tried so far.
    • MaxU - stop genocide of UA
      MaxU - stop genocide of UA about 6 years
    • Wise
      Wise about 6 years
      @MaxU No I haven't. I use them, but I don't know how it affects the input data?
    • MaxU - stop genocide of UA
      MaxU - stop genocide of UA about 6 years
      @Wise, same here... I didn't start my "Keras journey" yet...
  • Wise
    Wise about 6 years
    Thanks, my problem is that I wanna minmax scale a large number of arrays, so it's not one large array, but numerous arrays which are relatively large.
  • Mad Physicist
    Mad Physicist about 6 years
    @Wise. Then concatenate them and apply the function along a particular axis. Please make your question clear and complete.
  • sciroccorics
    sciroccorics about 6 years
    @MaxU: You're right for this specific case where full broadcasting is used. But using inplace operators is often a life-saver ;-)
  • Asclepius
    Asclepius over 5 years
    This answer is dangerous because ptp() can in theory return 0.
  • Asclepius
    Asclepius over 5 years
    This answer is dangerous because ptp() can in theory return 0.
  • Mad Physicist
    Mad Physicist over 5 years
    @A-B-B. Same as manually computing the difference. The question is about how to speed up the code, not catch all the corner cases. That being said, your comment is extremely useful.