Can I make random mask with Numpy?

11,164

Solution 1

Create an array of False values, set the first 1000 elements them to True:

a = np.full(10000, False)
a[:1000] = True

Afterwards simply shuffle the array

np.random.shuffle(a)

For a slightly faster solution you can also create an array of integer zeros, set some values to 1, shuffle and cast it to bool:

a = np.zeros(10000, dtype=int)
a[:1000] = 1
np.random.shuffle(a)
a = a.astype(bool)

In both cases you will have an array a with exactly 1000 True elements at random positions.

If instead you want each element to be individually picked from [True, False] you could use

np.random.choice([True, False], size=10000, p=[0.1, 0.9])

but note you cannot predict the number of True elements in your array. You'll just know that on average you'll have 1000 of them.

Solution 2

A common solution is creating an array of random integer indices, which can be efficiently done with numpy's random choice.

With this setup:

n_dim = 10_000  # size of the original array
n = 100         # size of the random mask
rng = np.random.default_rng(123)

To create the array of random index we can use numpy's choice passing the array size as first argument:

In [5]: %%timeit  
   ...: m = rng.choice(n_dim, replace=False, size=n) 
   ...:  
   ...:                                                                                                                                  
21.9 µs ± 161 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)

As a comparison, the boolean array approach mentioned in other answers (which requires shuffling an array of 0 and 1s) is quite slower (>10x slower in this example):

In [7]: %%timeit 
   ...: m = np.hstack([np.ones(n, dtype=bool), np.zeros(n_dim - n, dtype=bool)]) 
   ...: rng.shuffle(m) 
   ...:  
   ...:                                                                                                                                  
261 µs ± 604 ns per loop (mean ± std. dev. of 7 runs, 1000 loops each)

NOTE: The integer indexing works best in the sparse case, i.e. when selecting a small fraction of samples from the original array. In this case the RAM usage of an integer index would be much lower than a boolean mask. When the fraction of samples becomes more than 10..20% of the original array the bool mask approach would be more efficient.

NOTE2 The integer indexing will return samples in random order. In order to random sample an array while maintaining the order you need to sort the index. The bool mask would naturally return sorted samples.

To conclude, if you are performing sparse sampling and you don't care about order of the sampled items, the integer indexing shown here is likely to outperform other approaches.

Solution 3

In [7]: import numpy as np 

In [8]: mask=np.array( [False]*10000)

In [9]: inds=np.random.choice(np.arange(10000),size=1000)

In [10]: mask[inds]=True

Now the first 100 elements of your mask are

In [11]: print(mask[:100])
[False False False False False  True False False False False False False
 False False False False False False False False False False  True False
 False False False False False False False  True  True False  True False
 False False False False False False False False False False  True False
 True False False False False False False False False False False False
 False False False False False False  True False False False False False
 False False  True False False False False False False False False False
 False False  True False False False False False False False False False
 False False False False]

Solution 4

Similar to Nils Werner's answer, but more directly:

import numpy as np

size = 10000
num_true = 1000
mask = np.concatenate([np.ones(num_true, dtype=bool), np.zeros(size - num_true, dtype=bool)])
np.random.shuffle(mask)

It is equally fast; using IPython's %%timeit magic:

%%timeit
a = np.zeros(size, dtype=int)
a[:num_ones] = 1
np.random.shuffle(a)
a = a.astype(bool)

Out: 217 µs ± 2.33 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

%%timeit
mask = np.concatenate([np.ones(num_true, dtype=bool), np.zeros(size - num_true, dtype=bool)])
np.random.shuffle(mask)

Out: 201 µs ± 1.32 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

Share:
11,164
Admin
Author by

Admin

Updated on September 06, 2022

Comments

  • Admin
    Admin over 1 year

    I'm doing image processing using Python.

    I am trying to randomly extract some pixels from the image.

    Is it impossible to make random mask with Numpy?

    What I'm thinking now is to make 1000 elements of the 10000 line array True and all else False, is it possible to realize this?

    Also, if impossible, is there any other way to make a random mask? Thank you.

  • Anton vBR
    Anton vBR over 6 years
    I liked this solution. However.... just a thought. We could change the np.full(..) to np.zeros() and set a[:1000] to 1 instead. This should give a speed improvement. Could add a a.astype(bool) in the end too.
  • Nils Werner
    Nils Werner over 6 years
    They are about as fast with the purely boolean array being slightly faster most of the time.
  • Nils Werner
    Nils Werner over 6 years
    Ah, I misread the timings. The int solution is indeed the fastest! Will adjust my answer.
  • Anakhand
    Anakhand almost 5 years
    Could also do np.concat([np.ones(1000, dtype=bool), np.zeros(10000 - 1000, dtype=bool)]) to save some lines and the conversion to bool
  • Anakhand
    Anakhand over 3 years
    This only works if you don't care about maintaining the original order of the elements after indexing with the mask m. If you do care (as is OP's case, probably, since they are extracting pixels from an image), you have to sort the mask after creating it with m.sort(). The overall complexity of that is worse than the boolean mask method—i.e., it performs worse for larger mask sizes. See this benchmark.
  • user2304916
    user2304916 over 3 years
    Dear Anakhand, that's a fair point. The requirements of sample sorting is not mentioned in the OP question. The integer index approach works better in the sparse case, i.e. when the fraction of selected samples is a small fraction of the total samples. In this case, it will be both more more efficient (less RAM) and faster than a bool mask. When the fraction of sampled points is larger than ~1/4, I would use a bool mask, as the advantage of an integer mask in terms of RAM would vanish.
  • Safron
    Safron over 3 years
    np.random.choice([True, False], size=10000, p=[0.1, 0.9]) is about 10 times slower than the equivalent np.random.random_sample(10000) < 0.9.