Removing duplicate columns and rows from a NumPy 2D array

26,905

Solution 1

Here's one idea, it'll take a little bit of work but could be quite fast. I'll give you the 1d case and let you figure out how to extend it to 2d. The following function finds the unique elements of of a 1d array:

import numpy as np
def unique(a):
    a = np.sort(a)
    b = np.diff(a)
    b = np.r_[1, b]
    return a[b != 0]

Now to extend it to 2d you need to change two things. You will need to figure out how to do the sort yourself, the important thing about the sort will be that two identical entries end up next to each other. Second, you'll need to do something like (b != 0).all(axis) because you want to compare the whole row/column. Let me know if that's enough to get you started.

updated: With some help with doug, I think this should work for the 2d case.

import numpy as np
def unique(a):
    order = np.lexsort(a.T)
    a = a[order]
    diff = np.diff(a, axis=0)
    ui = np.ones(len(a), 'bool')
    ui[1:] = (diff != 0).any(axis=1) 
    return a[ui]

Solution 2

This should do the trick:

def unique_rows(a):
    a = np.ascontiguousarray(a)
    unique_a = np.unique(a.view([('', a.dtype)]*a.shape[1]))
    return unique_a.view(a.dtype).reshape((unique_a.shape[0], a.shape[1]))

Example:

>>> a = np.array([[1, 1], [2, 3], [1, 1], [5, 4], [2, 3]])
>>> unique_rows(a)
array([[1, 1],
       [2, 3],
       [5, 4]])

Solution 3

My method is by turning a 2d array into 1d complex array, where the real part is 1st column, imaginary part is the 2nd column. Then use np.unique. Though this will only work with 2 columns.

import numpy as np 
def unique2d(a):
    x, y = a.T
    b = x + y*1.0j 
    idx = np.unique(b,return_index=True)[1]
    return a[idx] 

Example -

a = np.array([[1, 1], [2, 3], [1, 1], [5, 4], [2, 3]])
unique2d(a)
array([[1, 1],
       [2, 3],
       [5, 4]])

Solution 4

>>> import numpy as NP
>>> # create a 2D NumPy array with some duplicate rows
>>> A
    array([[1, 1, 1, 5, 7],
           [5, 4, 5, 4, 7],
           [7, 9, 4, 7, 8],
           [5, 4, 5, 4, 7],
           [1, 1, 1, 5, 7],
           [5, 4, 5, 4, 7],
           [7, 9, 4, 7, 8],
           [5, 4, 5, 4, 7],
           [7, 9, 4, 7, 8]])

>>> # first, sort the 2D NumPy array row-wise so dups will be contiguous
>>> # and rows are preserved
>>> a, b, c, d, e = A.T    # create the keys for to pass to lexsort
>>> ndx = NP.lexsort((a, b, c, d, e))
>>> ndx
    array([1, 3, 5, 7, 0, 4, 2, 6, 8])
>>> A = A[ndx,]

>>> # now diff by row
>>> A1 = NP.diff(A, axis=0)
>>> A1
    array([[0, 0, 0, 0, 0],
           [4, 3, 3, 0, 0],
           [0, 0, 0, 0, 0],
           [0, 0, 0, 1, 0],
           [0, 0, 1, 0, 0],
           [2, 5, 0, 2, 1],
           [0, 0, 0, 0, 0],
           [0, 0, 0, 0, 0]])

>>> # the index array holding the location of each duplicate row
>>> ndx = NP.any(A1, axis=1)  
>>> ndx
    array([False,  True, False,  True,  True,  True, False, False], dtype=bool)  

>>> # retrieve the duplicate rows:
>>> A[1:,:][ndx,]
    array([[7, 9, 4, 7, 8],
           [1, 1, 1, 5, 7],
           [5, 4, 5, 4, 7],
           [7, 9, 4, 7, 8]])

Solution 5

The numpy_indexed package (disclaimer: I am its author) wraps the solution posted by user545424 in a nice and tested interface, plus many related features:

import numpy_indexed as npi
npi.unique(coordskeys)
Share:
26,905
Sergi
Author by

Sergi

Telecommunications Engineer and Ph.D. in Signal Theory and Communications. Building cloud apps based on open source software at Eastridge Workforce Solutions. Python lover and Open Source enthusiast.

Updated on April 03, 2020

Comments

  • Sergi
    Sergi about 4 years

    I'm using a 2D shape array to store pairs of longitudes+latitudes. At one point, I have to merge two of these 2D arrays, and then remove any duplicated entry. I've been searching for a function similar to numpy.unique, but I've had no luck. Any implementation I've been thinking on looks very "unoptimizied". For example, I'm trying with converting the array to a list of tuples, removing duplicates with set, and then converting to an array again:

    coordskeys = np.array(list(set([tuple(x) for x in coordskeys])))
    

    Are there any existing solutions, so I do not reinvent the wheel?

    To make it clear, I'm looking for:

    >>> a = np.array([[1, 1], [2, 3], [1, 1], [5, 4], [2, 3]])
    >>> unique_rows(a)
    array([[1, 1], [2, 3],[5, 4]])
    

    BTW, I wanted to use just a list of tuples for it, but the lists were so big that they consumed my 4Gb RAM + 4Gb swap (numpy arrays are more memory efficient).

  • Sergi
    Sergi over 12 years
    Yes, the order is not important.The solution of combining list + set is the one I use as example on the OP (which I admit is quite obfuscated). The problem with it is that it uses lists, and therefore the memory used is huge, having the same problem as if I was working just with lists instead of arrays from the beginning.
  • Deepak Mathpal
    Deepak Mathpal over 12 years
    +1 just posted my answer, then read yours--it looks like mine is a faithful 2D implementation of yours--same sequence of identical functions (i even had a row concatenation step at first, but i removed it and sliced first row off the original array instead.
  • Bi Rico
    Bi Rico over 12 years
    Doug, I think you're close but you're going to run into trouble because NP.sort(A, axis=0) sorts each column independently. Try running your method on the two following arrays: [[0, 0], [1, 1], [2,2]] and [[0, 1], [1, 0], [2,2]]. I added a sort function my my answer that keeps the rows intact while sorting.
  • Bi Rico
    Bi Rico over 12 years
    I didn't know about lexsort, I'm going to include it in my answer if that's ok
  • Deepak Mathpal
    Deepak Mathpal over 12 years
    @Bago : absolutely--you were first to have solved the heart of problem anyway, which is why i up-voted your answer, and left a comment to let people know that my answer is just a modified version of yours posted several hours later.
  • user545424
    user545424 over 10 years
    @user100464, edited so that it will work with transposed arrays.
  • Bi Rico
    Bi Rico about 8 years
    this answer mostly uses numpy so python2/3 shouldn't' matter. If it's not working for you, there is probably something else going on.
  • Ghostkeeper
    Ghostkeeper almost 8 years
    Worked for me in Python3. Note that this doesn't preserve the order.
  • Eelco Hoogendoorn
    Eelco Hoogendoorn over 7 years
    Note that the lexsort solution is limited in how many columns it supports