How to get distinct items from a Scala Iterable, maintaining laziness

10,309

Solution 1

You can use the distinct method on Stream. For example, if you have this Iterable:

val it = new java.lang.Iterable[Int] {
  def iterator = new java.util.Iterator[Int] {
    var i = 0
    var first = true

    def hasNext = true
    def next =
      if (first) { first = false; i } else { first = true; i += 1; i - 1 }
    def remove() { throw new UnsupportedOperationException("Can't remove.") }
  }
}

You can write:

scala> import scala.collection.JavaConverters._
import scala.collection.JavaConverters._

scala> val s = it.asScala.toStream
s: scala.collection.immutable.Stream[Int] = Stream(0, ?)

scala> s.take(10).toList
res0: List[Int] = List(0, 0, 1, 1, 2, 2, 3, 3, 4, 4)

scala> val s = it.asScala.toStream.distinct
s: scala.collection.immutable.Stream[Int] = Stream(0, ?)

scala> s.take(10).toList
res1: List[Int] = List(0, 1, 2, 3, 4, 5, 6, 7, 8, 9)

We can tell that everything is appropriately lazy since the stream is infinite.

Solution 2

UPDATE Reading questions carefully is good. No laziness in this solution. Sorry.

toSet will do exactly what you want:

  1. Store iterated elements in a collection (not what you want but required)
  2. Drop / Replace duplicates

Example

val it = Seq(1,2,3,4,2,4): Iterable[Int]
it.toSet
// Set(1,2,3,4)

If you feel fancy, you can convert that back to an iterable:

it.toSet.toIterable

Or, pimp the Iterable:

implicit class UniquableIterable[T](t: Iterable[T]) {
  def unique = t.toSet.toIterable
}

And then call

it.unique

Solution 3

Expanding on my comment above, but I can't test it right now:

def unique[A](it: Iterator[A]): Iterator[A] = {
  val seen = mutable.Set[A]()
  it.filter { a =>
    if (seen(a))
      false
    else {
      seen += a
      true
    }
  }
}

You get the idea, at least. You would then apply this to the iterator you get from your iterable, and not get the unnecessary storage behavior of Stream.

Solution 4

Here is the code that adds .disctinct method to Iterator.

implicit class IteratorWrapper[T](it: Iterator[T]) {
    def distinct = new Iterator[T] {
        var seen = Set.empty[T]
        var ahead = Option.empty[T]

        def searchAhead {
            while (ahead.isEmpty && it.hasNext) {
                val v = it.next
                if (!seen(v)) {
                    seen += v
                    ahead = Some(v)
                }
            }
        }

        def hasNext = {
            searchAhead
            ahead.nonEmpty
        }

        def next = {
            searchAhead
            val result = ahead.get
            ahead = None
            result
        }
    }
}

Be aware that, as it is usually so with Iterators, the original iterator is not valid after calling .distinct on it.

Share:
10,309
Dan Gravell
Author by

Dan Gravell

Indie developer and founder of bliss and OneMusicAPI. Started programming when I was seven years old with an Amstrad CPC 6128 and Appendix 3 of the "Amstrad CPC 6128 User Instructions"... In the mid 90s I got into Java via my University CS course, and that has been my main programming environment/ecosystem ever since.

Updated on June 19, 2022

Comments

  • Dan Gravell
    Dan Gravell almost 2 years

    I have a java.lang.Iterable which computes its values lazily. I am accessing it from Scala. Is there a core API way of returning only distinct values? For instance, imaging there was a filter method that also provided all results returned thus far:

    val myLazyDistinctIterable = iterable.filter((previousReturnedItems, newItem) => previousReturnedItems.contains(newItem))
    

    I guess this is not a very general case because it involves storing previously returned items, and that might be why it isn't in the core API.

    I know about List.distinct and Sets but I want something that will not compute its elements until asked.

  • Dan Gravell
    Dan Gravell about 11 years
    No worries, guess you're just not a lazy person ;)
  • Dan Gravell
    Dan Gravell about 11 years
    D'oh, should've checked Stream. Thanks. I remember now that Stream stores its previously returned items. I thought that was an implementation detail originally but I guess it's part of its semantics too.
  • gzm0
    gzm0 about 11 years
    @DanGravell Yep... I don't think you'll get around writing a UniqueIterable with really ugly internal state.
  • Andrew Scott Evans
    Andrew Scott Evans over 8 years
    have you ever tried to use Scalas own distinct? x.toList.distinct
  • gzm0
    gzm0 over 8 years
    This is wrong! Consider: Iterator(1, 1). After calling next for the first time, hasNext will still be true, but calling next will actually throw.
  • Rok Kralj
    Rok Kralj over 8 years
    I have done exactly that and everything is fine as it should be. pastebin.com/NPAjq2aS
  • gzm0
    gzm0 over 8 years
    Well yes... because you are calling next on the original iterator, not the one created by the call to distinct.
  • Rok Kralj
    Rok Kralj over 8 years
    I have fixed the problem. The consequence is that even .hasNext can mutate the iterator, as there is no way of telling whether the distinct iterator has any elements left, you have to peek.