Portmanteau (Spliterator)

For the most part, classes in the standard Java libraries consist of fairly normal words (e.g. a ScheduledExecutorService is an ExecutorService – a service that manages the execution of tasks – which allows for scheduling). There are exceptions (and I don’t mean the Java ones), of course, such as the OMGVMCID.

Java 8, specifically JSR-335 brought about a bunch of new classes, including some of my Guava favourites (the function package, with Supplier and friends, and Optional) and Streams. Most of these, again, have fairly normal-sounding names; there are a few that might sound a bit hacked together like BiConsumer but I think the meaning is generally intuitive (a consumer of two arguments). There is one that was initially fairly puzzling, though – “Spliterator” (rendered as one word). The obvious reading of that is as a portmanteau of “split” and “iterator”, but I wasn’t sure at first how useful it would be. Of course I have written code to partition collections or other iterables, but this was usually done on an ad-hoc basis.

The API seems designed around easily scheduling parallel computation. The key method here is trySplit(). To quote the Java API docs:

If this spliterator can be partitioned, returns a Spliterator covering elements, that will, upon return from this method, not be covered by this Spliterator.

I’ll go over this in a bit more detail.

  • If the spliterator can currently be split, then trySplit() returns a new spliterator, and this “takes over” some of the elements in our existing spliterator. Our existing spliterator, if traversed directly after this method returns (e.g. with tryAdvance()), will not have these elements.
  • Otherwise, trySplit() returns null. Our existing spliterator retains its elements.

For example, this is one way of doing a parallel file deletion, which starts from one spliterator and splits it as necessary.

// Assume there exists an ExecutorService, EXECUTOR.

class DeleteTask implements Runnable {
  private final Spliterator<File> filesToDelete;
  
  DeleteTask(Spliterator<File> filesToDelete) { 
    this.filesToDelete = filesToDelete; 
  }

  @Override
  public void run() {
    Spliterator<T> filesToDelegate = filesToDelete.trySplit();
    Future<T> delegateFuture;
    while (filesToDelegate != null) {
      delegateFuture = EXECUTOR.submit(new DeleteTask(filesToDelegate));
      filesToDelegate = filesToDelete.trySplit();
    }
    filesToDelete.forEachRemaining(File::delete);
    delegateFuture.get(); // we can use a fancier way of managing completions if needed
  }
}

Viewing this in relation to parallel programming, there are obvious similarities to fork-join based workflows, especially in cases where the work involves a map or filter, or doing some reduction that’s commutative and associative. A traditional way of writing the above wouldn’t be too different; you might write some logic that splits the files into batches (which now lives in the spliterator), and then similarly dispatch deletion jobs to an ExecutorService.

From the above, it seems that a one notable difference is where responsibility for how the stream should be split lives. The Spliterator has some responsibility for deciding whether it should split itself. To some extent this makes sense; in cases where we want batches to have rather specific sizes (e.g. if performance is highly sensitive to elements being processed in specific batch sizes), we can guard trySplit() with suitable calls to the estimateSize() or getExactSizeIfKnown() methods. This also can be useful for allowing us to avoid fiddly edge cases (where our size bounds pass, but we must work with less than a full batch of elements, for instance).

Spliterators are also useful because they include characteristics that dictate, among other things, how the elements are to be traversed. For example (though this doesn’t quite apply to the file deletion example) we may be able to use a certain optimisation if we can assume that we will see the data in sorted order; spliterators do carry this information (whereas if we used a standard Iterator we might not be able to easily do this without some additional bookkeeping). The same might apply for cases where we can do better if we know elements are unique. The collections API also generates spliterators with the correct characteristics, in a sense (e.g. a NavigableSet would give you a spliterator that already has the sorted and unique properties present).

I’m still not sure why this name was decided upon, as opposed to, say, SplittableIterator. Perhaps this was done in the interest of terseness and because the decomposition is fairly obvious, though the name still really doesn’t look like a proper word to me. It does seem that spliterators have use cases beyond standard iterators as well, even when parallel computation is out of the picture – the characteristics and ability to dynamically switch on them seems potentially useful. That said, I still have not used a Spliterator directly in my professional work, and have no intention of doing so unless it fits the task I’m doing!


Leave a Reply