If I’m asked to think of a real-life example of a set of sets, I think the first example I would come up with was about job lots of items being sold. Interestingly, it actually took a while – the ideas I came up with before that, which were companies I have investments in (note: mutual funds and ETFs own sets of companies), or messages to my friends were really unions of multiple sets.
How would we model a set of sets in Java? Well, that’s not too hard, since a Set is a type after all:
Set<Set<Item>> items = Sets.newHashSet();
Set<Item> thingsBeingSold = Sets.newHashSet();
thingsBeingSold.add(Item.create("1 oz. Gold", 123456));
thingsBeingSold.add(Item.create("1 oz. Silver", 3456));
That said, we’ve got to be careful. If we continue the above with this:
thingsBeingSold.add(Item.create("a chopstick", 20));
items.stream().anyMatch(x -> x.equals(items));
items.stream().anyMatch(x -> x == items);
What should the output of lines 7, 8 and 9 be? Are they equivalent, even? Well, there are some possible arguments in each direction:
- They should return true, because the reference to the set thingsBeingSold is still there.
- They should return false, because the value of thingsBeingSold isn’t the same as the value inserted.
I would argue that the first point is stronger, and would have been my answer when I first learnt about Java. However, the actual answer is that lines 8 and 9 should return true while line 7 generally returns false, though there are a few special cases:
- It will return true if thingsBeingSold.hashCode() before line 6 has the same hash as thingsBeingSold.hashCode() after line 6 i.e. there is a hash collision, or
- if the Item created in line 6 equals either of the Items created in lines 3 and 4 (unlikely).
The operation of line 8 is relatively simple: we check each element in the set to see if it is equal, and assuming our set’s equals() method does what it’s supposed to do we will always find the match. We can potentially improve on this with parallelStream(). Line 9 does a reference-equality check on the elements instead and would probably be faster as we don’t have to actually inspect the set contents. In this case, it’s also going to be true. However, these are linear time operations, at least. The point of using hash-based structures here, among other things, would be to knock the time complexity of lookups down to amortised constant.
In a sense, we find the object only if we fortuitously (!) happen to be looking for the same hash code. Java HashSets effectively are a key-set based view of a HashMap, and when we look up objects we actually investigate them by their hashes first, only bothering to check the values if we have a match on the hashes. This allows for some fairly interesting optimisations, actually; I previously thought that HashMaps use buckets and store an array of linked lists of mappings + information, resizing appropriately when the ratio of elements to buckets exceeded a load factor. This was correct up to Java 7, but in Java 8 it appears there’s a fun new optimisation that builds red-black trees sorted by hash codes (while objects themselves might not be Comparable, their hashes are ints, which certainly are) if some buckets get too large relative to others in an attempt to cut the worst-case complexity from linear to logarithmic. Still, when we inspect nodes, we look at their hashes first.
Note that this isn’t just a problem with HashSets; you can get into similar debacles with TreeSets, because the underlying red-black trees aren’t going to reorder themselves. In general, the Javadoc for Set does warn you that mutable objects shouldn’t be changed in ways that would affect their equals() method after they’ve been put into a set. HashSet and TreeSet are allowed to behave the way they do because of subsequent invariants between equals() and hashCode(), and comparison-based sets are supposed to use comparators that are consistent with equals() i.e. a.equals(b) iff a.compareTo(b) == 0.
One way to get round this is of course to ensure that the types used in these sets are immutable, or at least the parts that are used in equals() and related methods are immutable. If that’s infeasible, a good solution could be to delete the set that is to be changed from the set, and add it back after changes; add is idempotent and so the set will (correctly) stay as a set. Wiring this up seems rather messy, though; it looks like we need to listen for state changes (so some kind of publish/subscribe mechanism), where the objects being stored publish both their old and new state so that we can modify the set accordingly.