Fast (count) on custom types in Clojure, and how to participate in data abstractions

2013-03-27

It was supposed to be a Stackoverflow question. As often happens, I've resolved the issue while trying to describe it properly. Here they are, the questions answered.

I've got a custom data type which I'd like to implement some core Clojure protocols. This is the kind of implementation recommended in the Clojure Programming book (page 294), and found elsewhere on the web:

(deftype X [x]
  clojure.lang.IPersistentCollection
  (count [this]
    (println :X-count)
    (count (.x this)))
  clojure.lang.Seqable
  (seq [this]
   (println :X-seq)
   (seq (.x this))))

The problem is that if I imlpement c.l.Seqable, than all calls to (count my-coll) are dispatched via (seq my-coll) first.

(count (X. [1 2 3]))
; => :X-seq
; => 3

Obviously, it is inefficient, and I'd like (count) on values of type X to use the known length of .x attribute. If I omit a c.l.Seqable implementation, I get an AbstractMethodError instead:

(count (X. [1 2 3]))
; => AbstractMethodError   clojure.lang.RT.seqFrom (RT.java:480)

It still tries to build a sequence first, doesn't it? By pure chance I discovered clojure.lang.Counted (thanks to auto-completion in nrepl.el), and implementing that (count) seems to do the right thing:

(deftype X [x]
  clojure.lang.Counted
  (count [this]
    (println :Counted-count)
    (count (.x this))))
; => user.X
(count (X. [1 2 3]))
; => :Counted-count
; => 3

The questions

Some Clojure data abstractions implement similarly named methods, and it's unclear which one is going to be used.

1) In the example above, IPersistentCollection offers int count(), and extends Seqable, why a generic Seqable implementation of (count) is used?

2) Counted is not extending anything, why its implementation of (count) is preferred over Seqable?

3) How one is supposed to discover the right protocol and method to implement? (Apart from asking someone else and feeling lucky about random snippets on the web).

And the answers

To explain the question 3), let's say one wants to implement (conj) for a new data type. (source conj) is pretty useless, it appears to be a tautology:

(. clojure.lang.RT (conj coll x)

But it gives a hint. One needs to read RT.java find conj there, and see, that it is actually cons of IPersistentCollection.

Protocol methods are not easily discoverable. It appears that the source of RT.java is the best guide to what protocols Clojure functions depend on.

It answers also the second question. The precedence of Counted over Seqable is hard-coded:

public static int count(Object o){
    if(o instanceof Counted)
        return ((Counted) o).count();
    return countFrom(Util.ret1(o, o = null));
}

RT.java answers also the first question. In countFrom(), all IPersistentCollections are converted to sequences. On true Collections, the .size() method is called. In this case, the hierarchy of interfaces doesn't matter.

Bottom line. (scaffold) function from Clojure Programming is useful, but insufficient to choose the protocols and methods which need to be implemented. The source of RT.java is really helpful.

I suppose a piece of documentation on participating in Clojure data abstractions is missing. Many programmers can benefit from a map of core protocols and their relations. Basically, we need a mapping {:clojure.core.function [:Protocol :method]}. Unfortunately, RT.java is full of special cases, and I don't know how to represent them visually. It may become a mapping of {[:function :type] [:Protocol :method]}.