More Issues about Formal Classes and Methods

These are some miscellaneous points on which we (the community interested in extending programming in the S language) may want to reach some consensus. Some apply only to the R implementation; others more generally.

Documenting Methods and Classes

The chapter in the ``green book'' (Programming with Data; Springer, 1998) on documentation assumes S documentation objects and SGML-based documentation; until some further work is done, this is not directly usable.

The current implementation of special documentation in R is based on a mapping from type-topic pairs into a single string, the value of

   topicName(type, topic)

This corresponds to the binary version of the ? operator (Programming with Data, page 358):

   class ? track

looks for topic topicName(class, track). At the moment the actual topic would be "track-class". The string convention is used in two places:

as the name to go into a \alias entry; and
in the (default) name of the file where the documentation is written; e.g., "track-class.Rd"

The choice of "_" as the separator is based on the notion that it's unlikely to be part of a class or method name and is valid as part of a file name, on Unix/Linux and Windows at least.

The planned sequence for generating class and method documentation, using the current Rd format, is similar to that for functions. The programmer implements a class or some methods for a function, and then creates a shell of the documentation by the corresponding prompt function:

    promptClass("track")
    promptMethods("plot")

The output of these would currently go to files "track-class.Rd" and "plot-methods.Rd" respectively.

The current documentation for both classes and methods conforms to the existing Rd format. In particular, the special sections, such as that for documenting slots of a class, are handled by the standard \section command. Eventually, it would be better to have documentation files specialized to classes and methods, allowing easier checks of the documentation against the currently defined objects. Having specialized documentation would improve the prompt functions as well, since they currently have much too specific an idea of how the documentation is to be rendered. But good specialized documentation really requires the much-desired change in the overall documentation strategy to XML or other.

Another issue for consideration is the organization of class and method documentation in relation to an overall project. The general notion behind current software (e.g., promptMethods) is that the programmer may want either to document a function and its methods together, or to document separately some additional methods.

If I own a package that contains the definition of a function, say plotData, along with some methods for that function, I may want to document function and methods together. If you then define another package that requires mine and in that package add some more methods for plotData, the likely scenario is that you will document your methods separately, with a link to my documentation of the function itself. None of this is enforced, but the addTo argument to promptMethods is designed to set up either version.

On the other hand, promptClass also generates a list of the locally-defined methods involving the class. If I followed the design style above, I would likely put a link to the documentation of the corresponding function into the entries in the class documentation. Again, nothing is enforced, and it should be possible to have the other philosophy: put the method documentation into the relevant class documentation file. Philosophically, this is more an OOP approach than a function-based approach.

Representation of Objects and Slots for Formal Classes

(See also the discussion of class extensions.)

This is essentially a request for a new type of object in R (unless there is a solution within the current implementation not so far uncovered).

An important subset of formal classes either extend one of the basic vector types, such as "character", or extend the notion of a structure in S (as would, for example, a formal definition of matrix or time-series as a class). For these classes, one would like to inherit much of the behavior of S3 for vectors and structures. The present implementation satisfies this fairly well, as far as current experience has tested. The prototype of objects in these classes is an object of one of the ``basic classes''. Most existing non-class-based code for such objects should continue to work for classes that extend basic classes.

Other formal classes are defined only in terms of their slots, and they should not behave like vectors, unless the designer of the class defines appropriate methods. Nothing in the class definition says that objects from such classes should inherit primitive functions for subsetting, arithmetic, etc.

The API concepts corresponding to this are fairly simple:

If the class extends one of the basic vector classes, it inherits methods for arithmetic, subsetting, and other relevant functions from that basic class.
If the class extends class vector or class structure, it likewise should have methods pre-defined for the vector-like operations.
(There are a couple of issues here about what happens with S3-like class/attribute structures. The behavior is mostly, but perhaps not entirely, what we would want. It may be that the case of a formally defined class will need to be detected in the current base implementation and dealt with specially. But this seems likely to be fairly easy.)
If the class does not extend vector directly or indirectly, it should not permit vector-like operations unless these are given explicit definitions.

The current implementation of the methods package does not implement the last item. Vector operations return results (generally meaningless) if applied to non-vector classes. The reason is that the prototype object from the class is a vector (specifically, an empty list).

What one would like is an object for which all vector-like operations are generally undefined. For example, if t1 is an object from some non-vector class, for which no subsetting methods are defined, an error should result in situations like:

    t1[[1]]
    Error in t1[[1]] : object is not subsettable

In the current implementation, the operation falls through to an operation on an empty list, which may or may not fail.

There are non-vector data types in R, but it appears that none of them are suitable as a prototype. The two candidates that might be plausible are environment and closure. But environments are references and do not get duplicated. So ordinary assignment,

  y <- x

will not work according to S semantics. Changes to y will be reflected in x. Closures are partially duplicated, but attempts to use them as prototypes so far seem to fail (in less obvious ways). In addition, they are only partially non-vectors: subsetting fails but length does not.

Tightening up the Object/Class Model

In the initial implementation, the objects representing classes (with class "classRepEnvironment" did not themselves come from a formally defined class; instead, properties in these objects were dealt with in a special way.

It is fairly obvious that the model for a language should apply as uniformly as possible. Exceptions tend to be made for efficiency of basic computations, for better or worse, but otherwise we would like as few special cases as possible. Such uniformity applies particularly for an implementation of the S language, in which much of the computation on the language can be done in the language itself.

The revision in the previous section makes it possible to implement class objects as a true class. Aside from the philosophical desirability, having a true class opens up the possibility of extending that class later on. The bootstrap process for generating the methods package also simplifies substantially: we basically just need to create the initial definition of the ``class class'' and other computations are largely within the model.

Methods and Conforming Argument Lists

As discussed recently on the r-devel list, there are cases where it would be useful to allow methods defined on a subset of the arguments of the generic (chiefly for subsetting operators).

The first step in implementing this is simple, given the general model. A method is allowed if its argument list conforms to the arguments of the generic in the following sense:

A method conforms to its generic function if the formal arguments of the method are a subset of those in the generic, appearing in the same order, and if the omitted arguments in the method do not appear in the signature associated with the method. In this case, the signature method is treated as if it were augmented by class "missing" for each of the omitted arguments.

Note that with this definition, the extension to conforming arguments only affects method specification, not method dispatch.

The plan is to modify the rule for setting methods so that conforming, but not identical argument lists are allowed (there will probably continue to be a message noting the extended interpretation of the signature). Non-conforming method definitions will likely become errors, rather than the current warning (but this should get some discussion).

Representing Class Extensions and Subclasses

R saves an image only of the global environment. Therefore, all information about classes must reside in this environment, in order to be saved and reloaded correctly in a future session. But relationships between classes may involve classes whose definition is not (and should not be) local to the current global environment.

The following is the simplest example:


setClass("numbersAndStuff") ## a virtual class
setIs("numeric", "numbersAndStuff")

The purpose here is to create a virtual class of which various actual classes will be subclasses. Slots declared to be "numbersAndStuff" are then constrained to be one of those classes.

The definition of class "numeric" resides (currently) in the methods package. Further, the user is not allowed to redefine this class, for fairly obvious reasons. The issue then is how (and where) to store the metadata recording the call to setIs. Previous versions (R 1.4.1 and earlier) of the package stored the information in the class metadata object for "numeric", regardless of where that object resides. That approach produces incorrect results if the user saves the current global environment and then restores it. The setIs information will not be restored, since it was assigned in the environment of the methods package.

There appear to be two approaches that alleviate the difficulty:

create an ``extension metadata object'' for class "numeric" in the global environment;
store the information in the subclass property of the metadata for class "numbersAndStuff".

The call to setIs defines a link in the graph of inter-class relations. Essentially, the prior approach always stored an ``upward-pointing'' link. The two alternatives respectively create a new kind of metadata to store the link information, and store the information (upward or downward) in whichever class is local. There are some issues with either proposed solution. Overall the first approach is slightly more general (and perhaps more efficient; see comments below), but is a major complication of the metadata structure and has some issues about class versions. The second approach is not quite as general, but is a relatively modest change to the current behavior. For that reason, we will follow the second approach for now.

There are some issues with the second approach however. Generally, though not always, one or the other of the two classes in the setIs call will be local. If not, using subclass information clearly doesn't solve the basic problem. Should we prohibit setIs calls if the local package or environment does not own either class definition?

For the mechanism to work, however, we must be prepared to modify cached class definitions. The first solution makes it possible to search explicitly for all the extensions of a given class: they must reside in the class's own definition or in one of the extension meta-data objects for that class. But the second mechanism leaves the information about some extensions in the definition of another class.

In our example, suppose class "numeric" is needed before class "numbersAndStuff" has been encountered. The methods package caches all the available information about class "numeric" when it is first needed (not caching the information would be a major performance penalty). But there is no practical way to find all the relevant subclass information at the time the class definition of "numeric" is needed: One would have to examine the subclass information for all classes currently visible. So the cached (and nominally complete) definition of class "numeric" fails to say that it extends "numbersAndStuff".

To avoid errors from caching subclasses, one needs to insert new link information into the cached version, once that information is available. In the example, when we do need the definition of class "numbersAndStuff", the code that completes that class's definition must insert extends information into all the subclasses in the definition (and their subclasses as well). Does this ``downward completion'' resolve all the potential errors? It seems to, in the sense that before we can need the information that "numeric" extends "numbersAndStuff" surely we must have encountered class "numbersAndStuff", and the methods package is designed to complete a class definition as soon as that class is needed. This is, however, a distinctly heuristic argument, waiting for counter-examples!

The first approach, the use of extension metadata objects, would be a change in the organization of the metadata. It certainly affects more aspects of the package than using subclass information: the process of collecting extension information or subclass information is now decentralized. To implement the solution generally, the code that completes a class definition would need to search for all the extension meta-data objects for that class in the current search list, and merge the result into the cached definition. Intuitively, it seems both desirable and straightforward to make the link information fully symmetric in this approach. That is, a setIs call (or the same implication within a class definition) would provide extends information for both the classes: upward-pointing for the first and downward-pointing for the second. It's a straightforward example of doubly-linked list information, and once we agree to store links explicitly, why not be complete. Having this information seems likely to make class completion somewhat more efficient.

Separating the link information from the rest of the class definition does further fragment the metadata. In principle there isn't anything radically different needed to keep the information up to date in the two possible solutions. But can link information exist in a package that owns neither of the class definitions? How could we know if the information is correct? (So we don't really escape the similar problem for the other solution.)

John Chambers <jmc@research.bell-labs.com>

Last modified: Fri Feb 1 14:35:16 EST 2002