v1.0 - ATT SCCS compatible

ATT SCCS had the concept of a fork — two deltas with the same parent, but no merge. While extremely baffling about why they didn’t do graph closure, they did brilliantly provided a means to include and exclude deltas.

The original BitKeeper choose to store the merge bookkeeping in a way that if BitMover went out of business, customers could use ATT SCCS to read the data out of the files. That meant that merge nodes were a list of -i, (and as a result of csetprune, could also have a list of -x too).

Early BK made a list of all nodes that weren’t in the ancestry of the trunk node. This is not the minimal list. It includes items that don’t need to be included. If a merge node was in the list of merge nodes, then the includes from it counted.

That led to compressmap() function which computed the minimum set.

But if you step back, this is still all quite brain dead. The bookkeeping should be just a list of parents, with include and exclude being used for a separate idea of set inclusion and exclusion.

The puzzle: how to map the ATT form to a new BK form?

Csetprune did a graph to graph translation by transforming to a set compression, then pruning the graph, then transforming back. This keeps the set generated by each node the same.

However, in this case, it is necessary but not sufficient. Also we need to be able to reverse the transform and have it compute identical sets.

In SCCS, it is legal to include something already included or exclude something already excluded. Doing a transform to a set compression and back will drop these excess -i and -x, which can change the way serialmap computes a set.

Theory

We could augment symdiff format to track duplicate entries.

Conditions for the transforms
Given the forward and backward transform:
BK = f(ATT)
ATT = f'(BK)

For each d in the graph, these need to be true:
  1. serialmap(BK(d)) == serialmap(ATT(d))
  2. serialmap(ATT(d)) == serialmap(f`(BK)(d))
  3. serialmap(BK(d)) == serialmap(f(ATT)(d))
Unambiguous mapping between domains
SCCS <-> SYMDIFF + DUPS <-> BK

For every SCCS representation A, there is one SYMDIFF representation B, and one BK representation.

The process would be to convert every node to its symdiff + dups representation, then convert every node to bk format.

To do this, the intermediate symdiff form is augmented by a list of duplicates. The original symdiff entries are a list of serials; the enhanced entries get the serials of the duplicates with the high bit set.

This is satisfying because the mapping is pure — all instances transform both ways and no data is lost.

This is not satisfying because: every node needs to be translated twice, and real graphs are a strict subset of the general graph.

Improvements

The only nodes that do not need translation (and I’ll probably wind up eating my words here), are non-merge nodes with no -i or -x.

Non-merge nodes with lists could be the result of an edit -r which can have merge nodes in the list, or at least merge nodes somehow involved in the computing of the set.

A faster computing method is to init the file twice. That’s a lot of duplication. How about just duplicating the list of cludes offsets? Well, maybe too much work, but that’s what was done.

Also when processing in this order, dups do not need to be saved up like they were in the augmented symdiff form. Instead, the dup list is computed, used, and reset for each item translated.

Translation Performance

Well, an upgraded repo clone of our repo takes about 4.5 seconds, and a clone + translation takes about 8.5 seconds. It’s not free. It doesn’t totally suck. It should roughly scale linearly, with bushy trees being a little worse.

Compatibility

If the receiving side of [r]clone is new, then it can receive a graph in any form and know how to deal with it. The sending side, whether a new bk, or an old bk, just sends the existing files.

If the receiving side of [r]clone is old, and the sending side is dealing with new stuff (merge bookkeeping, binary format, etc), then the sfile needs to be converted before going out in the sfio. This is done in memory as to leave the source repository alone.

clone is a one pass deal. A request is sent to the bkd, and the bkd sends.

clone Client and Server being Old and New:
C   S
O   O	-- only supports old repos
O   N   -- server running bk_hasFeature() will say no; server will convert
N   O   -- nothing special - server sends what is there, client converts
N   N   -- nothing special - server sends what is there, client converts

rclone is a 2 pass, where the client does a sanity check round with the server before sending it an sfio. Here the client doesn’t tell the server anything about the local state of the sfiles, but waits to see what the server can understand. It then uses the same logic

rclone Client and Server being Old and New:
C   S
O   O	-- only supports old repos
O   N   -- nothing special - server sends what is there, server converts
N   O   -- client running bkd_hasFeature() will say no; client will convert
N   N   -- nothing special - server sends what is there, server converts