Remember all the places something exists

URLLIST

The idea behind URLLIST is to 'remember' all the locations to which we have connected and use them to either find a component that we need to populate nested_populate() or just to check whether it is available in some remote location urllist_find().

What follows is a brief description of the URLLIST API, the on-disk file formats it uses, and how nested_populate() and urllist_find() make use of them.

The problem of finding which components a given URL has and on which URLs a given component can be found is inherently a ''many-to-many'' problem (or n x m where n is the number of URLs and m is the number of components).

On-Disk File Format

For backward compatibility reasons, and also to avoid duplicating information, we keep the information in two separate files.

BitKeeper/log/urllist: a KV file that contains component rootkeys as keys and all the urls on which we’ve seen that component as values. There is no other information in this file. Occasionally, you might see a timestamp at the end of the stored URLs, this is because older BKs (bk-5.1) used to store it. It is deprecated.
BitKeeper/log/urlinfo: a KV file that contains URLs as keys and information about the URLs as data (separated by new lines). BitKeeper will read the fields it knows about and ignore extra fields. The order of the fields must be preserved.

Commands which use and/or alter state

Whenever a urllist is fetched from a remote source (clone, pull), the urls are translated knowing the url of the remote. So a remote file url may get translated to a bk:// url if the remote was accessed through a bk:// url.

clone - brings in urllist but not urlinfo.
rclone - updates local urllist and copies to remote
push - updates local urllist
pull - pulls from remote and integrates it in considering only remote
populate - can clean out urllist
unpopulate - if no -f, then will probe and clean the urlist.

Complexity in a pull

cd A; bk pull ../B[BR]

A typical component, say comp1, can be (roughly independent dimensions): * locally populated or unpopulated or non existent * remotely populated or unpopulated * have remote changes or not * have local changes or not * local repo here uses an alias that changes causing populate or unpopulate or no change.

urlist is pulled in remotely, interpreted with a local perspective, and see if can be considered a safe pull.[BR] [BR]

''NOTE: the saved urllist can be wrong.'' The probes done during safe and populate part of pull use the remote namespace, which is correct for the components getting populated, but not correct for components with local changes which aren’t involved in the pull. The havekeys will lead to components with local changes relative to the pull, but which match a known gate, to have the gate pulled from the urllist.[BR] [BR]

Now, if something has local and remote changes, the final resolve will cause the component to get flushed from the urllist.[BR] [BR]

If a remote is populated and the local is not, then the component is part of the pull, but not part of what is actually pulled, so will be part of the pull-safe test. If the whole pull succeeds, then the urlinfo will be right. If the pull fails, then the urlinfo will represent the remote level, which is a superset of the local level (otherwise the pull would have failed, right?), and that will be fine.[BR] [BR]

[rick] Has anybody done this type of state space analysis of pull and other operations? The more I understand, the more holes I see, and see some of my earlier suggestions were misdirected.

Data Structures

The data structures that support the on-disk format are slightly different and have been designed to be easier to work with than the on-disk hashes.

{{{#!cplusplus numbers=off typedef struct { // normalized url, output of remote_unparse() with any leading // file:// removed char *url;

// map comp struct pointers to "1" if remote has the needed
// component tipkey or "0" if they have the component, but not
// the tipkey.
hash	*pcomps;	/* populated components found in this URL */
u32	checked:1;	/* have we actually connected? */
u32	checkedGood:1;	/* was URL probe successful this time? */
u32	noconnect:1;	/* probeURL failed for connection problem */

	// From URLINFO file, extras are ignored
	time_t	time;		/* 1 time of last successful connection */
	int	gate;		/* 2 is it a gate?*/
	char	*repoID;	/* 3 */
	char	**extra;	/* extra data we don't parse */
} urlinfo;
}}}

Besides this new structure urlinfo, a couple of fields have been added to the nested structure. {{{#!cplusplus numbers=off struct nested { /* fields elided / // urllist u32 list_loaded:1; // urllist file loaded u32 list_dirty:1; // urllist data changed char **urls; // array of urlinfo *'s hash *urlinfo; // info about urls / fields elided */ }}}

API

urlinfo_load()

initializes the in-memory data structures from the on-disk KV files. Optionally, it can 'normalize' the URLs in the KV files with respect to some remote structure. What this means is that URLs of the form: file://foo/bar/baz will be translated to bk://server//foo/bar/baz. This is similar to what BAM does.

urlinfo_buildArray()

this is an auxiliary function that initializes the n->urls array.

urlinfo_urlArgs()

Occasionally we need to check some extra URLs than what is in the KV files. E.g. URLs that were passed by the user as arguments to some command (-@<url>). This helper function allows us to add then to the n->urls array.

urlinfo_write()

used to save the in-memory representation of the URLs to the two on-disk KV files.

urlinfo_set() and urlinfo_get

Accessors for getting at the urlinfo struct corresponding to a given URL. It uses a hash so it’s O(1).

urlinfo_setFromEnv()

There are some occasions where the information about the URL is in the environment passed by the BKD. Rather than having a more complicated call to urllist_set(), this helper function digs out the information from the environment and fills in the URL information accordingly (e.g. BKD_GATE).

urlinfo_addURL() and urlinfo_rmURL()

Maintenance functions to add/remove URLs to the list.

urlinfo_probeURL()

This function establishes a connection to the BKD at the given URL and updates what we know about that URL. E.g. it looks for what components are populated in the BKD side, whether it is a gate or not, etc. The information is updated in the n->urls structure and the whole URLlist is marked as dirty so that it can be saved.

urllist_find()

This functions works as an iterator. Each time that it is called, it will return a new URL for the give component. When no more URLs are known for the component, zero is returned. A typical use would be:

   {{{#!cplusplus numbers=off
k = 0;
while (url = urllist_find(n, comp, flags, &k)) {
      /* code here */
}
   }}}

See `populate.c` for a real example.