From email to dev on Mar 6, 2008:
The fslayer code has morphed alot since we first started thinking about it on Friday morning. I thought it would be a good idea to list out where we started and how it has changed.
Many of these ideas and thoughts came from Larry…
Much of the discussion is high level with details and file formats omitted.
Xilinx (big repos and fragmentation) The initial discussion centered around Xilinx and their repos with 1.3 million files and how to make bk scale in that environment. It was observed that in checkout:edit mode we will have a ~40 byte pfile on disk for every sfile. We also have little dfiles, zfiles, cfiles, etc… Because of block allocation these files will consume a surprisingly large amount of space on the disk. (~42% of the size of the sfiles on linux-2.6) Also on NFS every operation on one of those files will be a couple round trip network packets.
So we considered what if all of those files were collapsed into single file on the disk and bk accessed the data directly. To quote Jobs: Boom!
The way to do this is to wrap all file system calls with a layer of code to intercept and redirect them to our own code. This layer has already been written as a demonstration.
sfile blobs and append only The next thought was that if collapsing the X.files was such a big win what about collapsing all the sfiles into a single blob under the BitKeeper directory. The immediate benefit is that the SCCS directories are removed from among the user’s data. Next is that fragmentation will again allow a significant diskspace savings. (~15% in our tree and ~20% in linux-2.6)
And for simplicity and stability we can make those sfile blobs append-only where new sfiles are added to the end without deleting the old versions. The data can be recompacted in check.c occasionally when the unused space reaches a certain threshold. (The assumption is that only a small percentage on files will have changed in recent history. Unfortunately the largest file changes the most often. Chaining to the rescue!) Add an external index for the blobs and lookups would be very fast.
stdio and hidden compression With the recent stdio changes we can now compress the whole sfile when it is written to the blobs and uncompress as they are read without having to change any code in slib.c. And with a streaming or block based checksum we will remove the current code that has to read the whole sfile first to validate the file checksum before rereading to parse the data.
For environments will slow disks (NFS) or repos bigger than memory these changes will help performance.
transactions and nested Next came the observation that we need to write a transaction layer for nested so repo updates can be reverted after a failure occurs in the middle of a collection of nested pulls. If sfiles updates just append to a blob we just need to remember the size of the blob(s) at the start of our update and then roll back to that point if the update doesn’t work out. (Not counting all those details about tracking gfiles and renames and such.)
So if we built the fslayer/blob code all our transaction support drops out for "free". So since nested is on the fast path to a release this project was given the green light.
coherency and locking The problem with collapsing a bunch of files to a single file on the filesystem is that bk typically has a collection of processes that all need to access the data at the same time. This means we need some method to assure coherency. We can lock the files and only allow one process to access them at a time, but the locks need to be fine-grained or deadlock will occur. Example bk sfiles | bk edit - The sfiles will need to read the info on dfiles/pfiles to create the illusion that they are still on the disk when running getdir(), but edit will be writing new pfiles at the same time.
The only reliable lock we have is sccs_lockfile(), but that is way too heavy to run on every file operation.
We can't portably rely on shared memory and atomic in-memory locks without alot of work writing an API layer.
I considered having each process create a socket handle and write it to the disk so processes can interrupt each other to request access. That would allow fast direct access while you have the lock and a cheap way to handoff the lock between processes with contention. But it requires threading so we can run bk and sit in a select loop waiting for connections at the same time. Again not portable.
using bkd for locking So that suggests we need a 3rd party process to negotiate access to our shared resource. And since we have always wanted a persistent bkd to handle state and track repositories this was a good chance to introduce a global bkd to the system that we can use. Just start a single bkd per-user per-machine and put a pointer to it in $HOME/.bk. (We revised this so we didn’t start the bk in /, but that doesn’t matter at this point)
Then your bk process can contact the global bkd to request access to our embedded filesystem data, and the bkd will negotiate conflicts. But again to prevent deadlocks the locks need to be able to change hands between each access. So every FS access will need a round trip packet to the lock manager. (Using some time-based reservations we can reduce this traffic without risking deadlocks, but again that doesn't matter to the story.)
BTW this approach still has a number of unresolved issues when dealing with multiple users accessing the same repository...
using bkd as shared fs Once you start accessing the bkd for every access you might as well have the bkd just track the data directly and hand back the data for each file as it is requested. (or offset/len for sfiles)
This would be a lightweight NFS server. Back of the envelope calculations show that the overhead of this approach would be acceptable.
use unionfs for transactions At this point the fslayer code is looking like a reasonably complex piece of work and it still needs a chunk of more code to handle transactions. Since transactions are the main short term goal I thought about other ways to achieve the same effect.
We could use the fslayer to implement an application-level unionfs filesystem. When a new transaction is started we create a BitKeeper/transaction directory and make the fslayer redirect pathnames in the repository to look in that directory first. Writes would write to the transaction dir, and reads would read from transaction first and fallback to the repo if no file is found. Deletes would need to white-out that pathname so bk would no longer see the file. The net result is that bk would be able to write anything it wants and the repository wouldn't change at all until the transaction is committed and the data stored off to the side is moving into the repository. Since all state is stored on the filesystem we shouldn't have coherency problems between processes and all bk processes can see the same view of the filesystem.
Here the code for transactions is small and the overhead is tiny. And we can still add the data file collapsing in the future. BTW with this approach only syscall that take a pathname need to be wrapped and stuff like read() and write() work unmodified.
problems with triggers and returning to shell Note that other programs not linked with bk’s fslayer code would not see changes in the repository. This is a big issue for triggers. Consider triggers called out of resolve…
Also programs that bk calls like GNU diff would either need to be linked with the fslayer stuff or called explicitly with code to generate the 'real' pathname to a file on the disk.
reverse union and transactions An alternative for unionfs idea is to reverse it. Just have bk read and write the repository directly, but hook calls to unlink, rename and open-for-write, to save copies the files being modified off to the side. Then we hook even less syscalls, but have the data to revert the repository whenever we want. In this approach non-bk commands like diff and triggers continue to work, but modifications by these non-bk commands won’t be noticed.
This has big appeal for simplicity.