A couple of complications arise when trying to apply thersync algorithm to a file system, however.First, rsync’schoice ofF0based on filename is too simple.For exam-ple, when editing filefoo, emacs creates an auto-save filenamed#foo#.RCS uses even less suggestive temporaryfile names such as1v22825.Thus, the recipient wouldhave to chooseF0using something other than file names. Itmight selectF0based on a fixed-size “sketch” ofF, usingBroder’s resemblance estimation technique .However,even ignoring the additional cost of this approach, some-timesFcan best be reconstructed from chunks of multiplefiles—considerar, which outputs software libraries contain-ing many object files.3.1.1LBFS SolutionIn order to use chunks from multiple files on the recipient,LBFS takes a different approach from that of rsync. It con-siders only non-overlapping chunks of files and avoids sen-sitivity to shifting file offsets by setting chunk boundariesbased on file contents, rather than on position within a file.Insertions and deletions therefore only affect the surround-ing chunks. Similar techniques have been used successfullyin the past to segment files for the purpose of detecting unau-thorized copying .To divide a file into chunks, LBFS examines every (over-lapping) 48-byte region of the file and with probability2-13over each region’s contents considers it to be the end of a datachunk. LBFS selects these boundary regions—calledbreak-points—using Rabin fingerprints . A Rabin fingerprintis the polynomial representation of the data modulo a pre-determined irreducible polynomial.We chose fingerprintsbecause they are efficient to compute on a sliding windowin a file. When the low-order 13 bits of a region’s finger-print equal a chosen value, the region constitutes a break-point.Assuming random data, the expected chunk size is213=8192=8KBytes (plus the size of the 48-bytebreakpoint window).As will be discussed in Section 5.1,we experimented with various window sizes and found that48 bytes provided good results (though the effect of windowsize was not huge).Figure 1 shows how LBFS might divide up a file and whathappens to chunk boundaries after a series of edits.a.showsthe original file, divided into variable length chunks withbreakpoints determined by a hash of each 48-byte region.b.shows the effects of inserting some text into the file. Thetext is inserted in chunkc4, producing a new, larger chunkc8. However, all other chunks remain the same. Thus, oneneed only sendc8to transfer the new file to a recipient thatalready has the old version. Modifying a file can also changethe number of chunks.c.shows the effects of inserting datathat contains a breakpoint.Bytes are inserted inc5, split-ting that chunk into two new chunksc9andc10. Again, the
file can be transfered by sending only the two new chunks.