A couple of complications arise when trying to apply the rsync algorithm to a

A couple of complications arise when trying to apply

This preview shows page 4 - 6 out of 14 pages.

A couple of complications arise when trying to apply the rsync algorithm to a file system, however. First, rsync’s choice of F 0 based on filename is too simple. For exam- ple, when editing file foo , emacs creates an auto-save file named #foo# . RCS uses even less suggestive temporary file names such as 1v22825 . Thus, the recipient would have to choose F 0 using something other than file names. It might select F 0 based on a fixed-size “sketch” of F , using Broder’s resemblance estimation technique [4]. However, even ignoring the additional cost of this approach, some- times F can best be reconstructed from chunks of multiple files—consider ar , which outputs software libraries contain- ing many object files. 3.1.1 LBFS Solution In order to use chunks from multiple files on the recipient, LBFS takes a different approach from that of rsync. It con- siders only non-overlapping chunks of files and avoids sen- sitivity to shifting file offsets by setting chunk boundaries based on file contents, rather than on position within a file. Insertions and deletions therefore only affect the surround- ing chunks. Similar techniques have been used successfully in the past to segment files for the purpose of detecting unau- thorized copying [3]. To divide a file into chunks, LBFS examines every (over- lapping) 48-byte region of the file and with probability 2 - 13 over each region’s contents considers it to be the end of a data chunk. LBFS selects these boundary regions—called break- points —using Rabin fingerprints [19]. A Rabin fingerprint is the polynomial representation of the data modulo a pre- determined irreducible polynomial. We chose fingerprints because they are efficient to compute on a sliding window in a file. When the low-order 13 bits of a region’s finger- print equal a chosen value, the region constitutes a break- point. Assuming random data, the expected chunk size is 2 13 = 8192 = 8 KBytes (plus the size of the 48-byte breakpoint window). As will be discussed in Section 5.1, we experimented with various window sizes and found that 48 bytes provided good results (though the effect of window size was not huge). Figure 1 shows how LBFS might divide up a file and what happens to chunk boundaries after a series of edits. a. shows the original file, divided into variable length chunks with breakpoints determined by a hash of each 48-byte region. b. shows the effects of inserting some text into the file. The text is inserted in chunk c 4 , producing a new, larger chunk c 8 . However, all other chunks remain the same. Thus, one need only send c 8 to transfer the new file to a recipient that already has the old version. Modifying a file can also change the number of chunks. c. shows the effects of inserting data that contains a breakpoint. Bytes are inserted in c 5 , split- ting that chunk into two new chunks c 9 and c 10 . Again, the
Image of page 4
file can be transfered by sending only the two new chunks.
Image of page 5
Image of page 6

  • Left Quote Icon

    Student Picture

  • Left Quote Icon

    Student Picture

  • Left Quote Icon

    Student Picture