File systems.pdf - File Systems INF 551 Wensheng Wu 1 Roadmap \u2022 Files and directories looks down at the storage system \u2013 CRUD operations via system

File systems.pdf - File Systems INF 551 Wensheng Wu 1...

This preview shows page 1 out of 63 pages.

You've reached the end of your free preview.

Want to read all 63 pages?

Unformatted text preview: File Systems INF 551 Wensheng Wu 1 Roadmap • Files and directories looks down at the storage system. – CRUD operations via system calls local files - files stored locally network files - files stored in one server hdfs - files stored in multiple servers (hadoop has name node (metadata) and data node (data node stores data) • Implementing CRUD – Data structures, e.g., organization of blocks – Access methods: turn system calls into operations on data structures update is a system call 2 Files and directories • File content stored in blocks on storage device – Has user defined name: hello.txt – & low-level name, e.g., inode number: 410689 • Files are organized into directories (folders) – each may have a list of files and/or subdirectories – That is, directories can be nested rwevery letter in a terminal command is a binary. 3 Example Root directory Empty directory 4 Operations on files • Create – open(), write() • Read – open(), read(), lseek() • Update – write(), lseek() cin>>, cout<<, cerr - Input/Output (I/O) Streams provided by OS/language? System calls: calls to functions in the API provided by OS • Delete – unlink() 5 Create • User interface via GUI or touch command in Linux • Implementation, e.g., via a C program with a system call: open() |: Bitwise OR operator • int fd = open("foo", O_CREAT | O_WRONLY | O_TRUNC); – – – – Open with flags indicating the specifics O_CREAT: create a file O_WRONLY: write only O_TRUNC: remove existing contents if exits 6 File descriptor • Note open() returns a file descriptor – Typically an integer – Reserved fds: stdin 0, stdout, 1, stderr 2 what do each file descriptor mean? 7 Read Sequential • read(fd, buffer, size) – Read from file "fd" <size> number of bytes – And store them in buffer • Read starts from the current offset of fd – Initially 0 offset to first character = 0 offset to second line = no. of characters in first line (includes \n for next line) 8 Write Sequential • write(fd, buffer, size) – Write to file fd <size> number of bytes stored in buffer – Also start writing from the current offset open(filename, O_WRONLY | O_Creat, 0644) 0644 - permissions to execute This in C 6 in binary is 1. So, for writing in C, it is stdout. That’s why permission 0644. 9 Random read and write This lets the reading and writing to be random. • off_t lseek(int fd, off_t offset, int whence) – If whence is SEEK_SET, the offset is set to <offset> bytes from the beginning of file – If whence is SEEK_CUR, the offset is set to its current location plus <offset> bytes – If whence is SEEK_END, the offset is set to the size of the file plus <offset> bytes (typically offset is negative, e.g., -8 for 8 bytes from the end) • whence: from where 10 Copy a file possible to read less than the size of buffer size permission "0" starts an octal number => permissions: 110 (owner) rw100 (group) r-100 (others) r-if 10KB of data, 2 iterations needed to return the data since each iteration does 8KB Pointer to a character array ret_in can be smaller than BUF_SIZE 11 gcc filename.c -o output.c -o output file output.c can be executed File permission mode rw-r--r– => 110 (owner permission) 100 (group) 100 (others) 12 Resources for system calls • • open: (system_call ) • read: (system_call) • write: (system_call ) • close: (system_call) 13 Resources for system calls • man –S 2 read – Find it in the Section 2 of the manual 14 Install gcc on EC2 • sudo yum install gcc • Usage: compiling c file. – gcc -o copy2 copy2.c 15 File and directory • When creating a file Metadata of a file to find its location. – Bookkeeping data structure (inode) created: recording size of file, location of its blocks, etc. – Linking a human-readable name to the file – Putting the link in a directory 16 Info about file (stored in inode) struct stat { }; dev_t st_dev; /* ID of device containing file */ when the file is created, the file system will create a ino_t st_ino; /* inode number */ new inode number to store the metadata mode_t st_mode; /* protection */ nlink_t st_nlink; /* number of (hard) links */ uid_t st_uid; /* user ID of owner */ gid_t st_gid; /* group ID of owner */ dev_t st_rdev; /* device ID (if special device file, e.g., /etc/tty) */ off_t st_size; /* total size, in bytes */ blksize_t st_blksize; /* blocksize for filesystem I/O */ blkcnt_t st_blocks; /* number of blocks allocated */ time_t st_atime; /* last time file content was accessed */ time_t st_mtime; /* last time file content was modified */ time_t st_ctime; /* last time inode was changed */ Execute "man -S 2 stat" for more details… 17 inode • Stores metadata/attributes about the file • Also stores locations of blocks holding the content of the file 18 Example • a.txt abc def abc def abc def Access Time atime - when the file is accessed last (for reading) Modify Time mtime - when there is change in the data of the file (for writing) Change Time ctime - when the meta data of the file has been changed Example of when there will be a change in the ctime but not atime or mtime: 1) changing directory/filename; 2) when permission is changed Device id Access permission Block size # of blocks allocated Inode # User id Group id # of (hard) links 19 noatime noatime because everytime we read, we will need to update the atime and that will be costly since reading is done a lot of times. 20 each partition has its own inode number Hard links SOFT LINK creating shortcuts is creating soft links 1) Points to the file and not the data block. 2) No data block assigned to it. HARD LINK a.txt and b.txt (a-hard.txt) will point to the data. 1) hard links have the same inode number. its basically two different names to the same data. 2) the permissions for both the files (main file) and the hard link file will be same and can not be different for them. If one is changed, both will be changed. 3) hard links are just pointers to the same data block. if one of them changes the data block, the data is changed for both. it’s just another way to access the same data block. Both of them data block storage. 21 4) simultaneous changes - whichever is considered more recent. Symbolic links a-soft.txt -> (points to) a.txt. S, instead of pointing to the data, it points to the file for which we have the symbolic link ln -s a.txt a-soft.txt 22 Working with directories • Create: mkdir() system call – Used to implement command, e.g., mkdir xyz • Read: opendir(), readdir(), closedir() – ls xyz • Delete: rmdir() 23 Roadmap • Files and directories – CRUD operations • Implementation – Data structures: how to organize the blocks – Access methods: turn system calls into operations on data structures 24 Organization of blocks • Array-based – Disk consists of a list of blocks – We will assume this • Tree-based, e.g., SGI XFS – Blocks are organized into variable-length extents – Use B+-tree to quickly find free extents 25 Blocks • Consider a disk with 64 blocks – 4KB/block – 512B/sector (we assume this in this lecture) Block size = 2^12 B • So there are 212/29 = 23 = 8 sectors/block – Capacity of disk = 64 * 4KB = 256KB Sector size = 2^9 B Number of sectors/ block = 4KB/ 512B = (2^12)/(2^9) = 2^3 64 block. Each block has 8 sectors. So, 64*8 sectors. 26 Data region stand alone file system - meta data and data on same device hdfs - meta data in namenode and data will be on data node • 56 blocks (#8-63) this is stand alone file system – used to store data/content of files – but see later: some blocks may store pointers Metadata 27 Metadata • For each file, file system records its metadata – Information in the "stat" struct – Location of blocks that stores the content of file 28 Metadata of files stored in inodes • Index nodes • Stored in blocks #3 -- #7 (i.e., 5 blocks) • Together they are called the "inode" table Let’s say, each inode = 256B So, number of inode in each sector = 512B/256B = 2 Number of inode in each block = 2 * 8 = 16 inodes. (because 8 sectors in each block) Since 5 blocks storing inode, total inode = Number of blocks * Number of inodes/block = 5 * 16 = 80 inodes in total. It can store max 80 files. Another method for calculation: 29 Number of inode in each block = Block Size/Inode Size = 4KB/256B = 2^12B/2^8B = 2^4 = 16 How many inodes are there? • 256 bytes/inode • 5 blocks, 4KB/block 1 inode = 256B 1 sector = 2 nodes 1 Block = 8 sectors = 16 nodes Meta data here is stored on 5 blocks. Therefore, metadata = 5 * 16 nodes = 80 nodes 1 Block = 4KB. Number of inodes/block = Data per block/Data per inode = inode/block => 16 inodes/block (4K/256 = 212/28 ) => 5 blocks, 5 * 16 = 80 inodes => File system can store at most 80 files Data Region = 8 to 63 bits = 56 bits = 7 bytes 30 Free space management using bitmaps • Bitmap: a vector of bits – 0 for free (inode/block), 1 for in-use • Inode bitmap (imap) – keep track of which inodes in the inode table are available • Data bitmap (dmap) – Keep track of which blocks in data region are available 31 Bitmaps • Each bitmap is stored in a block – Block "i": keep track of 80 inodes (could track 32K) – Block "d": keep track of the 56 data blocks i = i_bmap: Number of bits = number of inodes = 80 bits = 7 bytes d = d_bmap: Numbers of bits = number of data nodes = 56 bits = 7 bytes 32 Superblock • Track where i/d blocks and inode table are – E.g., inode table starts at block 3; there are 80 inodes and 56 data blocks, etc. • Indicate type of FS & inumber of its root dir • Will be read first when file system is mounted 33 inumber • Each inode is identified by a number – Low-level number of file name • Can figure out location of inode from inumber Seet # = Sector Number = Offset/data per sector Offset = (Number of blocks before * Block size) + (Number trying to access (which is the number of inodes before this number) * Inode Size) offset to number 0 is 12KB as 3 blocks of 4KB each before 0 Seet # Number to access the inumber = offset / (data per sector) Seet # for 0 = 24 for 2 ===> 12KB + (2*256B) = 12KB + 512B for 1 ===> 12KB + (1*256B) = 12KB + 256B Seet # for 1 = 24.5 (we need to take floor though) 34 Seet # for 2 = 25 inumber => location • inumber = 32 => address: offset in bytes from the beginning => which sector? Offset = 12KB + (32 * 256)B = 3 * 2^12 B + (2^5 * 2^8) B = 3 * 2^12 + (2^13) B = 2^12 * (3+2) B = 2^12 * 5 B Sector = Offset/Sector Size = 2^12 * 5/2^9 = 2^3 * 5 = 40 35 inumber => location of inode • Address: 12K + 32 * 256 = 20K • Sector #: 20K/512 = 40 1 inode = 256B 1 sector = 2 nodes 1 Block = 8 sectors = 16 nodes This is for 32th inode Sector Size = – more generally Data per sector = 512 (4KB/8) – (inodeStartAddress + inumber ∗ inode size)/sector size 36 inode => location of data blocks • A number of direct pointers – E.g., 8 pointers, each points to a data block – Enough for 8*4K = 32K size of file • Also has a slot for indirect pointer – Pointing to a data block storing direct pointers – Assume 4 bytes for block address (e.g., represented in CHS), so 1024 pointers/block – Now file can have (8 + 1024) blocks or 4,128KB 37 Multi-level index • Pointers may be organized into multiple levels – Indirect pointer (as in previous slide) • Inode (pointer1, pointer2, …, indirect pointer) • Indirect pointer -> a block of direct pointers – Double indirect pointers • Inode (pointer1, pointer2, …, indirect pointer) • Indirect pointer -> a block of indirect pointers instead -> each points to a block of direct pointers – Triple indirect pointers • Indirect pointer -> a block of indirect pointers -> each points to a block of indirect pointers -> each points to a block of direct pointers 38 Indirect pointers allow to grow the size of file. Usually happens when we are adding stuff to a file. Similar to growing an array size. Double Indirect Pointers Direct pointers point to a data block. So, if we have 8 direct pointers, we will have 8 data blocks. Block of indirect pointers inode Indirect pointer direct pointers Indirect pointer Block of direct pointers Here, we have 8 direct pointers + 1 indirect pointer. Size of file = Adding size of data accessed by all pointers. 8 direct pointers = Data accessed by each pointer to a block * 8= 8 * 4KB = 32KB There is one indirect pointer to an entire block. This block contains direct pointers. To calculate the direct pointers per block: Pointers/Block = Size of block/Size of pointer = 4KB/4B = 1024 File Size = Data accessed by 8 pointers + Data accessed by 1 indirect (1024 direct pointers) = 32KB + (1024 * 4KB) = 4MB Block size = 4KB Pointer Size = 4B 39 Advantages of multi-level index • Grow to more levels as needed • Direct pointers handle most of the cases – Many files are small 40 Directory organization • Directory itself stored as a file • For each file in the directory, it stores: – name, inumber, record length, string length Every directory will have inum and a name. Important parts - inum and name. Every time a file is added, it will get a inum and a name. current directory lookup the directory Actual length 41 Record length vs string length • String length = # of characters in file name + 1 (for \0: end of string) • Record length >= string length – Due to entry reuse 42 Reusing directory entries • If file is deleted (using rm command) or a name is unlinked (using unlink command) – File is finally deleted when its last (hard) link is removed • Then inumber in its directory entry set to 0 (reserved for empty entry) – So we know it can be reused 43 Storing a directory • Also as a file with its own inode + data block • inode: – file type: directory (instead of regular file) – pointer to block(s) in data region storing directory entries 44 Roadmap • Files and directories – CRUD operations • Implementation – Data structures: how to organize blocks, e.g., into array/tree – Access methods: turn system calls to operations on data structures 45 Open for read • fd = open("/foo/bar", O_RDONLY) to read the permissions of bar, it is stored in the inode of the bar. For inode, we need to the know the inumber. We can find this in foo’s content. To get this, we need foo’s inode. For that, we need foo’s inumber. This goes to root’s content and then root’s inode and then root’s inumber. This goes to superblock. Bootstrap process To open a file, we need the permission for it. The permission of this is metadata and is stored on the inode. bar’s inode Read only the inodes and the bar’s inumber content. inumber is used only to foo’s content find location. foo’s inode foo’s inumber root’s content root’s inode root’s inumber root’s inumber is stored on superblock 46 Open for read • fd = open("/foo/bar", O_RDONLY) – Need to locate inode of the file "/foo/bar" – Assume inumber of root, say 2, is known (e.g., when the file system is mounted) 47 Open for read 1. Read inode and content of / (2 reads) – Look for "foo" in / -> foo's inumber 2. Read inode and content of /foo (2 reads) – Look for "bar" in /foo -> bar's inumber 3. Read inode of /foo/bar (1 read) – Permission check + allocate file descriptor 48 Cost of open() • Need 5 reads of inode/data block 49 File-open table per process File descriptor File name Inumber Position offset 3 /foo/bar 32382 0 4 /foo/more 48482 512 … … 50 Reading the file • read(fd, buffer, size) buffer stores the data you read from the file using the fd – Note fd is maintained in per-process open-file table – Table translates fd -> inumber of file If no caching done, we have to find the inode and do the bootstrap again. If we know a file’s fd (file descriptor), we can just read from the file open table. 51 Reading the file • read(fd, buffer, size) 1. 2. 3. 4. 5. Consult bar's inode to locate a block Read the block Update inode with newest file access time Update open-file table with new offset Repeat above steps until done (with reading data of given size) 52 Cost for reading a block • 3 I/O's: atime changes when we read every read here changes the atime this adds the cost – read inode, read data block, write inode 2 the writing is done to update the atime 53 Open for write when we are writing on a new file (bar), the directory in which it is stored (foo)’s metadata needs to be updated. All times (atime, mtime and ctime will be updated of foo. • int fd = open("/foo/bar", O_WRONLY) – Or int fd = create(("/foo/bar") – Assume bar is a new file under foo – (note the difference from reading chapter!) dmap and imap? Superblock (first block) imap has 80 bits for the availability of inodes (second block) dmap has 56 bits for the availability of data blocks (third block) 54 Open for write • int fd = open("/foo/bar", O_WRONLY) 1. Read '/' inode & content – obtain foo's inumber 2. Read '/foo' inode & content – check if bar exists 55 Open for write 3. Read imap, to find a free inode for bar 4. Update imap, setting 1 for allocated inode 5. Write bar's inode 56 Open for write 6. Update foo's content block – Adding an entry for bar 7. Update foo's inode – Update its modification time 57 Cost for "open for write" We are still just opening the file for write. We are not writing to the file yet. That’s why no write for bar. • int fd = open("/foo/bar", O_WRONLY) • Need 9 I/O's Expense: open for write > open for read updating the directory right above because the inode of the above directory is stored on main parent (root). So, you do not need to write on the inode of root. For writing, you need to read the dmap and then go back to imap. dmap and imap is for entire file system. inode root foo bar data bitmap bitmap inode inode inode read root foo data data bar bar bar data[0] data[1] data[2] read read create() read read write write write write 58 Writing the file: /foo/bar 1. Read inode of bar (by looking up its inumber in the file-open table) 2. Allocate new data block allocation can happen both to foo and bar depending on how much space is left in the blocks allocated to both. – Read and write bmap 3. Write to data block of bar 4. Update bar inode – new modification time, add pointer to block 59 Cost of writing /foo/bar • 5 I/O's for write a block data inode root foo bar bitmap bitmap inode inode inode read root foo data data bar bar bar data[0] data[1] data[2] read read create() read read write write write write read read write() write write write 60 Caching for read • First read may be slow – But subsequent ones will speed up • Good idea to cache popular blocks – e.g., determined via LRU strategy 61 Buffering for delayed write • Improve write performance via: – Batching (e.g., two updates to the same imap) – Scheduling (reordering for better performance) – Avoiding writes (if file created, then quickly deleted) • Problem: update may be lost when system crashes 62 Example file systems • NTFS – New technology file system, Microsoft proprietary • FAT – File allocation table – FAT 16, 32, … – 32 bits = # of sectors a file can occupy 512B/sector => 2TB limit on file size 4KB/sector => 16TB limit • Ext4 – fourth extended file system, common in Linux 63 ...
View Full Document

  • Fall '14

What students are saying

  • Left Quote Icon

    As a current student on this bumpy collegiate pathway, I stumbled upon Course Hero, where I can find study resources for nearly all my courses, get online help from tutors 24/7, and even share my old projects, papers, and lecture notes with other students.

    Student Picture

    Kiran Temple University Fox School of Business ‘17, Course Hero Intern

  • Left Quote Icon

    I cannot even describe how much Course Hero helped me this summer. It’s truly become something I can always rely on and help me. In the end, I was not only able to survive summer classes, but I was able to thrive thanks to Course Hero.

    Student Picture

    Dana University of Pennsylvania ‘17, Course Hero Intern

  • Left Quote Icon

    The ability to access any university’s resources through Course Hero proved invaluable in my case. I was behind on Tulane coursework and actually used UCLA’s materials to help me move forward and get everything together on time.

    Student Picture

    Jill Tulane University ‘16, Course Hero Intern

Stuck? We have tutors online 24/7 who can help you get unstuck.
A+ icon
Ask Expert Tutors You can ask You can ask You can ask (will expire )
Answers in as fast as 15 minutes