.s2 3.6 I/O calls .es The system calls to do I/O are designed to eliminate the differences between the various devices and styles of access. There is no distinction between ``random'' and ``sequential'' I/O, nor is any logical record size imposed by the system. The size of an ordinary file is determined by the highest byte written on it; no predetermination of the size of a file is necessary or possible. .pg To illustrate the essentials of I/O in \*sUNIX\*n, some of the basic calls are summarized below in an anonymous language which will indicate the required parameters without getting into the complexities of machine language programming. Each call to the system may potentially result in an error return, which for simplicity is not represented in the calling sequence. .pg To read or write a file assumed to exist already, it must be opened by the following call: .dc filep = open\|(\|name, flag\|) .ec \fIName\fR indicates the name of the file. An arbitrary path name may be given. The \fIflag\fR argument indicates whether the file is to be read, written, or ``updated,'' that is read and written simultaneously. .pg The returned value .ft I filep .ft R is called a .ft I file descriptor. .ft R It is a small integer used to identify the file in subsequent calls to read, write or otherwise manipulate the file. .pg To create a new file or completely rewrite an old one, there is a \fIcreate\fR system call which creates the given file if it does not exist, or truncates it to zero length if it does exist. \fICreate\fR also opens the new file for writing and, like \fIopen,\fR returns a file descriptor. .pg There are no user-visible locks in the file system, nor is there any restriction on the number of users who may have a file open for reading or writing. Although it is possible for the contents of a file to become scrambled when two users write on it simultaneously, in practice difficulties do not arise. We take the view that locks are neither necessary nor sufficient, in our environment, to prevent interference between users of the same file. They are unnecessary because we are not faced with large, single-file data bases maintained by independent processes. They are insufficient because locks in the ordinary sense, whereby one user is prevented from writing on a file which another user is reading, cannot prevent confusion when, for example, both users are editing a file with an editor which makes a copy of the file being edited. .pg It should be said that the system has sufficient internal interlocks to maintain the logical consistency of the file system when two users engage simultaneously in such inconvenient activities as writing on the same file, creating files in the same directory, or deleting each other's open files. .pg Except as indicated below, reading and writing are sequential. This means that if a particular byte in the file was the last byte written (or read), the next I/O call implicitly refers to the first following byte. For each open file there is a pointer, maintained by the system, which indicates the next byte to be read or written. If \fIn\fR bytes are read or written, the pointer advances by \fIn\fR bytes. .pg Once a file is open, the following calls may be used. .dc n = read\|(\|filep, buffer, count\|) .dc n = write\|(\|filep, buffer, count\|) .ec Up to .ft I count .ft R bytes are transmitted between the file specified by .ft I filep .ft R and the byte array specified by .ft I buffer. .ft R The returned value .ft I n .ft R is the number of bytes actually transmitted. In the .ft I write .ft R case, .ft I n .ft R is the same as .ft I count .ft R except under exceptional conditions like I/O errors or end of physical medium on special files; in a .ft I read, .ft R however, .ft I n .ft R may without error be less than .ft I count. .ft R If the read pointer is so near the end of the file that reading \fIcount\fR characters would cause reading beyond the end, only sufficient bytes are transmitted to reach the end of the file; also, typewriter-like devices never return more than one line of input. When a \fIread\fR call returns with \fIn\fR equal to zero, it indicates the end of the file. For disk files this occurs when the read pointer becomes equal to the current size of the file. It is possible to generate an end-of-file from a typewriter by use of an escape sequence which depends on the device used. .pg Bytes written on a file affect only those implied by the position of the write pointer and the count; no other part of the file is changed. If the last byte lies beyond the end of the file, the file is grown as needed. .pg To do random (direct access) I/O it is only necessary to move the read or write pointer to the appropriate location in the file. .dc location = seek\|(\|filep, offset, base\|) .ec The pointer associated with \fIfilep\fR is moved to a position \fIoffset\fR bytes from the beginning of the file, from the current position of the pointer, or from the end of the file, depending on .ft I base. .ft R .ft I Offset .ft R may be negative. For some devices (e.g. paper tape and typewriters) seek calls are ignored. The actual offset from the beginning of the file to which the pointer was moved is returned in \fIlocation\fR. .s2 3.6.1 Other I/O calls .es There are several additional system entries having to do with I/O and with the file system which will not be discussed. For example: close a file, get the status of a file, change the protection mode or the owner of a file, create a directory, make a link to an existing file, delete a file. .s1 4. Implementation of the file system .es As mentioned in \(sc3.2 above, a directory entry contains only a name for the associated file and a pointer to the file itself. This pointer is an integer called the .ft I i-number .ft (for index number) of the file. When the file is accessed, its i-number is used as an index into a system table (the \fIi-list\|\fR) stored in a known part of the device on which the directory resides. The entry thereby found (the file's \fIi-node\fR\|) contains the description of the file: .sp .tr | .in .5i .ti -\w'1. 'u 1.|its owner; .ti -\w'1. 'u 2.|its protection bits; .ti -\w'1. 'u 3.|the physical disk or tape addresses for the file contents; .ti -\w'1. 'u 4.|its size; .ti -\w'1. 'u 5.|time of last modification; .ti -\w'1. 'u 6.|the number of links to the file; that is, the number of times it appears in a directory; .ti -\w'1. 'u 7.|a bit indicating whether the file is a directory; .ti -\w'1. 'u 8.|a bit indicating whether the file is a special file; .ti -\w'1. 'u 9.|a bit indicating whether the file is ``large'' or ``small.'' .sp .tr || .in 0 The purpose of an .ft I open .ft R or .ft I create .ft R system call is to turn the path name given by the user into an i-number by searching the explicitly or implicitly named directories. Once a file is open, its device, i-number, and read/write pointer are stored in a system table indexed by the file descriptor returned by the .ft I open .ft R or .ft I create. .ft R Thus the file descriptor supplied during a subsequent call to read or write the file may be easily related to the information necessary to access the file. .pg When a new file is created, an i-node is allocated for it and a directory entry is made which contains the name of the file and the i-node number. Making a link to an existing file involves creating a directory entry with the new name, copying the i-number from the original file entry, and incrementing the link-count field of the i-node. Removing (deleting) a file is done by decrementing the link-count of the i-node specified by its directory entry and erasing the directory entry. If the link-count drops to 0, any disk blocks in the file are freed and the i-node is deallocated. .pg The space on all fixed or removable disks which contain a file system is divided into a number of 512-byte blocks logically addressed from 0 up to a limit which depends on the device. There is space in the i-node of each file for eight device addresses. A \fIsmall\fR (non-special) file fits into eight or fewer blocks; in this case the addresses of the blocks themselves are stored. For \fIlarge\fR (non-special) files, seven of the eight device addresses may point to indirect blocks each containing 256 addresses for the data blocks of the file. If required, the eighth word is the address of a double-indirect block containing 256 more addresses of indirect blocks. Thus files may conceptually grow to (7+256)\v'-.3'.\v'.3'256\v'-.3'.\v'.3'512 bytes; actually they are restricted to 16,777,216 (\|2\u\*t24\*n\d\|) bytes. Once opened, a small file (size 1 to 8 blocks) can be accessed directly. A large file (size 9 to 32768 blocks) requires one additional access to read below logical block 1792 (7\v'-.3'.\v'.3'256) and two additional references above 1792. .pg The foregoing discussion applies to ordinary files. When an I/O request is made to a file whose i-node indicates that it is special, the last seven device address words are immaterial, and the first is interpreted as a pair of bytes which constitute an internal .ft I device name. .ft R These bytes specify respectively a device type and subdevice number. The device type indicates which system routine will deal with I/O on that device; the subdevice number selects, for example, a disk drive attached to a particular controller or one of several similar typewriter interfaces. .pg In this environment, the implementation of the .ft I mount .ft R system call (\(sc3.4) is quite straightforward. .ft I Mount .ft R maintains a system table whose argument is the i-number and device name of the ordinary file specified during the .ft I mount, .ft R and whose corresponding value is the device name of the indicated special file. This table is searched for each (i-number, device)-pair which turns up while a path name is being scanned during an .it open or .it create; if a match is found, the i-number is replaced by 1 (which is the i-number of the root directory on all file systems), and the device name is replaced by the table value. .pg To the user, both reading and writing of files appear to be synchronous and unbuffered. That is, immediately after return from a \fIread\fR call the data are available, and conversely after a \fIwrite\fR the user's workspace may be reused. In fact the system maintains a rather complicated buffering mechanism which reduces greatly the number of I/O operations required to access a file. Suppose a \fIwrite\fR call is made specifying transmission of a single byte. U\*sNIX\*n will search its buffers to see whether the affected disk block currently resides in core memory; if not, it will be read in from the device. Then the affected byte is replaced in the buffer and an entry is made in a list of blocks to be written. The return from the \fIwrite\fR call may then take place, although the actual I/O may not be completed until a later time. Conversely, if a single byte is read, the system determines whether the secondary storage block in which the byte is located is already in one of the system's buffers; if so, the byte can be returned immediately. If not, the block is read into a buffer and the byte picked out. .pg The system recognizes when a program has made accesses to sequential blocks of a file, and asynchronously pre-reads the next block. This significantly reduces the running time of most programs while adding little to system overhead. .pg A program which reads or writes files in units of 512 bytes has an advantage over a program which reads or writes a single byte at a time, but the gain is not immense; it comes mainly from the avoidance of system overhead. A program which is used rarely or which does no great volume of I/O may quite reasonably read and write in units as small as it wishes. .pg The notion of the i-list is an unusual feature of \*sUNIX\*n. In practice, this method of organizing the file system has proved quite reliable and easy to deal with. To the system itself, one of its strengths is the fact that each file has a short, unambiguous name which is related in a simple way to the protection, addressing, and other information needed to access the file. It also permits a quite simple and rapid algorithm for checking the consistency of a file system, for example verification that the portions of each device containing useful information and those free to be allocated are disjoint and together exhaust the space on the device. This algorithm is independent of the directory hierarchy, since it need only scan the linearly-organized i-list. At the same time the notion of the i-list induces certain peculiarities not found in other file system organizations. For example, there is the question of who is to be charged for the space a file occupies, since all directory entries for a file have equal status. Charging the owner of a file is unfair in general, since one user may create a file, another may link to it, and the first user may delete the file. The first user is still the owner of the file, but it should be charged to the second user. The simplest reasonably fair algorithm seems to be to spread the charges equally among users who have links to a file. The current version of \*sUNIX\*n avoids the issue by not charging any fees at all. .s2 4.1 Efficiency of the file system .es To provide an indication of the overall efficiency of \*sUNIX\*n and of the file system in particular, timings were made of the assembly of a 8848-line program. The assembly was run alone on the machine; the total clock time was 32 seconds, for a rate of 276 lines per second. The time was divided as follows: 66% assembler execution time, 21% system overhead, 13% disk wait time. We will not attempt any interpretation of these figures nor any comparison with other systems, but merely note that we are generally satisfied with the overall performance of the system.