.\" Copyright (c) 1986 Regents of the University of California. .\" All rights reserved. The Berkeley software License Agreement .\" specifies the terms and conditions for redistribution. .\" .\" @(#)sys.ufs.t 1.5 (Berkeley) 4/11/86 .\" .NH Changes in the filesystem .PP The major change in the filesystem was the addition of a name translation cache. A table of recent name-to-inode translations is maintained by \fInamei\fP, and used as a lookaside cache when translating each component of each file pathname. Each \fInamecache\fP entry contains the parent directory's device and inode, the length of the name, and the name itself, and is hashed on the name. It also contains a pointer to the inode for the file whose name it contains. Unlike most inode pointers, which hold a ``hard'' reference by incrementing the reference count, the name cache holds a ``soft'' reference, a pointer to an inode that may be reused. In order to validate the inode from a name cache reference, each inode is assigned a unique ``capability'' when it is brought into memory. When the inode entry is reused for another file, or when the name of the file is changed, this capability is changed. This allows the inode cache to be handled normally, releasing inodes at the head of the LRU list without regard for name cache references, and allows multiple names for the same inode to be in the cache simultaneously without complicating the invalidation procedure. An additional feature of this scheme is that when opening a file, it is possible to determine whether the file was previously open. This is useful when beginning execution of a file, to check whether the file might be open for writing, and for similar situations. .PP Other changes that are visible throughout the filesystem include greater use of the ILOCK and IUNLOCK macros rather than the subroutine equivalents. The inode times are updated on each \fIirele\fP, not only when the reference count reaches zero, if the IACC, IUPD or ICHG flags are set. This is accomplished with the ITIMES macro; the inode is marked as modified with the new IMOD flag, that causes it to be written to disk when released, or on the next sync. .PP The remainder of this section describes the filesystem changes that are localized to individual files. .XP ufs_alloc.c The algorithm for extending file fragments was changed to take advantage of the observation that fragments that were once extended were frequently extended again, that is, that the file was being written in fragments. Therefore, the first time a given fragment is allocated, a best-fit strategy is used. Thereafter, when this fragment is to be extended, a full-sized block is allocated, the fragment removed from it, and the remainder freed for use in subsequent expansion. As this policy may result in increased fragmentation, it is not used when the filesystem becomes excessively fragmented (i.e. when the number of free fragments falls to 2% of the minfree value); the policy is stored in the superblock and may be changed with \fItunefs\fP. The \fIfserr\fP routine was converted to use \fIlog\fP rather than \fIprintf\fP. .XP ufs_bio.c I/O operations traced now include the size where relevant. .XP ufs_inode.c The size of the buffer hash table was increased substantially and changed to a power of two to allow the modulus to be computed with a mask operation. \fIIget\fP invalidates the capability in each inode that is flushed from the inode cache for reuse. The new \fIigrab\fP routine is used instead of \fIiget\fP when fetching an inode from a name cache reference; it waits for the inode to be unlocked if necessary, and removes it from the free list if it was free. The caller must check that the inode is still valid after the \fIigrab\fP. A bug was fixed in \fIitrunc\fP that allowed old contents to creep back into a file. When truncating to a location within a block, \fIitrunc\fP must clear the remainder of the block. Otherwise, if the file is extended by seeking past the end of file and then writing, the old contents reappear. .\" \fIItrunc\fP also waits for .XP ufs_mount.c The \fImount\fP system call was modified to return different error numbers for different types of errors. \fIMount\fP now examines the superblock more carefully before using size field it contains as the amount to copy into a new buffer. If a mount fails for a reason other than the device already being mounted, the device is closed again. When performing the name lookup for the mount point, \fImount\fP must prevent the name translation from being left in the name cache; \fIumount\fP must flush all name translations for the device. A bug in \fIgetmdev\fP caused an inode to remain locked if the specified device was not a block special file; this has been fixed. .XP ufs_namei.c This file was previously called ufs_nami.c. The \fInamei\fP function has a new calling convention with its arguments, associated context, and side effects encapsulated in a single structure. It has been extensively modified to implement the name cache and to cache directory offsets for each process. It may now return ENAMETOOLONG when appropriate, and returns EINVAL if the 8th bit is set on one of the pathname characters. Directories may be foreshortened if the last one or more blocks contain no entries; this is done when files are being created, as the entire directory must already be searched. An entry is provided for invalidating the entire name cache when the 32-bit prototype for capabilities wraps around. This is expected to happen after 13 months of operation, assuming 100 name lookups per second, all of which miss the cache. .XP A change in filesystem semantics is the introduction of ``sticky'' directories. If the ISVTX (sticky text) bit is set in the mode of a directory, files may only be removed from that directory by the owner of the file, the owner of the directory, or the superuser. This is enforced by \fInamei\fP when the lookup operation is DELETE. .XP ufs_subr.c The strategy for \fIsyncip\fP, the internal routine implementing \fIfsync\fP, has been modified for large files (those larger than half of the buffer cache). For large files all modified buffers for the device are written out. The old algorithm could run for a very long time on a very large file, that might not actually have many data blocks. The \fIupdate\fP routine now saves some work by calling \fIiupdate\fP only for modified inodes. The C replacements for the special VAX instructions have been collected in this file. .XP ufs_syscalls.c When doing an open with flags O_CREAT and O_EXCL (create only if the file did not exist), it is now considered to be an error if the target exists and is a symbolic link, even if the symbolic link refers to a nonexistent file. This behavior is desirable for reasons of security in programs that create files with predictable names. \fIRename\fP follows the policy of \fInamei\fP in disallowing removal of the target of a rename if the target directory is ``sticky'' and the user is not the owner of the target or the target directory. A serious bug in the open code which allowed directories and other unwritable files to be truncated has been corrected. Interrupted opens no longer lose file descriptors. The \fIlseek\fP call returns an ESPIPE error when seeking on sockets (including pipes) for backward compatibility. The error returned from \fIreadlink\fP when reading something other than a symbolic link was changed from ENXIO to EINVAL. Several calls that previously failed silently on read-only filesystems (\fIchmod\fP, \fIchown\fP, \fIfchmod\fP, \fIfchown\fP and \fIutimes\fP) now return EROFS. The \fIrename\fP code was reworked to avoid several races and to invalidate the name cache. It marks a directory being renamed with IRENAME to avoid races due to concurrent renames of the same directory. \fIMkdir\fP now sets the size of all new directories to DIRBLKSIZE. \fIRmdir\fP purges the name cache of entries for the removed directory. .XP ufs_xxx.c The routines \fIuchar\fP and \fIschar\fP are no longer used and have been removed. .XP quota_kern.c The quota hash size was changed to a power of 2 so that the modulus could be computed with a mask. .XP quota_ufs.c If a user has run out of warnings and had the hard limit enforced while logged in, but has then brought his allocation below the hard limit, the quota system reverts to enforcing the soft limit, and resets the warning count; users previously were required to log out and in again to get this affect.