NFS Tracing By Passive Network Monitoring Matt Blaze Department of Computer Science Princeton University mab@cs.princeton.edu ABSTRACT Traces of filesystem activity have proven to be useful for a wide variety of purposes, rang ing from quantitative analysis of system behavior to trace-driven simulation of filesystem algo rithms. Such traces can be difficult to obtain, however, usually entailing modification of the filesystems to be monitored and runtime overhead for the period of the trace. Largely because of these difficulties, a surprisingly small number of filesystem traces have been conducted, and few sample workloads are available to filesystem researchers. This paper describes a portable toolkit for deriving approximate traces of NFS [1] activity by non-intrusively monitoring the Ethernet traffic to and from the file server. The toolkit uses a promiscuous Ethernet listener interface (such as the Packetfilter[2]) to read and reconstruct NFS-related RPC packets intended for the server. It produces traces of the NFS activity as well as a plausible set of corresponding client system calls. The tool is currently in use at Princeton and other sites, and is available via anonymous ftp. 1. Motivation Traces of real workloads form an important part of virtually all analysis of computer system behavior, whether it is program hot spots, memory access patterns, or filesystem activity that is being studied. In the case of filesystem activity, obtaining useful traces is particularly challenging. Filesystem behavior can span long time periods, often making it necessary to collect huge traces over weeks or even months. Modification of the filesystem to collect trace data is often difficult, and may result in unacceptable runtime overhead. Distributed filesystems exa cerbate these difficulties, especially when the network is composed of a large number of heterogeneous machines. As a result of these difficulties, only a relatively small number of traces of Unix filesystem workloads have been conducted, primarily in computing research environments. [3], [4] and [5] are examples of such traces. Since distributed filesystems work by transmitting their activity over a network, it would seem reasonable to obtain traces of such systems by placing a "tap" on the network and collecting trace data based on the network traffic. Ethernet[6] based networks lend themselves to this approach particularly well, since traffic is broadcast to all machines connected to a given subnetwork. A number of general-purpose network monitoring tools are avail able that "promiscuously" listen to the Ethernet to which they are connected; Sun's etherfind[7] is an example of such a tool. While these tools are useful for observing (and collecting statistics on) specific types of packets, the information they provide is at too low a level to be useful for building filesystem traces. Filesystem operations may span several packets, and may be meaningful only in the context of other, previous operations. Some work has been done on characterizing the impact of NFS traffic on network load. In [8], for example, the results of a study are reported in which Ethernet traffic was monitored and statistics gathered on NFS activity. While useful for understanding traffic patterns and developing a queueing model of NFS loads, these previous stu dies do not use the network traffic to analyze the file access traffic patterns of the system, focusing instead on developing a statistical model of the individual packet sources, destinations, and types. This paper describes a toolkit for collecting traces of NFS file access activity by monitoring Ethernet traffic. A "spy" machine with a promiscuous Ethernet interface is connected to the same network as the file server. Each NFS-related packet is analyzed and a trace is produced at an appropriate level of detail. The tool can record the low level NFS calls themselves or an approximation of the user-level system calls (open, close, etc.) that triggered the activity. We partition the problem of deriving NFS activity from raw network traffic into two fairly distinct subprob lems: that of decoding the low-level NFS operations from the packets on the network, and that of translating these low-level commands back into user-level system calls. Hence, the toolkit consists of two basic parts, an "RPC decoder" (rpcspy) and the "NFS analyzer" (nfstrace). rpcspy communicates with a low-level network monitoring facility (such as Sun's NIT [9] or the Packetfilter [2]) to read and reconstruct the RPC transactions (call and reply) that make up each NFS command. nfstrace takes the output of rpcspy and reconstructs the sys tem calls that occurred as well as other interesting data it can derive about the structure of the filesystem, such as the mappings between NFS file handles and Unix file names. Since there is not a clean one-to-one mapping between system calls and lower-level NFS commands, nfstrace uses some simple heuristics to guess a reasonable approximation of what really occurred. 1.1. A Spy's View of the NFS Protocols It is well beyond the scope of this paper to describe the protocols used by NFS; for a detailed description of how NFS works, the reader is referred to [10], [11], and [12]. What follows is a very brief overview of how NFS activity translates into Ethernet packets. An NFS network consists of servers, to which filesystems are physically connected, and clients, which per form operations on remote server filesystems as if the disks were locally connected. A particular machine can be a client or a server or both. Clients mount remote server filesystems in their local hierarchy just as they do local filesystems; from the user's perspective, files on NFS and local filesystems are (for the most part) indistinguishable, and can be manipulated with the usual filesystem calls. The interface between client and server is defined in terms of 17 remote procedure call (RPC) operations. Remote files (and directories) are referred to by a file handle that uniquely identifies the file to the server. There are operations to read and write bytes of a file (read, write), obtain a file's attributes (getattr), obtain the contents of directories (lookup, readdir), create files (create), and so forth. While most of these operations are direct analogs of Unix system calls, notably absent are open and close operations; no client state information is maintained at the server, so there is no need to inform the server explicitly when a file is in use. Clients can maintain buffer cache entries for NFS files, but must verify that the blocks are still valid (by checking the last write time with the getattr operation) before using the cached data. An RPC transaction consists of a call message (with arguments) from the client to the server and a reply mes sage (with return data) from the server to the client. NFS RPC calls are transmitted using the UDP/IP connection less unreliable datagram protocol[13]. The call message contains a unique transaction identifier which is included in the reply message to enable the client to match the reply with its call. The data in both messages is encoded in an "external data representation" (XDR), which provides a machine-independent standard for byte order, etc. Note that the NFS server maintains no state information about its clients, and knows nothing about the context of each operation outside of the arguments to the operation itself. 2. The rpcspy Program rpcspy is the interface to the system-dependent Ethernet monitoring facility; it produces a trace of the RPC calls issued between a given set of clients and servers. At present, there are versions of rpcspy for a number of BSD-derived systems, including ULTRIX (with the Packetfilter[2]), SunOS (with NIT[9]), and the IBM RT running AOS (with the Stanford enet filter). For each RPC transaction monitored, rpcspy produces an ASCII record containing a timestamp, the name of the server, the client, the length of time the command took to execute, the name of the RPC command executed, and the command- specific arguments and return data. Currently, rpcspy understands and can decode the 17 NFS RPC commands, and there are hooks to allow other RPC services (for example, NIS) to be added reasonably easily. The output may be read directly or piped into another program (such as nfstrace) for further analysis; the for mat is designed to be reasonably friendly to both the human reader and other programs (such as nfstrace or awk). Since each RPC transaction consists of two messages, a call and a reply, rpcspy waits until it receives both these components and emits a single record for the entire transaction. The basic output format is 8 vertical-bar separated fields: timestamp | execution-time | server | client | command-name | arguments | reply-data where timestamp is the time the reply message was received, execution-time is the time (in microseconds) that elapsed between the call and reply, server is the name (or IP address) of the server, client is the name (or IP address) of the client followed by the userid that issued the command, command-name is the name of the particular program invoked (read, write, getattr, etc.), and arguments and reply-data are the command dependent arguments and return values passed to and from the RPC program, respectively. The exact format of the argument and reply data is dependent on the specific command issued and the level of detail the user wants logged. For example, a typical NFS command is recorded as follows: 690529992.167140 | 11717 | paramount | merckx.321 | read | {"7b1f00000000083c", 0, 8192} | ok, 1871 In this example, uid 321 at client "merckx" issued an NFS read command to server "paramount". The reply was issued at (Unix time) 690529992.167140 seconds; the call command occurred 11717 microseconds earlier. Three arguments are logged for the read call: the file handle from which to read (repr...
kopia23