Stanford Computer Systems Laboratory Technical Reports from the 1990s

The authors have given permission to make their technical reports available on this server. If a report waspublished in print and is not here it may be that the author published it elsewhere.

Report Number: CSL-TN-99-1 This is the only TN in the CSL series
Institution: Stanford University, Computer Systems Laboratory
Title: A Graph-Oriented Model for Articulation of Ontology Interdependencies
Author: Mitra, Prasenjit
Author: Wiederhold, Gio
Author: Kersten, Martin L.
Date: August 1999
Abstract: Ontologies are knowledge structures to explicate the contents, essential properties, and relationships between terms in a knowledge source. Many sources are now accessible with associated ontologies. Most prior work on use of ontologies relies on the construction of a single global ontology covering all sources. Such an approach is not scalable and maintainable especially when the sources change frequently. We propose a scalable and easily maintainable approach based on the interoperation of ontologies. To handle user queries crossing the boundaries of the underlying information systems, the interoperation between the ontologies should be precisely defined. Our approach is to use rules that cross the semantic gap by creating an articulation or linkage between the systems. The rules are generated using a semi-automatic articulation tool with the help of a domain expert. To make the ontologies amenable for automatic composition based on the accumulated knowledge rules, we represent them using a graph-oriented model extended with a small algebraic operator set. ONION, a user-friendly toolkit, aids the experts in bridging the semantic gap in real-life settings. Our framework provides a sound foundation to simplify the work of domain experts, enables integration with public semantic dictionaries, like Wordnet, and will derive ODMG-compliant mediators automatically.
CSL-TN-99-1 This is the only TN in the CSL series

Report Number: CSL-TR-90-410
Institution: Stanford University, Computer Systems Laboratory
Title: Tango introduction and tutorial
Author: Goldschmidt, Stephen R.
Author: Davis, Helen
Date: January 1990
Abstract: Tango is a software-based multiprocessor simulator that can generate traces of synchronization events and data references. The system runs on a uniprocessor and provides a simulated multiprocessor environment. The user code is augmented during compilation to produce a compiled simulation system with optional logging. Tango offers flexible and accurate tracing by allowing the user to incorporate various memory and synchronization models. Tango achieves high efficiency by running compiled user code, by focusing on information that is of specific interest to multiprocessing studies and by allowing the user to select the most efficient memory simulation that is appropriate for a set of experiments.
CSL-TR-90-410

Report Number: CSL-TR-90-411
Institution: Stanford University, Computer Systems Laboratory
Title: Branch strategies: modeling and optimization
Author: Dubey, Pradeep K.
Author: Flynn, Michael J.
Date: February 1990
Abstract: Instruction dependency introduced by conditional branch instructions, which is resolved only at run-time, can have severe performance impact on pipelined machines. A variety of strategies are in wide use to minimize this impact. Additional instruction traffic generated by these branch strategies can also have an adverse effect on the system performance. Therefore, in addition to the likely reduction a branch prediction strategy offers in average branch delay, resulting excess i-traffic can be an important parameter in evaluating its overall effectiveness. The objective of this paper is twofold: to develop a model for different approaches to the branch problem and to help select an optimal strategy after taking into account the additional i-traffic generated by the i-buffering. The model presented provides a flexible tool for comparing different branch strategies in terms of the reduction it offers in average branch delay and also in terms of the associated cost of wasted instruction fetches. This additional criterion turns out to be a valuable consideration in choosing between two almost equally performing strategies. More importantly, it provides a better insight into the expected overall system performance. Simple compiler-support-based low implementation-cost strategies can be very effective under certain conditions. An active branch prediction scheme based on loop buffer can be as competitive as a branch-target-buffer based strategy.
CSL-TR-90-411

Report Number: CSL-TR-90-413
Institution: Stanford University, Computer Systems Laboratory
Title: An area-utility model for on-chip memories and its application
Author: Mulder, Johannes M.
Author: Quach, Nhon T.
Author: Flynn, Michael J.
Date: February 1990
Abstract: Utility can be defined as quality per unit of cost. The utility of a particular function in a microprocessor can be defined as its contribution to the overall processor performance per unit of implementation cost. In the case of on-chip data memory (e.g., registers, caches) the performance contribution can be reduced to its effectiveness in reducing memory traffic or in reducing the average time to fetch operands. An important cost measure for on-chip memory is occupied area. On-chip memory performance, however, is expressed much more easily as a function of size (the storage capacity) than as a function of area. Simple models have been proposed for mapping memory size to occupied area. These models, however, are of unproven validity and only apply when comparing relatively large buffers (³ 128 words for caches, ³ 32 words for register sets) of the same structure (e.g., cache versus cache). In this paper we present an area model for on-chip memories. The area model considers the supplied bandwidth of the individual memory cells and includes such overhead as control logic, driver logic, and tag storage, thereby permitting comparison of data buffers of different organizations and of arbitrary sizes. The model gave less than 10% error when verified against real caches and register files. Using this area-utility measure F(Performance,Area), we first investigated the performance of various cache organizations and then compared the performance of register buffers (e.g., register sets, multiple overlapping sets) and on-chip caches. Comparing cache performance as a function of area, rather than size, leads to a significantly different set of organizational tradeoffs. Caches occupy more area per bit than register buffers for sizes of 128 words or less. For data caches, line size is a primary determinant of performance for small sizes while write policy becomes the primary factor for larger caches. For the same area, multiple register sets have poorer performance than a single register set with cache except when the memory access time is very fast (under 3 processor cycles).
CSL-TR-90-413

Report Number: CSL-TR-90-415
Institution: Stanford University, Computer Systems Laboratory
Title: High-speed addition in CMOS
Author: Quach, Nhon T.
Author: Flynn, Michael J.
Date: February 1990
Abstract: This paper describes a fully static Complementary Metal-Oxide Semiconductor (CMOS) implementation of a Ling type adder. The implementation described herein saves up to one gate delay and always reduces the number of serial transistors in the worst-case (critical) path over the conventional carry look-ahead (CLA) approach with a negligible increase in hardware.
CSL-TR-90-415

Report Number: CSL-TR-90-418
Institution: Stanford University, Computer Systems Laboratory
Title: Runtime Access to Type Information in C++
Author: Interrante, John A.
Author: Linton, Mark A.
Date: March 1990
Abstract: The C++ language currently does not provide a mechanism for an object to determine its type at runtime. We propose the Dossier class as a standard interface for accessing type information from within a C++ program. We have implemented a tool called mkdossier that automatically generates type information in a form that can be compiled and linked with an application. In the prototype implementation, a class must have a virtual function to access an object's dossier given the object. We propose this access be provided implicitly by the language through a predefined member in all classes.
CSL-TR-90-418

Report Number: CSL-TR-90-419
Institution: Stanford University, Computer Systems Laboratory
Title: HardwareC -- A Language for Hardware Design (Version 2.0)
Author: Ku, David
Author: DeMicheli, Giovanni
Date: April 1990
Abstract: High-level synthesis is the transformation from a behavioral level specification of hardware, through a series of optimizations and translations, to an implementation in terms of logic gates and registers. The success of a high-level synthesis system is heavily dependent on how effectively the high-level language captures the ideas of the designer in a simple and understandable way. Furthermore, as system-level issues such as communication protocols and design partitioning dominate the design process, the ability to specify constraints on the timing requirements and resource utilization of a design is necessary to ensure that the design can integrate with the rest of the system. In this paper, a hardware description language called HardwareC is presented. HardwareC supports both declarative and procedural semantics, has a C-like syntax, and is extended with notion of concurrent processes, message passing, timing constraints via tagging, resource constraints, explicit instantiation of models, and template models. The language is used as the input to the Hercules High-level Synthesis System.
CSL-TR-90-419

Report Number: CSL-TR-90-423
Institution: Stanford University, Computer Systems Laboratory
Title: Implementing a Directory-Based Cache Consistency Protocol
Author: Simoni, Richard
Date: March 1990
Abstract: Directory-based cache consistency protocols have the potential to allow shared-memory multiprocessors to scale to a large number of processors. While many variations of these coherence schemes exist in the literature, they have typically been described at a rather high level, making adequate evaluation difficult. This paper explores the implementation issues of directory-based coherency strategies by developing a design at the level of detail needed to write a memory system functional simulator with an accurate timing model. The paper presents the design of both an invalidation coherency protocol and the associated directory/memory hardware. Support is added to prevent deadlock, handle subtle consistency situations, and implement a proper programming model of multiprocess execution. Extensions are delineated for realizing a multiple-threaded directory that can continue to process commands while waiting for a reply from a cache. The final hardware design is evaluated in the context of the number of parts required for implementation.
CSL-TR-90-423

Report Number: CSL-TR-90-425
Institution: Stanford University, Computer Systems Laboratory
Title: Concurrent runtime monitoring of formally specified programs
Author: Mandal, Manas
Author: Sankar, Sriram
Date: April 1990
Abstract: This paper describes an application of formal specifications after an executable program has been constructed. We describe how high level specifications can be utilized to monitor critical aspects of the behavior of a program continuously while it is executing. This methodology provides a capability to distribute the monitoring of specifications on multi-processor hardware platforms to meet practical time constraints. Typically, runtime checking of formal specifications involves a significant time penalty which makes it impractical during normal production operation of a program. In previous research, runtime checking has been applied during testing and debugging of software, but not on a permanent basis. Crucial to our current methodology is the use of multi-processor machines - hence runtime monitoring can be performed concurrently on different processors. We describe techniques for distributing checks onto different processors. To control the degree of concurrency, we introduce checkpoints - a point in the program beyond which execution cannot proceed until the specified checks have been completed. Error reporting and recovery in a multi-processor environment is complicated and there are various techniques of handling this. We describe a few of these techniques in this paper. An implementation of this methodology for the Anna specification language for Ada programs is described. Results of experiments conducted on this implementation using a 12 processor Sequent Symmetry demonstrate that permanent concurrent monitoring of programs based on formal specifications is indeed feasible.
CSL-TR-90-425

Report Number: CSL-TR-90-426
Institution: Stanford University, Computer Systems Laboratory
Title: A VLSI architecture for the FCHC isometric lattice gas model
Author: Lee, Fung F.
Author: Flynn, Michael J.
Author: Morf, Martin
Date: April 1990
Abstract: Lattice gas models are cellular automata used for the simulation of fluid dynamics. This paper addresses the design issues of a lattice gas collision rule processor for the four-dimensional FCHC isometric lattice gas model. A novel VLSI architecture based on an optimized version of Henon's isometric algorithm is proposed. One of the key concepts behind this architecture is the permutation group representation of the isometry group of the lattice. In contrast to the straightforward table lookup approach which would take 4.5 billion bits to implement this set of collision rules, the size of our processor is only about 5000 gates. With a reasonable number of pipeline stages, the processor can deliver one result per cycle with a cycle time comparable to or less than that of a common commercial DRAM.
CSL-TR-90-426

Report Number: CSL-TR-90-428
Institution: Stanford University, Computer Systems Laboratory
Title: Sub-nanosecond arithmetic
Author: Flynn, Michael J.
Author: DeMicheli, Giovanni
Author: Dutton, Robert
Author: Wooley, Bruce
Author: Pease, R. Fabian
Date: May 1990
Abstract: The SNAP (Stanford Nanosecond Arithmetic Processor) project is targeted at realizing an arithmetic processor with performance approximately an order of magnitude faster than currently available technology. The realization of SNAP is predicated on an interdisciplinary approach and effort spanning research in algorithms, data representation, CAD, circuits and devices, and packaging. SNAP is visualized as an arithmetic coprocessor implemented on an active substrate containing several chips, each of which realize a particular arithmetic function.
CSL-TR-90-428

Report Number: CSL-TR-90-431
Institution: Stanford University, Computer Systems Laboratory
Title: Latency and throughput tradeoffs in self-timed speed-independent pipelines and rings
Author: Williams, Ted
Date: August 1990
Abstract: Asynchronous pipelines control the flow of tokens through a sequence of logical stages based on the status of local completion detectors. As in a synchronously clocked circuit, the design of self-timed pipelines can trade off between achieving low latency and high throughput. However, there are more degrees of freedom because of the variances in specific latch and function block styles, and the possibility of varying both the number of latches between function blocks and their connections to the completion detectors. This report demonstrates the utility of a graph-based methodology for analyzing the timing dependencies and uses it to make comparisons of different configurations. It is shown that the extremes for high throughput and low latency differ significantly, the placement of the completion detectors influences timing as much as adding an additional latch, and the choice as to whether precharged or static logic is best is dependent on the cost in complexity of the completion detectors.
CSL-TR-90-431

Report Number: CSL-TR-90-436
Institution: Stanford University, Computer Systems Laboratory
Title: Application of formal specification to software maintenance
Author: Madhav, Neel
Author: Sankar, Sriram
Date: August 1990
Abstract: This paper describes the use of formal specifications and associated tools in addressing various aspects of software maintenance ---corrective, perfective, and adaptive. It also addresses the refinement of the software development process to build programs that are easily maintainable. The task of software maintenance in our case includes the task of maintaining the specification as well as maintaining the program. We focus on the use of Anna, a specification language for formally specifying Ada programs, to aid us in maintaining Ada programs. These techniques are applicable to most other specification language and programming language environments. The tools of interest are: (1) the Anna Specification Analyzer which allows us to analyze the specification for correctness with respect to our informal understanding of program behavior; and (2) the Anna Consistency Checking System which monitors the Ada program at runtime based on the Anna specification.
CSL-TR-90-436

Report Number: CSL-TR-90-438
Institution: Stanford University, Computer Systems Laboratory
Title: A Methodology for Formal Specification and Implementation of Ada Packages Using Anna
Author: Madhav, Neel
Author: Mann, Walter
Date: August 1990
Abstract: This paper presents a methodology for formal specification and prototype implementation of Ada packages using the Anna specification language. Specifications play an important role in the software development cycle. The methodology allows specifiers of Ada packages to follow a sequence of simple steps to formally specify packages. Given the formal specification of a package resulting from the methodology for package specifications, the methodology allows implementors of packages to follow a few simple steps to implement the package. The implementation is meant to be a prototype. This methodology for specification and implementation is applicable to most Ada packages. Limitations of this approach are pointed out at various points in the paper. We present software tools which help the process of specification and implementation.
CSL-TR-90-438

Report Number: CSL-TR-90-439
Institution: Stanford University, Computer Systems Laboratory
Title: Tango: A Multiprocessor Simulation and Tracing System
Author: Davis, Helen
Author: Goldschmidt, Stephen R.
Date: July 1990
Abstract: Tango is a software simulation and tracing system used to obtain data for evaluating parallel programs and multiprocessor systems. The system provides a simulated multiprocessor environment by multiplexing application processes onto a single processor. Tango achieves high efficiency by running compiled user code, and by focusing on the information of greatest interest to multiprocessing studies. The system is being applied to a wide range of investigations, including algorithm studies and a variety of hardware evaluations.
CSL-TR-90-439

Report Number: CSL-TR-90-441
Institution: Stanford University, Computer Systems Laboratory
Title: Computing Types During Program Specialization
Author: Weise, Daniel
Author: Ruf, Erik
Date: October 1990
Abstract: We have developed techniques for obtaining and using type information during program specialization (partial evaluation). Computed along with every residual expression and every specialized program is type information that bounds the possible values that the specialized program will compute at run time. The three keystones of this research are symbolic values that represent both a value and the code for creating the value, generalization of symbolic values, and the use of online fixed-point iterations for computing the type of values returned by specialized recursive functions. The specializer exploits type information to increase the efficiency of specialized functions. This research has two benefits, one anticipated and one unanticipated. The anticipated benefit is that programs that are to be specialized can now be written in a more natural style without losing accuracy during specialization. The unanticipated benefit is the creation of what we term concrete abstract interpretation. This is a method of performing abstract interpretation with concrete values where possible. The specializer abstracts values as needed, instead of requiring that all values be abstracted prior to abstract interpretation.
CSL-TR-90-441

Report Number: CSL-TR-90-442
Institution: Stanford University, Computer Systems Laboratory
Title: An improved algorithm for high-speed floating-point addition
Author: Quach, Nhon T.
Author: Flynn, Michael J.
Date: August 1990
Abstract: This paper describes an improved, IEEE conforming floating-point addition algorithm. This algorithm has only one addition step involving the significand in the worst-case path, hence offering a considerable speed advantage over the existing algorithms, which typically require two to three addition steps.
CSL-TR-90-442

Report Number: CSL-TR-90-443
Institution: Stanford University, Computer Systems Laboratory
Title: A queuing analysis for disk array systems
Author: Ogata, Mikito
Author: Flynn, Michael J.
Date: August 1990
Abstract: Using a queuing model of disk arrays, we study the performance and tradeoffs in disk array sub-systems and develop guidelines for designing these sub-systems in various CPU environments. Finally, we compare our model with some earlier simulation results.
CSL-TR-90-443

Report Number: CSL-TR-90-453
Institution: Stanford University, Computer Systems Laboratory
Title: Event patterns: A language construct for hierarchical designs of concurrent systems
Author: Luckham, David D.
Author: Gennart, Benoit A.
Date: November 1990
Abstract: Event patterns are a language construct for expressing relationships between specifications at different levels of a hierarchical design of a concurrent system. They provide a facility missing from current hardware design languages such as VHDL, or programming languages with parallel constructs such as Ada. This paper explains the use of event patterns in (1) defining mappings between different levels of a design hierarchy, and (2) automating the comparison of the behavior of different design levels during simulation. It describes the language constructs for defining event patterns and mappings, and shows their use in a design example, a 16-bit CPU.
CSL-TR-90-453

Report Number: CSL-TR-90-454
Institution: Stanford University, Computer Systems Laboratory
Title: Page allocation to reduce access time of physical caches
Author: Bray, Brian K.
Author: Lunch, William L.
Author: Flynn, Michael J.
Date: November 1990
Abstract: A simple modification to an operating system's page allocation algorithm can give physically addressed caches the speed of virtually addressed caches. Colored page allocation reduces the number of bits that need to be translated before cache access, allowing large low-associativity caches to be indexed before address translation, which reduces the latency to the processor. The colored allocation also has other benefits: caches miss less (in general) and more uniformly, and the inclusion principle holds for second level caches with less associativity. However, the colored allocation requires main memory partitioning, and more common bits for shared virtual addresses. Simulation results show high non-uniformity of cache miss rates for normal allocation. Analysis demonstrates the extent of second-level cache inclusion, and the reduction in effective main-memory due to partitioning.
CSL-TR-90-454

Report Number: CSL-TR-91-459
Institution: Stanford University, Computer Systems Laboratory
Title: On fast IEEE rounding
Author: Quach, Nhon
Author: Takagi, Naofumi
Author: Flynn, Michael J.
Date: January 1991
Abstract: A systematic general rounding procedure is proposed. This procedure consists of 2 steps: constructing a rounding table and selecting a prediction scheme. Optimization guidelines are given in each step to minimize the hardware used. This procedure-based rounding method has the additional advantage that verification and generalization are trivial. Two rounding hardware models are described. The first is shown to be identical to that reported by Santoro, et al. The second is more powerful, providing solutions where the first fails. Applying this approach to the IEEE rounding modes for high-speed conventional binary multipliers reveals that round to infinity is more difficult to implement than the round to nearest mode; more adders are potentially needed. Round to zero requires the least amount of hardware. A generalization of this procedure to redundant binary multipliers reveals two major advantages over conventional binary multipliers. First, the computation of the sticky bit consumes considerably less hardware. Second, implementing round to positive and minus infinity modes does not require the examination of the sticky bit, removing a possible worst-case path. A generalization of this approach to addition produces a similar solution to that reported by Quach and Flynn. Although generalizable to other kinds of rounding as well as other arithmetic operations, we only treat the case of IEEE rounding for addition and multiplication; IEEE rounding because it is the current standard on rounding, addition and multiplication because they are the most frequently used arithmetic operations in a typical scientific computation.
CSL-TR-91-459

Report Number: CSL-TR-91-463
Institution: Stanford University, Computer Systems Laboratory
Title: Leading One Detection --- Implementation, Generalization, and Application
Author: Quach, Nhon
Author: Flynn, Michael J.
Date: March 1991
Abstract: This paper presents the concept of leading-one prediction (LOP) in greater detail and describes two existing implementations. The first one is similar to that used in the IBM RS/6000 processor. The second is a distributed version of the first, consuming less hardware when multiple patterns need to be detected. We show how to modify these circuits for sign-magnitude numbers as dictated by the IEEE standard. We then point out that (1) LOP and carry lookahead in parallel addition belong to the same class of problem, that of a bit pattern detection. Such a recognition allows techniques developed for parallel addition to be borrowed for bit pattern detection. And (2) LOP can be applied to compute the sticky bit needed for binary multipliers to perform IEEE rounding.
CSL-TR-91-463

Report Number: CSL-TR-91-465
Institution: Stanford University, Computer Systems Laboratory
Title: Analysis of Power Supply Networks in VLSI Circuits
Author: Stark, Don
Date: March 1991
Abstract: Although the trend toward finer geometries and larger chips has produced faster systems, it has also created larger voltage drops and higher current densities in chip power supply networks. Excessive voltage drops in the power supply lines cause incorrect circuit operation, and high current densities lead to circuit failure via electromigration. Analyzing this power supply noise by hand for large circuits is difficult and error prone; automatic checking tools are needed to make the analysis easier. This thesis describes Ariel, a CAD tool that helps VLSI designers analyze power supply noise. The system consists of three main components, a resistance extractor, a current estimator, and a linear solver, that are used together to determine the voltage drops and current density along the supply lines. The resistance extractor includes two parts: a fast extractor that calculates resistances quickly using simple heuristics, and a slower, more accurate finite element extractor. Despite its simplicity, the fast extractor obtained nearly the same results as the finite element one and is two orders of magnitude faster. The system also contains two current estimators, one for CMOS designs and one for ECL. The CMOS current estimator is based on the switch level simulator Rsim, and produces a time-varying current distribution that includes the effects of charge sharing, image currents, and slope on the gate's inputs. The ECL, estimator does a static analysis of the design, calculating each gate's tail current and tracing through the network to find where it enters the power supplies. Extensions to the estimator allow it to handle more complex circuits, such as shared current lines and diode decoders. Finally, the linear solver applies this current pattern to the resistance network, and efficiently calculates voltages and current densities by taking advantage of topological characteristics peculiar to power supply networks. It removes trees, simple loops, and series sections for separate analysis. These techniques substantially reduce the time required for solution. This report also includes the results of running the system on several large designs, and points out flaws that Ariel uncovered in their power networks.
CSL-TR-91-465

Report Number: CSL-TR-91-468
Institution: Stanford University, Computer Systems Laboratory
Title: Efficient moment-based timing analysis for variable accuracy switch level simulation
Author: Kao, Russell
Author: Horowitz, Mark
Date: April 1991
Abstract: We describe a timing analysis algorithm which can achieve the efficiency of RC tree analysis while retaining much of the generality of Asymptotic Waveform Estimation. RC tree analysis from switch level simulation is generalized to handle piecewise linear transistor models, non tree topologies, floating capacitors, and feedback. For simple switch level models the complexity is O(n). The algorithm allows the user to trade off efficiency vs accuracy through the selection of transistor models of varying complexity.
CSL-TR-91-468

Report Number: CSL-TR-91-469
Institution: Stanford University, Computer Systems Laboratory
Title: SPLASH: Stanford parallel applications for shared-memory
Author: Singh, Jaswinder Pal
Author: Weber, Wolf-Dietrich
Author: Gupta, Anoop
Date: April 1991
Abstract: This report was replaced and updated in CSL-TR-92-526
CSL-TR-91-469

Report Number: CSL-TR-91-470
Institution: Stanford University, Computer Systems Laboratory
Title: Writes caches as an alternative to write buffers
Author: Bray, Brian K.
Author: Flynn, Michael J.
Date: April 1991
Abstract: Write buffers help unbind one level of a memory hierarchy from the next, thus write buffers are used to reduce write stalls. Write buffers are used in write-through systems so that writes can occur at the rate the cache can handle them, but write buffers don't reduce the number of writes, or cluster writes for block transfers. A write cache is a cache that uses an allocate on write miss, write-back, no allocate on read miss strategy. A write cache tries to reduce the total number of writes (write traffic) to the next level by taking advantage of the temporal locality of writes. A write cache also groups writes for block transfers by taking advantage of the spatial locality of writes. We have found that small write caches can significantly reduce the write traffic to the first write-back level after the processor's register set. Systems that would benefit from reduced write traffic to the first write-back level would benefit from using a write cache instead of a write buffer. The temporal and spatial locality of writes is very important in determining what organization the write cache should have.
CSL-TR-91-470

Report Number: CSL-TR-91-475
Institution: Stanford University, Computer Systems Laboratory
Title: Making effective use of shared-memory multiprocessors: the process control approach
Author: Gupta, Anoop
Author: Tucker, Andrew
Author: Stevens, Luis
Date: May 1991
Abstract: We present the design, implementation, and performance of a novel approach for effectively utilizing shared-memory multiprocessors in the presence of multiprogramming. Our approach offers high performance by combining the techniques of process control and processor partitioning. The process control technique is based on the principle that to maximize performance, a parallel application must dynamically match the number of runnable processes associated with it to the effective number of processors available to it. This avoids the problems arising from oblivious preemption of processes and it allows an application to work at a better operating point on its speedup versus processors curve. Processor partitioning is necessary for dealing with realistic multiprogramming environments, where both process controlled and non-controlled applications may be present. It also helps improve the cache performance of applications and removes the bottleneck associated with a single centralized scheduler. Preliminary results from an implementation of the process control approach, with a user-level server, were presented in a previous paper. In this paper, we extend the process control approach to work with processor partitioning and fully integrate the approach with the operating system kernel. This also allows us to address a limitation in our earlierÊimplementation wherein a close correspondence between runnable processes and the available processors was not maintained in the presence of I/O. The paper presents the design decisions and the rationale for the current implementation, along with extensive results from executions on a high-performance Silicon Graphics 4D/340
CSL-TR-91-475

Report Number: CSL-TR-91-480
Institution: Stanford University, Computer Systems Laboratory
Title: Strategies for branch target buffers
Author: Bray, Brian K.
Author: Flynn, M. J.
Date: June 1991
Abstract: Achieving high instruction issue rates depends on the ability to dynamically predict branches. We compare two schemes for dynamic branch prediction: a separate branch target buffer and an instruction cache based branch target buffer. For instruction caches of 4KB and greater, instruction cache based branch prediction performance is a strong function of line size, and a weak function of instruction cache size. An instruction cache based branch target buffer with a line size of 8 (or 4) instructions performs about as well as a separate branch target buffer structure which has 64 (or 256, respectively) entries. Software can rearrange basic blocks in a procedure to reduce the number of taken branches, thus reducing the amount of branch prediction hardware needed. With software assistance, predicting all branches as not branching performs as well as a 4 entry branch target buffer without assistance, and a 4 entry branch target buffer with assistance performs as well as a 32 entry branch target buffer without assistance. The instruction cache based branch target buffer also benefits from the software, but only for line sizes of more than 4 instructions.
CSL-TR-91-480

Report Number: CSL-TR-91-481
Institution: Stanford University, Computer Systems Laboratory
Title: Subnanosecond arithmetic (Second Report)
Author: Flynn, Michael J.
Author: DeMicheli, Giovanni
Author: Dutton, Robert
Author: Pease, R. Fabian
Author: Wooley, Bruce
Date: June 1991
Abstract: The Stanford Nanosecond Arithmetic Project is targeted at realizing an arithmetic processor with performance approximately an order of magnitude faster than currently available technology. The realization of SNAP is predicated on an interdisciplinary approach and effort spanning research in algorithms, data representation, CAD, circuits and devices, and packaging. SNAP is visualized as an arithmetic coprocessor implemented on an active substrate containing several chips, each of which realize a particular arithmetic function. This year's report highlights recent results in the area of wave pipelining. We have fabricated a number of prototype die, implementing a multiplier slice. Cycle times below 5ns were realized.
CSL-TR-91-481

Report Number: CSL-TR-91-483
Institution: Stanford University, Computer Systems Laboratory
Title: Suggestions for implementing a fast IEEE multiply-add-fused instruction
Author: Quach, Nhon
Author: Flynn, Michael
Date: July 1991
Abstract: We studied three possible strategies to overlap the operations in a floating-point add (FPA) and a floating-point multiply (FPM) for implementing an IEEE multiply-add-fused (MAF) instruction. The operations in FPM and FPA are: (a) non-overlapped, (b) fully-overlapped, and (c) partially-overlapped. The first strategy corresponds to multiply-add-chained (MAC) widely used in vector processors. The second (Greedy) strategy uses a greedy algorithm, yielding an implementation similar to the IBM RS/6000 one. The third and final (SNAP) strategy uses a less aggressive starting configuration and corresponds to the SNAP implementation. An IEEE MAF delivers the same result as that obtained via a separate IEEE FPM and FPA. Two observations have prompted this study. First, in the IBM RS/6000 implementation, the design tradeoffs have been made for high internal data precision, which facilitates the execution of elementary functions. These tradeoff decisions, however, may not be valid for an IEEE MAF. Second, the RS/6000 implementation assumed a different critical path for FPA and FPM, which does not reflect the current state-of-the-art in FP technology. Using latency and hardware costs as the performance metrics we show that: (1) MAC has the lowest FPA latency and consumes the least hardware. But its MAF latency is the highest. (2) Greedy has a medium MAF latency but the highest FPA latency. And (3) SNAP has the lowest MAF latency and a slightly higher FPA latency than that of MAC, consuming an area that is comparable with that of Greedy. Both Greedy and SNAP have higher design complexity arising from rounding for the IEEE standard. SNAP has an additional wire complexity, which Greedy does not have because of its simpler datapath. If rounding for the IEEE standard is not an issue, the Greedy strategy --- and therefore the RS/6000 --- seems reasonable for applications with a high MAF to FPA ratio.
CSL-TR-91-483

Report Number: CSL-TR-91-484
Institution: Stanford University, Computer Systems Laboratory
Title: Self-Consistency and Transitivity in Self-Calibration Procedures
Author: Raugh, Michael
Date: July 1991
Abstract: Self-calibration refers to the use of an uncalibrated measuring instrument and an uncalibrated object called an artifact, such as a rigid marked plate, to simultaneously measure the artifact and calibrate the instrument. Typically, the artifact is measured in more than one position, and the required information is derived from comparisons of the various measurements. The problems of self-calibration are surprisingly subtle. This paper develops concepts and vocabulary for dealing with such problems in one and two dimensions and uses simple (non-optimal) measurement procedures to reveal the underlying principles. The approach in two dimensions is mathematically constructive: procedures are described for measuring an uncalibrated artifact in several stages, involving progressive transformations of the instrument's uncalibrated coordinate system, until correct coordinates for the artifact are obtained and calibration of the instrument is achieved. Self-consistency and transitivity, as defined within, emerge as key concepts. It is shown that self-consistency and transitivity are necessary conditions for self-calibration. Consequently, in general, it is impossible to calibrate a two dimenstional measuring instrument by simply rotating and measureing a calibration plate about a fixed center.
CSL-TR-91-484

Report Number: CSL-TR-91-485
Institution: Stanford University, Computer Systems Laboratory
Title: A Unidraw-Based User Interface Builder
Author: Vlissides, John M.
Author: Tang, Steven
Date: August 1991
Abstract: Ibuild is a user interface builder that lets a user manipulate simulations of toolkit objects rather than actual toolkit objects. Ibuild is built with Unidraw, a framework for building graphical editors that is part of the InterViews toolkit. Unidraw makes the simulation-based approach attractive. Simulating toolkit objects in Unidraw makes it easier to support editing facilities that are common in other kinds of graphical editors, and it keeps the builder insulated from a particular toolkit implementation. Ibuild supports direct manipulation analogs of InterViews' composition mechanisms, which simplify the specification of an interface's layout and resize semantics. Ibuild also leverages the C++ inheritance mechanism to decouple builder-generated code from the rest of the application. And while current user interface builders stop at the widget level, ibuild incorporates Unidraw abstractions to simplify the implementation of graphical editors.
CSL-TR-91-485

Report Number: CSL-TR-91-488
Institution: Stanford University, Computer Systems Laboratory
Title: The Stanford ADA style checker: an application of the ANNA tools and methodology
Author: Walicki, Michal
Author: Skakkebaek, Jens Ulrik
Author: Sankar, Sriram
Date: August 1991
Abstract: This report describes the Ada style checker, which was designed and constructed in Winter and Spring 1989-90. The style checker is based on the Stanford Anna Tools and has been annotated using Anna. The style checker examines Ada programs for "correct style'' which is defined in a style specification language (SSL). A style checker generator is used to automatically generate a style checker based on a set of style specifications.
CSL-TR-91-488

Report Number: CSL-TR-91-492
Institution: Stanford University, Computer Systems Laboratory
Title: Paging Performance with Page Coloring.
Author: Lynch, William L.
Author: Flynn, Michael J.
Date: October 1991
Abstract: Constraining the mapping of virtual to physical addresses page coloring can speed and/or simplify caches in the presence of virtual memory. For the mapping to hold, physical memory must be partitioned into distinct colors, and virtual pages allocated to a specific color of physical page determined by the mapping. This paper uses and analytical model and simulation to compare the paging effects of colored versus uncolored (conventional) page allocation, and concludes that these effects are small.
CSL-TR-91-492

Report Number: CSL-TR-91-496
Institution: Stanford University, Computer Systems Laboratory
Title: ANNA package specification: case studies
Author: Kenney, John
Author: Mann, Walter
Date: October 1991
Abstract: We present techniques of software specification of Ada* software based on the Anna specification language and examples of Ada packages formally specified in Anna. A package specification for an abstract set type is used to illustrate the techniques and pitfalls involved in the process of software specification and development. This specification not only exemplifies good Anna style and specification approach, but has a secondary goal of teaching the reader how to use Anna and the associated set of Anna tools developed at Stanford University over the past six years. The technical report thus aims to give readers a new way of looking at the software design and development process, synthesizing fifteen years of research in the process. *Ada is a registered trademark of the U.S. Government (Ada Joint Program Office)
CSL-TR-91-496

Report Number: CSL-TR-91-498
Institution: Stanford University, Computer Systems Laboratory.
Title: Spectral Techniques for Technology Mapping
Author: Yang, Jerry Chih-Yuan
Author: DeMicheli, Giovanni
Date: March 1994
Abstract: Technology mapping is the crucial step in logic synthesis where technology dependent optimizations take place. The matching phase of a technology mapping algorithm is generally considered the most computationally intensive task, because it is called on repeatedly. In this work, we investigate applications of spectral techniques in doing matching. In particular, we present an algorithm that will detect NPN-equivalent Boolean functions. We show that while generating the spectra for Boolean functions may be expensive, this algorithm offers significant pruning of the search space and is simple to implement. The algorithm is implemented as part of the Specter technology mapper, and results are compared to other Boolean matching techniques.
CSL-TR-91-498

Report Number: CSL-TR-92-510
Institution: Stanford University, Computer Systems Laboratory
Title: Rapide-0.2 Examples
Author: Hsieh, Alexander
Date: February 1992
Abstract: Rapide-0.2 is an executable language for prototyping distributed, time sensitive systems. We present in this report a series of simple, working example programs in the language. In each example we present one or more new concepts or constructs of the Rapide-0.2 language with later examples drawing on previously presented material. The examples are written for both those who wish to use the Rapide-0.2 language to do serious prototyping and for those who just wish to be familiar with it. The examples were not written for someone who wishes to learn prototyping in general. CSL-TN-92-387 is an informal reference manual, describing the Rapide-0.2 language and tools, which might be helpful to have in conjunction with CSL-TR-92-510 (p. 191).
CSL-TR-92-510

Report Number: CSL-TR-92-515
Institution: Stanford University, Computer Systems Laboratory
Title: Partial orderings of event sets and their application to prototyping concurrent timed systems
Author: Luckham, David C.
Author: Vera, James
Author: Bryan, Doug
Author: Augustin, Larry
Author: Belz, Frank
Date: April 1992
Abstract: Rapide is a concurrent object-oriented language specifically designed for prototyping large concurrent systems. One of the principle design goals has been to adopt a computation model in which the synchronization, concurrency, dataflow, and timing aspects of a prototype are explicitly represented and easily accessible both to the prototype itself and to the prototyper. This paper describes the partially ordered event set (poset) computation model, and the features of Rapide for using posets in reactive prototypes and for automatically checking posets. Some critical issues in the implementation of Rapide are described and our experience with them is summarized. An example prototyping scenario illustrates uses of the poset computation model.
CSL-TR-92-515

Report Number: CSL-TR-92-516
Institution: Stanford University, Computer Systems Laboratory
Title: Opportunities for Online Partial Evaluation
Author: Ruf, Erik
Author: Weise, Daniel
Date: April 1992
Abstract: Partial evaluators can be separated into two classes: offline specializers, which make all of their reduce/residualize decisions before specialization, and online specializers, which make such decisions during specialization. The choice of which method to use is driven by a tradeoff between the efficiency of the specializer and the quality of the residual programs that it produces. Existing research describes some of the inefficiencies of online specializers, and how these are avoided using offline methods, but fails to address the price paid in specialization quality. This paper motivates research in online specialization by describing two fundamental limitations of the offline approach, and explains why the online approach does not encounter the same difficulties.
CSL-TR-92-516

Report Number: CSL-TR-92-517
Institution: Stanford University, Computer Systems Laboratory
Title: Preserving Information during Online Partial Evaluation
Author: Ruf, Erik
Author: Weise, Daniel
Date: April 1992
Abstract: The degree to which a partial evaluator can specialize a source program depends on how accurately the partial evaluator can represent and maintain information about runtime values. Partial evaluators always lose some accuracy due to their use of finite type systems; however, existing partial evaluation techniques lose information about runtime values even when their type systems are capable of representing such information. This paper describes two sources of such loss in existing specializers, solutions for both cases, and the implementation of these solutions in our partial evaluation system, FUSE.
CSL-TR-92-517

Report Number: CSL-TR-92-518
Institution: Stanford University, Computer Systems Laboratory
Title: Avoiding Redundant Specialization during Partial Evaluation
Author: Ruf, Erik
Author: Weise, Daniel
Date: April 1992
Abstract: Existing partial evaluators use a strategy called polyvariant specialization, which involves specializing program points on the known portions of their arguments, and re-using such specializations only when these known portions match exactly. We show that this re-use criterion is overly restrictive, and misses opportunities for sharing in residual programs, thus producing large residual programs containing redundant specializations. We develop a criterion for re-use based on computing the domains of specializations, describe an approximate implementation of this criterion based on types, and show its implementation in our partial evaluation system FUSE. In addition, we describe several extensions to our mechanism to make it compatible with more powerful specialization strategies and to increase its efficiency. After evaluating our algorithm's usefulness, we relate it to existing work in partial evaluation and machine learning.
CSL-TR-92-518

Report Number: CSL-TR-92-520
Institution: Stanford University, Computer Systems Laboratory
Title: An Empirical Study of an Abstract Interpretation of Scheme Programs
Author: Kanamori, Atty
Author: Weise, Daniel
Date: April 1992
Abstract: Abstract Interpretation, a powerful and general framework for performing global program analysis, is being applied to problems whose difficulty far surpasses the traditional "bit-vector'' dataflow problems for which many of the high-speed abstract interpretation algorithms worked so well. Our experience has been that current methods of large scale abstract interpretation are unacceptably expensive. We studied a typical large-scale abstract interpretation problem: computing the control flow of a higher order program. Researchers have proposed various solutions that are designed primarily to improve the accuracy of the analysis. The cost of the analyses, and its relationship to accuracy, is addressed only cursorily in the literature. Somewhat paradoxically, one can view these strategies as attempts to simultaneously improve the accuracy and reduce the cost. The less accurate strategies explore many spurious control paths because many flowgraph paths represent illegal execution paths. For example, the less accurate strategies violate the LIFO constraints on procedure call and return. More accurate analyses investigate fewer control paths, and therefore may be more efficient despite their increased overhead. We empirically studied this accuracy versus efficiency tradeoff. We implemented two fixpoint algorithms, and four semantics (baseline, baseline + stack reasoning, baseline + contour reasoning, baseline + stack reasoning + contour reasoning) for a total of eight control flow analyzers. Our benchmarks test various programming constructs in isolation --- hence, if a certain algorithm exhibits poor performance, the experiment also yields insight into what kind of program behavior results in that poor performance. The results suggest that strategies that increase accuracy in order to eliminate spurious paths often generate unacceptable overhead in the parts of the analysis that do not benefit from the increased accuracy. Furthermore, we found little evidence that the extra effort significantly improves the accuracy of the final result. This suggests that increasing the accuracy of the analysis globally is not a good idea, and that future research shouldÊinvestigate adaptive algorithms that use different amounts of precision on different parts of the problem.
CSL-TR-92-520

Report Number: CSL-TR-92-523
Institution: Stanford University, Computer Systems Laboratory
Title: Architectural and implementation tradeoffs in the design of multiple-context processors
Author: Laudon, James
Author: Gupta, Anoop
Author: Horowitz, Mark
Date: May 1992
Abstract: Multiple-context processors have been proposed as an architectural technique to mitigate the effects of large memory latency in multiprocessors. We examine two schemes for implementing multiple-context processors. The first scheme switches between contexts only on a cache miss, while the other interleaves the contexts on a cycle-by-cycle basis. Both schemes provide the capability for a single context to fully utilize the pipeline. We show that cycle-by-cycle interleaving of contexts provides a performance advantage over switching contexts only at a cache miss. This advantage results from the context interleaving hiding pipeline dependencies and reducing the context switch cost. In addition, we show that while the implementation of the interleaved scheme is more complex, the complexity is not overwhelming. As pipelines get deeper and operate at lower percentages of peak performance, the performance advantage of the interleaved scheme is likely to justify its additional complexity.
CSL-TR-92-523

Report Number: CSL-TR-92-526
Institution: Stanford University, Computer Systems Laboratory
Title: SPLASH: Stanford parallel applications for shared-memory*
Author: Singh, Jaswinder Pal
Author: Weber, Wolf-Dietrich
Author: Gupta, Anoop
Date: June 1992
Abstract: We present the Stanford Parallel Applications for Shared-Memory (SPLASH), a set of parallel applications for use in the design and evaluation of shared-memory multiprocessing systems. Our goal is to provide a suite of realistic applications that will serve as a well-documented and consistent basis for evaluation studies. We describe the applications currently in the suite in detail, discuss and compare some of their important characteristicsPsuch as data locality, granularity, synchronization, etc.Pand explore their behavior by running them on a real multiprocessor as well as on a simulator of an idealized parallel architecture. We expect the current set of applications to act as a nucleus for a suite that will grow with time. This report replaces and updates CSL-TR-91-469, April 1991.
CSL-TR-92-526

Report Number: CSL-TR-92-528
Institution: Stanford University, Computer Systems Laboratory
Title: Binary multiplication Using Partially Redundant Multiples
Author: Bewick, Gary
Author: Flynn, Michael J.
Date: June 1992
Abstract: This report presents an extension to Booth's algorithm for binary multiplication. Most implementations that utilize Booth's algorithm use the 2 bit version, which reduces the number of partial products required to half that required by a simple add and shift method. Further reduction in the number of partial products can be obtained by using higher order versions of Booth's algorithm, but it is necessary to generate multiples of one of the operands (such as 3 times an operand) by the use of a carry propagate adder. This carry propagate addition introduces significant delay and additional hardware. The algorithm described in this report produces such difficult multiples in a partially redundant form, using a series of small length adders. These adders operate in parallel with no carries propagating between them. As a result, the delay introduced by multiple generation is minimized and the hardware needed for the multiple generation is also reduced, due to the elimination of expensive carry lookahead logic.
CSL-TR-92-528

Report Number: CSL-TR-92-532
Institution: Stanford University, Computer Systems Laboratory
Title: Piecewise Linear Models for Switch-Level Simulation
Author: Kao, Russell
Date: June 1992
Abstract: Rsim is an efficient logic plus timing simulator that employs the switched resistor transistor model and RC tree analysis to simulate efficiently MOS digital circuits at the transistor level. We investigate the incorporation of piecewise linear transistor models and generalized moments matching into this simulation framework. General piecewise linear models allow more accurate MOS models to be used to simulate circuits that are hard for Rsim. Additionally, they enable the simulator to handle circuits containing bipolar transistors such as ECL and BiCMOS. Nonetheless, the switched resistor model has proved to be efficient and accurate for a large class of MOS digital circuits. Therefore, it is retained as just one particular model available for use in this framework. The use of piecewise linear models requires the generalization of RC tree analysis. Unlike switched resistors, more general models may incorporate gain and floating capacitance. Additionally, we extend the analysis to handle non-tree topologies and feedback. Despite the increased generality, for many common MOS and ECL circuits the complexity remains linear. Thus, this timing analysis can be used to simulate, efficiently, those portions of the circuit that are well described by traditional switch level models, while simultaneously simulating, more accurately, those portions that are not. We present preliminary results from a prototype simulator, Mom. We demonstrate its use on a number of MOS, ECL, and BiCMOS circuits.
CSL-TR-92-532

Report Number: CSL-TR-92-534
Institution: Stanford University, Computer Systems Laboratory
Title: On the specialization of online program specializers
Author: Ruf, Erik
Author: Weise, Daniel
Date: July 1992
Abstract: Program specializers improve the speed of programs by performing some of the programs' reductions at specialization time rather than at runtime. This specialization process can be time-consuming; one common technique for improving the speed of the specialization of a particular program is to specialize the specializer itself on that program, creating a custom specializer, or program generator, for that particular program. Much research has been devoted to the problem of generating efficient program generators, which do not perform reductions at program generation time which could instead have been performed when the program generator was constructed. The conventional wisdom holds that only offline program specializers, which use binding time annotations, can be specialized into such efficient program generators. This paper argues that this is not the case, and demonstrates that the specialization of a nontrivial online program specializer similar to the original "naive MIX" can indeed yield an efficient program generator. The key to our argument is that, while the use of binding time information at program generator generation time is necessary for the construction of an efficient custom specializer, the use of explicit binding time approximation techniques is not. This allows us to distinguish the problem at hand (i.e., the use of binding time information during program generator generation) from particular solutions to that problem (i.e., offline specialization). We show that, given a careful choice of specializer data structures, and sufficiently powerful specialization techniques, binding time information can be inferred and utilized without the use of explicit binding time approximation techniques. This allows the construction of efficient, optimizing program generators from online program specializers.
CSL-TR-92-534

Report Number: CSL-TR-92-546
Institution: Stanford University, Computer Systems Laboratory
Title: The accuracy of trace-driven simulations of multiprocessors
Author: Goldschmidt, Stephen R.
Author: Hennessy, John L.
Date: September 1992
Abstract: In trace-driven simulation, traces generated for one set of machine characteristics are used to simulate a machine with different characteristics. However, the execution path of a multiprocessor workload may depend on the ordering of events on different processors, which in turn depends on machine characteristics such as memory system timings. Trace-driven simulations of multiprocessor workloads are inaccurate unless the timing-dependencies are eliminated from the traces. We measure such inaccuracies by comparing trace-driven simulations to direct simulations of the same workloads. The results were identical only for workloads whose timing dependencies were eliminated from the traces. The remaining workloads used either first-come first-served scheduling or non-deterministic algorithms; these characteristics resulted in timing-dependencies that could not be eliminated from the traces. Workloads which used task-queue scheduling had particularly large discrepancies because task-queue operations, unlike other synchronization operations, were not abstracted. Two types of simulation results had especially large discrepancies: those related to synchronization latency and those derived from relatively small numbers of events. Studies that rely on such results should use timing- independent traces or direct simulation.
CSL-TR-92-546

Report Number: CSL-TR-92-548
Institution: Stanford University, Computer Systems Laboratory
Title: System synthesis via hardware-software co-design
Author: Gupta, Rajesh K.
Author: DeMicheli, Giovanni
Date: October 1992
Abstract: Synthesis of circuits containing application-specific as well as re-programmable components such as off-the-shelf microprocessors provides a promising approach to realization of complex systems using a minimal amount of application-specific hardware while still meeting the required performance constraints. We formulate the synthesis problem of complex behavioral descriptions with performance constraints as a hardware-software co-design problem. The target system architecture consists of a software component as a program running on a re-programmable processor assisted by application-specific hardware components. System synthesis is performed by first partitioning the input system description into hardware and software portions and then by implementing each of them separately. We consider the problem of identifying potential hardware and software components of a system described in a high-level modeling language. Partitioning approaches are presented based on decoupling of data and control flow, and based on communication/synchronization requirements of the resulting system design. Synchronization between various elements of a mixed system design is one of the key issues that any synthesis system must address. We present software and interface synchronization schemes that facilitate communication between system components. We explore the relationship between the non-determinism in the system models and the associated synchronization schemes needed in system implementations. The synthesis of dedicated hardware is achieved by hardware synthesis tools, while the software component is generated using software compiling techniques. We present tools to perform synthesis of a system description into hardware and software components. The resulting software component is assumed to be implemented for the DLX machine, a load/store microprocessor. We present design of an ethernet based network coprocessor to demonstrate the feasibility of mixed system synthesis.
CSL-TR-92-548

Report Number: CSL-TR-92-550
Institution: Stanford University, Computer Systems Laboratory
Title: Cache Coherence Directories for Scalable Multiprocessors
Author: Simoni, Richard
Date: October 1992
Abstract: Directory-based protocols have been proposed as an efficient means of implementing cache coherence in large-scale shared-memory multiprocessors. This thesis explores the trade-offs in the design of cache coherence directories by examining the organization of the directory information, the options in the design of the coherency protocol, and the implementation of the directory and protocol. The traditional directory organization that maintains a full valid bit vector per directory entry is unsuitable for large-scale machines due to high storage overhead. This thesis proposes several alternate organizations. Limited pointers directories replace the bit vactor with several pointers that indicate those caches containing the data. Although this scheme performs well across a wide range of workloads, its performance does not improve as the read/write ratio becomes very large. To address this drawback, a dynamic pointer allocation directory is proposed. This directory allocates pointers from a pool to particular memory blocks as they are needed. Since the pointers may be allocated to any block on the memory module, the probability of running short of pointers is very small. Among the set of possible organizations, dynamic pointer allocation lies at an attractive cost/performance point. Measuring the performance impact of three coherency protocol features makes the virtues of simplicity clear. Adding a clean/exclusive state to reduce the time required to write a clean block results in only modest performance improvement. Using request forwarding to transfer a dirty block directly to another cache that has requested it yields similar results. For small cache block sizes, write hits to clean blocks can be simply treated as write misses without incurring significant extra network traffic. Protocol features designed to improve performance must be examined carefully, for they often complicate the protocol without offering substantial benefit. Implementing directory-based coherency presents several challenges. Methods are described for preventing deadlock, maintaining a model of parallel execution, handling subtle situations caused by temporary inconsistencies between cache and directory state, and tolerating out-of-order message delivery. Using these techniques, cache coherence can be added to large-scale multiprocessors in an inexpensive yet effective manner.
CSL-TR-92-550

Report Number: CSL-TR-92-553
Institution: Stanford University, Computer Systems Laboratory
Title: Branch predication using large self history
Author: Johnson, John D.
Date: December 1992
Abstract: Branch prediction is the main method of providing speculative opportunities for new high performance processors, therefore the accuracy of branch prediction is becoming very important. Motivated by this desire to achieve high levels of branch prediction, this study examines methods of using up to 24 bits branch direction history to determine the probable outcome of the next execution of a conditional branch. Using profiling to train a prediction logic function achieves an average branch prediction accuracy of up to 96.9% for the six benchmarks used in this study.
CSL-TR-92-553

Report Number: CSL-TR-93-554
Institution: Stanford University, Computer Systems Laboratory
Title: Using a Floating-Point Multiplier's Internals for High-Radix Division and Square Root
Author: Schwarz, Eric M.
Author: Flynn, Michael J.
Date: January 1993
Abstract: A method for obtaining high-precision approximations of high-order arithmetic operations at low-cost is presented in this study. Specifically, high-precision approximations of the reciprocal (12 bits worst case) and square root (16 bits) operations are obtained using the internal hardware of a floating-point multiplier without the use of look-up tables. The additional combinatorial logic necessary is very small due to the reuse of existing hardware. These low-cost high-precision approximations are used by iterative algorithms to perform the operations of division and square root. The method presented also applies to several other high-order arithmetic operations. Thus, high-radix algorithms for high-order arithmetic operations such as division and square root are possible at low-cost.
CSL-TR-93-554

Report Number: CSL-TR-93-556
Institution: Stanford University, Computer Systems Laboratory
Title: Support for Speculative Execution in High-Performance Processors
Author: Smith, Michael David
Date: November 1992
Abstract: Superscalar and superpipelining techniques increase the overlap between the instructions in a pipelined processor, and thus these techniques have the potential to improve processor performance by decreasing the average number of cycles between the execution of adjacent instructions. Yet, to obtain this potential performance benefit, an instruction scheduler for this high-performance processor must find the independent instructions within the instruction stream of an application to execute in parallel. For non-numerical applications, there is an insufficient number of independent instructions within a basic block, and consequently the instruction scheduler must search across the basic block boundaries for the extra instruction-level parallelism required by the superscalar and superpipelining techniques. To exploit instruction-level parallelism across a conditional branch, the instruction scheduler must support the movement of instructions above a conditional branch, and the processor must support the speculative execution of these instructions. We define boosting, an architectural mechanism for speculative execution, that allows us to uncover the instruction-level parallelism across conditional branches without adversely affecting the instruction count of the application or the cycle time of the processor. Under boosting, the compiler is responsible for analyzing and scheduling instructions, while the hardware is responsible for ensuring that the effects of a speculatively-executed instruction do not corrupt the program state when the compiler is incorrect in its speculation. To experiment with boosting, we built a global instruction scheduler, which is specifically tailored for the non-numerical environment, and a simulator, which determines the cycle-count performance of our globally-scheduled programs. We also analyzed the hardware requirements for boosting in a typical load/store architecture. Through the cycle-count simulations and an understanding of the cycle-time impact of the hardware support for boosting, we found that only a small amount of hardware support for speculative execution is necessary to achieve good performance in a small-issue, superscalar processor.
CSL-TR-93-556

Report Number: CSL-TR-93-560
Institution: Stanford University, Computer Systems Laboratory
Title: The Cramer Rao Bound for Discrete-Time Edge Position
Author: Gatherer, Alan
Date: February 1993
Abstract: The problem of estimating the position of an edge from a series of samples often occurs in the fields of machine vision and signal processing. It is therefore of interest to assess the accuracy of any estimation algorithm. Previous work in this area has produced bounds for the continuous time estimator. In this paper we derive a closed form for the minimum variance bound (or Cramer Rao bound) for estimating the position of an arbitrarily shaped edge in white Gaussian noise for the discrete samples case. We quantify the effects of the sampling rate, the bandwidth of the edge, the shape of the edge and the size of the observation window on the variance of the estimator. We describe a maximum likelihood estimator and show that in practice this estimator requires fewer computations than standard correlation.
CSL-TR-93-560

Report Number: CSL-TR-93-561
Institution: Stanford University, Computer Systems Laboratory
Title: Fetch Caches
Author: Bray, Brian K.
Author: Flynn, Michael J.
Date: February 1993
Abstract: For high performance, data caches must have a low miss rate and provide high bandwidth, while maintaining low latency. Larger and more complex set associative caches provide lower miss rates but at the cost of increased latency. Interleaved data caches can improve the available bandwidth, but the improvement is limited by bank conflicts and increased latency due to the switching networks required to distribute cache addresses and to route the data. We propose using a small buffer to reduce the data read latency or improve the read bandwidth of an on-chip data cache. We call the small read-only buffer a fetch cache. The fetch cache attempts to capture the immediate spatial locality of the data read reference stream by utilizing the large number of bits that can be fetched in a single access of an on-chip cache. There are two ways a processor can issue multiple instructions per cache access: the cache access can require multiple cycles (i.e. superpipelined), or multiple instructions are issued per cycle (i.e. superscalar). In the first section, we show the use of fetch caches with multi-cycle per access data caches. When there is a read hit in the fetch cache, the read request can be serviced in one cycle, otherwise the latency is that of the primary data cache. For a four line, 16 byte wide fetch cache, the hit rate ranged from 40 to 60 percent depending on the application. In the second part, we show the use of fetch caches when multi-accesses per cycle are requested. When there is a read hit in the fetch cache, a read can be satisfied by the fetch cache, while the primary cache performs another read or write request. For a four line, 16 byte wide fetch cache, the cache bandwidth increased by 20 to 30 percent depending on the application.
CSL-TR-93-561

Report Number: CSL-TR-93-562
Institution: Stanford University, Computer Systems Laboratory
Title: An Efficient Top-Down Parsing Algorithm for General Context-Free Grammars
Author: Sankar, Sriram
Date: February 1993
Abstract: This report describes a new algorithm for top-down parsing of general context-free grammars. The algorithm does not require any changes to be made to the grammar, and can parse with respect to any grammar non-terminal as the start symbol. It is possible to generate all possible parse trees of the input string in the presence of ambiguous grammars. The algorithm reduces to recursive descent parsing on LL grammars. This algorithm is ideal for use in software development environments which include tools such as syntax-directed editors and incremental parsers, where the language syntax is an integral part of the user-interface. General context-free grammars can describe the language syntax more intuitively than, for example, LALR(1) grammars. This algorithm is also applicable to batch-oriented language processors, especially during the development of new languages, where frequent changes are made to the language syntax and new prototype parsers need to be developed quickly. A prototype implementation of a parser generator that generates parsers based on this algorithm has been built. Parsing speeds of around 1000 lines per second have been achieved on a Sun SparcStation 2. This demonstrated performance is more than adequate for syntax-directed editors and incremental parsers, and in most cases, is perfectly acceptable for batch-oriented language processors.
CSL-TR-93-562

Report Number: CSL-TR-93-564
Institution: Stanford University, Computer Systems Laboratory
Title: Case Study in Prototyping With Rapide: Shared Memory Multiprocessor System
Author: Santoro, Alexandre
Date: March 1993
Abstract: Rapide is a concurrent object-oriented language designed for prototyping distributed systems. This paper describes the creation of such a prototype, more specifically a shared memory multiprocessor system. The design is presented in an evolutionary manner, starting with a simple CPU + memory model. The paper also presents some simulation results and shows how the partially ordered event sets that Rapide produces can be used both for performance analysis and for an in-depth understanding of the model's behavior.
CSL-TR-93-564

Report Number: CSL-TR-93-566
Institution: Stanford University, Computer Systems Laboratory
Title: Software Testing Using Algebraic Specification Based Test Oracles
Author: Sankar, Sriram
Author: Goyal, Anoop
Author: Sikchi, Prakash
Date: April 1993
Abstract: In TAV4, the first author presented a paper describing an algorithm to perform run-time consistency checking of abstract data types specified using algebraic specifications. This algorithm has subsequently been incorporated into a run-time consistency checking tool for the Anna specification language for Ada, and works on a subset of all possible algebraic specifications. The algorithm implementation can be considered a test oracle for algebraic specifications that performs its activities while the formally specified program is running. This paper presents empirical results on the use of this test oracle on a real-life symbol table implementation. Various issues that arise due to the use of algebraic specifications and the test oracle are discussed. 50 different errors were introduced into the symbol table implementation. On testing using the oracle, 60% of the errors were detected by the oracle, 35% of the errors caused Ada exceptions to be raised, and the remaining 5% went undetected. These results are remarkable, especially since the test input was simply one sequence of symbol table operations performed by a typical client. The cases that went undetected contained errors that required very specific boundary conditions to be met --- an indication that white box test-data generation techniques may be required to detect them. Hence, a combination of white-box test-data generation along with a specification based test oracle may be an extremely versatile combination in detecting errors. This paper does not address test-data generation, rather it illustrates the usefulness of algebraic specification based test oracles during run-time consistency checking. Run-time consistency checking should be considered a complementary approach to unit testing using generated test-data.
CSL-TR-93-566

Report Number: CSL-TR-93-570
Institution: Stanford University, Computer Systems Laboratory
Title: Frequency Domain Volume Rendering
Author: Totsuka, Takashi
Author: Levoy, Marc
Date: April 1993
Abstract: The Fourier projection-slice theorem allos projections of volume data to be generated in O(nsquare log n) time for a volumbe of size ncube. The method operates by extracting and inverse Fourier transforming 2D slices from a 3D frequency domain representation of the volume. Unfortunately, these projections do not exhibit the occlusion that is characteristic of conventional volume renderings. We present a new frequency domain volume rendering algorithm that replaces much of the missing depth and shape cues by performing shading calculations in the frequency domain during slice extraction. In particular, we demonstrate frequency domain methods for computing linear or nonlinear depth cueing and directional diffuse reflection. The resulting images can be generated an order of magnitude faster than volume renderings and may be more useful for many applications.
CSL-TR-93-570

Report Number: CSL-TR-93-573
Institution: Stanford University, Computer Systems Laboratory
Title: Performance of a Three-Stage Banyan-Based Architecture with Input and Output Buffers for Large Fast Packet Switches
Author: Chiussi, Fabio M.
Author: Tobagi, Fouad A.
Date: June 1993
Abstract: Fast packet switching, also referred to as Asynchronous Transfer Mode (ATM), has emerged as the most appropriate switching technique to handle the high data rates and the wide diversity of traffic requirements envisioned in Broadband Integrated Services Digital Networks (B-ISDN). ATM switches capable of meeting the challenges posed by a successful deployment of B-ISDN must be designed and implemented. Such switches should be nonblocking and capable of handling the highly-bursty traffic conditions that future anticipated applications will generate; they should be scalable to the large sizes expected when B-ISDN becomes widely deployed; accordingly, their complexity should be as low as possible; they should be simple to operate; namely, their architecture should facilitate the determination of whether or not a call can be accepted, and the assignment of a route to a call. In this paper, we describe an architecture, referred to as the Memory/Space/Memory switching fabric, which meets these challenges. It combines input and output shared-memory buffer components with space-division banyan networks, making it possible to build a switch with several hundred I/O ports. The MSM achieves output buffering, thus performing very well under a wide variety of traffic conditions, and is self-routing, thus adapting easily to different traffic mixes. Under bursty traffic, by implementing a backpressure mechanism to control the packet flow from input to output queues, and by properly managing the buffers, we can increase the average buffer occupancy; in this way, we can achieve important reductions in total buffer requirements with respect to output-buffer switches (e.g., up to 70% reduction with bursts of average length equal to 100 packets), use input and output buffers of equal sizes, and achieve sublinear increase of the buffer requirements with the burst length.
CSL-TR-93-573

Report Number: CSL-TR-93-577
Institution: Stanford University, Computer Systems Laboratory
Title: Implementation of a Three-Stage Banyan-Based Atchitecture with Input and Output Buffers for Large Fast Packet Switches
Author: Chiussi, Fabio M.
Author: Tobagi, Fouad A.
Date: June 1993
Abstract: Fast packet switching, also referred to as Asynchronous Transfer Mode (ATM), has emerged as the most appropriate switching technique for future Broadband Integrated Services Digital Networks (B-ISDN). A three-stage banyan-based switch architecture with input and output buffers has been recently described [Chi93]. Such architecture, also referred to as the Memory/Space/Memory (MSM) switching fabric, is capable of meeting the challenges posed by a successful deployment of B-ISDN; namely, it is made nonblocking with low complexity, and is scalable to large sizes (>1000 input/output ports); it supports a wide diversity of traffic patterns, including highly-bursty traffic; it maintains packet sequence, is self-routing, and is simple to operate.
CSL-TR-93-577

Report Number: CSL-TR-93-579
Institution: Stanford University, Computer Systems Laboratory
Title: Comparative Studies of Pipelined Circuits
Author: Klass, Fabian
Author: Flynn, Michael J.
Date: July 1993
Abstract: Wave pipelining is an attractive technique used in high-speed computer systems to speed-up pipeline rate without partitioning a system into pipeline stages. Although recent implemetations have reported very high-speed operation rates, a real evaluation of the advantages and disadvantages of wave pipelining requires a comparative study with other techniques, in particular the understanding of the trade-offs between conventional and wave pipelining is very important. This study is an attempt to provide approximate models which can be used as first-order tools for comparative study or sensitivity analysis of conventional and wave pipelined systems with different overheads. The models presented here are for subsystem-level pipelines. The product Latency x Cycle-Time is used as a measure of performance and is evaluated as a function of all the parameters of a design, such as the propagation delay of the combinational logic, the data skew resulting from the difference between maximum and minimum propagation delays through various logic paths, rise and fall time, the setup time, hold time, and propagation delay through registers, and the uncontrollable clock skew. In this way, an analytical basis is provided for a comparison between different approaches and for optimizations.
CSL-TR-93-579

Report Number: CSL-TR-93-580
Institution: Stanford University, Computer Systems Laboratory
Title: Automatic Technology Mapping for Generalized Fundamental-Mode Asynchronous Designs
Author: Siegel, Polly
Author: DeMicheli, Giovanni
Author: Dill, David
Date: June 1993
Abstract: The generalized fundamental-mode asynchronous design style is one in which the combinational portions of the circuit design are separated from the storage elements, as with synchronous design styles. Synchronous technology mapping techniques can be adapted to work for this asynchronous design style if hazards are taken into account. First, we examine each step of algorithmic technology mapping for its influence on the hazard behavior of the modified network. We then present modifications to an existing synchronous technology mapper to work for this asynchronous design style. We present efficient algorithms for hazard analysis that are used during the mapping process. These algorithms have been implemented and incorporated into the program CERES to produce a technology mapper suitable for asynchronous designs.
CSL-TR-93-580

Report Number: CSL-TR-93-584
Institution: Stanford University, Computer Systems Laboratory
Title: Optimization of Combinational Logic Circuits Based on Compatible Gates
Author: Damiani, Maurizio
Author: Yang, Jerry Chih-Yuan
Author: DeMicheli, Giovanni
Date: June 1993
Abstract: This paper presents a set of new techniques for the optimization of multiple-level combinational Boolean networks. We describe first a technique based upon the selection of appropriate "multiple-output" subnetworks (consisting of so-called "compatible gates" whose local functions can be optimized simultaneously. We then generalize the method to larger and more arbitrary subsets of gates. Because simultaneous optimization of local functions can take place, our methods are more powerful and general than Boolean optimization methods using "don't cares", where only single-gate optimization can be performed. In addition, our methods represent a more efficient alternative to optimization procedures based on Boolean relations because the problem can be modeled by a "unate" covering problem instead of the more difficult "binate" covering problem. The method is implemented in program ACHILLES and compares favorably to SIS .
CSL-TR-93-584

Report Number: CSL-TR-93-585
Institution: Stanford University, Computer Systems Laboratory
Title: A Rapide-1.0 Definition of the ADAGE Avionics System
Author: Mann, Walter
Author: Belz, Frank C.
Author: Cornell, Paul
Date: September 1993
Abstract: We have used the Rapide prototyping-languages, developed by Stanford and TRW under the ARPA ProtoTech Program, in a series of exercises to model an early version of IBM's ADAGE software architecture for helicopter avionics systems. These exercises, conducted under the ARPA Domain Specific Software Architectures (DSSA) Program, also assisted the evolution of the Rapide languages. The resulting Rapide-1.0 model of the ADAGE architecture in this paper is substantially more succinct and illuminating than the original models, developed in Rapide-0.2 and Preliminary Rapide-1.0. All Rapide versions include these key features: interfaces, by which types of components and their possible interactions with other components are defined; actions, by which the events that can be observed or generated by such components are defined; and pattern-based constraints, which define properties of the computation of interacting components in terms of partially ordered sets of events. Key features of Rapide-1.0 include services, which abstract whole communication patterns between components; behavior rules, which provide a state-transition oriented specification of component behavior and from which computation component instances can be synthesized; and architectures, which describe implementations of components with a particular interface, by showing a composition of subordinate components and their interconnections. The Rapide-1.0 model is illustrated with corresponding diagrammatic representations.
CSL-TR-93-585

Report Number: CSL-TR-93-588
Institution: Stanford University, Computer Systems Laboratory
Title: Update-Based Cache Coherence Protocols for Scalable Shared-Memory Multiprocessors
Author: Glasco, David B.
Author: Delagi, Bruce A.
Author: Flynn, Michael J.
Date: November 1993
Abstract: In this paper, two hardware-controlled update-based cache coherence protocols are presented. The paper discusses the two major disadvantages of the update protocols: inefficiency of updates and the mismatch between the granularity of synchronization and the data transfer. The paper presents two enhancements to the update-based protocols, a write combining scheme and a finer grain synchronization, to overcome these disadvantages. The results demonstrate the effectiveness of these enhancements that, when used together, allow the update-based protocols to significantly improve the execution time of a set of scientific applications when compared to three invalidate-based protocols.
CSL-TR-93-588

Report Number: CSL-TR-93-590
Institution: Stanford University, Computer Systems Laboratory
Title: The Effect of Fault Dropping on Fault Simulation Time
Author: Pan, Rong
Author: Touba, Nur A.
Author: McCluskey, Edward J.
Date: November 1993
Abstract: The effect of fault dropping on fault simulation time is studied in this paper. An experiment was performed in which fault simulation times, with and without fault dropping, were measured for three different simulators. A speedup approximately between 8 and 50 for random test sets and between 1.5 and 9 for deterministic test sets was observed. The results give some indication about how much fault dropping speeds up fault simulation. These results also show the overhead of an application requiring a complete fault dictionary.
CSL-TR-93-590

Report Number: CSL-TR-93-591
Institution: Stanford University, Computer Systems Laboratory
Title: Logic Synthesis for Concurrent Error Detection
Author: Touba, Nur A.
Author: McCluskey, Edward J.
Date: November 1993
Abstract: The structure of a circuit determines how the effects of a fault can propagate and hence affects the cost of concurrent error detection. By considering circuit structure during logic optimization, the overall cost of a concurrently checked circuit can be minimized. This report presents a new technique called structure-constrained logic optimization (SCLO) that optimizes a circuit under the constraint that faults in the resulting circuit can produce only a prescribed set of errors. Using SCLO, circuits can be optimized for various concurrent error detection schemes allowing the overall cost for each scheme to be compared. A technique for quickly estimating the size of a circuit under different structural constraints is described. This technique enables rapid exploration of the design space for concurrently checked circuits. A new method for the automated synthesis of self-checking circuit implementations for arbitrary combinational circuits is also presented. It consists of an algorithm that determines the best parity-check code for encoding the output of a given circuit, and then uses SCLO to produce the functional circuit which is augmented with a checker to form a self-checking circuit. This synthesis method provides fully automated design, explores a larger design space than other methods, and uses simple checkers. It has been implemented by making modifications to SIS (an updated version of MIS [Brayton 87a]), and results for several MCNC combinational benchmark circuits are given. In most cases, a substantial reduction in overhead compared to a duplicate-and-compare implementation is achieved.
CSL-TR-93-591

Report Number: CSL-TR-93-593
Institution: Stanford University, Computer Systems Laboratory
Title: The Performance Advantages of Integrating Message Passing in Cache-Coherent Multiprocessors
Author: Woo, Steven Cameron
Author: Singh, Jaswinder Pal
Author: Hennessy, John L.
Date: November 1993
Abstract: We examine the performance benefits of integrating a mechanism for block data transfer (message passing) in a cache-coherent shared address space multiprocessor. We do this through a detailed study of five important computations that appear to be likely candidates for block transfer. We find that while the benefits on a realistic architecture are significant in some cases, they are not as substantial as one might initially expect. The main reasons for this are (i) the relatively modest fraction of time that applications spend in communication that is amenable to block transfer, (ii) the difficulty of finding enough independent computation to overlap with the communication latency that remains even after block transfer, and (iii) the fact that long cache lines often capture many of the benefits of block transfer. Of the three primary advantages of block transfer, fast pipelined data transfer appears to be the most successful, followed by the ability to overlap computation and communication at a coarse granularity, and finally the benefits of replicating communicated data in main memory. We also examine the impact of varying important network parameters and processor speed on the relative effectiveness of block transfer, and comment on useful features that a block transfer engine should support for real applications.
CSL-TR-93-593

Report Number: CSL-TR-93-596
Institution: Stanford University, Computer Systems Laboratory
Title: Models of Communication Latency in Shared Memory Multiprocessors
Author: Byrd, Gregory T.
Date: December 1993
Abstract: We evaluate various mechanisms for data communication in large-scale shared memory multiprocessors. Data communication involves both data transmission and synchronization, resulting in the transfer of data between computational threads. We use simple analytical models to evaluate the communication latency for each of the mechanisms. The models show that efficient and opportunistic synchronization is the most important determinant of latency, followed by efficient transmission. Producer-initiated mechanisms, in which data is sent by its producer as it is produced, generally achieve lower latencies than consumer-initated mechanisms, in which data is retrieved as and when it is needed.
CSL-TR-93-596

Report Number: CSL-TR-94-599
Institution: Stanford University, Department of Computer Science
Title: The Design and Implementation of a High-Performance Floating-Point Divider
Author: Oberman, Stuart
Author: Quach, Nhon
Author: Flynn, Michael J.
Date: January 1994
Abstract: The increasing computation requirements of modern computer applications have stimulated a large interest in developing extremely high-performance floating- point dividers. A variety of division algorithms are available, with SRT being utilized in many computer systems.A careful analysis of SRT divider topologies has demonstrated that a relatively simple divider designed in anaggressive circuit style can achieve extremely high performance. Further, an aggressive circuit implementation can minimize many of the performance advantages of more complex divider algorithms. This paper presents the tradeoffs of the different divider topologies, the design of the divider, and performance results.
CSL-TR-94-599

Report Number: CSL-TR-94-600
Institution: Stanford University, Computer Systems Laboratory
Title: Environmental Limits on the Performance of CMOS Wave-Pipelined Circuits
Author: Nowka, Kevin J.
Author: Flynn, Michael J.
Date: January 1994
Abstract: Wave-pipelining is a circuit design technique which allows digital synchronous systems to be clocked at rates higher than can be achieved with conventional pipelining techniques. Wave-pipelining has been successfully applied to the design of SSI processor functional units, a Bipolar Population Counter, a CMOS adder, CMOS multipliers, and several simple CMOS circuits. For controlled operating environments, speed-ups of 2 to 10 have been reported for these designs. This report details the effects of temperature variation, supply voltage variation, and process variation on wave-pipelined static CMOS designs, derives limits for the performance of wave-pipelined circuits due to these variations, and compares the performance effects with those of traditional pipelined circuits. This study finds that wave-pipelined circuits designed for commercial operating environments are limited to 2 to 3 waves per pipeline stage when clocked from a fixed frequency source. Variable rate, internal clocking can approach the theoretical limit of waves at a cost of interface complexity.
CSL-TR-94-600

Report Number: CSL-TR-94-601
Institution: Stanford University, Computer Systems Laboratory.
Title: Efficient Scheduling on Multiprogrammed Shared-Memory Multiprocessors
Author: Tucker, Andrew
Date: March 1994
Abstract: Shared-memory multiprocessors are often used as compute servers, with multiple users running applications in a multiprogrammed style. On such systems, naive time-sharing scheduling policies can result in poor performance for parallel applications. Most parallel applications are written with the model of a stable computing environment, where applications are running uninterrupted on a fixed number of processors. On a time-sharing system, processes are interrupted periodically and the number of processors running an application continually varies. The result is an decrease in performance for a number of reasons, including processes being obliviously preempted inside critical sections and cached data being replaced by intervening processes. This thesis explores using more sophisticated scheduling systems to avoid these problems. Robust implementations of previously proposed approaches involving cache affinity scheduling and gang scheduling are developed and evaluated. It then presents the design, implementation, and performance of process control, a novel scheduling approach using explicit cooperation between the application and kernel to minimize context switching. Performance results from a suite of workloads containing both serial and parallel applications, run on a 4-processor Silicon Graphics workstation, confirm the effectiveness of the process control approach.
CSL-TR-94-601

Report Number: CSL-TR-94-602
Institution: Stanford University, Computer Systems Laboratory
Title: Analyzing and Tuning Memory Performance in Sequential and Parallel Programs
Author: Martonosi, Margaret Rose
Date: January 1994
Abstract: Recent architecture and technology trends have led to a significant gap between processor and main memory speeds. When cache misses are common, memory stalls can significantly degrade execution time. To help identify and fix such memory bottlenecks, this work presents techniques to efficiently collect detailed information about program memory performance and effectively organize the data collected. These techniques help guide programmers or compilers to memory bottlenecks. They apply to both sequential and parallel applications and are embodied in the MemSpy performance monitoring system. This thesis contends that the natural interrelationship between program memory bottlenecks and program data structures mandates the use of data oriented statistics, a novel approach that associates program performance information with application data structures. Data oriented statistics, viewed alone or paired with traditional code oriented statistics, offer a powerful, new dimension for performance analysis. I develop techniques for aggregating statistics on similarly-used data structures and for extracting intuitive source-code names for statistics. The thesis also argues that MemSpy's detailed statistics on the frequency and causes of cache misses are crucial in understanding memory bottlenecks. Common memory performance bugs are often most easily distinguished by noting the causes of their resulting cache misses. Since collecting such detailed information seems, at first glance, to require large execution time slowdowns, this dissertation also evaluates techniques to improve the performance of MemSpy's simulation-based monitoring. The first optimization, hit bypassing, improves simulation performance by specializing processing of cache hits. The second optimization, reference trace sampling, improves performance by simulating only sampled portions out of the full reference trace. Together, these optimizations reduce simulation time by nearly an order of magnitude. Overall, having used MemSpy to tune several applications, these experiences demonstrate that MemSpy generates effective memory performance profiles, at speeds competitive with previous, less detailed approaches.
CSL-TR-94-602

Report Number: CSL-TR-94-604
Institution: Stanford University, Department of Computer Science
Title: Integrating multiple communication paradigms in high performance multiprocessors
Author: Heinlein, John
Author: Gharachorloo, Kourosh
Author: Gupta, Anoop
Date: February 1994
Abstract: In the design of FLASH, the successor to the Stanford DASH multiprocessor, we are exploring architectural mechanisms for efficiently supporting both the shared memory and message passing communication models in a single system. The unique feature in the FLASH (FLexible Architecture for SHared memory) system is the use of a programmable controller at each node that replaces the functionality of hardwired cache coherence state machines in systems like DASH. The base coherence protocol is supported by executing appropriate software handlers on the programmable controller to service memory and coherence operations. The same programmable controller is also used to support message passing. This approach is attractive because of the flexibility software provides for implementing different coherence and message passing protocols, and because of the simplification in system design and debugging that arises from the shift of complexity from hardware to software. This paper focuses on the use of the programmable controller to support message passing. Our goal is to provide message passing performance that is comparable to an aggressive hardware implementation dedicated to this task. In FLASH, message data is transferred as a sequence of cache line sized units, thus exploiting the datapath support already present for cache coherence. In addition, we avoid costly interrupts to the main processor by having the programmable engine handle the control for message transfers. Furthermore, in contrast to most earlier work, we provide an integrated solution that handles the interaction of message data with virtual memory, protected multiprogramming, and cache coherence. Our preliminary performance studies indicate that this system can sustain message transfers at a rate of several hundred megabytes per second, efficiently utilizing the available network bandwidth.
CSL-TR-94-604

Report Number: CSL-TR-94-605
Institution: Stanford University, Computer Systems Laboratory
Title: Performance and Area Analysis of Processor Configurations with Scaling of Technology
Author: Fu, Steve
Author: Flynn, Michael J.
Date: March 1994
Abstract: The increasing density of transistors on integrated circuits and the increasing sensitivity toward costs have stimulated interest in developing techniques for relating transistor count to performance. This paper maps different processor configuration to transistor level area models and proposes an optimum evolution path of processor design as minimum feature size of technology is scaled. A parameter for measuring incremental performance improvement with respect to increasing transistor count is proposed.
CSL-TR-94-605

Report Number: CSL-TR-94-607
Institution: Stanford University, Computer Systems Laboratory
Title: Spreadsheets for Images
Author: Levoy, Marc
Date: February 1994
Abstract: We describe a data visualization system based on spreadsheets. Cells in our spreadsheet contain graphical objects such as images, volumes, or movies. Cells may also contain graphical widgets such as buttons, sliders, or movie viewers. Objects are displayed in miniature inside each cell. Formulas for cells are written in a programming language that includes operators for array manipulation, image processing, and rendering. Formulas may also contain control structures, procedure calls, and assignment operators with side effects. Compared to flow chart visualization systems, spreadsheets are more expressive, more scalable, and easier to program. Compared to numerical spreadsheets, spreadsheets for images pose several unique design problems: larger formulas, longer computation times, and more complicated intercell dependencies. We describe an implementation based on the Tcl programming language and the Tk widget set, and we discuss our solutions to t