Stanford Computer Systems Laboratory Technical Reports from the 1990s

The authors have given permission to make their technical reports available on this server. If a report waspublished in print and is not here it may be that the author published it elsewhere.

Report Number: CSL-TN-99-1 This is the only TN in the CSL series
Institution: Stanford University, Computer Systems Laboratory
Title: A Graph-Oriented Model for Articulation of Ontology Interdependencies
Author: Mitra, Prasenjit
Author: Wiederhold, Gio
Author: Kersten, Martin L.
Date: August 1999
Abstract: Ontologies are knowledge structures to explicate the contents, essential properties, and relationships between terms in a knowledge source. Many sources are now accessible with associated ontologies. Most prior work on use of ontologies relies on the construction of a single global ontology covering all sources. Such an approach is not scalable and maintainable especially when the sources change frequently. We propose a scalable and easily maintainable approach based on the interoperation of ontologies. To handle user queries crossing the boundaries of the underlying information systems, the interoperation between the ontologies should be precisely defined. Our approach is to use rules that cross the semantic gap by creating an articulation or linkage between the systems. The rules are generated using a semi-automatic articulation tool with the help of a domain expert. To make the ontologies amenable for automatic composition based on the accumulated knowledge rules, we represent them using a graph-oriented model extended with a small algebraic operator set. ONION, a user-friendly toolkit, aids the experts in bridging the semantic gap in real-life settings. Our framework provides a sound foundation to simplify the work of domain experts, enables integration with public semantic dictionaries, like Wordnet, and will derive ODMG-compliant mediators automatically.
CSL-TN-99-1 This is the only TN in the CSL series

Report Number: CSL-TR-90-410
Institution: Stanford University, Computer Systems Laboratory
Title: Tango introduction and tutorial
Author: Goldschmidt, Stephen R.
Author: Davis, Helen
Date: January 1990
Abstract: Tango is a software-based multiprocessor simulator that can generate traces of synchronization events and data references. The system runs on a uniprocessor and provides a simulated multiprocessor environment. The user code is augmented during compilation to produce a compiled simulation system with optional logging. Tango offers flexible and accurate tracing by allowing the user to incorporate various memory and synchronization models. Tango achieves high efficiency by running compiled user code, by focusing on information that is of specific interest to multiprocessing studies and by allowing the user to select the most efficient memory simulation that is appropriate for a set of experiments.
CSL-TR-90-410

Report Number: CSL-TR-90-411
Institution: Stanford University, Computer Systems Laboratory
Title: Branch strategies: modeling and optimization
Author: Dubey, Pradeep K.
Author: Flynn, Michael J.
Date: February 1990
Abstract: Instruction dependency introduced by conditional branch instructions, which is resolved only at run-time, can have severe performance impact on pipelined machines. A variety of strategies are in wide use to minimize this impact. Additional instruction traffic generated by these branch strategies can also have an adverse effect on the system performance. Therefore, in addition to the likely reduction a branch prediction strategy offers in average branch delay, resulting excess i-traffic can be an important parameter in evaluating its overall effectiveness. The objective of this paper is twofold: to develop a model for different approaches to the branch problem and to help select an optimal strategy after taking into account the additional i-traffic generated by the i-buffering. The model presented provides a flexible tool for comparing different branch strategies in terms of the reduction it offers in average branch delay and also in terms of the associated cost of wasted instruction fetches. This additional criterion turns out to be a valuable consideration in choosing between two almost equally performing strategies. More importantly, it provides a better insight into the expected overall system performance. Simple compiler-support-based low implementation-cost strategies can be very effective under certain conditions. An active branch prediction scheme based on loop buffer can be as competitive as a branch-target-buffer based strategy.
CSL-TR-90-411

Report Number: CSL-TR-90-413
Institution: Stanford University, Computer Systems Laboratory
Title: An area-utility model for on-chip memories and its application
Author: Mulder, Johannes M.
Author: Quach, Nhon T.
Author: Flynn, Michael J.
Date: February 1990
Abstract: Utility can be defined as quality per unit of cost. The utility of a particular function in a microprocessor can be defined as its contribution to the overall processor performance per unit of implementation cost. In the case of on-chip data memory (e.g., registers, caches) the performance contribution can be reduced to its effectiveness in reducing memory traffic or in reducing the average time to fetch operands. An important cost measure for on-chip memory is occupied area. On-chip memory performance, however, is expressed much more easily as a function of size (the storage capacity) than as a function of area. Simple models have been proposed for mapping memory size to occupied area. These models, however, are of unproven validity and only apply when comparing relatively large buffers (³ 128 words for caches, ³ 32 words for register sets) of the same structure (e.g., cache versus cache). In this paper we present an area model for on-chip memories. The area model considers the supplied bandwidth of the individual memory cells and includes such overhead as control logic, driver logic, and tag storage, thereby permitting comparison of data buffers of different organizations and of arbitrary sizes. The model gave less than 10% error when verified against real caches and register files. Using this area-utility measure F(Performance,Area), we first investigated the performance of various cache organizations and then compared the performance of register buffers (e.g., register sets, multiple overlapping sets) and on-chip caches. Comparing cache performance as a function of area, rather than size, leads to a significantly different set of organizational tradeoffs. Caches occupy more area per bit than register buffers for sizes of 128 words or less. For data caches, line size is a primary determinant of performance for small sizes while write policy becomes the primary factor for larger caches. For the same area, multiple register sets have poorer performance than a single register set with cache except when the memory access time is very fast (under 3 processor cycles).
CSL-TR-90-413

Report Number: CSL-TR-90-415
Institution: Stanford University, Computer Systems Laboratory
Title: High-speed addition in CMOS
Author: Quach, Nhon T.
Author: Flynn, Michael J.
Date: February 1990
Abstract: This paper describes a fully static Complementary Metal-Oxide Semiconductor (CMOS) implementation of a Ling type adder. The implementation described herein saves up to one gate delay and always reduces the number of serial transistors in the worst-case (critical) path over the conventional carry look-ahead (CLA) approach with a negligible increase in hardware.
CSL-TR-90-415

Report Number: CSL-TR-90-418
Institution: Stanford University, Computer Systems Laboratory
Title: Runtime Access to Type Information in C++
Author: Interrante, John A.
Author: Linton, Mark A.
Date: March 1990
Abstract: The C++ language currently does not provide a mechanism for an object to determine its type at runtime. We propose the Dossier class as a standard interface for accessing type information from within a C++ program. We have implemented a tool called mkdossier that automatically generates type information in a form that can be compiled and linked with an application. In the prototype implementation, a class must have a virtual function to access an object's dossier given the object. We propose this access be provided implicitly by the language through a predefined member in all classes.
CSL-TR-90-418

Report Number: CSL-TR-90-419
Institution: Stanford University, Computer Systems Laboratory
Title: HardwareC -- A Language for Hardware Design (Version 2.0)
Author: Ku, David
Author: DeMicheli, Giovanni
Date: April 1990
Abstract: High-level synthesis is the transformation from a behavioral level specification of hardware, through a series of optimizations and translations, to an implementation in terms of logic gates and registers. The success of a high-level synthesis system is heavily dependent on how effectively the high-level language captures the ideas of the designer in a simple and understandable way. Furthermore, as system-level issues such as communication protocols and design partitioning dominate the design process, the ability to specify constraints on the timing requirements and resource utilization of a design is necessary to ensure that the design can integrate with the rest of the system. In this paper, a hardware description language called HardwareC is presented. HardwareC supports both declarative and procedural semantics, has a C-like syntax, and is extended with notion of concurrent processes, message passing, timing constraints via tagging, resource constraints, explicit instantiation of models, and template models. The language is used as the input to the Hercules High-level Synthesis System.
CSL-TR-90-419

Report Number: CSL-TR-90-423
Institution: Stanford University, Computer Systems Laboratory
Title: Implementing a Directory-Based Cache Consistency Protocol
Author: Simoni, Richard
Date: March 1990
Abstract: Directory-based cache consistency protocols have the potential to allow shared-memory multiprocessors to scale to a large number of processors. While many variations of these coherence schemes exist in the literature, they have typically been described at a rather high level, making adequate evaluation difficult. This paper explores the implementation issues of directory-based coherency strategies by developing a design at the level of detail needed to write a memory system functional simulator with an accurate timing model. The paper presents the design of both an invalidation coherency protocol and the associated directory/memory hardware. Support is added to prevent deadlock, handle subtle consistency situations, and implement a proper programming model of multiprocess execution. Extensions are delineated for realizing a multiple-threaded directory that can continue to process commands while waiting for a reply from a cache. The final hardware design is evaluated in the context of the number of parts required for implementation.
CSL-TR-90-423

Report Number: CSL-TR-90-425
Institution: Stanford University, Computer Systems Laboratory
Title: Concurrent runtime monitoring of formally specified programs
Author: Mandal, Manas
Author: Sankar, Sriram
Date: April 1990
Abstract: This paper describes an application of formal specifications after an executable program has been constructed. We describe how high level specifications can be utilized to monitor critical aspects of the behavior of a program continuously while it is executing. This methodology provides a capability to distribute the monitoring of specifications on multi-processor hardware platforms to meet practical time constraints. Typically, runtime checking of formal specifications involves a significant time penalty which makes it impractical during normal production operation of a program. In previous research, runtime checking has been applied during testing and debugging of software, but not on a permanent basis. Crucial to our current methodology is the use of multi-processor machines - hence runtime monitoring can be performed concurrently on different processors. We describe techniques for distributing checks onto different processors. To control the degree of concurrency, we introduce checkpoints - a point in the program beyond which execution cannot proceed until the specified checks have been completed. Error reporting and recovery in a multi-processor environment is complicated and there are various techniques of handling this. We describe a few of these techniques in this paper. An implementation of this methodology for the Anna specification language for Ada programs is described. Results of experiments conducted on this implementation using a 12 processor Sequent Symmetry demonstrate that permanent concurrent monitoring of programs based on formal specifications is indeed feasible.
CSL-TR-90-425

Report Number: CSL-TR-90-426
Institution: Stanford University, Computer Systems Laboratory
Title: A VLSI architecture for the FCHC isometric lattice gas model
Author: Lee, Fung F.
Author: Flynn, Michael J.
Author: Morf, Martin
Date: April 1990
Abstract: Lattice gas models are cellular automata used for the simulation of fluid dynamics. This paper addresses the design issues of a lattice gas collision rule processor for the four-dimensional FCHC isometric lattice gas model. A novel VLSI architecture based on an optimized version of Henon's isometric algorithm is proposed. One of the key concepts behind this architecture is the permutation group representation of the isometry group of the lattice. In contrast to the straightforward table lookup approach which would take 4.5 billion bits to implement this set of collision rules, the size of our processor is only about 5000 gates. With a reasonable number of pipeline stages, the processor can deliver one result per cycle with a cycle time comparable to or less than that of a common commercial DRAM.
CSL-TR-90-426

Report Number: CSL-TR-90-428
Institution: Stanford University, Computer Systems Laboratory
Title: Sub-nanosecond arithmetic
Author: Flynn, Michael J.
Author: DeMicheli, Giovanni
Author: Dutton, Robert
Author: Wooley, Bruce
Author: Pease, R. Fabian
Date: May 1990
Abstract: The SNAP (Stanford Nanosecond Arithmetic Processor) project is targeted at realizing an arithmetic processor with performance approximately an order of magnitude faster than currently available technology. The realization of SNAP is predicated on an interdisciplinary approach and effort spanning research in algorithms, data representation, CAD, circuits and devices, and packaging. SNAP is visualized as an arithmetic coprocessor implemented on an active substrate containing several chips, each of which realize a particular arithmetic function.
CSL-TR-90-428

Report Number: CSL-TR-90-431
Institution: Stanford University, Computer Systems Laboratory
Title: Latency and throughput tradeoffs in self-timed speed-independent pipelines and rings
Author: Williams, Ted
Date: August 1990
Abstract: Asynchronous pipelines control the flow of tokens through a sequence of logical stages based on the status of local completion detectors. As in a synchronously clocked circuit, the design of self-timed pipelines can trade off between achieving low latency and high throughput. However, there are more degrees of freedom because of the variances in specific latch and function block styles, and the possibility of varying both the number of latches between function blocks and their connections to the completion detectors. This report demonstrates the utility of a graph-based methodology for analyzing the timing dependencies and uses it to make comparisons of different configurations. It is shown that the extremes for high throughput and low latency differ significantly, the placement of the completion detectors influences timing as much as adding an additional latch, and the choice as to whether precharged or static logic is best is dependent on the cost in complexity of the completion detectors.
CSL-TR-90-431

Report Number: CSL-TR-90-436
Institution: Stanford University, Computer Systems Laboratory
Title: Application of formal specification to software maintenance
Author: Madhav, Neel
Author: Sankar, Sriram
Date: August 1990
Abstract: This paper describes the use of formal specifications and associated tools in addressing various aspects of software maintenance ---corrective, perfective, and adaptive. It also addresses the refinement of the software development process to build programs that are easily maintainable. The task of software maintenance in our case includes the task of maintaining the specification as well as maintaining the program. We focus on the use of Anna, a specification language for formally specifying Ada programs, to aid us in maintaining Ada programs. These techniques are applicable to most other specification language and programming language environments. The tools of interest are: (1) the Anna Specification Analyzer which allows us to analyze the specification for correctness with respect to our informal understanding of program behavior; and (2) the Anna Consistency Checking System which monitors the Ada program at runtime based on the Anna specification.
CSL-TR-90-436

Report Number: CSL-TR-90-438
Institution: Stanford University, Computer Systems Laboratory
Title: A Methodology for Formal Specification and Implementation of Ada Packages Using Anna
Author: Madhav, Neel
Author: Mann, Walter
Date: August 1990
Abstract: This paper presents a methodology for formal specification and prototype implementation of Ada packages using the Anna specification language. Specifications play an important role in the software development cycle. The methodology allows specifiers of Ada packages to follow a sequence of simple steps to formally specify packages. Given the formal specification of a package resulting from the methodology for package specifications, the methodology allows implementors of packages to follow a few simple steps to implement the package. The implementation is meant to be a prototype. This methodology for specification and implementation is applicable to most Ada packages. Limitations of this approach are pointed out at various points in the paper. We present software tools which help the process of specification and implementation.
CSL-TR-90-438

Report Number: CSL-TR-90-439
Institution: Stanford University, Computer Systems Laboratory
Title: Tango: A Multiprocessor Simulation and Tracing System
Author: Davis, Helen
Author: Goldschmidt, Stephen R.
Date: July 1990
Abstract: Tango is a software simulation and tracing system used to obtain data for evaluating parallel programs and multiprocessor systems. The system provides a simulated multiprocessor environment by multiplexing application processes onto a single processor. Tango achieves high efficiency by running compiled user code, and by focusing on the information of greatest interest to multiprocessing studies. The system is being applied to a wide range of investigations, including algorithm studies and a variety of hardware evaluations.
CSL-TR-90-439

Report Number: CSL-TR-90-441
Institution: Stanford University, Computer Systems Laboratory
Title: Computing Types During Program Specialization
Author: Weise, Daniel
Author: Ruf, Erik
Date: October 1990
Abstract: We have developed techniques for obtaining and using type information during program specialization (partial evaluation). Computed along with every residual expression and every specialized program is type information that bounds the possible values that the specialized program will compute at run time. The three keystones of this research are symbolic values that represent both a value and the code for creating the value, generalization of symbolic values, and the use of online fixed-point iterations for computing the type of values returned by specialized recursive functions. The specializer exploits type information to increase the efficiency of specialized functions. This research has two benefits, one anticipated and one unanticipated. The anticipated benefit is that programs that are to be specialized can now be written in a more natural style without losing accuracy during specialization. The unanticipated benefit is the creation of what we term concrete abstract interpretation. This is a method of performing abstract interpretation with concrete values where possible. The specializer abstracts values as needed, instead of requiring that all values be abstracted prior to abstract interpretation.
CSL-TR-90-441

Report Number: CSL-TR-90-442
Institution: Stanford University, Computer Systems Laboratory
Title: An improved algorithm for high-speed floating-point addition
Author: Quach, Nhon T.
Author: Flynn, Michael J.
Date: August 1990
Abstract: This paper describes an improved, IEEE conforming floating-point addition algorithm. This algorithm has only one addition step involving the significand in the worst-case path, hence offering a considerable speed advantage over the existing algorithms, which typically require two to three addition steps.
CSL-TR-90-442

Report Number: CSL-TR-90-443
Institution: Stanford University, Computer Systems Laboratory
Title: A queuing analysis for disk array systems
Author: Ogata, Mikito
Author: Flynn, Michael J.
Date: August 1990
Abstract: Using a queuing model of disk arrays, we study the performance and tradeoffs in disk array sub-systems and develop guidelines for designing these sub-systems in various CPU environments. Finally, we compare our model with some earlier simulation results.
CSL-TR-90-443

Report Number: CSL-TR-90-453
Institution: Stanford University, Computer Systems Laboratory
Title: Event patterns: A language construct for hierarchical designs of concurrent systems
Author: Luckham, David D.
Author: Gennart, Benoit A.
Date: November 1990
Abstract: Event patterns are a language construct for expressing relationships between specifications at different levels of a hierarchical design of a concurrent system. They provide a facility missing from current hardware design languages such as VHDL, or programming languages with parallel constructs such as Ada. This paper explains the use of event patterns in (1) defining mappings between different levels of a design hierarchy, and (2) automating the comparison of the behavior of different design levels during simulation. It describes the language constructs for defining event patterns and mappings, and shows their use in a design example, a 16-bit CPU.
CSL-TR-90-453

Report Number: CSL-TR-90-454
Institution: Stanford University, Computer Systems Laboratory
Title: Page allocation to reduce access time of physical caches
Author: Bray, Brian K.
Author: Lunch, William L.
Author: Flynn, Michael J.
Date: November 1990
Abstract: A simple modification to an operating system's page allocation algorithm can give physically addressed caches the speed of virtually addressed caches. Colored page allocation reduces the number of bits that need to be translated before cache access, allowing large low-associativity caches to be indexed before address translation, which reduces the latency to the processor. The colored allocation also has other benefits: caches miss less (in general) and more uniformly, and the inclusion principle holds for second level caches with less associativity. However, the colored allocation requires main memory partitioning, and more common bits for shared virtual addresses. Simulation results show high non-uniformity of cache miss rates for normal allocation. Analysis demonstrates the extent of second-level cache inclusion, and the reduction in effective main-memory due to partitioning.
CSL-TR-90-454

Report Number: CSL-TR-91-459
Institution: Stanford University, Computer Systems Laboratory
Title: On fast IEEE rounding
Author: Quach, Nhon
Author: Takagi, Naofumi
Author: Flynn, Michael J.
Date: January 1991
Abstract: A systematic general rounding procedure is proposed. This procedure consists of 2 steps: constructing a rounding table and selecting a prediction scheme. Optimization guidelines are given in each step to minimize the hardware used. This procedure-based rounding method has the additional advantage that verification and generalization are trivial. Two rounding hardware models are described. The first is shown to be identical to that reported by Santoro, et al. The second is more powerful, providing solutions where the first fails. Applying this approach to the IEEE rounding modes for high-speed conventional binary multipliers reveals that round to infinity is more difficult to implement than the round to nearest mode; more adders are potentially needed. Round to zero requires the least amount of hardware. A generalization of this procedure to redundant binary multipliers reveals two major advantages over conventional binary multipliers. First, the computation of the sticky bit consumes considerably less hardware. Second, implementing round to positive and minus infinity modes does not require the examination of the sticky bit, removing a possible worst-case path. A generalization of this approach to addition produces a similar solution to that reported by Quach and Flynn. Although generalizable to other kinds of rounding as well as other arithmetic operations, we only treat the case of IEEE rounding for addition and multiplication; IEEE rounding because it is the current standard on rounding, addition and multiplication because they are the most frequently used arithmetic operations in a typical scientific computation.
CSL-TR-91-459

Report Number: CSL-TR-91-463
Institution: Stanford University, Computer Systems Laboratory
Title: Leading One Detection --- Implementation, Generalization, and Application
Author: Quach, Nhon
Author: Flynn, Michael J.
Date: March 1991
Abstract: This paper presents the concept of leading-one prediction (LOP) in greater detail and describes two existing implementations. The first one is similar to that used in the IBM RS/6000 processor. The second is a distributed version of the first, consuming less hardware when multiple patterns need to be detected. We show how to modify these circuits for sign-magnitude numbers as dictated by the IEEE standard. We then point out that (1) LOP and carry lookahead in parallel addition belong to the same class of problem, that of a bit pattern detection. Such a recognition allows techniques developed for parallel addition to be borrowed for bit pattern detection. And (2) LOP can be applied to compute the sticky bit needed for binary multipliers to perform IEEE rounding.
CSL-TR-91-463

Report Number: CSL-TR-91-465
Institution: Stanford University, Computer Systems Laboratory
Title: Analysis of Power Supply Networks in VLSI Circuits
Author: Stark, Don
Date: March 1991
Abstract: Although the trend toward finer geometries and larger chips has produced faster systems, it has also created larger voltage drops and higher current densities in chip power supply networks. Excessive voltage drops in the power supply lines cause incorrect circuit operation, and high current densities lead to circuit failure via electromigration. Analyzing this power supply noise by hand for large circuits is difficult and error prone; automatic checking tools are needed to make the analysis easier. This thesis describes Ariel, a CAD tool that helps VLSI designers analyze power supply noise. The system consists of three main components, a resistance extractor, a current estimator, and a linear solver, that are used together to determine the voltage drops and current density along the supply lines. The resistance extractor includes two parts: a fast extractor that calculates resistances quickly using simple heuristics, and a slower, more accurate finite element extractor. Despite its simplicity, the fast extractor obtained nearly the same results as the finite element one and is two orders of magnitude faster. The system also contains two current estimators, one for CMOS designs and one for ECL. The CMOS current estimator is based on the switch level simulator Rsim, and produces a time-varying current distribution that includes the effects of charge sharing, image currents, and slope on the gate's inputs. The ECL, estimator does a static analysis of the design, calculating each gate's tail current and tracing through the network to find where it enters the power supplies. Extensions to the estimator allow it to handle more complex circuits, such as shared current lines and diode decoders. Finally, the linear solver applies this current pattern to the resistance network, and efficiently calculates voltages and current densities by taking advantage of topological characteristics peculiar to power supply networks. It removes trees, simple loops, and series sections for separate analysis. These techniques substantially reduce the time required for solution. This report also includes the results of running the system on several large designs, and points out flaws that Ariel uncovered in their power networks.
CSL-TR-91-465

Report Number: CSL-TR-91-468
Institution: Stanford University, Computer Systems Laboratory
Title: Efficient moment-based timing analysis for variable accuracy switch level simulation
Author: Kao, Russell
Author: Horowitz, Mark
Date: April 1991
Abstract: We describe a timing analysis algorithm which can achieve the efficiency of RC tree analysis while retaining much of the generality of Asymptotic Waveform Estimation. RC tree analysis from switch level simulation is generalized to handle piecewise linear transistor models, non tree topologies, floating capacitors, and feedback. For simple switch level models the complexity is O(n). The algorithm allows the user to trade off efficiency vs accuracy through the selection of transistor models of varying complexity.
CSL-TR-91-468

Report Number: CSL-TR-91-469
Institution: Stanford University, Computer Systems Laboratory
Title: SPLASH: Stanford parallel applications for shared-memory
Author: Singh, Jaswinder Pal
Author: Weber, Wolf-Dietrich
Author: Gupta, Anoop
Date: April 1991
Abstract: This report was replaced and updated in CSL-TR-92-526
CSL-TR-91-469

Report Number: CSL-TR-91-470
Institution: Stanford University, Computer Systems Laboratory
Title: Writes caches as an alternative to write buffers
Author: Bray, Brian K.
Author: Flynn, Michael J.
Date: April 1991
Abstract: Write buffers help unbind one level of a memory hierarchy from the next, thus write buffers are used to reduce write stalls. Write buffers are used in write-through systems so that writes can occur at the rate the cache can handle them, but write buffers don't reduce the number of writes, or cluster writes for block transfers. A write cache is a cache that uses an allocate on write miss, write-back, no allocate on read miss strategy. A write cache tries to reduce the total number of writes (write traffic) to the next level by taking advantage of the temporal locality of writes. A write cache also groups writes for block transfers by taking advantage of the spatial locality of writes. We have found that small write caches can significantly reduce the write traffic to the first write-back level after the processor's register set. Systems that would benefit from reduced write traffic to the first write-back level would benefit from using a write cache instead of a write buffer. The temporal and spatial locality of writes is very important in determining what organization the write cache should have.
CSL-TR-91-470

Report Number: CSL-TR-91-475
Institution: Stanford University, Computer Systems Laboratory
Title: Making effective use of shared-memory multiprocessors: the process control approach
Author: Gupta, Anoop
Author: Tucker, Andrew
Author: Stevens, Luis
Date: May 1991
Abstract: We present the design, implementation, and performance of a novel approach for effectively utilizing shared-memory multiprocessors in the presence of multiprogramming. Our approach offers high performance by combining the techniques of process control and processor partitioning. The process control technique is based on the principle that to maximize performance, a parallel application must dynamically match the number of runnable processes associated with it to the effective number of processors available to it. This avoids the problems arising from oblivious preemption of processes and it allows an application to work at a better operating point on its speedup versus processors curve. Processor partitioning is necessary for dealing with realistic multiprogramming environments, where both process controlled and non-controlled applications may be present. It also helps improve the cache performance of applications and removes the bottleneck associated with a single centralized scheduler. Preliminary results from an implementation of the process control approach, with a user-level server, were presented in a previous paper. In this paper, we extend the process control approach to work with processor partitioning and fully integrate the approach with the operating system kernel. This also allows us to address a limitation in our earlierÊimplementation wherein a close correspondence between runnable processes and the available processors was not maintained in the presence of I/O. The paper presents the design decisions and the rationale for the current implementation, along with extensive results from executions on a high-performance Silicon Graphics 4D/340
CSL-TR-91-475

Report Number: CSL-TR-91-480
Institution: Stanford University, Computer Systems Laboratory
Title: Strategies for branch target buffers
Author: Bray, Brian K.
Author: Flynn, M. J.
Date: June 1991
Abstract: Achieving high instruction issue rates depends on the ability to dynamically predict branches. We compare two schemes for dynamic branch prediction: a separate branch target buffer and an instruction cache based branch target buffer. For instruction caches of 4KB and greater, instruction cache based branch prediction performance is a strong function of line size, and a weak function of instruction cache size. An instruction cache based branch target buffer with a line size of 8 (or 4) instructions performs about as well as a separate branch target buffer structure which has 64 (or 256, respectively) entries. Software can rearrange basic blocks in a procedure to reduce the number of taken branches, thus reducing the amount of branch prediction hardware needed. With software assistance, predicting all branches as not branching performs as well as a 4 entry branch target buffer without assistance, and a 4 entry branch target buffer with assistance performs as well as a 32 entry branch target buffer without assistance. The instruction cache based branch target buffer also benefits from the software, but only for line sizes of more than 4 instructions.
CSL-TR-91-480

Report Number: CSL-TR-91-481
Institution: Stanford University, Computer Systems Laboratory
Title: Subnanosecond arithmetic (Second Report)
Author: Flynn, Michael J.
Author: DeMicheli, Giovanni
Author: Dutton, Robert
Author: Pease, R. Fabian
Author: Wooley, Bruce
Date: June 1991
Abstract: The Stanford Nanosecond Arithmetic Project is targeted at realizing an arithmetic processor with performance approximately an order of magnitude faster than currently available technology. The realization of SNAP is predicated on an interdisciplinary approach and effort spanning research in algorithms, data representation, CAD, circuits and devices, and packaging. SNAP is visualized as an arithmetic coprocessor implemented on an active substrate containing several chips, each of which realize a particular arithmetic function. This year's report highlights recent results in the area of wave pipelining. We have fabricated a number of prototype die, implementing a multiplier slice. Cycle times below 5ns were realized.
CSL-TR-91-481

Report Number: CSL-TR-91-483
Institution: Stanford University, Computer Systems Laboratory
Title: Suggestions for implementing a fast IEEE multiply-add-fused instruction
Author: Quach, Nhon
Author: Flynn, Michael
Date: July 1991
Abstract: We studied three possible strategies to overlap the operations in a floating-point add (FPA) and a floating-point multiply (FPM) for implementing an IEEE multiply-add-fused (MAF) instruction. The operations in FPM and FPA are: (a) non-overlapped, (b) fully-overlapped, and (c) partially-overlapped. The first strategy corresponds to multiply-add-chained (MAC) widely used in vector processors. The second (Greedy) strategy uses a greedy algorithm, yielding an implementation similar to the IBM RS/6000 one. The third and final (SNAP) strategy uses a less aggressive starting configuration and corresponds to the SNAP implementation. An IEEE MAF delivers the same result as that obtained via a separate IEEE FPM and FPA. Two observations have prompted this study. First, in the IBM RS/6000 implementation, the design tradeoffs have been made for high internal data precision, which facilitates the execution of elementary functions. These tradeoff decisions, however, may not be valid for an IEEE MAF. Second, the RS/6000 implementation assumed a different critical path for FPA and FPM, which does not reflect the current state-of-the-art in FP technology. Using latency and hardware costs as the performance metrics we show that: (1) MAC has the lowest FPA latency and consumes the least hardware. But its MAF latency is the highest. (2) Greedy has a medium MAF latency but the highest FPA latency. And (3) SNAP has the lowest MAF latency and a slightly higher FPA latency than that of MAC, consuming an area that is comparable with that of Greedy. Both Greedy and SNAP have higher design complexity arising from rounding for the IEEE standard. SNAP has an additional wire complexity, which Greedy does not have because of its simpler datapath. If rounding for the IEEE standard is not an issue, the Greedy strategy --- and therefore the RS/6000 --- seems reasonable for applications with a high MAF to FPA ratio.
CSL-TR-91-483

Report Number: CSL-TR-91-484
Institution: Stanford University, Computer Systems Laboratory
Title: Self-Consistency and Transitivity in Self-Calibration Procedures
Author: Raugh, Michael
Date: July 1991
Abstract: Self-calibration refers to the use of an uncalibrated measuring instrument and an uncalibrated object called an artifact, such as a rigid marked plate, to simultaneously measure the artifact and calibrate the instrument. Typically, the artifact is measured in more than one position, and the required information is derived from comparisons of the various measurements. The problems of self-calibration are surprisingly subtle. This paper develops concepts and vocabulary for dealing with such problems in one and two dimensions and uses simple (non-optimal) measurement procedures to reveal the underlying principles. The approach in two dimensions is mathematically constructive: procedures are described for measuring an uncalibrated artifact in several stages, involving progressive transformations of the instrument's uncalibrated coordinate system, until correct coordinates for the artifact are obtained and calibration of the instrument is achieved. Self-consistency and transitivity, as defined within, emerge as key concepts. It is shown that self-consistency and transitivity are necessary conditions for self-calibration. Consequently, in general, it is impossible to calibrate a two dimenstional measuring instrument by simply rotating and measureing a calibration plate about a fixed center.
CSL-TR-91-484

Report Number: CSL-TR-91-485
Institution: Stanford University, Computer Systems Laboratory
Title: A Unidraw-Based User Interface Builder
Author: Vlissides, John M.
Author: Tang, Steven
Date: August 1991
Abstract: Ibuild is a user interface builder that lets a user manipulate simulations of toolkit objects rather than actual toolkit objects. Ibuild is built with Unidraw, a framework for building graphical editors that is part of the InterViews toolkit. Unidraw makes the simulation-based approach attractive. Simulating toolkit objects in Unidraw makes it easier to support editing facilities that are common in other kinds of graphical editors, and it keeps the builder insulated from a particular toolkit implementation. Ibuild supports direct manipulation analogs of InterViews' composition mechanisms, which simplify the specification of an interface's layout and resize semantics. Ibuild also leverages the C++ inheritance mechanism to decouple builder-generated code from the rest of the application. And while current user interface builders stop at the widget level, ibuild incorporates Unidraw abstractions to simplify the implementation of graphical editors.
CSL-TR-91-485

Report Number: CSL-TR-91-488
Institution: Stanford University, Computer Systems Laboratory
Title: The Stanford ADA style checker: an application of the ANNA tools and methodology
Author: Walicki, Michal
Author: Skakkebaek, Jens Ulrik
Author: Sankar, Sriram
Date: August 1991
Abstract: This report describes the Ada style checker, which was designed and constructed in Winter and Spring 1989-90. The style checker is based on the Stanford Anna Tools and has been annotated using Anna. The style checker examines Ada programs for "correct style'' which is defined in a style specification language (SSL). A style checker generator is used to automatically generate a style checker based on a set of style specifications.
CSL-TR-91-488

Report Number: CSL-TR-91-492
Institution: Stanford University, Computer Systems Laboratory
Title: Paging Performance with Page Coloring.
Author: Lynch, William L.
Author: Flynn, Michael J.
Date: October 1991
Abstract: Constraining the mapping of virtual to physical addresses page coloring can speed and/or simplify caches in the presence of virtual memory. For the mapping to hold, physical memory must be partitioned into distinct colors, and virtual pages allocated to a specific color of physical page determined by the mapping. This paper uses and analytical model and simulation to compare the paging effects of colored versus uncolored (conventional) page allocation, and concludes that these effects are small.
CSL-TR-91-492

Report Number: CSL-TR-91-496
Institution: Stanford University, Computer Systems Laboratory
Title: ANNA package specification: case studies
Author: Kenney, John
Author: Mann, Walter
Date: October 1991
Abstract: We present techniques of software specification of Ada* software based on the Anna specification language and examples of Ada packages formally specified in Anna. A package specification for an abstract set type is used to illustrate the techniques and pitfalls involved in the process of software specification and development. This specification not only exemplifies good Anna style and specification approach, but has a secondary goal of teaching the reader how to use Anna and the associated set of Anna tools developed at Stanford University over the past six years. The technical report thus aims to give readers a new way of looking at the software design and development process, synthesizing fifteen years of research in the process. *Ada is a registered trademark of the U.S. Government (Ada Joint Program Office)
CSL-TR-91-496

Report Number: CSL-TR-91-498
Institution: Stanford University, Computer Systems Laboratory.
Title: Spectral Techniques for Technology Mapping
Author: Yang, Jerry Chih-Yuan
Author: DeMicheli, Giovanni
Date: March 1994
Abstract: Technology mapping is the crucial step in logic synthesis where technology dependent optimizations take place. The matching phase of a technology mapping algorithm is generally considered the most computationally intensive task, because it is called on repeatedly. In this work, we investigate applications of spectral techniques in doing matching. In particular, we present an algorithm that will detect NPN-equivalent Boolean functions. We show that while generating the spectra for Boolean functions may be expensive, this algorithm offers significant pruning of the search space and is simple to implement. The algorithm is implemented as part of the Specter technology mapper, and results are compared to other Boolean matching techniques.
CSL-TR-91-498

Report Number: CSL-TR-92-510
Institution: Stanford University, Computer Systems Laboratory
Title: Rapide-0.2 Examples
Author: Hsieh, Alexander
Date: February 1992
Abstract: Rapide-0.2 is an executable language for prototyping distributed, time sensitive systems. We present in this report a series of simple, working example programs in the language. In each example we present one or more new concepts or constructs of the Rapide-0.2 language with later examples drawing on previously presented material. The examples are written for both those who wish to use the Rapide-0.2 language to do serious prototyping and for those who just wish to be familiar with it. The examples were not written for someone who wishes to learn prototyping in general. CSL-TN-92-387 is an informal reference manual, describing the Rapide-0.2 language and tools, which might be helpful to have in conjunction with CSL-TR-92-510 (p. 191).
CSL-TR-92-510

Report Number: CSL-TR-92-515
Institution: Stanford University, Computer Systems Laboratory
Title: Partial orderings of event sets and their application to prototyping concurrent timed systems
Author: Luckham, David C.
Author: Vera, James
Author: Bryan, Doug
Author: Augustin, Larry
Author: Belz, Frank
Date: April 1992
Abstract: Rapide is a concurrent object-oriented language specifically designed for prototyping large concurrent systems. One of the principle design goals has been to adopt a computation model in which the synchronization, concurrency, dataflow, and timing aspects of a prototype are explicitly represented and easily accessible both to the prototype itself and to the prototyper. This paper describes the partially ordered event set (poset) computation model, and the features of Rapide for using posets in reactive prototypes and for automatically checking posets. Some critical issues in the implementation of Rapide are described and our experience with them is summarized. An example prototyping scenario illustrates uses of the poset computation model.
CSL-TR-92-515

Report Number: CSL-TR-92-516
Institution: Stanford University, Computer Systems Laboratory
Title: Opportunities for Online Partial Evaluation
Author: Ruf, Erik
Author: Weise, Daniel
Date: April 1992
Abstract: Partial evaluators can be separated into two classes: offline specializers, which make all of their reduce/residualize decisions before specialization, and online specializers, which make such decisions during specialization. The choice of which method to use is driven by a tradeoff between the efficiency of the specializer and the quality of the residual programs that it produces. Existing research describes some of the inefficiencies of online specializers, and how these are avoided using offline methods, but fails to address the price paid in specialization quality. This paper motivates research in online specialization by describing two fundamental limitations of the offline approach, and explains why the online approach does not encounter the same difficulties.
CSL-TR-92-516

Report Number: CSL-TR-92-517
Institution: Stanford University, Computer Systems Laboratory
Title: Preserving Information during Online Partial Evaluation
Author: Ruf, Erik
Author: Weise, Daniel
Date: April 1992
Abstract: The degree to which a partial evaluator can specialize a source program depends on how accurately the partial evaluator can represent and maintain information about runtime values. Partial evaluators always lose some accuracy due to their use of finite type systems; however, existing partial evaluation techniques lose information about runtime values even when their type systems are capable of representing such information. This paper describes two sources of such loss in existing specializers, solutions for both cases, and the implementation of these solutions in our partial evaluation system, FUSE.
CSL-TR-92-517

Report Number: CSL-TR-92-518
Institution: Stanford University, Computer Systems Laboratory
Title: Avoiding Redundant Specialization during Partial Evaluation
Author: Ruf, Erik
Author: Weise, Daniel
Date: April 1992
Abstract: Existing partial evaluators use a strategy called polyvariant specialization, which involves specializing program points on the known portions of their arguments, and re-using such specializations only when these known portions match exactly. We show that this re-use criterion is overly restrictive, and misses opportunities for sharing in residual programs, thus producing large residual programs containing redundant specializations. We develop a criterion for re-use based on computing the domains of specializations, describe an approximate implementation of this criterion based on types, and show its implementation in our partial evaluation system FUSE. In addition, we describe several extensions to our mechanism to make it compatible with more powerful specialization strategies and to increase its efficiency. After evaluating our algorithm's usefulness, we relate it to existing work in partial evaluation and machine learning.
CSL-TR-92-518

Report Number: CSL-TR-92-520
Institution: Stanford University, Computer Systems Laboratory
Title: An Empirical Study of an Abstract Interpretation of Scheme Programs
Author: Kanamori, Atty
Author: Weise, Daniel
Date: April 1992
Abstract: Abstract Interpretation, a powerful and general framework for performing global program analysis, is being applied to problems whose difficulty far surpasses the traditional "bit-vector'' dataflow problems for which many of the high-speed abstract interpretation algorithms worked so well. Our experience has been that current methods of large scale abstract interpretation are unacceptably expensive. We studied a typical large-scale abstract interpretation problem: computing the control flow of a higher order program. Researchers have proposed various solutions that are designed primarily to improve the accuracy of the analysis. The cost of the analyses, and its relationship to accuracy, is addressed only cursorily in the literature. Somewhat paradoxically, one can view these strategies as attempts to simultaneously improve the accuracy and reduce the cost. The less accurate strategies explore many spurious control paths because many flowgraph paths represent illegal execution paths. For example, the less accurate strategies violate the LIFO constraints on procedure call and return. More accurate analyses investigate fewer control paths, and therefore may be more efficient despite their increased overhead. We empirically studied this accuracy versus efficiency tradeoff. We implemented two fixpoint algorithms, and four semantics (baseline, baseline + stack reasoning, baseline + contour reasoning, baseline + stack reasoning + contour reasoning) for a total of eight control flow analyzers. Our benchmarks test various programming constructs in isolation --- hence, if a certain algorithm exhibits poor performance, the experiment also yields insight into what kind of program behavior results in that poor performance. The results suggest that strategies that increase accuracy in order to eliminate spurious paths often generate unacceptable overhead in the parts of the analysis that do not benefit from the increased accuracy. Furthermore, we found little evidence that the extra effort significantly improves the accuracy of the final result. This suggests that increasing the accuracy of the analysis globally is not a good idea, and that future research shouldÊinvestigate adaptive algorithms that use different amounts of precision on different parts of the problem.
CSL-TR-92-520

Report Number: CSL-TR-92-523
Institution: Stanford University, Computer Systems Laboratory
Title: Architectural and implementation tradeoffs in the design of multiple-context processors
Author: Laudon, James
Author: Gupta, Anoop
Author: Horowitz, Mark
Date: May 1992
Abstract: Multiple-context processors have been proposed as an architectural technique to mitigate the effects of large memory latency in multiprocessors. We examine two schemes for implementing multiple-context processors. The first scheme switches between contexts only on a cache miss, while the other interleaves the contexts on a cycle-by-cycle basis. Both schemes provide the capability for a single context to fully utilize the pipeline. We show that cycle-by-cycle interleaving of contexts provides a performance advantage over switching contexts only at a cache miss. This advantage results from the context interleaving hiding pipeline dependencies and reducing the context switch cost. In addition, we show that while the implementation of the interleaved scheme is more complex, the complexity is not overwhelming. As pipelines get deeper and operate at lower percentages of peak performance, the performance advantage of the interleaved scheme is likely to justify its additional complexity.
CSL-TR-92-523

Report Number: CSL-TR-92-526
Institution: Stanford University, Computer Systems Laboratory
Title: SPLASH: Stanford parallel applications for shared-memory*
Author: Singh, Jaswinder Pal
Author: Weber, Wolf-Dietrich
Author: Gupta, Anoop
Date: June 1992
Abstract: We present the Stanford Parallel Applications for Shared-Memory (SPLASH), a set of parallel applications for use in the design and evaluation of shared-memory multiprocessing systems. Our goal is to provide a suite of realistic applications that will serve as a well-documented and consistent basis for evaluation studies. We describe the applications currently in the suite in detail, discuss and compare some of their important characteristicsPsuch as data locality, granularity, synchronization, etc.Pand explore their behavior by running them on a real multiprocessor as well as on a simulator of an idealized parallel architecture. We expect the current set of applications to act as a nucleus for a suite that will grow with time. This report replaces and updates CSL-TR-91-469, April 1991.
CSL-TR-92-526

Report Number: CSL-TR-92-528
Institution: Stanford University, Computer Systems Laboratory
Title: Binary multiplication Using Partially Redundant Multiples
Author: Bewick, Gary
Author: Flynn, Michael J.
Date: June 1992
Abstract: This report presents an extension to Booth's algorithm for binary multiplication. Most implementations that utilize Booth's algorithm use the 2 bit version, which reduces the number of partial products required to half that required by a simple add and shift method. Further reduction in the number of partial products can be obtained by using higher order versions of Booth's algorithm, but it is necessary to generate multiples of one of the operands (such as 3 times an operand) by the use of a carry propagate adder. This carry propagate addition introduces significant delay and additional hardware. The algorithm described in this report produces such difficult multiples in a partially redundant form, using a series of small length adders. These adders operate in parallel with no carries propagating between them. As a result, the delay introduced by multiple generation is minimized and the hardware needed for the multiple generation is also reduced, due to the elimination of expensive carry lookahead logic.
CSL-TR-92-528

Report Number: CSL-TR-92-532
Institution: Stanford University, Computer Systems Laboratory
Title: Piecewise Linear Models for Switch-Level Simulation
Author: Kao, Russell
Date: June 1992
Abstract: Rsim is an efficient logic plus timing simulator that employs the switched resistor transistor model and RC tree analysis to simulate efficiently MOS digital circuits at the transistor level. We investigate the incorporation of piecewise linear transistor models and generalized moments matching into this simulation framework. General piecewise linear models allow more accurate MOS models to be used to simulate circuits that are hard for Rsim. Additionally, they enable the simulator to handle circuits containing bipolar transistors such as ECL and BiCMOS. Nonetheless, the switched resistor model has proved to be efficient and accurate for a large class of MOS digital circuits. Therefore, it is retained as just one particular model available for use in this framework. The use of piecewise linear models requires the generalization of RC tree analysis. Unlike switched resistors, more general models may incorporate gain and floating capacitance. Additionally, we extend the analysis to handle non-tree topologies and feedback. Despite the increased generality, for many common MOS and ECL circuits the complexity remains linear. Thus, this timing analysis can be used to simulate, efficiently, those portions of the circuit that are well described by traditional switch level models, while simultaneously simulating, more accurately, those portions that are not. We present preliminary results from a prototype simulator, Mom. We demonstrate its use on a number of MOS, ECL, and BiCMOS circuits.
CSL-TR-92-532

Report Number: CSL-TR-92-534
Institution: Stanford University, Computer Systems Laboratory
Title: On the specialization of online program specializers
Author: Ruf, Erik
Author: Weise, Daniel
Date: July 1992
Abstract: Program specializers improve the speed of programs by performing some of the programs' reductions at specialization time rather than at runtime. This specialization process can be time-consuming; one common technique for improving the speed of the specialization of a particular program is to specialize the specializer itself on that program, creating a custom specializer, or program generator, for that particular program. Much research has been devoted to the problem of generating efficient program generators, which do not perform reductions at program generation time which could instead have been performed when the program generator was constructed. The conventional wisdom holds that only offline program specializers, which use binding time annotations, can be specialized into such efficient program generators. This paper argues that this is not the case, and demonstrates that the specialization of a nontrivial online program specializer similar to the original "naive MIX" can indeed yield an efficient program generator. The key to our argument is that, while the use of binding time information at program generator generation time is necessary for the construction of an efficient custom specializer, the use of explicit binding time approximation techniques is not. This allows us to distinguish the problem at hand (i.e., the use of binding time information during program generator generation) from particular solutions to that problem (i.e., offline specialization). We show that, given a careful choice of specializer data structures, and sufficiently powerful specialization techniques, binding time information can be inferred and utilized without the use of explicit binding time approximation techniques. This allows the construction of efficient, optimizing program generators from online program specializers.
CSL-TR-92-534

Report Number: CSL-TR-92-546
Institution: Stanford University, Computer Systems Laboratory
Title: The accuracy of trace-driven simulations of multiprocessors
Author: Goldschmidt, Stephen R.
Author: Hennessy, John L.
Date: September 1992
Abstract: In trace-driven simulation, traces generated for one set of machine characteristics are used to simulate a machine with different characteristics. However, the execution path of a multiprocessor workload may depend on the ordering of events on different processors, which in turn depends on machine characteristics such as memory system timings. Trace-driven simulations of multiprocessor workloads are inaccurate unless the timing-dependencies are eliminated from the traces. We measure such inaccuracies by comparing trace-driven simulations to direct simulations of the same workloads. The results were identical only for workloads whose timing dependencies were eliminated from the traces. The remaining workloads used either first-come first-served scheduling or non-deterministic algorithms; these characteristics resulted in timing-dependencies that could not be eliminated from the traces. Workloads which used task-queue scheduling had particularly large discrepancies because task-queue operations, unlike other synchronization operations, were not abstracted. Two types of simulation results had especially large discrepancies: those related to synchronization latency and those derived from relatively small numbers of events. Studies that rely on such results should use timing- independent traces or direct simulation.
CSL-TR-92-546

Report Number: CSL-TR-92-548
Institution: Stanford University, Computer Systems Laboratory
Title: System synthesis via hardware-software co-design
Author: Gupta, Rajesh K.
Author: DeMicheli, Giovanni
Date: October 1992
Abstract: Synthesis of circuits containing application-specific as well as re-programmable components such as off-the-shelf microprocessors provides a promising approach to realization of complex systems using a minimal amount of application-specific hardware while still meeting the required performance constraints. We formulate the synthesis problem of complex behavioral descriptions with performance constraints as a hardware-software co-design problem. The target system architecture consists of a software component as a program running on a re-programmable processor assisted by application-specific hardware components. System synthesis is performed by first partitioning the input system description into hardware and software portions and then by implementing each of them separately. We consider the problem of identifying potential hardware and software components of a system described in a high-level modeling language. Partitioning approaches are presented based on decoupling of data and control flow, and based on communication/synchronization requirements of the resulting system design. Synchronization between various elements of a mixed system design is one of the key issues that any synthesis system must address. We present software and interface synchronization schemes that facilitate communication between system components. We explore the relationship between the non-determinism in the system models and the associated synchronization schemes needed in system implementations. The synthesis of dedicated hardware is achieved by hardware synthesis tools, while the software component is generated using software compiling techniques. We present tools to perform synthesis of a system description into hardware and software components. The resulting software component is assumed to be implemented for the DLX machine, a load/store microprocessor. We present design of an ethernet based network coprocessor to demonstrate the feasibility of mixed system synthesis.
CSL-TR-92-548

Report Number: CSL-TR-92-550
Institution: Stanford University, Computer Systems Laboratory
Title: Cache Coherence Directories for Scalable Multiprocessors
Author: Simoni, Richard
Date: October 1992
Abstract: Directory-based protocols have been proposed as an efficient means of implementing cache coherence in large-scale shared-memory multiprocessors. This thesis explores the trade-offs in the design of cache coherence directories by examining the organization of the directory information, the options in the design of the coherency protocol, and the implementation of the directory and protocol. The traditional directory organization that maintains a full valid bit vector per directory entry is unsuitable for large-scale machines due to high storage overhead. This thesis proposes several alternate organizations. Limited pointers directories replace the bit vactor with several pointers that indicate those caches containing the data. Although this scheme performs well across a wide range of workloads, its performance does not improve as the read/write ratio becomes very large. To address this drawback, a dynamic pointer allocation directory is proposed. This directory allocates pointers from a pool to particular memory blocks as they are needed. Since the pointers may be allocated to any block on the memory module, the probability of running short of pointers is very small. Among the set of possible organizations, dynamic pointer allocation lies at an attractive cost/performance point. Measuring the performance impact of three coherency protocol features makes the virtues of simplicity clear. Adding a clean/exclusive state to reduce the time required to write a clean block results in only modest performance improvement. Using request forwarding to transfer a dirty block directly to another cache that has requested it yields similar results. For small cache block sizes, write hits to clean blocks can be simply treated as write misses without incurring significant extra network traffic. Protocol features designed to improve performance must be examined carefully, for they often complicate the protocol without offering substantial benefit. Implementing directory-based coherency presents several challenges. Methods are described for preventing deadlock, maintaining a model of parallel execution, handling subtle situations caused by temporary inconsistencies between cache and directory state, and tolerating out-of-order message delivery. Using these techniques, cache coherence can be added to large-scale multiprocessors in an inexpensive yet effective manner.
CSL-TR-92-550

Report Number: CSL-TR-92-553
Institution: Stanford University, Computer Systems Laboratory
Title: Branch predication using large self history
Author: Johnson, John D.
Date: December 1992
Abstract: Branch prediction is the main method of providing speculative opportunities for new high performance processors, therefore the accuracy of branch prediction is becoming very important. Motivated by this desire to achieve high levels of branch prediction, this study examines methods of using up to 24 bits branch direction history to determine the probable outcome of the next execution of a conditional branch. Using profiling to train a prediction logic function achieves an average branch prediction accuracy of up to 96.9% for the six benchmarks used in this study.
CSL-TR-92-553

Report Number: CSL-TR-93-554
Institution: Stanford University, Computer Systems Laboratory
Title: Using a Floating-Point Multiplier's Internals for High-Radix Division and Square Root
Author: Schwarz, Eric M.
Author: Flynn, Michael J.
Date: January 1993
Abstract: A method for obtaining high-precision approximations of high-order arithmetic operations at low-cost is presented in this study. Specifically, high-precision approximations of the reciprocal (12 bits worst case) and square root (16 bits) operations are obtained using the internal hardware of a floating-point multiplier without the use of look-up tables. The additional combinatorial logic necessary is very small due to the reuse of existing hardware. These low-cost high-precision approximations are used by iterative algorithms to perform the operations of division and square root. The method presented also applies to several other high-order arithmetic operations. Thus, high-radix algorithms for high-order arithmetic operations such as division and square root are possible at low-cost.
CSL-TR-93-554

Report Number: CSL-TR-93-556
Institution: Stanford University, Computer Systems Laboratory
Title: Support for Speculative Execution in High-Performance Processors
Author: Smith, Michael David
Date: November 1992
Abstract: Superscalar and superpipelining techniques increase the overlap between the instructions in a pipelined processor, and thus these techniques have the potential to improve processor performance by decreasing the average number of cycles between the execution of adjacent instructions. Yet, to obtain this potential performance benefit, an instruction scheduler for this high-performance processor must find the independent instructions within the instruction stream of an application to execute in parallel. For non-numerical applications, there is an insufficient number of independent instructions within a basic block, and consequently the instruction scheduler must search across the basic block boundaries for the extra instruction-level parallelism required by the superscalar and superpipelining techniques. To exploit instruction-level parallelism across a conditional branch, the instruction scheduler must support the movement of instructions above a conditional branch, and the processor must support the speculative execution of these instructions. We define boosting, an architectural mechanism for speculative execution, that allows us to uncover the instruction-level parallelism across conditional branches without adversely affecting the instruction count of the application or the cycle time of the processor. Under boosting, the compiler is responsible for analyzing and scheduling instructions, while the hardware is responsible for ensuring that the effects of a speculatively-executed instruction do not corrupt the program state when the compiler is incorrect in its speculation. To experiment with boosting, we built a global instruction scheduler, which is specifically tailored for the non-numerical environment, and a simulator, which determines the cycle-count performance of our globally-scheduled programs. We also analyzed the hardware requirements for boosting in a typical load/store architecture. Through the cycle-count simulations and an understanding of the cycle-time impact of the hardware support for boosting, we found that only a small amount of hardware support for speculative execution is necessary to achieve good performance in a small-issue, superscalar processor.
CSL-TR-93-556

Report Number: CSL-TR-93-560
Institution: Stanford University, Computer Systems Laboratory
Title: The Cramer Rao Bound for Discrete-Time Edge Position
Author: Gatherer, Alan
Date: February 1993
Abstract: The problem of estimating the position of an edge from a series of samples often occurs in the fields of machine vision and signal processing. It is therefore of interest to assess the accuracy of any estimation algorithm. Previous work in this area has produced bounds for the continuous time estimator. In this paper we derive a closed form for the minimum variance bound (or Cramer Rao bound) for estimating the position of an arbitrarily shaped edge in white Gaussian noise for the discrete samples case. We quantify the effects of the sampling rate, the bandwidth of the edge, the shape of the edge and the size of the observation window on the variance of the estimator. We describe a maximum likelihood estimator and show that in practice this estimator requires fewer computations than standard correlation.
CSL-TR-93-560

Report Number: CSL-TR-93-561
Institution: Stanford University, Computer Systems Laboratory
Title: Fetch Caches
Author: Bray, Brian K.
Author: Flynn, Michael J.
Date: February 1993
Abstract: For high performance, data caches must have a low miss rate and provide high bandwidth, while maintaining low latency. Larger and more complex set associative caches provide lower miss rates but at the cost of increased latency. Interleaved data caches can improve the available bandwidth, but the improvement is limited by bank conflicts and increased latency due to the switching networks required to distribute cache addresses and to route the data. We propose using a small buffer to reduce the data read latency or improve the read bandwidth of an on-chip data cache. We call the small read-only buffer a fetch cache. The fetch cache attempts to capture the immediate spatial locality of the data read reference stream by utilizing the large number of bits that can be fetched in a single access of an on-chip cache. There are two ways a processor can issue multiple instructions per cache access: the cache access can require multiple cycles (i.e. superpipelined), or multiple instructions are issued per cycle (i.e. superscalar). In the first section, we show the use of fetch caches with multi-cycle per access data caches. When there is a read hit in the fetch cache, the read request can be serviced in one cycle, otherwise the latency is that of the primary data cache. For a four line, 16 byte wide fetch cache, the hit rate ranged from 40 to 60 percent depending on the application. In the second part, we show the use of fetch caches when multi-accesses per cycle are requested. When there is a read hit in the fetch cache, a read can be satisfied by the fetch cache, while the primary cache performs another read or write request. For a four line, 16 byte wide fetch cache, the cache bandwidth increased by 20 to 30 percent depending on the application.
CSL-TR-93-561

Report Number: CSL-TR-93-562
Institution: Stanford University, Computer Systems Laboratory
Title: An Efficient Top-Down Parsing Algorithm for General Context-Free Grammars
Author: Sankar, Sriram
Date: February 1993
Abstract: This report describes a new algorithm for top-down parsing of general context-free grammars. The algorithm does not require any changes to be made to the grammar, and can parse with respect to any grammar non-terminal as the start symbol. It is possible to generate all possible parse trees of the input string in the presence of ambiguous grammars. The algorithm reduces to recursive descent parsing on LL grammars. This algorithm is ideal for use in software development environments which include tools such as syntax-directed editors and incremental parsers, where the language syntax is an integral part of the user-interface. General context-free grammars can describe the language syntax more intuitively than, for example, LALR(1) grammars. This algorithm is also applicable to batch-oriented language processors, especially during the development of new languages, where frequent changes are made to the language syntax and new prototype parsers need to be developed quickly. A prototype implementation of a parser generator that generates parsers based on this algorithm has been built. Parsing speeds of around 1000 lines per second have been achieved on a Sun SparcStation 2. This demonstrated performance is more than adequate for syntax-directed editors and incremental parsers, and in most cases, is perfectly acceptable for batch-oriented language processors.
CSL-TR-93-562

Report Number: CSL-TR-93-564
Institution: Stanford University, Computer Systems Laboratory
Title: Case Study in Prototyping With Rapide: Shared Memory Multiprocessor System
Author: Santoro, Alexandre
Date: March 1993
Abstract: Rapide is a concurrent object-oriented language designed for prototyping distributed systems. This paper describes the creation of such a prototype, more specifically a shared memory multiprocessor system. The design is presented in an evolutionary manner, starting with a simple CPU + memory model. The paper also presents some simulation results and shows how the partially ordered event sets that Rapide produces can be used both for performance analysis and for an in-depth understanding of the model's behavior.
CSL-TR-93-564

Report Number: CSL-TR-93-566
Institution: Stanford University, Computer Systems Laboratory
Title: Software Testing Using Algebraic Specification Based Test Oracles
Author: Sankar, Sriram
Author: Goyal, Anoop
Author: Sikchi, Prakash
Date: April 1993
Abstract: In TAV4, the first author presented a paper describing an algorithm to perform run-time consistency checking of abstract data types specified using algebraic specifications. This algorithm has subsequently been incorporated into a run-time consistency checking tool for the Anna specification language for Ada, and works on a subset of all possible algebraic specifications. The algorithm implementation can be considered a test oracle for algebraic specifications that performs its activities while the formally specified program is running. This paper presents empirical results on the use of this test oracle on a real-life symbol table implementation. Various issues that arise due to the use of algebraic specifications and the test oracle are discussed. 50 different errors were introduced into the symbol table implementation. On testing using the oracle, 60% of the errors were detected by the oracle, 35% of the errors caused Ada exceptions to be raised, and the remaining 5% went undetected. These results are remarkable, especially since the test input was simply one sequence of symbol table operations performed by a typical client. The cases that went undetected contained errors that required very specific boundary conditions to be met --- an indication that white box test-data generation techniques may be required to detect them. Hence, a combination of white-box test-data generation along with a specification based test oracle may be an extremely versatile combination in detecting errors. This paper does not address test-data generation, rather it illustrates the usefulness of algebraic specification based test oracles during run-time consistency checking. Run-time consistency checking should be considered a complementary approach to unit testing using generated test-data.
CSL-TR-93-566

Report Number: CSL-TR-93-570
Institution: Stanford University, Computer Systems Laboratory
Title: Frequency Domain Volume Rendering
Author: Totsuka, Takashi
Author: Levoy, Marc
Date: April 1993
Abstract: The Fourier projection-slice theorem allos projections of volume data to be generated in O(nsquare log n) time for a volumbe of size ncube. The method operates by extracting and inverse Fourier transforming 2D slices from a 3D frequency domain representation of the volume. Unfortunately, these projections do not exhibit the occlusion that is characteristic of conventional volume renderings. We present a new frequency domain volume rendering algorithm that replaces much of the missing depth and shape cues by performing shading calculations in the frequency domain during slice extraction. In particular, we demonstrate frequency domain methods for computing linear or nonlinear depth cueing and directional diffuse reflection. The resulting images can be generated an order of magnitude faster than volume renderings and may be more useful for many applications.
CSL-TR-93-570

Report Number: CSL-TR-93-573
Institution: Stanford University, Computer Systems Laboratory
Title: Performance of a Three-Stage Banyan-Based Architecture with Input and Output Buffers for Large Fast Packet Switches
Author: Chiussi, Fabio M.
Author: Tobagi, Fouad A.
Date: June 1993
Abstract: Fast packet switching, also referred to as Asynchronous Transfer Mode (ATM), has emerged as the most appropriate switching technique to handle the high data rates and the wide diversity of traffic requirements envisioned in Broadband Integrated Services Digital Networks (B-ISDN). ATM switches capable of meeting the challenges posed by a successful deployment of B-ISDN must be designed and implemented. Such switches should be nonblocking and capable of handling the highly-bursty traffic conditions that future anticipated applications will generate; they should be scalable to the large sizes expected when B-ISDN becomes widely deployed; accordingly, their complexity should be as low as possible; they should be simple to operate; namely, their architecture should facilitate the determination of whether or not a call can be accepted, and the assignment of a route to a call. In this paper, we describe an architecture, referred to as the Memory/Space/Memory switching fabric, which meets these challenges. It combines input and output shared-memory buffer components with space-division banyan networks, making it possible to build a switch with several hundred I/O ports. The MSM achieves output buffering, thus performing very well under a wide variety of traffic conditions, and is self-routing, thus adapting easily to different traffic mixes. Under bursty traffic, by implementing a backpressure mechanism to control the packet flow from input to output queues, and by properly managing the buffers, we can increase the average buffer occupancy; in this way, we can achieve important reductions in total buffer requirements with respect to output-buffer switches (e.g., up to 70% reduction with bursts of average length equal to 100 packets), use input and output buffers of equal sizes, and achieve sublinear increase of the buffer requirements with the burst length.
CSL-TR-93-573

Report Number: CSL-TR-93-577
Institution: Stanford University, Computer Systems Laboratory
Title: Implementation of a Three-Stage Banyan-Based Atchitecture with Input and Output Buffers for Large Fast Packet Switches
Author: Chiussi, Fabio M.
Author: Tobagi, Fouad A.
Date: June 1993
Abstract: Fast packet switching, also referred to as Asynchronous Transfer Mode (ATM), has emerged as the most appropriate switching technique for future Broadband Integrated Services Digital Networks (B-ISDN). A three-stage banyan-based switch architecture with input and output buffers has been recently described [Chi93]. Such architecture, also referred to as the Memory/Space/Memory (MSM) switching fabric, is capable of meeting the challenges posed by a successful deployment of B-ISDN; namely, it is made nonblocking with low complexity, and is scalable to large sizes (>1000 input/output ports); it supports a wide diversity of traffic patterns, including highly-bursty traffic; it maintains packet sequence, is self-routing, and is simple to operate.
CSL-TR-93-577

Report Number: CSL-TR-93-579
Institution: Stanford University, Computer Systems Laboratory
Title: Comparative Studies of Pipelined Circuits
Author: Klass, Fabian
Author: Flynn, Michael J.
Date: July 1993
Abstract: Wave pipelining is an attractive technique used in high-speed computer systems to speed-up pipeline rate without partitioning a system into pipeline stages. Although recent implemetations have reported very high-speed operation rates, a real evaluation of the advantages and disadvantages of wave pipelining requires a comparative study with other techniques, in particular the understanding of the trade-offs between conventional and wave pipelining is very important. This study is an attempt to provide approximate models which can be used as first-order tools for comparative study or sensitivity analysis of conventional and wave pipelined systems with different overheads. The models presented here are for subsystem-level pipelines. The product Latency x Cycle-Time is used as a measure of performance and is evaluated as a function of all the parameters of a design, such as the propagation delay of the combinational logic, the data skew resulting from the difference between maximum and minimum propagation delays through various logic paths, rise and fall time, the setup time, hold time, and propagation delay through registers, and the uncontrollable clock skew. In this way, an analytical basis is provided for a comparison between different approaches and for optimizations.
CSL-TR-93-579

Report Number: CSL-TR-93-580
Institution: Stanford University, Computer Systems Laboratory
Title: Automatic Technology Mapping for Generalized Fundamental-Mode Asynchronous Designs
Author: Siegel, Polly
Author: DeMicheli, Giovanni
Author: Dill, David
Date: June 1993
Abstract: The generalized fundamental-mode asynchronous design style is one in which the combinational portions of the circuit design are separated from the storage elements, as with synchronous design styles. Synchronous technology mapping techniques can be adapted to work for this asynchronous design style if hazards are taken into account. First, we examine each step of algorithmic technology mapping for its influence on the hazard behavior of the modified network. We then present modifications to an existing synchronous technology mapper to work for this asynchronous design style. We present efficient algorithms for hazard analysis that are used during the mapping process. These algorithms have been implemented and incorporated into the program CERES to produce a technology mapper suitable for asynchronous designs.
CSL-TR-93-580

Report Number: CSL-TR-93-584
Institution: Stanford University, Computer Systems Laboratory
Title: Optimization of Combinational Logic Circuits Based on Compatible Gates
Author: Damiani, Maurizio
Author: Yang, Jerry Chih-Yuan
Author: DeMicheli, Giovanni
Date: June 1993
Abstract: This paper presents a set of new techniques for the optimization of multiple-level combinational Boolean networks. We describe first a technique based upon the selection of appropriate "multiple-output" subnetworks (consisting of so-called "compatible gates" whose local functions can be optimized simultaneously. We then generalize the method to larger and more arbitrary subsets of gates. Because simultaneous optimization of local functions can take place, our methods are more powerful and general than Boolean optimization methods using "don't cares", where only single-gate optimization can be performed. In addition, our methods represent a more efficient alternative to optimization procedures based on Boolean relations because the problem can be modeled by a "unate" covering problem instead of the more difficult "binate" covering problem. The method is implemented in program ACHILLES and compares favorably to SIS .
CSL-TR-93-584

Report Number: CSL-TR-93-585
Institution: Stanford University, Computer Systems Laboratory
Title: A Rapide-1.0 Definition of the ADAGE Avionics System
Author: Mann, Walter
Author: Belz, Frank C.
Author: Cornell, Paul
Date: September 1993
Abstract: We have used the Rapide prototyping-languages, developed by Stanford and TRW under the ARPA ProtoTech Program, in a series of exercises to model an early version of IBM's ADAGE software architecture for helicopter avionics systems. These exercises, conducted under the ARPA Domain Specific Software Architectures (DSSA) Program, also assisted the evolution of the Rapide languages. The resulting Rapide-1.0 model of the ADAGE architecture in this paper is substantially more succinct and illuminating than the original models, developed in Rapide-0.2 and Preliminary Rapide-1.0. All Rapide versions include these key features: interfaces, by which types of components and their possible interactions with other components are defined; actions, by which the events that can be observed or generated by such components are defined; and pattern-based constraints, which define properties of the computation of interacting components in terms of partially ordered sets of events. Key features of Rapide-1.0 include services, which abstract whole communication patterns between components; behavior rules, which provide a state-transition oriented specification of component behavior and from which computation component instances can be synthesized; and architectures, which describe implementations of components with a particular interface, by showing a composition of subordinate components and their interconnections. The Rapide-1.0 model is illustrated with corresponding diagrammatic representations.
CSL-TR-93-585

Report Number: CSL-TR-93-588
Institution: Stanford University, Computer Systems Laboratory
Title: Update-Based Cache Coherence Protocols for Scalable Shared-Memory Multiprocessors
Author: Glasco, David B.
Author: Delagi, Bruce A.
Author: Flynn, Michael J.
Date: November 1993
Abstract: In this paper, two hardware-controlled update-based cache coherence protocols are presented. The paper discusses the two major disadvantages of the update protocols: inefficiency of updates and the mismatch between the granularity of synchronization and the data transfer. The paper presents two enhancements to the update-based protocols, a write combining scheme and a finer grain synchronization, to overcome these disadvantages. The results demonstrate the effectiveness of these enhancements that, when used together, allow the update-based protocols to significantly improve the execution time of a set of scientific applications when compared to three invalidate-based protocols.
CSL-TR-93-588

Report Number: CSL-TR-93-590
Institution: Stanford University, Computer Systems Laboratory
Title: The Effect of Fault Dropping on Fault Simulation Time
Author: Pan, Rong
Author: Touba, Nur A.
Author: McCluskey, Edward J.
Date: November 1993
Abstract: The effect of fault dropping on fault simulation time is studied in this paper. An experiment was performed in which fault simulation times, with and without fault dropping, were measured for three different simulators. A speedup approximately between 8 and 50 for random test sets and between 1.5 and 9 for deterministic test sets was observed. The results give some indication about how much fault dropping speeds up fault simulation. These results also show the overhead of an application requiring a complete fault dictionary.
CSL-TR-93-590

Report Number: CSL-TR-93-591
Institution: Stanford University, Computer Systems Laboratory
Title: Logic Synthesis for Concurrent Error Detection
Author: Touba, Nur A.
Author: McCluskey, Edward J.
Date: November 1993
Abstract: The structure of a circuit determines how the effects of a fault can propagate and hence affects the cost of concurrent error detection. By considering circuit structure during logic optimization, the overall cost of a concurrently checked circuit can be minimized. This report presents a new technique called structure-constrained logic optimization (SCLO) that optimizes a circuit under the constraint that faults in the resulting circuit can produce only a prescribed set of errors. Using SCLO, circuits can be optimized for various concurrent error detection schemes allowing the overall cost for each scheme to be compared. A technique for quickly estimating the size of a circuit under different structural constraints is described. This technique enables rapid exploration of the design space for concurrently checked circuits. A new method for the automated synthesis of self-checking circuit implementations for arbitrary combinational circuits is also presented. It consists of an algorithm that determines the best parity-check code for encoding the output of a given circuit, and then uses SCLO to produce the functional circuit which is augmented with a checker to form a self-checking circuit. This synthesis method provides fully automated design, explores a larger design space than other methods, and uses simple checkers. It has been implemented by making modifications to SIS (an updated version of MIS [Brayton 87a]), and results for several MCNC combinational benchmark circuits are given. In most cases, a substantial reduction in overhead compared to a duplicate-and-compare implementation is achieved.
CSL-TR-93-591

Report Number: CSL-TR-93-593
Institution: Stanford University, Computer Systems Laboratory
Title: The Performance Advantages of Integrating Message Passing in Cache-Coherent Multiprocessors
Author: Woo, Steven Cameron
Author: Singh, Jaswinder Pal
Author: Hennessy, John L.
Date: November 1993
Abstract: We examine the performance benefits of integrating a mechanism for block data transfer (message passing) in a cache-coherent shared address space multiprocessor. We do this through a detailed study of five important computations that appear to be likely candidates for block transfer. We find that while the benefits on a realistic architecture are significant in some cases, they are not as substantial as one might initially expect. The main reasons for this are (i) the relatively modest fraction of time that applications spend in communication that is amenable to block transfer, (ii) the difficulty of finding enough independent computation to overlap with the communication latency that remains even after block transfer, and (iii) the fact that long cache lines often capture many of the benefits of block transfer. Of the three primary advantages of block transfer, fast pipelined data transfer appears to be the most successful, followed by the ability to overlap computation and communication at a coarse granularity, and finally the benefits of replicating communicated data in main memory. We also examine the impact of varying important network parameters and processor speed on the relative effectiveness of block transfer, and comment on useful features that a block transfer engine should support for real applications.
CSL-TR-93-593

Report Number: CSL-TR-93-596
Institution: Stanford University, Computer Systems Laboratory
Title: Models of Communication Latency in Shared Memory Multiprocessors
Author: Byrd, Gregory T.
Date: December 1993
Abstract: We evaluate various mechanisms for data communication in large-scale shared memory multiprocessors. Data communication involves both data transmission and synchronization, resulting in the transfer of data between computational threads. We use simple analytical models to evaluate the communication latency for each of the mechanisms. The models show that efficient and opportunistic synchronization is the most important determinant of latency, followed by efficient transmission. Producer-initiated mechanisms, in which data is sent by its producer as it is produced, generally achieve lower latencies than consumer-initated mechanisms, in which data is retrieved as and when it is needed.
CSL-TR-93-596

Report Number: CSL-TR-94-599
Institution: Stanford University, Department of Computer Science
Title: The Design and Implementation of a High-Performance Floating-Point Divider
Author: Oberman, Stuart
Author: Quach, Nhon
Author: Flynn, Michael J.
Date: January 1994
Abstract: The increasing computation requirements of modern computer applications have stimulated a large interest in developing extremely high-performance floating- point dividers. A variety of division algorithms are available, with SRT being utilized in many computer systems.A careful analysis of SRT divider topologies has demonstrated that a relatively simple divider designed in anaggressive circuit style can achieve extremely high performance. Further, an aggressive circuit implementation can minimize many of the performance advantages of more complex divider algorithms. This paper presents the tradeoffs of the different divider topologies, the design of the divider, and performance results.
CSL-TR-94-599

Report Number: CSL-TR-94-600
Institution: Stanford University, Computer Systems Laboratory
Title: Environmental Limits on the Performance of CMOS Wave-Pipelined Circuits
Author: Nowka, Kevin J.
Author: Flynn, Michael J.
Date: January 1994
Abstract: Wave-pipelining is a circuit design technique which allows digital synchronous systems to be clocked at rates higher than can be achieved with conventional pipelining techniques. Wave-pipelining has been successfully applied to the design of SSI processor functional units, a Bipolar Population Counter, a CMOS adder, CMOS multipliers, and several simple CMOS circuits. For controlled operating environments, speed-ups of 2 to 10 have been reported for these designs. This report details the effects of temperature variation, supply voltage variation, and process variation on wave-pipelined static CMOS designs, derives limits for the performance of wave-pipelined circuits due to these variations, and compares the performance effects with those of traditional pipelined circuits. This study finds that wave-pipelined circuits designed for commercial operating environments are limited to 2 to 3 waves per pipeline stage when clocked from a fixed frequency source. Variable rate, internal clocking can approach the theoretical limit of waves at a cost of interface complexity.
CSL-TR-94-600

Report Number: CSL-TR-94-601
Institution: Stanford University, Computer Systems Laboratory.
Title: Efficient Scheduling on Multiprogrammed Shared-Memory Multiprocessors
Author: Tucker, Andrew
Date: March 1994
Abstract: Shared-memory multiprocessors are often used as compute servers, with multiple users running applications in a multiprogrammed style. On such systems, naive time-sharing scheduling policies can result in poor performance for parallel applications. Most parallel applications are written with the model of a stable computing environment, where applications are running uninterrupted on a fixed number of processors. On a time-sharing system, processes are interrupted periodically and the number of processors running an application continually varies. The result is an decrease in performance for a number of reasons, including processes being obliviously preempted inside critical sections and cached data being replaced by intervening processes. This thesis explores using more sophisticated scheduling systems to avoid these problems. Robust implementations of previously proposed approaches involving cache affinity scheduling and gang scheduling are developed and evaluated. It then presents the design, implementation, and performance of process control, a novel scheduling approach using explicit cooperation between the application and kernel to minimize context switching. Performance results from a suite of workloads containing both serial and parallel applications, run on a 4-processor Silicon Graphics workstation, confirm the effectiveness of the process control approach.
CSL-TR-94-601

Report Number: CSL-TR-94-602
Institution: Stanford University, Computer Systems Laboratory
Title: Analyzing and Tuning Memory Performance in Sequential and Parallel Programs
Author: Martonosi, Margaret Rose
Date: January 1994
Abstract: Recent architecture and technology trends have led to a significant gap between processor and main memory speeds. When cache misses are common, memory stalls can significantly degrade execution time. To help identify and fix such memory bottlenecks, this work presents techniques to efficiently collect detailed information about program memory performance and effectively organize the data collected. These techniques help guide programmers or compilers to memory bottlenecks. They apply to both sequential and parallel applications and are embodied in the MemSpy performance monitoring system. This thesis contends that the natural interrelationship between program memory bottlenecks and program data structures mandates the use of data oriented statistics, a novel approach that associates program performance information with application data structures. Data oriented statistics, viewed alone or paired with traditional code oriented statistics, offer a powerful, new dimension for performance analysis. I develop techniques for aggregating statistics on similarly-used data structures and for extracting intuitive source-code names for statistics. The thesis also argues that MemSpy's detailed statistics on the frequency and causes of cache misses are crucial in understanding memory bottlenecks. Common memory performance bugs are often most easily distinguished by noting the causes of their resulting cache misses. Since collecting such detailed information seems, at first glance, to require large execution time slowdowns, this dissertation also evaluates techniques to improve the performance of MemSpy's simulation-based monitoring. The first optimization, hit bypassing, improves simulation performance by specializing processing of cache hits. The second optimization, reference trace sampling, improves performance by simulating only sampled portions out of the full reference trace. Together, these optimizations reduce simulation time by nearly an order of magnitude. Overall, having used MemSpy to tune several applications, these experiences demonstrate that MemSpy generates effective memory performance profiles, at speeds competitive with previous, less detailed approaches.
CSL-TR-94-602

Report Number: CSL-TR-94-604
Institution: Stanford University, Department of Computer Science
Title: Integrating multiple communication paradigms in high performance multiprocessors
Author: Heinlein, John
Author: Gharachorloo, Kourosh
Author: Gupta, Anoop
Date: February 1994
Abstract: In the design of FLASH, the successor to the Stanford DASH multiprocessor, we are exploring architectural mechanisms for efficiently supporting both the shared memory and message passing communication models in a single system. The unique feature in the FLASH (FLexible Architecture for SHared memory) system is the use of a programmable controller at each node that replaces the functionality of hardwired cache coherence state machines in systems like DASH. The base coherence protocol is supported by executing appropriate software handlers on the programmable controller to service memory and coherence operations. The same programmable controller is also used to support message passing. This approach is attractive because of the flexibility software provides for implementing different coherence and message passing protocols, and because of the simplification in system design and debugging that arises from the shift of complexity from hardware to software. This paper focuses on the use of the programmable controller to support message passing. Our goal is to provide message passing performance that is comparable to an aggressive hardware implementation dedicated to this task. In FLASH, message data is transferred as a sequence of cache line sized units, thus exploiting the datapath support already present for cache coherence. In addition, we avoid costly interrupts to the main processor by having the programmable engine handle the control for message transfers. Furthermore, in contrast to most earlier work, we provide an integrated solution that handles the interaction of message data with virtual memory, protected multiprogramming, and cache coherence. Our preliminary performance studies indicate that this system can sustain message transfers at a rate of several hundred megabytes per second, efficiently utilizing the available network bandwidth.
CSL-TR-94-604

Report Number: CSL-TR-94-605
Institution: Stanford University, Computer Systems Laboratory
Title: Performance and Area Analysis of Processor Configurations with Scaling of Technology
Author: Fu, Steve
Author: Flynn, Michael J.
Date: March 1994
Abstract: The increasing density of transistors on integrated circuits and the increasing sensitivity toward costs have stimulated interest in developing techniques for relating transistor count to performance. This paper maps different processor configuration to transistor level area models and proposes an optimum evolution path of processor design as minimum feature size of technology is scaled. A parameter for measuring incremental performance improvement with respect to increasing transistor count is proposed.
CSL-TR-94-605

Report Number: CSL-TR-94-607
Institution: Stanford University, Computer Systems Laboratory
Title: Spreadsheets for Images
Author: Levoy, Marc
Date: February 1994
Abstract: We describe a data visualization system based on spreadsheets. Cells in our spreadsheet contain graphical objects such as images, volumes, or movies. Cells may also contain graphical widgets such as buttons, sliders, or movie viewers. Objects are displayed in miniature inside each cell. Formulas for cells are written in a programming language that includes operators for array manipulation, image processing, and rendering. Formulas may also contain control structures, procedure calls, and assignment operators with side effects. Compared to flow chart visualization systems, spreadsheets are more expressive, more scalable, and easier to program. Compared to numerical spreadsheets, spreadsheets for images pose several unique design problems: larger formulas, longer computation times, and more complicated intercell dependencies. We describe an implementation based on the Tcl programming language and the Tk widget set, and we discuss our solutions to these design problems. We also point out some unexpected uses for our spreadsheets: as a visual database browser, as a graphical user interface builder, as a smart clipboard for the desktop, and as a presentation tool.
CSL-TR-94-607

Report Number: CSL-TR-94-613
Institution: Stanford University, Department of Computer Science
Title: Design and Validation of Update-Based Cache Coherence Protocols
Author: Glasco, David B.
Author: Delagi, Bruce A.
Author: Flynn, Michael J.
Date: March 1994
Abstract: In this paper, we present the details of the two update-based cache coherence protocols for scalable shared-memory multiprocessors that were studied in our previous work. First, the directory structures required for the protocols are briefly reviewed. Next, the state diagrams and some examples of the two update-based protocols are presented; one of the protocols is based on a centralized directory, and the other is based on a singly-linked distributed directory. Protocol deadlock and the additional requirements placed the protocols to avoid such deadlock are also examined. Finally, protocol validation using an exhaustive validation tool known as Murphi is discussed.
CSL-TR-94-613

Report Number: CSL-TR-94-614
Institution: Stanford University, Computer Systems Laboratory.
Title: Co-Synthesis of Hardware and Software for Digital Embedded Systems
Author: Gupta, Rajesh Kumar
Date: December 1993
Abstract: As the complexity of systems being subject to computer-aided synthesis and optimization techniques increases, so does the need to find ways to incorporate predesigned components into the final system implementation. In this context, a general-purpose microprocessor provides a sophisticated low-cost component that can be tailored to realize most system functions through appropriate software. This approach is particularly useful in the design of embedded systems that have a relatively simple target architecture, when compared to general-purpose computing systems such as workstations. In embedded systems the processor is used as a resource dedicated to implement specific functions. However, the design issues in embedded systems are complicated since most of these systems operate in a time-constrained environment. Recent advances in chip-level synthesis have made it possible to synthesize application-specific circuits under strict timing constraints. This dissertation formulates the problem of computer-aided design of embedded systems using both application-specific as well as general-purpose reprogrammable components under timing constraints. Given a specification of system functionality and constraints in a hardware description language, we model the system as a set of bilogic flow graphs, and formulate the co-synthesis problem as a partitioning problem under constraints. Timing constraints are used to determine the parts of the system functionality that are delegated to application-specific hardware and the software that runs on the processor. The software component of such a 'mixed' system poses an interesting problem due to its interaction with concurrently operating hardware. We address this problem by generating software as a set of concurrent fixed-latency serialized operations called threads. The satisfaction of the imposed performance constraints is then ensured by exploiting concurrency between program threads, achieved by an inter-leaved execution on a single processor system. This co-synthesis of hardware and software from behavioral specifications makes it possible to build time-constrained embedded systems by using off-the-shelf parts and application-specific circuitry. Due to the reduction in size of application-specific hardware needed compared to an all-hardware solution, the needed hardware component can be easily mapped to semicustom VLSI such as gate arrays, thus shortening the design time. In addition, the ability to perform a detailed analysis of timing performance provides an opportunity to improve the system definition by creating better prototypes. The algorithms and techniques described have been implemented in a framework called Vulcan, which is integrated with the Stanford Olympus Synthesis System and provides a path from chip-level synthesis to system-level synthesis.
CSL-TR-94-614

Report Number: CSL-TR-94-616
Institution: Stanford University, Computer Systems Laboratory
Title: Reuse of High Precision Arithmetic Hardware to Perform Multiple Low Precision Calculations
Author: Z ucker, Daniel
Author: Lee, Ruby
Date: April 1994
Abstract: Many increasingly important applications, such as video compression, graphics, or multimedia, require only low-precision arithmetic. However, because the widespread adoption of the IEEE floating point standard has led to the ubiquity of IEEE double precision hardware, this double precision hardware is frequently used to do the low precision calculations. Naturally, it seems an inefficient use of resources to use 54 bits of hardware to perform an 8 or 12 bit calculation. This paper presents a method for packing operands to perform multiple low precision arithmetic operations using regular high precision hardware. Using only source level software modification, a speedup of 15% is illustrated for the Discrete Cosine Transform. Since no machine-specific optimizations are required, this method will work on any machine that supports IEEE arithmetic. Finally, an analysis of speedup and suggestions for future work are presented.
CSL-TR-94-616

Report Number: CSL-TR-94-617
Institution: Stanford University, Computer Systems Laboratory
Title: Fast Multiplication: Algorithms and Implementations
Author: Bewick, Gary W.
Date: April 1994
Abstract: This thesis investigates methods of implementing binary multiplication with the smallest possible latency. The principle area of concentration is on multipliers with lengths of 53 bits, which makes the results suitable for IEEE-754 double precision multiplication. Low latency demands high performance circuitry, and small physical size to limit propagation delays. VLSI implementations are the only available means for meeting these two requirements, but efficient algorithms are also crucial. An extension to Booth's algorithm for multiplication (redundant Booth) has been developed, which represents partial products in a partially redundant form. This redundant representation can reduce or eliminate the time required to produce "hard" multiples (multiples that require a carry propagate addition) required by the traditional higher order Booth algorithms. This extension reduces the area and power requirements of fully parallel implementations, but is also as fast as any multiplication method yet reported. In order to evaluate various multiplication algorithms, a software tool has been developed which automates the layout and optimization of parallel multiplier trees. The tool takes into consideration wire and asymmetric input delays, as well as gate delays, as the tree is built. The tool is used to design multipliers based upon various algorithms, using both Booth encoded, non-Booth encoded and the new extended Booth algorithms. The designs are then compared on the basis of delay, power, and area. For maximum speed, the designs are based upon a 0.6mu BiCMOS process using emitter coupled logic (ECL). The algorithms developed in this thesis make possible 53x53 multipliers with a latency of less than 2.6 nanoseconds @ 10.5 Watts and a layout area of 13 mm@+[2]. Smaller and lower power designs are also possible, as illustrated by an example with a latency of 3.6 nanoseconds @ 5.8 W, and an area of 8.9 mm@+[2]. The conclusions based upon ECL designs are extended where possible to other technologies (CMOS). Crucial to the performance of multipliers are high speed carry propagate adders. A number of high speed adder designs have been developed, and the algorithms and design of these adders are discussed. The implementations developed for this study indicate that traditional Booth encoded multipliers are superior in layout area, power, and delay to non-Booth encoded multipliers. Redundant Booth encoding further reduces the area and power requirements. Finally, only half of the total multiplier delay was found to be due to the summation of the partial products. The remaining delay was due to wires and carry propagate adder delays.
CSL-TR-94-617

Report Number: CSL-TR-94-618
Institution: Stanford University, Computer Systems Laboratory
Title: Optimum Routing of Multicast Audio and Video Streams in Communications Networks
Author: Noronha, Ciro A., Jr.
Author: Tobagi, Fouad A.
Date: April 94
Abstract: In this report, we consider the problem of routing multicast audio and video streams in a communications network. After describing the previous work in the area and identifying its shortcomings, we show that the problem of optimally routing multicast streams can be formulated as an integer programming problem. We propose an efficient solution technique, composed of two parts: (i) an extension to the decomposition principle, to speed up the linear relaxation of the problem, and (ii) enhanced value-fixing rules, to prune the search space for the integer problem. We characterize the reduction in run time gained using these techniques. Finally, we compare the run times for the optimum multicast routing algorithm and for existing heuristic algorithms.
CSL-TR-94-618

Report Number: CSL-TR-94-619
Institution: Stanford University, Computer Systems Laboratory
Title: Evaluation of Multicast Routing Algorithms for Multimedia Streams
Author: Noronha, Ciro A., Jr.
Author: Tobagi, Fouad A.
Date: April 1994
Abstract: Multimedia applications place new requirements on networks as compared to traditional data applications: (i) they require relatively high bandwidths on a continuous basis for long periods of time; (ii) involve multipoint communications and thus are expected to make heavy use of multicasting; and (iii) tend to be interactive and thus require low latency. These requirements must be taken into account when routing multimedia traffic in a network. This report presents a performance evaluation of routing algorithms in the multimedia environment, where the requirements of multipoint communications, bandwidth and latency must be satisfied. We present an exact solution to the optimum multicast routing problem, based on integer programming, and use this solution as a benchmark to evaluate existing heuristic algorithms, considering both performance and cost of implementation (as measured by the average run time), under realistic network and traffic scenarios.
CSL-TR-94-619

Report Number: CSL-TR-94-620
Institution: Stanford University, Computer Systems Laboratory
Title: The SUIF Compiler System: a Parallelizing and Optimizing Research Compiler
Author: Wilson, Robert
Author: French, Robert
Author: Wilson, Christopher
Author: Amarasinghe, Saman
Author: Anderson, Jennifer
Author: Tjiang, Steve
Author: Liao, Shih-Wei
Author: Tseng, Chau-Wen
Author: Hall, Mary
Author: Lam, Monica
Author: Hennessy, John
Date: May 1994
Abstract: Compiler infrastructures that support experimental research are crucial to the advancement of high-performance computing. New compiler technology must be implemented and evaluated in the context of a complete compiler, but developing such an infrastructure requires a huge investment in time and resources. We have spent a number of years building the SUIF compiler into a powerful, flexible system, and we would now like to share the results of our efforts. SUIF consists of a small, clearly documented kernel and a toolkit of compiler passes built on top of the kernel. The kernel defines the intermediate representation, provides functions to access and manipulate the intermediate representation, and structures the interface between compiler passes. The toolkit currently includes C and Fortran front ends, a loop-level parallelism and locality optimizer, an optimizing MIPS back end, a set of compiler development tools, and support for instructional use. Although we do not expect SUIF to be suitable for everyone, we think it may be useful for many other researchers. We thus invite you to use SUIF and welcome your contributions to this infrastructure. The SUIF software is freely available via anonymous ftp from suif.Stanford.EDU. Additional information about SUIF can be found on the World-Wide Web at http://suif.Stanford.EDU.
CSL-TR-94-620

Report Number: CSL-TR-94-621
Institution: Stanford University, Computer Systems Laboratory
Title: Synthesis for Scan Dependence in Built-In Self-Testable Designs
Author: Avra, LaNae J.
Author: McCluskey, Edward J.
Date: May 1994
Abstract: This report introduces new design and synthesis techniques that reduce the area and improve the performance of embedded built-in self-test (BIST) architectures such as circular BIST and parallel BIST. Our goal is to arrange the system bistables into scan paths so that some of the BIST and scan logic is shared with the system logic. Logic sharing is possible when scan dependence is introduced in the design. Other BIST design techniques attempt to avoid all types of scan dependence because it can reduce the fault coverage of embedded, multiple input signature registers (MISRs). We show that introducing certain types of scan dependence in embedded MISRs can result in reduced overhead and improved fault coverage, and we describe synthesis techniques that maximize the amount of this beneficial scan dependence. Finally, we present fault simulation, layout area, and delay results for circular BIST versions of benchmark circuits that have been synthesized with our techniques.
CSL-TR-94-621

Report Number: CSL-TR-94-622
Institution: Stanford University, Computer Systems Laboratory
Title: A Synthesis-for-Test Design System
Author: Avra, LaNae J.
Author: Gerbaux, Laurent
Author: Giomi, Jean-Charles
Author: Martinolle, Francoise
Author: McCluskey, Edward J.
Date: May 1994
Abstract: Hardware synthesis techniques automatically generate a structural hardware implementation given an abstract (e.g., functional, behavioral, register transfer) description of the behavior of the design. Existing hardware synthesis systems typically use cost and performance as the main criteria for selecting the best hardware implementation, and seldom even consider test issues during the synthesis process. We have developed and implemented a computer-aided design tool whose primary objective is to generate the lowest-cost, highest-performance hardware implementation that also meets specified testability requirements. By considering testability during the synthesis process, the tool is able to generate designs that are optimized for specific test techniques. The input to the tool is a behavioral VHDL specification that consists of high-level software language constructs such as conditional statements, assignment statements, and loops, and the output is a structural VHDL description of the design. Implemented synthesis procedures include compiler optimizations, inter-process analysis, high-level synthesis operations (scheduling, allocation, and binding) and control logic generation. The purpose of our design tool is to serve as a platform for experimentation with existing and future synthesis-for-test techniques, and it can currently generate designs optimized for both parallel and circular built-in self-test architectures.
CSL-TR-94-622

Report Number: CSL-TR-94-623
Institution: Stanford University, Computer Systems Laboratory
Title: Communication Mechanisms in Shared Memory Multiprocessors
Author: Byrd, Gregory T.
Author: Delagi, Bruce A.
Author: Flynn, Michael J.
Date: May 1994
Abstract: Shared memory systems generally support consumer-initiated communication; when a process needs data, it is retrieved from the global memory. Systems that were designed around the message passing model, on the other hand, support producer-initiated communication mechanisms; the producer of data sends it directly to the other processes that require it. Parallel applications require both kinds of communication. In this paper, we examine the performance of five shared-memory communication mechanisms -- invalidate-based cache coherence, prefetch, locks, deliver, and StreamLine -- to determine the effectiveness of architectural support for efficient producer-initiated communication. We find that StreamLine, a cached-based message passing mechanism, offers the best performance on our simulated benchmarks. In addition, StreamLine is much less sensitive to system parameters such as cache line size and network performance.
CSL-TR-94-623

Report Number: CSL-TR-94-624
Institution: Stanford University, Computer Systems Laboratory
Title: WSIM: A Symbolic Waveform Simulator
Author: Franco, Piero
Author: McCluskey, Edward J.
Date: June 1994
Abstract: A symbolic waveform simulator is proposed in this report. The delay of faulty element is treated as a variable in the generation of the output waveform. Therefore, many timing simulations with different delay values do not have to be done to analyze the behavior of the circuit-under-test with the timing fault. The motivation for this work was to investigate delay testing by Output Waveform Analysis, where an accurate representation of the actual waveforms is required, although the simulator can be used for other applications as well (such as power analysis). Output Waveform Analysis will be briefly reviewed, followed by a description of both a simplified and a complete implementation of the waveform simulator, and simulation results.
CSL-TR-94-624

Report Number: CSL-TR-94-625
Institution: Stanford University, Computer Systems Laboratory
Title: An Experimental Chip to Evaluate Test Techniques Part 1: Description of Experiment
Author: Franco, Piero
Author: Stokes, Robert L.
Author: Farwell, William D.
Author: McCluskey, Edward J.
Date: June 1994
Abstract: A Test Chip has been designed and manufactured to evaluate different testing techniques for combinational or full-scan circuits. The Test Chip is a 25k gate CMOS gate-array using LSI Logic's LFT150K technology, and includes support (design for testability) circuitry and five types of circuits-under-test (CUT). Over 5,000 die have been manufactured. The five circuits-under-test include both datapath and synthesized control logic. The tests include design verification (simulation), exhaustive, pseudo-random, and deterministic vectors for various fault models (stuck-at, transition, delay faults, and IDDQ Testing). The chip will also be testing using the CrossCheck methodology, as well as other new technques, including Stability Checking and Very-Low-Voltage Testing. The experiment includes an investigation of both serial and parallel signature analysis. This report describes the Test Evaluation Chip Experiment, including the design of the Test Chip and the tests applied. A future report will cover the experimental results and data analysis.
CSL-TR-94-625

Report Number: CSL-TR-94-626
Institution: Stanford University, Computer Systems Laboratory
Title: Synthesis and Optimization of Synchronous Logic Circuits
Author: Damiani, Maurizio
Date: June 1994
Abstract: The design automation of complex digital circuits offers important benefits. It allows the designer to reduce design time and errors, to explore more thoroughly the design space, and to cope effectively with an ever-increasing project complexity. This dissertation presents new algorithms for the logic optimization of combinational and synchronous digital circuits. These algorithms rely on a common paradigm. Namely, global optimization is achieved by the iterative local optimization of small subcircuits. The dissertation first explores the combinational case. Chapter 2 presents algorithms for the optimization of subnetworks consisting of a single-output subcircuit. The design space for this subcircuit is described implicitly by a Boolean function, a so-called function . Efficient methods for extracting this function are presented. Chapter 3 is devoted to a novel method for the optimization of multiple-output subcircuits. There, we introduce the notion of compatible gates . Compatible gates represent subsets of gates whose optimization is particularly simple. The other three chapters are devoted to the optimization of synchronous circuits. Following the lines of the combinational case, we attempt the optimization of the gate-level (rather than the state diagram -level) representation. In Chapter 4 we focus on extending combinational techniques to the sequential case. In particular, we present algorithms for finding a synchronous function that can be used in the optimization process. Unlike the combinational case, however, this approach is exact only for pipeline-like circuits. Exact approaches for general, acyclic circuits are presented in Chapter 5. There, we introduce the notion of synchronous recurrence equation. Eventually, Chapter 6 presents methods for handling feedback interconnection.
CSL-TR-94-626

Report Number: CSL-TR-94-627
Institution: Stanford University, Computer Systems Laboratory
Title: An Efficient Shared Memory Layer for Distributed Memory Machines.
Author: Scales, Daniel J.
Author: Lam, Monica S.
Date: July 1994
Abstract: This paper describes a system called SAM that simplifies the task of programming machines with distributed address spaces by providing a shared name space and dynamic caching of remotely accessed data. SAM makes it possible to utilize the computational power available in networks of workstations and distributed memory machines, while getting the ease of programming associated with a single address space model. The global name space and caching are especially important for complex scientific applications with irregular communication and parallelism. SAM is based on the principle of tying synchronization with data accesses. Precedence constraints are expressed by accesses to single-assignment values, and mutual exclusion constraints are represented by access to data items called accumulators. Programmers easily express the communication and synchronization between processes using these operations; they can also use alternate paradigms tyhat are built with the SAM primitives. Operations for prefetching data and explicitly sending data to another processor integrate cleanly with SAM's shared memory model and allow the user to obtain the efficiency of message passing when necessary. We have built implementations of SAM for the CM-5, the Intel iPSC/860, the Intel Paragon, the IBM SP1, and heterogeneous networks of Sun, SGI, and DEC workstations (using PVM). In this report, we describe the basic functionality provided by SAM, discuss our experience in using it to program a variety of scientific applications and distributed data structures, and provide performance results for these complex applications on a range of machines. Our experience indicates that SAM significantly simplifies the programming of these parallel systems, supports the necessary functionality for developing efficient implementations of sophisticated applications, and provides portability across a range of distributed memory environments.
CSL-TR-94-627

Report Number: CSL-TR-94-628
Institution: Stanford University, Computer Systems Laboratory
Title: Tolerating Latency Through Software-Controlled Data Prefetching
Author: Mowry, Todd C.
Date: June 1994
Abstract: The large latency of memory accesses in modern computer systems is a key obstacle to achieving high processor utilization. Furthermore, the technology trends indicate that this gap between processor and memory speeds is likely to increase in the future. While increased latency affects all computer systems, the problem is magnified in large-scale shared-memory multiprocessors, where physical dimensions cause latency to be an inherent problem. To cope with the memory latency problem, the basic solution that nearly all computer systems rely on is their cache hierarchy. While caches are useful, they are not a panacea. Software-controlled prefetching is a technique for tolerating memory latency by explicitly executing prefetch instructions to move data close to the processor before it is actually needed. This technique is attractive because it can hide both read and write latency within a single thread of execution while requiring relatively little hardware support. Software-controlled prefetching, however, presents two major challenges. First, some sophistication is required on the part of either the programmer, runtime system, or (preferably) the compiler to insert prefetches into the code. Second, care must be taken that the overheads of prefetching, which include additional instructions and increased memory queueing delays, do not outweigh the benefits. This dissertation proposes and evaluates a new compiler algorithm for inserting prefetches into code. The proposed algorithm attempts to minimize overheads by only issuing prefetches for references that are predicted to suffer cache misses. The algorithm can prefetch both dense-matrix and sparse-matrix codes, thus covering a large fraction of scientific applications. It also works for both uniprocessor and large-scale shared-memory multiprocessor architectures. We have implemented our algorithm in the SUIF (Stanford University Intermediate Form) optimizing compiler. The results of our detailed architectural simulations demonstrate that the speed of some applications can be improved by as much as a factor of two, both on uniprocessor and multiprocessor systems. This dissertation also compares software-controlled prefetching with other latency-hiding techniques (e.g., locality optimizations, relaxed consistency models, and multithreading), and investigates the architectural support necessary to make prefetching effective.
CSL-TR-94-628

Report Number: CSL-TR-94-629
Institution: Stanford University, Computer Systems Laboratory
Title: Precise Delay Generation Using Coupled Oscillators
Author: Maneatis, John George
Date: June 1994
Abstract: This thesis describes a new class of delay generation structures which can produce precise delays with sub- gate delay resolution. These structures are based on coupled ring oscillators which oscillate at the same frequency. One such structure, called an array oscillator, consists of a linear array of ring oscillators. A unique coupling arrangement forces the outputs of the ring oscillators to be uniformly offset in phase by a precise fraction of a buffer delay. This arrangement enables the array oscillator to achieve a delay resolution equal to a buffer delay divided by the number of rings. Another structure, called a delay line oscillator, consists of a series of delay stages, each based on a single coupled ring oscillator. These delay stages uniformly span the delay interval to which they are phase locked. Each delay stage is capable of generating a phase shift that varies over a positive and negative range. These characteristics allow the structure to precisely subdivide delays into arbitrarily small intervals. The buffer stages used in the ring oscillators must have high supply noise rejection to avoid losing precision to output jitter. This thesis presents several types of buffer stage designs for achieving high supply noise rejection and low supply voltage operation. These include a differential buffer stage design based on a source coupled pair using load elements with symmetric I-V characteristics and a single-ended buffer stage design based on a diode clamped common source device. The thesis also discusses techniques for achieving low jitter phase-locked loop performance which is important to achieving high precision. Based on the concepts developed in this thesis, an experimental differential array oscillator delay generator was designed and fabricated in a 1.2-um N- well CMOS technology. The delay generator achieved a delay resolution of 43ps while operating at 331MHz with peak delay error of 47ps.
CSL-TR-94-629

Report Number: CSL-TR-94-630
Institution: Stanford University, Department of Computer Science
Title: Expansion Caches For Superscalar Processors
Author: Johnson, John D.
Date: June 1994
Abstract: Superscalar implementations present increased demands on instruction caches as well as instruction decoding and issuing mechanisms leading to very complex hardware requirements. This work proposes utilizing an expanded instruction cache to reduce and simplify the complexity of hardware required to implement a superscalar machine. Trace driven simulation is used for evaluating the presented Expanded Parallel Instruction Cache (EPIC) machine and its performance is found to be comparable to a dynamically scheduled superscalar model.
CSL-TR-94-630

Report Number: CSL-TR-94-631
Institution: Stanford University, Computer Systems Laboratory
Title: SimOS: A Fast Operating System Simulation Environment
Author: Rosenblum, Mendel
Author: Varadarajan, Mani
Date: July 1994
Abstract: In this paper we describe techniques for building a software development environment for operating system software. These techniques allow an operating system to be run at user-level on a general-purpose operating system such as System V R4 Unix. The approach used in this work is to simulate a machine's hardware using services provided by the underlying operating system. We describe how to simulate the CPU using the operating system's process abstraction, the memory management unit using file mapping operations, and the I/O devices using separate processes. The techniques we present allow the simulator to run with sufficient speed and detail that workloads that exercise bugs on the real machine can be transferred and run in near real-time on the simulated machine. The speed of the simulation depends on the quantity and the cost of the simulated operations. Real programs usually run in the simulated environment at between 50% and 100% of the speed of the underyling machine. The simulation detail we provide allows an operating system running in the simulated environment to be nearly indistinguishable from the real machine from a user perspective.
CSL-TR-94-631

Report Number: CSL-TR-94-632
Institution: Stanford University, Computer Systems Laboratory
Title: The Benefits of Clustering in Shared Address Space Multiprocessors: An Applications-Driven Investigation
Author: Erlichson, Andrew
Author: Nayfeh, Basem A.
Author: Singh, Jaswinder Pal
Author: Olukotun, Kunle
Date: October 1994
Abstract: Clustering processors together at a level of the memory hierarchy in shared address space multiprocessors appears to be an attractive technique from several standpoints: Resources are shared, packaging technologies are exploited, and processors within a cluster can share data more effectively. We investigate the performance benefits that can be obtained by clustering on a range of important scientific and engineering applications. We find that in general clustering is not very effective in reducing the inherent communication to computation ratios. Clustering is more useful in reducing working set requirements in unstructured applications, and can improve performance substantially when small first level caches are clustered in these cases. This suggests that clustering at the first level cache might be useful in highly-integrated, relatively fine-grained environments. For less integrated machines such as current distributed shared memory multiprocessors, our results suggest that clustering is not very useful in improving application performance, and the decision about whether or not to cluster should be made on the basis of engineering and packaging constraints.
CSL-TR-94-632

Report Number: CSL-TR-94-633
Institution: Stanford University, Computer Systems Laboratory
Title: Synthesis Techniques for Built-In Self-Testable Designs
Author: Avra, LaNae Joy
Date: July 1994
Abstract: This technical report contains the text of LaNae Joy Avra's thesis "Synthesis Techniques for Built-In Self-Testable Designs."
CSL-TR-94-633

Report Number: CSL-TR-94-634
Institution: Stanford University, Computer Systems Laboratory
Title: Architectural and Implementation Tradeoffs for Multiple-Context Processors
Author: Laudon, James P.
Date: September 1994
Abstract: Tolerating memory latency is essential to achieving high performance in scalable shared-memory multiprocessors. In addition, tolerating instruction (pipeline dependency) latency is essential to maximize the performance of individual processors. Multiple-context processors have been proposed as a universal mechanism to mitigate the negative effects of latency. These processors tolerate latency by switching to a concurrent thread of execution whenever one of the threads blocks due to a high-latency operation. Multiple context processors built so far, however, either have a high context-switch cost which disallows tolerance of short latencies (e.g., due to pipeline dependencies), or alternatively they require excessive concurrency from the software. We propose a multiple-context architecture that combines full single-thread support with cycle-by-cycle context interleaving to provide lower switch costs and the ability to tolerate short latencies. We compare the performance of our proposal with that of earlier approaches, showing that our approach offers substantially better performance for parallel applications. We also explore using our approach for uniprocessor workstations --- an important environment for commodity microprocessors. We show that our approach also offers much better performance for multiprogrammed uniprocessor workloads. Finally, we explore the implementation issues for both our proposed and existing multiple-context architectures. One of the larger costs for a multiple-context processor arises in providing a cache capable of handling multiple outstanding requests, and we propose a lockup-free cache which provides high performance at a reasonable cost. We also show that amount of processor state that needs to be replicated to support multiple contexts is modest and the extra complexity required to control the multiple contexts under both our proposed and existing approaches is manageable. The performance benefits and reasonable implementation cost of our approach make it a promising candidate for addition to future microprocessors.
CSL-TR-94-634

Report Number: CSL-TR-94-635
Institution: Stanford University, Department of Computer Science
Title: A Performance/Area Workbench for Cache Memory Design
Author: Okuzawa, Osamu
Author: Flynn, Michael J.
Date: August 1994
Abstract: For high performance processor design, cache memory size is an important parameter which directly affects performance and the chip area. Modeling performance and area is required for design tradeoff of cache memory. This paper describes a tool which calculates cache memory performance and area. A designer can try a variety of cache parameters to complete the specification of a cache memory. Data examples calculated using this tool are shown.
CSL-TR-94-635

Report Number: CSL-TR-94-636
Institution: Stanford University, Department of Computer Science
Title: Mable: A Technique for Efficient Machine Simulation
Author: Davies, Peter
Author: Lacroute, Philippe
Author: Heinlein, John
Author: Horowitz, Mark
Date: October 1994
Abstract: We present a framework for an efficient instruction-level machine simulator which can be used with existing software tools to develop and analyze programs for a proposed processor architecture. The simulator exploits similarities between the instruction sets of the emulated machine and the host machine to provide fast simulation. Furthermore, existing program development tools on the host machine such as debuggers and profilers can be used without modification on the emulated program running under the simulator. The simulator can therefore be used to debug and tune application code for the new processor without building a whole new set of program development tools. The technique has applicability to a diverse set of simulation problems. We show how the framework has been used to build simulators for a shared-memory multiprocessor, a superscalar processor with support for speculative execution, and a dual-issue embedded processor.
CSL-TR-94-636

Report Number: CSL-TR-94-637
Institution: Stanford University, Computer Systems Laboratory
Title: Testing Digital Circuits for Timing Failures by Output Waveform Analysis
Author: Franco, Piero
Date: September 1994
Abstract: Delay testing is done to ensure that a digital circuit functions at the designed speed. Delay testing is complicated by test invalidation and fault detection size. Furthermore, we show that simple delay models are not sufficient to provoke the longest delay through a circuit. Even if all paths are robustly tested, path delay testing cannot guarantee that the circuit functions at the desired speed. Output Waveform Analysis is a new approach for detecting timing failures in digital circuits. Unlike conventional testing where the circuit outputs are sampled, the waveform between samples is analyzed. The motivation is that delay changes affect the shape of the output waveform, and information can be extracted from the waveform to detect timing failures. This is especially useful as a Design-for-Testability technique for Built-In Self-Test or pseudo-random testing environments, where delay tests are difficult to apply and test invalidation is a problem. Stability Checking is a simple form of Output Waveform Analysis. In a fault-free circuit, the outputs are expected to have reached the desired logic values by the time they are sampled, so delay faults can be detected by observing the outputs for any changes after the sampling time. Apart from traditional delay testing, Stability Checking is also useful for on-line or concurrent testing under certain timing restrictions. A padding algorithm was implemented to show that circuits can be efficiently modified to meet the required timing constraints. By analyzing the output waveform before the sampling time, circuits with timing flaws can be detected even before the circuit fails. This is useful in high reliability applications as a screening technique that does not stress the circuit, and for wear-out prediction. A symbolic waveform simulator has been implemented to show the benefits of the proposed Output Waveform Analysis techniques. Practical test architectures have been designed, and various waveform analyzers have been manufactured and tested. These include circuits implemented using the Stanford BiCMOS process, and a design implemented in a 25k gate Test Evaluation Chip Experiment.
CSL-TR-94-637

Report Number: CSL-TR-94-638
Institution: Stanford University, Computer Systems Laboratory
Title: The Design, Implementation and Evaluation of Jade: A Portable, Implicitly Parallel Programming Language
Author: Rinard, Martin C.
Date: August 1994
Abstract: Over the last decade, research in parallel computer architecture has led to the development of many new parallel machines. These machines have the potential to dramatically increase the resources available for solving important computational problems. The widespread use of these machines, however, has been limited by the difficulty of developing useful parallel software. This thesis presents the design, implementation and evaluation of Jade, a new programming language for parallel computations that exploit task-level concurrency. Jade is structured as a set of constructs that programmers use to specify how a program written in a standard sequential, imperative language accesses data. The implementation dynamically analyzes these specifications to automatically extract the concurrency and map the computation onto the parallel machine. The resulting parallel execution preserves the semantics of the original serial program. We have implemented Jade on a wide variety of parallel computing platforms: shared-memory multiprocessors such as the Stanford DASH machine, homogeneous message-passing machines such as the Intel iPSC/860, and on heterogeneous networks of workstations. Jade programs port without modification between all of these platforms. We evaluate the design and implementation of Jade by parallelizing several complete scientific and engineering applications in Jade and executing these applications on several computational platforms. We analyze how well Jade supports the process of developing these applications and present results that characterize how well they perform.
CSL-TR-94-638

Report Number: CSL-TR-94-639
Institution: Stanford University, Computer Systems Laboratory
Title: Two Case Studies in Latency Tolerant Architectures
Author: Bennett, James E.
Author: Flynn, Michael J.
Date: October 1994
Abstract: Researchers have proposed a variety of techniques for dealing with memory latency, such as dynamic scheduling, hardware prefetching, software prefetching, and multiple contexts. This paper presents the results of two case studies on the usefulness of some simple techniques for latency tolerance. These techniques are nonblocking caches, reordering of loads and stores, and basic block scheduling for the expected latency of loads. The effectiveness of these techniques was found to vary according to the type of application. While nonblocking caches and load/store reordering consistently improved performance, scheduling based on expected latency was found to decrease performance in most cases. This result shows that the assumption of a uniform miss rate used by the scheduler is incorrect, and suggests that techniques for estimating the miss rates of individual loads are needed. These results were obtained using a new simulation environment, MXS, currently under development.
CSL-TR-94-639

Report Number: CSL-TR-94-640
Institution: Stanford University, Computer Systems Laboratory
Title: Transformed Pseudo-Random Patterns for BIST
Author: Touba, Nur A.
Author: McCluskey, Edward J.
Date: October 1994
Abstract: This paper presents a new approach for on-chip test pattern generation. The set of test patterns generated by a pseudo-random pattern generator (e.g., an LFSR) is transformed into a new set of patterns that provides the desired fault coverage. The transformation is performed by a small amount of mapping logic that decodes sets of patterns that don't detect any new faults and maps them into patterns that detect the hard-to-detect faults. The mapping logic is purely combinational and is placed between the pseudo-random pattern generator and the circuit under test (CUT). A procedure for designing the mapping logic so that it satisfies test length and fault coverage requirements is described. Results are shown for benchmark circuits which indicate that an LFSR plus a small amount of mapping logic reduces the test length required for a particular fault coverage by orders of magnitude compared with using an LFSR alone. These results are compared with previously published results for other methods, and it is shown that the proposed method requires much less overhead to achieve the same fault coverage for the same test length.
CSL-TR-94-640

Report Number: CSL-TR-94-641
Institution: Stanford University, Computer Systems Laboratory
Title: Using Checking Experiments to Test Two-State Latches
Author: Makar, Samy R.
Author: McCluskey, Edward J.
Date: November 1995
Abstract: Necessary and sufficient conditions for an exhaustive functional test (checking experiment) of various latches are derived. These conditions are used to derive minimum-length checking experiments. The checking experiment for the D-latch is simulated using an HSpice implementation of the transmission gate latch. All detectable stuck-at, stuck-open, stuck-on, and bridging faults are detected.
CSL-TR-94-641

Report Number: CSL-TR-94-642
Institution: Stanford University, Computer Systems Laboratory
Title: An Apparatus for Pseudo-Deterministic Testing
Author: Mukund, Shridhar K.
Author: McCluskey, Edward J.
Author: Rao, T.R.N.
Date: October 1994
Abstract: Pseudo-random testing is popularly used, particularly in Built-In Self Test (BIST) applications. To achieve a desired fault coverage, pseudo-random patterns are often supplemented with few deterministic patterns. When positions of deterministic patterns in the pseudo-random sequence are known a priori, pseudo-random sub-sequences can be chosen such that they also cover these deterministic patterns. We call this method of test application, pseudo-deterministic testing. The theory of discrete logarithm has been applied to determine positions of bit-patterns in the pseudo-random sequence generated by a modular form or internal-XOR Line ar Feedback Shift Register (LFSR) [5,7]. However, the scheme requires that all the inputs of the combinational logic block (CLB), under test, come from the same LFSR source. This constraint in circuit configuration severely limits its application. In this paper, we propose a practical and cost effective technique for pseudo-de terministic testing. For most part, the problem of circuit configuration has been simplified to one of scan path insertion, by employing LFSR/SR (an arbitrary length shift register driven by a standard form or external-XOR LFSR). To enable the usage of LFSR/SR as a pseudo-deterministic pattern source, we propose a method to determine positions of bit-patterns, at arbitrarily chosen tap configurations, in the LFSR/SR sequence.
CSL-TR-94-642

Report Number: CSL-TR-94-643
Institution: Stanford University, Computer Systems Laboratory
Title: Design-for-Current-Testability (DFCT) for Dynamic CMOS Logic
Author: Ma, Siyad C.
Author: McCluskey, Edward J.
Date: November 1994
Abstract: The applicability of quiescent current monitoring (IDDQ testing) to dynamic logic is discussed here. IDDQ is very useful in detecting some defects that can escape functional and delay tests, however, we show that some defects in domino logic cannot be detected by either voltage or current measurements. A design-for-current-testability (DFCT) modification for dynamic logic is presented and shown to enable detection of these defects. The DFCT circuitry is designed with a negligible performance impact during normal operation. This is particularly important since the main reason for using dynamic logic is because of its speed.
CSL-TR-94-643

Report Number: CSL-TR-94-644
Institution: Stanford University, Computer Systems Laboratory
Title: Synthesis of Asynchronous Controllers for Heterogeneous Systems
Author: Yun, Kenneth Yi
Date: August 1994
Abstract: There are two synchronization mechanisms used in digital systems: synchronous and asynchronous. Synchronous or asynchronous refers to whether the system events occur in lock-step based on a clock or not. Today's system components typically employ the synchronous paradigm primarily because of the availability of the rich set of design tools and algorithms and, perhaps, because of the designers' perception of ``ease of design'' and the lack of alternatives. Even so, the interfaces among the system components do not strictly adhere to the synchronous paradigm because of the cost benefit of mixing modules operating at different clock rates and modules with asynchronous interfaces. This thesis addresses the problem of how to synthesize controllers operating in heterogeneous systems - systems with components employing different synchronization mechanisms. We introduce a new design style called extended-burst-mode. The extended-burst-mode design style covers a wide spectrum of sequential circuits ranging from delay-insensitive to synchronous. We can synthesize multiple-input change asynchronous finite state machines, and many circuits that fall in the gray area between synchronous and asynchronous which are difficult or impossible to synthesize automatically using existing methods. Our implementation of extended-burst-mode machines uses standard combinational logic, generates low-latency outputs and guarantees freedom from hazards at the gate level. We present a complete set of automated sequential synthesis algorithms: hazard-free state assignment, hazard-free state minimization, and critical-race-free state encoding. We also describe two radically different hazard-free combinational synthesis methods: two-level sums-of-products implementation and multiplexor trees implementation. Existing theories for hazard-free combinational synthesis are extended to handle non-monotonic input changes. A set of requirements for freedom from logic hazards is presented for each combinational synthesis method. Experimental data from a large set of examples are presented and compared to competing methods, whenever possible. To demonstrate the effectiveness of the design style and the synthesis tool, the design of a commercial-scale SCSI controller data path is presented. This design is functionally compatible with an existing high performance commercial chip and meets the ANSI SCSI-2 standard.
CSL-TR-94-644

Report Number: CSL-TR-94-645
Institution: Stanford University, Computer Systems Laboratory
Title: Rationale, Design and Performance of the Hydra Multiprocessor
Author: Olukotun, Kunle
Author: Bergmann, Jules
Author: Chang, Kun-Yung
Author: Nayfeh, Basem A.
Date: November 1994
Abstract: In Hydra four high performance processors communicate via a shared secondary cache. The shared cache is implemented using multichip module (MCM) packaging technology. The Hydra multiprocessor is designed to efficiently support automatically parallelized programs that have high degrees of fine grained sharing. This paper motivates the Hydra multiprocessor design by reviewing current trends in architecture and development in parallelizing compiler technology and implementation technology. The design of the Hydra multiprocessor is described and explained. Initial estimates of the interprocessor communication latencies show them to be much better than current bus-based multiprocessors. These lower latencies result in higher performance on applications with fine grained parallelism.
CSL-TR-94-645

Report Number: CSL-TR-94-646
Institution: Stanford University, Computer Systems Laboratory
Title: Technology Mapping for VLSI Circuits Exploiting Boolean Properties and Operations
Author: Mailhot, Frederic
Date: December 1994
Abstract: Automatic synthesis of digital circuits has gained increasing importance. The synthesis process consists of transforming an abstract representation of a system into an implementation in a target technology. The set of transformations has traditionally been broken into three steps: high-level synthesis, logic synthesis and physical design. This dissertation is concerned with logic synthesis. More specifically, we study technology mapping, which is the link between logic synthesis and physical design. The object of technology mapping is to transform a technology-independent logic description into an implementation in a target technology. One of teh key operations during technology mapping is to recognize logic equivalence between a portion of the initial logic description and an element of the target technology. We introduce new methods for establishing logic equivalence between two logic functions. The techniques, based on Boolean comparisons, use Binary Decision Diagrams (BDDs). An algorithm for dealing with completely specified functions is first presented. Then we introduce a second algorithm, which is applicable to incompletely specified functions. We also present an ensemble of techniques for optimizing delay, which rely on an iterative approach. All these methods have proven to be efficient both for run-times and quality of results, when compared to other existing technology mapping systems. The algorithms presented have been implemented in a technology mapping program, Ceres. Results are shown that highlight the apllication of the different algorithms.
CSL-TR-94-646

Report Number: CSL-TR-94-647
Institution: Stanford University, Computer Systems Laboratory
Title: Design Issues in Floating-Point Division
Author: Oberman, Stuart F.
Author: Flynn, Michael J.
Date: December 1994
Abstract: Floating-point division is generally regarded as a low frequency, high latency operation in typical floating-point applications. However, the increasing emphasis on high performance graphics and the industry-wide usage of performance benchmarks, such as SPECmarks, forces processor designers to pay close attention to all aspects of floating-point computation. This paper presents the algorithms often utilized for floating-point division, and it also presents implementation alternatives available for designers. Using a system level study as a basis, it is shown how typical floating-point applications can guide the designer in making implementation decisions and trade-offs.
CSL-TR-94-647

Report Number: CSL-TR-94-648
Institution: Stanford University, Computer Systems Laboratory
Title: Automatic Synthesis of Gate-Level Speed-Independent Circuits
Author: Beerel, Peter A.
Author: Myers, Chris J.
Author: Meng, Teresa H.-Y.
Date: December 1994
Abstract: This paper presents a CAD tool for the synthesis of robust asynchronous control circuits using limited-fanin basic gates such as AND gates, OR gates, and C-elements. The synthesized circuits are speed-independent; that is, they work correctly regardless of individual gate delays. Included in our synthesis procedure is an efficient procedure for logic optimizations using {\em observability don't cares} and {\em incremental verification}. We apply the procedure to a variety of specifications taken from industry and previously published examples and compare our speed-independent implementations to those generated using a non-speed-independent synthesis procedure included in Berkeley's SIS. Our implementations are not only more robust to delay variations since those produced by SIS rely on bounded delay lines to avoid circuit hazards but also are on average 13 percent faster with an area penalty of only 14 percent.
CSL-TR-94-648

Report Number: CSL-TR-94-649
Institution: Stanford University, Computer Systems Laboratory
Title: Routing of Streams in WDM Reconfigurable Networks
Author: Noronha, Ciro A., Jr.
Author: Tobagi, Fouad A.
Date: December 1994
Abstract: Due to its low attenuation, fiber has become that medium of choice for point-to-point links. Using Wavelength-Division Multiplexing (WDM), many channels can be created in the same fiber. A network node equipped with a tunable optical transmitter can select any of these channels for sending data. An optical interconnection combines the signal from the various receivers in the network, and makes it available to the optical receivers, which may also be tunable. By properly tuning transmitters and/or receivers, point-to-point links can be dynamically created and destroyed. Therefore, in a WDM network, the routing algorithm has an additional degree of freedom compared to traditional networks: it can modify the network topology to create the routes. In this report, we consider the problem of routing audio/video streams in WDM networks. We present a general linear integer programming formulation for the problem. However, since this is a complex solution, we propose simpler heuristic algorithms, both for the unicast case and for the multicast case. The performance of these heuristics is evaluated in a number of scenarios, with a realistic traffic model, and from the evaluation we derive guidelines for usage of the heuristic algorithms.
CSL-TR-94-649

Report Number: CSL-TR-94-650
Institution: Stanford University, Computer Systems Laboratory
Title: A Uniform Approach to the Synthesis of Synchronous and Asynchronous Circuits
Author: Myers, Chris J.
Author: Meng, Teresa H.-Y.
Date: December 1994
Abstract: In this paper we illustrate the application of a synthesis procedure used for timed asynchronous circuits to the design of synchronous circuits. In addition to providing a uniform synthesis approach, our procedure results in circuits that are significantly smaller and faster than those designed using the synchronous design tool SIS.
CSL-TR-94-650

Report Number: CSL-TR-94-651
Institution: Stanford University, Computer Systems Laboratory
Title: Automatic Hazard-Free Decomposition of High-Fanin Gates in Asynchronous Circuit Synthesis
Author: Myers, Chris J.
Author: Meng, Teresa H.-Y.
Date: December 1994
Abstract: In this paper we present an automated procedure to decompose high-fanin gates generated by asynchronous circuit synthesis procedures for technology mapping to practical gate libraries. Our procedure begins with a specification in the form of an event-rule system, a circuit implementation in the form of a production rule set, and a given gate library. For each gate in the implementation that has a fanin larger than the maximum in the library, a new signal is added to the specification. Each valid decomposition of the high-fanin gates using these new signals is examined by resynthesis until all gates have been successfully decomposed, or it has been determined that a solution does not exist. The procedure has been automated and used to decompose high-fanin gates from several examples generated by the synthesis tools ATACS and SYN. Our resulting implementations using ATACS, when compared with SIS which uses synchronous technology mapping and adds delay elements to remove hazards, are up to 50 percent smaller and have less than half the latency using library delays generated by HSPICE.
CSL-TR-94-651

Report Number: CSL-TR-94-652
Institution: Stanford University, Computer Systems Laboratory
Title: Automatic Synthesis and Verification of Gate-Level Timed Circuits
Author: Myers, Chris J.
Author: Rokicki, Tomas G.
Author: Meng, Teresa H.-Y.
Date: December 1994
Abstract: This paper presents a CAD system for the automatic synthesis and verification of gate-level timed circuits. Timed circuits are a class of asynchronous circuits which incorporate explicit timing information in the specification which is used throughout the synthesis procedure to optimize the design. This system accepts a textual specification capable of specifying general circuit behavior and timing requirements. This specification is systematically transformed to a graphical representation that can be analyzed using an exact and efficient timing analysis algorithm to find the reachable state space. From this state space, our synthesis procedure derives a timed circuit that is hazard-free using only basic gates to facilitate the mapping to semi-custom components, such as standard-cells and gate-arrays. The resulting gate-level timed circuit implementations are up to 40 percent smaller and 50 percent faster than those produced using other asynchronous design methodologies. We also demonstrate that our timed designs can be smaller and faster than their synchronous counterparts. To address verification, we have applied our timing analysis algorithm to verify efficiently not only our synthesized circuits but also a wide collection of reasonable-sized, highly concurrent timed circuits that could not previously be verified using traditional techniques.
CSL-TR-94-652

Report Number: CSL-TR-94-653
Institution: Stanford University, Computer Systems Laboratory
Title: Routing of Video/Audio Streams In Packet-Switched Networks
Author: Noronha, Ciro A., Jr.
Date: December 1994
Abstract: The transport of multimedia streams in computer communication networks raises issues at all layers of the OSI model. This thesis considers some of the issues related to supporting multimedia streams at the network layer; in particular, the issue of appropriate routing algorithms. New routing algorithms, capable of efficiently meeting multimedia requirements, are needed. We formulate the optimum multipoint stream routing problem as a linear integer programming problem and propose an efficient solution technique. We show that the proposed solution technique significantly decreases the time to compute the solution, when compared to traditional methods. We use the optimum multicast stream routing problem as a benchmark to characterize the performance of existing heuristic algorithms under realistic network and traffic scenarios, and derive guidelines for using their usage and for upgrading the network capacity. We also consider the problem of routing multimedia streams in a Wavelength-Division Multiplexing (WDM) optical network, which has an additional degree of freedom over traditional networks - its topology can be changed by the routing algorithm to create routes as needed, by tuning optical transmitters and/or receivers. We show that the optimum reconfiguration and routing problem can formulated as a linear integer programming problem. Since this is a complex solution, we also propose a set of heuristic algorithms, both for unicast and multicast routing. We evaluate the performance of the proposed heuristics and derive guidelines for their usage.
CSL-TR-94-653

Report Number: CSL-TR-94-654
Institution: Stanford University, Computer Systems Laboratory
Title: Multipliers and Datapaths
Author: Al-Twaijry, Hesham
Author: Flynn, Michael J.
Date: December 1994
Abstract: People traditionally have considered the number of counters in the critical path as the metric for the performance of a multiplier. This report presents the view that tree topologies which have the least number of levels do not always give the fastest possible multiplier when constrained to be part of a microprocessor. It proposes two new topologies: hybrid structure and higher order arrays which are faster than conventional tree topologies for typical datapaths.
CSL-TR-94-654

Report Number: CSL-TR-94-655
Institution: Stanford University, Computer Systems Laboratory
Title: I/O Characterization and Attribute Caches for Improved I/O System Performance
Author: Richardson, Kathy J.
Date: December 1994
Abstract: Workloads generate a variety of disk I/O requests to access file information, execute programs, and perform computation. I/O caches capture most of these requests, reducing execution time, providing high I/O rates, and decreasing the disk bandwidth needed by each workload. A cache has difficulty capturing the full range of I/O behavior, however, when it treats the requests as single stream of uniform tasks. The single stream contains I/O requests for data with vastly different reuse rates and access patterns. Disk files can be classified as accesses to inodes, directories, datafiles or executables. The combined cache behavior of all four taken together provides few clues for improving performance of the I/O cache. But individually, the cache behavior of each reveals the distinct components that make up aggregate I/O behavior. Inodes and directories turn out to be small, highly reused files. Datafiles and executable files have more diverse characteristics. The smaller ones exhibit moderate reuse and have little sequential access, while the larger files tend to be accessed sequentially and not reused. Properly used, file type and file size information improves cache performance. The dissertation introduces attribute caches to improve I/O cache performance. Attribute caches use file attributes to selectively cache I/O data with a cache scheme tailored to the expected behavior of the file type. Inodes and directories are cached in very small blocks, capitalizing on their high reuse rate, and small space requirements. Large files are cached in large cache blocks capitalizing on their sequential access patterns. Small and medium sized files are cached in average 4 kbyte blocks that minimizes the memory required to service the bulk of requests. The portion of cache dedicated to each group varies with total cache size. This allows the important features of the workload to be captured at the appropriate cache size, and increases the total cache utilization. For a set of 11 measured workloads an attribute cache scheme reduced the miss ratio 25--60\% depending on cache size, and required only about 1/8 as much memory as a typical I/O cache implementation achieving the same miss ratio.
CSL-TR-94-655

Report Number: CSL-TR-94-656
Institution: Stanford University, Computer Systems Laboratory
Title: I/O Characterization and Attribute Cache Data for Eleven Measured Workloads
Author: Richardson, Kathy J.
Date: December 1994
Abstract: Workloads generate a variety of disk I/O requests to access file information, execute programs, and perform computation. Workload characterization is crucial to optimizing I/O system performance. This report contains detailed workload characterization data for eleven measured workloads. It includes numerous tables, and cache behavior plots for each workload. The workload I/O traces, from which the characterization is derived, include both file system information and I/O system information, where previous traces only included one or the other. The additional information allows I/O characterization at the system level, and greatly increases the body of knowledge about the make-up and type of disk I/O requested. The new information shows that the I/O request stream contains statistically diverse components that can be separated. This allows the important features of the workload to be captured at the appropriate cache size, and increases the total cache utilization. Note: This technical report is a companion report to the dissertation "I/O Characterization and Attribute Caches for Improved I/O System Performance" (CSL-TR-94-655). While the dissertation is self contained, this report is not; it presents data that is analyzed and discussed only in the dissertation.
CSL-TR-94-656

Report Number: CSL-TR-94-657
Institution: Stanford University, Computer Systems Laboratory
Title: Instruction Level Parallel Processors---A New Architectural Model for Simulation and Analysis
Author: Rudd, Kevin W.
Date: December 1994
Abstract: Trends in high-performance computer architecture have led to the development of increased clock-rate and dynamic multiple-instruction issue processor designs. There have been problems combining both these techniques due to the pressure that the complex scheduling and issue logic puts on the cycle time. This problem has limited the performance of multiple-instruction issue architectures. The alternative approach of static multiple-operation issue avoids the clock-rate problem by allowing the hardware to concurrently issue only those operations that the compiler scheduled to be issued concurrently. Since there is no hardware support required to achieve multiple-operation issue (there are multiple operations in a single instruction and the hardware issues a single instruction at a time), these designs can be effectively scaled to high clock rates. However, these designs have the problem that the scheduling of operations into instructions is rigid and to increase the performance of the system the entire system must be scaled uniformly so that the static schedule is not compromised. This report describes an architectural model that allows a range of hybrid architectures to be studied.
CSL-TR-94-657

Report Number: CSL-TR-95-658
Institution: Stanford University, Computer Systems Laboratory
Title: RYO: a Versatile Instruction Instrumentation Tool for PA-RISC
Author: Z ucker, Daniel F.
Author: Karp, Alan H.
Date: January 1995
Abstract: RYO (Roll Your Own) is actually a family of novel instrumentation tools for the PA-RISC family of processors. Relatively simple awk scripts, these tools instrument PA-RISC assembly instruction sequences by replacing individual machine instructions with calls to user written routines. Examples are presented showing how to generate address traces by replacing memory instructions, and how to analyze floating point arithmetic by replacing floating point instructions. This paper introduces the overall structure and design of RYO, as well as giving detailed instructions on its use.
CSL-TR-95-658

Report Number: CSL-TR-95-659
Institution: Stanford University, Computer Systems Laboratory
Title: High-Speed BiCMOS Memories
Author: Wingard, Drew Eric
Date: December 1994
Abstract: Existing BiCMOS static memories do not simultaneously combine the speed of bipolar memories with the low power and density of CMOS memories. Beginning with fundamentally fast low=swing bipolar circuits and zero-power CMOS storage latches, we introduce CMOS devices into the bipolar circuits to reduce the power dissipation without compromising speed and insert bipolar transistors into CMOS storage arrays to improve the speed without power nor density penalties. Replacing passive load resistors with switched PMOS transistors reduces the amount of power required to keep bipolar decoder outputs low. The access delay need not increase because the load resistance is quickly reduced via a low-swing signal when the decoder could switch. For ECL NOR decoders, we apply a variable BiCMOS current source that is simplified by carefully regulating the negative supply. We also develop techniques that improve the reading and writing characteristics of the CMOS-storage, emitter-access memory cell. The 16K-word 4-bit asynchronous CSEA memory was fabricated in a 0.8-micron BiCMOS technology and accesses in 3.7ns while using 1.75 W. An improved 64Kx4 design is simulated to run at 3.4ns and 2.3W. Finally, a synchronous 4Kx64 CSEA memory is estimated to operate at 2.5ns and 2.4W in the same process technology.
CSL-TR-95-659

Report Number: CSL-TR-95-660
Institution: Stanford University, Computer Systems Laboratory
Title: The Effects of Latency, Occupancy, and Bandwidth in Distributed Shared Memory Multiprocessors
Author: Holt, Chris
Author: Heinrich, Mark
Author: Singh, Jaswinder Pal
Author: Rothberg, Edward
Author: Hennessy, John
Date: January 1995
Abstract: Distributed shared memory (DSM) machines can be characterized by four parameters, based on a slightly modified version of the logP model. The l (latency) and o (occupancy of the communication controller) parameters are the keys to performance in these machines, and are largely determined by major architectural decisions about the aggressiveness and customization of the node and network. For recent and upcoming machines, the g (gap) parameter that measures node-to-network bandwidth does not appear to be a bottleneck. Conventional wisdom is that latency is the dominant factor in determining the performance of a DSM machine. We show, however, that controller occupancy--which causes contention even in highly optimized applications--plays a major role, especially at low latencies. When latency hiding is used, occupancy becomes more critical, even in machines with high latency networks. Scaling the problem size is often used as a technique to overcome limitations in communication latency and bandwidth. We show that in many structured computations occupancy-induced contention is not alleviated by increasing problem size, and that there are important classes of applications for which the performance lost by using higher latency networks or higher occupancy controllers cannot be regained easily, if at all, by scaling the problem size.
CSL-TR-95-660

Report Number: CSL-TR-95-661
Institution: Stanford University, Computer Systems Laboratory
Title: Performance Factors for Superscalar Processors
Author: Bennett, James E.
Author: Flynn, Michael J.
Date: February 1995
Abstract: This paper introduces three performance factors for dynamically scheduled superscalar processors. These factors, availability, efficiency, and utility, are then used to explain the variations in performance that occur with different processor and memory system features. The processor features that are investigated are branch prediction depth and following multiple branch paths. The memory system features that are investigated are cache size, associativity, miss penalty, and memory bus bandwidth. Dynamic scheduling with appropriate levels of bus bandwidth and branch prediction is shown to be remarkably effective at achieving good performance over a range of differing application types and over a range of cache miss rates. These results were obtained using a new simulation environment, MXS, which directly executes the benchmarks.
CSL-TR-95-661

Report Number: CSL-TR-95-662
Institution: Stanford University, Computer Systems Laboratory
Title: Limits of Scaling MOSFETs
Author: McFarland, Grant
Author: Flynn, Michael J.
Date: January 1995
Abstract: The fundamental electrical limits of MOSFETs are discussed and modeled to predict the scaling limits of digital bulk CMOS circuits. Limits discussed include subthreshold currents, time dependent dielectric breakdown (TDDB), hot electron effects, and drain induced barrier lowering (DIBL). This paper predicts the scaling of bulk CMOS MOSFETs to reach its limits at drawn dimensions of approximately 0.1um. These electrical limits are used to find scaling factors for SPICE Level 3 model parameters, and a scalable Level 3 device model is presented. Current trends in scaling interconnects are also discussed.
CSL-TR-95-662

Report Number: CSL-TR-95-663
Institution: Stanford University, Computer Systems Laboratory
Title: Automatic Technology Mapping for Asynchronous Designs
Author: Siegel, Polly Sara Kay
Date: March 1995
Abstract: Asynchronous design styles have been increasing in popularity as device sizes shrink and concurrency is exploited to increase system performance. However, asynchronous designs are difficult to implement correctly because the presence of hazards, which are of little consequence to most parts of synchronous systems, can cause improper circuit operation. Many asynchronous design styles, together with accompanying automated synthesis algorithms, address the issues of design complexity and correctness. Typically, these synthesis systems take a high-level description of an asynchronous system and produce a logic-level description of the resultant design that is hazard-free for transitions of interest. The designer then must manually translate this logic-level description into a technology- specific implementation composed of an interconnection of elements from a semi-custom cell library. At this stage, the designer must be careful not to introduce new hazards into the design. The size of designs is limited in part by the inability to safely (and reliably) map the technology-independent description into an implementation. In this thesis, we address the problem of technology mapping for two different asynchronous design styles. We first address the problem for burst-mode designs. We developed theorems and algorithms for hazard-free mapping of burst-mode designs, and implemented these algorithms on top of an existing synchronous technology mapper. We incorporated this mapper into a toolkit for asynchronous design, and used the toolkit to implement a low-power infrared communications chip. We then extended this work to apply to the problem of hazard-free technology mapping of speed-independent designs. The difficulty in this design style is in the decomposition phase of the mapping algorithm, and we developed theory and algorithms for correct hazard-free decomposition of this design style. We also developed an exact covering algorithm which takes advantage of logic sharing within the design. These algorithms were then applied to benchmark circuits.
CSL-TR-95-663

Report Number: CSL-TR-95-664
Institution: Stanford University, Computer Systems Laboratory
Title: Nondeterministic Operators in Algebraic Frameworks
Author: Meldal, Sigurd
Author: Walicki, Michal Antonin
Date: March 1995
Abstract: A major motivating force behind research into abstract data types is the realization that software should be described in an abstract manner - on the one hand leaving open decisions regarding further refinement and on the other allowing for substitutivity of modules as long as they satisfy a particular specification. The use of nondeterministic operators is a useful abstraction tool: nondeterminism represents a natural abstraction whenever there is a hidden state or other components of a system description which are, methodologically, conceptually or technically, inaccessible at a particular level of specification granularity. In this report we explore the various approaches to dealing with nondeterminism within the framework of algebraic specifications. The basic concepts involved in the study of nondeterminism are introduced. The main alternatives for the interpretation of nondeterministic operations, homomorphisms between nondeterministic structures and equivalence of nondeterministic terms are sketched, and we discuss various proposals for initial and terminal semantics. We offer some comments on the continuous semantics of nondeterminism and the problem of solving recursive equations over signatures with binary nondeterministic choice. We also present the attempts at reducing reasoning about nondeterminism to reasoning in first order logic, and present a calculus dealing directly with nondeterministic terms. Finally, rewriting with nondeterminism is discussed: primarily as a means of reasoning, but also as a means of assigning operational semantics to nondeterministic specifications.
CSL-TR-95-664

Report Number: CSL-TR-95-665
Institution: Stanford University, Computer Systems Laboratory
Title: Interprocedural Parallelization Analysis: Preliminary Results
Author: Hall, Mary W.
Author: Amarasinghe, Saman P.
Author: Murphy, Brian R.
Author: Liao, Shih-Wei
Author: Lam, Monica S.
Date: March 1995
Abstract: This paper describes a fully interprocedural automatic parallelization system for Fortran programs, and presents the results of extensive experiments obtained using this system. The system incorporates a comprehensive and integrated collection of analyses including dependence, privatization and reduction recognition for both array and scalar variables, and scalar symbolic analysis to support these. All the analyses have been implemented in the SUIF (Stanford University Intermediate Format) compiler system, with the aid of an interprocedural analysis construction tool known as FIAT. Our interprocedural analysis is uniquely designed to provide the same quality of information as if the program were analyzed as a single procedure, while managing the complexity of the analysis. We have implemented a robust system that has parallelized, completely automatically, loops containing over a thousand lines of code. This work makes possible the first comprehensive empirical evaluation of state-of-the-art automatic parallelization technology. This paper reports evaluation numbers on programs from standard benchmark suites. The results demonstrate that all the interprocedural analyses taken together can substantially advance the capability of current automatic parallelization technology.
CSL-TR-95-665

Report Number: CSL-TR-95-666
Institution: Stanford University, Computer Systems Laboratory
Title: ON DIVISION AND RECIPROCAL CACHES
Author: Oberman, Stuart F.
Author: Flynn, Michael J.
Date: April 1995
Abstract: Floating-point division is generally regarded as a high latency operation in typical floating-point applications. Many techniques exist for increasing division performance, often at the cost of increasing either chip area, cycle time, or both. This paper presents two methods for decreasing the latency of division. Using applications from the SPECfp92 and NAS benchmark suites, these methods are evaluated to determine their effects on overall system performance. The notion of recurring computation is presented, and it is shown how recurring division can be exploited using an additional, dedicated division cache. Additionally, for multiplication-based division algorithms, reciprocal caches can be utilized to store recurring reciprocals. Due to the similarity between the algorithms typically used to compute division and square root, the performance of square root caches is also investigated. Results show that reciprocal caches can achieve nearly a 2X reduction in effective division latency for reasonable cache sizes.
CSL-TR-95-666

Report Number: CSL-TR-95-667
Institution: Stanford University, Computer Systems Laboratory
Title: Better Optical Triangulation through Spacetime Analysis
Author: Curless, Brian
Author: Levoy, Marc
Date: April 1995
Abstract: The standard methods for extracting range data from optical triangulation scanners are accurate only for planar objects of uniform reflectance illuminated by an incoherent source. Using these methods, curved surfaces, discontinuous surfaces, and surfaces of varying reflectance cause systematic distortions of the range data. Coherent light sources such as lasers introduce speckle artifacts that further degrade the data. We present a new ranging method based on analyzing the time evolution of the structured light reflections. Using our spacetime analysis, we can correct for each of these artifacts, thereby attaining significantly higher accuracy using existing technology. We present results that demonstrate the validity of our method using a commercial laser stripe triangulation scanner.
CSL-TR-95-667

Report Number: CSL-TR-95-668
Institution: Stanford University, Computer Systems Laboratory
Title: Architecture Evaluator's Work Bench and its Application to Microprocessor Floating Point Units
Author: Fu, Steve
Author: Quach, Nhon
Author: Flynn, Michael
Date: June 1995
Abstract: This paper introduces Architecture Evaluator's Workbench(AEWB), a high level design space exploration methodology, and its application to floating point units(FPUs). In applying AEWB to FPUs, a metric for optimizing and comparing FPU implementations is developed. The metric -- FUPA incorporates four aspects of AEWB -- latency, cost, technology and profiles of target applications. FUPA models latency in terms of delay, cost in terms of area, and profile in terms of percentage of different floating point operations. We utilize sub-micron device models, interconnect models, and actual microprocessor scaling data to develop models used to normalize both latency and area enabling technology-independent comparison of implementations. This report also surveys most of the state of the art microprocessors, and compares them utilizing FUPA. Finally, we correlate the FUPA results to reported SPECfp92 results, and demonstrate the effect of circuit density on FUPA implementations.
CSL-TR-95-668

Report Number: CSL-TR-95-669
Institution: Stanford University, Computer Systems Laboratory
Title: Testing BiCMOS and Dynamic CMOS Logic
Author: Ma, Siyad
Date: June 1995
Abstract: In a normal integrated circuit (IC) production cycle, manufactured ICs are tested to remove defective parts. The purpose of this research is to study the effects of real defects in BiCMOS and Dynamic CMOS circuits, and propose better test solutions to detect these defects. BiCMOS and Dynamic CMOS circuits are used in many new high performance VLSI ICs. Fault models for BiCMOS and Dynamic CMOS circuits are discussed first. Shorted and open transistor terminals, the most common failure modes in MOS and bipolar transistors, are simulated for BiCMOS and Dynamic CMOS logic gates. Simulations show that a faulty behavior similar to data retention faults in memory cells can occur in BiCMOS and Dynamic CMOS logic gates. We explain here why it is important to test for these faults, and present test techniques that can detect these faults. Simulation results also show that shorts and opens in Dynamic CMOS and BiCMOS circuits are harder to test than their counterparts in Static CMOS circuits. Simulation results also show that the testability of opens in BiCMOS gates can be predicted without time-consuming transistor-level simulations. We present a prediction method based on an extended switch-level model for BiCMOS gates. To improve the testability of dynamic CMOS circuits, design-for-testability circuitry are proposed. Scan cell designs add scan capabilities to dynamic latches and flip-flops with negligible performance overhead, while design-for-current-testability circuitry allows quiescent supply current (IDDQ) measurements for dynamic CMOS circuits.
CSL-TR-95-669

Report Number: CSL-TR-95-670
Institution: Stanford University, Computer Systems Laboratory
Title: Design and Analysis of Update-Based Cache Coherence Protocols for Scalable Shared-Memory Multiprocessors
Author: Glasco, David Brian
Date: June 1995
Abstract: This dissertation examines the performance difference between invalidate-based and update-based cache coherence protocols for scalable shared-memory multiprocessors. The first portion of the dissertation reviews cache coherence. First, chapter 1 describes the cache coherence problem and identifies the two classes of cache coherence protocols, invalidate-based and update-based. The chapter also reviews bus-based protocols and reviews the additional requirements placed on the protocols to extend them to scalable systems. Next, chapter 2 reviews two latency tolerating techniques, relaxed memory consistency models and software-controlled data prefetch, and examines their impact on the cache coherence protocols. Finally, chapter 3 reviews the details of three invalidate-based protocols defined in the literature and defines two new update-based protocols. The second portion of this dissertation examines the performance differences between invalidate-based and update-based protocols. First, chapter 4 presents the methodology used to examine the performance of the protocols. This presentation includes a discussion of the simulation environment, the simulated architecture and the scientific applications. Next, chapter 5 describes and analyzes the performance of two enhancements to the update-based cache coherence protocols. The first enhancement, a fine-grain or word based synchronization scheme, combines data synchronization with the data. This allows the system to take advantage of the fine-grain data updates which result from the update-based protocols. The second enhancement, a write grouping scheme, is necessary to reduce the network traffic generated by the update-based protocols. Next, chapter 6 presents and discusses the simulated results that demonstrate that update-based protocols, with the two enhancements, can significantly improve the performance of the fine-grain scientific applications examined compared to invalidate-based protocols. Chapter 7 examines the sensitivity of the protocols to changes in the architectural parameters and to migratory data. Finally chapter 8 discusses how the choice of protocols affect the correctness, cost and efficiency of the cache coherence mechanism. Overall, this work demonstrates that update-based protocols can be used not only as a coherence mechanism, but also as a latency reducing and tolerating technique to improve the performance of a set of fine-grain scientific applications. But as with other latency reducing techniques, such as data prefetch, the technique must be used with an understanding of its consequences.
CSL-TR-95-670

Report Number: CSL-TR-95-671
Institution: Stanford University, Computer Systems Laboratory
Title: Characterization and reduction of metastability errors in CMOS interface circuits
Author: Portmann, Clemenz Lenard
Date: June 1995
Abstract: In synchronous digital logic systems, asynchronous external signals must be referenced to the system clock or synchronized. Synchronization of asynchronous signals, however, inevitably leads to metastability errors. Metastability error rates can increase by orders of magnitude as clock frequencies increase in high performance designs, and supply voltages decrease in low- power designs. This research focuses on the characterization of metastability parameters and error reduction with no penalty in circuit performance. Two applications, high-speed flash analog- to-digital conversion and synchronization of asynchronous binary signals in application-specific integrated circuits have been investigated. Applications such as telecommunications and instrumentation for time-domain analysis require analog-to-digital converters with metastability error probabilities on the order of 10^-10 errors/ cycle, achievable in high performance designs only through the use of dedicated circuitry for error reduction. A power and area efficient externally pipelined metastability error reduction technique for flash converters has been developed. Unresolved comparator outputs are held valid, causing the encode logic to fail benignly in the presence of metastability. In an n bit converter, errors are passed as a single unsettled bit to the converter output and are reduced with an external pipeline of only n latches per stage rather than an internal pipeline of 2^n-1 latches per stage. An 80-MHz, externally pipelined, 7-bit flash analog-to-digital converter was fabricated in 1.2-um CMOS. Measured error rates were less than 10^-12 errors/cycle. Using internal pipelining with two levels of 127 latches to achieve equivalent performance would require 3.48 times more power for the error reduction circuitry with a Nyquist frequency input. This corresponds to a reduction in the total power for the implemented converter of 1.24 times compared with the internally pipelined converter. In synchronizers and arbiters, general purpose applications require mean time between failures on the order of one per year or tens of years. Comparison of previous designs has been difficult due to varying technologies, test setups, and test conditions. To address this problem, a test circuit for synchronizers was implemented in 2-um and 1.2-um CMOS technologies. Using the test setup, the evaluation and comparison of synchronizer performance in varying environments and technologies is possible. The effects of loading, output buffering, supply scaling, supply noise, and technology scaling on synchronizer performance are discussed.
CSL-TR-95-671

Report Number: CSL-TR-95-672
Institution: Stanford University, Computer Systems Laboratory
Title: Delay Models for CMOS Circuits
Author: McFarland, Grant
Author: Flynn, Michael
Date: June 1995
Abstract: Four different CMOS inverter delay models are derived and compared. It is shown that inverter delay can be estimated with fair accuracy over a wide range of input rise times and loads as the sum of two terms, one proportional to the input rise time, and one proportional to the capacitive load. Methods for estimating device capacitance from HSPICE parameters are presented, as well as means of including added delay due to wire resistance and the use of series transistors.
CSL-TR-95-672

Report Number: CSL-TR-95-673
Institution: Stanford University, Computer Systems Laboratory
Title: Informing Loads: Enabling Software to Observe and React to Memory Behavior
Author: Horowitz, Mark
Author: Martonosi, Margaret
Author: Mowry, Todd C.
Author: Smith, Michael D.
Date: July 1995
Abstract: Memory latency is an important bottleneck in system performance that cannot be adequately solved by hardware alone. Several promising software techniques have been shown to address this problem successfully in specific situations. However, the generality of these software approaches has been limited because current architectures do not provide a fine-grained, low-overhead mechanism to observe memory behavior directly. To fill this need, we propose a new set of memory operations called informing memory operations, and in particular, we describe the design and functionality of an informing load instruction. This instruction serves as a primitive that allows the software to observe cache misses and to act upon this information inexpensively (i.e. under the miss, when the processor would typically be idle) within the current software context. Informing loads enable new solutions to several important software problems. We demonstrate this through examples that show their usefulness in (i) the collection of fine-grained memory profiles with high precision and low overhead and (ii) the automatic improvement of memory system performance through compiler techniques that take advantage of cache-miss information. Overall, we find that the apparent benefit of an informing load instruction is quite high, while the hardware cost of this functionality is quite modest. In fact, the bulk of the required hardware support is already present in today's high-performance processors.
CSL-TR-95-673

Report Number: CSL-TR-95-674
Institution: Stanford University, Computer Systems Laboratory
Title: Three Concepts of System Architecture
Author: Luckham, David C.
Author: Vera, James
Author: Meldal, Sigurd
Date: July 1995
Abstract: An architecture is a specification of the components of a system and the communication between them. Systems are constrained to conform to an architecture. An architecture should guarantee certain behavioral properties of a conforming system, i.e., one whose components are configured according to the architecture. An architecture should also be useful in various ways during the process of building a system. This paper presents three alternative concepts of architecture: object connection architecture, interface connection architecture, and plug and socket architecture. We describe different concepts of interface and connection that are needed for each of the three kinds of architecture, and different conformance requirements of each kind. Simple examples are used to compare the usefulness of each kind of architecture in guaranteeing properties of conforming systems, and in correctly modifying a conforming system. In comparing the three architecture concepts the principle of communication integrity becomes central, and two new architecture concepts, duality of sub-interfaces (services) and connections of dual services (service connection), are introduced to define plug and socket architecture. We describe how these concepts reduce the complexity of architecture definitions, and can in many cases help guarantee that the components of a conforming system communicate correctly. The paper is presented independently of any particular formalism, since the concepts can be represented in widely differing architecture definition formalisms, varying from graphical languages to event-based simulation languages.
CSL-TR-95-674

Report Number: CSL-TR-95-675
Institution: Stanford University, Computer Systems Laboratory
Title: An Analysis of Division Algorithms and Implementations
Author: Oberman, Stuart F.
Author: Flynn, Michael J.
Date: July 1995
Abstract: Floating-point division is generally regarded as a low frequency, high latency operation in typical floating-point applications. However, the increasing emphasis on high performance graphics and the industry-wide usage of performance benchmarks forces processor designers to pay close attention to all aspects of floating-point computation. Many algorithms are suitable for implementing division in hardware. This paper presents four major classes of algorithms in a unified framework, namely digit recurrence, functional iteration, very high radix, and variable latency. Digit recurrence algorithms, the most common of which is SRT, use subtraction as the fundamental operator, and they converge to a quotient linearly. Division by functional iteration converges to a quotient quadratically using multiplication. Very high radix division algorithms are similar to digit recurrence algorithms, but they incorporate multiplication to reduce the latency. Variable latency division algorithms reduce the average latency to form the quotient. These algorithms are explained and compared in this work. It is found that for low-cost implementations where chip area must be minimized, digit recurrence algorithms are suitable. An implementation of division by functional iteration can provide the lowest latency for typical multiplier latencies. Variable latency algorithms show promise for simultaneously minimizing average latency while also minimizing area.
CSL-TR-95-675

Report Number: CSL-TR-95-676
Institution: Stanford University, Computer Systems Laboratory
Title: The COOL Parallel Programming Language: Design, Implementation, and Performance
Author: Chandra, Rohit
Date: January 1995
Abstract: Effective utilization of multiprocessors requires that a program be partitioned for parallel execution, and that it execute with good data locality and load balance. Although automatic compiler-based techniques to address these concerns are attractive, they are often limited by insufficient information about the application. Explicit programmer participation is therefore necessary for programs that exploit unstructured task-level parallelism. However, support for such intervention must address the tradeoff between ease of use and providing a sufficient degree of control to the programmer. In this thesis we present the programming language COOL, that extends C++ with simple and efficient constructs for writing parallel programs. COOL is targeted towards programming shared-memory multiprocessors. Our approach emphasizes the integration of concurrency and synchronization with data abstraction. Concurrent execution is expressed through parallel functions that execute asynchronously when invoked. Synchronization for shared objects is expressed through monitors, and event synchronization is expressed through condition variables. This approach provides several benefits. First, integrating concurrency with data abstraction allows construction of concurrent data structures that have most of the complex details suitably encapsulated. Second, monitors and condition variables integrated with objects offer a flexible set of building blocks that can be used to build more complex synchronization abstractions. Synchronization operations are clearly identified through attributes and can be optimized by the compiler to reduce synchronization overhead. Finally, the object framework supports abstractions to improve the load distribution and data locality of the program. Besides these mechanisms for exploiting parallelism, COOL also provides support for the programmer to address the performance issues, in the form of abstractions that can be used to supply hints about the objects referenced by parallel tasks. These hints are used by the runtime system to schedule tasks close to the objects they reference, and thereby improve data locality. The hints are easily supplied by the programmer in terms of the objects in the program, while the details of task creation and scheduling are managed transparently within the runtime system. Furthermore, the hints do not affect the semantics of the program and allow the programmer to easily experiment with different optimizations. COOL has been implemented on several shared-memory machines, including the Stanford DASH multiprocessor. We have programmed a variety of applications in COOL, including many from the SPLASH parallel benchmark suite. Our experience has been promising: the applications are easily expressed in COOL, and perform as well as hand-tuned codes using lower-level primitives. Furthermore, supplying hints has proven to be an easy and effective way of improving program performance. This thesis therefore demonstrates that (a) the simple but powerful constructs in COOL can effectively exploit task-level parallelism across a variety of application programs, (b) an object-based approach improves both the expressiveness and the performance of parallel programs, and (c) improving data locality can be simple through a combination of programmer abstractions and smart scheduling mechanisms.
CSL-TR-95-676

Report Number: CSL-TR-95-677
Institution: Stanford University, Computer Systems Laboratory
Title: SPARC-V9 Architecture Specification with Rapide
Author: Santoro, Alexandre
Author: Park, Woosang
Author: Luckham, David
Date: September 1995
Abstract: This report presents an approach to creating an executable standard for the SPARC-V9 instruction set architecture using Rapide-1.0, a language for modeling and prototyping distributed systems. It describes the desired characteristics of a formal specification of the architecture and shows how Rapide can be used to build a model with these characteristics. This is followed by the description of a simple prototype of the proposed model, and a discussion of the issues involved in building and testing the complete specification (with emphasis on some Rapide-specific features such as constraints, causality and mapping). The report concludes with a brief evaluation of the proposed model and suggestions on future areas of research.
CSL-TR-95-677

Report Number: CSL-TR-95-678
Institution: Stanford University, Computer Systems Laboratory
Title: Fast Volume Rendering Using a Shear-Warp Factorization of the Viewing Transformation
Author: Lacroute, Philippe
Date: September 1995
Abstract: Volume rendering is a technique for visualizing 3D arrays of sampled data. It has applications in areas such as medical imaging and scientific visualization, but its use has been limited by its high computational expense. Early implementations of volume rendering used brute-force techniques that require on the order of 100 seconds to render typical data sets on a workstation. Algorithms with optimizations that exploit coherence in the data have reduced rendering times to the range of ten seconds but are still not fast enough for interactive visualization applications. In this thesis we present a family of volume rendering algorithms that reduces rendering times to one second. First we present a scanline-order volume rendering algorithm that exploits coherence in both the volume data and the image. We show that scanline-order algorithms are fundamentally more efficient than commonly-used ray casting algorithms because the latter must perform analytic geometry calculations (e.g. intersecting rays with axis-aligned boxes). The new scanline-order algorithm simply streams through the volume and the image in storage order. We describe variants of the algorithm for both parallel and perspective projections and a multiprocessor implementation that achieves frame rates of over 10 Hz. Second we present a solution to a limitation of existing volume rendering algorithms that use coherence accelerations: they require an expensive preprocessing step every time the volume is classified (i.e. when opacities are assigned to the samples), thereby limiting the usefulness of the algorithms for interactive applications. We introduce a data structure for encoding spatial coherence in unclassified volumes. When combined with our rendering algorithm this data structure allows us to build a fully-interactive volume visualization system.
CSL-TR-95-678

Report Number: CSL-TR-95-679
Institution: Stanford University, Computer Systems Laboratory
Title: Measuring the Complexity of SRT Tables
Author: Oberman, Stuart F.
Author: Flynn, Michael J.
Date: November 1995
Abstract: This paper presents an analysis of the complexity of quotient-digit selection tables in SRT division implementations. SRT dividers use a fixed number of partial remainder and divisor bits to consult a table to select the next quotient-digit in each iteration. The complexity of these tables is a function of the radix, the redundancy, and the number of bits in the estimates of the divisor and partial remainder. This analysis derives the allowable divisor and partial remainder truncations for radix 2 through radix 32, and it quantifies the relationship between table parameters and the number of product terms in the logic equations defining the tables. By mapping the tables to a library of standard-cells, delay and area values were measured and are presented for table configurations through radix 32. The results show that: 1) Gray-coding of the quotient-digits allows for the automatic minimization of the quotient-digit selection logic equations. 2) Using a short carry-assimilating adder with a few more input bits than output bits can reduce table complexity. 3) Reducing the number of bits in the partial remainder estimate and increasing the length of the divisor estimate increases the size and delay of the table, offsetting any performance gain due to the shorter external adder. 4) While delay increases nearly linearly with radix, area increases quadratically, limiting practical table implementations to radix 2 and radix 4.
CSL-TR-95-679

Report Number: CSL-TR-95-680
Institution: Stanford University, Computer Systems Laboratory
Title: Designing a Multicast Switch Scheduler
Author: Prabhakar, Balaji
Author: McKeown, Nick W.
Date: November 1995
Abstract: This paper presents the design of the scheduler for an M x N input-queued switch. It is assumed that each input maintains a single queue for arriving multicast cells and that only the cell at the head of line (HOL) can be observed and scheduled at one time. The scheduler is required to be work-conserving, which means that no output port may be idle as long as there is an input cell destined to it. Furthermore, the scheduler is required to be fair, which means that no input cell may be held at HOL for more than M cell times (M is the number of input ports). The aim is to find a work-conserving, fair policy that delivers maximum throughput and minimizes input queue latency. When a scheduling policy decides which cells to schedule, contention may require that it leave a residue of cells to be scheduled in the next cell time. The selection of where to place the residue uniquely defines the scheduling policy. It is demonstrated that a policy which always concentrates the residue, subject to our fairness constraint, always outperforms all other policies. We present one such policy, called TATRA, and analyze it geometrically. We also present a heuristic round-robin policy called mRRM that is simple to implement in hardware, fair, and performs quite well when compared to a concentrating algorithm.
CSL-TR-95-680

Report Number: CSL-TR-95-681
Institution: Stanford University, Computer Systems Laboratory
Title: Netlist Processing for Custom VLSI via Pattern Matching
Author: Chanak, Thomas Stephen
Date: November 1995
Abstract: A vast array of CAD tools are available to support the design of integrated circuits. Unfortunately, tool development lags advances in technology and design methodology - the newest, most aggressive custom chips confront design issues that were not anticipated by the currently available set of tools. When existing tools cannot fill a custom design's needs, a new tool must be developed, often in a hurry. This situation arises fairly often, and many of the tools created use, or imply, some method of netlist pattern recognition. If the pattern-oriented facet of these tools could be isolated and unified among a variety of tools, custom tool writers would have a useful building block to start with when confronted with the urgent need for a new tool. Starting with the UNIX pattern-matching, text-processing tool AWK as a model, a pattern-action processing environment was built to test the concept of writing CAD tools by specifying patterns and actions. After implementing a wide variety of netlist processing applications, the refined pattern-action system proved to be a useful and fast way to implement new tools. Previous work in this area had reached the same conclusion, demonstrating the usefulness of pattern recognition for electrical rules checking, simulation, database conversion, and more. Our experiments identified a software building block, the "pattern object", that can construct the operators proposed in other works while maintaining flexibility in the face of changing requirements through the decoupling of global control from a pattern matching engine. The implicit computation of subgraph isomorphism common to pattern matching systems was thought to be a potential runtime performance issue. Our experience contradicts this concern. VLSI netlists tend to be sparse enough that runtimes do not grow unreasonably when a sensible amount of care is taken. Difficulties with the verification of pattern based tools, not performance, present the greatest obstacle to pattern-matching tools. Pattern objects that modify netlists raise the prospect of order dependencies and subtle interactions among patterns, and this interaction is what causes the most difficult verification problems. To combat this problem, a technique that considers an application's entire set of pattern objects and a specific target netlist together can perform analyses that expose otherwise subtle errors. This technique, along with debugging tools built specifically for pattern objects and netlists, allows the construction of trustworthy applications.
CSL-TR-95-681

Report Number: CSL-TR-95-682
Institution: Stanford University, Computer Systems Laboratory
Title: High Performance Cache Architectures to Support Dynamic Superscalar Microprocessors
Author: Wilson, Kenneth M.
Author: Olukotun, Kunle
Date: June 1995
Abstract: Simple cache structures are not sufficient to provide the memory bandwidth needed by a dynamic superscalar computer, so more sophisticated memory hierarchies such as non-blocking and pipelined caches are required. To provide direction for the designers of modern high performance microprocessors, we investigate the performance tradeoffs of the combinations of cache size, blocking and non-blocking caches, and pipeline depth of caches within the memory subsystem of a dynamic superscalar processor for integer applications. The results show that the dynamic superscalar processor can hide about two-thirds of the additional latency of two and three pipelined caches, and that a non-blocking cache is always beneficial. A pipelined cache will only outperform a non-pipelined cache if the miss penalty and miss rates are large.
CSL-TR-95-682

Report Number: CSL-TR-95-683
Institution: Stanford University, Computer Systems Laboratory
Title: A Comparison of Hardware Prefetching Techniques For Multimedia Benchmarks
Author: Z ucker, Daniel F.
Author: Flynn, Michael J.
Author: Lee, Ruby B.
Date: December 1995
Abstract: Data prefetching is a well known technique for improving cache performance. While several studies have examined prefetch strategies for scientific and commercial applications, no published work has studied the special memory requirements of multimedia applications. This paper presents data for three types of hardware prefetching schemes: stream buffers, stride prediction tables, and a hybrid combination of the two, the stream cache. Use of the stride prediction table is shown to eliminate up to 90% of the misses that would otherwise be incurred in a moderate or large sized cache with no prefetching hardware. The stream cache, proposed for the first time in this paper, has the potential to cut execution times by more than half by the addition of a relatively small amount of additional hardware.
CSL-TR-95-683

Report Number: CSL-TR-95-684
Institution: Stanford University, Computer Systems Laboratory
Title: Performance/Area Tradeoffs in Booth Multipliers
Author: Al-Twaijry, Hesham
Author: Flynn, Michael J.
Date: November 1995
Abstract: Booth encoding is a method of reducing the number of summands required to produce the multiplication result. This paper compares the performance/area tradeoffs for the different Booth algorithms when trees are used as the summation network. This paper shows that the simple non-Booth algorithm is not a viable design, and that currently Booth 2 is the best design. It also points out that in the future Booth 3 may offer the best performance/area ratio.
CSL-TR-95-684

Report Number: CSL-TR-95-685
Institution: Stanford University, Computer Systems Laboratory
Title: Memory Consistency Models for Shared-Memory Multiprocessors
Author: Gharachorloo, Kourosh
Date: December 1995
Abstract: The memory consistency model for a shared-memory multiprocessor specifies the behavior of memory with respect to read and write operations from multiple processors. As such, the memory model influences many aspects of system design, including the design of programming languages, compilers, and the underlying hardware. Relaxed models that impose fewer memory ordering constraints offer the potential for higher performance by allowing hardware and software to overlap and reorder memory operations. However, fewer ordering guarantees can compromise programmability and portability. Many of the previously proposed models either fail to provide reasonable programming semantics or are biased toward programming ease at the cost of sacrificing performance. Furthermore, the lack of consensus on an acceptable model hinders software portability across different systems. This dissertation focuses on providing a balanced solution that directly addresses the trade-off between programming ease and performance. To address programmability, we propose an alternative method for specifying memory behavior that presents a higher level abstraction to the programmer. We show that with only a few types of information supplied by the programmer, an implementation can exploit the full range of optimizations enabled by previous models. Furthermore, the same information enables automatic and efficient portability across a wide range of implementations. To expose the optimizations enabled by a model, we have developed a formal framework for specifying the low-level ordering constraints that must be enforced by an implementation. Based on these specifications, we present a wide range of architecture and compiler implementation techniques for efficiently supporting a given model. Finally, we evaluate the performance benefits of exploiting relaxed models based on detailed simulations of realistic parallel applications. Our results show that the optimizations enabled by relaxed models are extremely effective in hiding virtually the full latency of writes in architectures with blocking reads (i.e., processor stalls on reads), with gains as high as 80\%. Architectures with non-blocking reads can further exploit relaxed models to hide a substantial fraction of the read latency as well, leading to a larger overall performance benefit. Furthermore, these optimizations complement gains from other latency hiding techniques such as prefetching and multiple contexts. We believe that the combined benefits in hardware and software will make relaxed models universal in future multiprocessors, as is already evidenced by their adoption in several commercial systems.
CSL-TR-95-685

Report Number: CSL-TR-95-686
Institution: Stanford University, Computer Systems Laboratory
Title: Automatic Synthesis of Burst-Mode Asynchronous Controllers
Author: Nowick, Steven Mark
Date: December 1995
Abstract: Asynchronous design has enjoyed a revival of interest recently, as designers seek to eliminate penalties of traditional synchronous design. In principle, asynchronous methods promise to avoid overhead due to clock skew, worst-case design assumptions and resynchronization of asynchronous external inputs. In practice, however, many asynchronous design methods suffer from a number of problems: unsound algorithms (implementations may have hazards), harsh restrictions on the range of designs that can be handled (single-input changes only), incompatibility with existing design styles and inefficiency in the resulting circuits. This thesis presents a new locally-clocked design method for the synthesis of asynchronous controllers. The method has been automated, is proven correct and produces high-performance implementations which are hazard-free at the gate-level. Implementations allow multiple-input changes and handle a relatively unconstrained class of behaviors (called "burst-mode" specifications). The method produces state-machine implementations with a minimal or near-minimal number of states. Implementations can be easily built in such common VLSI design styles as gate-array, standard cell and full-custom. Realizations typically have the latency of their combinational logic. A complete set of state and logic minimization algorithms has been developed and automated for the synthesis method. The logic minimization algorithm differs from existing algorithms since it generates two-level minimized logic which is also hazard-free. The synthesis program is used to produce competitive implementations for several published designs. In addition, a large real-world controller is designed as a case study: an asynchronous second-level cache controller for a new RISC processor.
CSL-TR-95-686

Report Number: CSL-TR-96-687
Institution: Stanford University, Computer Systems Laboratory
Title: Latency Tolerance for Dynamic Processors
Author: Bennett, James E.
Author: Flynn, Michael J.
Date: January 1996
Abstract: While a number of dynamically scheduled processors have recently been brought to market, work on hardware techniques for tolerating memory latency has mostly targeted statically scheduled processors. This paper attempts to remedy this situation by examining the applicability of hardware latency tolerance techniques to dynamically scheduled processors. The results so far indicate that the inherent ability of the dynamically scheduled processor to tolerate memory latency reduces the need for additional hardware such as stream buffers or stride prediction tables. However, the technique of victim caching, while not usually considered as a latency tolerating technique, proves to be quite effective in aiding the dynamically scheduled processor in tolerating memory latency. For a fixed size investment in microprocessor chip area, the victim cache outperforms both stream buffers and stride prediction.
CSL-TR-96-687

Report Number: CSL-TR-96-688
Institution: Stanford University, Computer Systems Laboratory
Title: OS Support for Improving Data Locality on CC-NUMA Compute Servers
Author: Verghese, Ben
Author: Devine, Scott
Author: Gupta, Anoop
Author: Rosenblum, Mendel
Date: February 1996
Abstract: The dominant architecture for the next generation of cache-coherent shared-memory multiprocessors is CC-NUMA (cache-coherent non-uniform memory architecture). These machines are attractive as compute servers, because they provide transparent access to local and remote memory. However, the access latency to remote memory is 3 - 5 times the latency to local memory. Given the large remote access latencies, data locality is potentially the most important performance issue. In compute-server workloads, when moving processes between nodes for load balancing, to maintain data locality the OS needs to do page-migration and page-replication. Through trace-analysis and actual runs of realistic workloads, we study the potential improvements in performance provided by OS supported dynamic migration and replication. Analyzing our kernel-based implementation of the policy, we provide a detailed breakdown of the costs and point out the functions using the most time. We study alternatives to using full-cache miss information to drive the policy, and show that sampling of cache misses can be used to reduce cost without compromising performance, and that TLB misses are inconsistent as an approximation for cache misses. Finally, our workload runs show that OS supported dynamic page-migration and page-replication can substantially increase performance, as much as 29%, in some workloads.
CSL-TR-96-688

Report Number: CSL-TR-96-689
Institution: Stanford University, Computer Systems Laboratory
Title: A Variable Latency Pipelined Floating-Point Adder
Author: Oberman, Stuart F.
Author: Flynn, Michael J.
Date: February 1996
Abstract: Addition is the most frequent floating-point operation in modern microprocessors. Due to its complex shift-add-shift-round dataflow, floating-point addition can have a long latency. To achieve maximum system performance, it is necessary to design the floating-point adder to have minimum latency, while still providing maximum throughput. This paper proposes a new floating-point addition algorithm which exploits the ability of dynamically-scheduled processors to utilize functional units which complete in variable time. By recognizing that certain operand combinations do not require all of the steps in the complex addition dataflow, the average latency is reduced. Simulation on SPECfp92 applications demonstrates that a speedup in average addition latency of 1.33 can be achieved using this algorithm, while still maintaining single cycle throughput.
CSL-TR-96-689

Report Number: CSL-TR-96-690
Institution: Stanford University, Computer Systems Laboratory
Title: Analysis and Synthesis of Concurrent Digital Systems Using Control-Flow Expressions
Author: Coelho, Claudionor Jose Nunes Jr.
Date: March 1996
Abstract: We present in this thesis a modeling style and control synthesis technique for system-level specifications that are better described as a set of concurrent descriptions, their synchronizations and complex constraints. For these types of specifications, conventional synthesis tools will not be able to enforce design constraints because these tools are targeted to sequential components with simple design constraints. In order to generate controllers satisfying the constraints of system-level specifications, we propose a synthesis tool called Thalia that considers the degrees of freedom introduced by the concurrent models and by the system's environment. The synthesis procedure will be subdivided into the following steps: We first model the specification in an algebraic formalism called control-flow expressions, that considers most of the language constructs used to model systems reacting to their environment, i.e. sequential, alternative, concurrent, iterative, and exception handling behaviors. Such constructs are found in languages such as C, Verilog HDL, VHDL, Esterel and StateCharts. Then, we convert this model and a suitable representation for the environment into a finite-state machine, where the system is analyzed, and design constraints such as timing, resource and synchronization are incorporated. In order to generate the control-units for the design, we present two scheduling procedures. The first procedure, called static scheduling, attempts to find fixed schedules for operations satisfying system-level constraints. The second procedure, called dynamic scheduling, attempts to synchronize concurrent parts of a circuit description by dynamically selecting schedules according to a global view of the system.
CSL-TR-96-690

Report Number: CSL-TR-96-691
Institution: Stanford University, Computer Systems Laboratory
Title: PPP: A Gate-Level Power Simulator - A World Wide Web Application
Author: Bogliolo, Alessandro
Author: Benini, Luca
Author: DeMicheli, Giovanni
Author: Ricco, Bruno
Date: March 1996
Abstract: Power consumption is an increasingly important constraint for complex ICs. Accurate and efficient power estimations are required at any level of abstraction to steer the design process. PPP is a Web-based integrated environment for synthesis and simulation of low-power CMOS circuits. We describe the simulation engine of PPP and we propose a new paradigm for tool integration. The simulation engine of PPP is a gate-level simulator that achieves accuracy comparable with electrical simulation, while keeping performance competitive with traditional gate-level techniques. This is done by using advanced symbolic models of the basic library cells, that exploit the understanding of the main phenomena involved in power consumption. In order to maintain full compatibility with gate-level design tools, we use VERILOG-XL as simulation platform. The accuracy obtained on benchmark circuits is always within 6% from SPICE also for single-gate/single-pattern power analysis, thus providing the local information needed to optimize the design. Interface and tool integration issues have been addressed using a Web-based approach. The graphical interface of PPP is a dynamically generated tree of interactive HTML pages that allow the user to access and execute the tool through the Internet by using his/her own Web-browser. No software installation is required and all the details of data transfer and tool communication are hidden to the user.
CSL-TR-96-691

Report Number: CSL-TR-96-694
Institution: Stanford University, Computer Systems Laboratory
Title: Analysis and Synthesis of Concurrent Digital Circuits Using Control-Flow Expressions
Author: Coelho, Claudionor Nunes Jr.
Author: DeMicheli, Giovanni
Date: April 1996
Abstract: We present in this paper a novel modeling style and control synthesis technique for system-level specifications that are better described as a set of concurrent descriptions, their synchronizations and constraints. The proposed synthesis procedure considers the degrees of freedom introduced by the concurrent models and by the environment in order to satisfy the design constraints. Synthesis is divided in two phases. In the first phase, the original specification is translated into an algebraic system, for which complex control-flow constraints and quantifiers of the design are introduced. In the second phase, we translate the algebraic formulation into a finite-state representation, and we derive an optimal control-unit implementation for each individual concurrent part. In the implementation of the controllers from the finite-state representation, we use flexible objective functions, which allows designers to better control the goals of the synthesis tool, and thus incorporate as much as possible their knowledge about the environment and the design.
CSL-TR-96-694

Report Number: CSL-TR-96-692
Institution: Stanford University, Computer Systems Laboratory
Title: Delay Balancing of Wave Pipelined Multiplier Counter Trees Using Pass Transistor Multiplexers
Author: Kishigami, Hidechika
Author: Nowka, Kevin J.
Author: Flynn, Michael J.
Date: January 1996
Abstract: Wave pipelining is an attractive technique used in high-speed digital circuits to speed-up pipeline clock-rate by eliminating the synchronizing elements between pipeline stages. Wave-pipelining has been successfully applied to the design of CMOS multipliers which have demonstrated speed-ups of clock-rate 4 to 7 times over their non-pipelined design. In order to achieve high clock-rate by using wave-pipelining techniques, it is necessary to equalize (balance) all signal path delay of the circuit. In an earlier study a multiplier was designed by using only 2-inputs NAND gates and inverters as primitives in order to reduce delay variations of the circuit. Alternatively, there are several reports that use pass-transistor logic as primitives for multipliers to achieve very low latency. Pass-transistor logic seems attractive for reducing circuit delay variations. In this report we describe a design of wave-pipelined counter tree, which is a central part of parallel multiplier, and detail a method to balance the delay of (4,2) counter using pass-transistor multiplexers (PTMs) as primitives to achieve both higher clock-rate and smaller latency. Simulations of the wave-pipelined counter tree demonstrated 0.8ns clock-rate and 2.33ns latency through the use of pass-transistor multiplexers (PTMs) for a 0.8$\mu$m CMOS process. This data suggests that using pass-transistor multiplexers as primitives for wave-pipelined circuits is useful to achieve both higher clock-rate and lower latency.
CSL-TR-96-692

Report Number: CSL-TR-96-693
Institution: Stanford University, Computer Systems Laboratory
Title: High-Performance CMOS System Design Using Wave Pipelining
Author: Nowka, Kevin J.
Date: January 1996
Abstract: Wave pipelining, or maximum rate pipelining, is a circuit design technique that allows digital synchronous systems to be clocked at rates higher than can be achieved with conventional pipelining techniques. It relies on the predictable finite signal propagation delay through combinational logic for virtual data storage. Wave pipelining of combinational circuits has been shown to achieve clock rates 2 to 7-times those possible for the same circuits with conventional pipelining. Conventional pipelined systems allow data to propagate from a register through the combinational network to another register prior to initiating the subsequent data transfer. Thus, the maximum operating frequency is determined by the maximum propagation delay through the longest pipeline stage. Wave pipeline systems apply the subsequent data to the network as soon as it can be guaranteed that it will not interfere with the current data wave. The maximum operating frequency of a wave pipeline is therefore determined by the difference between the maximum propagation delay and the minimum propagation delay through the combinational logic. By minimizing variations in delay, the performance of wave pipelining is maximized. Data wave interference in CMOS VLSI circuits is the result of the variation in the propagation delay due to path length differences, differences in the state of the network inputs and intermediate nodes, and difference in fabrication and environmental conditions. To maximize the performance of wave pipelined circuits, the path length variations through the combinational logic must be minimized. A method of modifying the transistor geometries of individual static CMOS gates so as to tune their delays has been developed. This method is used by CAD tools that minimize the path length variation. These tools are used to equalize delays within a wave pipelined logic block and to synchronize separate wave pipelined units which share a common reference clock. This method has been demonstrated to limit the variation in delay of CMOS circuits to less than 20%. Delay models have demonstrated that temperature variation, supply power variations, and noise limit the number of concurrent waves in CMOS wave pipelined systems to three or less. Run-to-run process variation can have a significant impact on CMOS VLSI signal propagation delay. The ratio of maximum to minimum delay along the same path for seven different runs of a 0.8-micron feature size fabrication process was found to be 1.35. Unless this variation is controlled, the speedup of wave pipelining is limited to two to three to ensure that devices from any of these runs will operate. When aggregated with variations due to environmental factors, the maximum speed-up of a wave pipeline is less than two. To counteract the effects of process variation, an adaptive supply voltage technique has been developed. An on-chip detector circuit determines when delays are faster than the nominal delays and the power supply is lowered accordingly. In this manner, ICs fabricated with fast processes are run at a lower supply voltage to ensure correct operation at the design target frequency. To demonstrate that wave pipeline technology can be applied to VLSI system design, a CMOS wave pipelined vector unit has been developed. Extensive use of wave pipelining was employed to achieve high clock rates in the functional units. The VLSI processor consists of a wave pipelined vector register file, a wave pipelined adder, a wave pipelined multiplier, load and store units, an instruction buffer, a scoreboard, and control logic. The VLSI vector unit contains approximately 47000 transistors and occupies an area of 43 sq mm. It has been fabricated in a 0.8-micron CMOS technology. Tests indicate wave pipelined operation at a maximum rate of 303MHz. An equivalent vector unit design using traditional latch-based pipelining was designed and simulated. The latch-based design occupied 2% more die area, operated with a 35% longer clock period, and had multiply latency 8% longer and add latency 11% longer than the wave pipelined vector unit.
CSL-TR-96-693

Report Number: CSL-TR-96-695
Institution: Stanford University, Computer Systems Laboratory
Title: Producer-Oriented versus Consumer-Oriented Prefetching: a Comparison and Analysis of Parallel Application Programs
Author: Ohara, Moriyoshi
Date: June 1996
Abstract: Due to large remote-memory latencies, reducing the impact of cache misses is critical for large scale shared-memory multiprocessors. This thesis quantitatively compares two classes of software-controlled prefetch schemes for reducing the impact: consumer-oriented and producer-oriented schemes. Examining the behavior of these schemes leads us to characterize the communication behavior of parallel application programs. Consumer-oriented prefetch has been shown to be effective for hiding large memory latencies. Producer-oriented prefetch (called deliver), on the other hand, has not been extensively studied. Our implementation of deliver uses a hardware mechanism that tracks the set of potential consumers based on past sharing patterns. Qualitatively, deliver has an advantage since the producer sends the datum as soon as, but not before, it is ready for use. In contrast, prefetch may fetch the datum too early so that it is invalidated before use, or may fetch it too late so that the datum is not yet available when it is needed by the consumer. Our simulation results indeed show that the qualitative advantage of deliver can yield a slight performance advantage when the cache size and the memory latency are very large. Overall, however, deliver turns out to be less effective than prefetch for two reasons. First, prefetch benefits from a "filtering effect," and thus generates less traffic than deliver. Second, deliver suffers more from cache interference than prefetch. The sharing and temporal characteristics of a set of parallel applications are shown to account for the different behavior of the two prefetch schemes. This analysis shows the inherent difficulties in predicting future communication behavior of parallel applications from recent history of the application behavior. This suggests that cache accesses involved with coherency in general are much less predictable based on past behavior than other types of cache behavior.
CSL-TR-96-695

Report Number: CSL-TR-96-696
Institution: Stanford University, Computer Systems Laboratory
Title: Computer Assisted Analysis of Multiprocessor Memory Systems
Author: Park, Seungjoon
Date: June 1996
Abstract: In a shared memory multiprocessor architecture, a memory model describes the behavior of the memory system as observed at the user-level. A cache coherence protocol aims to conform to a memory model by maintaining consistency among the multiple copies of cached data and the data in main memory. Memory models and cache coherence protocols can be quite complex and subtle, creating a real possibility of misunderstandings and actual design errors. In this thesis, we will present solutions to these problems. Though weaker memory models for multiprocessor systems allow higher-performance implementation techniques, they are also very subtle. Hence, it is vital to specify memory models precisely and to verify that the programs running under a memory model satisfy desired properties. Our approach to these problems is to write an executable specification of the memory model using a high-level description language for concurrent systems. This executable description provides a precise specification of the machine architecture for implementors and programmers. Moreover, the availability of formal verification tools allows users to experiment with the effects of the memory model on small assembly-language routines. Running the verifier can be very effective at clarifying the subtle details of the models and synchronization routines. Cache coherence protocols, like other protocols for distributed systems, simulate atomic transactions in environments where atomic implementations are impossible. Based on this observation, we propose a verification method which compares an implementation with a specification representing the desired abstract behavior. The comparison is done through an aggregation function, which maps the sequence of implementation steps for each transaction to the corresponding transaction step in the specification. The aggregation approach is applied to verification of the cache coherence protocol in the FLASH multiprocessor system. The protocol, consisting of more than a hundred implementation steps, is proved to conform to a reduced description with six kinds of atomic transactions. From the reduced behavior, it is very easy to prove crucial properties of the protocol including data consistency of cached copies. The aggregation method is also used to prove that the reduced protocol satisfies a desired memory consistency model.
CSL-TR-96-696

Report Number: CSL-TR-96-697
Institution: Stanford University, Computer Systems Laboratory
Title: The Design of SMART: A Scheduler for Multimedia Applications
Author: Nieh, Jason
Author: Lam, Monica S.
Date: June 1996
Abstract: We have created SMART, a Scheduler for Multimedia And Real-Time applications. SMART supports both real-time and conventional computations and provides flexible and accurate control over the sharing of processor time. SMART is able to satisfy real-time constraints in an optimal manner and provide proportional sharing across all real-time and conventional tasks. Furthermore, when not all real-time constraints can be met, SMART satisfies each real-time task's proportional share of deadlines, and adjusts its execution rate dynamically. This technique is especially important for multimedia applications that can operate at different rates depending on the loading condition. This paper presents the design of SMART and provides measured performance results of its effectiveness based on a prototype implementation in the Solaris operating system.
CSL-TR-96-697

Report Number: CSL-TR-96-698
Institution: Stanford University, Computer Systems Laboratory
Title: Technology Scaling Effects on Multipliers
Author: Al-Twaijry, Hesham
Author: Flynn, Michael J.
Date: July 1996
Abstract: Booth encoding is a method of reducing the number of summands required to produce the multiplication result. This paper compares the performance/area tradeoffs for the different Booth algorithms when trees are used as the summation network. This paper shows that the simple non-Booth algorithm is not an efficient design, and that for small feature sizes the performance for the different Booth encoding schemes are comparable in terms of delay. The report also quantifies the effects of wires on the multiplier. As the feature size continues to decrease, wires will provide an ever increasing portion of the total delay. Booth 3 becomes more attractive since it is smaller.
CSL-TR-96-698

Report Number: CSL-TR-96-699
Institution: Stanford University, Computer Systems Laboratory
Title: Efficient Multiprocessor Communications: Networks, Algorithms, Simulation, and Implementation
Author: Lu, Yen-Wen
Date: July 1996
Abstract: As technology and processing power continue to improve, inter-processor communication becomes a performance bottleneck in a multiprocessor network. In this dissertation, an enhanced 2-D torus with segmented reconfigurable bus (SRB) to overcome the delay due to long distance communications was proposed and analyzed. A procedure of selecting an optimal segment length and segment alignment based on minimizing the lifetime of a packet and reducing the interaction between segments was developed to design a SRB network. Simulation shows that a torus with SRB is more than twice as efficient as a traditional torus. Efficient use of channel bandwidth is an important issue in improving network performance. The communication links between two adjacent nodes can be organized as a pair of opposite uni-directional channels, or combined into a single bi-directional channel. A modified channel arbitration scheme with hidden delay, called ``token-exchange,'' was designed for the bi-directional channel configuration. In spite of the overhead of channel arbitration, simulation shows that bi-directional channels have significantly better latency-throughput performance and can sustain higher data bandwidth relative to uni-directional channels of the same channel width. For example, under 2% hot-spot traffic, bi-directional channels can support 80% more bandwidth without saturation compared with uni-directional channels. An efficient, low power, wormhole data router chip for 2-D mesh and torus networks with bi-directional channels and token-exchange arbitration was designed and implemented. The token-exchange delay is fully hidden and no latency penalty occurs when there is no traffic contention; the token-exchange delay is also negligible when the contention is high. Distributed decoders and arbiters are provided for each of four IO ports, and a fully-connected 5x6 crossbar switch increases parallelism of data routing. The router also provides special hardware such as flexible header decoding and switching to support path-based multicasting. From measured results, multicasting with two destinations used only 1/3 of the energy required for unicasting. The wormhole router was fabricated using MOSIS/HP 0.6um technology. It delivers 1.6Gb/s (50MHz) @ Vdd=2.1V, consuming an average power of 15mW.
CSL-TR-96-699

Report Number: CSL-TR-96-700
Institution: Stanford University, Computer Systems Laboratory
Title: Fast IEEE Rounding for Division by Functional Iteration
Author: Oberman, Stuart F.
Author: Flynn, Michael J.
Date: July 1996
Abstract: A class of high performance division algorithms is functional iteration. Division by functional iteration uses multiplication as the fundamental operator. The main advantage of division by functional iteration is quadratic convergence to the quotient. However, unlike non-restoring division algorithms such as SRT division, functional iteration does not directly provide a final remainder. This makes fast and exact rounding difficult. This paper clarifies the methodology for correct IEEE compliant rounding for quadratically-converging division algorithms. It proposes an extension to previously reported techniques of using extended precision in the computation to reduce the frequency of back multiplications required to obtain the final remainder. Further, a technique applicable to all IEEE rounding modes is presented which replaces the final subtraction for remainder computation with very simple combinational logic.
CSL-TR-96-700

Report Number: CSL-TR-96-701
Institution: Stanford University, Computer Systems Laboratory
Title: Characterization of Quality and Traffic for Various Video Encoding Schemes and Various Encoder Control Schemes
Author: Dalgic, Ismail
Author: Tobagi, Fouad A.
Date: August 1996
Abstract: Lossy video compression algorithms, such as those used in the H.261, MPEG, and JPEG standards, result in quality degradation seen in the form of digital tiling, edge busyness, and mosquito noise. The encoder parameters (typically, the so-called quantizer scale) can be adjusted to trade-off encoded video quality and bit rate. Clearly, when more bits are used to represent a given scene, the quality gets better. However, for a given set of encoder parameter values, both the generated traffic and the resulting quality depend on the scene content. Therefore, in order to achieve certain quality and traffic objectives at all times, the encoder parameters must be appropriately adjusted according to the scene content. Currently, two schemes exist for setting the encoder parameters. The most commonly used scheme today is called Constant Bit Rate (CBR), where the encoder parameters are controlled to achieve a target bit rate over time by considering a hypothetical rate control buffer at the encoder's output which is drained at the target bit rate; the buffer occupancy level is used as feedback to control the quantizer scale. In a CBR encoded video stream, the quality varies in time, since the quantizer scale is controlled to achieve a constant bit rate regardless of the scene complexity. In the other existing scheme, called Open-Loop Variable Bit Rate (OL-VBR), all encoder parameters are simply kept fixed at all times. The motivation behind this scheme is to presumably provide a more consistent video quality compared to CBR encoding. In this report, we characterize the traffic and quality for the CBR and OL-VBR schemes by using several video sequences of different spatial and temporal characteristics, encoded using the H.261, MPEG, and motion-JPEG standards. We investigate the effect of the controller parameters (i.e., for CBR, target bit rate and rate control buffer size, and for OL-VBR, the fixed quantizer scale) and video content on the resulting traffic and quality. We show that with the CBR and OL-VBR schemes, the encoder control parameters can be chosen so as to achieve or exceed a given quality objective at all times; however, this can only be done by producing more bits than needed during some of the scenes. In order to produce only as many bits as needed to achieve a given quality objective, we propose a video encoder control scheme which maintains the quality of the encoded video at a constant level, referred to as Constant Quality VBR (CQ-VBR). This scheme is based on a quantitative video quality metric which is used in a feedback control mechanism to adjust the encoder parameters. We determine the appropriate feedback functions for the H.261, MPEG, and motion-JPEG standards. We show that this scheme is indeed able to achieve a constant quality at all times; however, the resulting traffic occasionally contains bursts of relatively high-magnitude (5-10 times the average), but short duration (5-15 frames). We then introduce a modification to this scheme, where in addition to the quality, the peak rate of the traffic is also controlled. We show that with the modified scheme, it is possible to achieve nearly constant video quality while keeping the peak rate within 2-3 times the average.
CSL-TR-96-701

Report Number: CSL-TR-96-702
Institution: Stanford University, Computer Systems Laboratory
Title: Performance Evaluation of Ethernets and ATM Networks Carrying Video Traffic
Author: Dalgic, Ismail
Author: Tobagi, Fouad A.
Date: August 1996
Abstract: In this report the performance of Ethernets (10Base-T and 100Base-T) and ATM networks carrying multimedia traffic is presented. End-to-end delay requirements suitable for a wide range of multimedia applications are considered (ranging from 20 ms to 500 ms). Given the specific nature of the network considered and the maximum latency requirement, some data is lost. Data loss at the receiver causes quality degradations in the displayed video in the form of discontinuities, referred to as glitches. We define various quantities characterizing the glitches, namely, the total amount of information lost in glitches, their duration, and the rate at which glitches occur. We study these quantities for various network and traffic scenarios, using a computer simulation model driven by real video traffic generated by encoding video sequences. We also determine the maximum number of video streams that can be supported for given maximum delay requirement and glitch rate. We consider and compare the results for various types of video contents (video conferencing, motion pictures, commercials), two encoding schemes (H.261 and MPEG-1), and two encoder control schemes [Constant Bit Rate (CBR) and Constant-Quality Variable Bit Rate (CQ-VBR)], considering also scenarios where the traffic consists of various mixtures of the above. We show that when the video content is highly variable, both 100Base-T Ethernet and ATM can support many more CQ-VBR streams than CBR streams. When the video content is not much variable, as in a videoconferencing sequence, then the number of CBR and CQ-VBR streams that can be supported are comparable. For low values of end-to-end delay requirement, we show that ATM networks can support up to twice as many video streams of a given type as Ethernets for a channel capacity of 100Mb/s. For relaxed end-to-end delay requirements, both networks can support about the same number of video streams of a given type. We also determine the number of streams supportable for traffic scenarios consisting of mixtures of heterogeneous video traffic sources in terms of the video content, video encoding scheme and encoder control scheme, as well as the end-to-end delay requirement. We then consider multihop ATM network scenarios, and provide admission control guidelines for video when the network topology is an arbitrary mesh. Finally, we consider scenarios with mixtures of video and data traffic (with various degrees of burstiness), and determine the effect of one traffic type over the other.
CSL-TR-96-702

Report Number: CSL-TR-96-703
Institution: Stanford University, Computer Systems Laboratory
Title: Test Point Insertion for Non-Feedback Bridging Faults
Author: Touba, Nur A.
Author: McCluskey, Edward J.
Date: August 1996
Abstract: This paper studies pseudo-random pattern testing of bridging faults. Although bridging faults are generally more random pattern testable than stuck-at faults, examples are shown to illustrate that some bridging faults can be much less random pattern testable than stuck-at faults. A fast method for identifying these random-pattern-resistant bridging faults is described. It is shown that state-of-the-art test point insertion techniques, which are based on the stuck-at fault model, are inadequate. Data is presented which indicates that even after inserting test points that result in 100% single stuck-at fault coverage, many bridging faults are still not detected. A test point insertion procedure that targets both single stuck-at faults and non-feedback bridging faults is presented. It is shown that by considering both types of faults when selecting the location for test points, higher fault coverage can be obtained with little or no increase in overhead. Thus, the test point insertion procedure described here is a low-cost way to improve the quality of built-in self-test. While this paper considers only non-feedback bridging faults, the techniques that are described can be applied to feedback bridging faults in a straightforward manner.
CSL-TR-96-703

Report Number: CSL-TR-96-704
Institution: Stanford University, Computer Systems Laboratory
Title: Synthesis Techniques for Pseudo-Random Built-In Self-Test
Author: Touba, Nur A.
Date: August 1996
Abstract: Built-in self-test (BIST) techniques enable an integrated circuit (IC) to test itself. BIST reduces test and maintenance costs for an IC by eliminating the need for expensive test equipment and by allowing fast location of failed ICs in a system. BIST also allows an IC to be tested at its normal operating speed which is very important for detecting timing faults. Despite all of these advantages, BIST has seen limited use in industry because of area and performance overhead and increased design time. This dissertation presents automated techniques for implementing BIST in a way that minimizes area and performance overhead. A low-overhead approach for BIST is to use a linear feedback shift register (LFSR) to apply pseudorandom test patterns to the circuit-under-test. Unfortunately, many circuits contain random-pattern-resistant faults which limit the fault coverage that can be obtained for pseudo-random BIST. Several different approaches for solving this problem are presented. A logic synthesis procedure that performs testability-driven factoring to generate a random pattern testable design is presented. By considering random pattern testability during the factoring process, the overhead can be minimized. For hand-designed circuits or circuits that are not synthesizable, an innovative test point insertion procedure is described for inserting test points to make the circuit random pattern testable. A path tracing procedure is used for test point placement. A few of the existing primary inputs are ANDed together to form signals that drive the control points. These innovations result in fewer test points than previous methods. If it is not possible or not desirable to modify the circuit-under-test, then a procedure is described for synthesizing mapping logic that can placed at the output of the LFSR to transform the pseudorandom patterns so that they provide the required fault coverage. Much less overhead is required compared with weighted pattern testing methods. Lastly, a technique is described for placing bitfixing logic at the serial output of an LFSR to embed deterministic test patterns for the random pattern resistant faults in the pseudorandom bit sequence. This method does not require any performance overhead beyond what is needed for scan.
CSL-TR-96-704

Report Number: CSL-TR-96-705
Institution: Stanford University, Computer Systems Laboratory
Title: Rapide: A Language and Toolset for Simulation of Distributed Systems by Partial Orderings of Events.
Author: Luckham, David C.
Date: September 1996
Abstract: This paper describes the RAPIDE concepts of system architecture, causal event simulation, and some of the tools for viewing and analysis of causal event simulations. Illustration of the language and tools is given by a detailed small example.
CSL-TR-96-705

Report Number: CSL-TR-96-706
Institution: Stanford University, Computer Systems Laboratory
Title: Optimum Placement and Routing of Multiplier Partial Product Trees
Author: Al-Twaijry, Hesham
Author: Flynn, Michael J.
Date: September 1996
Abstract: An algorithm that builds a multiplier under the constraint of a limited number of wiring tracks is designed. The algorithm has been implemented. The program is then used to compare several designs of an IEEE floating point multiplier using several delay models.
CSL-TR-96-706

Report Number: CSL-TR-96-707
Institution: Stanford University, Computer Systems Laboratory
Title: Reducing Cache Miss Rates Using Prediction Caches
Author: Bennett, James E.
Author: Flynn, Michael J.
Date: October 1996
Abstract: Processor cycle times are currently much faster than memory cycle times, and the trend has been for this gap to increase over time. The problem of increasing memory latency, relative to processor speed, has been dealt with by adding high speed cache memory. However, it is difficult to make a cache both large and fast, so that cache misses are expected to continue to have a significant performance impact. Prediction caches use a history of recent cache misses to predict future misses, and to reduce the overall cache miss rate. This paper describes several prediction caches, and introduces a new kind of prediction cache, which combines the features of prefetching and victim caching. This new cache is shown to be more effective at reducing miss rate and improving performance than existing prediction caches.
CSL-TR-96-707

Report Number: CSL-TR-96-708
Institution: Stanford University, Computer Systems Laboratory
Title: Validation Tools for Complex Digital Designs
Author: Ho, Chian-Min Richard
Date: December 1996
Abstract: The functional validation of a complex digital design is a laborious, ad-hoc and open-ended task. Many circuits are too complex to be formally verified in their entirety. Instead, simulation of a register transfer level (RTL) model is used. This research explores techniques to make the validation task more systematic, automated and efficient. This can be accomplished by using information embedded in the RTL model to extract the set of "interesting behaviors" of the design, represented as interacting finite state machines (FSM). If all such interesting behaviors of the RTL could be tested in simulation, the degree of confidence that the design is correct would be substantially higher. This work provides two tools towards this goal. First, a test vector generator is described that uses this information to produce a series of test vectors that exercise all the implemented behaviors of the design in RTL simulation. Secondly, the information can be used as the basis for coverage analysis of a pre-existing test vector suite. Previous coverage metrics, such as toggles on a node in the circuit or code block execution counts, often give good first order indications of how thorough a circuit has been exercised but do not usually give an accurate picture of whether multiple or concurrent events have been exercised. In this thesis, a new method is proposed of analyzing test vector suite coverage based on projecting a minimized control state graph onto control signals that enter the datapath part of the design. The fundamental problem facing any technique that uses state exploration is state space explosion. Two techniques are proposed to minimize this problem; first, a dynamic state graph pruning algorithm based on static analysis of the model structure to provide an exact minimization and second, approximation of the state graph with an estimation of the state space in a more compact representation. These techniques help delay the onset of state explosion, allowing useful information to be obtained and utilized, even for complex designs. Results and practical experiences of applying these techniques to the design of the node controller (MAGIC) of the Stanford FLASH Multiprocessor project are given.
CSL-TR-96-708

Report Number: CSL-TR-96-710
Institution: Stanford University, Computer Systems Laboratory
Title: Executable Formal Models of Distributed Transaction Systems Based on Event Processing
Author: Kenney, John
Date: November 1996
Abstract: This dissertation presents formal models of distributed transaction processing (DTP) that are executable and testable. These models apply a new technology, Rapide, an object-oriented executable architecture description language designed for specifying and prototyping distributed, time-sensitive systems. This dissertation shows how the Rapide technology can be applied to specify, prototype, and test DTP models. In particular, this dissertation specifies a reference architecture for the X/Open DTP industry standard. The reference architecture, written in Rapide, defines architectures and behaviors of systems that comply with the X/Open standard. This dissertation also applies a technique developed previously by Gennart and Luckham for testing applications for conformance with reference architectures.
CSL-TR-96-710

Report Number: CSL-TR-96-711
Institution: Stanford University, Computer Systems Laboratory
Title: Design Issues in High Performance Floating Point Arithmetic Units
Author: Oberman, Stuart Franklin
Date: December 1996
Abstract: In recent years computer applications have increased in their computational complexity. The industry-wide usage of performance benchmarks, such as SPECmarks, forces processor designers to pay particular attention to implementation of the floating point unit, or FPU. Special purpose applications, such as high performance graphics rendering systems, have placed further demands on processors. High speed floating point hardware is a requirement to meet these increasing demands. This work examines the state-of-the-art in FPU design and proposes techniques for improving the performance and the performance/area ratio of future FPUs. In recent FPUs, emphasis has been placed on designing ever-faster adders and multipliers, with division receiving less attention. The design space of FP dividers is large, comprising five different classes of division algorithms: digit recurrence, functional iteration, very high radix, table look-up, and variable latency. While division is an infrequent operation even in floating point intensive applications, it is shown that ignoring its implementation can result in system performance degradation. A high performance FPU requires a fast and efficient adder, multiplier, and divider. The design question becomes how to best implement the FPU in order to maximize performance given the constraints of silicon die area. The system performance and area impact of functional unit latency is examined for varying instruction issue rates in the context of the SPECfp92 application suite. Performance implications are investigated for shared multiplication hardware, shared square root, on-the-fly rounding and conversion and fused functional units. Due to the importance of low latency FP addition, a variable latency FP addition algorithm has been developed which improves average addition latency by 33% while maintaining single-cycle throughput. To improve the performance and area of linear converging division algorithms, an automated process is proposed for minimizing the complexity of SRT tables. To reduce the average latency of quadratically-converging division algorithms, the technique of reciprocal caching is proposed, along with a method to reduce the latency penalty for exact rounding. A combination of the proposed techniques provides a basis for future high performance floating point units.
CSL-TR-96-711

Report Number: CSL-TR-97-712
Institution: Stanford University, Computer Systems Laboratory
Title: Hive: Operating System Fault Containment for Shared-Memory Multiprocessors
Author: Chapin, John
Date: July 1997
Abstract: Reliability and scalability are major concerns when designing general-purpose operating systems for large-scale shared-memory multiprocessors. This dissertation describes Hive, an operating system with a novel kernel architecture that addresses these issues. Hive is structured as an internal distributed system of independent kernels called cells. This architecture improves reliability because a hardware or software error damages only one cell rather than the whole system. The architecture improves scalability because few kernel resources are shared by processes running on different cells. The Hive prototype is a complete implementation of UNIX SVR4 and is targeted to run on the Stanford FLASH multiprocessor. The research described in the dissertation makes three primary contributions: (1) it demonstrates that distributed system mechanisms can be used to provide fault containment inside a shared- memory multiprocessor; (2) it provides a specification for a set of hardware features, implemented in the Stanford FLASH, that are sufficient to support fault containment; and (3) it demonstrates how to take advantage of shared-memory hardware across cell boundaries at both application and kernel levels while preserving fault containment. The dissertation also analyzes the architectural and performance tradeoffs of multicellular kernels. Fault injection experiments conducted using the SimOS machine simulator demonstrate the reliability of the Hive prototype. Studies using both general-purpose and scientific workloads illustrate the performance tradeoffs of the multicellular kernel architecture.
CSL-TR-97-712

Report Number: CSL-TR-97-713
Institution: Stanford University, Computer Systems Laboratory
Title: From the Valley of Heart's Delight to Silicon Valley: A Study of Stanford University's Role in the Transformation
Author: Tajnai, Carolyn
Date: January 1997
Abstract: This study examines the role of Stanford University in the transformation from the Valley of Heart's Delight to the Silicon Valley. At the dawn of the Twentieth Century, California's Santa Clara County was an agricultural paradise. Because of the benign climate and thousands of acres of fruit orchards, the area became known as the Valley of Heart's Delight. In the early 1890's, Leland and Jane Stanford donated land in the valley to build a university in memory of their son. Thus, Leland Stanford, Jr., University was founded. In the early 1930's, there were almost no jobs for young Stanford engineering graduates. This was about to change. Although there was no organized plan to help develop the economic base of the area around Stanford University, the concern about the lack of job opportunities for their graduates motivated Stanford faculty to begin the chain of events that led to the birth of Silicon Valley. Stanford University's role in the transformation of the Valley of Heart's Delight into Silicon Valley is history, but it is enduring history. Stanford continues to effect the local economy by spawning new and creative ideas, dreams, and ambitions.
CSL-TR-97-713

Report Number: CSL-TR-97-714
Institution: Stanford University, Computer Systems Laboratory
Title: Parallelizing Compiler Techniques Based on Linear Inequalities
Author: Amarasinghe, Saman Prabhath
Date: January 1997
Abstract: Shared-memory multiprocessors, built out of the latest microprocessors, are becoming a widely available class of computationally powerful machines. These affordable multiprocessors can potentially deliver supercomputer-like performance to the general public. To effectively harness the power of these machines it is important to find all the available parallelism in programs. The Stanford SUIF interprocedural parallelizer we have developed is capable of detecting coarser granularity of parallelism in sequential scientific applications than previously possible. Specifically, it can parallelize loops that span numerous procedures and hundreds of lines of code, frequently requiring modifications to array data structures such as array privatization. Measurements from several standard benchmark suites demonstrate that aggressive interprocedural analyses can substantially advance the capability of automatic parallelization technology. However, locating parallelism is not sufficient in achieving high performance. It is critical to make effective use of the memory hierarchy. In parallel applications, false sharing and cache conflicts between processors can significantly reduce performance. We have developed the first compiler that automatically performs a full suite of data transformations (a combination of transposing, strip-mining and padding). The performance of many benchmarks improves drastically after the data transformations. We introduce a framework based on systems of linear inequalities for developing compiler algorithms. Many of the whole program analyses and aggressive optimizations in our compiler employ this framework. Using this framework general solutions to many compiler problems can be found systematically.
CSL-TR-97-714

Report Number: CSL-TR-97-715
Institution: Stanford University, Computer Systems Laboratory
Title: Software and Hardware for Exploiting Speculative Parallelism with a Multiprocessor
Author: Oplinger, Jeffrey
Author: Heine, David
Author: Liao, Shih-Wei
Author: Nayfeh, Basem A.
Author: Lam, Monica S.
Author: Olukotun, Kunle
Date: february 1997
Abstract: Thread-level speculation (TLS) makes it possible to parallelize general purpose C programs. This paper proposes software and hardware mechanisms that support speculative thread- level execution on a single-chip multiprocessor. A detailed analysis of programs using the TLS execution model shows a bound on the performance of a TLS machine that is promising. In particular, TLS makes it feasible to find speculative do across parallelism in outer loops that can greatly improve the performance of general-purpose applications. Exploiting speculative thread-level parallelism on a multiprocessor requires the compiler to determine where to speculate, and to generate SPMD (single program multiple data) code.We have developed a fully automatic compiler system that uses profile information to determine the best loops to execute speculatively, and to generate the synchronization code that improves the performance of speculative execution. The hardware mechanisms required to support speculation are simple extensions to the cache hierarchy of a single chip multiprocessor. We show that with our proposed mechanisms, thread-level speculation provides significant performance benefits.
CSL-TR-97-715

Report Number: CSL-TR-97-716
Institution: Stanford University, Computer Systems Laboratory
Title: Checking Experiments for Scan Chain Lathes and Flip-FLops
Author: Makar, Samy
Date: August 1997
Abstract: New digital designs often include scan chains; high quality economical test is the reason. A scan chain allows easy access to internal combinational logic by converting bistable elements, latches and flip-flops, into a shift register. Test patterns are scanned in, applied to the internal circuitry, and the results are scanned out for comparison. While many techniques exist for testing the combinational circuitry, little attention has been paid to testing the bistable elements themselves. The bistable elements are typically tested by shifting in a sequence of zeroes and ones. This test can miss many defects inside the bistable elements. A checking experiment is a sequence of inputs and outputs that contains enough information to extract the functionality of the circuit. A new approach, based on such sequences, can significantly reduce the number of defects missed. Simulation results show that as many as 20 percent of the faults in bistable elements can be missed by typical tests; essentially all of these missed faults are detected by checking experiments. Since the checking experiment is a functional test, it is independent of the implementation of the bistable element. This is especially useful since designers often use different implementations of bistable elements to optimize their circuits for area and performance. Another benefit of a functional test is that it avoids the need for generating test patterns at the transistor level. Applying a complete checking experiment to a bistable element embedded inside a circuit can be very difficult, if not impossible. The new approach breaks up the checking experiment into a set of small sub-sequences. For each of these sub-sequences a test pattern is generated. These test patterns are scanned in, as in the case of the tests for combinational logic, appropriate changes to the control inputs of the bistable elements are applied, and the results are scanned out. The process of generating the patterns is automated by modifying an existing stuck-at test generator. A designer or test engineer need only provide a gate level description of the circuit to generate tests that guarantee a checking experiment for each bistable element in the design. Test size is an important economic factor in circuit design. The size of the checking-experiment-based test increases with circuit size at about the same rate as the traditional test, indicating that it is practical for large circuits. Checking-experiment-based tests are an effective economic means for testing the bistable elements in scan chain designs.
CSL-TR-97-716

Report Number: CSL-TR-97-717
Institution: Stanford University, Computer Systems Laboratory
Title: Automatic Synthesis of Sequential Circuits for Low Power Dissipation
Author: Benini, Luca
Date: February 1997
Abstract: In high-performance digital CMOS systems, excessive power dissipation reduces reliability and increases the cost imposed by cooling systems and packaging. Power is obviously the primary concern for portable applications, since battery technology cannot keep the fast pace imposed by Moore's Law, and there is large demand for devices with light batteries and long time between recharges. Computer-Aided Engineering is probably the only viable paradigm for designing state-of-the art VLSI and ULSI systems, because it allows the designer to focus on the high-level trade-offs and to concentrate the human effort on the most critical parts of the design. We present a framework for the computer-aided design of low-power digital circuits. We propose several techniques for automatic power reduction based on paradigms which are widely used by designers. Our main purpose is to provide the foundation for a new generation of CAD tools for power optimization under performance constraints. In the last decade, the automatic synthesis and optimization of digital circuits for minimum area and maximum performance has been extensively investigated. We leverage the knowledge base created by such research, but we acknowledge the distinctive characteristics of power as optimization target.
CSL-TR-97-717

Report Number: CSL-TR-97-718
Institution: Stanford University, Computer Systems Laboratory
Title: Fault Tolerance: Methods of Rollback Recovery
Author: Sunada, Dwight
Author: Glasco, David
Author: Flynn, Michael
Date: March 1997
Abstract: This paper describes the latest methods of rollback recovery for fault-tolerant distributed shared memory (DSM) multiprocessors. This report discusses (1) the theoretical issues that rollback recovery addresses, (2) the 3 major classes of methods for recovery, and (3) the relative merits of each class.
CSL-TR-97-718

Report Number: CSL-TR-97-719
Institution: Stanford University, Computer Systems Laboratory
Title: Automatic Computation and Data Decomposition for Multiprocessors
Author: Anderson, Jennifer-Ann Monique
Date: March 1997
Abstract: Memory subsystem efficiency is critical to achieving high performance on parallel machines. The memory subsystem organization of modern multiprocessor architectures makes their performance highly sensitive to both the distribution of the computation and the layout of the data. A key issue in programming these machines is selecting the computation and data decomposition, the mapping of the computation and data, respectively, across the processors of the machine. A popular approach to the decomposition problem is to require programmers to perform the decomposition analysis themselves, and to communicate that information to the compiler using language extensions. This thesis presents a new compiler algorithm that automatically calculates computation and data decompositions for dense-matrix scientific codes. The core of the algorithm is based on a linear algebra framework for expressing and calculating decompositions. Since the best decompositions may change as different phases of the program are executed, the algorithm also considers re-organizing the data dynamically. The analysis is performed both within and across procedure boundaries so that entire programs can be analyzed. We evaluated the effectiveness of the algorithm by applying it to a suite of benchmark programs. We found that our decomposition analysis and optimization can lead to significant increases in program performance.
CSL-TR-97-719

Report Number: CSL-TR-97-720
Institution: Stanford University, Computer Systems Laboratory
Title: A Simulation Study of IP Switching
Author: Lin, Steven
Author: McKeown, Nick
Date: April 1997
Abstract: Recently there has been much interest in combining the speed of layer-2 switching with the features of layer-3 routing. This has been prompted by numerous proposals, including: IP Switching, Tag Switching, ARIS, CSR, and IP over ATM. In this paper, we study IP Switching and evaluate the performance claims made by Newman et al. In particular, using nine network traces, we study how well IP Switching performs with traffic found in campus, corporate, and Internet Service Provider (ISP) environments. Our main finding is that IP Switching will lead to a high proportion of datagrams that are switched; over 75% in all of the environments we studied. We also investigate the effects that different flow classifiers and various timer values have on performance, and note that some choices can result in a large VC space requirement. Finally, we present recommendations for the flow classifier and timer values, as a function of the VC space of the switch and the network environment being served.
CSL-TR-97-720

Report Number: CSL-TR-97-723
Institution: Stanford University, Computer Systems Laboratory
Title: Hierarchical Storage Systems for Interactive Video-On-Demand
Author: Chan, Shueng-Han Gary
Author: Tobagi, Fouad A.
Date: April 1997
Abstract: On-demand video servers based on hierarchical storage systems are able to offer high-capacity and low-cost video storage. In such a system, video files are stored in the tertiary level and transferred to the secondary level to be displayed. Designing such servers allowing user interaction with the playbacked video is of great interest. We have conducted a comprehensive study on the architecture and operation of such a VOD server. Our objective is to understand its performance characteristics, so as to design a video server to meet specific application requirements. Applications of interest include distance-learning, movie-on-demand, interactive news, home-shopping, etc. The design of such a server actually involves many design choices pertaining to both architecture and operational procedures. We first study through simulation a baseline system which captures the essential performance characteristics of a hierarchical storage system. Then we extend our study beyond the baseline covering numerous other system variations in terms of architectural parameters and operational procedures. We have also examined various applications characteristics, such as file size and video popularity, on system performance. We demonstrate the usefulness of our results by applying them to the design of a video server taking into account current storage technologies.
CSL-TR-97-723

Report Number: CSL-TR-97-724
Institution: Stanford University, Computer Systems Laboratory
Title: State Reduction Methods for Automatic Formal Verification
Author: Ip, C. Norris
Date: December 1996
Abstract: Validation of industrial designs is becoming more challenging as technology advances. One of the most suitable debugging aids is automatic formal verification. This thesis presents several techniques for reducing the state explosion problem, that is, reducing the number of states that are examined. A major contribution of this thesis is the design of simple extensions to the Murphi description language, which enable us to convert two existing abstraction strategies into two fully automatic algorithms, making these strategies easy to use and safe to apply. These two algorithms rely on two facts about high-level designs: they frequently exhibit structural symmetry, and their behavior is often independent of the exact number of replicated components they contain. Another contribution is the design of a new state reduction algorithm, which relies on reversible rules (transitions that do not lose information) in a system description. This new reduction algorithm can be used simultaneously with the other two algorithms. These techniques, implemented in the Murphi verification system, have been applied to many applications, such as cache coherence protocols and distributed algorithms. In the cases of two important classes of infinite systems, infinite state graphs can be automatically converted to small finite state graphs.
CSL-TR-97-724

Report Number: CSL-TR-97-725
Institution: Stanford University, Computer Systems Laboratory
Title: Designing reliable programs with RAPIDE
Author: Madhav, Neel
Author: Luckham, David C.
Date: april 1997
Abstract: Rapide is a language for prototyping large, distributed systems. Rapide allows the scale design of a system to be constructed and analyzed before resources are applied to the construction of the actual system. Two important facets of designing reliable systems are (1) system architecture -- the components in the system and the communication paths between the componnts, and (2) system behavior -- the requirements on the components and the communication. Rapide facilitates the design of system architecture and behavior by (1) providing language features to realize system designs, (2) providing an expressive model for capturing the execution behavior of systems, and (3) providing techniques and tools for analyzing system execution behavior. This paper introduces the essential concepts of Rapide and gives an example of system design using Rapide. Rapide has 4 sublanguages -- (1) a type language, (2) an architecture definition language, (3) a constraint language and (4) an executable language. The paper introduces the Rapide architecture sublanguage and the Rapide constraint sublanguage. The Rapide model of system execution is a set of significant events partially ordered by causality (also called posets). This paper discusses Rapide execution models and compares them with totally ordered event based models. Rapide provides tools to check constraints on posets to browse posets and to animate events on a system architecture. This paper briefly discusses the Rapide analysis tools.
CSL-TR-97-725

Report Number: CSL-TR-97-726
Institution: Stanford University, Computer Systems Laboratory
Title: LOW-POWER PROCESSOR DESIGN
Author: Gonzalez, Ricardo E.
Date: june 1997
Abstract: Power has become an important aspect in the design of general purpose processors. This thesis explores how design tradeoffs affect the power and performance of the processor. Scaling the technology is an attractive way to improve the energy efficiency of the processor. In a scaled technology a processor would dissipate less power for the same performance or higher performance for the same power. Some micro-architectural changes, such as pipelining and caching, can significantly improve efficiency. Unfortunately many other architectural tradeoffs leave efficiency unchanged. This is because a large fraction of the energy is dissipated in essential functions and is unaffected by the internal organization of the processor. Another attractive technique for reducing power dissipation is scaling the supply and threshold voltages. Unfortunately this makes the processor more sensitive to variations in process and operating conditions. Design margins must increase to guarantee operation, which reduces the efficiency of the processor. One way to shrink these design margins is to use feedback control to regulate the supply and threshold voltages thus reducing the design margins. Adaptive techniques can also be used to dynamically trade excess performance for lower power. This results in lower average power and therefore longer battery life. Improvements are limited, however, by the energy dissipation of the rest of the system.
CSL-TR-97-726

Report Number: CSL-TR-97-727
Institution: Stanford University, Computer Systems Laboratory
Title: Towards an Abstraction Hierarchy for CAETI Architectures, and Possible Applications
Author: Luckham, David
Author: Vera, James
Author: Belz, Frank
Date: april 1997
Abstract: This document proposes a four level abstraction hierarchy for CAETI systems architectures for review and discussion by the CAETI community. Some possible applications are described briefly.
CSL-TR-97-727

Report Number: CSL-TR-97-728
Institution: Stanford University, Computer Systems Laboratory
Title: Defining a Security Reference Architecture
Author: Meldal, Sigurd
Author: Luckham, David
Date: june 1997
Abstract: This report discusses the definition and modeling of reference architectures that specify the security aspects of distributed systems. NSA's MISSI (Multilevel Information System Security Initiative) security reference architecture is used as an illustrative example. We show how one would define such a reference architecture, and how one could use such a definition to model as well as check implementations for compliance with the reference. We demonstrate that an ADL should have not only the capability to specify interfaces, connections and operational constraints, but also to specify how it is related to other architectures or to implementations. A reference architecture such as MISSI is defined in Rapide [10] as a set of hierarchical interface connection architectures [9]. Each Rapide interface connection architecture is a reference architecture - an abstract architecture that allows a number of different implementations, but which enforces common structure and communication rules. The hierarchical reference architecture defines the MISSI policies at different levels - at the level of enclaves communicating through a network, at the level of each enclave being a local area network with firewalls and workstations and at the level of the individual workstations. The reference architecture defines standard components, communication patterns and policies common to MISSI compliant networks of computer systems. A network of computers may be checked for conformance against the reference architecture. The report also shows how one can generate architecture scenarios of networks of communicating computers. The scenarios are constructed as Rapide executable models, and the behaviors of the models can be checked for conformance with the reference architecture in these scenarios. The executable models demonstrate how the structure and security policies in the reference architecture may apply to networks of computers.
CSL-TR-97-728

Report Number: CSL-TR-97-729
Institution: Stanford University, Computer Systems Laboratory
Title: Remote Memory Access in Workstation Clusters
Author: Verghese, Ben
Author: Rosenblum, Mendel
Date: july 1997
Abstract: Efficient sharing of memory resources in a cluster of workstations has the promise of greatly improving the performance and cost-effectiveness of the cluster when running large memory- intensive jobs. A point of interest is the hardware support required for good memory sharing performance. We evaluate the performance of two models: the software-only model that runs on a traditional distributed system configuration, and requires support from the operating system to access remote memory; and the hardware-intensive model that uses a specialized network interface to extend the memory system to allow direct access to remote memory. Using SimOS, we do a fair comparison of the performance of the two memory-sharing models for a set of interesting compute-server workloads. We find that the software-only model, with current remote page-fault latencies, does not provide acceptable memory-sharing performance. The hardware shared-memory system is able to provide stable performance across a range of latencies. If the remote page-fault latency can be reduced to 100 microseconds, the performance of the software- only model becomes acceptable for many, though not all, workloads. Considering the interconnection bandwidth required to sustain the software-only page-level memory sharing, our experiments show that a gigabit network is necessary for good performance.
CSL-TR-97-729

Report Number: CSL-TR-97-730
Institution: Stanford University, Computer Systems Laboratory
Title: Performance Isolation and Resource Sharing on Shared-Memory Multiprocessors
Author: Verghese, Ben
Author: Gupta, Anoop
Author: Rosenblum, Mendel
Date: july 1997
Abstract: Shared-memory multiprocessors are attractive as general-purpose compute servers. On the software side, they present programmers with the same programming paradigm as uniprocessors, and they can run unmodified uniprocessor binaries. On the hardware side, the tight coupling of multiple processors, memory, and I/O enables efficient fine-grain sharing of resources on these systems. This fine-grain sharing is important in compute servers because it allows idle resources to be easily utilized by active jobs leading to better system throughput. However, current SMP operating systems do not provide an important feature that users of workstations enjoy, namely the lack of interference from the jobs of unrelated users. We show that this lack of isolation is caused by the resource allocation model carried over from single-user workstations, which is inappropriate for multi-user multiprocessor systems. We propose "performance isolation", a new resource allocation model for multi-user multiprocessor compute servers. This model allows the isolation of the performance of groups of processes from the load on the rest of the system, provides performance comparable to a smaller system that corresponds to the resources used, and allows the sharing of idle resources for throughput comparable to a SMP OS. We implement the performance isolation model in the IRIX5.3 operating system for three important system resources: CPU time, memory, and disk bandwidth. Our implementation of fairness for disk bandwidth is novel. Running a number of workloads we show that this model is very successful at providing workstation-like latencies under heavy load and SMP-like latencies under light load.
CSL-TR-97-730

Report Number: CSL-TR-97-731
Institution: Stanford University, Computer Systems Laboratory
Title: A Single Chip Multiprocessor Integrated with High Density DRAM
Author: Yamauchi, Tadaaki
Author: Hammond, Lance
Author: Olukotun, and Kunle
Date: August 1997
Abstract: A microprocessor integrated with DRAM on the same die has the potential to improve system performance by reducing memory latency and improving memory bandwidth. In this paper we evaluate the performance of a single chip multiprocessor integrated with DRAM when the DRAM is organized as on-chip main memory and as on-chip cache. We compare the performance of this architecture with that of a more conventional chip which only has SRAM-based on-chip cache. The DRAM-based architecture with four processors outperforms the SRAM-based architecture on floating point applications which are effectively parallelized and have large working sets.This performance difference is significantly better than that possible in a uniprocessor DRAM-based architecture, which performs only slightly faster than an SRAM-based architecture on the same applications. In addition, on multiprogrammed workloads, in which independent processes are assigned to every processor in a single chip multiprocessor,the large bandwidth of on-chip DRAM can handle the inter-access contention better. These results demonstrate that a multiprocessor takes better advantage of the large bandwidt provided by the on-chip DRAM than a uniprocessor.
CSL-TR-97-731

Report Number: CSL-TR-97-732
Institution: Stanford University, Computer Systems Laboratory
Title: Efficient Exception Handling Techniques for High-Performance Processor Architectures
Author: Rudd, Kevin W.
Date: October 1997
Abstract: Providing precise exceptions has driven much of the complexity in modern processor designs. While this complexity is required to maintain the illusion of a processor based on a sequential architectural model, it also results in reduced performance during normal execution. The existing notion of precise exceptions is limited to processors based on a sequential architectural model and there have been few techniques developed that are applicable to processors that are not based on this model. Processors with exposed pipelines (typical of VLIW processors) do not conform to the sequential execution model. These processors have explicit overlaps in operation execution and thus cannot support the traditional notion of precise exceptions; most exception handling techniques for these processors require restrictive software scheduling. In this report, we generalize the notion of a precise exception and extend the applicability of precise exceptions to a wider range of architectures. We propose precise exception handling techniques that solve the problem of efficient exception handling for both sequential architectures as well as exposed pipeline architectures. We also show how these techniques can provide efficient support for speculative execution past multiple branches for both architectures as well as latency tolerance for exposed pipeline architectures.
CSL-TR-97-732

Report Number: CSL-TR-97-733
Institution: Stanford University, Computer Systems Laboratory
Title: New Methods for Surface Reconstruction from Range Images
Author: Curless, Brian Lee
Date: June 1997
Abstract: The digitization and reconstruction of 3D shapes has numerous applications in areas that include manufacturing, virtual simulation, science, medicine, and consumer marketing. In this thesis, we address the problem of acquiring accurate range data through optical triangulation, and we present a method for reconstructing surfaces from sets of data known as range images. The standard methods for extracting range data from optical triangulation scanners are accurate only for planar objects of uniform reflectance. Using these methods, curved surfaces, discontinuous surfaces, and surfaces of varying reflectance cause systematic distortions of the range data. We present a new ranging method based on analysis of the time evolution of the structured light reflections. Using this spacetime analysis, we can correct for each of these artifacts, thereby attaining significantly higher accuracy using existing technology. When using coherent illumination such as lasers, however, we show that laser speckle places a fundamental limit on accuracy for both traditional and spacetime triangulation. The range data acquired by 3D digitizers such as optical triangulation scanners commonly consists of depths sampled on a regular grid, a sample set known as a range image. A number of techniques have been developed for reconstructing surfaces by integrating groups of aligned range images. A desirable set of properties for such algorithms includes: incremental updating, representation of directional uncertainty, the ability to fill gaps in the reconstruction, and robustness in the presence of outliers and distortions. Prior algorithms possess subsets of these properties. In this thesis, we present an efficient volumetric method for merging range images that possesses all of these properties. Using this method, we are able to merge a large number of range images (as many as 70) yielding seamless, high-detail models of up to 2.6 million triangles.
CSL-TR-97-733

Report Number: CSL-TR-97-734
Institution: Stanford University, Computer Systems Laboratory
Title: Packet Switching Photonic Network Switch Design and Routing Algorithm
Author: Lee, Hyuk-Jun
Author: Morf, Martin
Author: Flynn, Michael
Date: December 1997
Abstract: Maturity of photonic technology makes it possible to construct all optical network switch to avoid optical-to-electrical signal conversion for routing. To realize all optical packet switching, our current network topology and routing algorithms have to be reexamined and modified to satisfy the necessities of all optical network switching such as a fast routing decision, consideration of hardware implementation, buffering etc. In this paper, first, we will review various switching architectures including crossbar, Benes and Batcher/Banyan. Secondly, optical implementation of a multiple output port network switch will be presented. In many levels of networking from multiprocessor interconnection to wide area networking, multiple latencies resulting from this scheme could improve the overall performance when combined with smart routing schemes. Finally, we present a interpretation of multistage network using a symmetric group. A Cayley graph for a symmetric group and its coset graphs suggest an interesting alternative way to construct a new multistage interconnection network.
CSL-TR-97-734

Report Number: CSL-TR-97-735
Institution: Stanford University, Computer Systems Laboratory
Title: Flexible Connectivity Management for Mobile Hosts
Author: Z hao, Xinhua
Author: Baker, Mary G.
Date: september 1997
Abstract: Powerful light-weight portable computers, the availability of wireless networks, and the popularity of the Internet are driving the need for better networking support for mobile hosts. Users should be able to connect their portable computers to the Internet at any time and in any place, but the dynamic nature of such connectivity requires more flexible network management than has typically been available for stationary workstations. This report proposes techniques to address a unique feature of connectivity management on mobile hosts: its multiplicity, i.e. the need to support multiple packet delivery methods simultaneously and to support the use of multiple network devices for both availability and efficiency reasons. We have developed a set of techniques in the context of mobile IP for flexible, automatic network connectivity management for mobile hosts. We augment the routing layer of the network protocol stack with a Mobile Policy Table (MPT) to support multiple packet delivery mechanisms for different simultaneous flows based on the nature of the traffic. We also devise a set of mechanisms, including a backwards-compatible extension to the routing table, to facilitate the use of multiple network devices. We include performance results showing some of the potential benefits such increased flexibility provides for mobile hosts.
CSL-TR-97-735

Report Number: CSL-TR-97-737
Institution: Stanford University, Computer Systems Laboratory
Title: Stochastic Congestion Model for VLSI Systems
Author: Hung, Patrick
Author: Flynn, Michael J.
Date: october 1997
Abstract: Designing with deep submicron feature size presents new challenges in complexity, performance, and productivity. Information on routing congestion and interconnect area are critical in the pre-RTL stage in order to forecast the whole die size, define the timing specifications, and evaluate the chip power consumption. In this report, we propose a stochastic model for VLSI interconnect routing, which can be used to estimate the routing congestion and the interconnect area in the pre-RTL stage. First, we define the uniform and geometric routing distributions, and introduce a simple and efficient algorithm to calculate the routing probabilities. We then derive the routing probabilities among multiple functional blocks, and investigate the effects of routing obstacles. Finally, we map the chip to a Cartesian coordinate system, and model routability based on the supply and demand distributions of routing channels.
CSL-TR-97-737

Report Number: CSL-TR-97-738
Institution: Stanford University, Computer Systems Laboratory
Title: On the Speedup Required for Combined Input and Output Queued Switching
Author: Prabhakar, Balaji
Author: McKeown, Nick
Date: November 1997
Abstract: Architectures based on a non-blocking fabric, such as a crosspoint switch, are attractive for use in high-speed LAN switches, ATM switches and IP routers. These fabrics, coupled with memory bandwidth limitations, dictate that queues be placed at the input of the switch. But it is well known that input-queueing can lead to low throughput, and does not allow the control of latency through the switch. This is in contrast to output-queueing, which maximizes throughput, and permits the accurate control of packet latency through scheduling. We ask the question: Can a switch with combined input and output queueing be designed to behave identically to an output-queued switch? In this paper, we prove that if the switch uses virtual output queueing, and has an internal speedup of just four, it is possible for it to behave identically to an output queued switch, regardless of the nature of the arriving traffic. Our proof is based on a novel scheduling algorithm, known as Most Urgent Cell First. This result makes possible switches that perform as if they were output-queued, yet use memories that run more slowly.
CSL-TR-97-738

Report Number: CSL-TR-97-739
Institution: Stanford University, Computer Systems Laboratory
Title: Hardware/Software Co-Design of Run-Time Schedulers for Real-Time Systems
Author: Mooney, Vincent John III
Author: Micheli, Giovanni De
Date: November 1997
Abstract: We present the SERRA Run-Time Scheduler Synthesis and Analysis Tool which automatically generates a run-time scheduler from a heterogeneous system-level specification in both Verilog HDL and C. Part of the run-time scheduler is implemented in hardware, which allows the scheduler to be predictable in being able to meet hard real-time constraints, while part is implemented in software, thus supporting features typical of software schedulers. SERRA's real-time analysis generates a priority assignment for the software tasks in the mixed hardware-software system. The tasks in hardware and software have precedence constraints, resource constraints, relative timing constraints, and a rate constraint. A heuristic scheduling algorithm assigns the static priorities such that a hard real-time rate constraint can be predictably met. SERRA supports the specification of critical regions in software, thus providing the same functionality as semaphores. We describe the task control/data-flow extraction, synthesis of the control portion of the run-time scheduler in hardware, real-time analysis and priority scheduler template. We also show how our approach fits into an overall tool flow and target architecture. Finally, we conclude with a sample application of the novel run-time scheduler synthesis and analysis tool to a robotics design example.
CSL-TR-97-739

Report Number: CSL-TR-97-744
Institution: Stanford University, Computer Systems Laboratory
Title: The FLASH Multiprocessor: Designing a Flexible and Scalable System
Author: Kuskin, Jeffrey Scott
Date: November 1997
Abstract: The choice of a communication paradigm, or protocol, is central to the design of a large-scale multiprocessor system. Unlike traditional multiprocessors, the FLASH machine uses a programmable node controller, called MAGIC, to implement all protocol processing. The architecture of the MAGIC chip allows FLASH to support multiple communication paradigms - in particular, cache-coherent shared memory and high-performance message passing - while minimizing both hardware and software overhead. Each node in FLASH contains a microprocessor, a portion of the machine's global memory, a port to the interconnection network, an I/O interface, and MAGIC, the custom node controller. The MAGIC chip handles all communication both within the node and among nodes, using hardwired data paths for efficient data movement and a programmable processor optimized for executing protocol operations. The result is a system that is flexible and scalable, yet competitive in performance with a traditional multiprocessor that implements a single communication paradigm completely in hardware. The focus of this dissertation is the architecture, design, and performance of FLASH. Much of the motivation behind the FLASH system and the MAGIC node controller design stems from an examination of the characteristics of protocol code and the architecture of the DASH system, the predecessor to FLASH. This examination led to two major design goals: development of a node controller architecture that can attain high protocol processing performance while still maintaining flexibility and a need to reduce the logic and memory overheads associated with cache coherence. The MAGIC design achieves these goals by implementing on a single chip a programmable protocol engine with an instruction set optimized for the characteristics of protocol code, along with dedicated support logic to alleviate the most serious protocol processing performance bottlenecks - data movement, message dispatch, and lack of close coupling to the node board components. The design of the FLASH node complements the MAGIC design, matching the close coupling and high bandwidth support in MAGIC to provide a balanced node architecture. Next, the dissertation investigates the performance of cache-coherence on FLASH. Performance results are presented from microbenchmarks run on the Verilog RTL of the MAGIC chip and from complete applications run on FlashLite, the FLASH system-level simulator. The microbenchmarks demonstrate that the architectural extensions added to the MAGIC design - particularly the instruction set optimizations to the programmable protocol processor - yield significantly lower latencies and protocol processor occupancies to service the most common types of memory operations. The application results are used to evaluate the performance costs of flexibility by comparing the performance of FLASH to that of a hardwired machine on representative parallel applications and multiprogramming workloads. These results show that poor application memory reference or load balancing characteristics cause the performance of the FLASH system to degrade more rapidly than the performance of the hardwired system; that is, FLASH's performance is less robust. For applications that incur a large number of remote misses or exhibit substantial hot-spotting, the increased remote access latencies or the occupancy of MAGIC lead to lower performance for the flexible design. Overall, however, the performance of FLASH can be competitive with the performance of the hardwired machine. Specifically, for a range of optimized parallel applications, the performance differences between the hardwired machine and FLASH are small, typically less than 10% at 32 processors and less than 15% at 64 processors. For these programs, either the processor cache miss rates are small or the latency of the programmable protocol processing can be hidden behind the memory access time.
CSL-TR-97-744

Report Number: CSL-TR-97-745
Institution: Stanford University, Computer Systems Laboratory
Title: Selection of Recent Advances in Computer Systems
Author: Mencer, Oskar
Author: Flynn, Michael
Date: July 1999
Abstract: This paper presents a selection of recent research results in computer systems. The roadmap for CMOS technology for the next ten years shows a theoretical limit of 0.1 um for the channel of a MOSFET transistor, reached by 2007. Mainstream processors are adapting to multimedia applications with subword parallel instructions like Intel's MMX or HP's MAX instruction set extensions. Coprocessors and embedded processors are moving towards VLIW in order to save hardware costs. The memory system of the future is going to be the next generation of Rambus/RDRAM. Finally, Custom Computing Machines based on Field Programmable Gate Arrays are on of the promising future technologies for computing -- offering very high performance for highly parallelizable and pipelinable applications.
CSL-TR-97-745

Report Number: CSL-TR-97-748
Institution: Stanford University, Computer Systems Laboratory
Title: Decision Diagrams and Pass Transistor Logic Synthesis
Author: Bertacco, V.
Author: Minato, S.
Author: Verplaetse, P.
Author: Benini, L.
Author: Micheli, and G. De
Date: December 1997
Abstract: Since the relative importance of interconnections increases as feature size decreases, standard-cell based synthesis becomes less effective when deep-submicron technologies become available. Intra-cell connectivity can be decreased by the use of macro-cells. In this work we present methods for the automatic generation of macro-cells using pass transistors and domino logic. The synthesis of these cells is based on BDD and Z BDD representations of the logic functions. We address specific problems associated with the BDD approach (level degradation, long paths) and the Z BDD approach (sneak paths, charge sharing, long paths). We compare performance of the macro-cells approach versus the conventional standard-cell approach based on accurate electrical simulation. This shows that the macro-cells perform well up to a certain complexity of the logic function. Functions of high complexity must be decomposed into smaller logic blocks that can directly be mapped to macro-cells.
CSL-TR-97-748

Report Number: CSL-TR-98-749
Institution: Stanford University, Computer Systems Laboratory
Title: Considerations in the Design of Hydra: A Multiprocessor-on-a-Chip Microarchitecture
Author: Hammond, Lance
Author: Olukotun, Kunle
Date: February 1998
Abstract: As more transistors are integrated onto larger dies, single-chip multiprocessors integrated with large amounts of cache memory will soon become a feasible alternative to the large, monolithic uniprocessors that dominate today's microprocessor marketplace. Hydra offers a promising way to build a small-scale MP-on-a-chip using a fairly simple design that still maintains excellent performance on a wide variety of applications. This report examines key parts of the Hydra design -- the memory hierarchy, the on-chip buses, and the control and arbitration mechanisms -- and explains the rationale for some of the decisions made in the course of finalizing the design of this memory system, with particular emphasis given to applications that stress the memory system with numerous memory accesses. With the balance between complexity and performance that we obtain, we feel Hydra offers a promising model for future MP-on-a-chip designs.
CSL-TR-98-749

Report Number: CSL-TR-98-751
Institution: Stanford University, Computer Systems Laboratory
Title: Vitis Propulsion: Theory and Practice
Author: Baker, Mary
Author: Honig, Sue
Author: Kercheval, Berry
Author: Seltzer, Margo
Date: February 1998
Abstract: We have proof that red grapes scoot around more than green grapes when microwaved.
CSL-TR-98-751

Report Number: CSL-TR-98-752
Institution: Stanford University, Computer Systems Laboratory
Title: Smoothness of Stationary Subdivision on Irregular Meshes
Author: Z orin, Denis
Date: January 1998
Abstract: We derive necessary and sufficient conditions for tangent plane and C(superscript k)-continuity of stationary subdivision schemes nearextraordinary vertices. Our criteria generalize most previously knownconditions. We introduce a new approach to analysis of subdivisionsurfaces based on the idea of the universal surface. Any subdivisionsurface can be locally represented as a projection of the universal surface,which is uniquely defined by the subdivision scheme. This approach provides us with a more intuitive geometric understanding of subdivision near extraordinary vertices.
CSL-TR-98-752

Report Number: CSL-TR-98-753
Institution: Stanford University, Computer Systems Laboratory
Title: Resource Management Issues for Shared-Memory Multiprocessors
Author: Verghese, Ben
Date: March 1998
Abstract: Shared-memory multiprocessors (SMPs) are attractive as general-purpose compute servers. On the software side, they present the same programming paradigm as uniprocessors, and they can run unmodified uniprocessorbinaries. On the hardware side, the tight coupling of multipleprocessors, memory, and I/O provides enormous computing power in a single system, and enables the efficient sharing of these resources. As a compute server, this power can be exploited both by a collection of uniprocessor programs and by explicitly or automatically parallelized applications. This thesis addresses two important performance-related issues encountered in such systems, performance isolation and data locality. The solutions presented in this dissertation address these issues through careful resourcemanagement in the operating system.Current shared-memory multiprocessor operating systems provide very few controls for sharing the resources of the system amongthe active tasks or users. This is a serious limitation for a compute server that is to be used for multiple tasks or by multiple users. The current unconstrained sharing scheme allowsthe load placed by one user or task to adversely affect theperformance seen by another. We show that this lack of isolation is caused by the resource allocation scheme (or lackthereof) carried over from single-user workstations. Multi-usermultiprocessor systems require more sophisticated resourcemanagement, and we propose "performance isolation", a newresource management scheme for such systems.
CSL-TR-98-753

Report Number: CSL-TR-98-755
Institution: Stanford University, Computer Systems Laboratory
Title: ABSS v2.0: a SPARC Simulator
Author: Sunada, Dwight
Author: Glasco, David
Author: Flynn, Michael
Date: April 1998
Abstract: This paper describes various aspects of the augmentation-based SPARCsimulator (ABSS). We discuss (1) the problems that we solved in porting AugMINTto the SPARC platform to create ABSS, (2) the major sections of ABSS, and (3) the limitationsof ABSS.
CSL-TR-98-755

Report Number: CSL-TR-98-756
Institution: Stanford University, Computer Systems Laboratory
Title: Hardware-assisted Algorithms for Checkpoints
Author: Sunada, Dwight
Author: Glasco, David
Author: Flynn, Michael
Date: July 1998
Abstract: We can classify the algorithms for establishing checkpoints ondistributed-shared-memory multiprocessors DSMMs into 3 broad classes: tightlysynchronized method TSM loosely synchronized method LSM, unsynchronizedmethod USM. TSM-type algorithms force the immediate establishment of a checkpoint whenever a dependency between 2 processors arises. LSM-type algorithmsrecord this dependency and, hence, do not require the immediate establishmentof a checkpoint if a dependency does arise; when a processor chooses to establisha checkpoint, the processor will query the dependency records to determineother processors that must also establish a checkpoint. USM-type algorithmsallow a processor to establish a checkpoint without regard to any other processor.Within this framework, we developed 4 hardware-based algorithms: distributedrecoverable shared memory (DRSM), DRSM for communication checkpoints (DRSM-C), DRSM with a hybrid method (DRSM-H), and DRSM with logs (DRSM-L). DRSM-C is a TSM-type algorithm, andDRSM and DRSM-H are LSM-type algorithms. DRSM-L is a USM-type algorithm and isthe first of its kind for a tightly-coupled DSMM where hardware in the form of a directory maintains cache coherence. We find that DRSM has the best performance interms of minimizing the impact of establishing checkpoints (or logs) on the running applications, but DRSM along with DRSM-C has the most expensive hardware requirements. DRSM-L has the second best performance but has the least expensivehardware requirement. We conclude that DRSM-L is the best algorithm in terms ofcost and performance.
CSL-TR-98-756

Report Number: CSL-TR-98-758
Institution: Stanford University, Computer Systems Laboratory
Title: Matching Output Queueing with a Combined Input Output Queued Switch
Author: Chuang, Shang-Tse
Author: Goel, Ashish
Author: McKeown, Nick
Author: Prabhakar, Balaji
Date: April 1998
Abstract: The Internet is facing two problems simultaneously: we need a fasterswitching/routing infrastructure, and we need to introduce guaranteed qualities of service (QoS). As a community, we have solutions to each: we can make the routers faster by using input-queued crossbard, instead of shared memory systems; and we can introduce QoS using WFQ-based packet scheduling. But we don't know how to do both at the same time. Until now, the two solutions have been mutually exclusive - all of the work on WFQ-based scheduling algorithms has required that switches/routers use output-queueing, or centralized shared memory. We demonstrate that a Combined Input Output Queueing (CIOQ) switch running twice as fast as an input-queued switch can provide precise emulation of a broad class of packet scheduling algorithms, including WFQ and strict priorities. More precisely, we show that a "speedup" of 2 - 1/N is both necessary and sufficient for this precise emulation. We introduce a variety of algorithms that configure the crossbar so that emulation is achieved with a speedup of two, and consider their running time and implementation complexity. We believe that, in the future, theseresults will make possible the support of QoS in very high bandwidth routers.
CSL-TR-98-758

Report Number: CSL-TR-98-759
Institution: Stanford University, Computer Systems Laboratory
Title: Optimized Multiprocessor Communication and Synchronization Using a Programmable Protocol Engine
Author: Heinlein, John
Date: March 1998
Abstract: In recent years, multiprocessor designs have converged towards a unified hardware architecture despite supporting different communication abstractions. The implementation of these communication abstractions and the associated protocols in hardware is complex, inflexible, and error prone. For these reasons, some recent designs have employed a programmable controller to manage system communication. One particular focus of these designs is implementing cache coherence protocols in software. This dissertation argues that a programmable communication controller that provides cache coherence can also effectively support block transfer and synchronization protocols. This research is part of the FLASH project, a major focus of which is exploring the integration of multiple communication protocols in a single multiprocessor architecture. In our analysis, we examine the needs of protocols other than cache coherence to identify the requirements they share. The interface between the processor and controller is one critical issue in these protocols, so we propose techniques to export such protocols reliably, at low overhead, and without system calls. Unlike most prior studies, our approach supports a modern operating system with features like multiprogramming, protection, and virtual memory. Our study focuses in detail on two classes of communication that are important for large scale multiprocessors: block transfer and synchronization using locks and barriers. In particular, we attempt to improve the performance of these classes of communication as compared to implementations using only software on top of shared memory. For each protocol we identify the critical metrics of performance, explore the limitations of existing techniques, then present our implementation, which is tailored to leverage the programmable communication controller. We evaluate each protocol in isolation, in the context of microbenchmarks, and within a variety of applications. We find that embedding advanced communication and synchronization features in a programmable controller has a number of advantages. For example, the block transfer protocol improves transfer performance in some cases, enables the processor to perform other work in parallel, and reduces processor cache pollution caused by the transfer. The synchronization protocols reduce overhead and eliminate bottlenecks associated with synchronization primitives implemented using software on top of shared memory. Simulations of scientific applications running on FLASH show that, in many cases, synchronization support improves performance and increases the range of machine sizes over which the applications scale. Our study shows that embedded programmability is a convenient approach for supporting block transfer and synchronization, and that the FLASH system design effectively supports this approach.
CSL-TR-98-759

Report Number: CSL-TR-98-760
Institution: Stanford University, Computer Systems Laboratory
Title: High Performance Inter-Chip Signalling
Author: Sidiropoulos, Stefanos
Date: April, 1998
Abstract: The achievable off-chip bandwidth of digital IC's is a crucial and often limiting factor in the performanceof digital systems. In intra-system interfaces where both latencyand bandwidth are important, source-synchronous parallel channelshave been adopted as the most effective solution. This work investigatesreceiver and clocking circuit design techniques for increasing the signalling rate androbustness of such channels.One of the main problems arising in the reception of high speed signals is the adverse effects of high frequency noise. To alleviate these effects, a newclass of receiver structures that utilize current integration is proposed.The integration of current on a capacitor based on the incoming signal polarity effectively averages the signal over its valid time period, thereforefiltering out high frequency noise. An experimental transceiver prototypeutilizing current integrating receivers was designed and fabricated in a 0.8(Mu)m CMOS technology. The prototype achieves a signaling rate of 740 Mbps/pinoperating from a 3.3-V supply with a bit error rate of less than 10 (SUP -14).The second major challenge of inter-chip communication is the design of clockgeneration and synchronization circuits. Delay locked loops are an attractive alternative to VCO-based phase locked loops due to their simpler design, intrinsicstability, and absence of phase error accumulation. One of their main problemshowever is their limited phase capture range. A dual loop architecture that eliminates this problem is proposed. This architecture employs a core loopto generate finely spaced clock edges, which are then used by a peripheral loop to generate the output clock through phase interpolation. Due to itsdigital control, the dual loop can offer great flexibility in the implementationof phase acquisition algorithms. A dual DLL prototype was fabricated in a 0.8(Mu)m CMOS technology. The prototype achieves 80KHz-400MHz operating range,12-ps rms jitter and 0.4-ps/mV jitter supply sensitivity.
CSL-TR-98-760

Report Number: CSL-TR-98-761
Institution: Stanford University, Computer Systems Laboratory
Title: An Output Encoding Problem and A Solution Technique
Author: Mitra, Subhasish
Author: LaNae, J. Avra
Author: McCluskey, Edward J.
Date: November 1997
Abstract: We present a new output encoding problem as follows:Given a specification table, such as a truth table or a finite state machine state table, where some of the outputs are specified in terms of 1s, 0sand don't cares, and others are specified symbolically, determine a binary code for each symbol of the symbolically specified output columnsuch that the total number of output functions to be implemented after encoding the symbolic outputs and compacting the output columns isminimum. There are several applications of this output encoding problem,one of which is to reduce the area overhead while implementing scan orpseudo-random BIST in a circuit with one-hot signals. This algorithm can also be used as a pre-processing step during FSM state encoding.In this paper, we develop an exact algorithm to solve the aboveproblem, prove its correctness, analyze the worst case time complexityof the algorithm and present experimental data to validate the claim that ourencoding strategy helps to reduce the area of a synthesized circuit. Inaddition, we have investigated the possibility of using elementary gates tofacilitate further merging of the output functions generated by the encoding bits with the output functions generated by the elementary gates.
CSL-TR-98-761

Report Number: CSL-TR-98-762
Institution: Stanford University, Computer Systems Laboratory
Title: Hardware/Software Co-Design of Run-Time Systems
Author: Mooney, Vincent John III
Date: June 1998
Abstract: Trends in system-level design show a clear move towards core-based design, where processors, controllers and other proprietary cores are reused and constitute essential building blocks. Thus, areas such as embedded system design and system-on-a-chip design are changing dramatically, requiring new design methodologies and Computer-Aided Design (CAD) tools. This thesis presents a novel system-level scheduling methodology and CAD environment, the SERRA Run-Time Scheduler Synthesis and Analysis Tool. Unlike previous approaches to run-time scheduling, we split our run-time scheduler between hardware and software, as opposed to placing the scheduler all in one or the other. Thus, given an already partitioned input system specification in an HDL and a software language, SERRA automatically generates a run-time scheduler partly in hardware and partly in software, for a target architecture of a microprocessor core together with multiple hardware cores or modules. A heuristic scheduling algorithm solves for priorities of software tasks executing on a single microprocessor with a custom priority scheduler, interrupt service routine, and context switch code. Real-time analysis takes into account the split hardware/software implementation both of the scheduler and of the tasks. The scheduler supports standard requirements of both domains, such as relative timing constraints in hardware and semaphores in software. A designer who uses the SERRA CAD tool gains the advantage of efficient satisfaction of timing constraints for hardware/software systems within a framework that enables different hardware/software partitions to be quickly evaluated. Thus, a hardware/software partitioning tool could easily sit on top of SERRA, which would generate run-time systems for different hardware/software partitions chosen for evaluation. In addition, SERRA's more efficient design space exploration can improve time-to-market for a product. Finally, we present two case studies. First, we show a full analysis, synthesis, and simulation of a hardware/software implementation of a robotics control system for a PUMA arm. Second, we describe a sample prototype of the split run-time scheduler in an actual design, a force-feedback real-time Haptic robot. For this application, the hardware part of the scheduler was implemented on programmable logic communicating with software using a standard communication protocol.
CSL-TR-98-762

Report Number: CSL-TR-98-764
Institution: Stanford University, Computer Systems Laboratory
Title: A Method for Analysis of C(superscript 1)-Continuity of Subdivision Surfaces
Author: Z orin, Denis
Date: May 1998
Abstract: A sufficient condition for C(superscript 1)-continuity of subdivision surfaces was proposed by Rief [17] and extended to a moregeneral setting in [22]. In both cases, the analysis of C(superscript 1)-continuity is reduced to establishing injectivity andregularity of a characteristic map. In all known proofs of C(superscript 1)-continuity,explicit representation of the limit surface on an annular region was used to establish injectivity. We propose a new approach to thisproblem: we show that for a general class of subdivision schemes, regularity can be inferred from the properties of a sufficiently closelinear approximation, and injectivity canbe verified by computing the index of a curve. An additional advantage ofour approach is that it allows us to prove C(superscript 1)-continuity for all valences of vertices, rather than for an arbitrarily large, but finitenumber of valences. As an application, we use our method to analyzeC(superscript 1)-continuity of most stationary subdivision schemes known to us, including interpolating Butterfly and Modified Butterfly schemes, as wellas the Kobbelt's interpolating scheme for quadrilateral meshes.
CSL-TR-98-764

Report Number: CSL-TR-98-768
Institution: Stanford University, Computer Systems Laboratory
Title: EJAVA - Causal Extensions for Java
Author: Santoro, Alexandre
Author: Mann, Walter
Author: Madhav, Neel
Author: Luckham, David
Date: August 1998
Abstract: Programming languages like Jave provide designerswith a variety of classes that simplify the process of buildingmultithreaded programs. Though useful, especially in the creationof reactive systems, multithreaded programs present challengingproblems such as race conditions and synchronization issues.Validating these programs against a specification is also nottrivial since Java does not clearly indicate thread interaction.These problems can be solved by modifying Java so that it produces computations, collections of events with both causal andtemporal ordering relations defined for them. Specifically, the causal ordering is ideal for identifying thread interaction. Thispaper presents eJava, an extension to Java that is both event basedand causally aware, and shows how it simplifies the process of understanding and debugging multithreaded programs.
CSL-TR-98-768

Report Number: CSL-TR-98-769
Institution: Stanford University, Computer Systems Laboratory
Title: Resource Discovery in Ad hoc Networks
Author: Tang, Diane
Author: Chang, Chih-Yuan
Author: Tanaka, Kei
Author: Baker, Mary
Date: August 1998
Abstract: Much of the current research in mobile networking investigates how tosupport a mobile user within an established infrastructure of routers and servers. Ad hoc networks come into play when no such establishedinfrastructure exits. This paper presents a two-stage protocol to solvethe resource discovery problem in ad hoc networks: how hosts discoverwhat resources are available in the network and how they discover how to usethe resources. This protocol does not require any established servers or other infrastructure. It only requires routing capabilities in the network.
CSL-TR-98-769

Report Number: CSL-TR-98-772
Institution: Stanford University, Computer Systems Laboratory
Title: Designing a Partitionable Multiplier
Author: Lee, Hyuk-Jun
Author: Flynn, Michael
Date: October 1998
Abstract: This report presents the design of a 64-bit integer multiplier core that can perform 32-bit or 16-bit parallel integer multiplications(PMUL) and 32-bit or 16-bit parallel integer multiplications followed by additions(PMADD). The proposed multiplier removes sign and constant bits from its core and projects them to the boundaries to minimize the complexity of base cells. It also adopts an array-of-arrays architecture with unequal array sizes by decoupling partial product generation from carry save addition. This makes it possible to achieve high speed for 64-bit multiplication. Two architectures, which are done in dual-rail domino, are tested for functionality in Verilog and simulated in HSPICE for TSMC 0.35um process. The first architecture is capable of both PMUL and PMADD. The estimated delay is 4.9 ns (excluding a final adder) at 3.3V supply and 25c and its estimated area is 6.5 mm*2. The estimated delay of the second architecture, only capable of PMUL, is 4.5 ns. Its estimated area is 5.2 mm*2.
CSL-TR-98-772

Report Number: CSL-TR-98-773
Institution: Stanford University, Computer Systems Laboratory
Title: Scalable Services for Video-On-Demand
Author: Chan, Shuen-Han Gary
Author: Tobagi, Fouad A.
Date: December 1998
Abstract: Video-on-demand (VOD) refers to video services in which users can request any video program from a server at anytime. VOD has important applications in entertainment, education,information, and advertising, such as movie-on-demand, distancelearning, home shopping, interactive news, etc. In order to provide VOD services accommodating a large number of video titles and concurrent users, a VOD system has to be scalable -- scalable in storageand scalable in streaming capacity. Our goal is to design such a system with low cost, low complexity, and offering high level ofservice quality (in terms of, for example, user delay experienced oruser loss rate). Storage scalability is achieved by using a hierarchical storagesystem, in which video files are stored in tertiary libraries orjukeboxes and transferred to a secondary level (of magnetic oroptical disks) for display. We address the design of such a system byspecifying the required architectural parameters (the bandwidth andstorage capacity in each level) and operational procedures (such asrequest scheduling and file replacement schemes) in order to meetcertain performance goals. Scalability in streaming capacity can be achieved by means of request batching, in which requests for a video arriving within aperiod of time are grouped together (i.e., "batched")and served with a single multicast stream. The goal here is toachieve the trade-off between the multicasting cost and user delayin the system. We study a number of batching schemes (in termsof user delay experienced, the number of users collected in eachbatch, etc.), and how system profit can be maximized given user'sreneging behaviour. Both storage and streaming scalabilities can be achieved with a distributed servers architecture, in which video files are accessedfrom servers distributed in a network. We examine a number of caching schemes in terms of their requirements in storage and streaming bandwidth. Given certain cost functions in storage and streaming, we addresswhen and how much a video file should be cached in order to minimize the systemcost. We show that a distributed servers architecture can achieve great cost savings while offering users low start-up delay.
CSL-TR-98-773

Report Number: CSL-TR-98-774
Institution: Stanford University, Computer Systems Laboratory
Title: Fault-Tolerant Systems in A Space Environment: The CRC ARGOS Project
Author: Shirvani, Philip P.
Author: McCluskey, Edward J.
Date: December 1998
Abstract: This report describes the ARGOS project at Stanford CRC. The primary goals of this project are to collect data on the errors that occur in digital integrated circuits in a space environment, to determine the tradeoffs between fault-avoidance and fault-tolerance, and to see if radiation hardening can be avoided by using fault tolerance techniques. Our experiments will be carried out on two processor boards on the ARGOS experimental satellite. One of the boards uses radiation-hardened components while the other uses only commercial off-the-shelf (COTS) parts. Programs and data can be uploaded to the boards during the mission. This capability allows us to evaluate different software fault-tolerance techniques. This report reviews various error detection techniques. Software techniques that do not require any special hardware are discussed. The framework of the software that we are developing for error data collection is presented.
CSL-TR-98-774

Report Number: CSL-TR-98-775
Institution: Stanford University, Computer Systems Laboratory
Title: Design of High-Speed Serial Links in CMOS
Author: Yang, Chih-Kong Ken
Date: December 1998
Abstract: Demand for bandwidth in serial links has been increasing as the communications industry demand higher quantity and quality of information. Whereas traditional gigabit per second links has been in bipolar or GaAs, this research aims to push the use of CMOS process technology in such links. Intrinsic gate speed limitations are overcome by parallelizing the data. The on-chip frequency is maintained at a fraction (1/16) of the off-chip data rate. Clocks with carefully controlled phases tapped from a local ring oscillator are driven to a bank of input samplers to convert the serial bit stream into parallel data. Similarly, the overlap of multiple-phased clocks are used to synchronize the multiplexing of the parallel data onto the transmission line. To perform clock/data recovery, data is further oversampled with finer phase separation and passed to digital logic. The digital logic operates upon the samples to detect transitions in the bit stream to track the bit boundaries. This tracking can operate at the cycle rate of the digital logic allowing robustness to systematic phase noise. The challenge lies in the capturing of the high frequency data stream and generating low jitter, accurately spaced clock edges. A test chip is built demonstrating the transmission and recovery of a 4.0-Gb/s bit streams with < 10 (minus superscript 14) bit-error rate using a 3x oversampled system in a 0.5-um MOSIS CMOS process.
CSL-TR-98-775

Report Number: CSL-TR-99-776
Institution: Stanford University, Computer Systems Laboratory
Title: Novel Checkpointing Algorithm for Fault Tolerance on a Tightly-Coupled Multiprocessor
Author: Sunada, Dwight
Author: Glasco, David
Author: Flynn, Michael
Date: January 1999
Abstract: The tightly-coupled multiprocessor (TCMP), where specialized hardware maintains the image of a single shared memory, offers the highest performance in a computer system. In order to deploy a TCMP in the commercial world, the TCMP must be fault tolerant. Researchers have designed various checkpointing algorithms to implement fault tolerance in a TCMP. To date, these algorithms fall into 2 principal classes, where processors can be checkpoint dependent on each other. We introduce a new apparatus and algorithm that represents a 3rd class of checkpointing scheme. Our algorithm is distributed recoverable shared memory with logs (DRSM-L) and is the first of its kind for TCMPs. DRSM-L has the desirable property that a processor can establish a checkpoint or roll back to the last checkpoint in a manner that is independent of any other processor. In this paper, we describe DRSM-L, show the optimal value of its principal design parameter, and present results indicating its performance under simulation.
CSL-TR-99-776

Report Number: CSL-TR-99-777
Institution: Stanford University, Computer Systems Laboratory
Title: The Mobile People Architecture
Author: Appenzeller, Guido
Author: Lai, Kevin
Author: Maniatis, Petros
Author: Roussopoulos, Mema
Author: Swierk, Edward
Author: Z hao, Xinhua
Author: Baker, Mary
Date: January 1999
Abstract: People are the outsiders in the current communications revolution. Computer hosts, pager terminals, and telephones are addressable entities throughout the Internet and telephony systems. Human beings, however, still need application-specific tricks to be identified, like email addresses, telephone numbers, and ICQ IDs. The key challenge today is to find people and communicate with them personally, as opposed to communicating merely with their possibly inaccessible machines---cell phones that are turned off, or PCs on faraway desktops. We introduce the Mobile People Architecture, designed to meet this challenge. The main goal of this effort is to put the person, rather than the devices that the person uses, at the endpoints of a communication session. This architecture introduces the concept of routing between people. To that effect, we define the Personal Proxy, which has a dual role: as a Tracking Agent, the proxy maintains the list of devices or applications through which a person is currently accessible; as a Dispatcher, the proxy directs communications and uses Application Drivers to massage communication bits into a format that the recipient can see immediately. It does all this while protecting the location privacy of the recipient from the message sender. Finally, we substantiate our architecture with ideas about a future prototype that allows the easy integration of new application protocols.
CSL-TR-99-777

Report Number: CSL-TR-99-778
Institution: Stanford University, Computer Systems Laboratory
Title: Analysis of HTTP/1.1 Performance on a Wireless Network
Author: Cheng, Stephen
Author: Lai, Kevin
Author: Baker, Mary
Date: February 1999
Abstract: We compare the performance of HTTP/1.0 and 1.1 on a high latency, low bandwidth wireless network. HTTP/1.0 is known to have low throughput and consume excessive network and server resources on today's graphics-intensive web pages. A high latency, low bandwidth network only magnifies these problems. HTTP/1.1 was developed to remedy these problems. We show that on a Ricochet wireless network, HTTP/1.1 doubles throughput over HTTP/1.0 and decreases the number of packets sent by 60%.
CSL-TR-99-778

Report Number: CSL-TR-99-779
Institution: Stanford University, Computer Systems Laboratory
Title: CHOKe - A simple approach for providing Quality of Service through stateless approximation of fair queueing
Author: Pan, Rong
Author: Prabhakar, Balaji
Date: March 1999
Abstract: We consider the problem of providing a fair bandwidth allocation to each of n flows that share an outgoing link at a congested router. The buffer at the outgoing link is a simple FIFO, commonly shared by packets belonging to the n flows. We devise a simple packet dropping scheme, CHOKe, that discriminates against the flows which submit more packets/sec than is allowed by their fair share. By doing this, the scheme aims to approximate the fair queueing policy.
CSL-TR-99-779

Report Number: CSL-TR-99-780
Institution: Stanford University, Computer Systems Laboratory
Title: Coarse Grain Carry Architecture for FPGA
Author: Lee, Hyuk-Jun
Author: Flynn, Michael
Date: February 1999
Abstract: In this report we investigated several methods to improve the performance of FPGA for general purpose computing. In the early stage of this research we identified the fine grain size of current FPGA as the major performance bottleneck. To increase the grain size, we introduced coarse grain carry architecture that can increase the granularity of arithmetic operations including addition and multiplication. We used throughput density as a cost/performance metric to justify the benefit of the new architecture. We could achieve roughly up to 5 times larger throughput density for selected applications. Along with that we also introduced a dual-rail carry structure to improve the performance of a carry chain, which usually set the cycle time of a FPGA design. A carry select adder built from the dual-rail carry structure reduces the carry chain delay by a factor of two.
CSL-TR-99-780

Report Number: CSL-TR-99-781
Institution: Stanford University, Computer Systems Laboratory
Title: An Architecture for Distributed, Interactive, Multi-Stream, Multi-Participant Audio and Video
Author: Schmidt, Brian K.
Date: April 1999
Abstract: Today's computer users are becoming increasingly sophisticated, demanding richer and fuller machine interfaces. This is evidenced by the fact that viewing and manipulating a single stream of full-size video along with its associated audio stream is becoming commonplace. However, multiple media streams will become a necessity to meet the increasing demands of future applications. An example which requires multiple media streams is an application that supports multi-viewpoint audio and video, which allows users to observe a remote scene from many different perspectives so that a sense of immersion is experienced. Although desktop audio and video open many exciting possibilities, their use in a computer environment only becomes interesting when computational resources are expended to manipulate them in an interactive manner. We feel that user interaction will also become increasingly complex. In addition, future applications will make significant demands on the network in terms of bandwidth, quality of service guarantee, latency, and connection management. Based on these trends we feel that an architecture designed to support future multimedia applications must provide support for several key features. The need for numerous media streams is clearly the next step forward in terms of creating a richer environment. Support for non-trivial, fine-grain interaction with the media data is another important requirement, and distributing the system across a network is imperative so that multiple participants can become involved. Finally, as a side effect of the network and multi-participant requirements, integral support for and use of multicast will be a prime architectural component. The goal of our work is to design and implement a complete system architecture capable of supporting applications with these requirements.
CSL-TR-99-781

Report Number: CSL-TR-99-782
Institution: Stanford University, Computer Systems Laboratory
Title: A Compiler for Creating Evolutionary Software and Application Experience
Author: Schmidt, Brian K.
Author: Lam, Monica S.
Date: April 1999
Abstract: Recent studies have shown that significant amounts of value repetition occur in modern applications. Due to global initialized data, immediate values, address calculations, redundancy in external input, etc.; the same value is used at the same program point as much as 80% of the time. Naturally, attention has begun to focus on how compilers and specialized hardware can take advantage of this value locality. Unfortunately, there is significant overhead associated with dynamically recognizing predictable values and optimizing for them; and all too, this cost dramatically outweighs the benefits. There are various levels at which value locality can be observed and used for optimization, ranging from register value re-use to function memorization. We are concerned with predictability of program variable values across multiple runs of a given program. In this paper we present a complete system that automatically translates ordinary sequential programs into evolutionary software, software that evolves to improve its performance using execution information from previous runs. This concept can have a significant impact on software engineering, as it can be used to replace the manual performance tuning phase in the application development lifecycle. Not only does it alleviate the developer from a tedious and error-prone task, but it also has the important side effect of keeping applications free from obscure hand optimizations which muddle the code and make it difficult to maintain or port. This concept can also be used to produce efficient applications where static performance tuning is not adequate. Our system automatically identifies targets for program specializations and instruments the code to gather high-level profiling information. Upon completion, the program automatically re-compiles itself when the new profile information suggests that it is profitable. The programmer is completely unaware of this process, as the software tailors itself to its environment. We have demonstrated the utility of our system by using it to optimize graphics applications that are built upon a general-purpose graphics library. While much of this work is based on well-established techniques, this is the first practical system which takes advantage of predictability in a way such that the overhead does not overwhelm the benefit.
CSL-TR-99-782

Report Number: CSL-TR-99-783
Institution: Stanford University, Computer Systems Laboratory
Title: Optimum Instruction-level Parallelism (ILP) for Superscalar and VLIW Processors
Author: Hung, Patrick
Author: Flynn, Michael J.
Date: July 1999
Abstract: Modern superscalar and VLIW processors fetch, decode, issue, execute, and retire multiple instructions per cycle. By taking advantage of instruction-level parallelism (ILP), processor performance can be improved substantially. However, increasing the level of ILP may eventually result in diminishing and negative returns due to control and data dependencies among subsequent instructions as well as resource conflicts within a processor. Moreover, the additional ILP complexity can have significant overload in cycle time and latency. This technical report uses a generic processor model to investigate the optimum level of ILP for superscalar and VLIW processors.
CSL-TR-99-783

Report Number: CSL-TR-99-784
Institution: Stanford University, Computer Systems Laboratory
Title: The M-log-Fraction Transform (MFT) for Computer Arithmetic
Author: Mencer, Oskar
Author: Flynn, Michael J.
Author: Morf, Martin
Date: July 1999
Abstract: State-of-the-art continued fraction(CF) arithmetic enables us to compute rational functions so that input and output values are represented as simple continued fractions. The main problem of previous work is the conversion between simple continued fractions and binary numbers. The M-log-Fraction Transform(MFT), introduced in this work, enables us to instantly convert between binary numbers and M-log-Fractions. Conversion is related to the distance between the '1's of the binary number. Applying M-log-Fractions to continued fraction arithmetic algorithms reduces the complexity of the CF algorithm to shift-and-add structures, and more specifically, digit-serial arithmetic algorithms for (homographic) rational functions. We show two applications of the MFT: (1) a high radix rational arithmetic unit computing (ax+b)/(cx+d) in a shift-and-add structure. (2) the evaluation of rational approximations (or continued fraction approximations) in a multiplication-based structure. In (1) we obtain algebraic formulations of the entire computation, including the next-digit-selection function. For high radix operation, we can therefore partition the selection table into arithmetic blocks, making high radix implementations feasible. (2) overlaps the final division of a rational approximation with the multiply-add iterations. The MFT bridges the gap between continued fractions and the binary number representation, enabling the design of a new class of efficient rational arithmetic units and efficient evaluation of rational approximations.
CSL-TR-99-784

Report Number: CSL-TR-99-785
Institution: Stanford University, Computer Systems Laboratory
Title: Checkpointing Apparatus and Algorithms for Fault-Tolerant Tightly-Coupled Multiprocessors
Author: Sunada, Dwight
Date: July 1999
Abstract: The apparatus and algorithms for establishing checkpoints on a tightly-coupled multiprocessor (TCMP) fall naturally into three broad classes: tightly synchronized method, loosely synchronized method, and unsynchronized method. The algorithms in the class of the tightly synchronized method force the immediate establishment of a checkpoint whenever a dependency between two processors arises. The algorithms in the class of the loosely synchronized method record this dependency and, hence, do not require the immediate establishment of a checkpoint if a dependency does arise; when a processor chooses to establish a checkpoint, the processor will query the dependency records to determine other processors that must also establish a checkpoint. The algorithms in the class of the unsynchronized method allow a processor to establish a checkpoint without regard to any other processor. Within this framework, we develop four apparatus and algorithms: distributed recoverable shared memory (DRSM), DRSM for communication checkpoints (DRSM-C), DRSM with half of the memory (DRSM-H), and DRSM with logs (DRSM-L). DRSM-C is an algorithm in the class of the tightly synchronized method, and DRSM and DRSM-H are algorithms in the class of the loosely synchronized method. DRSM-L is an algorithm in the class of the unsynchronized method and is the first of its kind for a TCMP. DRSM-L has the best performance in terms of minimizing the impact of establishing checkpoints (or logs) on the running applications and has the least expensive hardware.
CSL-TR-99-785

Report Number: CSL-TR-99-786
Institution: Stanford University, Computer Systems Laboratory
Title: High-Speed Interconnect Schemes for a Pipelined FPGA
Author: Lee, Hyuk-Jun
Author: Flynn, Michael J.
Date: August 1999
Abstract: This paper presents two high-speed interconnectschemes for a pipelined FPGA utilizing a locally synchronizedpostcharging technique. By avoiding a global synchronizedclock, we reduce the power consumption significantly. Throughpostcharging the interconnect and overlapping the postchargingdelay with the logic delay, we successfully hide thepostcharge time. The long channel devices reduce the area penaltydue to delay elements significantly. The timing simulation isdone using Hspice for a TSMC 0.35 um and areais measured by drawing key elements in MAGIC and using the areamodel developed in [2]. The postcharge scheme shows a 30% delayreduction over the precharge scheme and up to 310% and 230%delay reductions over the conventiaonal NMOS pass transistorscheme and the tri-state buffer scheme.
CSL-TR-99-786

Report Number: CSL-TR-99-787
Institution: Stanford University, Computer Systems Laboratory
Title: CHOKe - A stateless queue management scheme forapproximating fair bandwidth allocation
Author: Pan, Rong
Author: Prabhakar, Balaji
Author: Psounis, Konstantios
Date: September 1999
Abstract: We investigate the problem of providing a fair bandwidth allocation to each of n flows that share an outgoing link at a congested router. The buffer at the outgoing link is a simple FIFO, commonly shared by packets belonging to the n flows. We devise a simple packet dropping scheme, CHOKe, that discriminates against the flows which submit more packets/sec than is allowed by their fair share. By doing this, the scheme aims to approximatethe fair queueing policy. Since it is stateless and easy to implement, CHOKe controls unresponsive or misbehaving flows with a minimum overhead.
CSL-TR-99-787

Report Number: CSL-TR-99-788
Institution: Stanford University, Computer Systems Laboratory
Title: Managing Event Processing Networks
Author: Perrochon, Louis
Author: Kasriel, Stephane
Author: Luckham, David C.
Date: October 1999
Abstract: The technical report presents Complex EventProcessing, CEP is a fundamental new technology that will enablethe next generation of middleware based distributed applications. CEP gains information on distributed systems and uses thisknowledge for monitoring, failure analysis or prediction ofactivities. A very promising route in CEP research is thatof Event Processing Networks, which is one of the main areasof research of the Program and Analysis Group at StanfordUniversity. Event Processing Networks are one way of describing and building CEP, by successively filteringmeaningful information and aggregating the corresponding eventsinto higher levels of abstraction.This report describes in detail the foundations and aims ofComplex Event Processing. Then we will introduce the concept of Event Processing. Finally, we will describe the architecture ofthe CEP system.
CSL-TR-99-788

Report Number: CSL-TR-99-789
Institution: Stanford University, Computer Systems Laboratory
Title: Flexible Use of Memory for Replication/Migration in Cache-Coherent DSM Multiprocessors
Author: Soundararajan, Vijayaraghavan
Date: November 1999
Abstract: Shared-memory multiprocessors are being used increasingly as compute servers. These systems enable efficientusage of computing resources through the aggregation and tight coupling of CPUs, memory, and I/O. One popular design for such machines is a bus-based architecture. However, as processors get faster, the shared bus becomes abandwidth bottleneck. CC-NUMA (Cache-Coherent with Non-Uniform Memory Access time) machines remove this architectural limitation and provide a scalable shared-memory architecture. One significant characteristic of the CC-NUMA architecture is that the latency to access remote data is considerably larger than the latency to accesslocal data. On such machines, good data locality can reduce memory stall time and is therefore critical for high performance.In this thesis we study the various options available to system designers to transparently decrease the fraction of data misses serviced remotely. This work is done in the context of the Stanford FLASH multiprocessor. We utilize theprogrammability of the FLASH memory controller to explore a number of techniques for improving data locality: base cache-coherence (CC); a Remote Access Cache (RAC), in which a portion of local memory is used to cacheremotely-allocated data at cache-line granularity; a Cache-Only Memory Architecture (COMA-F), in which all of local memory is used as a cache under hardware control; and OS-assisted page migration/replication (MigRep), inwhich the operating system migrates or replicates pages according to observed cache miss patterns. We then propose a novel hybrid scheme, MIGRAC, that combines the benefits of RAC and MigRep. We evaluate complete implementations of these schemes on the same platform using compute-server workloads (including OS effects), thereby providing a more consistent and detailed evaluation than has been done before.We find that a simple RAC can improve performance significantly over CC (up to 64% gains). COMA-F improves locality but its additional complexity limits its gains versus CC (only 14% improvement). MigRep performs well (up to 33% gains) but does not handle fine-grain sharing as effectively as RAC or COMA-F. Finally, our MIGRAC approach performs well relative to RAC (up to 57% faster) and MigRep (up to 24% faster) and is robust.
CSL-TR-99-789

Report Number: CSL-TR-00-790
Institution: Stanford University, Computer Systems Laboratory %T Reciprocal Approximation Theory with Table Compensation
Author: Liddicoat, Albert A.
Author: Flynn, Michael J.
Date: January 2000
Abstract: [Sch93] demonstrates the reuse of a multiplier partial product array (PPA) to approximate higher order functions such as the reciprocal, division, and square root. Schwarz generalizes this technique to any higher order function that can be expressed as A*B=C. Using this technique, the height of the PPA increases exponentially to increase the result precision. Schwarz added compensation terms within the PPA to reduce the worst case error. This work investigates the approximation theory of higher order functions without the bounds of multiplier reuse. Additional techniques are presented to increase the worst case precision for a fixed height PPA. A compensation table technique is presented in this work. This technique combines the approximation computation with a compensation table to produce a result with fixed precision. The area-time tradeoff for three design points is studied. Increasing the computation decreases the area needed to implement the function but also increases the latency. Finally, the applicability of this technique to the bipartite ROM reciprocal table is discussed. We expect that this technique can be applied to the bipartite ROM reciprocal table to significantly reduce the hardware area needed at a minimal increase in latency. In addition, this work focuses on hardware reconfigurability and the ability of the hardware unit to be used to perform multiple higher order functions efficiently. The PPA structure can be used to approximate several higher order functions that can be expressed as a multiply.
CSL-TR-00-790

Report Number: CSL-TR-00-791
Institution: Stanford University, Computer Systems Laboratory
Title: Precision of Semi-Exact Redundant Continued FractionArithmetic for VLSI
Author: Mencer, Oskar
Author: Morf, Martin
Author: Flynn, Michael J.
Date: February 2000
Abstract: Continued fractions (CFs) enable straightforward representation of elementary functions and rational approximations. We improve the positional algebraic algorithm, which computes homographic functions. The improved algorithm for the linear fractional transformation produces exact results, given regular continued fraction input. In case the input is in redundant continued fraction form, our improved linear algorithm increases the percentage of exact results with 12-bit state registers from 78% to 98%. The maximal error of non-exact results is improved. Indeed, by detecting a small number of cases, we can add a final correction step to improve the guaranteed accuracy of non-exact results. We refer to the fact that a few results may not be exact as "Semi-Exact" arithmetic. We detail the adjustments to the positional algebraic algorithm concerning register overflow, the virtual singularities that occur during the computation, and the errors due to non-regular, redundant CF inputs.
CSL-TR-00-791

Report Number: CSL-TR-00-792
Institution: Stanford University, Computer Systems Laboratory
Title: Performance of Data-Intensive Algorithms on FPGAs in anObject-oriented Programming Environment
Author: Mencer, Oskar
Author: Morf, Martin
Author: Flynn, Michael J.
Date: February 2000
Abstract: Recently, we see academic and industrial efforts to combine traditional computing environments with reconfigurable logic. Each application, or part of an application, has an optimal implementation within the design space of microprocessors, reconfigurable logic, and hardwired VLSI circuits. Programmability, Performance, and Power are the major metrics that have to be taken into account when deciding between the available technologies. Performance advantages of FPGAs over processors for specific applications have been shown in previous research. We show the potential of current low-power FPGAs to outperform current state-of-the-art processors in Performance over Power by more than half an order of magnitude. Programmability remains a tough issue. As a starting point, we define a hardware object interface in C++, PAM-Blox. PAM-Blox is an open, object-oriented environment for programming FPGAs that encourages design sharing and code reuse. PAM-Blox simplifies the creation of optimized high-performance designs. Encouraging a distributed effort to share hardware objects over the internet in the spirit of open software, is a first step towards improving the programmability of FPGAs.
CSL-TR-00-792

Report Number: CSL-TR-00-793
Institution: Stanford University, Computer Systems Laboratory
Title: Allocation and Interface Synthesis Algorithms for Component-based Design
Author: Smith, James
Date: February 2000
Abstract: Since 1965, the size of transistors has been halved and their speed of operation has been doubled, every 18 to 24 months, a phenomenon known as Moore's Law. This has allowed rapid increases in the amount of circuitry that can be included on a single die. However, as the availability of hardware real estate escalates at an exponential rate, the complexity involved in creating circuitry that utilizes that real estate grows at an exponential, or higher, rate. Component-based design methodologies promise to reduce the complexity of this task and the time required to design integrated circuits by raising the level of abstraction at which circuitry is specified, synthesized, verified, or physically implemented. This thesis develops algorithms for synthesizing integrated circuits by mapping high-level specifications onto existing components. To perform this task, word- level polynomial representations are introduced as a mechanism for canonically and compactly representing the functionality of complex components. Polynomial representations can be applied to a broad range of circuits, including combinational, sequential, and datapath dominated circuits. They provide the basis for efficiently comparing the functionality of a circuit specification and a complex component. Once a set of existing components is determined to be an appropriate implementation of a specification, interfaces between these components must be designed. This thesis also presents an algorithm for automatically deriving an HDL model of an interface between two or more components given an HDL model of those components. The combination of polynomial representations and interface synthesis algorithms provides the basis for a component-based design methodology.
CSL-TR-00-793

Report Number: CSL-TR-00-804
Institution: Stanford University, Computer Systems Laboratory
Title: IdentiScape: Tackling the Personal Online Identity Crisis
Author: Maniatis, Petros
Author: Baker, Mary
Date: June 2000
Abstract: Traditional systems refer to a mobile person using the name or address of that person's communication device. As personal communicationsbecome more diverse and popular, this solution is no longer adequate, since mobile people frequently move between different devices and use different communications applications. This lack of identifiers for mobile people causes problems ranging from the inconvenient to the downright dangerous: to locate a person, callers must use potentially multiple email addresses, cell phone numbers,land line phone numbers or instant messaging IDs; callers leave sensitive messages on shared voicemail boxes; and they send communications intendedfor the previous owner of a telephone number to the next owner. To solve thisnaming problem, we should be able to name people as the ultimate endpointsof personal communications, regardless of the applications or devices they use.In this paper, we develop a naming scheme for mobile people: we derive itsrequirements and describe its design and implementation in the context ofpersonal communications. IdentiScape, our prototype personal naming scheme, includes a name service which provides globally available identifiers that persistover time and an online identity repository service which can be locally owned andmanaged.
CSL-TR-00-804

Report Number: CSL-TR-00-807
Institution: Stanford University, Computer Systems Laboratory
Title: UIF Explorer: An Interactive and Interprocedural Parallelizer
Author: Liao, Shih-Wei
Date: August 2000
Abstract: Shared-memory multiprocessors that use the latest microprocessors are becoming widely used both as compute servers and as desktop computers. But the difficulty in developing parallel software is a major obstacle to the effective use of the multiprocessors to solve a single task. To increase the productivity of multiprocessor programmers, we developed an interactive interprocedural parallelizer called SUIF Explorer. Our experience with SUIF Explorer also helps to identify missing interprocedural analyses that can significantly improve an automatic parallelizer. As a parallel programming tool, the Explorer actively guides the programmers in the parallelization process using a set of advanced static and dynamic analyses and visualization techniques. Our interprocedural program analyses provide high- quality information that restricts the need for user assistance. The Explorer is also the first tool to apply slicing analysis to aid the programmer in uncovering program properties for interactive parallelization. These static and dynamic analyses minimize the number of lines of code requiring programmer assistance to produce parallel codes for real-world applications. As a tool for finding missing compiler techniques, SUIF Explorer helps the compiler researchers design the next-generation parallelizer. Our experience with the Explorer shows that interprocedural array liveness analysis is an enabler of several important optimizations, such as privatization and array contraction. We developed and evaluated an efficient context-sensitive and flow-sensitive interprocedural array liveness algorithm and integrated it into the parallelizer. We use the liveness information to enable contraction of arrays that are not live at loop exits, which results in a smaller memory footprint and better cache utilization. The resulting codes run faster on both uni- and multi- processors. Another key interprocdural analysis which we developed and evaluated is the array reduction analysis. Our reduction algorithm extends beyond previous approaches in its ability to locate reductions to array regions, even in the presence of arbitrarily complex data dependences. To exploit the multiprocessors effectively, the algorithm can locate interprocedural reductions, reduction operations that span multiple procedures. In summary, we successfully apply the Explorer to help the user develop parallel codes effectively and to help the compiler researcher develop the next-generation parallelizer.
CSL-TR-00-807