Archer:SESC-Cacti EEL4713C Spring 2008

From Grid-Appliance Wiki

Jump to: navigation, search

Contents

Cache simulation assignment

Author: Renato Figueiredo renato
Level: Undergraduate
Effort: Several hours; appropriate as lab part of a 2-3 week assignment on cache hierarchies

Introduction

Today’s computers (including memory hierarchies) are designed with the aid of extensive simulations that provide quantitative data to justify design choices. In this exercise, you will use a cache hierarchy simulator to investigate tradeoffs in the design of a level-1 cache.

In this problem, you will use the “SESC” and “cacti” tools available for execution from the Grid appliance. SESC is a microprocessor simulator (that also simulates the behavior of two-level caches), while Cacti is a program that calculates the speed of caches based on parameters such as size and associativity.

In this assignment you will consider a base configuration with two levels of caches: the L1 cache is split into instruction+data caches; the base configuration has both I-cache and D-cache configured with the same parameters: 8Kbytes, 32-byte blocks, direct-mapped. Use default parameters for the L2 cache.

Note: each SESC simulation will take 10 or more minutes, so start working on this assignment early! The Grid appliance allows you to run multiple simulations concurrently on remote computers, so you can have more than one simulation running at the same time.

Installation

This assignment used the Grid appliance and the SESC and Cacti simulators. Here are the steps to install them:

Installing the Grid Appliance

Install the Archer virtual machine Grid appliance in your computer. This will create a Linux-based virtual environment in your computer which will be used for various kinds of simulations.

Browse to the Grid appliance portal and follow the “quick start guide” instructions to install a virtual machine appliance in your laptop and go through its tutorial.

Installing SESC

Download the sesc_demo.tgz file; drag it to the appliance network folder in your host as described in the quick start guide, or download it directly to the appliance with the commands:

cd ~
sudo wget http://www.acis.ufl.edu/~ipop/files/apps/sesc_demo.tgz
sudo chown griduser sesc_demo.tgz

Within the appliance, expand the archive using the command:

tar -xzf sesc_demo.tgz
cd sesc_demo

Installing Cacti 3.2

Download the cacti.tgz file; drag it to the appliance network folder in your host as described in the quick start guide, or download it directly to the appliance with the commands:

cd ~
sudo wget http://www.acis.ufl.edu/~ipop/files/apps/cacti.tgz
sudo chown griduser cacti.tgz

Within the appliance, expand the archive using the command:

tar -xzf cacti.tgz

Warming up: Cacti

Cacti is a tool that helps designers determine cache access times, power/energy consumption, among other parameters based on relevant design points entered as inputs to the simulator. It is a fast simulator that uses analytical models to estimate its outputs. Let’s use the Cacti tool to determine the access time of the L1 cache configured as above as an example. Use a 0.09µm (90nm) technology. The command for a Cacti simulation is:

cd cacti
./cacti C B A TECH Nsubbanks

Where C is the cache size, B is the clock size, A is the associativity (1=direct-mapped), TECH is the transistor technology, and Nsubbanks is the number of sub-banks (a .pdf document with the full description of the CACTI simulator version 3.2 is included.)

For example, the command:

./cacti 8192 32 1 0.09 1 > cacti.out

Writes the output of cacti to cacti.out, and the command:

grep Access cacti.out

Filters the output cacti.out for lines containing the string “Access”.

Before moving to the next step, double-check that the access time you obtain is approximately 479ps. Assume a processor clock rate of 3.252GHz throughout this lab (the cache above would have a 2-cycle latency).

Warming up: SESC

SESC is an open-source execution-driven simulator of superscalar and multi-core processors. It takes as inputs a MIPS executable and a configuration file which determines the computer architecture parameters to be used in the simulation (e.g. cache size, block size, number of processors). It simulates the execution of the MIPS executable instruction by instruction, and at the end of simulation, provides a summary of various simulated parameters (e.g. cache miss rate). Because it simulates every instruction at great detail, SESC simulations can take a very long time to finish. For many benchmarks, simulation times of hours/days are not uncommon. For this assignment, the simulations have been chosen to take a few minutes.

To run your first SESC simulation:

cd ~
cd sesc_demo
./sesc.smp –itest.in –csesc32.conf crafty.mipseb

This command runs a simulation of the application “crafty.mipseb” (a chess simulator benchmark from SPEC CPU 2000, http://www.spec.org/cpu/CINT2000/186.crafty/docs/186.crafty.html), with input file test.in, and SESC configuration parameters stored in sesc32.conf. As the simulation progresses, you will see outputs from the crafty application coming out line by line in the X terminal. At the end of the simulation, a file named:

sesc_crafty.mipseb.XYZABC

will be created (XYZABC is replaced by a random string). This output file summarizes all the inputs that were configured in the sesc32.conf file (see below) as well as shows summary results from the simulation. For example, if you type:

grep DL1:readMiss sesc_crafty*
grep IL1:readMiss sesc_crafty*
grep DL1:readHit sesc_crafty*
grep IL1:readHit sesc_crafty*

You will see the total number of cache hits and misses for the I- and D-caches, for processors P(0) and P(1). You only need to care about processor P(0) in this assignment. And:

grep clockTicks sesc_crafty*

Shows the total number of simulated clock cycles. In this example, the simulation covers about 113E6 clock cycles.

Note: the main file of interest to you in this assignment is sesc32.conf. Check out the file sesc32.conf to help understand what the configuration has been set for this simulation, for example:

cacheLineSize : the block size (32 Bytes in this file)
frequency: processor frequency (5GHz in this file)
[IMemory]: parameters for L1 instruction cache (32Kbytes, 2-way associative, Write-through (WT), LRU replacement, 2 ports, 2-cycle hit delay)
[DMemory]: parameters for L1 data cache

Condor simulation

The SESC configuration file is the only file you will need to configure in this assignment. You can create multiple configuration files with different names for all the simulations you will perform. You can then “batch” their execution through Condor in the Grid appliance. Check out the submit_sesc.condor file for an example on how to submit three jobs in parallel to Condor. The only difference among these sample jobs is the cache line size (32 Bytes in sesc32.conf, 64 and 128 in sesc64.conf and sesc128.conf).

Feel free to use Condor or run simulations in your own machine for this assignment. For Condor to work, you need to be connected to the Internet and check that condor_status is responding correctly.

Assignment

Block size tradeoffs

In this problem you will simulate caches with different block sizes to investigate tradeoffs between this cache parameter and performance.

a) Use SESC to simulate the system with the base cache configuration described above, and seven other configurations with larger block sizes in the L1 instruction cache. Start with the file sesc32.conf; if needed, adjust the block size of the L2 cache in your simulations to be equal to or larger than the L1 block size. Make sure that for each cache configuration you select you simulate its access time using Cacti and take that into consideration when entering a value for the cache hit latency in SESC. Assume the hit latency is the ratio of cache access time to clock cycle, rounded up to an integer value.

  • Plot two graphs: L1 I-cache miss rate (compute it by taking the ratio of misses/(hits+misses) from the SESC simulation output) versus block size, and total execution time (clockTicks) versus block size. Discuss your results.

b) Repeat a), but now changing the block size of the L1 data cache.

  • Plot the same two graphs. Discuss your results, comparing to the results obtained in a).

Associativity tradeoffs

In this problem you will simulate caches with different associativities to investigate tradeoffs between this cache parameter and performance.

a) Use SESC to simulate the system with the base cache configuration described above, and five other configurations with increased associativity in the L1 data cache. Make sure that for each cache configuration you select you simulate its access time using Cacti and take that into consideration when entering a value for the cache hit latency in SESC.

  • Plot two graphs: L1 D-cache miss rate versus associativity, and total execution time (clock cycles) versus cache associativity. Discuss your results.

Cache size tradeoffs

In this problem you will simulate caches with different sizes to investigate tradeoffs between this cache parameter and performance. a) Use SESC to simulate the system with the base cache configuration described above, and five other configurations with increased size in the L1 instruction cache. Make sure that for each cache configuration you select you simulate its access time using Cacti and take that into consideration when entering a value for the cache hit latency in SESC.

  • Plot two graphs: L1 I-cache miss rate versus cache size, and total execution time (clock cycles) versus cache size. Discuss your results.

Overall cache performance experiment

Given your results from the previous three experiments, select three candidate cache configurations that you believe will show improved performance over the base configuration when you combine improvements in size, associativity and block size. Briefly explain the reasoning behind your selections. Obtain access times for these configurations using Cacti and simulate them using SESC.

  • Summarize the relevant results you’re your simulation. Which cache configuration, among those you selected, yields the best performance (i.e. lowest execution time)?

Impact of pipeline organization

Starting from the best configurations found in the previous experiment, let us observe trade-offs in pipeline configurations. Vary the maximum number of instructions issued per cycle (check for the parameter “issue”) to 2 and 8 (the default is 4). For each configuration, run the processor both as out-of-order issue (the default configuration) and in-order issue (set fetchPolicy = “inorder” and inorder = true in the configuration file). Assume the clock cycle remains the same for all configurations

  • What are the CPIs for each configuration? (check the nGradInsts value in the SESC output for the number of graduated instructions).
  • Using the 2-issue, in-order processor as baseline, compute the speedups of the five other configurations. Does the speedup due to increased instructions-per-cycle grow linearly?
  • How much performance improvement is achieved in the best case due to out-of-order vs. in-order issue?
Personal tools