科目名稱:計算機結構【資工系碩士班甲組、乙組】 ※本科目依簡章規定「不可以」使用計算機 **題號: 434001** 共4頁第1頁 NOTE: If some questions are unclear or not well defined to you, you can make your own assumptions and state them clearly in the answer sheet. - 1. (18% total) You are the lead designer of a new processor. The processor design and compiler are complete, and now you must decide whether to produce the current design as it stands or spend additional time to improve it. Please consider and answer the following problems. - 1.1 (10%) You discuss this problem with your hardware engineering team and arrive at the following options: - (a) Leave the design as it stands. Call this base machine Mbase. It has a clock rate of 500 MHz. - (b) Optimize the hardware. The hardware team claims that it can improve the processor design to give it a clock rate of 600 MHz. Call this machine *Mopt*. The following measurements have been made using a simulator for Mbase and Mopt. | Instruction class | Enggranav | CPI | | | | |-------------------|-----------|-------|------|--|--| | Instruction class | Frequency | Mbase | Mopt | | | | A | 40% | 2 | 2 | | | | В | 25% | 3 | 2 | | | | C | 25% | 3 | 3 | | | | D | 10% | 5 | 4 | | | What are the CPI and MIPS (million instructions per second) rate for *Mbase* and *Mopt*? Copy Table 1-1 to your answer sheet and fill in the blanks with your answers. 1.2 (8%) The compiler team proposes to improve the compiler for the machine to further enhance performance. Call this combination of the improved compiler and the base machine *Mcomp*. The instruction improvements from this enhanced compiler have been estimated as follows. | Instruction class | Percentage of instruction executed vs. base machine | |-------------------|-----------------------------------------------------| | A | 90% | | В | 90% | | С | 85% | | D | 95% | For example, if the base machine executed 500 class A instructions, Mcomp would execute $0.9 \times 500 = 450$ class A instructions for the same program. What is the CPI for Mcomp? How much faster is Mcomp than Mbase? Copy Table 1-2 to your answer sheet and fill in the blanks with your answers. ### Table 1-1: | Machine | CPI | MIPS rate | |---------|-----|-----------| | Mbase | | | | Mopt | | | #### Table 1-2: | Machine | CPI | Execution time Mbase / Execution time Mcomp | | | | | |---------|-----|---------------------------------------------|--|--|--|--| | Mcomp | | | | | | | 科目名稱:計算機結構【資工系碩士班甲組、乙組】 ※本科目依簡章規定「不可以」使用計算機 **題號: 434001** 共4頁第2頁 2. (10%) Here is a loop in C: Loop: g = g + A[i]; i = i + j; if (i != h) goto Loop; Assume that A is an array of 100 elements and the variables g, h, i, and j are in registers \$s1, \$s2, \$s3, and \$s4, respectively. In addition, assume that the base of the array A is in \$s5. Other registers that can be used are \$t0 and \$t1. A possible MISP assembly code corresponding to this C loop is showed as follows: Loop: add \$t1, \$s3, OPA add \$t1, \$t1, \$t1 add \$t1, \$t1, OPB lw \$t0, 0(OPC) add OPD, \$s1, \$t0 add \$s3, \$s3, \$s4 bne \$s3, OPE, Loop Please determine proper values for the operands (OPA, OPB, OPC, OPD, OPE). Copy the Table 2 to your answer sheet and fill in the operand values. #### Table 2: | Operand | OPA | OPB | OPC | OPD | OPE | |---------|-----|-----|-----|-----|-----| | Value | | | | | | 3. (10%) Figure 1 shows a 16-bit adder consisting of four 4-bit ALUs using carry lookahead. Assume that $gi = ai \cdot bi$ (generate) and pi = ai + bi (propagate). Determine the value of P0, P1, P2, P3, G0, G1, G2, G3, and C4 (CarryOut) when adding two 16-bit numbers 0001101000110011 and 1110010111101011 with Carry\_In bit equal to 0. Please copy Table 3 to your answer sheet and fill in the blanks with your answers. ### Table 3: | Signal | P0 | P1 | P2 | Р3 | G0 | G1 | G2 | G3 | C4 | |--------|----|----|----|----|----|----|----|----|----| | Value | | | | | | | | | | - 4. (15% total) Compare the single-cycle implementation, in which all instructions take 1 clock cycle, with the five-stage (IF, ID, EX, MEM, WB) pipelined implementation using the following eight instructions: load word (lw), store word (sw), subtract (sub), and (and), or (or), set-less-than (slt) and branch-on-equal (beq). The operation times for the major functional units are 2 ns for memory access, 2 ns for ALU operation, and 1 ns for register file read or write. - 4.1 (5%) Please complete the Table 4-1 by filling in the time for each component to calculate the total time of each instruction executed in the single-cycle implementation. Assume that the multiplexors, control unit, PC accesses, and sign extension unit have no delay. Please copy Table 4-1 to your answer sheet and fill in the blanks with your answers. - 4.2 (5%) What is the clock cycle time for the single-cycle implementation and the pipelined implementation? Please copy Table 4-2 to your answer sheet and fill in the blanks with your answers. 科目名稱:計算機結構【資工系碩士班甲組、乙組】 ※本科目依簡章規定「不可以」使用計算機 題號:434001 共4頁第3頁 4.3 (5%) Consider a program consisting of 100 lw instructions and in which each instruction is dependent upon the instruction before it. What would the actual CPI be if the program were run on the single-cycle implementation and the pipelined implementation with a forwarding unit and a hazard detection unit? Please copy Table 4-2 to your answer sheet and fill in the blanks with your answers. Table 4-1: | 1 abit 4-1. | | | | | | | |-----------------------------------|-------------------|------------------|---------------|-------------|-------------------|---------------| | Instruction class | Instruction fetch | Register<br>read | ALU operation | Data access | Register<br>write | Total<br>time | | Load word (lw) | | | | | | | | Store word (sw) | | | | | | | | R-format (add, sub, and, or, slt) | 2 ns | 1 ns | 2 ns | | 1 ns | 6 ns | | Branch (beq) | | | | | | | Table 4-2: | Design | Clock cycle time | Actual CPI | |-----------------------------|------------------|------------| | Single-cycle implementation | | | | Pipelined implementation | | | 5. (12%) Increasing associativity requires more comparators, as well as more tag bits per cache block. Assuming a cache of 4K blocks, a four-word block size, and a 32-bit address, find the total number of sets and the total number of tag bits for caches that are direct mapped, two-way and four-way set associative, and fully associative. Please copy Table 5 to your answer sheet and fill in your answers. Table 5: | Cache | The total number of sets | The total number of tag bits | |--------------------------|--------------------------|------------------------------| | Direct mapped | | | | Two-way set associative | | | | Four-way set associative | | | | Fully associative | | | 6. (15%) Here is a series of address references given as word addresses: 1, 4, 8, 5, 20, 17, 19, 56, 9, 11, 4, 43, 5, 6, 9, 17. Assuming a direct-mapped cache with four-word blocks and a total size of 16 words that is initially empty, label each reference in the list as a hit or a miss and show the final contents of the cache. Please copy Table 6-1 and Table 6-2 to your answer sheet and fill in your answers. Table 6-1: | 14010 0 1. | | | | | | | | | | | | | | | | | |-------------|------|------|---|---|----|----|----|-----|---|----|---|----|---|---|---|-----| | Reference | 1 | 4 | 8 | 5 | 20 | 17 | 19 | .56 | 9 | 11 | 4 | 43 | 5 | 6 | 9 | 17_ | | Hit or miss | Miss | Miss | | | | | | | | | | | | | | | Table 6-2: | Block# | 0 | 1 | 2 | 3 | |------------------|---|---|---|---| | Starting address | | | | | 科目名稱:計算機結構【資工系碩士班甲組、乙組】 ※本科目依簡章規定「不可以」使用計算機 **題號:434001** 共4頁第4頁 - 7. (15% total) Suppose we have a processor with a base CPI of 1.0, assuming all references hit in the primary cache, and a clock rate of 1 GHz. Assume a main memory access time of 100 ns, including all the miss handling. - 7.1 (7%) Suppose the miss rate per instruction at the primary cache is 5%. What is the effective CPI for the machine with one level of caching? - 7.2 (8%) If we add a secondary cache that has a 10-ns access time for either a hit or a miss and is large enough to reduce the miss rate to main memory to 2%. What is the effective CPI for the machine with the two-level cache? - 8. (5%) Suppose you want to achieve a speedup of 50 with 100 processors. What fraction of the original computation can be sequential?