交通大學電子研究所(乙A組、乙B組) 交通大學電控工程研究所(乙組) 清華大學電機工程學系(丙組) 清華大學通訊工程研究所 ## 一. [10%] - (1) [2%] Some multicore designers claim their chips can attain a linear speedup with processor numbers. Please describe the possible problem set that fit such claim. - (2) [2%] True or false, RAID systems rely on redundancy to achieve high reliability. - (3) [2%] True or false, in the simultaneous multithreading case, both thread-level parallelism and data parallelism are exploited for parallel execution. - (4) [4%] Given the following instruction sequence of three threads, (the vertical axis is time, the horizontal axis is issue slot) how many clock cycles will fine-grained multithreading, and simultaneous multithreading use respectively? Annotation [] denotes the issue slot. Assume maximum four issue slots are allowed per cycle. | Thread 1: | Thread 2: | Thread 3: | |-----------|-----------|-----------| | | | | | | | | | | | (stall) | | | (stall) | (stall) | | (stall) | | | | | | | | | (stall) | | | | | | | | (stall) | | | | | | ## 二. [10%] - (1) [5%] Please describe the Amdahl's law, and use the Amdahl's law to explain why the multicore processor cannot get the similar improvement factor as the core numbers. - (2) [5%] Suppose a program segment consists of a purely sequential part which takes 20 cycles to execute, and an iterated loop which takes 200 cycles per iteration. Assume the loop iterations are independent, and cannot be further parallelized. If the loop is to be executed 200 times, what is the maximum speedup possible using an infinite number of processors (compared to a single processor)? - 三. [9%] Assume variable key is associated with register \$s1, variable h is associated with register \$s2 and the base address of the array A is in \$s3. Therefore, the following C program segment can be converted to the MIPS assembly code as shown right. (1) [6%] Please fill in the blanks (a), (b), (c) to complete this assembly code. add \$s2,\$zero,\$zero L1: slti \$t0,\$s2,10 (a) \$t0,\$zero,Exit (b) \$t0,\$s2,2 add \$t0,\$t0,\$s3 lw \$t1,(c) beq \$t1,\$s1,Exit addi \$s2,\$s2,1 j L1 Exit: (2) [3%] Assume this code is put at the memory address starting from 0x00000600, what is the machine code of the instruction beq \$t1,\$s1,Exit in hexadecimal form? (beq → opcode=4,\$t1 → 9,\$s1 → 17) 主背面有試題 交通大學電子研究所(乙A組、乙B組) 交通大學電控工程研究所(乙組) 清華大學電機工程學系 (丙組) 清華大學通訊工程研究所 四. [11%] Assume we are designing a 16-bit MIPS CPU with 16-bit instruction words. Please adjust the instruction formats according to the following specifications. (1) [3%] Assume this CPU has 6 arithmetic/logic instructions and 10 other instructions. In addition, assume there are 16 registers in this CPU; each is 16-bit long. If the 3-operand format (R-type format) is still adopted, what are the numbers of bits of the fields (a), (b), (c) in this 16-bit instruction format? Please explain your reasons. | ор | rs | rt | rd | shamt | funct | |-----|-----|----|----|-------|-------| | (a) | (b) | | | | (c) | - (2) [3%] In the original 32-bit MIPS CPU, there is an instruction addi that can add a 16-bit immediate number with the content of a register. Using the specifications in (1), how to implement addi instruction in this 16-bit MIPS CPU if the bit width of immediate numbers should be kept as 16? Please propose one possible solution with brief explanations. - (3) [3%] Assume the IEEE-754 floating-point representation is also adjusted to 16-bit long for this 16-bit MIPS CPU. If the floating-point numbers are required to represent the values within $\pm 10^{18}$ , what are the numbers of bits of the fields (d), (e), (f) in this 16-bit floating-point format? Please explain your reasons. (assume $\log_{10} 2 = 0.3$ ) | S | ign | exponent | mantissa | |---|-----|----------|----------| | | (d) | (e) | (f) | - (4) [2%] In the IEEE-754 floating-point representation, the exponent field employs the excess-N coding. What should be N in your 16-bit floating-point format if it follows the similar mechanism of IEEE 32-bit standard? - 五. [6%] Pipelining affects the clock cycle time of the processor. Following the three assumptions listed below, let's compare clock cycle times and execution times with single-cycle, multi-cycle, and pipelined organization of this processor. (A multi-cycle organization means each instruction takes multiple cycles but one instruction finishes before another is fetched. In this organization, an instruction only goes through stages it actually needs.) - (i) individual stages (IF, ID, EX, MEM, WB) of the datapath of a simplified implementation (covering only lw, sw, beq and ALU instructions) have the following latencies: | IF | ID | EX | MEM | WB | |-------|-------|-------|-------|-------| | 200ps | 150ps | 120ps | 190ps | 140ps | (ii) the instructions executed by the processor are broken down as follows: | ALU | beq | lw | sw | |-----|-----|-----|-----| | 30% | 25% | 30% | 15% | - (iii) there are no stalls or hazards. - (1) [2%] What is the clock cycle time in a pipelined and non-pipelined single-cycle processor? - (2) [4%] If the execution time of the pipelined organization is denoted as T, what are the execution times of the single-cycle and multi-cycle organizations in terms of T? 交通大學電子研究所(乙A組、乙B組) 交通大學電控工程研究所 (乙組) 清華大學電機工程學系(丙組) 清華大學通訊工程研究所 六. [14%] Assume we want to execute the following addi instruction in the single-cycle datapath: addi \$19, \$29, 16 The single-cycle datapath diagram below shows the execution of this instruction. (1) [4%] Please set the control signals in the following table so that addican be executed properly. Use 'x' for 'don't care' if necessary. | RegDst | ALUSrc | MemToReg | PCSrc | |--------|--------|----------|-------| | | | | | (2) [10%] Several of the datapath values are already shown in the diagram. You are to provide values for the ten remaining signals, which are labeled with a question mark (?) followed by a character (from a to j). Please write their decimal values in the following table. Assume register \$29 and PC initially contain the value of 145 and 32, respectively. If a value cannot be determined, mark it as 'X'. | a | b | С | d | e | f | g | lì | i | j | |---|---|---|---|---|---|---|----|---|---| | | | | | | | | | · | | ±. [20%] Given a multiple-issue processor with 10 pipeline stages and issue width of 4. Assume that the processor always execute the maximum number of instructions per cycle if there is no control hazards. For control dependencies, the processor uses branch prediction and continues fetching from the predicted path. If the branch has been mispredicted, the instructions fetched after the mispredicted branch are discarded when the branch outcome is resolved in stage 8. In the next cycle, the processor starts fetching from the correct path. Assume 20% of all executed instructions are branches with 90% prediction accuracy. 交通大學電子研究所(乙A組、乙B組) 交通大學電控工程研究所(乙組) 清華大學電機工程學系(丙組) 清華大學通訊工程研究所 - (1) [3%] How many register read ports should the processor have to avoid any resource hazards due to the register read? - (2) [3%] If there are no branch mispredictions and no data dependences, what is the expected performance improvement over a 1-issue processor with the classical 5-stage pipeline? Assume that the clock cycle time decreases in proportional to the number of pipeline stages. - (3) [4%] With branch mispredictions, how many instructions are expected to be executed between mispredictions? - (4) [3%] How many branch instructions are expected to be in progress (i.e. fetch but not yet committed) at any given time? - (5) [3%] What percentage of all cycles are entirely spent on fetching wrong-path instructions? - (6) [4%] If we want to limit the stalls due to mispredicted branches to zero (i.e. no stalls), what should be the required branch prediction accuracy? - N. [8%] The following table summarizes the memory hierarchy parameters and their effects on performance. Copy this table to your exam paper and fill in each missing cell with the most proper statement from below. | Design Change | Potential advantages | Possible Disadvantages | |------------------------|----------------------|------------------------| | Increase cache size | • | | | Increase block size | | | | Increase associativity | | | | More sophisticated | | | | replacement policy | | | - (A) Decrease capacity misses - (G) Increase capacity misses - (B) Decrease conflict misses - (H) Increase conflict misses - (C) Decrease compulsory misses Decrease decision time - (I) Increase compulsory misses - (D) Decrease access time - (J) Increase access time - (F) Decrease miss penalty - (L) Increase miss penalty (K) Increase decision time 九. [12%] Consider the virtual memory system of Processor XYZ described in the following table. | Characteristic | Processor XYZ | |------------------|---------------------------------------------------------| | Virtual address | 48 bits | | Physical address | 40 bits | | Page size | 4 KB | | TLB Organization | 1 L1 TLB for instructions and 1 L1 TLB for data | | | Both L1 TLBs are fully associative with 64 entries | | | 1 L2 TLB for instructions and 1 L2 TLB for data | | | Both L2 TLBs are 4-way set associative with 256 entries | 交通大學電子研究所(乙A組、乙B組) 交通大學電控工程研究所 (乙組) 清華大學電機工程學系(丙組) 清華大學通訊工程研究所 Characteristics of the first-level cache (L1) and the second-level cache (L2) in Processor XYZ are summarized in the following table. Note that the cache size only includes the actual storage of instructions/data. | Characteristic | Processor XYZ | |------------------------|-----------------------------------| | L1 cache organization | Split instruction and data caches | | L1 block size | 64 bytes | | L1 cache associativity | 2-way set associative | | L1 cache size | 16 KB each for instructions/data | | L2 cache organization | Unified (instruction and data) | | L2 block size | 64 bytes | | L2 cache associativity | 4-way set associative | | L2 cache size | 256 KB | - (1) [3%] What is the number of tag bits in the virtual address feed to L1 data TLB? - (2) [3%] What is the number of tag bits in the virtual address feed to L2 data TLB? - (3) [3%] What is the number of index bits in the physical address feed to L1 data cache? - (4) [3%] What is the number of tag bits in the physical address feed to L2 cache?