科目名稱:計算機結構【資工系碩士班甲組、乙組】 #### 一作答注意事項- 考試時間:100分鐘 - 考試開始鈴響前不得翻閱試題,並不得書寫、劃記、作答。請先檢查答案卷(卡)之應考證號碼、桌角號碼、應試科目是否正確,如有不同立即請監試人員處理。 - 答案卷限用藍、黑色筆(含鉛筆)書寫、繪圖或標示,可攜帶橡皮擦、無色透明無文字墊板、尺規、修正液(帶)、手錶(未附計算器者)。每人每節限使用一份答案卷,請衡酌作答。 - 答案卡請以2B鉛筆劃記,不可使用修正液(帶)塗改,未使用2B鉛筆、劃記太輕或污損致光學閱讀機無法辨識答案者,後果由考生自負。 - 答案卷(卡)應保持清潔完整,不得折疊、破壞或塗改應考證號碼及條碼,亦不得書寫考生姓名、應考證號碼或與答案無關之任何文字或符號。 - 可否使用計算機請依試題資訊內標註為準,如「可以」使用,廠牌、功能不拘,唯不得攜帶書籍、紙張(應考證不得做計算紙書寫)、具有通訊、記憶、傳輸或收發等功能之相關電子產品或其他有礙試場安寧、考試公平之各類器材入場。 - 試題及答案卷(卡)請務必繳回,未繳回者該科成績以零分計算。 - 試題採雙面列印,考生應注意試題頁數確實作答。 - 違規者依本校招生考試試場規則及違規處理辦法處理。 科目名稱:計算機結構【資工系碩士班甲組、乙組】 ※本科目依簡章規定「不可以」使用計算機(問答申論題) **題號:434001** 共3頁第1頁 NOTE: If some questions are unclear or not well defined to you, you can make your own assumptions and state them clearly in the answer sheet. - 1. (8% total) When parallelizing an application, the ideal speedup is speeding up by the number of processors. This is limited by two things: percentage of the application that can be parallelized and the cost of communication. If 80% of the application is parallelizable, answer the following questions. - 1.1 (4%) What is the speedup with 8 processors, ignoring the cost of communication? - 1.2 (4%) What is the speedup with 8 processors if, for every time the number of processors is doubled, the communication overhead is increased by 0.5% of the original execution time? - 2. (8% total) Assume a hypothetical GPU with the following characteristics: - Clock rate 2.0 GHz - Contains 16 SIMD processors, each containing 16 single-precision floating-point units - Has 600 GB/sec off-chip memory bandwidth - 2.1 (4%) Without considering memory bandwidth, what is the peak single-precision floating-point throughput for this GPU in GFLOPS/sec, assuming that all memory latencies can be hidden? - 2.2 (4%) Is this throughput sustainable given the memory bandwidth limitation? Please explain your reasons clearly. - 3. (18% total) Assume that individual stages of the datapath have the following latencies, and instructions executed by the processor are broken down as the following percentages: | ĬF | ID | EX | MEM | WB | alu_ | |--------|--------|--------|--------|--------|------| | 250 ps | 150 ps | 350 ps | 300 ps | 200 ps | 45% | | alu | beq | lw sw | | |-----|-----|-------|-----| | 45% | 10% | 25% | 20% | - 3.1 (4%) What is the clock cycle time in a non-pipelined and pipelined processor, respectively? - 3.2 (4%) What is the individual total latency of the following two instructions: **Iw**, **beq** in a non-pipelined and pipelined processor? - 3.3 (6%) Assuming there are no stalls or hazards, what is the total execution time of the following four instructions: Iw, sw, add, beq in a non-pipelined and pipelined processor, respectively? - 3.4 (4%) Assuming there are no stalls or hazards, what is the utilization of the data memory and the write-register port of the "Register" unit? - 4. (15% total) The importance of having a good branch predictor depends on how often conditional branches are executed. Together with branch predictor accuracy, this will determine how much time is spent stalling due to mispredicted branches. Assume that the breakdown of dynamic instructions into various instruction categories is as follows: | R-Type | BEQ | JMP | LW | SW | |--------|-----|-----|-----|-----| | 40% | 20% | 5% | 25% | 10% | Also, assume the following branch predictor accuracies: | Always-Taken | Always-Not-Taken | 2-Bit | |--------------|------------------|-------| | 40% | 60% | 80% | 科目名稱:計算機結構【資工系碩士班甲組、乙組】 ※本科目依簡章規定「不可以」使用計算機(問答申論題) **題號: 434001** 共 3 頁第 2 頁 4.1 (5%) Assume no stall cycle is required when branch prediction is correct. Stall cycles due to mispredicted branches increase the CPI. What is the extra CPI due to mispredicted branches with the always-taken predictor? Assume that branch outcomes are determined in the EX stage of a basic 5-stage pipelined processor, that there are no data hazards, and that no delay slots are used. 4.2 (5%) Repeat 4.1 for the 2-bit predictor. - 4.3 (5%) With the always-taken predictor, what speedup would be achieved if we could convert half of the branch instructions in a way that replaces a branch instruction with an ALU instruction? Assume that correctly and incorrectly predicted instructions have the same chance of being replaced. - 5. (15% total) Caches are important to providing a high-performance memory hierarchy to processors. Below is a list of 32-bit memory address references, given as word addresses. 3, 180, 43, 2, 191, 88, 190, 14, 181, 44, 186, 253 5.1 (9%) For each of these references, identify the binary address, the tag, and the index given a direct-mapped cache with two-word blocks and a total size of 8 blocks. Also list if each reference is a hit or a miss, assuming the cache is initially empty. | Word Address | Binary Address | Tag | Index | Hit/Miss | |--------------|----------------|-----|-------|----------| | 3 | 0000 0011 | 0 | 3 | Miss | | 180 | | | | | | 43 | | | | | | 2 | | | | | | 191 | | | | | | ••• | | | | | | 181 | | | | | | 44 | | | | | | 186 | | | | | | 253 | | | | | - 5.2 (6%) How many total bits are required for a direct-mapped cache with 32 KiB of data, 8-word blocks, and 32-bit address? - 6. (16% total) Assume that the CPI with a perfect cache is 2.0, the clock cycle time is 1.0 ns, there are 1.5 memory references per instruction, the size of both caches is 64 KB, and both have a block size of 64 bytes. One cache is direct mapped and the other is two-way set associative. Since the speed of the CPU is tied directly to the speed of a cache hit, assume the CPU clock cycle time must be stretched 1.3 times to accommodate the selection multiplexor of the set-associative cache. To the first approximation, the cache miss penalty is 80 ns for either cache organization. Assume the hit time is 1 clock cycle, the miss rate of a direct-mapped 64 KB cache is 1.5%, and the miss rate for a two-way set-associative cache of the same size is 1.1%. - 6.1 (8%) Calculate the average memory access time for these two different cache organizations. - 6.2 (8%) Calculate the CPU performance in terms of CPU time. - 7. (20% total) Consider the virtual memory system. - 7.1 (8%) The following list provides parameters of a virtual memory system. | Virtual Address (bits) | Physical DRAM Installed | Page Size | PTE Size (byte) | |------------------------|-------------------------|-----------|-----------------| | 42 | 16 GiB | 4 KiB | 4 | 科目名稱:計算機結構【資工系碩士班甲組、乙組】 ※本科目依簡章規定「不可以」使用計算機(問答申論題) 題號: 434001 共3頁第3頁 For a single-level page table, how many page table entries (PTEs) are needed? How much physical memory is needed for storing the page table? 7.2 (12%) The following data constitutes a stream of virtual address as seen on a system. Assume 4 KiB pages, a 4-entry fully associative TLB, and true LRU replacement. If pages must be brought in from disk, increment the next largest page number. TLB | - | 122 | | | | | | |---|-----------|----|----------------------|--|--|--| | i | Valid Tag | | Physical Page Number | | | | | | 1 | 11 | 12 | | | | | | 1 | 7 | 4 | | | | | | 1 | 3 | 6 | | | | | | 0 | 4 | 9 | | | | Page table | Valid | Physical Page or in Disk | | | |-------|--------------------------|--|--| | 1 | 5 | | | | 0 | Disk | | | | 0 | Disk | | | | 1 | 6 | | | | 1 | 9 | | | | 1 | 11 | | | | 0 | Disk<br>4<br>Disk | | | | 1 | | | | | 0 | | | | | 0 | Disk | | | | 1 | 3 | | | | 1 | 12 | | | | | | | | Given the following address stream: 4669, 13916, ..., and the initial TLB and page table states provided above, show the final state of the system. Also list for each reference if it is a hit in the TLB, a hit in the page table, or a page fault. | Address | Virtual Page | TLB H/M | TLB | | | |---------|--------------|---------|-------|-----|----------------------| | Addless | | | Valid | Tag | Physical Page Number | | | | | | | | | 4669 | | | | | | | | | | | | | | | | | | | | | 4.504.6 | | | | | | | 13916 | | | | | | | | | | | | |