Best performance defined as smallest time * area
Init state
From 1 to 10. Improve stop at memory = 2.
For three parameter. Each sweep {1..3}, find the best combination. Found that 2, 1, 1 has the best performance. Now the configuration: 4 2 1 1 4 2 1 0 0 64 8
Decrease the number of Mul at the same time. Alu from 2 to 6. (When Alu = 1, crash) Mul from 0 to 3. Found that 4, 0 has the best performance. Now the configuration: 4 2 1 1 4 2 1 0 0 64 8 (What happens when Mul is 0, but mul is still needed? Explain.)
Sweep Alu from 4 to 10. And sweep issue width from 4 to 10. Performace did not change much, best still: 4 2 1 1 4 2 1 0 0 64 8
Sweep r0 among 32 64 128 160 Sweep b0 among 4 8 12 16 20 24 Performance is at: 4 2 1 1 4 2 1 0 0 32 4
The performance below means that Time * Area is the smallest
With init values: 4 1 1 1 4 2 1 0 0 64 8 Set.
From 1 to 10. Performanace stops increasing after exceeding 4. We keep Mul at 4. Configuration: 4 1 1 1 4 4 1 0 0 64 8
From 2 to 10. Performance stops increasing after Alu exceeds 6. We keep Alu at 6. Configue: 4 1 1 1 6 4 1 0 0 64 8
From 4 to 30. (Seems very parallelizable) Performance stops to increase after exceeding 6. Configue: 6 1 1 1 6 4 1 0 0 64 8
Each from 1 to 5. Performance does not change. Still: 6 1 1 1 6 4 1 0 0 64 8
r0: 32 64 96 128 160 192 224 256 288 310 b0: 4 8 12 16 20 24 28 32 36 40 After r0 exceeding 64 and b0 exceeding 4, performance does not improve. Configure: 6 1 1 1 6 4 1 0 0 64 4
From 1 to 10. Still: 6 1 1 1 6 4 1 0 0 64 4
Mul, Alu and IssueWidth might have coupled influence on the performance. Sweep three parameters: Mul. Alu and IssueWidth Mul from: 4 to 10 Alu from: 6 to 10 IssueWidth: 6 to 15
Analyze the result:
- When IssueWidth (call it IW from now on) = 6, Alu and Mul sweeping. No improvement.
- When IW = 7, and since then, the improvement starts.
- IW = 7, Alu = 7, Mpy = 4. Best
- IW = 8, Alu = 8, Mpy = 4. Best, however, result a bit wierd. As shown in figure below:
- IW = 9, ALu = 10, Mpy = 4. Best.
- IW = 10, ALu = 10, Mpy = 4. Best.
- IW = 11, ALu = 10, Mpy = 4. Best. (But same as above)
- IW = 12, ALu = 10, Mpy = 4. Best. (But same as well)
- IW = 13, ALu = 10, Mpy = 4. Best. (Same too)
- IW = 14, ALu = 10, Mpy = 4. Best. (Same)
- IW = 15, ALu = 10, Mpy = 4. Best. (Same) Set: 10 1 1 1 10 4 1 0 0 64 4
Mul constant = 4 Alu from 10 to 20. IW from 10 to 20.
Analyze the result:
- IW = 10, ALu = 11. Best.
- IW = 11, ALu = 12. Best. (Perf. same as above)
- IW = 12, ALu = 13. Best. (Same)
- IW = 13, ALu = 13. Best. Better than above
- IW = 14, ALu = 13. Best. (Perf. same as above)
- IW = 15, ALu = 13. Best. (Perf. same as above)
- IW = 16, ALu = 13. Best. (Perf. same as above)
- IW = 17, ALu = 13. Best. (Perf. same as above)
- IW = 18, ALu = 13. Best. (Perf. same as above)
- IW = 19, ALu = 13. Best. (Perf. same as above)
- IW = 20, ALu = 13. Best. (Perf. same as above) Set: 13 1 1 1 13 4 1 0 0 64 4
Mul sweep 4 to 20. Mul has no more influence. Mul always stays at 4. Set: 13 1 1 1 13 4 1 0 0 64 4
IW, Mul and Alu kinda the best combination already. Adjust the rest. Memload:1 2 3 Memstore:1 2 3 MemPft:1 2 3 r0:16 32 48 80 b0:2 6 10 14 Analyze the result: (First of all, the best performance remains the same: 654627)
- The three: 1 1 1, r0: 48, b0: 6 The three connections to the data cache are basically unrelated. Set: 13 1 1 1 13 4 1 0 0 48 6
Check with set: 13 1 1 1 13 4 1 0 0 48 4 (Same)
Use: 13 1 1 1 13 4 1 0 0 48 4
Useful information ends at 1203.