Skip to main content

Homework 1

info

Q1 Loop-carried dependences For this question, we will describe a loop-carried dependence as coming from some statement in iteration (i,j) to some other iteration (x,y), where x and y are written as expressions involving i and j (e.g., (i,j+1)). For example, there could be a loop-carried dependence from S4(i,j) to S5(i+1,j). Note that the first statement must come before the second statement in program order. Consider the code below, where all variables have been previously declared.

...
for (i=1; i <= N; i++) {
for (j=1; j <= i; j++) {
S1: a[i][j] = b[i][j] + c[i][j];
S2: b[i][j] = a[i-1][j-1] * b[i+1][j-1] * c[i-1][j];
S3: c[i+1][j] = a[i][j];
}
}
...

Thinking

  • Finding dependencies inside iteration:
    • S1TS3S_{1} \mathop{\to}^T S_{3}
    • S1AS2S_{1} \mathop{\to}^{A} S_{2}
  • Finding dependencies across iterations
    • S1[i,j]TS2[i+1,j+1]S_{1}[i,j]\mathop{\to}^T S_{2}[i+1,j+1]
    • S2[i,j]AS2[i+1,j1]S_{2}[i,j]\mathop{\to}^A S_{2}[i+1,j-1]
    • S3[i,j]TS1[i+1,j]S_{3}[i,j]\mathop{\to}^T S_{1}[i+1,j]
    • S3[i1,j]TS2[i+1,j]S_{3}[i-1,j]\mathop{\to}^T S_{2}[i+1,j], normalized to S3[i,j]TS2[i+2,j]S_{3}[i,j]\mathop{\to}^T S_{2}[i+2,j]
info

Q1.1 LDG Draw LDG for this loop nest

LDG does not reflect loop-independent dependency, each node in LDG represents all statements in a particular iteration. Therefore, only dependencies across iterations should be included:

CSC 506 HW1 Q1.excalidraw Figure: LDG of Q1

danger

Notice that the loop variable j <= i, therefore, there are no data dependences below the diagonal of the graph.

[!success] Answer from TA Pasted image 20241008190149

info

Q2 Variable scope analysis Consider the code below. All matrices have dimension N×NN\times N, all vectors have dimension NN, and assume that NN is divisible by 22.

...
for (i=0; i<N; i++) {
k = C[N-1-i];
for (j=0; j<N; j++) {
A[i][j] = k * A[i][j] * B[i/2][j];
}
}
...
info

Q2.1 ii-loop By default, all variables are shared. If the ii-loop is converted to a parallel loop, indicate which variables should be private.

Thinking: Applying parallelism on ii loop, ii should be private to each thread, and things related to ii should be private too.

  • ii: each thread should have its own ii. Make ii private.
  • jj: its originally private if ii loop is parallel
  • kk: its assigned within ii loop but not jj loop. Make kk private.\
  • AA: its being read and written, but since the access index i,ji,j is private, then access to AA is thread safe. So AA can be shared.
  • BB: its read only, can be shared.
  • CC: its read only, can be shared. Therefore, ii, jj, kk are private.
info

Q2.2 jj-loop By default, all variables are shared. If the jj-loop is converted to a parallel loop, indicate which variables should be private.

Thinking: Applying parallelism on jj loop, jj should be private to each thread, and things related to jj should be private.

  • ii: it has the same value and its read only for jj loops inside the same ii loop iteration, can be shared.
  • jj: each thread should have its own jj. Make jj private.
  • kk: same as ii, can be shared.
  • AA: The same as ii loop parallelism, no race condition between different jj.
  • BB: its read only, can be shared.
  • CC: its read only, can be shared. Therefore, jj is private.
info

Q3 Average access latency Suppose you have a one-level (L1) cache. The L1 latency is 2 cycles, and memory latency is 80 cycles.

info

Q3.1 Calculate AAT If the L1 miss rate is 0.05, what is the average latency for a memory operation?

AAT=TL1+ML1TMem=2+0.05×80=6(cycles)AAT = T_{L_{1}} + M_{L_{1}}\cdot T_{Mem} = 2+0.05\times 80 = 6 (cycles)
info

Q3.2 Calculate miss rate What must the miss rate be in order to achieve an average access latency of 10 cycles?

AAT=TL1+ML1TMem=2+ML180=10ML1=10280=0.1\begin{align} AAT = T_{L_{1}} + M_{L_{1}}\cdot T_{Mem} = 2+M_{L_{1}}\cdot 80 = 10 \\ M_{L_{1}} = \frac{10-2}{80} = 0.1 \end{align}
info

Q4 Average access latency (two-level) Suppose you have a two-level cache. The L1 access latency is 2 cycles. The L2 access latency is 10 cycles. The memory access latency is 80 cycles.

info

Q4.1 Calculate AAT If the L1 miss rate is 0.1, and the L2 miss rate is 0.5, what is the average access latency?

ATT=TL1+ML1TL2+ML1ML2TMem=2+0.1×10+0.1×0.5×80=7(cycles)\begin{align} ATT &= T_{L_{1}} + M_{L_{1}} \cdot T_{L_{2}} + M_{L_{1}}\cdot M_{L_{2}} \cdot T_{Mem} \\ &=2+0.1\times 10 + 0.1\times 0.5 \times 80 \\ &=7(cycles) \end{align}
info

Q4.2 Calculate L2 miss rate If the L1 miss rate is 0.05, what must the L2 miss rate be to achieve an average access latency of 4 cycles?

ATT=TL1+ML1TL2+ML1ML2TMem=4=2+0.05×10+0.05×ML2×80=4ML2=420.05×1080×0.05=0.375\begin{align} ATT &= T_{L_{1}} + M_{L_{1}} \cdot T_{L_{2}} + M_{L_{1}}\cdot M_{L_{2}} \cdot T_{Mem} &=4\\ &=2+0.05\times 10 + 0.05\times M_{L_{2}} \times 80 &=4 \\ M_{L_{2}}&=\frac{4-2-0.05\times 10}{80\times 0.05}=0.375 \end{align}
info

Q5 Replacement policy You have a fully-associative cache with a capacity of four blocks. Blocks are stored in positions 0, 1, 2, 3 in the set, from left to right in the figure. To write the content of the cache, list the block addresses in left-to-right order, and use a single hypen (-) to denote a block position that does not contain a valid block. Pasted image 20240921215725 Empty blocks are filled from left to right. As shown, the initial cache state is A,-,-,-.

The cache uses LRU replacement and has a write-no-allocate policy. After each memory operation below, show the state of the cache. Access sequence: Read B, Write A, Read C, Write D, Read C, Read E, Read D, Write C, Read B, Read F

It has a write no allocate policy, therefore write miss will not cause the block to be loaded into cache, only read miss will.

CSC 506 HW1 Q5.excalidraw