![]() This is probably obvious, but even if streams don’t make sense for the particular case of implementing an LSTM block, I am still really interested in learning to use them effectively – I just had to choose something as a test case, and this is what I chose. For the stream version, adding the preactivations took an average of 53 microseconds, but doing the matrix multiplies took 262 microseconds. ![]() In the fused version, concatenating the hidden state and the input takes an average of 65 microseconds, but the single matrix multiply takes only 117 microseconds. In the baseline model, the two matrix multiples together took an average of 208 microseconds (121 + 87), and adding them to compute the preactivations took an average of 58 microseconds – it surprises me that the simple add operation takes so long. I was a bit surprised at the line-by-line timings though, which makes me wonder if I have indeed misunderstood the pytorch model and that the truth is that it does do something more async. Preactivations = torch.addmm(bias_batch, combined_inputs, bined_weights) Lambda: torch.addmm(bias_batch, h_0, self.weight_hh),Ĭombined_inputs = torch.cat((h_0, input), 1) Wh_b = torch.addmm(bias_batch, h_0, self.weight_hh) I found that that the last approach improved performance significantly but that the stream approach actually decreased performance slightly. In the third, I concatenate the hidden state and the input and do one matrix multiply. (By the way, if you are wondering why I don’t just use the built-in LSTM, it is because I actually want to use a somewhat different architecture that is not supported as a built-in.) In the second implementation, I use streams via my StreamSpreader class. This is what is like pytorch does when not using cudnn, I believe. In the first, I do separate matrix multiplies of the hidden state and the input and add the results. I tried three implementations of an LSTM. One question I have is: if this implementation doesn’t work because pytorch will wait for each operation to complete before launching the next, is it at least possible to make a working StreamSpreader class with the same API? What I tried doing was making a simple class to run code in parallel on different streams, like so: class StreamSpreader(): If I am right that pytorch waits, that explains why my naive attempt to use streams below fails to improve performance. ![]() And I would think that if pytorch just did an asynchronous kernel launch and immediately returned it would be faster. While seeing such numbers is good for helping me to see where bottlenecks are, it seems to imply that pytorch is waiting for computations to finish after each line (and not just when, say, I try to print out the result of a computation). I have tried line profiling code using the python line profiler ( ), and it seems like the numbers I get for how much time is spent on each line roughly correspond to how long I would expect the corresponding computation to take on the GPU (but, as noted below, simple operations are not as much faster than complex ones as I might expect, and a further caveat is that the line profiler doesn’t provide any kind of variance estimate). (It is hard to understand where the bottlenecks are, but one tipoff is that nvidia-smi reports sm usage of only around 33%.)Ī prior confusion that I have about pytorch before even getting to the topic of streams is about when pytorch is waiting for kernels to finish running. ![]() I believe that currently my model is getting much less out of the GPU than it could. ![]() I would be clear where the configuration of the threads has been defined, and the 1D, 2D and 3D access pattern depends on how you are interpreting your data and also how you are accessing them by 1D, 2D and 3D blocks of threads.I am also trying to understand how to use streams. To sumup, it does it matter if you use a dim3 structure. Int y = blockIdx.y * blockDim.y + threadIdx.y īecause blockIdx.y and threadIdx.y will be zero. So, in both cases: dim3 blockDims(512) and myKernel>(.) you will always have access to threadIdx.y and threadIdx.z.Īs the thread ids start at zero, you can calculate a memory position as a row major order using also the ydimension: int x = blockIdx.x * blockDim.x + threadIdx.x The same happens for the blocks and the grid. When defining a variable of type dim3, any component left unspecified is initialized to 1. However, the access pattern depends on how you are interpreting your data and also how you are accessing them by 1D, 2D and 3D blocks of threads.ĭim3 is an integer vector type based on uint3 that is used to specify dimensions. The memory is always a 1D continuous space of bytes. The way you arrange the data in memory is independently on how you would configure the threads of your kernel. ![]()
0 Comments
Leave a Reply. |