Quantcast
Channel: Active questions tagged kernel - Stack Overflow
Viewing all articles
Browse latest Browse all 6502

How to get "sum" of parallel arrays in cuda?

$
0
0

my problem is about getting "sum" for some same length arrays. For example,I have a M*N(100 * 2000) length float array in all. I would like to get M(100) sum values of every N(2000) float numbers. I found two ways to do this job. One is with Cublas function in a for loop for M ,like cublasSasum. The other is self-written kernel function, adding numbers in loop. My problem is the speed of these two ways and how to choose between them.

For Cublas method, no matter how big is N(4000~2E6), the time consuming is depending mainly on M, the loop number.

For self-written kennel function, the speed varied much with N. In detail, if N is small, below 5000, it runs much faster than the Cublas way. Then the time consumption is increasing with N's increasing.

N = 4000 |10000 | 40000 | 80000 | 1E6 | 2E6

t = 254ms| 422ms | 1365ms| 4361ms| 5399ms | 10635ms

If N is big enough, it runs much slower than Cublas way. My problem is how could I make a predition with M or N to decide which way I should use? My code might be used on different GPU device. Must I compare the speed in a parameter swept and then "guess" to make a choice in every GPU device, or I could inference from GPU device information?

Besides, for the kernel function way,I also have problem in deciding the blockSize and gridSize. I found have a bigger gridSize is always better, I don't know the reason. Is it about the usage of register number per thread? But I don't think it needs many registers per thread in this kernel function.

Any suggestion or information would be appreciated. Thank you.

Cublas way

for (int j = 0;j<M;j++)    cublasStatus = cublasSasum(cublasHandle,N,d_in+N*j,1,d_out+j);

self-written kernel way

__global__ void getSum(int M, int N, float* in, float * out){    int i = threadIdx.x + blockIdx.x * blockDim.x;    if(i<M){        float tmp = 0;        for(int ii = 0; ii<N; ii++){            tmp += *(in+N*i+ii);        }        out[i] = tmp;    }}

Bigger gridSize is faster. I don't know the reason.

getSum<<<M,1>>>(M, N, d_in, d_out); //fastergetSum<<<1,M>>>(M, N, d_in, d_out); 

I could done this in two ways. One is using the Cublas function in a loop.


Viewing all articles
Browse latest Browse all 6502

Trending Articles



<script src="https://jsc.adskeeper.com/r/s/rssing.com.1596347.js" async> </script>