How to get "sum" of parallel arrays in cuda?

my problem is about getting "sum" for some same length arrays. For example,I have a M*N(100 * 2000) length float array in all. I would like to get M(100) sum values of every N(2000) float numbers. I found two ways to do this job. One is with Cublas function in a for loop for M ,like cublasSasum. The other is self-written kernel function, adding numbers in loop. My problem is the speed of these two ways and how to choose between them.

For Cublas method, no matter how big is N(4000~2E6), the time consuming is depending mainly on M, the loop number.

For self-written kennel function, the speed varied much with N. In detail, if N is small, below 5000, it runs much faster than the Cublas way. Then the time consumption is increasing with N's increasing.

N = 4000 |10000 | 40000 | 80000 | 1E6 | 2E6

t = 254ms| 422ms | 1365ms| 4361ms| 5399ms | 10635ms

If N is big enough, it runs much slower than Cublas way. My problem is how could I make a predition with M or N to decide which way I should use? My code might be used on different GPU device. Must I compare the speed in a parameter swept and then "guess" to make a choice in every GPU device, or I could inference from GPU device information?

Besides, for the kernel function way，I also have problem in deciding the blockSize and gridSize. I found have a bigger gridSize is always better, I don't know the reason. Is it about the usage of register number per thread? But I don't think it needs many registers per thread in this kernel function.

Any suggestion or information would be appreciated. Thank you.

Cublas way

for (int j = 0;j<M;j++)    cublasStatus = cublasSasum(cublasHandle,N,d_in+N*j,1,d_out+j);

self-written kernel way

__global__ void getSum(int M, int N, float* in, float * out){    int i = threadIdx.x + blockIdx.x * blockDim.x;    if(i<M){        float tmp = 0;        for(int ii = 0; ii<N; ii++){            tmp += *(in+N*i+ii);        }        out[i] = tmp;    }}

Bigger gridSize is faster. I don't know the reason.

getSum<<<M,1>>>(M, N, d_in, d_out); //fastergetSum<<<1,M>>>(M, N, d_in, d_out);

I could done this in two ways. One is using the Cublas function in a loop.

How to get "sum" of parallel arrays in cuda?

Trending Articles

RAMAYAMPET Mandal Sarpanch | Upa-Sarpanch | Ward member Mobile Numbers Medak...

लड़कियां सेक्स के दौरान क्यों करती है उह! आह!लड़कियां सेक्स के दौरान क्यों करती...

Neem Baba Extra Questions Answer Class 6 English Poorvi

Throw Back: 4×4 — Sikilitele (Ft Castro) Prod by JQ

Rajasthan Board 10th Result 2016 Roll No wise & Name Wise

Lowe faces four theft charges

Practice Sheet of Right form of verbs for HSC Students

Mafia, Murder & Mayhem In The Motor City: Detroit Mob Hit Timeline (1937-2007)

The 10 Tennessee Cities With The Largest Black Population For 2021

Materials Around Us Class 6 Worksheet Science Chapter 6

デスクトップヒープの枯渇

Best Suvichar in Hindi |बेस्ट सुविचार |शुभ विचार हिंदी में

Kanulanu Thaake Lyrics and translation | Manam (2014)

Korean Sex Porn Videos: XXX Videos & Free Porn Movies

Teen Shot In Miami Drive-By Dies From Injuries

Download: IQ Muzatasha feat Shy D & Pmj – Ulesi NiFertilizer Yamavuto

Mahakal Attitude Status

Property developer set up cannabis factory to help pay off debts...

♡

KB: How to troubleshoot issues when adding a Hyper-V host in System Center...