The low speedup of written CUDA code -


i written 2 separate code same program in cpu (c++) , cuda. don't know why speedup of cuda code less cpu code.
have 3 matrices h, e, f , operations performed on these. speedup time in cpu code 0.004s , in cuda code is: 0.006s dimensions of matrices 32*32. in kernel code defined 3 shared memory variables matrix_h, matrix_e, matrix_fand copied dev_h, dev_e, dev_f values global memory shared memory speedup access time of memory , copied calculated shared memory variables global memory.
it's because of lot of parameters in kernel call or elsewhere?

__global__ void kernel_scorematrix(char *dev_seqa, char *dev_seqb,     int *dev_h, int *dev_e, int *dev_f, int *dev_i_side, int *dev_j_side,    int *dev_lena, int *dev_idx_array, int *dev_array_length) {    __shared__ int matrix_h[1024];    __shared__ int matrix_e[1024];    __shared__ int matrix_f[1024];     int x= threadidx.x;    int y= threadidx.y;     //calculate current_cell execute threads    int current_cell = *(dev_lena)*(y) + x;     matrix_h[current_cell]=dev_h[current_cell];    matrix_e[current_cell]=dev_e[current_cell];    matrix_f[current_cell]=dev_f[current_cell];     int index=0;     int scorematrix[4];     //for determine cells  must compute in time    (int i=0; i<*(dev_array_length); i++)     if (current_cell== dev_idx_array[i]){             scorematrix[0] = h_matrix(current_cell, x, y, matrix_h, dev_seqa, dev_seqb, dev_lena);              scorematrix[1] = e_matrix(current_cell, matrix_e, matrix_h, dev_lena);             scorematrix[2] = f_matrix(current_cell, matrix_f, matrix_h, dev_lena);             scorematrix[3] = 0;             dev_h[current_cell] = findmax(scorematrix,4, index); } 

in main function:

dim3 threadsperblock(32, 32); kernel_scorematrix<<<1,threadsperblock>>>(dev_seqa, dev_seqb, dev_h, dev_e, dev_f,          dev_i_side, dev_j_side, dev_lena, dev_idx_array, dev_array_length); 

a threadblock by definition executes on single sm. regardless of how many threads threadblock contains, only execution resources have available execution of particular threadblock resources in (single) sm. since nvidia gpus contain more single sm, in order keep gpu busy (which necessary most performance), it's necessary launch grids more 1 threadblock. reasonable rule of thumb have @ least 2-4x number of threadblocks have sms, , there little harm in having lot more threadblocks that.

but if launch kernel 1 threadblock, limited 1 sm. , therefore getting approximately 1/(number of sms in gpu) of performance available machine. number of threads in threadblock not affect factor.


Comments

Popular posts from this blog

jquery - How can I dynamically add a browser tab? -

keyboard - C++ GetAsyncKeyState alternative -

android - java.net.UnknownHostException(Unable to resolve host “URL”: No address associated with hostname) -