gpu - Linking with 3rd party CUDA libraries slows down cudaMalloc -
it not secret on cuda 4.x first call cudamalloc
can ridiculously slow (which reported several times), seemingly bug in cuda drivers.
recently, noticed weird behaviour: running time of cudamalloc
directly depends on how many 3rd-party cuda libraries linked program (note not use these libraries, link program them)
i ran tests using following program:
int main() { cudasetdevice(0); unsigned int *ptr = 0; cudamalloc((void **)&ptr, 2000000 * sizeof(unsigned int)); cudafree(ptr); return 1; }
the results follows:
linked with: -lcudart -lnpp -lcufft -lcublas -lcusparse -lcurand running time: 5.852449
linked with: -lcudart -lnpp -lcufft -lcublas running time: 1.425120
linked with: -lcudart -lnpp -lcufft running time: 0.905424
linked with: -lcudart running time: 0.394558
according 'gdb', time indeed goes cudamalloc, it's not caused library initialization routine..
i wonder if has plausible explanation ?
in example, cudamalloc
call initiates lazy context establishment on gpu. when runtime api libraries included, binary payloads have inspected , gpu elf symbols , objects contain merged context. more libraries there are, longer can expect process take. further, if there architecture mismatch in of cubins , have backwards compatible gpu, can trigger driver recompilation of device code target gpu. in extreme case, have seen old application linked old version of cublas take 10s of seconds load , initialise when run on fermi gpu.
you can explicitly force lazy context establishment issuing cudafree
call this:
int main() { cudasetdevice(0); cudafree(0); // context establishment happens here unsigned int *ptr = 0; cudamalloc((void **)&ptr, 2000000 * sizeof(unsigned int)); cudafree(ptr); return 1; }
if profile or instrument version timers should find first cudafree
call consumes of runtime , cudamalloc
call becomes free.
Comments
Post a Comment