c++ - No Speedup achieved using Parallel_Scan Component of Intel Thread Building Blocks (TBB) -
i exploring parallel_scan component in intel thread building blocks used in case of associative operation , finding parallel_scan taking 10 times more have been done in serial.
code have written check is:
#include <iostream> #include <stdlib.h> #include <time.h> #include "tbb/task_scheduler_init.h" #include "tbb/blocked_range.h" #include "tbb/parallel_scan.h" #include "tbb/tick_count.h" using namespace std; using namespace tbb; template <class t> class body { t reduced_result; t* const y; const t* const x; public: body( t y_[], const t x_[] ) : reduced_result(0), x(x_), y(y_) {} t get_reduced_result() const {return reduced_result;} template<typename tag> void operator()( const blocked_range<int>& r, tag ) { t temp = reduced_result; for( int i=r.begin(); i<r.end(); ++i ) { temp = temp+x[i]; if( tag::is_final_scan() ) y[i] = temp; } reduced_result = temp; } body( body& b, split ) : x(b.x), y(b.y), reduced_result(10) {} void reverse_join( body& ) { reduced_result = a.reduced_result + reduced_result; } void assign( body& b ) { reduced_result = b.reduced_result; } }; template<class t> float doparallelscan( t y[], const t x[], int n) { body<int> body(y,x); tick_count t1,t2,t3,t4; t1=tick_count::now(); parallel_scan( blocked_range<int>(0,n), body , auto_partitioner() ); t2=tick_count::now(); cout<<"time taken parallel scan \t"<<(t2-t1).seconds()<<endl; return body.get_reduced_result(); } template<class t1> float serialscan(t1 y[], const t1 x[], int n) { tick_count t3,t4; t3=tick_count::now(); t1 temp = 10; for( int i=1; i<n; ++i ) { temp = temp+x[i]; y[i] = temp; } t4=tick_count::now(); cout<<"time taken serial scan \t"<<(t4-t3).seconds()<<endl; return temp; } int main() { task_scheduler_init init1; int y1[100000],x1[100000]; for(int i=0;i<100000;i++) x1[i]=i; cout<<fixed; cout<<"\n serial scan output \t"<<serialscan(y1,x1,100000)<<endl; cout<<"\n parallel scan output \t"<<doparallelscan(y1,x1,100000)<<endl; return 0; }
please me in finding going wrong.
i'm original author of tbb::parallel_scan.
getting speedup out of parallel scan on multicore systems using "big cores" can hard. reason parallel scan inherently two-pass algorithm. if data not fit in outer-level cache, parallel scan has stream in data memory twice, whereas serial algorithm has once. operation simple integer addition, memory traffic, not alu, bottleneck "big core" devotes lot of hardware resources fast serial execution. if data does fit in outer-level cache, there might not enough work amortize parallel overheads.
i able parallel speedup (about 2x) example following changes , conditions:
i hoisted read of r.end() local variable before loop, this:
int rend = r.end(); for( int i=r.begin(); i<rend; ++i )
doing helps compiler generate better code because knows rend loop-invariant. without hoisting, compiler has assume writes y[i] might overwrite field of r r.end() reads. might hoist reads of fields x , y local variables, though compiler should able tell type-based alias disambiguation writes y[i] not affect fields.
i increased input arrays have 10,000,000 elements, there more work do, , better amortize parallel scheduling overheads. avoid stack overflow, allocated arrays in heap.
i warmed tbb run-time. in general, when doing sort of timing exercise, "throw away" run first one-time startup costs not counted. warm (for both serial , parallel code), wrapped three-trip loop around timing logic, this:
for( int k=0; k<3; ++k ) { cout<<"\n serial scan output \t"<<serialscan(y1,x1,n)<<endl; cout<<"\n parallel scan output \t"<<doparallelscan(y1,x1,n)<<endl; }
this timing experiments, can see if first-time costs significant or whether there other variation of interest.
i compiled "gcc -o2 -ltbb".
i ran on 16-core system 2 "sandy bridge" chips.
a way see impact of memory bandwidth change t in example smaller type. when edited example change t int char (thus reducing memory bandwidth demands 4x), parallel speedup increased. (aside: there "body<int>" in example should "body<t>".)
another way see impact of memory bandwidth try example on system many small cores. tried example, modified described type int, on intel(r) xeon phi(tm), has high memory bandwidth , many small cores. can got 4x-7x parallel speedup. cranking problem size 100,000,000 got me 10x-20x speedup.
to summarize: multi-threaded scan can pay off when benefits of parallel computation can outweigh overhead of making 2 passes on data.
Comments
Post a Comment