algorithm - Get N samples given iterator -
given iterator it on data points, number of data points have n, , maximum number of samples want use calculations (maxsamples).
imagine function calculatestatistics(iterator it, int n, int maxsamples). function should use iterator retrieve data , (heavy) calculations on data element retrieved.
- if
n <= maxsamplesof course use each element iterator - if
n > maxsampleshave choose elements @ , skip
i've been spending quite time on this. problem of course how choose when skip element , when keep it. approaches far:
- i don't want take first
maxsamplescoming iterator, because values might not evenly distributed. - another idea use random number generator , let me create
maxsamples(distinct) random numbers between0,n, take elements @ these positions. if e.g.n = 101,maxsamples = 100gets more , more difficult find new distinct number not yet in list, loosing lot of time in random number generation - my last idea contrary: generate
n - maxsamplesrandom numbers , exclude data elements @ these positions elements. doesn't seem solution.
do have idea problem? there maybe standard known algorithms this?
to provide answer, way collect set of random numbers given collection size > elements needed, following. (in c++ ish pseudo code).
edit: may need iterate on , create "someelements" vector first. if elements large can "pointers" these elements save space.
vector randomcollectionfromvector(someelements, numelementstograb) { while(numelementstograb--) { randposition = rand() % someelements.size(); resultvector.push(someelements.get(randposition)) someelements.remove(randposition); } return resultvector; } if don't care changing vector of elements, remove random elements someelements, mentioned. algorithm similar, , again, conceptually same idea, pass someelements reference, , manipulate it.
something worth noting, quality of psuedo random distributions far how random are, grows size of distribution used increases. so, may tend better results if pick method use based on method results in use of more random numbers. example: if have 100 values, , need 99, should pick 99 values, result in using 99 pseudo random numbers, instead of 1. conversely, if have 1000 values, , need 99, should prefer version remove 901 values, because use more numbers psuedo random distribution. if want solid random distribution, simple optimization, increase quality of "fake randomness" see. alternatively, if performance matters more distribution, take alternative or grab first 99 values approach.
Comments
Post a Comment