Rewriting slow R function in C++ & Rcpp -
i have line of r code:
croppeddna <- completedna[,apply(completedna,2,function(x) any(c(false,x[-length(x)]!=x[-1])))]
what identify sites (cols) in matrix of dna sequences (1 row = 1 seq) not universal (informative) , subsets them matrix make new 'cropped matrix' i.e. rid of columns in values same. big dataset takes 6 seconds. don't know if can faster in c++ (still beginner in c++) me try. idea use rcpp, loop through columns of charactermatrix, pull out column (the site) charactervector check if same. if same, record column number/index, continue columns. @ end make new charactermatrix includes columns. important keep rownames , column names in th "r version" of matrix i.e. if column goes, should colname.
i've been writing 2 minutes, far have (not finished):
#include <rcpp.h> #include <vector> using namespace rcpp; // [[rcpp::export]] charactermatrix reduce_sequences(charactermatrix completedna) { std::vector<bool> informativesites; for(int = 0; < completedna.ncol(); i++) { charactervector bpsite = completedna(,i); if(all(bpsite == bpsite[1]) { informativesites.push_back(i); } } charactermatrix cutdna = completedna(,informativesites); return cutdna; }
am going right way this? there easier way. understanding need std::vector because it's easy grow them (since don't know in advance how many cols going want keep). indexing need +1 informativesites vector @ end (because r indexes 1 , c++ 0)?
thanks, ben w.
sample data:
set.seed(123) z <- matrix(sample(c("a", "t", "c", "g", "n", "-"), 3*398508, true), 3, 398508)
op's solution:
system.time(y1 <- z[,apply(z,2,function(x) any(c(false,x[-length(x)]!=x[-1])))]) # user system elapsed # 4.929 0.043 4.976
a faster version using base r:
system.time(y2 <- (z[, colsums(z[-1,] != z[-nrow(z), ]) > 0])) # user system elapsed # 0.087 0.011 0.098
the results identical:
identical(y1, y2) # [1] true
it's possible c++ beat it, necessary?
Comments
Post a Comment