r - Subset a data.frame into a list in a more efficient way -
i have data.frame 2 columns representing interaction between 2 genes. example of how looks data.frame:
head(df) v1 v2 a1bg a1bg a1bg crisp3 a1cf a1cf a1cf apobec1 a1cf cugbp2 a1cf khsrp
i want split data.frame based on values first column, i've used following command:
out <- split(df, df$v1)
the desired output should be:
out $a1bg [1] a1bg crisp3 $a1cf [2] a1cf apobec1 cugbp2 khsrp
however, process using split takes such long time since file big (around 200,000 rows)
many thanks
to speed up, if need df$v2
split apart on basis of df$v1
, use vector in call split
not entire data frame df
. e.g:
## dummy data df <- read.table(text = "v1 v2 a1bg a1bg a1bg crisp3 a1cf a1cf a1cf apobec1 a1cf cugbp2 a1cf khsrp", header = true) ## make big! df <- with(df, cbind.data.frame(v1 = rep(v1, length.out = 1e5), v2 = rep(v2, length.out = 1e5))) # time system.time(sp1 <- split(df, df$v1)) system.time(sp2 <- split(df$v2, df$v1)) > system.time(sp1 <- split(df, df$v1)) user system elapsed 0.024 0.000 0.016 > system.time(sp2 <- split(df$v2, df$v1)) user system elapsed 0.008 0.000 0.005
this on example few levels though. many levels, inefficiency of splitting entire data frame starts weigh heavily on compute time, e.g. factor around 10000 levels:
df2 <- data.frame(v1 = factor(sample(10000, 1e5, replace = true)), v2 = rnorm(1e5)) system.time(sp3 <- split(df2, df2$v1)) system.time(sp4 <- split(df2$v2, df2$v1)) > system.time(sp3 <- split(df2, df2$v1)) user system elapsed 5.332 0.000 4.216 > > system.time(sp4 <- split(df2$v2, df2$v1)) user system elapsed 0.008 0.000 0.005
the reason in split(df, df$v1)
case, split.data.frame
method called, lapply()
on vector 1:nrow(df)
split groups f
(df$v2
), , applies function (function(ind) x[ind, , drop = false])
) each component. hence number of levels grows large, number of function calls anonymous function grows , inflates compute time.
in split(df$v2, df$v1)
case split.default
method used, if called factor f
needs call fast c implementation of split
. such doesn't incur of overhead of calling anonymous function nor repeated calls [
.
Comments
Post a Comment