r - Subset a data.frame into a list in a more efficient way -


i have data.frame 2 columns representing interaction between 2 genes. example of how looks data.frame:

head(df) v1       v2 a1bg     a1bg a1bg    crisp3 a1cf     a1cf a1cf   apobec1 a1cf    cugbp2 a1cf     khsrp 

i want split data.frame based on values first column, i've used following command:

out <- split(df, df$v1) 

the desired output should be:

out $a1bg [1] a1bg crisp3  $a1cf [2] a1cf apobec1 cugbp2 khsrp 

however, process using split takes such long time since file big (around 200,000 rows)

many thanks

to speed up, if need df$v2 split apart on basis of df$v1, use vector in call split not entire data frame df. e.g:

## dummy data df <- read.table(text = "v1       v2 a1bg     a1bg a1bg    crisp3 a1cf     a1cf a1cf   apobec1 a1cf    cugbp2 a1cf     khsrp", header = true) ## make big! df <- with(df, cbind.data.frame(v1 = rep(v1, length.out = 1e5),                                 v2 = rep(v2, length.out = 1e5))) # time system.time(sp1 <- split(df, df$v1))  system.time(sp2 <- split(df$v2, df$v1))  > system.time(sp1 <- split(df, df$v1))    user  system elapsed    0.024   0.000   0.016  > system.time(sp2 <- split(df$v2, df$v1))    user  system elapsed    0.008   0.000   0.005 

this on example few levels though. many levels, inefficiency of splitting entire data frame starts weigh heavily on compute time, e.g. factor around 10000 levels:

df2 <- data.frame(v1 = factor(sample(10000, 1e5, replace = true)),                   v2 = rnorm(1e5))  system.time(sp3 <- split(df2, df2$v1))  system.time(sp4 <- split(df2$v2, df2$v1))  > system.time(sp3 <- split(df2, df2$v1))    user  system elapsed    5.332   0.000   4.216  >  > system.time(sp4 <- split(df2$v2, df2$v1))    user  system elapsed    0.008   0.000   0.005 

the reason in split(df, df$v1) case, split.data.frame method called, lapply() on vector 1:nrow(df) split groups f (df$v2), , applies function (function(ind) x[ind, , drop = false])) each component. hence number of levels grows large, number of function calls anonymous function grows , inflates compute time.

in split(df$v2, df$v1) case split.default method used, if called factor f needs call fast c implementation of split. such doesn't incur of overhead of calling anonymous function nor repeated calls [.


Comments

Popular posts from this blog

jquery - How can I dynamically add a browser tab? -

keyboard - C++ GetAsyncKeyState alternative -

android - java.net.UnknownHostException(Unable to resolve host “URL”: No address associated with hostname) -