r - Subset a data.frame into a list in a more efficient way -

June 15, 2010

i have data.frame 2 columns representing interaction between 2 genes. example of how looks data.frame:

head(df) v1       v2 a1bg     a1bg a1bg    crisp3 a1cf     a1cf a1cf   apobec1 a1cf    cugbp2 a1cf     khsrp

i want split data.frame based on values first column, i've used following command:

out <- split(df, df$v1)

the desired output should be:

out $a1bg [1] a1bg crisp3  $a1cf [2] a1cf apobec1 cugbp2 khsrp

however, process using split takes such long time since file big (around 200,000 rows)

many thanks

to speed up, if need df$v2 split apart on basis of df$v1, use vector in call split not entire data frame df. e.g:

## dummy data df <- read.table(text = "v1       v2 a1bg     a1bg a1bg    crisp3 a1cf     a1cf a1cf   apobec1 a1cf    cugbp2 a1cf     khsrp", header = true) ## make big! df <- with(df, cbind.data.frame(v1 = rep(v1, length.out = 1e5),                                 v2 = rep(v2, length.out = 1e5))) # time system.time(sp1 <- split(df, df$v1))  system.time(sp2 <- split(df$v2, df$v1))  > system.time(sp1 <- split(df, df$v1))    user  system elapsed    0.024   0.000   0.016  > system.time(sp2 <- split(df$v2, df$v1))    user  system elapsed    0.008   0.000   0.005

this on example few levels though. many levels, inefficiency of splitting entire data frame starts weigh heavily on compute time, e.g. factor around 10000 levels:

df2 <- data.frame(v1 = factor(sample(10000, 1e5, replace = true)),                   v2 = rnorm(1e5))  system.time(sp3 <- split(df2, df2$v1))  system.time(sp4 <- split(df2$v2, df2$v1))  > system.time(sp3 <- split(df2, df2$v1))    user  system elapsed    5.332   0.000   4.216  >  > system.time(sp4 <- split(df2$v2, df2$v1))    user  system elapsed    0.008   0.000   0.005

the reason in split(df, df$v1) case, split.data.frame method called, lapply() on vector 1:nrow(df) split groups f (df$v2), , applies function (function(ind) x[ind, , drop = false])) each component. hence number of levels grows large, number of function calls anonymous function grows , inflates compute time.

in split(df$v2, df$v1) case split.default method used, if called factor f needs call fast c implementation of split. such doesn't incur of overhead of calling anonymous function nor repeated calls [.

Search This Blog

New Mian

r - Subset a data.frame into a list in a more efficient way -

Comments

Post a Comment

Popular posts from this blog

Change php variable from jquery value using ajax (same page) -

Pull out data related to my apps from Android Play Store and iOS App Store -

How can I fetch data from a web server in an android application? -