r - Purging observations from a dataset based on number of occurrences -

August 15, 2012

i have data frame in 1 variable categorical , has large number of possible values. i'm trying process data frame in way removes instance of categorical variable occurs fewer x number of times.

for example, if i'm dealing car makes variable, may like:

toyota ford lexus ford acura subaru dodge ford ford lexus ... ... ...

i remove observations in car make classifier occurs fewer ten times. example, if ford, lexus, , toyota appear 30, 20, , 15 times, , others fewer ten, remove other entries associated makes.

i know command like

cars.processed <- which(table(cars$make) > 10)

does produce integer count of how many classifiers meet required criteria, don't know how move on there.

thanks help!

lets assume df data.frame , x column in questions , thr threshold:

thr <- 3 keep <- names(which(table(df$x) > thr)) df   <- df[df$x %in% keep, ]  # optionally, drop levels df$x <- droplevels(df$x)

here data.table solution well:

library(data.table)  dt <- data.table(df)  dt[x %in% names(which(table(x)>thr))]

or if dont mind reordering rows according x, gets more succinct

dt <- data.table(df, key="x")  dt[.(names(which(table(x)>thr)))]

Search This Blog

New Mian

r - Purging observations from a dataset based on number of occurrences -

Comments

Post a Comment

Popular posts from this blog

android - java.net.UnknownHostException(Unable to resolve host “URL”: No address associated with hostname) -

jquery - How can I dynamically add a browser tab? -

keyboard - C++ GetAsyncKeyState alternative -