r - Purging observations from a dataset based on number of occurrences -
i have data frame in 1 variable categorical , has large number of possible values. i'm trying process data frame in way removes instance of categorical variable occurs fewer x number of times.
for example, if i'm dealing car makes variable, may like:
toyota ford lexus ford acura subaru dodge ford ford lexus ... ... ...
i remove observations in car make classifier occurs fewer ten times. example, if ford, lexus, , toyota appear 30, 20, , 15 times, , others fewer ten, remove other entries associated makes.
i know command like
cars.processed <- which(table(cars$make) > 10)
does produce integer count of how many classifiers meet required criteria, don't know how move on there.
thanks help!
lets assume df
data.frame , x
column in questions , thr
threshold:
thr <- 3 keep <- names(which(table(df$x) > thr)) df <- df[df$x %in% keep, ] # optionally, drop levels df$x <- droplevels(df$x)
here data.table solution well:
library(data.table) dt <- data.table(df) dt[x %in% names(which(table(x)>thr))]
or if dont mind reordering rows according x, gets more succinct
dt <- data.table(df, key="x") dt[.(names(which(table(x)>thr)))]
Comments
Post a Comment