r - Purging observations from a dataset based on number of occurrences -
i have data frame in 1 variable categorical , has large number of possible values. i'm trying process data frame in way removes instance of categorical variable occurs fewer x number of times.
for example, if i'm dealing car makes variable, may like:
toyota ford lexus ford acura subaru dodge ford ford lexus ... ... ... i remove observations in car make classifier occurs fewer ten times. example, if ford, lexus, , toyota appear 30, 20, , 15 times, , others fewer ten, remove other entries associated makes.
i know command like
cars.processed <- which(table(cars$make) > 10) does produce integer count of how many classifiers meet required criteria, don't know how move on there.
thanks help!
lets assume df data.frame , x column in questions , thr threshold: 
thr <- 3 keep <- names(which(table(df$x) > thr)) df   <- df[df$x %in% keep, ]  # optionally, drop levels df$x <- droplevels(df$x) here data.table solution well:
library(data.table)  dt <- data.table(df)  dt[x %in% names(which(table(x)>thr))] or if dont mind reordering rows according x, gets more succinct
dt <- data.table(df, key="x")  dt[.(names(which(table(x)>thr)))] 
Comments
Post a Comment