r - Set a variable using colnames(), update data.table using := operator, variable is silently updated? -
this question has answer here:
well one's bit strange... seems creating new column in data.table using := operator, assigned variable (created using colnames) changes silently.
is expected behaviour? if not what's @ fault?
# lets make simple data table require(data.table) dt <- data.table(fruit=c("apple","banana","cherry"),quantity=c(5,8,23)) dt fruit quantity 1: apple 5 2: banana 8 3: cherry 23 # , assign column names variable colsdt <- colnames(dt) str(colsdt) chr [1:2] "fruit" "quantity" # let's add column data table using := operator dt[,double_quantity:=quantity*2] dt fruit quantity double_quantity 1: apple 5 10 2: banana 8 16 3: cherry 23 46 # ... , without explicitly changing 'colsdt', let's take look: str(colsdt) chr [1:3] "fruit" "quantity" "double_quantity" # ... colsdt has been silently updated!
for comparison's sake, though i'd see if adding new column via data.frame method has same issue. doesn't:
dt$triple_quantity=dt$quantity*3 dt fruit quantity double_quantity triple_quantity 1: apple 5 10 15 2: banana 8 16 24 3: cherry 23 46 69 # ... again make no explicit changes colsdt, let's take look: str(colsdt) chr [1:3] "fruit" "quantity" "double_quantity" # ... , time not silently updated
so bug data.table := operator, or expected behaviour?
thanks!
short answer, use copy
colsdt <- copy(colnames(dt))
then good.
dt[,double_quantity:=quantity*2] str(colsdt) # chr [1:2] "fruit" "quantity"
what's going in in general (ie, in base r
), assignment operator <-
creates new copy of object when assigning value object. true when assigning same object name, in x <- x + 1
, or lot more costly, df$newcol <- df$a + df$b
. large objects (think 100k+ rows, dozens or hundreds of columns. worse if more columns), can costly.
data.table
, through pure wizardry (read: c code) avoids overhead. instead set pointer same memory location object value stored. offers huge efficiency & spped boost.
but means have objects might otherwise appear differnet , independent objects in fact 1 , same
this copy
comes in. creates new copy of object, opposed passing reference.
some more detail why happening.
note: using terms "source" , "destination" loosely, refer assignment relationship destination <- source
this in fact expected behavoir, admittadly bit obfuscated.
in base r
, when assign via <-
, 2 objects point same memory location until 1 of them changes. way of handling memory has many benefits, namely long 2 objects have same exact value, there no need duplicate memory. step held off long possible.
a <- 1:5 b <- .internal(inspect(a)) # @11a5e2a88 13 intsxp g0c3 [nam(2)] (len=5, tl=0) 1,2,3,4,5 .internal(inspect(b)) # @11a5e2a88 13 intsxp g0c3 [nam(2)] (len=5, tl=0) 1,2,3,4,5 ^^^^ notice same memory location
once either of 2 objects change, "bond" broken. is, changing either "source" or "destination" object cause object reassigned new memory location.
a[[3]] <- a[[3]] + 1 .internal(inspect(a)) # @11004bc38 14 realsxp g0c4 [nam(1)] (len=5, tl=0) 1,2,4,4,5 ^^^^ new location .internal(inspect(b)) # @11a5e2a88 13 intsxp g0c3 [nam(2)] (len=5, tl=0) 1,2,3,4,5 ^^^^^ still same before; note actual value. `a` _had_ been
the problem in data.table
s case we reassign actual data.table object. notice if modify "destination" object, gets moved (copied) off of memory location.
colsdt <- colnames(dt) .internal(inspect(colnames(dt))) # @114859280 16 strsxp g0c7 [mark,nam(2)] (len=2, tl=100) .internal(inspect(colsdt)) # @114859280 16 strsxp g0c7 [mark,nam(2)] (len=2, tl=100) ^^^^ notice same memory location # insiginificant change colsdt[] <- colsdt .internal(inspect(colsdt)) # @100aa4a40 16 strsxp g0c2 [nam(1)] (len=2, tl=100) # can test original issue op: dt[, newcol := quantity*2] str(colnames(dt)) # chr [1:3] "fruit" "quantity" "newcol" str(colsdt) # chr [1:2] "fruit" "quantity"
the situation avoid:
however, since when working data.table
, (almost) modifying reference, can cause unexpected results. namely, situation where:
- we assign from data.table object using standard
<-
assignment operator - then subsequently change value of "source" data.table
- we expect (and our code might depend on) "destination" object still have value assigned it.
this of course cause issue.
data.table
amazingly powerful package. source of strength its long hair fact avoids making copies whenever possible.
best practice:
this shifts onus user deliberate , judicious when copying , expecting copy made.
in other words, best practices is: when expect copy exist, use copy function.
Comments
Post a Comment