c# - Why does my application spend 24% of its life doing a null check? -


i've got performance critical binary decision tree, , i'd focus question on single line of code. code binary tree iterator below results running performance analysis against it.

        public sctreenode getnodeforstate(int rootindex, float[] inputs)         { 0.2%        sctreenode node = rootnodes[rootindex].treenode;  24.6%       while (node.branchdata != null)             { 0.2%            branchnodedata b = node.branchdata; 0.5%            node = b.child2; 12.8%           if (inputs[b.splitinputindex] <= b.splitvalue) 0.8%                node = b.child1;             }  0.4%        return node;         } 

branchdata field, not property. did prevent risk of not being inlined.

the branchnodedata class follows:

public sealed class branchnodedata {     /// <summary>     /// index of data item in input array on need split     /// </summary>     internal int splitinputindex = 0;      /// <summary>     /// value should split on     /// </summary>     internal float splitvalue = 0;      /// <summary>     /// nodes children     /// </summary>     internal sctreenode child1;     internal sctreenode child2; } 

as can see, while loop / null check massive hit on performance. tree massive, expect searching leaf take while, i'd understand disproportionate amount of time spent on 1 line.

i've tried:

  • separating null check while - it's null check that's hit.
  • adding boolean field object , checking against that, made no difference. doesn't matter what's being compared, it's comparison that's issue.

is branch prediction issue? if so, can it? if anything?

i won't pretend understand cil, i'll post can try scrape information it.

.method public hidebysig instance class optimaltreesearch.sctreenode getnodeforstate (     int32 rootindex,     float32[] inputs ) cil managed {     // method begins @ rva 0x2dc8     // code size 67 (0x43)     .maxstack 2     .locals init (         [0] class optimaltreesearch.sctreenode node,         [1] class optimaltreesearch.branchnodedata b     )      il_0000: ldarg.0     il_0001: ldfld class [mscorlib]system.collections.generic.list`1<class optimaltreesearch.scrootnode> optimaltreesearch.scsearchtree::rootnodes     il_0006: ldarg.1     il_0007: callvirt instance !0 class [mscorlib]system.collections.generic.list`1<class optimaltreesearch.scrootnode>::get_item(int32)     il_000c: ldfld class optimaltreesearch.sctreenode optimaltreesearch.scrootnode::treenode     il_0011: stloc.0     il_0012: br.s il_0039     // loop start (head: il_0039)         il_0014: ldloc.0         il_0015: ldfld class optimaltreesearch.branchnodedata optimaltreesearch.sctreenode::branchdata         il_001a: stloc.1         il_001b: ldloc.1         il_001c: ldfld class optimaltreesearch.sctreenode optimaltreesearch.branchnodedata::child2         il_0021: stloc.0         il_0022: ldarg.2         il_0023: ldloc.1         il_0024: ldfld int32 optimaltreesearch.branchnodedata::splitinputindex         il_0029: ldelem.r4         il_002a: ldloc.1         il_002b: ldfld float32 optimaltreesearch.branchnodedata::splitvalue         il_0030: bgt.un.s il_0039          il_0032: ldloc.1         il_0033: ldfld class optimaltreesearch.sctreenode optimaltreesearch.branchnodedata::child1         il_0038: stloc.0          il_0039: ldloc.0         il_003a: ldfld class optimaltreesearch.branchnodedata optimaltreesearch.sctreenode::branchdata         il_003f: brtrue.s il_0014     // end loop      il_0041: ldloc.0     il_0042: ret } // end of method scsearchtree::getnodeforstate 

edit: decided branch prediction test, added identical if within while, have

while (node.branchdata != null) 

and

if (node.branchdata != null) 

inside that. ran performance analysis against that, , took 6 times longer execute first comparison did execute second comparison returned true. looks indeed branch prediction issue - , i'm guessing there's nothing can it?!

another edit

the above result occur if node.branchdata had loaded ram while check - cached if statement.


this third question on similar topic. time i'm focusing on single line of code. other questions on subject are:

the tree massive

by far expensive thing processor ever not executing instructions, accessing memory. execution core of modern cpu many times faster memory bus. problem related distance, further electrical signal has travel, harder gets signal delivered other end of wire without being corrupted. cure problem make go slower. big problem wires connect cpu ram in machine, can pop case , see wires.

processors have counter-measure problem, use caches, buffers store copy of bytes in ram. important 1 l1 cache, typically 16 kilobytes data , 16 kilobytes instructions. small, allowing close execution engine. reading bytes l1 cache typically takes 2 or 3 cpu cycles. next l2 cache, bigger , slower. upscale processors have l3 cache, bigger , slower yet. process technology improves, buffers take less space , automatically becomes faster closer core, big reason why newer processors better , how manage use ever increasing number of transistors.

those caches not perfect solution. processor still stall on memory access if data not available in 1 of caches. cannot continue until slow memory bus has supplied data. losing fat hundred cpu cycles possible on single instruction.

tree structures problem, not cache friendly. nodes tend scattered throughout address space. fastest way access memory reading sequential addresses. unit of storage l1 cache 64 bytes. or in other words, once processor reads one byte, next 63 fast since they'll present in cache.

which makes array far efficient data structure. reason .net list<> class isn't list @ all, uses array storage. same other collection types, dictionary, structurally not remotely similar array, internally implemented arrays.

so while() statement suffering cpu stalls because dereferencing pointer access branchdata field. next statement cheap because while() statement did heavy lifting of retrieving value memory. assigning local variable cheap, processor uses buffer writes.

not otherwise simple problem solve, flattening tree arrays unpractical. not in least because typically cannot predict in order nodes of tree visited. red-black tree might help, isn't clear question. simple conclusion draw is running fast can hope for. , if need go faster you'll need better hardware faster memory bus. ddr4 going mainstream year.


Comments

Popular posts from this blog

Change php variable from jquery value using ajax (same page) -

Pull out data related to my apps from Android Play Store and iOS App Store -

How can I fetch data from a web server in an android application? -