Converting arm code to use NEON intrinsics -

February 15, 2010

i have been trying modify code beneath work neon intrinsics, thereby creating speedup. unfortunately nothing seems work correctly. have idea going wrong? updated doubles single floating point elements.

typedef         float       real; typedef         real        vec3[3];      typedef struct  driehoek {     vec3        norm;                   /* face normal. */     real        d;                      /* plane equation d. */     vec3        *vptr;                  /* global vertex list pointer. */     vec3        *nptr;                  /* global normal list pointer. */     int         vindex[3];              /* index of vertices. */     int         indx;                   /* normal component max flag. */     bool        norminterp;             /* normal interpolation? */     bool        vorder;                 /* vertex order orientation. */ }driehoek;  typedef struct element {     int         index;     struct object   *parent;            /* ptr parent object.    */     char        *data;                  /* pointer data info.         */     bbox        bv;                     /* element bounding volume.      */ }element;  int triangleintersection(ray *pr, element *pe, irecord *hit) {     float      rd_dot_pn;       /* polygon normal dot ray direction. */     float      ro_dot_pn;       /* polygon normal dot ray origin.    */     float      q1, q2;     float      tval;            /* intersection t distance value.    */     vec3       *v1, *v2, *v3;       /* vertex list pointers.         */     vec3       e1, e2, e3;      /* edge vectors.             */     driehoek   *pt;         /* ptr triangle data.         */       pt = (driehoek *)pe->data;      rd_dot_pn = vecdot(pt->norm, pr->d);      if (abs(rd_dot_pn) < rayeps)        /* ray parallel.      */         return (0);          hit->b3 = e1[0] * (q2 - (*v1)[1]) - e1[1] * (q1 - (*v1)[0]);         if (!inside(hit->b3, pt->norm[2]))             return (0);         break;     }      return (1);  }

an array of float vec[3] not enough of hint compiler neon intrinsic can used. issue float vec[3] has each element individually addressable. compiler must store each in floating point register. see gcc neon intrinsic documentation.

although 3 dimensions common in universe, our friends computers binary. have 2 data types can used neon intrinsics; float32x4_t , float32x2_t. need use intrinsics such vfmaq_f32, vsubq_f32, etc. these intrinsics different each compiler; guess using gcc. should use intrinsic data types combining float32x2_t single float can result in movement between register types, expensive. if algorithm can treat each dimension separately, might able combine types. however, don't think have register pressure , simd speed-up should beneficial. keep in float32x4_t begin with. maybe able use dimension 3d-projection when comes rendering phase.

here source cmath library called math-neon under lgpl. instead of using intrinsics gcc, uses inline assembler.^{neon intrinsics vs assembly}

see also: armcc neon intrinsics, if using arm compiler.

Search This Blog

New Mian

Converting arm code to use NEON intrinsics -

Comments

Post a Comment

Popular posts from this blog

android - java.net.UnknownHostException(Unable to resolve host “URL”: No address associated with hostname) -

jquery - How can I dynamically add a browser tab? -

keyboard - C++ GetAsyncKeyState alternative -