Converting arm code to use NEON intrinsics -
i have been trying modify code beneath work neon intrinsics, thereby creating speedup. unfortunately nothing seems work correctly. have idea going wrong? updated doubles single floating point elements.
typedef float real; typedef real vec3[3]; typedef struct driehoek { vec3 norm; /* face normal. */ real d; /* plane equation d. */ vec3 *vptr; /* global vertex list pointer. */ vec3 *nptr; /* global normal list pointer. */ int vindex[3]; /* index of vertices. */ int indx; /* normal component max flag. */ bool norminterp; /* normal interpolation? */ bool vorder; /* vertex order orientation. */ }driehoek; typedef struct element { int index; struct object *parent; /* ptr parent object. */ char *data; /* pointer data info. */ bbox bv; /* element bounding volume. */ }element; int triangleintersection(ray *pr, element *pe, irecord *hit) { float rd_dot_pn; /* polygon normal dot ray direction. */ float ro_dot_pn; /* polygon normal dot ray origin. */ float q1, q2; float tval; /* intersection t distance value. */ vec3 *v1, *v2, *v3; /* vertex list pointers. */ vec3 e1, e2, e3; /* edge vectors. */ driehoek *pt; /* ptr triangle data. */ pt = (driehoek *)pe->data; rd_dot_pn = vecdot(pt->norm, pr->d); if (abs(rd_dot_pn) < rayeps) /* ray parallel. */ return (0); hit->b3 = e1[0] * (q2 - (*v1)[1]) - e1[1] * (q1 - (*v1)[0]); if (!inside(hit->b3, pt->norm[2])) return (0); break; } return (1); }
an array of float vec[3]
not enough of hint compiler neon intrinsic can used. issue float vec[3]
has each element individually addressable. compiler must store each in floating point register. see gcc neon intrinsic documentation.
although 3 dimensions common in universe, our friends computers binary. have 2 data types can used neon intrinsics; float32x4_t
, float32x2_t
. need use intrinsics such vfmaq_f32
, vsubq_f32
, etc. these intrinsics different each compiler; guess using gcc
. should use intrinsic data types combining float32x2_t
single float
can result in movement between register types, expensive. if algorithm can treat each dimension separately, might able combine types. however, don't think have register pressure , simd speed-up should beneficial. keep in float32x4_t
begin with. maybe able use dimension 3d-projection when comes rendering phase.
here source cmath library called math-neon under lgpl. instead of using intrinsics gcc, uses inline assembler.neon intrinsics vs assembly
see also: armcc neon intrinsics, if using arm compiler.
Comments
Post a Comment