python (12.9k questions)
javascript (9.2k questions)
reactjs (4.7k questions)
java (4.2k questions)
java (4.2k questions)
c# (3.5k questions)
c# (3.5k questions)
html (3.3k questions)
What is the most efficient way to handle integer multiplication overflow with saturation with ARM Neon intrinsics?
I have the following multiplication between 2 16 bit vectors:
int16x8_t dx;
int16x8_t dy;
int16x8_t dxdy = vmulq_s16(dx, dy);
In case dx and dy are both large enough, the result will overflow.
I woul...

Elad Maimoni
Votes: 0
Answers: 1
ARMv7 NEON: Unpack 32 bit mask to 64 bit mask
I have a 32 NEON bit mask that I need to unpack to 64 bits like so:
uint32x4_t mask = { 0xFFFFFFFF, 0xFFFFFFFF, 0, 0 };
uint64x2_t mask_lo = { 0xFFFFFFFFFFFFFFFF, 0xFFFFFFFFFFFFFFFF };
uint64x2_t mask...
simonlet
Votes: 0
Answers: 1
Organizing multiple implementations (for SIMD)
This is admittedly an open-ended/subjective question but I am looking for different ideas on how to "organize" multiple alternative implementations of the same functions.
I have a set of sev...

Matthew M.
Votes: 0
Answers: 1
transpose 8x16 matrix on AVX 512
I have a 8x16 uint32_t matrix already loaded in 8 zmm registers. They have the layout
zmmi = {ai_15, ai_14, ..., ai_1, ai_0}
where i goes from 0 to 7 and ai_j are 32 bit integers for each j from 0 to...
potuz
Votes: 0
Answers: 0