Only Coders - Where knowledge meets opportunity

python (12.9k questions)

javascript (9.2k questions)

reactjs (4.7k questions)

java (4.2k questions)

java (4.2k questions)

c# (3.5k questions)

c# (3.5k questions)

html (3.3k questions)

Questions - intrinsics

What is the most efficient way to handle integer multiplication overflow with saturation with ARM Neon intrinsics?

I have the following multiplication between 2 16 bit vectors: int16x8_t dx; int16x8_t dy; int16x8_t dxdy = vmulq_s16(dx, dy); In case dx and dy are both large enough, the result will overflow. I woul...

Elad Maimoni

arm

simd

intrinsics

neon

saturation-arithmetic

Votes: 0

Answers: 1

Latest Answer

Here’s another version. It does pretty much the same as your code, but uses fewer instructions for that, e.g. NEON has widening multiplication. I’m not sure if it’s faster or slower (apparently there’...

Soonts

ARMv7 NEON: Unpack 32 bit mask to 64 bit mask

I have a 32 NEON bit mask that I need to unpack to 64 bits like so: uint32x4_t mask = { 0xFFFFFFFF, 0xFFFFFFFF, 0, 0 }; uint64x2_t mask_lo = { 0xFFFFFFFFFFFFFFFF, 0xFFFFFFFFFFFFFFFF }; uint64x2_t mask...

simonlet

c++

arm

simd

intrinsics

neon

Votes: 0

Answers: 1

Latest Answer

It’s unclear why do you want 0xFFFFFFFF to unpack into 0xFFFFFFFFFFFFFFFF If you want sign extend, use reinterpret intrinsics, and vmovl_s32 for the unpacking. This will unpack 0x80000000 into 0xFFFFF...

Soonts

Organizing multiple implementations (for SIMD)

This is admittedly an open-ended/subjective question but I am looking for different ideas on how to "organize" multiple alternative implementations of the same functions. I have a set of sev...

Matthew M.

c++

simd

intrinsics

instruction-set

Votes: 0

Answers: 1

Latest Answer

The following are just some ideas that i came up with while thinking about it - there might be better solutions that i'm not aware of. 1. Tag-Dispatch Using Tag-Dispatch you can define an order in wh...

Turtlefight

transpose 8x16 matrix on AVX 512

I have a 8x16 uint32_t matrix already loaded in 8 zmm registers. They have the layout zmmi = {ai_15, ai_14, ..., ai_1, ai_0} where i goes from 0 to 7 and ai_j are 32 bit integers for each j from 0 to...

potuz

assembly

intrinsics

avx512

Votes: 0

Answers: 0

Posts

Questions

Blogs