Only Coders - Where knowledge meets opportunity

python (12.9k questions)

javascript (9.2k questions)

reactjs (4.7k questions)

java (4.2k questions)

c# (3.5k questions)

html (3.3k questions)

Questions - neon

ARM NEON: Convert a binary 8-bit-per-pixel image (only 0/1) to 1-bit-per-pixel?

I am working on a task to convert a large binary label image, which has 8 bits (uint8_t) per pixel and each pixel can only be 0 or 1 (or 255), to an array of uint64_t numbers and each bit in uint64_t ...

debug_all_the_time

arm

neon

Votes: 0

Answers: 3

Latest Answer

Assuming the input value is either 0 or 255, below is the basic version which is rather straightforward, especially for people with Intel SSE/AVX experience. void foo_basic(uint8_t *pDst, uint8_t *pSr...

Jake 'Alquimista' LEE

How to load vector registers from integer registers in Arm64? (M1)

This is a question about SIMD instructions on AArch64 on an M1. I am working on a routine that works entirely inside the registers. All the memory reads and writes occur outside of the main loop. The ...

JON-ERIK STORM

assembly

arm64

neon

Votes: 0

Answers: 1

Latest Answer

Already answered in comments by Peter Cordes, just promoting to an answer: You want the ins instruction. It moves a general-purpose register into a specified element of a vector register, leaving oth...

Nate Eldredge

Efficient C vectors for generic SIMD (SSE, AVX, NEON) test for zero matches. (find FP max absolute value and index)

I want to see if it's possible to write some generic SIMD code that can compile efficiently. Mostly for SSE, AVX, and NEON. A simplified version of the problem is: Find the maximum absolute value of...

TrentP

gcc

simd

sse

neon

Votes: 0

Answers: 3

Latest Answer

As commented by chtz, the most generic and typical method is to have another mask to gather indices: Vec8s indices = { 0,1,2,3,4,5,6,7}; Vec8s max_idx = indices; Vec8f max_abs = abs(load8(ptr)); for...

Aki Suihkonen

What is the most efficient way to handle integer multiplication overflow with saturation with ARM Neon intrinsics?

I have the following multiplication between 2 16 bit vectors: int16x8_t dx; int16x8_t dy; int16x8_t dxdy = vmulq_s16(dx, dy); In case dx and dy are both large enough, the result will overflow. I woul...

Elad Maimoni

arm

simd

intrinsics

neon

saturation-arithmetic

Votes: 0

Answers: 1

Latest Answer

Here’s another version. It does pretty much the same as your code, but uses fewer instructions for that, e.g. NEON has widening multiplication. I’m not sure if it’s faster or slower (apparently there’...

Soonts

Posts

Questions

Blogs

Questions about neon

Read more about neon