Add packed double-precision (64-bit) floating-point elements in a and b.
Add packed single-precision (32-bit) floating-point elements in a and b.
Alternatively add and subtract packed double-precision (64-bit) floating-point elements in a to/from packed elements in b.
Alternatively add and subtract packed single-precision (32-bit) floating-point elements in a to/from packed elements in b.
Compute the bitwise AND of packed double-precision (64-bit) floating-point elements in a and b.
Compute the bitwise AND of packed single-precision (32-bit) floating-point elements in a and b.
Compute the bitwise NOT of packed double-precision (64-bit) floating-point elements in a and then AND with b.
Compute the bitwise NOT of packed single-precision (32-bit) floating-point elements in a and then AND with b.
Blend packed double-precision (64-bit) floating-point elements from a and b using control mask imm8.
Blend packed single-precision (32-bit) floating-point elements from a and b using control mask imm8.
Blend packed double-precision (64-bit) floating-point elements from a and b using mask.
Blend packed single-precision (32-bit) floating-point elements from a and b using mask. Blend packed single-precision (32-bit) floating-point elements from a and b using mask.
Broadcast 128 bits from memory (composed of 2 packed double-precision (64-bit) floating-point elements) to all elements. This effectively duplicates the 128-bit vector.
Broadcast 128 bits from memory (composed of 4 packed single-precision (32-bit) floating-point elements) to all elements. This effectively duplicates the 128-bit vector.
Broadcast a single-precision (32-bit) floating-point element from memory to all elements.
Cast vector of type __m128d to type __m256d; the upper 128 bits of the result are undefined.
Cast vector of type __m256d to type __m128d; the upper 128 bits of a are lost.
Cast vector of type __m256d to type __m256.
Cast vector of type __m256d to type __m256i.
Cast vector of type __m128 to type __m256; the upper 128 bits of the result are undefined.
Cast vector of type __m256 to type __m128. The upper 128-bit of a are lost.
Cast vector of type __m256 to type __m256d.
Cast vector of type __m256 to type __m256i.
Cast vector of type __m128i to type __m256i; the upper 128 bits of the result are undefined.
Cast vector of type __m256i to type __m256d.
Cast vector of type __m256i to type __m256.
Cast vector of type __m256i to type __m128i. The upper 128-bit of a are lost.
Convert packed signed 32-bit integers in a to packed double-precision (64-bit) floating-point elements.
Convert packed signed 32-bit integers in a to packed single-precision (32-bit) floating-point elements.
Convert packed double-precision (64-bit) floating-point elements in a to packed single-precision (32-bit) floating-point elements.
Convert packed single-precision (32-bit) floating-point elements in a` to packed double-precision (64-bit) floating-point elements.
Return the lower double-precision (64-bit) floating-point element of a.
Return the lower 32-bit integer in a.
Return the lower single-precision (32-bit) floating-point element of a.
Convert packed double-precision (64-bit) floating-point elements in a to packed 32-bit integers with truncation.
Convert packed single-precision (32-bit) floating-point elements in a.
Divide packed double-precision (64-bit) floating-point elements in a by packed elements in b.
Divide packed single-precision (32-bit) floating-point elements in a by packed elements in b.
Conditionally multiply the packed single-precision (32-bit) floating-point elements in a and b using the high 4 bits in imm8, sum the four products, and conditionally store the sum using the low 4 bits of imm8.
Extract a 32-bit integer from a, selected with imm8.
Extract a 64-bit integer from a, selected with index.
Extract a 128-bits lane from a, selected with index (0 or 1).
Horizontally add adjacent pairs of double-precision (64-bit) floating-point elements in a and b.
Horizontally add adjacent pairs of single-precision (32-bit) floating-point elements in a and b.
Horizontally subtract adjacent pairs of double-precision (64-bit) floating-point elements in a and b.
Copy a, then insert 128 bits (composed of 2 packed double-precision (64-bit) floating-point elements) from b at the location specified by imm8.
Copy a then insert 128 bits (composed of 4 packed single-precision (32-bit) floating-point elements) from b, at the location specified by imm8.
Copy a, then insert 128 bits from b at the location specified by imm8.
Load 256-bits of integer data from unaligned memory into dst. This intrinsic may perform better than _mm256_loadu_si256 when the data crosses a cache line boundary.
Load 256-bits (composed of 4 packed double-precision (64-bit) floating-point elements) from memory. mem_addr must be aligned on a 32-byte boundary or a general-protection exception may be generated.
Load 256-bits (composed of 8 packed single-precision (32-bit) floating-point elements) from memory. mem_addr must be aligned on a 32-byte boundary or a general-protection exception may be generated.
Load 256-bits of integer data from memory. mem_addr must be aligned on a 32-byte boundary or a general-protection exception may be generated.
Load two 128-bit values (composed of 4 packed single-precision (32-bit) floating-point elements) from memory, and combine them into a 256-bit value. hiaddr and loaddr do not need to be aligned on any particular boundary.
Load two 128-bit values (composed of 2 packed double-precision (64-bit) floating-point elements) from memory, and combine them into a 256-bit value. hiaddr and loaddr do not need to be aligned on any particular boundary.
Load two 128-bit values (composed of integer data) from memory, and combine them into a 256-bit value. hiaddr and loaddr do not need to be aligned on any particular boundary.
Load 256-bits (composed of 4 packed double-precision (64-bit) floating-point elements) from memory. mem_addr does not need to be aligned on any particular boundary.
Load 256-bits (composed of 8 packed single-precision (32-bit) floating-point elements) from memory. mem_addr does not need to be aligned on any particular boundary.
Load 256-bits of integer data from memory. mem_addr does not need to be aligned on any particular boundary.
Compare packed double-precision (64-bit) floating-point elements in a and b, and return packed maximum values.
Compare packed single-precision (32-bit) floating-point elements in a and b, and return packed maximum values.
packed minimum values.
Compare packed single-precision (32-bit) floating-point elements in a and b, and return packed maximum values.
Multiply packed double-precision (64-bit) floating-point elements in a and b.
Multiply packed single-precision (32-bit) floating-point elements in a and b.
Compute the bitwise NOT of 256 bits in a. #BONUS
Compute the bitwise OR of packed double-precision (64-bit) floating-point elements in a and b.
Compute the bitwise OR of packed single-precision (32-bit) floating-point elements in a and b.
Broadcast 16-bit integer a to all elements of the return value.
Broadcast 32-bit integer a to all elements.
Broadcast 64-bit integer a to all elements of the return value.
Broadcast 8-bit integer a to all elements of the return value.
Broadcast double-precision (64-bit) floating-point value a to all elements of the return value.
Broadcast single-precision (32-bit) floating-point value a to all elements of the return value.
Set packed 16-bit integers with the supplied values.
Set packed 32-bit integers with the supplied values.
Set packed 64-bit integers with the supplied values.
Set packed 8-bit integers with the supplied values.
Set packed __m256d vector with the supplied values.
Set packed __m256d vector with the supplied values.
Set packed __m256i vector with the supplied values.
Set packed double-precision (64-bit) floating-point elements with the supplied values.
Set packed single-precision (32-bit) floating-point elements with the supplied values.
Set packed 16-bit integers with the supplied values in reverse order.
Set packed 32-bit integers with the supplied values in reverse order.
Set packed 64-bit integers with the supplied values in reverse order.
Set packed 8-bit integers with the supplied values in reverse order.
Set packed __m256 vector with the supplied values.
Set packed __m256d vector with the supplied values.
Set packed __m256i vector with the supplied values.
Set packed double-precision (64-bit) floating-point elements with the supplied values in reverse order.
Set packed single-precision (32-bit) floating-point elements with the supplied values in reverse order.
Return vector of type __m256d with all elements set to zero.
Return vector of type __m256 with all elements set to zero.
Return vector of type __m256i with all elements set to zero.
Shuffle double-precision (64-bit) floating-point elements within 128-bit lanes using the control in imm8.
Shuffle single-precision (32-bit) floating-point elements in a within 128-bit lanes using the control in imm8.
Compute the square root of packed double-precision (64-bit) floating-point elements in a.
Compute the square root of packed single-precision (32-bit) floating-point elements in a.
Store 256-bits (composed of 4 packed double-precision (64-bit) floating-point elements) from a into memory. mem_addr must be aligned on a 32-byte boundary or a general-protection exception may be generated.
Store 256-bits (composed of 8 packed single-precision (32-bit) floating-point elements) from a into memory. mem_addr must be aligned on a 32-byte boundary or a general-protection exception may be generated.
Store 256-bits of integer data from a into memory. mem_addr must be aligned on a 32-byte boundary or a general-protection exception may be generated.
Store the high and low 128-bit halves (each composed of 4 packed single-precision (32-bit) floating-point elements) from a into memory two different 128-bit locations. hiaddr and loaddr do not need to be aligned on any particular boundary.
Store the high and low 128-bit halves (each composed of 2 packed double-precision (64-bit) floating-point elements) from a into memory two different 128-bit locations. hiaddr and loaddr do not need to be aligned on any particular boundary.
Store the high and low 128-bit halves (each composed of integer data) from a into memory two different 128-bit locations. hiaddr and loaddr do not need to be aligned on any particular boundary.
Store 256-bits (composed of 4 packed double-precision (64-bit) floating-point elements) from a into memory. mem_addr does not need to be aligned on any particular boundary.
Store 256-bits (composed of 8 packed single-precision (32-bit) floating-point elements) from a into memory. mem_addr does not need to be aligned on any particular boundary.
Store 256-bits of integer data from a into memory. mem_addr does not need to be aligned on any particular boundary.
Store 256-bits (composed of 4 packed single-precision (64-bit) floating-point elements) from a into memory using a non-temporal memory hint. mem_addr must be aligned on a 32-byte boundary or a general-protection exception may be generated. Note: non-temporal stores should be followed by _mm_sfence() for reader threads.
Store 256-bits (composed of 8 packed single-precision (32-bit) floating-point elements) from a into memory using a non-temporal memory hint. mem_addr must be aligned on a 32-byte boundary or a general-protection exception may be generated. Note: non-temporal stores should be followed by _mm_sfence() for reader threads.
Store 256-bits of integer data from a into memory using a non-temporal memory hint. mem_addr must be aligned on a 32-byte boundary or a general-protection exception may be generated. Note: there isn't any particular instruction in AVX to do that. It just defers to SSE2. Note: non-temporal stores should be followed by _mm_sfence() for reader threads.
Subtract packed double-precision (64-bit) floating-point elements in b from packed double-precision (64-bit) floating-point elements in a.
Subtract packed single-precision (32-bit) floating-point elements in b from packed single-precision (32-bit) floating-point elements in a.
Return vector of type __m256d with undefined elements.
Return vector of type __m256 with undefined elements.
Return vector of type __m256i with undefined elements.
Unpack and interleave double-precision (64-bit) floating-point elements from the high half of each 128-bit lane in a and b.
Unpack and interleave double-precision (64-bit) floating-point elements from the high half of each 128-bit lane in a and b.
Unpack and interleave double-precision (64-bit) floating-point elements from the low half of each 128-bit lane in a and b.
Unpack and interleave single-precision (32-bit) floating-point elements from the low half of each 128-bit lane in a and b.
Compute the bitwise XOR of packed double-precision (64-bit) floating-point elements in a and b.
Compute the bitwise XOR of packed single-precision (32-bit) floating-point elements in a and b.
Cast vector of type __m128d to type __m256d; the upper 128 bits of the result are zeroed.
Cast vector of type __m128 to type __m256; the upper 128 bits of the result are zeroed.
Cast vector of type __m128i to type __m256i; the upper 128 bits of the result are zeroed.
Broadcast a single-precision (32-bit) floating-point element from memory to all elements.
AVX intrinsics. https://www.intel.com/content/www/us/en/docs/intrinsics-guide/index.html#techs=AVX