Broadcast 128 bits of integer data from `a to all 128-bit lanes in result. Note: also exist with name _mm256_broadcastsi128_si256 which is identical.
Shift 128-bit lanes in a left by bytes bytes while shifting in zeroes.
Shift 128-bit lanes in a right by bytes bytes while shifting in zeroes.
Compute the absolute value of packed signed 16-bit integers in a.
Compute the absolute value of packed signed 32-bit integers in a.
Compute the absolute value of packed signed 8-bit integers in a.
Add packed 16-bit integers in a and b.
Add packed 32-bit integers in a and b.
Add packed 64-bit integers in a and b.
Add packed 8-bit integers in a and b.
Add packed 16-bit signed integers in a and b using signed saturation.
Add packed 8-bit signed integers in a and b using signed saturation.
Add packed 16-bit unsigned integers in a and b using unsigned saturation.
Add packed 8-bit unsigned integers in a and b using unsigned saturation.
Concatenate pairs of 16-byte blocks in a and b into a 32-byte temporary result, shift the result right by imm8 bytes, and return the low 16 bytes of that in each lane.
Compute the bitwise AND of 256 bits (representing integer data) in a and b.
Compute the bitwise NOT of 256 bits (representing integer data) in a and then AND with b.
Average packed unsigned 16-bit integers in a and b.
Average packed unsigned 8-bit integers in a and b.
Blend packed 16-bit integers from a and b within 128-bit lanes using 8-bit control mask imm8, in each of the two lanes. Note: this is functionally equivalent to two _mm_blend_epi16.
Blend packed 32-bit integers from a and b using 8-bit control mask imm8.
Blend packed 8-bit integers from a and b using mask. Select from b if the high-order bit of the corresponding 8-bit element in mask is set, else select from a.
Bro0adcast the low packed 8-bit integer from a to all elements of result.
Broadcast the low packed 32-bit integer from a to all elements of result.
Broadcast the low packed 64-bit integer from a to all elements of result.
Broadcast the low double-precision (64-bit) floating-point element from a to all elements of result.
Broadcast the low single-precision (32-bit) floating-point element from a to all elements of result.
Broadcast the low packed 16-bit integer from a to all elements of result.
Shift 128-bit lanes in a left by bytes bytes while shifting in zeroes.
Shift 128-bit lanes in a right by bytes bytes while shifting in zeroes.
Compare packed 16-bit integers in a and b for equality.
Compare packed 32-bit integers in a and b for equality.
Compare packed 64-bit integers in a and b for equality.
Compare packed 8-bit integers in a and b for equality.
Compare packed signed 16-bit integers in a and b for greater-than.
Compare packed signed 32-bit integers in a and b for greater-than.
Compare packed signed 8-bit integers in a and b for greater-than.
Sign extend packed 16-bit integers in a to packed 32-bit integers.
Sign extend packed 16-bit integers in a to packed 64-bit integers.
Sign extend packed 32-bit integers in a to packed 64-bit integers.
Sign extend packed 8-bit integers in a to packed 16-bit integers.
Sign extend packed 8-bit integers in a to packed 32-bit integers.
Sign extend packed 8-bit integers in the low 8 bytes of a to packed 64-bit integers.
Zero-extend packed unsigned 16-bit integers in a to packed 32-bit integers.
Zero-extend packed unsigned 16-bit integers in a to packed 64-bit integers.
Zero-extend packed unsigned 32-bit integers in a to packed 64-bit integers.
Zero-extend packed unsigned 8-bit integers in a to packed 16-bit integers.
Zero-extend packed unsigned 8-bit integers in a to packed 32-bit integers.
Zero-extend packed unsigned 8-bit integers in a to packed 64-bit integers.
Extract a 16-bit integer from a, selected with index.
Extract a 8-bit integer from a, selected with index.
Extract 128 bits (composed of integer data) from a, selected with imm8.
Horizontally add adjacent pairs of 16-bit integers in a and b, and pack the signed 16-bit results.
Horizontally add adjacent pairs of 32-bit integers in a and b, and pack the signed 32-bit results.
Horizontally add adjacent pairs of signed 16-bit integers in a and b using saturation, and pack the signed 16-bit results.
Horizontally subtract adjacent pairs of 16-bit integers in a and b, and pack the signed 16-bit results.
Horizontally subtract adjacent pairs of 32-bit integers in a and b, and pack the signed 32-bit results.
Horizontally subtract adjacent pairs of signed 16-bit integers in a and b using saturation, and pack the signed 16-bit results.
Gather 32-bit integers from memory using 32-bit indices. 32-bit elements are loaded from addresses starting at base_addr and offset by each 32-bit element in vindex (each index is scaled by the factor in scale). Gathered elements are returned. scale should be 1, 2, 4 or 8.
Gather 64-bit integers from memory using 32-bit indices. 64-bit elements are loaded from addresses starting at base_addr and offset by each 32-bit element in vindex (each index is scaled by the factor in scale). Gathered elements are returned. scale should be 1, 2, 4 or 8.
Gather double-precision (64-bit) floating-point elements from memory using 32-bit indices. 64-bit elements are loaded from addresses starting at base_addr and offset by each 32-bit element in vindex (each index is scaled by the factor in scale). Gathered elements are returned. scale should be 1, 2, 4 or 8.
Gather single-precision (32-bit) floating-point elements from memory using 32-bit indices. 32-bit elements are loaded from addresses starting at base_addr and offset by each 32-bit element in vindex (each index is scaled by the factor in scale). Gathered elements are returned. scale should be 1, 2, 4 or 8.
Gather 32-bit integers from memory using 64-bit indices. 32-bit elements are loaded from addresses starting at base_addr and offset by each 64-bit element in vindex (each index is scaled by the factor in scale). Return gathered elements. scale should be 1, 2, 4 or 8.
Gather 64-bit integers from memory using 64-bit indices. 64-bit elements are loaded from addresses starting at base_addr and offset by each 64-bit element in vindex (each index is scaled by the factor in scale). Gathered elements are returned. scale should be 1, 2, 4 or 8.
Gather double-precision (64-bit) floating-point elements from memory using 64-bit indices. 64-bit elements are loaded from addresses starting at base_addr and offset by each 64-bit element in vindex (each index is scaled by the factor in scale). Gathered elements are returned. scale should be 1, 2, 4 or 8.
Gather single-precision (32-bit) floating-point elements from memory using 64-bit indices. 32-bit elements are loaded from addresses starting at base_addr and offset by each 64-bit element in vindex (each index is scaled by the factor in scale). Gathered elements are returned. scale should be 1, 2, 4 or 8.
Copy a to result, then insert 128 bits from b into result at the location specified by imm8.
Multiply packed signed 16-bit integers in a and b, producing intermediate signed 32-bit integers. Horizontally add adjacent pairs of intermediate 32-bit integers, and pack the results in destination.
Vertically multiply each unsigned 8-bit integer from a with the corresponding signed 8-bit integer from b, producing intermediate signed 16-bit integers. Horizontally add adjacent pairs of intermediate signed 16-bit integers, and pack the saturated results.
Gather 32-bit integers from memory using 32-bit indices. 32-bit elements are loaded from addresses starting at base_addr and offset by each 32-bit element in vindex (each index is scaled by the factor in scale). Gathered elements are merged using mask (elements are copied from src when the highest bit is not set in the corresponding element). scale should be 1, 2, 4 or 8.
Gather 64-bit integers from memory using 32-bit indices. 64-bit elements are loaded from addresses starting at base_addr and offset by each 32-bit element in vindex (each index is scaled by the factor in scale). Gathered elements are merged using mask (elements are copied from src when the highest bit is not set in the corresponding element). scale should be 1, 2, 4 or 8.
Gather double-precision (64-bit) floating-point elements from memory using 32-bit indices. 64-bit elements are loaded from addresses starting at base_addr and offset by each 32-bit element in vindex (each index is scaled by the factor in scale). Gathered elements are merged using mask (elements are copied from src when the highest bit is not set in the corresponding element). scale should be 1, 2, 4 or 8.
Gather single-precision (32-bit) floating-point elements from memory using 32-bit indices. 32-bit elements are loaded from addresses starting at base_addr and offset by each 32-bit element in vindex (each index is scaled by the factor in scale). Gathered elements are merged using mask (elements are copied from src when the highest bit is not set in the corresponding element). scale should be 1, 2, 4 or 8.
Gather 32-bit integers from memory using 64-bit indices. 32-bit elements are loaded from addresses starting at base_addr and offset by each 64-bit element in vindex (each index is scaled by the factor in scale). Gathered elements are merged using mask (elements are copied from src when the highest bit is not set in the corresponding element). scale should be 1, 2, 4 or 8.
Gather 64-bit integers from memory using 64-bit indices. 64-bit elements are loaded from addresses starting at base_addr and offset by each 64-bit element in vindex (each index is scaled by the factor in scale). Gathered elements are merged into dst using mask (elements are copied from src when the highest bit is not set in the corresponding element). scale should be 1, 2, 4 or 8.
Gather double-precision (64-bit) floating-point elements from memory using 64-bit indices. 64-bit elements are loaded from addresses starting at base_addr and offset by each 64-bit element in vindex (each index is scaled by the factor in scale). Gathered elements are merged using mask (elements are copied from src when the highest bit is not set in the corresponding element). scale should be 1, 2, 4 or 8.
Gather single-precision (32-bit) floating-point elements from memory using 64-bit indices. 32-bit elements are loaded from addresses starting at base_addr and offset by each 64-bit element in vindex (each index is scaled by the factor in scale). Gathered elements are merged using mask (elements are copied from src when the highest bit is not set in the corresponding element). scale should be 1, 2, 4 or 8.
Load packed 32-bit integers from memory using mask (elements are zeroed out when the highest bit is not set in the corresponding element). Warning: See "Note about mask load/store" to know why you must address valid memory only.
Load packed 64-bit integers from memory using mask (elements are zeroed out when the highest bit is not set in the corresponding element). Warning: See "Note about mask load/store" to know why you must address valid memory only.
Compare packed signed 16-bit integers in a and b, and return packed maximum values.
Compare packed signed 32-bit integers in a and b, and return packed maximum values.
Compare packed signed 8-bit integers in a and b, and return packed maximum values.
Compare packed unsigned 16-bit integers in a and b, and return packed maximum values.
Compare packed unsigned 32-bit integers in a and b, and return packed maximum values.
Compare packed unsigned 8-bit integers in a and b, and return packed maximum values.
Compare packed signed 32-bit integers in a and b, and return packed minimum values.
Compare packed signed 8-bit integers in a and b, and return packed minimum values.
Compare packed unsigned 16-bit integers in a and b, and return packed minimum values.
Compare packed unsigned 32-bit integers in a and b, and return packed minimum values.
Compare packed unsigned 8-bit integers in a and b, and return packed minimum values.
Create mask from the most significant bit of each 8-bit element in a.
Basically 2x _mm_mpsadbw_epu8 in parallel, over the two lanes.
Multiply the low signed 32-bit integers from each packed 64-bit element in a and b, and return the signed 64-bit results.
Multiply the low unsigned 32-bit integers from each packed 64-bit element in a and b, and return the unsigned 64-bit results.
Multiply the packed signed 16-bit integers in a and b, producing intermediate 32-bit integers, and return the high 16 bits of the intermediate integers.
Multiply the packed unsigned 16-bit integers in a and b, producing intermediate 32-bit integers, and return the high 16 bits of the intermediate integers.
Multiply packed signed 16-bit integers in a and b, producing intermediate signed 32-bit integers. Truncate each intermediate integer to the 18 most significant bits, round by adding 1, and return bits [16:1] to dst.
Multiply the packed signed 16-bit integers in a and b, producing intermediate 32-bit integers, and return the low 16 bits of the intermediate integers.
Multiply the packed signed 32-bit integers in a and b, producing intermediate 64-bit integers, and store the low 32 bits of the intermediate integer.
Compute the bitwise OR of 256 bits (representing integer data) in a and b.
Convert packed signed 16-bit integers from a and b to packed 8-bit integers using signed saturation. Warning: a and b are interleaved per-lane. Result has: a lane 0, b lane 0, a lane 1, b lane 1.
Convert packed signed 32-bit integers from a and b to packed 16-bit integers using signed saturation. Warning: a and b are interleaved per-lane. Result has: a lane 0, b lane 0, a lane 1, b lane 1.
Convert packed signed 16-bit integers from a and b to packed 8-bit integers using unsigned saturation. Warning: a and b are interleaved per-lane. Result has: a lane 0, b lane 0, a lane 1, b lane 1.
Convert packed signed 32-bit integers from a and b to packed 16-bit integers using unsigned saturation. Warning: a and b are interleaved per-lane. Result has: a lane 0, b lane 0, a lane 1, b lane 1.
Shuffle 128-bits (composed of 2 packed (128-bit) integer elements) selected by imm8 from a and b. See the documentation as the imm8 format is quite complex.
Shuffle 64-bit integers in a across lanes using the control in imm8.
Shuffle 64-bit double in a across lanes using the control in imm8.
Shuffle 32-bit integers in a across lanes using the corresponding index in idx.
Shuffle single-precision (32-bit) floating-point in a across lanes using the corresponding index in idx.
Compute the absolute differences of packed unsigned 8-bit integers in a and b, then horizontally sum each consecutive 8 differences to produce two unsigned 16-bit integers, and pack these unsigned 16-bit integers in the low 16 bits of 64-bit elements in result.
Shuffle 32-bit integers in a within 128-bit lanes using the control in imm8, and return the results.
Shuffle 8-bit integers in a within 128-bit lanes according to shuffle control mask in the corresponding 8-bit element of b.
Shuffle 16-bit integers in the high 64 bits of 128-bit lanes of a using the control in imm8. Store the results in the high 64 bits of 128-bit lanes of result, with the low 64 bits of 128-bit lanes being copied from from a. See also: _MM_SHUFFLE.
Shuffle 16-bit integers in the low 64 bits of 128-bit lanes of a using the control in imm8. Store the results in the low 64 bits of 128-bit lanes of result, with the high 64 bits of 128-bit lanes being copied from from a. See also: _MM_SHUFFLE.
Negate packed signed 16-bit integers in a when the corresponding signed 8-bit integer in b is negative. Elements in result are zeroed out when the corresponding element in b is zero.
Negate packed signed 32-bit integers in a when the corresponding signed 8-bit integer in b is negative. Elements in result are zeroed out when the corresponding element in b is zero.
Negate packed signed 8-bit integers in a when the corresponding signed 8-bit integer in b is negative. Elements in result are zeroed out when the corresponding element in b is zero.
Shift packed 16-bit integers in a left by count while shifting in zeroes. Bit-shift is a single value in the low-order 64-bit of count. If bit-shift > 15, result is defined to be all zeroes. Note: prefer _mm256_slli_epi16, less of a trap.
Shift packed 32-bit integers in a left by count while shifting in zeroes. Bit-shift is a single value in the low-order 64-bit of count. If bit-shift > 31, result is defined to be all zeroes. Note: prefer _mm256_slli_epi32, less of a trap.
Shift packed 64-bit integers in a left by count while shifting in zeroes. Bit-shift is a single value in the low-order 64-bit of count. If bit-shift > 63, result is defined to be all zeroes. Note: prefer _mm256_sll_epi64, less of a trap.
Shift packed 16-bit integers in a left by imm8 while shifting in zeros.
Shift packed 32-bit integers in a left by imm8 while shifting in zeros.
Shift packed 64-bit integers in a left by imm8 while shifting in zeros.
Shift packed 32-bit integers in a left by the amount specified by the corresponding element in count while shifting in zeroes.
Shift packed 64-bit integers in a left by the amount specified by the corresponding element in count while shifting in zeroes.
Shift packed 16-bit integers in a right by count while shifting in sign bits. Bit-shift is a single value in the low-order 64-bit of count. If bit-shift > 15, result is defined to be all sign bits. Warning: prefer _mm256_srai_epi16, less of a trap.
Shift packed 32-bit integers in a right by count while shifting in sign bits. Bit-shift is a single value in the low-order 64-bit of count. If bit-shift > 31, result is defined to be all sign bits. Warning: prefer _mm256_sra_epi32, less of a trap.
Shift packed 16-bit integers in a right by imm8 while shifting in sign bits.
Shift packed 32-bit integers in a right by imm8 while shifting in sign bits.
Shift packed 16-bit integers in a right by count while shifting in zeroes. Bit-shift is a single value in the low-order 64-bit of count. If bit-shift > 15, result is defined to be all zeroes. Note: prefer _mm256_srli_epi16, less of a trap.
Shift packed 32-bit integers in a right by count while shifting in zeroes. Bit-shift is a single value in the low-order 64-bit of count. If bit-shift > 31, result is defined to be all zeroes. Note: prefer _mm256_srli_epi32, less of a trap.
Shift packed 64-bit integers in a right by count while shifting in zeroes. Bit-shift is a single value in the low-order 64-bit of count. If bit-shift > 63, result is defined to be all zeroes. Note: prefer _mm256_srli_epi64, less of a trap.
Shift packed 16-bit integers in a right by imm8 while shifting in zeros.
Shift packed 32-bit integers in a right by imm8 while shifting in zeros.
Shift packed 64-bit integers in a right by imm8 while shifting in zeros.
Shift packed 32-bit integers in a right by the amount specified by the corresponding element in count while shifting in zeroes.
Shift packed 64-bit integers in a right by the amount specified by the corresponding element in count while shifting in zeroes.
Load 256-bits of integer data from memory using a non-temporal memory hint. mem_addr must be aligned on a 32-byte boundary or a general-protection exception may be generated.
Subtract packed 16-bit integers in b from packed 16-bit integers in a.
Subtract packed 32-bit integers in b from packed 32-bit integers in a.
Subtract packed 64-bit integers in b from packed 64-bit integers in a.
Subtract packed 8-bit integers in b from packed 8-bit integers in a.
Subtract packed signed 16-bit integers in b from packed 16-bit integers in a using saturation.
Subtract packed signed 8-bit integers in b from packed 8-bit integers in a using saturation.
Subtract packed unsigned 16-bit integers in b from packed unsigned 16-bit integers in a using saturation.
Subtract packed unsigned 8-bit integers in b from packed unsigned 8-bit integers in a using saturation.
Unpack and interleave 16-bit integers from the high half of each 128-bit lane in a and b.
Unpack and interleave 32-bit integers from the high half of each 128-bit lane in a and b.
Unpack and interleave 64-bit integers from the high half of each 128-bit lane in a and b.
Unpack and interleave 8-bit integers from the high half of each 128-bit lane in a and b,
Unpack and interleave 16-bit integers from the low half of each 128-bit lane in a and b.
Unpack and interleave 32-bit integers from the low half of each 128-bit lane in a and b.
Unpack and interleave 64-bit integers from the low half of each 128-bit lane in a and b.
Unpack and interleave 8-bit integers from the low half of each 128-bit lane in a and b.
Compute the bitwise XOR of 256 bits (representing integer data) in a and b.
Blend packed 32-bit integers from a and b using 4-bit control mask imm8.
Broadcast the low packed 8-bit integer from a to all elements of result.
Broadcast the low packed 32-bit integer from a to all elements of result.
Broadcast the low packed 64-bit integer from a to all elements of result.
Broadcast the low double-precision (64-bit) floating-point element from a to all elements of result.
Broadcast 128 bits of integer data from `a to all 128-bit lanes in result. Note: also exist with name _mm256_broadcastsi128_si256 which is identical.
Broadcast the low single-precision (32-bit) floating-point element from a to all elements of result.
Broadcast the low packed 16-bit integer from a to all elements of result.
Gather 32-bit integers from memory using 32-bit indices. 32-bit elements are loaded from addresses starting at base_addr and offset by each 32-bit element in vindex (each index is scaled by the factor in scale). Return gathered elements. scale should be 1, 2, 4 or 8.
Gather 64-bit integers from memory using 32-bit indices. 64-bit elements are loaded from addresses starting at base_addr and offset by each 32-bit element in vindex (each index is scaled by the factor in scale). Gathered elements are returned. scale should be 1, 2, 4 or 8.
Gather double-precision (64-bit) floating-point elements from memory using 32-bit indices. 64-bit elements are loaded from addresses starting at base_addr and offset by each 32-bit element in vindex (each index is scaled by the factor in scale). Gathered elements are returned. scale should be 1, 2, 4 or 8.
Gather single-precision (32-bit) floating-point elements from memory using 32-bit indices. 32-bit elements are loaded from addresses starting at base_addr and offset by each 32-bit element in vindex (each index is scaled by the factor in scale). Gathered elements are returned. scale should be 1, 2, 4 or 8.
Gather 32-bit integers from memory using 64-bit indices. 32-bit elements are loaded from addresses starting at base_addr and offset by each 64-bit element in vindex (each index is scaled by the factor in scale). Return gathered elements. scale should be 1, 2, 4 or 8.
Gather 64-bit integers from memory using 64-bit indices. 64-bit elements are loaded from addresses starting at base_addr and offset by each 64-bit element in vindex (each index is scaled by the factor in scale). Gathered elements are returned. scale should be 1, 2, 4 or 8.
Gather double-precision (64-bit) floating-point elements from memory using 64-bit indices. 64-bit elements are loaded from addresses starting at base_addr and offset by each 64-bit element in vindex (each index is scaled by the factor in scale). Gathered elements are returned. scale should be 1, 2, 4 or 8.
Gather single-precision (32-bit) floating-point elements from memory using 64-bit indices. 32-bit elements are loaded from addresses starting at base_addr and offset by each 64-bit element in vindex (each index is scaled by the factor in scale). Gathered elements are returned. scale should be 1, 2, 4 or 8.
Gather 32-bit integers from memory using 32-bit indices. 32-bit elements are loaded from addresses starting at base_addr and offset by each 32-bit element in vindex (each index is scaled by the factor in scale). Gathered elements are merged using mask (elements are copied from src when the highest bit is not set in the corresponding element). scale should be 1, 2, 4 or 8.
Gather 64-bit integers from memory using 32-bit indices. 64-bit elements are loaded from addresses starting at base_addr and offset by each 32-bit element in vindex (each index is scaled by the factor in scale). Gathered elements are merged using mask (elements are copied from src when the highest bit is not set in the corresponding element). scale should be 1, 2, 4 or 8.
Gather double-precision (64-bit) floating-point elements from memory using 32-bit indices. 64-bit elements are loaded from addresses starting at base_addr and offset by each 32-bit element in vindex (each index is scaled by the factor in scale). Gathered elements are merged using mask (elements are copied from src when the highest bit is not set in the corresponding element). scale should be 1, 2, 4 or 8.
Gather single-precision (32-bit) floating-point elements from memory using 32-bit indices. 32-bit elements are loaded from addresses starting at base_addr and offset by each 32-bit element in vindex (each index is scaled by the factor in scale). Gathered elements are merged using mask (elements are copied from src when the highest bit is not set in the corresponding element). scale should be 1, 2, 4 or 8.
Gather 32-bit integers from memory using 64-bit indices. 32-bit elements are loaded from addresses starting at base_addr and offset by each 64-bit element in vindex (each index is scaled by the factor in scale). Gathered elements are merged using mask (elements are copied from src when the highest bit is not set in the corresponding element). scale should be 1, 2, 4 or 8.
Gather 64-bit integers from memory using 32-bit indices. 64-bit elements are loaded from addresses starting at base_addr and offset by each 32-bit element in vindex (each index is scaled by the factor in scale). Gathered elements are merged using mask (elements are copied from src when the highest bit is not set in the corresponding element). scale should be 1, 2, 4 or 8.
Gather double-precision (64-bit) floating-point elements from memory using 64-bit indices. 64-bit elements are loaded from addresses starting at base_addr and offset by each 64-bit element in vindex (each index is scaled by the factor in scale). Gathered elements are merged into dst using mask (elements are copied from src when the highest bit is not set in the corresponding element). scale should be 1, 2, 4 or 8.
Gather single-precision (32-bit) floating-point elements from memory using 64-bit indices. 32-bit elements are loaded from addresses starting at base_addr and offset by each 64-bit element in vindex (each index is scaled by the factor in scale). Gathered elements are merged using mask (elements are copied from src when the highest bit is not set in the corresponding element). scale should be 1, 2, 4 or 8.
Load packed 32-bit integers from memory using mask (elements are zeroed out when the highest bit is not set in the corresponding element). Warning: See "Note about mask load/store" to know why you must address valid memory only.
Load packed 64-bit integers from memory using mask (elements are zeroed out when the highest bit is not set in the corresponding element). Warning: See "Note about mask load/store" to know why you must address valid memory only.
Shift packed 32-bit integers in a left by the amount specified by the corresponding element in count while shifting in zeroes.
Shift packed 64-bit integers in a left by the amount specified by the corresponding element in b while shifting in zeros.
Shift packed 32-bit integers in a right by the amount specified by the corresponding element in count while shifting in zeroes.
Shift packed 64-bit integers in a right by the amount specified by the corresponding element in count while shifting in zeroes.
AVX2 intrinsics. https://www.intel.com/content/www/us/en/docs/intrinsics-guide/index.html#techs=AVX2