Gather single-precision (32-bit) floating-point elements from memory using 32-bit indices.
32-bit elements are loaded from addresses starting at base_addr and offset by each 32-bit
element in vindex (each index is scaled by the factor in scale). Gathered elements are merged using mask
(elements are copied from src when the highest bit is not set in the corresponding element).
scale should be 1, 2, 4 or 8.
Gather single-precision (32-bit) floating-point elements from memory using 32-bit indices. 32-bit elements are loaded from addresses starting at base_addr and offset by each 32-bit element in vindex (each index is scaled by the factor in scale). Gathered elements are merged using mask (elements are copied from src when the highest bit is not set in the corresponding element). scale should be 1, 2, 4 or 8.