AVX2：计算512个浮点数组的点积

首先，我将是SIMD内部函数的完整入门者。

从本质上讲，我有一个支持AVX2内部（Intel(R) Core(TM) i5-7500T CPU @ 2.70GHz）的CPU。我想知道计算大小为std::vector<float>的两个512的点积的最快方法。

我已经在线进行了一些挖掘，发现this和this，并且this堆栈溢出问题建议使用以下函数__m256 _mm256_dp_ps(__m256 m1,__m256 m2,const int mask);，但是，这些建议的方式都不相同点产品的执行，我不确定什么是正确（最快）的方法。

特别是，我正在寻找对尺寸为512的矢量执行点积运算的最快方法（因为我知道矢量尺寸会影响实现）。

谢谢您的帮助

编辑1 ：我对-mavx2 gcc标志也有些困惑。如果使用这些AVX2函数，在编译时是否需要添加标志？另外，如果我编写了一个简单的点积实现，gcc是否可以为我做这些优化（例如是否使用-OFast gcc标志）？

编辑2 如果有时间和精力，那么如果您能编写完整的实现，我将不胜感激。我相信其他初学者也会重视此信息。

_mm256_dp_ps仅对2到4个元素的点积有用。对于更长的向量，请在循环中使用垂直SIMD，最后减少为标量。循环使用_mm256_dp_ps和_mm256_add_ps会慢很多。

与MSVC和ICC不同，GCC和clang要求您启用（使用命令行选项）使用内部函数的ISA扩展。

下面的代码可能接近CPU的理论性能极限。未经测试。

使用clang或gcc -O3 -march=native进行编译。（至少需要-mavx -mfma，但是-mtune隐含的-march选项也很好，其他-mpopcnt和其他arch=native启用的选项也是如此。对于大多数具有FMA的CPU，选项对于有效地进行编译至关重要，特别是-mno-avx256-split-unaligned-load：Why doesn't gcc resolve _mm256_loadu_pd as single vmovupd?）

或使用MSVC -O2 -arch:AVX2

进行编译

#include <immintrin.h>
#include <vector>
#include <assert.h>

// CPUs support RAM access like this: "ymmword ptr [rax+64]"
// Using templates with offset int argument to make easier for compiler to emit good code.

// Multiply 8 floats by another 8 floats.
template<int offsetRegs>
inline __m256 mul8( const float* p1,const float* p2 )
{
    constexpr int lanes = offsetRegs * 8;
    const __m256 a = _mm256_loadu_ps( p1 + lanes );
    const __m256 b = _mm256_loadu_ps( p2 + lanes );
    return _mm256_mul_ps( a,b );
}

// Returns acc + ( p1 * p2 ),for 8-wide float lanes.
template<int offsetRegs>
inline __m256 fma8( __m256 acc,const float* p1,const float* p2 )
{
    constexpr int lanes = offsetRegs * 8;
    const __m256 a = _mm256_loadu_ps( p1 + lanes );
    const __m256 b = _mm256_loadu_ps( p2 + lanes );
    return _mm256_fmadd_ps( a,b,acc );
}

// Compute dot product of float vectors,using 8-wide FMA instructions.
float dotProductFma( const std::vector<float>& a,const std::vector<float>& b )
{
    assert( a.size() == b.size() );
    assert( 0 == ( a.size() % 32 ) );
    if( a.empty() )
        return 0.0f;

    const float* p1 = a.data();
    const float* const p1End = p1 + a.size();
    const float* p2 = b.data();

    // Process initial 32 values. Nothing to add yet,just multiplying.
    __m256 dot0 = mul8<0>( p1,p2 );
    __m256 dot1 = mul8<1>( p1,p2 );
    __m256 dot2 = mul8<2>( p1,p2 );
    __m256 dot3 = mul8<3>( p1,p2 );
    p1 += 8 * 4;
    p2 += 8 * 4;

    // Process the rest of the data.
    // The code uses FMA instructions to multiply + accumulate,consuming 32 values per loop iteration.
    // Unrolling manually for 2 reasons:
    // 1. To reduce data dependencies. With a single register,every loop iteration would depend on the previous result.
    // 2. Unrolled code checks for exit condition 4x less often,therefore more CPU cycles spent computing useful stuff.
    while( p1 < p1End )
    {
        dot0 = fma8<0>( dot0,p1,p2 );
        dot1 = fma8<1>( dot1,p2 );
        dot2 = fma8<2>( dot2,p2 );
        dot3 = fma8<3>( dot3,p2 );
        p1 += 8 * 4;
        p2 += 8 * 4;
    }

    // Add 32 values into 8
    const __m256 dot01 = _mm256_add_ps( dot0,dot1 );
    const __m256 dot23 = _mm256_add_ps( dot2,dot3 );
    const __m256 dot0123 = _mm256_add_ps( dot01,dot23 );
    // Add 8 values into 4
    const __m128 r4 = _mm_add_ps( _mm256_castps256_ps128( dot0123 ),_mm256_extractf128_ps( dot0123,1 ) );
    // Add 4 values into 2
    const __m128 r2 = _mm_add_ps( r4,_mm_movehl_ps( r4,r4 ) );
    // Add 2 lower values into the final result
    const __m128 r1 = _mm_add_ss( r2,_mm_movehdup_ps( r2 ) );
    // Return the lowest lane of the result vector.
    // The intrinsic below compiles into noop,modern compilers return floats in the lowest lane of xmm0 register.
    return _mm_cvtss_f32( r1 );
}

可能的进一步改进：

解开8个向量寄存器，而不是4个。我检查了gcc 9.2 asm output，编译器只使用了16个可用向量寄存器中的8个向量寄存器。
确保两个输入向量对齐，例如使用自定义分配器，该分配器在msvc上调用_aligned_malloc / _aligned_free，在gcc和clang上调用aligned_alloc / free。然后将_mm256_loadu_ps替换为_mm256_load_ps。

要自动向量化一个简单的标量点积，您还需要OpenMP SIMD或-ffast-math（由-Ofast表示），以使编译器将FP数学视为关联（即使不是）（因为四舍五入）。但是GCC在自动矢量化时不会使用多个累加器，即使它确实展开了，因此您将成为FMA延迟的瓶颈，而不是吞吐量。

（每个FMA 2个负载意味着此代码的吞吐量瓶颈是矢量负载，而不是实际的FMA操作。）

AVX2：计算512个浮点数组的点积

xkdtj 回答：AVX2：计算512个浮点数组的点积

大家都在问