使用SIMD

2024-05-01 • 问答

我有两个向量。大小为a的双精度N的向量和大小为b的无符号字符ceil(N/8)的向量。目的是计算a的某些值的乘积。将b逐位读取，其中每个位指示是否要在产品中考虑从double给定的a。

  // Let's create some data      
  unsigned nbBits  = 1e7;
  unsigned nbBytes = nbBits / 8;
  unsigned char nbBitsinlastByte = nbBits % 8;
  assert(nbBits == nbBytes * 8 + nbBitsinlastByte);
  std::vector<double> a(nbBits,0.999999);   // In practice a values will vary. It is just an easy to build example I am showing here
  std::vector<unsigned char> b(nbBytes,false); // I am not using `vector<bool>` nor `bitset`. I've got my reasons!
  assert(a.size() == b.size() * 8);

  // Set a few bits to true
  for (unsigned byte = 0 ; byte < (nbBytes-1) ; byte+=2)
  {
    b[byte] |= 1 << 2; // set second (zero-based counting) bit to 'true'
    b[byte] |= 1 << 7; // set last bit to 'true'
                //  ^ This is the bit index
  }

如上所述，我的目标是在a为真时计算b中值的乘积。可以通过

  // Initialize the variable we want to compute
  double product = 1.0;

  // Product for the first nbByts-1 bytes
  for (unsigned byte = 0 ; byte < (nbBytes-1) ; ++byte)
  {
    for (unsigned bit = 0 ; bit < 8 ; ++bit) // inner loop could be manually unrolled
    {
      if((b[byte] >> bit) & 1) // gets the bit value
        product *= a[byte*8+bit];
    }
  }

  // Product for the last byte
  for (unsigned bit = 0 ; bit < nbBitsinlastByte ; ++bit)
  {
    if((b[nbBytes-1] >> bit) & 1) // gets the bit value
      product *= a[(nbBytes-1)*8+bit];
  }

此乘积计算是我代码中速度较慢的部分。我想知道是否显式向量化（SIMD）这个过程在这里是否有帮助？我一直在看'xmmintrin.h'中提供的功能，但对SIMD知之甚少，未能找到有帮助的东西。你能帮我吗？

// Load 2 double values from source pointer,and conditionally multiply with the product. // Returns the new product. template<int startIdx> inline __m128d product2( const double* pSource,__m128i mask,__m128d oldProduct ) { // Multiply values unconditionally const __m128d source = _mm_loadu_pd( pSource + startIdx ); const __m128d newProduct = _mm_mul_pd( source,oldProduct ); // We only calling product2 with 4 different template arguments. // There are 16 vector registers in total,enough for all 4 different `maskAndBits` values. constexpr int64_t bit1 = 1 << startIdx; constexpr int64_t bit2 = 1 << ( startIdx + 1 ); const __m128i maskAndBits = _mm_setr_epi64x( bit1,bit2 ); mask = _mm_and_si128( mask,maskAndBits ); // NAN if the mask is 0 after the above AND i.e. the bit was not set,0.0 if the bit was set const __m128d maskDouble = _mm_castsi128_pd( _mm_cmpeq_epi64( mask,_mm_setzero_si128() ) ); // This instruction actually does the masking,it's from SSE 4.1 return _mm_blendv_pd( newProduct,oldProduct,maskDouble ); } double conditionalProducts( const double* ptr,const uint8_t* masks,size_t size ) { // Round down the size of your input vector,and multiply last couple values the old way. assert( 0 == size % 8 ); __m128d prod = _mm_set1_pd( 1.0 ); const double* const end = ptr + size; while( ptr < end ) { // Broadcast the mask byte into 64-bit integer lanes const __m128i mask = _mm_set1_epi64x( *masks ); // Compute the conditional products of 8 values prod = product2<0>( ptr,mask,prod ); prod = product2<2>( ptr,prod ); prod = product2<4>( ptr,prod ); prod = product2<6>( ptr,prod ); // Advance the pointers ptr += 8; masks++; } // Multiply two lanes together prod = _mm_mul_sd( prod,_mm_shuffle_pd( prod,prod,0b11 ) ); return _mm_cvtsd_f64( prod ); }

使用SIMD

Jackson0417 回答：使用SIMD

大家都在问