我有以下代码:
#include <iostream>
#include <chrono>
#define ITERATIONS "10000"
int main()
{
/*
======================================
The first case: the MOV is outside the loop.
======================================
*/
auto t1 = std::chrono::high_resolution_clock::now();
asm("mov $100,%eax\n"
"mov $200,%ebx\n"
"mov $" ITERATIONS ",%ecx\n"
"lp_test_time1:\n"
" add %eax,%ebx\n" // 1
" add %eax,%ebx\n" // 2
" add %eax,%ebx\n" // 3
" add %eax,%ebx\n" // 4
" add %eax,%ebx\n" // 5
"loop lp_test_time1\n");
auto t2 = std::chrono::high_resolution_clock::now();
auto time = std::chrono::duration_cast<std::chrono::nanoseconds>(t2 - t1).count();
std::cout << time;
/*
======================================
The second case: the MOV is inside the loop (faster).
======================================
*/
t1 = std::chrono::high_resolution_clock::now();
asm("mov $100,%eax\n"
"mov $" ITERATIONS ",%ecx\n"
"lp_test_time2:\n"
" mov $200,%ebx\n"
" add %eax,%ebx\n" // 5
"loop lp_test_time2\n");
t2 = std::chrono::high_resolution_clock::now();
time = std::chrono::duration_cast<std::chrono::nanoseconds>(t2 - t1).count();
std::cout << '\n' << time << '\n';
}
第一种情况
我用
编译gcc version 9.2.0 (GCC)
Target: x86_64-pc-linux-gnu
gcc -Wall -Wextra -pedantic -O0 -o proc proc.cpp
及其输出是
14474
5837
我也用Clang编译了它,结果相同。
那么,为什么第二种情况更快(几乎是3倍的加速)?它实际上与某些微建筑细节有关吗?如果有问题,我可以使用AMD的CPU:“ AMD A9-9410 RADeon R5,5个计算核心2C + 3G”。