我目前正在用C编写素数生成器.我先制作了一个单线程版本,后来又制作了一个多线程版本.
我发现如果我的程序生成的值小于100’000,则单线程版本比多线程版本更快.显然我做错了什么.
我的代码如下:
- #include <iostream>
- #include <fstream>
- #include <set>
- #include <string>
- #include <thread>
- #include <mutex>
- #include <shared_mutex>
- using namespace std;
- set<unsigned long long> primeContainer;
- shared_mutex m;
- void checkPrime(const unsigned long long p)
- {
- if (p % 3 == 0)
- return;
- bool isPrime = true;
- for (set<unsigned long long>::const_iterator it = primeContainer.cbegin(); it != primeContainer.cend(); ++it)
- {
- if (p % *it == 0)
- {
- isPrime = false;
- break;
- }
- if (*it * *it > p) // check only up to square root
- break;
- }
- if (isPrime)
- primeContainer.insert(p);
- }
- void checkPrimeLock(const unsigned long long p)
- {
- if (p % 3 == 0)
- return;
- bool isPrime = true;
- try
- {
- shared_lock<shared_mutex> l(m);
- for (set<unsigned long long>::const_iterator it = primeContainer.cbegin(); it != primeContainer.cend(); ++it)
- {
- if (p % *it == 0)
- {
- isPrime = false;
- break;
- }
- if (*it * *it > p)
- break;
- }
- }
- catch (exception& e)
- {
- cout << e.what() << endl;
- system("pause");
- }
- if (isPrime)
- {
- try
- {
- unique_lock<shared_mutex> l(m);
- primeContainer.insert(p);
- }
- catch (exception& e)
- {
- cout << e.what() << endl;
- system("pause");
- }
- }
- }
- void runLoopThread(const unsigned long long& l)
- {
- for (unsigned long long i = 10; i < l; i += 10)
- {
- thread t1(checkPrimeLock,i + 1);
- thread t2(checkPrimeLock,i + 3);
- thread t3(checkPrimeLock,i + 7);
- thread t4(checkPrimeLock,i + 9);
- t1.join();
- t2.join();
- t3.join();
- t4.join();
- }
- }
- void runLoop(const unsigned long long& l)
- {
- for (unsigned long long i = 10; i < l; i += 10)
- {
- checkPrime(i + 1);
- checkPrime(i + 3);
- checkPrime(i + 7);
- checkPrime(i + 9);
- }
- }
- void printPrimes(const unsigned long long& l)
- {
- if (1U <= l)
- cout << "1 ";
- if (2U <= l)
- cout << "2 ";
- if (3U <= l)
- cout << "3 ";
- if (5U <= l)
- cout << "5 ";
- for (auto it = primeContainer.cbegin(); it != primeContainer.cend(); ++it)
- {
- if (*it <= l)
- cout << *it << " ";
- }
- cout << endl;
- }
- void writeToFile(const unsigned long long& l)
- {
- string name = "primes_" + to_string(l) + ".txt";
- ofstream f(name);
- if (f.is_open())
- {
- if (1U <= l)
- f << "1 ";
- if (2U <= l)
- f << "2 ";
- if (3U <= l)
- f << "3 ";
- if (5U <= l)
- f << "5 ";
- for (auto it = primeContainer.cbegin(); it != primeContainer.cend(); ++it)
- {
- if (*it <= l)
- f << *it << " ";
- }
- }
- else
- {
- cout << "Error opening file." << endl;
- system("pause");
- }
- }
- int main()
- {
- unsigned int n = thread::hardware_concurrency();
- std::cout << n << " concurrent threads are supported." << endl;
- unsigned long long limit;
- cout << "Please enter the limit of prime generation: ";
- cin >> limit;
- primeContainer.insert(7);
- if (10 < limit)
- {
- //runLoop(limit); //single-threaded
- runLoopThread(limit); //multi-threaded
- }
- printPrimes(limit);
- //writeToFile(limit);
- system("pause");
- return 0;
- }
在main函数中,您将找到关于哪个函数是单线程和多线程的注释.
它们之间的主要区别在于锁的使用,容器迭代的共享以及插入的唯一性.如果重要,我的cpu有4个核心.
为什么单线程版本更快?
解决方法
对我来说,似乎你正在为每个单一的素数检查开始一个新线程.那是不好的恕我直言,因为线程启动/关闭加上同步增加了每个素数的计算.启动一个线程可能会很慢.
我建议在main for循环之外启动那4个线程,并在每个线程中处理范围的1/4.但这可能需要一些额外的同步,因为要检查素数,上面的代码显然需要首先具有可用的sqrt N的素数.
从我的角度来看,使用Sieve of Erastothenes算法可能更容易,这可能更容易并行化而没有任何锁定(但是仍然可能遇到称为“false sharing”的问题).
编辑
在这里,我使用Siera of Erastothenes快速创建了一个版本:
- void processSieve(const unsigned long long& l,const unsigned long long& start,const unsigned long long& end,const unsigned long long& step,vector<char> &is_prime)
- {
- for (unsigned long long i = start; i <= end; i += step)
- if (is_prime[i])
- for (unsigned long long j = i + i; j <= l; j += i)
- is_prime[j] = 0;
- }
- void runSieve(const unsigned long long& l)
- {
- vector<char> is_prime(l + 1,1);
- unsigned long long end = sqrt(l);
- processSieve(l,2,end,1,is_prime);
- primeContainer.clear();
- for (unsigned long long i = 1; i <= l; ++i)
- if (is_prime[i])
- primeContainer.insert(i);
- }
- void runSieveThreads(const unsigned long long& l)
- {
- vector<char> is_prime(l + 1,1);
- unsigned long long end = sqrt(l);
- vector<thread> threads;
- threads.reserve(cpuCount);
- for (unsigned long long i = 0; i < cpuCount; ++i)
- threads.emplace_back(processSieve,l,2 + i,cpuCount,ref(is_prime));
- for (unsigned long long i = 0; i < cpuCount; ++i)
- threads[i].join();
- primeContainer.clear();
- for (unsigned long long i = 1; i <= l; ++i)
- if (is_prime[i])
- primeContainer.insert(i);
- }
测量结果,最高可达1 000 000(MSVC 2013,Release):
- runLoop: 204.02 ms
- runLoopThread: 43947.4 ms
- runSieve: 30.003 ms
- runSieveThreads (8 cores): 24.0024 ms
高达10 0000 000:
- runLoop: 4387.44 ms
- // runLoopThread disabled,taking too long
- runSieve: 350.035 ms
- runSieveThreads (8 cores): 285.029 ms
时间包括向量的最终处理并将结果推送到素数集.
正如您所看到的,即使在单线程版本中,Sieve版本也比您的版本快得多(对于您的互斥锁版本,我必须将锁定更改为常规互斥锁,因为MSVC 2013没有shared_lock,因此结果可能更糟糕比你的).
但是你可以看到筛网的多线程版本运行速度仍然没有预期的那么快(8个核心,即8个线程,线性加速比单线程快8倍),虽然没有锁定(折衷一些数字可能如果它们没有被其他线程标记为“无素数”,则不必要地运行,但通常结果应该是稳定的,因为每次只设置为0,如果多个线程同时设置则无关紧要).加速不是线性的原因很可能是因为我之前提到的“false sharing”问题 – 写入零的线程使每个其他高速缓存行无效.