我正在努力并行运行几种机器学习算法(来自scikit-learn),并且我将Process类与进程之间共享的变量一起使用,以保存结果。
不幸的是,我的代码永无止境。因为我正在运行10种相当繁重的算法,这可能是内存问题吗?还是只是速度慢?
我试图将整个代码分为两部分(我当时认为这样做会更快),但是,它并没有改变任何东西...
请注意,train_bow和test_bow只是浮点向量。
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import MultinomialNB,ComplementNB,bernoulliNB
from sklearn.ensemble import GradientBoostingClassifier,AdaBoostClassifier,VotingClassifier,ExtraTreesClassifier
from sklearn.svm import SVC,LinearSVC
from sklearn.ensemble import RandomForestClassifier
from sklearn.neighbors import KNeighborsClassifier as Knn
from sklearn.feature_extraction.text import TfidfVectorizer
#Custom class
from utilities.db_handler import *
from utilities.utils import *
from multiprocessing import Process,Manager
import json
import pickle as pkl
import os
import numpy as np
import pandas as pd
manager = Manager()
return_dict = manager.dict()
# Use a shared variable in order to get the results
proc = []
fncs1 = [random_forest_classification,SVC_classification,LinearSVC_classification,MultinomialNB_classification,LogisticRegression_classification]
fncs2 = [bernoulliNB_classification,GradientBoosting_classification,AdaBoost_classification,VotingClassifier_classification,ComplementNB_classification,ExtrExtraTrees_classification]
# Instantiating 2 set of processes with relative arguments. Each function
# writes the result on result_dict
for fn in fncs1:
p = Process(target=fn,args=(train_bow,test_bow,label_train,label_test,return_dict))
proc.append(p)
p.start()
for p in proc:
p.join()
for fn in fncs2:
p = Process(target=fn,return_dict))
proc.append(p)
p.start()
for p in proc:
p.join()
# then pick te best of the results from return_dict and save them
这段代码给了我一些属于算法的警告,但是没有显示与多处理有关的任何错误或警告。