从pandas系列和csr矩阵并行填充ndarray

2024-05-02 • 问答

当前使用for循环将pandas系列（类别/对象dtype）和csr矩阵（numpy）中的值填充到ndarray，我一直想加快速度

顺序循环（有效），numba（不喜欢序列和字符串），joblib（比顺序循环慢），swifter.apply（比我不得不使用pandas慢得多，但它确实可以并行化）

import pandas as pd
import numpy as np
from scipy.sparse import rand

nr_matches = 10**5
name_vector = pd.Series(pd.util.testing.rands_array(10,nr_matches))
matches = rand(nr_matches,10,density = 0.2,format = 'csr')
non_zeros = matches.nonzero()
sparserows = non_zeros[0]
sparsecols = non_zeros[1]

left_side = np.empty([nr_matches],dtype = object)
right_side = np.empty([nr_matches],dtype = object)
similarity = np.zeros(nr_matches)

for index in range(0,nr_matches):
    left_side[index] = name_vector.iat[sparserows[index]]
    right_side[index] = name_vector.iat[sparsecols[index]]
    similarity[index] = matches.data[index]

没有错误消息，但是这很慢，因为它使用一个线程！

从pandas系列和csr矩阵并行填充ndarray

calvin1 回答：从pandas系列和csr矩阵并行填充ndarray

大家都在问