上下文
关于如何使用Word2Vec
和流数据训练gensim
,存在几个问题。无论如何,这些问题都没有解决流传输不能使用多个工作程序的问题,因为没有在线程之间拆分的数组。
因此,我想创建一个为gensim提供此类功能的生成器。我的结果如下:
from gensim.models import Word2Vec as w2v
#The data is store in a python-list and unsplitted.
#It's to much data to store it splitted,so I have to do the split while streaming.
data = ['this is document one','this is document two',...]
#Now the generator-class
import threading
class dataGenerator:
"""
Generator for batch-tokenization.
"""
def __init__(self,data: list,batch_size:int = 40):
"""Initialize generator and pass data."""
self.data = data
self.batch_size = batch_size
self.lock = threading.Lock()
def __len__(self):
"""Get total number of batches."""
return int(np.ceil(len(self.data) / float(self.batch_size)))
def __iter__(self) -> list([]):
"""
Iterator-wrapper for generator-functionality (since generators cannot be used directly.
Allows for data-streaming.
"""
for idx in range(len(self)):
yield self[idx]
def __getitem__(self,idx):
#Make multithreading threadsafe it thread-safe
with self.lock:
# Returns current batch by slicing data.
return [arr.split(" ") for arr in self.data[idx * self.batch_size : (idx + 1) * self.batch_size]]
#And now do the training
model = w2v(
sentences=dataGenerator(data),size=300,window=5,min_count=1,workers=4
)
这会导致错误
TypeError:不可散列的类型:“列表”
由于如果只生成一个拆分的文档,dataGenerator(data)
会起作用,所以我假设gensims word2vec
将生成器包装在一个额外的列表中。在这种情况下,__iter__
看起来像:
def __iter__(self) -> list:
"""
Iterator-wrapper for generator-functionality (since generators cannot be used directly.
Allows for data-streaming.
"""
for text in self.data:
yield text.split(" ")
因此,我的批处理也将被包装,导致[[['this','...'],['this','...']],[[...],[...]]]
(=> list of list of list的列表)不能被gensim处理。
我的问题:
我可以“批量”通过批次以使用多个工作程序吗? 如何相应地更改代码?