当前,我正在使用Pandas数据框。我遍历行并根据将dtype分配给该列的数据类型的数量。假设我有一个如下数据框:
column1 column2 column3
0 1.43816 lots 1.9837
1 -0.28378 of 0.01758
2 0.552564 string 0.257276
3 dummy inthis -1.34906
4 string column 1.33308
5 0.944862 -0.657849 dadada
我的代码如下:(没有Dask的工作示例)
import numpy as np
import pandas as pd
def is_number(column,column_length):
count = 0
for row in column:
if isinstance(row,np.int) == True and \
str(row) != 'True' and str(row) != 'False':
count += 1
elif isinstance(row,np.float) == True:
count += 1
if count >= column_length*0.51:
column = pd.to_numeric(column,errors='coerce')
return column
data = {'column1': [1.438161,-0.283780,0.552564,'dummy','string',0.944862],'column2': ['lots','of','inthis','column',-0.657849],'column3': [1.983704,0.017580,0.257276,-1.349062,1.333079,'dadada']}
df = pd.DataFrame(data)
print(df)
print(df.dtypes)
column_names = df.columns
for column in column_names:
column_length = len(df[column])
df[column] = is_number(df[column],column_length)
print(df.dtypes)
由于我的实际数据量很大,因此我想使用Dask通过确定列的dtype后丢弃并不将整个数据集加载到内存中来增加可扩展性并降低内存使用量。(并且也加快了过程)。但是,当我想遍历数据框的行时,会在第NotImplementedError: Series getitem in only supported for other series objects with matching partition structure
行抛出一个错误:for row in column
。 Dask数据帧不支持行拆分。 如何使用Dask数据框实现相同的目的?我也在考虑逐列拆分数据框并并行执行此操作。 如何在Dask(dask.distributed,因为我正在考虑使用机器集群)中并行化此操作(for循环)?
还有我的“无效” Dask代码:
import numpy as np
import pandas as pd
import dask.dataframe as dd
def is_number(column,'dadada']}
df = pd.DataFrame(data)
df = dd.from_pandas(df,npartitions=8)
df = df.repartition(partition_size="100MB")
print(df)
print(df.dtypes)
column_names = df.columns
for column in column_names:
column_length = len(df[column])
df[column] = is_number(df[column],column_length)
print(df.dtypes)
和完整的追溯:
Traceback (most recent call last):
File "/home/dodzilla-ai/.PyCharm2019.2/config/scratches/scratch_1.py",line 28,in <module>
df[column] = is_number(df[column],column_length)
File "/home/dodzilla-ai/.PyCharm2019.2/config/scratches/scratch_1.py",line 7,in is_number
for row in column:
File "/home/dodzilla-ai/Projects/project/venv/lib/python3.6/site-packages/dask/dataframe/core.py",line 2673,in __getitem__
"Series getitem in only supported for other series objects "
NotImplementedError: Series getitem in only supported for other series objects with matching partition structure