在Dask Dataframe中找出并分配列类型

当前,我正在使用Pandas数据框。我遍历行并根据将dtype分配给该列的数据类型的数量。假设我有一个如下数据框:

    column1   column2   column3
0   1.43816      lots    1.9837
1  -0.28378        of   0.01758
2  0.552564    string  0.257276
3     dummy    inthis  -1.34906
4    string    column   1.33308
5  0.944862 -0.657849    dadada

我的代码如下:(没有Dask的工作示例)

import numpy as np
import pandas as pd

def is_number(column,column_length):
    count = 0
    for row in column:
        if isinstance(row,np.int) == True and \
                str(row) != 'True' and str(row) != 'False':
            count += 1
        elif isinstance(row,np.float) == True:
            count += 1
    if count >= column_length*0.51:
        column = pd.to_numeric(column,errors='coerce')
    return column

data = {'column1': [1.438161,-0.283780,0.552564,'dummy','string',0.944862],'column2': ['lots','of','inthis','column',-0.657849],'column3': [1.983704,0.017580,0.257276,-1.349062,1.333079,'dadada']}
df = pd.DataFrame(data)
print(df)
print(df.dtypes)
column_names = df.columns
for column in column_names:
    column_length = len(df[column])
    df[column] = is_number(df[column],column_length)
print(df.dtypes)

由于我的实际数据量很大,因此我想使用Dask通过确定列的dtype后丢弃并不将整个数据集加载到内存中来增加可扩展性并降低内存使用量。(并且也加快了过程)。但是,当我想遍历数据框的行时,会在第NotImplementedError: Series getitem in only supported for other series objects with matching partition structure行抛出一个错误:for row in column。 Dask数据帧不支持行拆分。 如何使用Dask数据框实现相同的目的?我也在考虑逐列拆分数据框并并行执行此操作。 如何在Dask(dask.distributed,因为我正在考虑使用机器集群)中并行化此操作(for循环)?

还有我的“无效” Dask代码:

import numpy as np
import pandas as pd
import dask.dataframe as dd

def is_number(column,'dadada']}
df = pd.DataFrame(data)
df = dd.from_pandas(df,npartitions=8)
df = df.repartition(partition_size="100MB")
print(df)
print(df.dtypes)
column_names = df.columns
for column in column_names:
    column_length = len(df[column])
    df[column] = is_number(df[column],column_length)
print(df.dtypes)

和完整的追溯:

Traceback (most recent call last):
  File "/home/dodzilla-ai/.PyCharm2019.2/config/scratches/scratch_1.py",line 28,in <module>
    df[column] = is_number(df[column],column_length)
  File "/home/dodzilla-ai/.PyCharm2019.2/config/scratches/scratch_1.py",line 7,in is_number
    for row in column:
  File "/home/dodzilla-ai/Projects/project/venv/lib/python3.6/site-packages/dask/dataframe/core.py",line 2673,in __getitem__
    "Series getitem in only supported for other series objects "
NotImplementedError: Series getitem in only supported for other series objects with matching partition structure

dabengua5527 回答:在Dask Dataframe中找出并分配列类型

暂时没有好的解决方案,如果你有好的解决方案,请发邮件至:iooj@foxmail.com
本文链接:https://www.f2er.com/3132982.html

大家都在问