根据列值展开数据集

2024-05-19 • 问答

我有一个数据框df1：

Date_1     Date_2       i_count c_book
01/09/2019  02/08/2019  2       204
01/09/2019  03/08/2019  2       211
01/09/2019  04/08/2019  2       218
01/09/2019  05/08/2019  2       226
01/09/2019  06/08/2019  2       234
01/09/2019  07/08/2019  2       242
01/09/2019  08/08/2019  2       251
01/09/2019  09/08/2019  2       259
01/09/2019  10/08/2019  3       269
01/09/2019  11/08/2019  3       278
01/09/2019  12/08/2019  3       288
01/09/2019  13/08/2019  3       298
01/09/2019  14/08/2019  3       308
01/09/2019  15/08/2019  3       319
01/09/2019  16/08/2019  4       330
01/09/2019  17/08/2019  4       342
01/09/2019  18/08/2019  4       354
01/09/2019  19/08/2019  4       366
01/09/2019  20/08/2019  4       379
01/09/2019  21/08/2019  5       392
01/09/2019  22/08/2019  5       406
01/09/2019  23/08/2019  6       420
01/09/2019  24/08/2019  6       435
01/09/2019  25/08/2019  7       450
01/09/2019  26/08/2019  8       466
01/09/2019  27/08/2019  9       483
01/09/2019  28/08/2019  10      500
01/09/2019  29/08/2019  11      517
01/09/2019  30/08/2019  12      535
01/09/2019  31/08/2019  14      554

我想基于i_count扩展数据集。 i_count是要复制的行数。因此可以说i_count = 2是否暗示需要为同一行复制2行。

此外，我想创建一个新列c_book_i，这样c_book应该在数据集中的条目内划分。例如，如果i_count = 2表示新数据帧应具有2个条目，而c_book_i应具有2个条目，使得sum(c_book_i) = c_book。最后一个约束是我想在所有情况下都拥有c_book_i > 10。

到目前为止：

def f(x):
    i = np.random.random(len(x))
    j = i/sum(i) * x
    return j

joined_df2 = df1.reindex(df1.index.repeat(df1['i_count']))
joined_df2['c_book_i'] = joined_df2.groupby(['Date_1','Date_2'])['c_book'].transform(f)

这为我提供了相同的东西，但是没有检查c_book应该大于10。很多值小于10。

任何人都可以提供帮助。

谢谢

def f(x): total = x.iloc[0].astype(int) minimum = 10 dividers = sorted(random.sample(range(minimum,total-minimum,minimum),len(x) - 1)) return [a - b for a,b in zip(dividers + [total],[0] + dividers)]

def distribute_randomly(array): # This is the minimum to give each: minimum = 10 # This means we have to reserve this amount: min_value_sum = len(array)*minimum # The rest we can distribute: to_distribute = array.sum() - min_value_sum # Get random values that all sum up to 1: random_values = numpy.random.rand(len(array)) random_values = random_values/random_values.sum() # Return the minimum + a part of what is left to distribute return random_values*to_distribute + minimum # Expand rows based on length of i_count: df1 = df1.join(df1['i_count'].apply(lambda x: range(x)).explode().rename('dummy')) # transform cbook_ to randomize df1['c_book_2'] = df1.groupby('i_count')['c_book'].transform(distribute_randomly) # Finally make sure they are not below 10: df1['c_book_i'] = df1['c_book_2'].where(df1['c_book_2']>10,10) # If needed: df1 = df1.reset_index()

根据列值展开数据集

qq845402721 回答：根据列值展开数据集

大家都在问