如何使用发布日期列遍历数据框以制作每日映射表

我正在使用Python 3.7来完成此任务。 我有一个存储博客ID,博客名称和发布日期的数据框。我需要将其转换为一个新的数据框,该数据框将每一天和URL映射到ID。我需要这样做才能运行到前一天的日期(撰写本文时为20191103)。假设包括一篇文章一天之内不能被重新发布两次,并且博客的发布日期之前不需要有任何日期。

示例:

data = [[1234,'Blog1',20191030],[1235,20191101],[1237,20191102],[1236,'Blog2',20191101]]
df = pd.DataFrame(data,columns = ['ID','Blog Name','Publish Date'])
df.head()

开始数据框:

     ID  Blog Name  Publish Date
0   1234    Blog1   20191030
1   1235    Blog1   20191101
2   1237    Blog1   20191102
3   1236    Blog2   20191101

目标:最终数据框:

   Blog Name  Date    ID
0   Blog1   20191030 1234
1   Blog1   20191031 1234
2   Blog1   20191101 1235
3   Blog1   20191102 1237
4   Blog1   20191103 1237
5   Blog1   20191101 1236
6   Blog2   20191102 1236
7   Blog2   20191103 1236

我主要不确定如何最好地遍历数据框,是否在原始数据框中创建一个带有“下一个发布日期”的列,然后在新数据框中为“发布日期”之间的每个日期创建一行和“下一个发布日期”?

解决方案:(由代码不同提供)

# Your Publish Date column is string,Need to convert it to Timestamp
df['Publish Date'] = pd.to_datetime(df['Publish Date'],format='%Y%m%d')

def summarize(g):
    # A date range that covers from the first Publish Date to the current day
    d = pd.date_range(g['Publish Date'].min(),pd.Timestamp.now(),name='Publish Date').to_frame(index=False)

    # The merge. This also has the effect of filling any gap in the
    # Publish Date
    return pd.merge_asof(d,g,on='Publish Date')


df.sort_values(['Blog Name','Publish Date']) \
    .groupby('Blog Name').apply(summarize) \
    .reset_index(drop=True)
my183300 回答:如何使用发布日期列遍历数据框以制作每日映射表

merge_asof的理想之选:

# Your Publish Date column is string,Need to convert it to Timestamp
df['Publish Date'] = pd.to_datetime(df['Publish Date'],format='%Y%m%d')

def summarize(g):
    # A date range that covers from the first Publish Date to the current day
    d = pd.date_range(g['Publish Date'].min(),pd.Timestamp.now(),name='Publish Date').to_frame(index=False)

    # The merge. This also has the effect of filling any gap in the
    # Publish Date
    return pd.merge_asof(d,g,on='Publish Date')


df.sort_values(['Blog Name','Publish Date']) \
    .groupby('Blog Name').apply(summarize) \
    .reset_index(drop=True)

结果(假设今天= 2019-11-04):

  Publish Date    ID Blog Name
0   2019-10-30  1234     Blog1
1   2019-10-31  1234     Blog1
2   2019-11-01  1235     Blog1
3   2019-11-02  1237     Blog1
4   2019-11-03  1237     Blog1
5   2019-11-04  1237     Blog1
6   2019-11-01  1236     Blog2
7   2019-11-02  1236     Blog2
8   2019-11-03  1236     Blog2
9   2019-11-04  1236     Blog2
本文链接:https://www.f2er.com/3163384.html

大家都在问