这有点挑战,只是因为“任何一年的四月的第一个星期日”需要几个步骤来计算。您可以通过几种方式来解决这个问题,但我将首先计算目标日期当年以及下一年中最近的目标日期(因为四月总是在一年的第一部分,所以日期永远不会接近前一年的 4 月),然后找到与任一目标的最小绝对差异。
第一步,我将使用一些随机偏移量(+/- 30 分钟)和更长的时间序列来扩展您的 MWE。我还添加了一个值列,以便 df 显示为一个框架:
In [26]: df = pd.DataFrame(
...: {'val': np.arange(24*366*10)},...: index=(
...: pd.date_range('2010-01-01',periods=24*366*10,freq='H')
...: + pd.to_timedelta(np.random.randint(-30,30,size=(24*366*10)),unit='minutes')
...: ),...: )
In [27]: df
Out[27]:
val
2010-01-01 00:29:00 0
2010-01-01 01:09:00 1
2010-01-01 01:43:00 2
2010-01-01 03:14:00 3
2010-01-01 03:54:00 4
... ...
2020-01-08 18:31:00 87835
2020-01-08 20:21:00 87836
2020-01-08 20:54:00 87837
2020-01-08 21:47:00 87838
2020-01-08 23:11:00 87839
接下来,我找到每一行年份的四月第一个星期日(凌晨 2 点)的日期:
In [28]: apr1 = pd.to_datetime({'year': df.index.year,'month': 4,'day': 1,'hour': 2})
In [29]: apr_first_sun = apr1 + pd.to_timedelta(6 - apr1.dt.weekday,unit='day')
In [30]: apr_first_sun
Out[30]:
0 2010-04-04 02:00:00
1 2010-04-04 02:00:00
2 2010-04-04 02:00:00
3 2010-04-04 02:00:00
4 2010-04-04 02:00:00
...
87835 2020-04-05 02:00:00
87836 2020-04-05 02:00:00
87837 2020-04-05 02:00:00
87838 2020-04-05 02:00:00
87839 2020-04-05 02:00:00
Length: 87840,dtype: datetime64[ns]
In [31]: apr1 = pd.to_datetime({'year': df.index.year + 1,'hour': 2})
In [32]: next_apr_first_sun = apr1 + pd.to_timedelta(6 - apr1.dt.weekday,unit='day')
接下来,找到更接近的绝对差异:
In [36]: nearer_abs_diff = np.minimum(abs(df.index - apr_first_sun.values),abs(df.index - next_apr_first_sun.values))
In [37]: nearer_abs_diff
Out[37]:
TimedeltaIndex(['93 days 01:31:00','93 days 00:51:00','93 days 00:17:00','92 days 22:46:00','92 days 22:06:00','92 days 20:54:00','92 days 20:23:00','92 days 19:25:00','92 days 18:12:00','92 days 16:48:00',...
'87 days 12:19:00','87 days 11:12:00','87 days 09:36:00','87 days 08:31:00','87 days 07:36:00','87 days 07:29:00','87 days 05:39:00','87 days 05:06:00','87 days 04:13:00','87 days 02:49:00'],dtype='timedelta64[ns]',length=87840,freq=None
最后,找到最小绝对差的位置索引,并用它来索引数据帧:
In [38]: idx = np.argmin(nearer_abs_diff)
In [39]: df.iloc[idx]
Out[39]:
val 37346
Name: 2014-04-06 02:14:00,dtype: int64
,
根据您的评论,您似乎可以依赖于每年在您想要的时间(4 月的第一个星期日)的一小时内获得数据。在这种情况下,您可以采用更简单的方法。
使用随时间变化的示例数据集:
In [4]: df = pd.DataFrame(
...: ...: {'val': np.arange(24*366*10)},...: ...: index=(
...: ...: pd.date_range('2010-01-01',freq='H')
...: ...: + pd.to_timedelta(np.random.randint(-30,unit='minutes')
...: ...: ),...: ...: )
In [5]: df
Out[5]:
val
2010-01-01 00:14:00 0
2010-01-01 01:20:00 1
2010-01-01 01:46:00 2
2010-01-01 03:20:00 3
2010-01-01 03:51:00 4
... ...
2020-01-08 18:48:00 87835
2020-01-08 19:46:00 87836
2020-01-08 21:07:00 87837
2020-01-08 22:06:00 87838
2020-01-08 23:11:00 87839
[87840 rows x 1 columns]
我们可以根据四舍五入到最接近的 2 小时的时间进行过滤:
within_an_hour = df[
(df.index.month==4)
& (df.index.day<=7)
& (df.index.day_of_week == 6)
& (df.index.round('2H').hour == 2)
]
然后我们可以通过对每年的 2 小时四舍五入值取最小绝对差来选择最接近的指数:
In [15]: closest_indices = (
...: within_an_hour
...: .groupby(within_an_hour.index.year)
...: .apply(
...: lambda x: x.index.values[np.argmin(abs(x.index - x.index.round('2H')))]
...: )
...: )
In [16]: closest_indices
Out[16]:
2010 2010-04-04 02:17:00
2011 2011-04-03 02:22:00
2012 2012-04-01 01:49:00
2013 2013-04-07 01:39:00
2014 2014-04-06 02:01:00
2015 2015-04-05 01:58:00
2016 2016-04-03 02:12:00
2017 2017-04-02 01:54:00
2018 2018-04-01 02:22:00
2019 2019-04-07 02:13:00
dtype: datetime64[ns]
本文链接:https://www.f2er.com/123.html