打开和重新分块数据集时,xarray / dask中的分块性能 (案例1):打开沿station维度只有一个块的数据集(快速切片一次)(案例2):沿station维度打开具有很多大小= 1个块的数据集(切片一次很慢,切片一个很快)(案例3):尝试将station重新打包成一个块(一次切片仍然很慢,应该更快吗?)

在以下两个用例下,xarray / dask中的分块行为是否应类似?

(a)使用chunks选项从netcdf文件打开数据集时;

(b)使用Dataset.chunk方法重新分块现有数据集时。

我对在不同维度上进行切片的效果感兴趣。就我而言,效果是完全不同的,请比较下面的( Case1 )和( Case3 ):

(案例1):打开沿station维度只有一个块的数据集(快速切片一次)

In [1]: import xarray as xr

In [2]: dset = xr.open_dataset( 
    ...: "/tmp/spectra.nc",...: chunks={"station": None}
    ...: )

In [3]: dset
Out[3]: 
<xarray.Dataset>
Dimensions:       (direction: 24,frequency: 25,station: 14048,time: 249)
Coordinates:
  * time          (time) datetime64[ns] 2017-01-01 ... 2017-02-01
  * station       (station) float64 1.0 2.0 3.0 ... 1.405e+04 1.405e+04
  * frequency     (frequency) float32 0.04118 0.045298003 ... 0.40561208
  * direction     (direction) float32 90.0 75.0 60.0 45.0 ... 135.0 120.0 105.0
Data variables:
    longitude     (time,station) float32 dask.array<chunksize=(249,14048),meta=np.ndarray>
    latitude      (time,meta=np.ndarray>
    efth          (time,station,frequency,direction) float32 dask.array<chunksize=(249,14048,25,24),meta=np.ndarray>

In [4]: %time lats = dset.latitude.isel(time=0).values
CPU times: user 171 ms,sys: 49.2 ms,total: 220 ms
Wall time: 219 ms

(案例2):沿station维度打开具有很多大小= 1个块的数据集(切片一次很慢,切片一个很快)

In [5]: dset = xr.open_dataset( 
    ...: "/tmp/spectra.nc",...: chunks={"station": 1}
    ...: )

In [6]: dset
Out[6]: 
<xarray.Dataset>
Dimensions:       (direction: 24,1),1,meta=np.ndarray>

In [7]: %time lats = dset.latitude.isel(time=0).values
CPU times: user 13.1 s,sys: 1.94 s,total: 15 s
Wall time: 11.1 s

(案例3):尝试将station重新打包成一个块(一次切片仍然很慢,应该更快吗?)

In [8]: dset = dset.chunk({"station": None})

In [8]: dset
Out[8]: 
<xarray.Dataset>
Dimensions:       (direction: 24,meta=np.ndarray>

In [9]: %time lats = dset.latitude.isel(time=0).values
CPU times: user 9.06 s,sys: 1.13 s,total: 10.2 s
Wall time: 7.7 s

作为参考,此数据集以netcdf4的形式存储在磁盘上,沿station尺寸为size = 1块:

$ ncdump -hs /tmp/spectra.nc
netcdf spectra {
dimensions:
        time = UNLIMITED ; // (249 currently)
        station = 14048 ;
        frequency = 25 ;
        direction = 24 ;
variables:
        double time(time) ;
                time:long_name = "julian day (UT)" ;
                time:standard_name = "time" ;
                time:units = "days since 1990-01-01 00:00:00" ;
                time:_Storage = "chunked" ;
                time:_ChunkSizes = 512 ;
                time:_DeflateLevel = 9 ;
                time:_Shuffle = "true" ;
                time:_Endianness = "little" ;
        int station(station) ;
                station:long_name = "station id" ;
                station:_FillValue = -2147483647 ;
                station:_Storage = "chunked" ;
                station:_ChunkSizes = 1 ;
                station:_DeflateLevel = 9 ;
                station:_Shuffle = "true" ;
                station:_Endianness = "little" ;
        short longitude(time,station) ;
                longitude:long_name = "longitude" ;
                longitude:standard_name = "longitude" ;
                longitude:units = "degree_east" ;
                longitude:_FillValue = 9.96921e+36f ;
                longitude:scale_factor = -0.00547824f ;
                longitude:add_offset = 180.f ;
                longitude:_Storage = "chunked" ;
                longitude:_ChunkSizes = 249,1 ;
                longitude:_DeflateLevel = 9 ;
                longitude:_Shuffle = "true" ;
                longitude:_Endianness = "little" ;
        short latitude(time,station) ;
                latitude:long_name = "latitude" ;
                latitude:standard_name = "latitude" ;
                latitude:units = "degree_north" ;
                latitude:_FillValue = 9.96921e+36f ;
                latitude:scale_factor = -0.0006866874f ;
                latitude:add_offset = -54.f ;
                latitude:_Storage = "chunked" ;
                latitude:_ChunkSizes = 249,1 ;
                latitude:_DeflateLevel = 9 ;
                latitude:_Shuffle = "true" ;
                latitude:_Endianness = "little" ;
        float frequency(frequency) ;
                frequency:long_name = "frequency of center band" ;
                frequency:standard_name = "sea_surface_wave_frequency" ;
                frequency:units = "s-1" ;
                frequency:scale_factor = 1.f ;
                frequency:add_offset = 0.f ;
                frequency:_FillValue = 9.96921e+36f ;
                frequency:_Storage = "chunked" ;
                frequency:_ChunkSizes = 25 ;
                frequency:_DeflateLevel = 9 ;
                frequency:_Shuffle = "true" ;
                frequency:_Endianness = "little" ;
        float direction(direction) ;
                direction:long_name = "sea surface wave to direction" ;
                direction:standard_name = "sea_surface_wave_to_direction" ;
                direction:units = "degree" ;
                direction:scale_factor = 1.f ;
                direction:add_offset = 0.f ;
                direction:_FillValue = 9.96921e+36f ;
                direction:_Storage = "chunked" ;
                direction:_ChunkSizes = 24 ;
                direction:_DeflateLevel = 9 ;
                direction:_Shuffle = "true" ;
                direction:_Endianness = "little" ;
        short efth(time,direction) ;
                efth:long_name = "sea surface wave directional variance spectral density" ;
                efth:standard_name = "sea_surface_wave_directional_variance_spectral_density" ;
                efth:units = "m2 s rad-1" ;
                efth:_FillValue = 9.96921e+36f ;
                efth:scale_factor = -0.004410254f ;
                efth:add_offset = 144.5064f ;
                efth:_Storage = "chunked" ;
                efth:_ChunkSizes = 249,24 ;
                efth:_DeflateLevel = 9 ;
                efth:_Shuffle = "true" ;
                efth:_Endianness = "little" ;

// global attributes:
                :nco_openmp_thread_number = 1 ;
                :_NCProperties = "version=2,netcdf=4.6.2,hdf5=1.10.4" ;
                :_SuperblockVersion = 2 ;
                :_Isnetcdf4 = 1 ;
                :_Format = "netCDF-4" ;
}
yanglu06550132 回答:打开和重新分块数据集时,xarray / dask中的分块性能 (案例1):打开沿station维度只有一个块的数据集(快速切片一次)(案例2):沿station维度打开具有很多大小= 1个块的数据集(切片一次很慢,切片一个很快)(案例3):尝试将station重新打包成一个块(一次切片仍然很慢,应该更快吗?)

暂时没有好的解决方案,如果你有好的解决方案,请发邮件至:iooj@foxmail.com
本文链接:https://www.f2er.com/3109451.html

大家都在问