I'm working with some data that's chunked; dimensions of the chunks are (1, 510, 1068). It was originally shuffled and compressed; it was decompressed, then I modified the nccopy utility to deshuffle it following observation of miserable performance to test the hypothesis that shuffling might be the source.
Unfortunately, even without shuffling, I observed some pretty dismal performance when reading in small sections of this data; upon further digging, it seems that the NetCDF library is reading in the entire field to extract just a few values. This can be observed when using ncks to extract a small subset of the data (it's observable with other software too) by stracing the program and observing the size of the reads:
$ strace -f ncks -d lat,0,1 -d lon,0,1 -d time,0,4 ~/public_html/pr+tasmax+tasmin_day_BCCAQ+ANUSPLIN300+MRI-CGCM3_historical+rcp85_r1i1p1_19500101-21001231.nc.sub
--- CUT ---
lseek(3, 17462119, SEEK_SET) = 17462119
read(3, "TREE\1\0\5\0\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\240>!\0\0\0\0\0"..., 3136) = 3136
brk(0x1e8d000) = 0x1e8d000
read(3, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 2178720) = 2178720
brk(0x20a1000) = 0x20a1000
read(3, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 2178720) = 2178720
brk(0x22b5000) = 0x22b5000
read(3, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 2178720) = 2178720
read(3, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 2178720) = 2178720
lseek(3, 30539671, SEEK_SET) = 30539671
read(3, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 2178720) = 2178720
write(1, "time[0]=0 lat[0]=41.041666665 lo"..., 74time[0]=0 lat[0]=41.041666665 lon[0]=-140.958333335 tasmin[0]=-32768 degC
) = 74
write(1, "time[0]=0 lat[0]=41.041666665 lo"..., 74time[0]=0 lat[0]=41.041666665 lon[1]=-140.875000005 tasmin[1]=-32768 degC
) = 74
write(1, "time[0]=0 lat[1]=41.124999995 lo"..., 77time[0]=0 lat[1]=41.124999995 lon[0]=-140.958333335 tasmin[1068]=-32768 degC
) = 77
write(1, "time[0]=0 lat[1]=41.124999995 lo"..., 77time[0]=0 lat[1]=41.124999995 lon[1]=-140.875000005 tasmin[1069]=-32768 degC
) = 77
write(1, "time[1]=1 lat[0]=41.041666665 lo"..., 79time[1]=1 lat[0]=41.041666665 lon[0]=-140.958333335 tasmin[544680]=-32768 degC
) = 79
write(1, "time[1]=1 lat[0]=41.041666665 lo"..., 79time[1]=1 lat[0]=41.041666665 lon[1]=-140.875000005 tasmin[544681]=-32768 degC
) = 79
write(1, "time[1]=1 lat[1]=41.124999995 lo"..., 79time[1]=1 lat[1]=41.124999995 lon[0]=-140.958333335 tasmin[545748]=-32768 degC
) = 79
write(1, "time[1]=1 lat[1]=41.124999995 lo"..., 79time[1]=1 lat[1]=41.124999995 lon[1]=-140.875000005 tasmin[545749]=-32768 degC
) = 79
write(1, "time[2]=2 lat[0]=41.041666665 lo"..., 80time[2]=2 lat[0]=41.041666665 lon[0]=-140.958333335 tasmin[1089360]=-32768 degC
) = 80
write(1, "time[2]=2 lat[0]=41.041666665 lo"..., 80time[2]=2 lat[0]=41.041666665 lon[1]=-140.875000005 tasmin[1089361]=-32768 degC
) = 80
write(1, "time[2]=2 lat[1]=41.124999995 lo"..., 80time[2]=2 lat[1]=41.124999995 lon[0]=-140.958333335 tasmin[1090428]=-32768 degC
) = 80
write(1, "time[2]=2 lat[1]=41.124999995 lo"..., 80time[2]=2 lat[1]=41.124999995 lon[1]=-140.875000005 tasmin[1090429]=-32768 degC
) = 80
write(1, "time[3]=3 lat[0]=41.041666665 lo"..., 80time[3]=3 lat[0]=41.041666665 lon[0]=-140.958333335 tasmin[1634040]=-32768 degC
) = 80
write(1, "time[3]=3 lat[0]=41.041666665 lo"..., 80time[3]=3 lat[0]=41.041666665 lon[1]=-140.875000005 tasmin[1634041]=-32768 degC
) = 80
write(1, "time[3]=3 lat[1]=41.124999995 lo"..., 80time[3]=3 lat[1]=41.124999995 lon[0]=-140.958333335 tasmin[1635108]=-32768 degC
) = 80
write(1, "time[3]=3 lat[1]=41.124999995 lo"..., 80time[3]=3 lat[1]=41.124999995 lon[1]=-140.875000005 tasmin[1635109]=-32768 degC
) = 80
write(1, "time[4]=4 lat[0]=41.041666665 lo"..., 80time[4]=4 lat[0]=41.041666665 lon[0]=-140.958333335 tasmin[2178720]=-32768 degC
) = 80
write(1, "time[4]=4 lat[0]=41.041666665 lo"..., 80time[4]=4 lat[0]=41.041666665 lon[1]=-140.875000005 tasmin[2178721]=-32768 degC
) = 80
write(1, "time[4]=4 lat[1]=41.124999995 lo"..., 80time[4]=4 lat[1]=41.124999995 lon[0]=-140.958333335 tasmin[2179788]=-32768 degC
) = 80
write(1, "time[4]=4 lat[1]=41.124999995 lo"..., 80time[4]=4 lat[1]=41.124999995 lon[1]=-140.875000005 tasmin[2179789]=-32768 degC
Note that the result of the read calls is a block of 2178720 bytes, which equals the product of the sizes of the X and Y dimensions and the size of the storage type (float). It should be 8 bytes, which means about 250000 times as much I/O bandwidth is used than should be required.
I have also used strace on other data with different chunk sizes (and possibly different internal metadata); these full-field reads are not present with other data I have examined.
I have also experimented with reading in data using native HDF5 applications, specifically the rhdf5 library; the read pattern when using rhdf5 is what one would expect: reading in only the values that are required (that is, 8 bytes at a time).
I'm at a loss for where to go from here. I've dug into the NetCDF library source code, but I don't have experience with either the NetCDF library source code or the HDF5 library, so it's very slow going. I think this might have to do with non-ideal caching behaviour of the NetCDF library, but that's just a guess. However, given the apparent ubiquity of this problem, and the apparent lack of the problem when using the HDF5 library without the NetCDF library in the middle, it suggests a library bug with the NetCDF library, whatever the source.