Hi Julien,
I hope you are keeping well and healthy :)
I was wondering if you could help me with a spark-fits
issue that I have been having when reading large fits files ~1GB (e.g. https://portal.nersc.gov/project/cosmo/data/legacysurvey/dr8/north/sweep/8.0/sweep-100p030-110p035.fits).
It seems that even when I remove all nans from the dataframe, there still appears to be nans present when a collect/show/toPandas is called. The fraction of nans is small (<0.1%) but it does cause some dataloss and means that I have to filter the data a second time after the collection from the jvm. This issue also causes summary / aggregation statistics etc. to be incorrect.
Troubleshooting
The issue is random and therefore different elements of the dataframe are impacted each time the same collect/show/toPandas call is made with the same file and the same dataframe. This also means the total number of nans changes each time. The data is read directly from the fits file(s) using spark-fits
each time the call is made (rather than a cached dataframe) which is why I think the fits reading stage could be causing the problem.
Perhaps it is related to how spark-fits
splits up the file. I have tried various record length settings (recordlength=1 * 1024, recordlength=10 * 1024, recordlength=100 * 1024) but I am not sure what is optimal or if it could alleviate this issue.
The columns that I filter on are mag_*
, which I derive from FLUX_* and FLUX_IVAR_* columns in the example fits file using a simple pyspark SQL expression.
I filter out the following cases for each mag_*
column.
'mag_* > 27.0'
'mag_* <= 0.0'
'isnan(mag_*)'
'isnull(mag_*)'
These expressions are evaluated successfully and appear in the logical plan. However, on collection, some nans which should not be there have somehow been genereated. The same code also works with smaller fits files.
I also checked to see if any columns were incorrectly casted as DecimalType()
(as this could also cause nans due to being incorrectly casted as the numpy object dtype). But since I explicitly cast all FLUX_* and FLUX_IVAR_* columns as FloatType()
, the resulting mag_*
columns are also FloatType()
so this is not the cause of the problem.
Please let me know if you have any idea what could be causing this problem or if you require any more detailed screenshots/code samples to help troubleshoot.
I thought it could be related to this spark-fits bug I raised last year #84
Thanks @jacobic, it helps a lot to understand.
Actually, I dig deep inside spark-fits. When the file is read on my computer, basically it makes chunks of 33,554,432 bytes. Guess what? 33,554,432 is very close to 162829 [row] * 206 [bytes/row]... So it seems there is a mismatch in the starting index when reading a new chunk of this file. Looking deeper, I found the bug... The bug was introduced by an ugly fix for Image HDU:
|
// Here is an attempt to fix a bug when reading images: |
|
// |
|
// I noticed that when an image is split across several HDFS blocks, |
|
// the transition is not done correctly, and there is one line typically |
|
// missing. After some manual inspection, I found that introducing a shift |
|
// in the starting index helps removing the bug. The shift is function of |
|
// the HDU index, and depends whether the primary HDU is empty or not. |
|
// By far I'm not convinced about this fix in general, but it works for |
|
// the few examples that I tried. If you face a similar problem, |
|
// or find a general solution, let me know! |
|
var shift = if (primaryfits.empty_hdu) { |
|
FITSBLOCK_SIZE_BYTES * (conf.get("hdu").toInt - 3) |
|
} else { |
|
FITSBLOCK_SIZE_BYTES * (conf.get("hdu").toInt - 1) |
|
} |
I'm still not 100% sure this is the final word, but fixing it seems to give correct answer:
# +-------+--------------------+-------------------+---------+------------------+------------------+------------------+------------------+------------------+-------------------+-----------+
# |summary| object_id| d_pos| d_mag| specz_name| specz_ra| specz_dec| specz_redshift|specz_redshift_err|specz_original_flag|specz_mag_i|
# +-------+--------------------+-------------------+---------+------------------+------------------+------------------+------------------+------------------+-------------------+-----------+
# | count| 2299625| 2299625| 2299625| 2299625| 2299625| 2299625| 2299625| 2299625| 2299625| 2299625|
# | mean|4.325592616529654...|0.20328246733677513| NaN| null|157.35647009225943|2.6757728602627218|0.6220279768352934| NaN| 2.8340504323962143| NaN|
# | stddev|8.044115412593749E15|0.19084386174782547| NaN| null|119.00443164434095|12.329133357718852|1.1161051065508547| NaN| 23.447103620806015| NaN|
# | min| 36407037608854089| 5.951991E-4|-Infinity| 3DHST_AEGIS_10002| 6.86E-4| -7.282074| -9.99999| 0.0| -1| -9999.166|
# | max| 76557903720366950| 1.0048198| NaN|ZCOSMOS-DR3-950074| 359.99936| 56.899819| 9.9999| NaN| B,1| NaN|
# +-------+--------------------+-------------------+---------+------------------+------------------+------------------+------------------+------------------+-------------------+-----------+
I will make a fix tomorrow.
Originally posted by @JulienPeloton in #84 (comment)
Thanks as always,
Jacob