HERA: I have fixed the simulator such that the proper models of so

Replication of HERA sim paper: Each training sample is 6.4MH

Experiments ran for HERA AOFlagger threshold vs metric sensi

Boosting results Analysis of metrics Ins

Some hera results to discuss: I have regenerated the HERA da

AO Flagger threshold results analysis <a target="_blank" rel="noop

Breaking points: It seems that the NLN predic

Fixes to clipping issues: <a target="_blank" rel="noopener norefer

LOFAR Results: Here we use LOFAR o

Distributions of labelled data Im trying to figure out how t

LOFAR Results: Here I evaluate NLN and UNET on 2 different d

Finalising results about rfi-nln HOT 11 CLOSED

mesarcik commented on June 6, 2024

Finalising results

from rfi-nln.

Comments (11)

mesarcik commented on June 6, 2024

Replication of HERA sim paper:

Each training sample is 6.4MHz x 60 time samples (i.e. each training sample uses a bandwidth of 6.4MHz in the range of 100-200MHz
Passband for the simulator according to the paper is 100-200MHz (hence why we see the notches in the plot above at 100 to 200 MHZ.
The sources shown above are from a "psuedo-sky model" based off the GSM models. However their models have no relation to "real astronomical entities".
A discrete visibility equation is used to model them, this is gathered from the discrete model here

Here, we see a measured visibility expressed as a discrete sum over point sources, each entering at a different delay with a different inherent frequency spectrum. The delay transform maps flux from each celestial source to a Dirac delta function, δD, centered at the corresponding group delay, convolved by a kernel representing the Fourier transforms of frequency-dependent interferometer gains, A˜(τ, sˆn), and the inherent spectrum of each source, S˜ n.

Essentially each telescope has a gain pattern that changes with frequency that is given by A(t,s) and the sources-spectrum are distributed by a power law with a lower bound of 0.3 Jy as given by:
In the "foreground" the number of sources is between 1000 and 10000 according to the model, and the positions are randomly sampled from a uniform distribution
The baseline depenant effects such as fringes are then convolved with the input visibilities.

Finalised training set:

We simulate hera visibilities with the following parameters:
- Duration 30 minutes, integration time 3.52s (this is done to keep the visibilities square.
- Bandwidth 90MHz from from 105MHz to 195MHz
- Hexagonal array layout with distance of 14.6m between antennas
- We use the H1C "observation season" as speicified by the simulator
- We use default "diffuse foregrounds" with the default parameters specified above
- We add thermal noise using the default parameters, such that

power-law temperature with 180 K at 180 MHz and spectral index of -2.5

- We add the RFI stations models as defined in the simulator generated by the ORBCOMM satelites
- We add DTV RFI with default parameters 
- Impulse and Noise like RFI as well with default parameters 
- Then cross talk is added that is modelled by the convolution between noise and the simulated visibilities (as described in the paper) 
-   Finally we apply a bandpass model (with varying gains, and group delays per station)

Diffuse foregrounds:

RFI

Instrumentation noise

Cross coupling

Bandpass effects

from rfi-nln.

mesarcik commented on June 6, 2024

Experiments ran for HERA

AOFlagger threshold vs metric sensitivity
- Expand results to thresholds of 100 and 200
OOD RFI vs metric sensitivity
Table showing the AUROC vs AO-Flagger threshold for aoflagger

OOD RFI

currently running on patch size of 8x8, i think the performance should increase for nln based models:
Note for AOFlagger I find the average maximum threshold for all OOD RFI runs and use that (in this case it is 2)

Analysis:

On AUROC alone NLN performs best for OOD RFI detection, however AFAIK AUROC is not the best metric for class imbalance problems as we have here
its interesting to me that the AUPRC is so much better for Impulse RFI and IOU than NLN
- One explanation is the impulse based RFI waveforms are similar to the stations based waveforms, so the UNET can detect the RFI with a low SNR

Metric sensitivity to AO flagger threshold:

I am current running the experiments to add thresholds of 0.25, 100 and 500 to see what the extreme effects are
It seems that in the extremes the NLN based methods perform best, however this is not the case for the AE for thresholds 20 and 50, im not 100% why this is the case, and am looking into it

Below we can see the result of the threshold on the actual ground truth metrics, here it can be seen that the optimal threhold is at approximately 10.
Intrerestingly this threhsold is not all the same of the optimals for the OOD RFI situation
Note that these "scores" are misleading because the AOFlagger outputs the thresholded values, so we only really have 1 point in our "curves".

Metric	0.5	1.0	2.0	4.0	5.0	6.0	7.0	8.0	9.0	10.0	20.0	50.0
AUROC	0.696	0.965	0.975	0.978	0.977	0.977	0.979	0.977	0.977	0.978	0.977	0.964
AUPRC	0.493	0.620	0.663	0.692	0.712	0.738	0.774	0.777	0.779	0.784	0.788	0.752
IOU	0.044	0.297	0.384	0.445	0.492	0.547	0.619	0.626	0.629	0.639	0.648	0.592

from rfi-nln.

mesarcik commented on June 6, 2024

Boosting results

Analysis of metrics

Inspecting previous results it always seemed strange to me that we can get good AUROC yet our AUPRC and IOU scores were always far away from UNET
As far as I understand IOU is the most sensitive as any subtle change in the allignemnt of the output masks will decrease the performance (this is the same for both classes)
AUROC = TPR/FPR at some threshold
AUPRC = precision/recall at some threshold, precision= TP/TP+FP, recall = TP/TP+FN
- In effect AUPRC is much more sensitive to false negatives (i.e. it really is RFI but we say its not)
I've somewhat solved this problem by removing the logarithmic normalisation
Changed BCE to MSE

!!! Figure!!!

More work:

In bringing together the AUPRC and AUROC scores I introduced another problem, that our models are halucinating RFI even when its not there.
In the figure you can see that when the "station RFI" is present the AE reconstructs the patch that contains the RFI with a slightly higher magnitude than when not. However it also does this when patches near to the RFI/patches that also contain the edges of the RFI.
The effect of this is that the thresholded error that we calculate has halucinated RFI for the stations class (only)
I think it has to do with the way patches are constructed
another thing to investigate is to see if it has to do with the magnitude of the RFI (i.e. if stronger RFI leads to more halucinated features)
- This may mean a way to fix this is to perform a clip of the training data such that any RFI above som threshodshd gets clipped

Clippping at 100

seems to make things worse as other interference becomes equally as strong
i have inccreased the clip to 200 to see if that brings down the maximum as well as not inflating other RFI

Illustration of potential fixes

theres a bug in the roll operation, but i think if we shift the patches and do the NLN algorithm twice we can maybe resolve the "shadows" that are created

Adding a roll:

-it fixes the problem, but introduces another one, AUPRC, AUROC decrease but IOU increases by like 10 percent

The way i've implemented it is by finding the minimum between the rolled and unrolled errors and then using that for computation.

Fixed

Some hera results to discuss:

I have regenerated the HERA data and this is the distribution of magnitude:
61.5% of the RFI lies in the same magnitude range as the astronomical data, 38.5% is higher.
This means that theoretically a single threshold could detect all of the 38.5 of the RFI.
What I find is that with a single threshold we can achieve the following on the dataset: 0.8283647308084154 0.4699700327099673 0.05943569979393715
The reason that auroc is high is because the dataset is so imbalanced, we can easily detect non-RFI (and only 2.76% of the data contains RFI). Note I think the percentage contamination is less than the original paper because we are simulating much larger spectrograms.
Looking at the naive threshold, we get the following precision recall breakdown:

                   precision    recall   f1-score   support

        No RFI       0.99      0.64       0.78     142745554
        RFI          0.06      0.81       0.11      4055086

So our the naive threshold can almost perfectly detect no RFI, i.e. the number of false positive Non-RFI classes is very low
However, the recall shows that there are many false negatives (i.e. it misses a number of non-rfi classes to flag
For detcting RFI there are many false postives (low precision), but few false negatives (high recall)
-**To summarise, I dont think that AUROC is a very good metric for this class embalanced problem, here it is way too overly optimistic. **
Below it is clear that the threshold (argmax(tpr-fpr)) fails for autocorrelations.

New fixed results:

here we can see that I have clearly improved the performance on AUPRC and IOU metrics.
This was done by using a discriminative loss, evaluating without absolute error and clipping the training data between 0.5 and 50.

from rfi-nln.

mesarcik commented on June 6, 2024

AO Flagger threshold results analysis

The plot above should be in relative performance
the AOFlagger produces the best results when the threshold is between 3 and 10
In this range it is clear that AOFlagger obtains the best AUROC performance, however this does not translate into improved AUPRC or IOU scores.
I think this is the case because the AUROC score is not sensitive enough in this class-imbalanced setting, so that the morphological operator of joining together RFI emissions doesnt really impact the AUROC result. However, the PRC metric is far more sensitive to false positives and shows a degredation in performance similarly with IOU. Below I illustrate this with a particular baseline from HERA, the first number in the last plot is the AUROC and the second is the AUPRC.

It is clear that the UNET is more sensitive to under-flagging than the NLN algorithm, this makes sense as when we overflag we have less data to train on for NLN, but output more false positives for UNET
However when we increase the threshold above 50, it is clear that the performance on the NLN algorithm degrades quickly . This is because our NLN algorithm will start producing False negatives (i.e. flagging RFI as not RFI). This is shown in the plot below when the threshold is 200

from rfi-nln.

mesarcik commented on June 6, 2024

Breaking points:

It seems that the NLN predictions break only at certain positions, this is at the edge of the light and dark areas.
I think this is because the model is trained on MSE so it produces low amplitude patches for such areas (as the mean would be low). This means that when subtracting the mean from the "boundry" areas we produce a "high" output
It could also be because of the "shadow" idea i mentioned before, but in conjection with these edge effects.
I.e. why is this only happening on the edge with particular type of RFI?
We detect RFI perfectly expect for a few cases such as shown below.
Things tried to improve the performance futher
- Denoising AE
  - This seems to exacerbate the problem
- Changing clip (before and after logarithm)
- Increasing patchsize to 64x64
- I cant seem to beat (0.96, 0.94, 0.88)

Potentially reason for low AUROC

The RFI stations model contains some very low amplitude RFI
I never realised but we mostly do not detect it, or we need to put the threshold lowever to accomodate it which results in more problems.
it seems that clipping at 0.5 clips it.

from rfi-nln.

mesarcik commented on June 6, 2024

Fixes to clipping issues:

from rfi-nln.

mesarcik commented on June 6, 2024

LOFAR Results:

Here we use LOFAR observations of the merging galaxy cluster Abell 2146 for training from the averaging pipeline
https://lta.lofar.eu/Lofar?project=DDT9_001&pipeline_object_id=60423B36FB622467E053B316A9C38845&mode=show_dataproducts_pipe&product=AveragingPipeline

First experiment

Need to tune the alpha parameter properly
Need to change preprocessing, change the clipping and normalisation (the AE's are much more sensitive to data processing)
I had this same problem with HERA that was fixed through more accurate preprocessing
In the plot below, alpha is 0.1 patch size is 32x32, nneighbours = 20
Clearly we get improved AUROC, but low AUPRC and IOU
- This is both because we are more heavily weighting the latent distances (which typically gives auprc and iou decreases due the the incompatible patch size to RFI size)
Here it is also interesting that UNET has good AUPRC scores but very low IOU (relative to AOFlagger)

from rfi-nln.

mesarcik commented on June 6, 2024

Distributions of labelled data

Im trying to figure out how to preprocess the LOFAR data, the distributions are very skewed
In the plot below you can see the log scale plots of the labelled training set, with the RFI and non-RFI classes separated
From this it seems sensible to threshold at 1e7 such that we have enough "headroom" for both the RFI and astronimcal signals.

For training data (using the magnitude based aoflagger masks:

from rfi-nln.

mesarcik commented on June 6, 2024

LOFAR Results:

Here I evaluate NLN and UNET on 2 different datasets from the LTA.
They are both calibration sets with few time samples, but this is done to show that regardless of the training data we can still obtain good performance on the hand-labelled dataset.
I have also decided to exclude IOU score from our evaluation, I will go back and calculate the F1 score based on the maximisation of precision and recall.
Note that the AOFlagger labels are those taken from the original datasets and not ones that i computed.
Here we use a patch size of 32x32 for both UNET and NLN
The NLN backbone is a descriminative AE

Training set	Model	AUROC	AUPRC
N/A	AOFlagger	0.7883	0.5716
L631961	UNET	0.7332	0.6070
L631961	NLN	0.8525	0.6000
-	-	-	-
L629174	UNET	0.7948	0.5220
L629174	NLN	0.8893	0.6142

(note results taken from gpu-01: outputs/results_LOFAR_04-21-2022-04-29_26c2a9.csv and gpu-02: outputs/results_LOFAR_04-21-2022-06-25_f4a439.csv)

NLN modifications

Here I have modified the NLN algorithm to work for the LOFAR data in the following ways:
1. I threshold the distance based metrics such that all values aboved >2*median(d) for d in dists are True and below are False
2. I clip the NLN reconstructions between clip(5*std(nln_recon), 1.0)
3. I multiply the clipped reconstructions with the distance based masks

from rfi-nln.

mesarcik commented on June 6, 2024

final resultls:

L629174 : outputs/results_LOFAR_05-28-2022-01-32_96c554.csv
L631961 : outputs/results_LOFAR_05-30-2022-01-57_660653.csv
All: outputs/results_LOFAR_06-14-2022-09-54_c3e64c.csv
HERA: ``

from rfi-nln.

Comments (11)

Replication of HERA sim paper:

Finalised training set:

Diffuse foregrounds:

RFI

Instrumentation noise

Cross coupling

Bandpass effects

Experiments ran for HERA

OOD RFI

Analysis:

Metric sensitivity to AO flagger threshold:

Boosting results

Analysis of metrics

More work:

Clippping at 100

Illustration of potential fixes

Adding a roll:

Fixed

More problems

Some hera results to discuss:

New fixed results:

AO Flagger threshold results analysis

Breaking points:

Potentially reason for low AUROC

Fixes to clipping issues:

LOFAR Results:

First experiment

Distributions of labelled data

LOFAR Results:

NLN modifications

final resultls:

Related Issues (2)

Recommend Projects

Recommend Topics

Recommend Org