Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

s04_stack2 bug fixes + mov_stack improvement #372

Open
wants to merge 14 commits into
base: master
Choose a base branch
from

Conversation

asyates
Copy link
Contributor

@asyates asyates commented Sep 27, 2024

The following things are fixed with these changes (mentioned in issue #371)

  • bugfix: reference function creation no longer based only on Todo jobs, now calls get_results_all using datelist generated from build_ref_datelist.
  • bugfix: boolean overwrite set to false in xr_save_ccf to avoid overwriting of previous STACKS output with only Todo jobs.
  • when computing moving stacks (-m), gaps in the day list (generated from current jobs) are identified and filled with dates necessary to correctly apply rolling mean, taking into account the mov_stack size.

e.g. for days = ['2006-02-01', '2006-02-02', '2006-02-03', '2006-02-10'] and stack size (max(mov_rolling)) of two days, the days '2006-01-30', '2006-01-31', '2006-02-08', and '2006-02-09' would be added prior to calling get_results_all.

ie.

                # Calculate the maximum mov_rolling value (in days)
                max_mov_rolling = max(pd.to_timedelta(mov_stack[0]).total_seconds() for mov_stack in mov_stacks)
                max_mov_rolling_days = max(1, max_mov_rolling / 86400)

                days = list(days)
                days.sort()
                days = [datetime.datetime.strptime(day, '%Y-%m-%d') for day in days]
                day_diffs = np.diff(days)
                gaps = [i+1 for i, diff in enumerate(day_diffs) if diff.days > 1] #get index of days with gaps
                gaps.insert(0,0) #zero index also 'gap' (need previous data for stacking)

                all_days = list(days)
                added_days = [] #keep track of days included for eventual removal pre-saving CCFs

                for gap_idx in gaps:
                    start = days[gap_idx]
                    #Add preceding days
                    for j in range(1, max_mov_rolling_days+1):
                        preceding_day = start - datetime.timedelta(days=j)
                        if preceding_day not in all_days:
                            all_days.append(preceding_day)
                            added_days.append(preceding_day)

                added_dates = pd.to_datetime(added_days).values
                c = get_results_all(db, sta1, sta2, filterid, components, all_days, format="xarray") #get ccfs needed for -m stacking

These additional days are then removed prior to saving via:

                   mask = xx.times.dt.floor('D').isin(added_dates)
                   xx_cleaned = xx.where(~mask, drop=True) #remove days not associated with current jobs

Will use this same branch to implement wiener filter... but these changes/fixes more fundamental.

@asyates
Copy link
Contributor Author

asyates commented Oct 2, 2024

Added:

  • reference function no longer uses jobs i.e. running msnoise cc stack -r will always compute reference function regardless of whether STACK jobs are Todo or Done.

To do:

  • Add wiener filter. I think for now will do this without SVD component and test... as im not sure how to implement SVD decomp cleanly in a way that all data is processed 'equally' i.e. number of eigenvectors available could vary a lot depending how many CCFs being processed.
  • update documentation

@ThomasLecocq
Copy link
Member

re:doc : this documentation should work - but you'll need to set up a biggy test folder with some data (esp. for the examples to build) - My idea (didn't have time yet) is to provide two big "pooch" payloads 1: with only the raw test data & the recipe to build then same as 2: the environment having all the data processed for doing the examples etc

@asyates
Copy link
Contributor Author

asyates commented Oct 3, 2024

Okay think working nicely now. Did quite a different tests to check working properly with gaps, processing jobs in stages rather than all together, etc.

As said, no svd component for now as unsure how will work cleanly if not processing all data at same time (i.e. to ensure data processed equaly.

Functionality of wiener filter right now is:

  • if wiener filter applied, will also pad with previous/future 2*M of CCFs in addition to current jobs, where M is the smoothing duration in datetime axis.
  • checks for 'continuous' CCFs to apply wiener to. Gaps less than length of M will be 'ignored' i.e. wiener filter will consider CCFs adjacent. Otherwise, wiener applied separately on different groups of adjacent stacks. Note, gaps are restored post-wiener (pre-stacking).
  • for saving, the first M duration of pre-job stacks (previously pulled in for purpose of stacking) are removed, as at 'edge' of 'image', so less neighbouring points to use in filter. The second M duration of pre-job stacks stay, however, and overwrite the previous saved stacks. Idea is that this previous data, if processing real-time for example, may not have had neighbouring (future) points previously... but now can be updated based on data that has come in (to be consistent with other data processed).

Users set three params in config:

wienerfilt: bool, False by default)
wiener_Mlen: str (timedelta), smoothing in datetime axis
wiener_Nlen: str (timedelta), smoothing in lagtime axis

Quick example showing final dv/v for one month of data at Ruapehu, 2d stacks

image

Still to do: documentation

@asyates
Copy link
Contributor Author

asyates commented Oct 4, 2024

A couple smaller changes, and started adding documenting (just in s04_stack2 for now).

Ended up going down rabbit hole regarding how much padding with data outside of Todo jobs actually reduces edge influence (even if you pad well belong the width of the filter, you can still get subtle differences propagate to the middle of the 'image' despite neighbouring points not changing).

Given that, I added a warning in the documentation that caution should be exercised if processing data in steps (i.e. not all at the same time).

Below is an example where i tested processing dv/v after applying wiener to all data together, and data in different stages (cutoff indicated by dashed-lines, so for example.... end of month is simulating reading in 1-day of new data each time). Similar, but some difference.

image

@asyates
Copy link
Contributor Author

asyates commented Oct 4, 2024

Demonstration that padding doesn't fully prevent values further away from edge of image slightly changing. Top row i am starting adding 2 rows each time (of 1s), e.g. to reflect new CCF data coming in, and then applying wiener filter of length (2,2). Can see that the values corresponding to the original pattern have subtle differences even when adding four or six rows of constant value.

So not sure there is a 'perfect' way to do it, other than maybe having a fixed moving window where we are applying the wiener filter i.e. every N days, so that it is consistent if processing in different stages. But... that would be pretty horrid for computation time i imagine so maybe just having the warning is best for cases where not processing all data together.

image

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants