Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ApproximateGradientSumFunction and SGFunction #1345

Closed

Conversation

epapoutsellis
Copy link
Contributor

@epapoutsellis epapoutsellis commented Aug 17, 2022

Describe your changes

SubsetSumFunction

The SubsetSumFunction is defined as

$$ \frac{1}{n} \sum_{i=1}^{n}f_{i}(x)$$

where $n$ is the number of subsets and $f_{i}$ are smooth functions.

Example

$$ \frac{1}{n}\sum_{i=1}^{n} F_{i}(x) = \frac{1}{n}\sum_{i=1}^{n}||A_{i} x - b_{i}||^{2} $$

f = SubsetSumFunction([LeastSquares(Ai, b=bi)]*n for Ai,bi in zip(A_subsets, b_subsets)) 

Note

At the moment, there are 2 sampling methods: random and sequential. For the random case, we have the following.

  • sampling=random, replacement=True : means thats the next_subset method can select the same subset_num
  • sampling=random, replacement=False : means thats the next_subset method selects subset_num uniquely.

Example

a = [1,2,3,4,5,6]
num_subset = 2
  • sampling=random, replacement=True: [3,1,2], [4,2,6]
  • sampling=random, replacement=False: [1,3,2], [6,4,5]
  • sampling=sequential, [1,2,3], [4,5,6].

SGDFunction

Example

f = SGDFunction([LeastSquares(Ai, b=bi)]*n for Ai,bi in zip(A_subsets, b_subsets), sampling="random", replacement=False) 

GD vs SGD in CIL

We can solve the LeastSquares minimisation problem $$\min_{x}\|Ax - b\|^{2}$$
  • Deterministically using: GD: objective_function = LeastSquares
  • Randomly using GD: objective_function = SGDFunction(list of LeastSquares)

Proximal Gradient Algorithms in CIL

For non-smooth objectives we can solve $$\min_{x>0}||Ax - b||^{2}$$

  • Deterministically using: ISTA/FISTA: f = LeastSquares, g=IndicatorBox(lower=0)
  • Randomly using : ISTA/FISTA: f = SGDFunction(list of LeastSquares) , g=IndicatorBox(lower=0)
G = IndicatorBox(lower=0)
F = LeastSquares(A, b = noisy_sino)
initial = ig.allocate(0)
ista = ISTA(initial=initial, f=F, g=G, update_objective_interval = 1, 
            max_iteration = 50)
ista.run(verbose=1)
F_sgd1 = SGDFunction(f_subsets, sampling = "random") 
G = IndicatorBox(lower=0) 
num_epochs = 10
initial = ig.allocate(0)
s_ista1 = ISTA(initial=initial, 
            f=F_sgd1,
            g=G, update_objective_interval = n_subsets, 
            max_iteration = num_epochs * n_subsets)
s_ista1.run(verbose=1)

Describe any testing you have performed

  • Add tests for SubsetSumFunction
  • Add tests for SGDFunction [wip]

Demos

Link relevant issues

Close #1341
Close #1343

From the Hackathon-000-Stochastic-Algorithms repository
Close 11
Close 10
Close 9

Checklist when you are ready to request a review

  • I have performed a self-review of my code
  • I have added docstrings in line with the guidance in the developer guide
  • I have implemented unit tests that cover any new or modified functionality
  • CHANGELOG.md has been updated with any functionality change
  • Request review from all relevant developers
  • Change pull request label to 'Waiting for review'

Contribution Notes

Please read and adhere to the developer guide and local patterns and conventions.

  • The content of this Pull Request (the Contribution) is intentionally submitted for inclusion in CIL (the Work) under the terms and conditions of the Apache-2.0 License.
  • I confirm that the contribution does not violate any intellectual property rights of third parties

@epapoutsellis
Copy link
Contributor Author

@KrisThielemans (could not add you in the reviewers list)

epapoutsellis added a commit to epapoutsellis/CIL that referenced this pull request Aug 23, 2022
epapoutsellis added a commit to epapoutsellis/CIL that referenced this pull request Aug 25, 2022
Copy link
Contributor

@KrisThielemans KrisThielemans left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suggest some changes, but in particular to the design:

  • I really don't think using 1/n is a good idea. It means that an SGDFunction is not a direct replacement of the original, but a replacement of 1/num_subsets orig_function. This means that every derived algorithm needs to know this, and adjust step size etc.
  • I suggest a change to have a subset_gradient function (without next_subset())
  • There is no reason to put the preconditioner in SGDFunction. In principle, it could sit in SubsetSumFunction, but in fact, it could sit in Function. I'm not so sure you want to have it in this hierarchy at all though. A gradient preconditioner is something for gradient algorithms, not for functions.

if len(functions) < 2:
self.num_functions = len(functions)
if self.num_functions < 2:
raise ValueError('At least 2 functions need to be passed')
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not sure why we'd want to throw an error here. Seems more useful to handle only 1 function, or even zero.


r"""SubsetSumFunction represents the following sum

.. math:: \frac{1}{n}\sum_{i=1}^{n} F_{i} = \frac{1}{n}(F_{1} + F_{2} + ... + F_{n})
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why did we choose the 1/n normalisation? I don't find it very natural in some cases (not in KL, but also not when penalty is one of the functions). Also, it doesn't really fit the "SumFunction class

return (1/self.num_subsets) * super(SubsetSumFunction, self).__call__(x)

def _full_gradient(self, x, out=None):
r""" Computes the full gradient at :code:`x`. It is the sum of all the gradients for each function. """
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why _full_gradient and not full_gradient?

Also, currently doc is wrong as it is the average of the gradients of all functions

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

doc is correct now. still remains why we'd call this _full_gradient

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

_full_gradient is used in SVRG although I am not sure if it is needed, see here, @zeljkozeljko any ideas?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the "old" SubsetSumFunction class had a _full_gradient method, and SVRG had an attribute called full_gradient, which was computed (updated) once every couple of epochs and it is the estimator of the full gradient of the objective at the snapshot/anchored image. This attribute is then used at every update and I would think we want to store (or at least have the option of storing it), and not recompute it at every update (otherwise why not use full GD). As to the naming, i'm not sure if the choice of the underscore in the method was because of some CIL convention, but of course the, as regards to the attribute in SVRG, this can be renamed (eg we could have snapshot_full_gradient and then also snapshot_image, or something similar), and reserve full_gradient for the SubsetSumFunction class

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can be resolved now


.. math:: \frac{1}{n}\sum_{i=1}^{n} F_{i} = \frac{1}{n}(F_{1} + F_{2} + ... + F_{n})

where :math:`n` is the number of subsets.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should add some doc on what this intends to represent, i.e. the gradient is an approximation using some subsets/history etc.

return (1/self.num_subsets) * super(SubsetSumFunction, self).gradient(x, out=out)

def gradient(self, x, out=None):
""" Computes the gradient for each subset function at :code:`x`."""
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

wrong doc

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In fact, I think it would be better to let this function call next_subset() and then an (undefined) subset_gradient(). All derived classes then have to define subset_gradient(), and never need to know anything about next_subset() (I think). It also means that a user can call subset_gradient() without "advancing" the subset

@paskino
Copy link
Contributor

paskino commented Sep 15, 2022

In your description you have 2 examples @epapoutsellis

f = SubsetSumFunction([LeastSquares(Ai, b=bi)]*n for Ai,bi in zip(A_subsets, b_subsets)) 

and

f = SGDFunction([LeastSquares(Ai, b=bi)]*n for Ai,bi in zip(A_subsets, b_subsets), sampling="random", replacement=False) 

I think they should be as follows:

f = SubsetSumFunction(*[LeastSquares(Ai, b=bi) for Ai,bi in zip(A_subsets, b_subsets)]) 

and

f = SGDFunction(*[LeastSquares(Ai, b=bi) for Ai,bi in zip(A_subsets, b_subsets)], \
    sampling="random", replacement=False) 

Is it true?

@KrisThielemans
Copy link
Contributor

I think the *n has to be there to compensate for the 1/n in SubsetSumFunction (which is one reason one I think the latter is a bad idea)

@epapoutsellis
Copy link
Contributor Author

epapoutsellis commented Sep 16, 2022

I suggest some changes, but in particular to the design:

  • I really don't think using 1/n is a good idea. It means that an SGDFunction is not a direct replacement of the original, but a replacement of 1/num_subsets orig_function. This means that every derived algorithm needs to know this, and adjust step size etc.
  • I suggest a change to have a subset_gradient function (without next_subset())
  • There is no reason to put the preconditioner in SGDFunction. In principle, it could sit in SubsetSumFunction, but in fact, it could sit in Function. I'm not so sure you want to have it in this hierarchy at all though. A gradient preconditioner is something for gradient algorithms, not for functions.

For the preconditioner

Short answer: Totally agree, the preconditioner should be a parameter for each of the algorithms, e.g., GD, ISTA, FISTA that will follow the f.gradient(x, out=out) step. This is something we discussed with @zeljkozeljko .

Long answer: However, we do not have this option atm for out proximal gradient algorithms. In addition, the preconditioner should be a callable method. With this setup, we can cover the cases where

$$ x_{k+1}= x_{k} - \gamma_{k}D(x_{k}, iteration) \nabla f(x_{k}) $$

a) preconditioner is fixed $\frac{1}{A^{T}\mathbb{1}}$
b) depends on the iterates $\frac{x_{k}+\delta}{A^{T}\mathbb{1}}$
c) depends on the subsets_num, i.e., activate/deactivate it after some epochs or iterations.
d) None

Notice that in case c) the (deterministic) proximal gradient algorithms have no info about the number of subsets. Atm, this is known from the SubsetsSumFunction (SGDFunction, SAGFunction, etc). So if we have a callable method for the precond step that depends on (x, iteration) the user has all the information to define it.

self.objective_function.gradient(self.x, out=self.x_update)
self.x_update *= apply_precond(self.x, self.iteration)
self.x.sapyb(1.0, self.x_update, -self.step_size, out=self.x)

Notice that the same can be applied for adaptive step-size, e.g., decreasing step size in SGD that satisfy certain conditions. Unless we have a fix step size or another method armijo rule, it should be a callable method. See for instance SGDClassifier. Finally, all of the above should and can work in ISTA and FISTA.

@epapoutsellis
Copy link
Contributor Author

I suggest some changes, but in particular to the design:

  • I really don't think using 1/n is a good idea. It means that an SGDFunction is not a direct replacement of the original, but a replacement of 1/num_subsets orig_function. This means that every derived algorithm needs to know this, and adjust step size etc.
  • I suggest a change to have a subset_gradient function (without next_subset())
  • There is no reason to put the preconditioner in SGDFunction. In principle, it could sit in SubsetSumFunction, but in fact, it could sit in Function. I'm not so sure you want to have it in this hierarchy at all though. A gradient preconditioner is something for gradient algorithms, not for functions.

For the 1/n convention

Short answer: I think it is important to have the 1/n weight in the base class SubsetSumFunction (inherits from SumFunction) . In this way we will not have problems when comparing with other algorithms e.g., deterministic and will not have to adjust future algorithms found in the literature when the 1/n is considered for finding the argmin.

Long answer: In the first hackathon, we implementedSAG based on this review paper. Then I think an AverageFunction class was created but we decided with @ClaireDelplancke @zeljkozeljko @Imraj-Singh @BillyTang to remove it and fix the scalings in the SubsetSumFunction.

Example: In CIL we can solve the following LS problem

$$ argmin_{x} f(x) = \frac{1}{2}||Ax - d||^{2} $$

  1. using the LeastSquares function and the GD algorithm (Deterministic)
  2. using the SumFunction function and the GD algorithm (Deterministic)
  3. using the SGDFunction function and the GD algorithm (Stochastic)

For 3) the SGDFunction has the $\frac{1}{n}$ in the call method. Therefore to match the min for the deterministic cases we need $\frac{1}{n}$ in front of the LeastSquares and the SumFunction.

For 1) :

f = LeastSquares(Aop, b=bop, c = 0.5/n) 
initial = ig.allocate()
gd = GD(initial = initial, objective_function = f, max_iteration=n_epochs,
       update_objective_interval=1)
gd.run(verbose=1)

The default step_size of GD is $\frac{2*0.99}{L}$ and we have

$$ x_{k+1} = x_{k} - \frac{2 * 0.99}{L} * \frac{1}{n}A^{T}(Ax_{k}-d) = x_{k} - \frac{2 * 0.99}{\frac{||A^{T}A||}{n}} * \frac{1}{n}A^{T}(Ax_{k}-d) = x_{k} - \frac{2 * 0.99}{||A^{T}A||} * A^{T}(Ax_{k}-d)$$

For 2) : Here, we assume that $$f(x) : = \frac{1}{n}\sum_{i=1}^{n} f_{i}(x), \quad f_{i}(x) = \frac{1}{2}||A_{i}x-d_{i}||^{2} $$
Assuming that we have split $A$ into $A_{i}$'s and $d$ into $d_{i}$'s, we can have

average_sum = (1./n)*SumFunction(*fi_cil)
average_sum_gd = GD(initial = initial, objective_function = average_sum,  max_iteration=n_epochs,
       update_objective_interval=1)
average_sum_gd.run(verbose=1)

Here we have,

$$ L = \frac{1}{n}\sum_{i=1}^{n} L_{i} = \frac{1}{n}\sum_{i=1}^{n} ||A_{i}^{T}A_{i}|| $$

$$ x_{k+1} = x_{k} - \frac{2 * 0.99}{L} * \frac{1}{n}\sum_{i=1}^{n}A_{i}^{T}(A_{i}x_{k}-d_{i}) = x_{k} - \frac{2 * 0.99}{ \sum_{i=1}^{n} ||A_{i}^{T}A_{i}|| } \sum_{i=1}^{n}A_{i}^{T}(A_{i}x_{k}-d_{i})$$

Finally for the SGD case, we have

sgd_fun = SGDFunction(fi_cil)
step_size = 0.005
sgd = GD(initial = initial, objective_function = sgd_fun, 
        step_size = step_size, max_iteration=n_epochs*n_subsets,
       update_objective_interval=n_subsets)
sgd.run(verbose=1)

and

$$ x_{k+1} = x_{k} - \gamma_{k} \nabla f_{i_{k}} = x_{k} - \gamma_{k} A_{i_{k}}^{T}(A_{i_{k}} x - d_{i_{k}}) $$

Notice that in cases 1) and 2), there is no dependence on the number of subsets. Therefore finding the argmin is ok. However comparing the objective values for the deterministic vs stochastic algorithms we need the $\frac{1}{n}$ in front of LeastSquares and SumFunction.

Comment on lines 363 to 364
self.data_passes = [0]
self.sampling = sampling
Copy link

@zeljkozeljko zeljkozeljko Sep 22, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might be useful to also track the used subset_num used per iteration, by say creating an attribute self.subset_use = [subset_init], and then appending to it in each iteration

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The current implementation does not seem to track the subset_use. One can create an attribute self.subset_use here, and then append to it after calling next_subset with a given function (or appending to it within the next_subset method, but then making sure that svrg does not call next_subset directly after it has updated the snapshots)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, data_passes doesn't seem to be tracked/updated (even though it is initialised here). Was this the intended use?

Comment on lines 408 to 410
else:
raise NotImplementedError("Only {} and {} are implemented at the moment.".format("random","sequential"))

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is the point where one could append self.subset_num to the list self.subset_use

@paskino
Copy link
Contributor

paskino commented Sep 29, 2022

I'm trying to resume the conversation we had with @KrisThielemans @epapoutsellis @gfardell

With this PR we want to propose a structure in CIL for stochastic optimisation with gradient type algorithms. A number of choices have already been made, the most relevant being that we decided not to develop stochastic algorithms rather a special class of functions that, plugged in an algorithm like GD, will make it behave like SGD. Same applies for ISTA and FISTA.

As @epapoutsellis wrote we want to be able to solve optimisation problems of the form

$$ \text{argmin}_u F(u, b) + \alpha R(u) $$

where $F$ is a data fitting term, $b$ is the measured data, $R$ is the regulariser and $\alpha$ the regularisation parameter. $F$ is supposed to be smooth, while the regulariser might not be, and in such case we'd use an appropriate algorithm.

At the bottom of this PR we rewrite the $F$ as a separable sum

$$ F(u,b) = \frac{1}{n} \sum^n f_i(u,b) $$

The base class for this in CIL is the SubsetSumFunction. @KrisThielemans argued that the presence of the $1/n$ multiplicative factor makes the previous not exactly a separable sum, but an averaged separable sum, hence the name SubsetSumFunction would not be describing the object correctly, proposing to use something that contained average.

Most of the algorithms we are planning to implement, have been published following this convention. The optimisation problem becomes

$$ \text{argmin}_u \frac{1}{n} \sum^n f_i(u,b) + \alpha R(u) $$

I quote @epapoutsellis from a private communication,

If you change the number of subsets, you change the strength of your fidelity term, then you need to adjust the strength of your regulariser, i.e., alpha, in order to have the same argmin in every optimisation problem.

Also, the objective value is not immediately comparable between 2 set ups with different number of subsets, but it needs to be scaled by it.

This means that with the convention

$$ F(u,b) = \frac{1}{n} \sum^n f_i(u,b) $$

the user will need to be extra careful in scaling the regularisation parameter and the objective value with the number of subsets.

However, if we were to choose the convention for which the separable sum is a separable sum, i.e. it does not have the $1/n$ scaling:

$$ F(u,b) = \sum^n f_i(u,b) $$

  1. the name SubsetSumFunction makes sense (to @KrisThielemans )
  2. we would not have any problem of comparison of algorithms with different amount of subsets

Conversely, it would add a burden on the developers team as:

  1. each subset function needs to take into account the missing $1/n$ which seems to be the convention in the papers
  2. comparison with literature results (following the $1/n$ convention) will need to be appropriately scaled

@epapoutsellis
Copy link
Contributor Author

I wonder if we can agree on how the user sets the sampling strategies, as these are also useful to SPDHG, see #1345 (comment).

Come to think of it, I think it might be better to isolate the "subset number generator" in a separate class. At some point, you'd want to use non-uniform sampling (in fact, already used in the spdhg paper if the penalty is a "subset" I believe). You don't really want people to have to modify SubsetSumFunction to enter their favorite sampler. It's probably also clearer if we just accept a Python discrete RNG in the constructor.

Yes, this is what we discussed in the previous meeting. The new class (suggested name: SubsetGenerator) will implement the next_subset method. And the new next_subset method will have only

self.subset_num = SubsetGenerator ( num_subset, `sampling_method`)

@epapoutsellis
Copy link
Contributor Author

updated the link in my comment, which now points to #1344 (comment) which describes a "subset number generator" class, in fact.

that class is somewhat different. It points to a "all angles in all subset generator". That is something useful as well, but is related to the generation of the Ai etc. This isn't quite what we need here I think, as this is just about "which subset (or "i") do I choose next). This is just a random (or sequential or whatever) number from 0....N.

Yes, the class in #1344 (comment) will be used to generate operators (Ai) and data (di) and is prior to running the algorithm. The next_subset method which has the two basic strategies (replacement=True or False) is related to which function (fi) is going to be selected and how in every iteration/epoch.

@paskino
Copy link
Contributor

paskino commented Nov 3, 2022

Why closing this?

@epapoutsellis
Copy link
Contributor Author

Why closing this?

I reopened it

@epapoutsellis epapoutsellis reopened this Nov 3, 2022
@KrisThielemans
Copy link
Contributor

this is what we discussed in the previous meeting. The new class (suggested name: SubsetGenerator) will implement the next_subset method. And the new next_subset method will have only

self.subset_num = SubsetGenerator ( num_subset, `sampling_method`)

I think this is slightly wrong. I suggest something like

generator =  SubsetGenerator ( num_subsets, `sampling_method`)
subset_func = SGFunction(fi_cil, generator)
...

# somewhere in SGFunction
 subset_num = self.subset_num_generator()

@epapoutsellis
Copy link
Contributor Author

@KrisThielemans , @paskino is the following ok?

import numpy as np

class SubsetGenerator(object):

    def __init__(self, num_subsets, sampling_method = "random", replacement = True, shuffle ="random"):

        self.num_subsets = num_subsets
        self.sampling_method = sampling_method
        self.replacement = replacement
        self.shuffle = shuffle        
        self.subset_num = -1
        self.index = 0       

        if self.replacement is False:
                        
            # create a list of subsets without replacement, first permutation
            self.list_of_subsets = np.random.choice(range(self.num_subsets),self.num_subsets, replace=False)            
                       
            if self.shuffle not in ["random","single"]:
                raise NotImplementedError("Only {} and {} are implemented for the replacement=False case.".format("random","single"))            
                         
    def next_subset(self):
            
            if self.sampling_method=="random" :
                
                if self.replacement is False:
                    
                    # list of subsets already permuted
                    self.subset_num = self.list_of_subsets[self.index]
                    self.index+=1                
                                    
                    if self.index == self.num_subsets:
                        self.index=0
                        
                        # For random shuffle, at the end of each epoch, we permute the list again                    
                        if self.shuffle=="random":                    
                            self.list_of_subsets = np.random.choice(range(self.num_subsets),self.num_subsets, replace=False)                                                                                          
                else:
                    
                    # at each iteration (not epoch) a subset is randomly selected
                    self.subset_num = np.random.randint(0, self.num_subsets)
                
            elif self.sampling_method=="sequential":    
                
                if self.subset_num + 1 >= self.num_subsets:
                    self.subset_num = 0                
                else:
                    self.subset_num += 1         
            else:
                raise NotImplementedError("Only {} and {} are implemented at the moment.".format("random","sequential"))          
            
            return self.subset_num

and is used as

generator = SubsetGenerator(self.n_subsets)
self.F_SG = SGFunction(self.fi_cil, generator) 

and the next_subset method of SubsetSumFunction is

    def next_subset(self):
        
        # Select the next subset using the subset_generator
        
        self.subset_num = self.subset_generator.next_subset()
        self.subsets_used.append(self.subset_num)

@KrisThielemans
Copy link
Contributor

I think this is nice. What you wrote is not a SubsetGenerator, but a SubsetNumberGenerator.

However, it could be made even nicer if you'd be able to make it a Python generator. close to the C++ "distributions".

Usage would be

generator = make_subset_num_generator(6)
self.F_SG = SGFunction(self.fi_cil, generator) 
...def next_subset(self):
        self.subset_num = next(self.subset_generator)

I had hoped to that the Python random number generator classes would follow this paradigm, such that we could just as well have said

rng = numpy.random.default_rng()
# doesn't exist
generator = np.random.int_distribution(rng,0, num_subsets) 

but it seems that the NumPy people don't like this (or are thinking mostly in terms of returning arrays, which isn't needed here).

Supposing you like this design, I'm not so sure how to implement it... Possibly something like this

def make_subset_num_generator(self, num_subsets, sampling_method = "random", replacement = True, shuffle ="random"):
     # code from __init__
      self.num_subsets = num_subsets
      etc      

     while(True):
        # code from next_subset 
        if self.replacement is False:
             etc
       yield subset_num

It would probably more efficient to put the yield statement wherever you currently assign, but who cares about efficiency here?

Note that this is no longer a class, so I've changed naming convention (using the one from STIR/SIRF, not sure what CIL uses for functions).

@epapoutsellis
Copy link
Contributor Author

No problem to make it function/generator, something like

def subset_num_generator(num_subsets, sampling_method = "random", replacement = True, shuffle ="random"):

        num_subsets = num_subsets
        sampling_method = sampling_method
        replacement = replacement
        shuffle = shuffle        
        subset_num = -1
        index = 0       

        if replacement is False:
                        
            # create a list of subsets without replacement, first permutation
            list_of_subsets = np.random.choice(range(num_subsets),num_subsets, replace=False)            
                       
            if shuffle not in ["random","single"]:
                raise NotImplementedError("Only {} and {} are implemented for the replacement=False case.".format("random","single"))            

        while(True):                
            
            if sampling_method=="random" :
            
                if replacement is False:
                    
                    # list of subsets already permuted
                    subset_num = list_of_subsets[index]
                    index+=1                
                                    
                    if index == num_subsets:
                        index=0
                        
                        # For random shuffle, at the end of each epoch, we permute the list again                    
                        if shuffle=="random":                    
                            list_of_subsets = np.random.choice(range(num_subsets),num_subsets, replace=False)                                                                                          
                else:
                    
                    # at each iteration (not epoch) a subset is randomly selected
                    subset_num = np.random.randint(0, num_subsets)
                
            elif sampling_method=="sequential":    
                
                if subset_num + 1 >= num_subsets:
                    subset_num = 0                
                else:
                    subset_num += 1         
            else:
                raise NotImplementedError("Only {} and {} are implemented at the moment.".format("random","sequential"))          
            
            yield subset_num

    sng = subset_num_generator(10, replacement=False)
    for i in range(20): 
               
        if i>=1 and np.mod(i,10)==0:
            print("new epoch")
            print(next(sng))
        else:
            print(next(sng))
4
0
1
6
8
2
9
7
5
3
new epoch
5
9
4
8
7
0
1
3
6
2

@paskino is this ok? Where should we add this function?At the moment, I have created a new script inside Functions directory.

@epapoutsellis epapoutsellis changed the title Subset sum function and SGDFunction ApproximateGradientSumFunction and SGFunction Nov 18, 2022
@epapoutsellis
Copy link
Contributor Author

epapoutsellis commented Nov 18, 2022

@KrisThielemans, @paskino, @jakobsj @zeljkozeljko

Changes after Hackathon meeting (17/11/2022) and CIL-Dev meeting (18/11/2022):

  • A utilities directory is added in the cil.optimisation that contains the FunctionNumberGenerator.py

  • Accepted/Implemented sampling_method for the FunctionNumberGenerator.py are : random, random_permutation, fixed_permutation.

    • random: every function is selected randomly, with replacement.
    fng = FunctionNumberGenerator(10, sampling_method="random")
    generated_numbers=[print(next(fng),end=' ') for _ in range(2*10)] # epochs = 2
    0 8 3 0 3 0 1 0 9 6 0 8 6 4 0 3 6 0 1 7 
    • random_permutation: every function is selected randomly without replacement. At the end of each epoch, i.e., after all the functions have been selected, the list of functions is permuted again.
    fng = FunctionNumberGenerator(10, sampling_method="random_permutation")
    generated_numbers=[print(next(fng),end=' ') for _ in range(2*10)]
    8 1 2 6 9 3 7 0 4 5 0 6 1 9 3 8 5 2 7 4 
    • fixed_permutation: the list of functions is permuted one time at the beginning (without replacement) and is used for all epochs
    fng = FunctionNumberGenerator(10, sampling_method="fixed_permutation")
    generated_numbers=[print(next(fng),end=' ') for _ in range(2*10)]
    5 6 4 1 0 9 3 7 8 2 5 6 4 1 0 9 3 7 8 2
  • SubsetSumFunction is remaned to ApproximateGradientSumFunction

  • ApproximateGradientSumFunction can accept either a string for the selection of functions or instance of the FunctionNumberGenerator

  • The subset term is removed from ApproximateGradientSumFunction and SGFunction

  • Unittests for ApproximateGradientSumFunction and SGFunction are updated

Example code for the LS problem

import numpy as np
from cil.optimisation.algorithms import GD
from cil.optimisation.functions import ApproximateGradientSumFunction, LeastSquares, SGFunction
from cil.optimisation.utilities import FunctionNumberGenerator
from cil.optimisation.operators import MatrixOperator
from cil.framework import VectorData


# create LS problem (Ax = b)
np.random.seed(10)
n = 50
m = 500
n_subsets = 10

Anp = np.random.uniform(0,1, (m, n)).astype('float32')
xnp = np.random.uniform(0,1, (n,)).astype('float32')
bnp = Anp.dot(xnp)

Ai = np.vsplit(Anp, n_subsets) 
bi = [bnp[i:i+n] for i in range(0, m, n)]     

Aop = MatrixOperator(Anp)
bop = VectorData(bnp) 
ig = Aop.domain        
x_cil = ig.allocate('random')

fi_cil = []
for i in range(n_subsets):   
    Ai_cil = MatrixOperator(Ai[i])
    bi_cil = VectorData(bi[i])
    fi_cil.append(LeastSquares(Ai_cil, bi_cil, c=1.0))

    
F_SG = SGFunction(fi_cil)  # default selection = "random" 

# or 
# F_SG = SGFunction(fi_cil, selection="random_permutation")  

# or 
# generator = FunctionNumberGenerator(n_subsets, sampling_method="fixed_permutation")
# F_SG = SGFunction(fi_cil, generator)

step_size = 1./F_SG.L
initial = ig.allocate()
epochs = 200
sgd = GD(initial = initial, objective_function = F_SG, step_size = step_size,
            max_iteration = epochs * n_subsets, 
            update_objective_interval = 10*n_subsets)
sgd.run(verbose=1)   

print("Function selected")
k=0
for i in range(0, epochs, 20):
    print("Epoch : {}, function used : {}".format(i, F_SG.functions_used[k:n_subsets+k]))
    k+=n_subsets

# import cvxpy
# u_cvxpy = cvxpy.Variable(ig.shape[0])
# objective = cvxpy.Minimize( 0.5*cvxpy.sum_squares(Aop.A @ u_cvxpy - bop.array))
# p = cvxpy.Problem(objective)
# p.solve(verbose=True, solver=cvxpy.SCS, eps=1e-4)     

# np.testing.assert_allclose(p.value, sgd.objective[-1], atol=1e-1)
# np.testing.assert_allclose(u_cvxpy.value, sgd.solution.array, atol=1e-1)    
     Iter   Max Iter     Time/Iter            Objective
                               [s]                     
        0       2000         0.000          9.99876e+04
      100       2000         0.006          4.66201e+01
      200       2000         0.006          1.37573e+01
      300       2000         0.006          4.64783e+00
      400       2000         0.006          1.73196e+00
      500       2000         0.006          7.08509e-01
      600       2000         0.005          2.99555e-01
      700       2000         0.005          1.28043e-01
      800       2000         0.005          5.58902e-02
      900       2000         0.005          2.55412e-02
     1000       2000         0.005          1.11664e-02
     1100       2000         0.005          5.23994e-03
     1200       2000         0.005          2.30367e-03
     1300       2000         0.005          1.05134e-03
     1400       2000         0.005          4.96243e-04
     1500       2000         0.005          2.25485e-04
     1600       2000         0.005          1.04333e-04
     1700       2000         0.005          4.82236e-05
     1800       2000         0.005          2.21038e-05
     1900       2000         0.005          1.01833e-05
     2000       2000         0.005          4.69173e-06
-------------------------------------------------------
     2000       2000         0.005          4.69173e-06
Stop criterion has been reached.

Function selected
Epoch : 0, function used : [4, 0, 9, 6, 0, 1, 5, 1, 5, 5]
Epoch : 20, function used : [1, 3, 6, 4, 6, 3, 8, 5, 6, 8]
Epoch : 40, function used : [5, 4, 8, 9, 3, 1, 8, 5, 6, 0]
Epoch : 60, function used : [3, 4, 3, 6, 1, 6, 0, 9, 5, 8]
Epoch : 80, function used : [6, 5, 6, 3, 2, 0, 9, 1, 9, 7]
Epoch : 100, function used : [1, 0, 8, 6, 3, 0, 6, 3, 5, 2]
Epoch : 120, function used : [5, 9, 7, 7, 2, 0, 8, 8, 3, 7]
Epoch : 140, function used : [9, 3, 6, 0, 7, 7, 1, 1, 8, 8]
Epoch : 160, function used : [8, 7, 8, 0, 4, 7, 5, 9, 1, 5]
Epoch : 180, function used : [2, 2, 3, 1, 0, 0, 0, 0, 5, 5]

@epapoutsellis
Copy link
Contributor Author

@zeljkozeljko thank you for your quick comments. In my latest commit, I have implemented as you suggested but I changed them again today after the discussion we had with @paskino, @jakobsj. I have not spent time fixing docs.

@KrisThielemans , @paskino, @jakobsj @zeljkozeljko before starting reviewing this PR, could we agree on the following:

  • Name and signature of the FunctionNumberGenerator.
  • Name and signature of the ApproximateGradientSumFunction.

Copy link
Contributor

@KrisThielemans KrisThielemans left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks mostly ok to me. I don't know CIL naming conventions. In STIR/SIRF, CamelCase like FunctionNumberGenerator implies it is a class (which it isn't). That is one for @paskino .

One problem is duplication of doc on the generator in all derived classes, which will be terrible.
Possible solutions:

  • refer to doc of FunctionNumberGenerator (or whatever that ends up being called).

  • add a member set_selection_generator to ApproximateGradientSumFunction. Someone might want to change this over epochs. Doc of all __init__ functions could then refer to the doc of that member (which is only written once).

  • give up passing strings in all those constructors and enforce that it has be a generator.

-----------
functions : list(functions)
A list of functions: :code:`[F_{1}, F_{2}, ..., F_{n}]`. Each function is assumed to be smooth function with an implemented :func:`~Function.gradient` method.
selection : :obj:`string`, Default = :code:`random`. It can be :code:`random`, :code:`random_permutation`, :code:`fixed_permutation`.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good idea to allow a string here, but the doc needs to mention the generator option. Also, I recommend referring to the doc of FunctionNumberGenerator for allowed strings. Otherwise things are going to do desynchronise

Comment on lines +31 to +34
sampling : :obj:`string`, Default = :code:`random`
Selection process for each function in the list. It can be :code:`random` or :code:`sequential`.
replacement : :obj:`boolean`. Default = :code:`True`
The same subset can be selected when :code:`replacement=True`.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

somehow avoid duplication as this is incorrect


.. math:: \partial F_{i}(x) .

The ith function is selected from the :meth:`~SubsetSumFunction.next_subset` method.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

incorrect

Note
----

The :meth:`~SGFunction.gradient` computes the `gradient` of one function from the list :math:`[F_{1},\cdots,F_{n}]`,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

times N


def approximate_gradient(self,subset_num, x, out):

""" Returns the gradient of the selected function at :code:`x`. The function is selected using the :meth:`~SubsetSumFunction.next_subset`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

times N


initialise_tests()

class TestApproximateGradientSumFunction(unittest.TestCase):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no test for the string argument


if replacement is False:
# create a list of functions without replacement, first permutation
list_of_functions = np.random.choice(range(num_functions),num_functions, replace=False)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess this variable should be called list_of function_nums. I'd also prefix with _ to make it "private"

Selection process for each function in the list. It can be :code:`random`, :code:`random_permutation`, :code:`fixed_permutation`.
- :code:`random`: Every function is selected randomly with replacement.
- :code:`random_permutation`: Every function is selected randomly without replacement. After selecting all the functions in the list, i.e., after one epoch, the list is randomly permuted.
- :code:`fixed_permuation`: Every function is selected randomly without replacement and the list of function is permuted only once.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see why anyone would want to use fixed_permutation. If anything, they'd want to specify an order. No harm in keeping it of course.

Probably the only 2 fixed orders that make sense are 0,1,2,3... and 0,n/2,n/3,3n/4,... (with the last appropriate when the functions correspond to projections ordered by angle). Maybe you could call these "fixed_sequential" and "fixed_staggered"?

@robbietuk would have code to implement the "staggered" approach


# Select the next function using the selection:=function_number_generator
self.function_num = next(self.selection)
self.functions_used.append(self.function_num)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this should do a range check and write a nice error message if out of range

@epapoutsellis
Copy link
Contributor Author

Thank you @KrisThielemans for your comments, before addressing your comments and @zeljkozeljko, I think we need to have a final decision for the design. Please have a look #1377.

@KrisThielemans
Copy link
Contributor

You might consider using the new interface of numpy.random which uses (quite confusingly named) Generators, see https://numpy.org/doc/stable/reference/random/index.html

rng = np.random.default_rng(optional_seed)
rints = rng.integers(low=0, high=10, size=3)

it would allow passing a seed such that the generated sequence is reproducible. (cannot do that with random.choice as you don't know who's going to call that as well). Of course, that could be done later.

@KrisThielemans
Copy link
Contributor

Recording part of an email conversation here on adding capabilities for non-uniform random sampling. @epapoutsellis had a proposal but noted that algorithms might need to ask what those probablities are. I think that leads back to the need of having a class. At that point, a "generator" is probably not useful anymore and we can revert to having an iterator interface

def IndexGenerator:
  def __init__(lots of arguments here)
  def get_probabilities(self)
  def uses_replacement(self)
  # either this is an iteratable object 
  def iter(self) # could return a generator as far as I know
  # or it is already an iterator
  def __iter__(self)
  def __next__(self)

and we’d have to decide to use a string+extra arguments
'''python
ig = IndexGenerator(num_functions, “random”, probabilities=[.1, .3, .2, …])
gen = ig.iter()
next(gen)

or use a hierarchy of derived classes, e.g.
```python
  def RandomIndexGenerator: IndexGenerator
     def __init__ (either_num_functions_or_list_of_probabilities, replacement=True)
     ...
  
 ig = RandomIndexGenerator (5)

where I think the latter design is more easily extendable.

@epapoutsellis
Copy link
Contributor Author

@KrisThielemans @paskino @jakobsj see below my proposal. I created a base class RandomSampling that creates random indices or batches (e.g. mini-batch SGD). Depending on the batch_size, we have control of which of the RandomIndex or RandomBatch classes will be created. The signature remains the same for these two classes and the __next__ method is different. I also included some static methods in the base class that work for different batch_sizes. I guess for the names we need to decide. I have no objection.

import numpy as np
import random

class RandomSampling():
    
    def __new__(cls, num_indices, batch_size=1, prob = None, replace = True, shuffle=False, seed = None):
        
        cls.batch_size = batch_size
        
        if cls.batch_size == 1:
            return super(RandomSampling, cls).__new__(RandomIndex)
        else:
            return super(RandomSampling, cls).__new__(RandomBatch)
    
    def __init__(self, num_indices, batch_size=1, prob = None, replace = True, shuffle=False, seed = None ):
        
        self.num_indices = num_indices
        self.batch_size = batch_size
        self.equal_size_batches = self.num_indices%self.batch_size==0        
        if self.equal_size_batches:
            self.num_batches = self.num_indices//self.batch_size
        else:
            self.num_batches = (self.num_indices//self.batch_size)+1        
        self.prob = prob
        self.replace = replace
        self.shuffle = shuffle
        self.indices_used = []
        self.index = 0
        np.random.seed(seed)
                            
        if self.replace is False: 
            self.list_of_indices = np.random.choice(num_indices, size=self.num_indices, p=prob, replace=False)                                        
        else:
            if shuffle is True and self.batch_size==1:
                raise ValueError("Shuffle is used only with replace=False")   
            self.list_of_indices = np.random.choice(num_indices, size=self.num_indices, p=prob, replace=True)         
            
        if self.batch_size>1: 
            self.partition_list = [self.list_of_indices[i:i + self.batch_size] for i in range(0, self.num_indices, self.batch_size)]             
            
    def show_epochs(self, epochs):
        
        total_its = epochs * self.num_indices
        for _ in range(total_its):
            next(self)
                    
        if self.batch_size==1:
            k = 0
            for i in range(epochs):
                print(" Epoch : {}, indices used : {} ".format(i, self.indices_used[k:k+self.num_indices]))    
                k+=self.num_indices 
            print("")   
        else:
            k=0
            for i in range(epochs):
                print(" Epoch : {}, batchs used : {} ".format(i, self.indices_used[k:k+num_batches]))                 
                k += num_batches
                
            
    @staticmethod    
    def uniform(num_indices, batch_size = 1, seed=None):
        return RandomSampling(num_indices,  batch_size=batch_size, prob=None, replace = True, shuffle = False, seed = seed)
    
    @staticmethod
    def non_uniform(num_indices, prob, batch_size = 1, seed=None):
        return RandomSampling(num_indices, batch_size=batch_size, replace = True, prob=prob, shuffle = False, seed = seed) 
    
    @staticmethod    
    def uniform_no_replacement(num_indices, batch_size = 1, shuffle=False, seed=None):
        return RandomSampling(num_indices, batch_size=batch_size, prob=None, replace = False, shuffle = shuffle, seed = seed) 
    
    @staticmethod    
    def non_uniform_no_replacement(num_indices, prob, batch_size = 1, shuffle=False, seed=None):
        return RandomSampling(num_indices, batch_size=batch_size, prob=prob, replace = False, shuffle = shuffle, seed = seed)     
        
    @staticmethod
    def single_shuffle(num_indices, batch_size = 1, seed=None):
        return RandomSampling(num_indices, batch_size = batch_size, replace = False, shuffle = False, seed = seed)
    
    @staticmethod
    def random_shuffle(num_indices, batch_size = 1, seed=None):
        return RandomSampling(num_indices, batch_size = batch_size, replace = False, shuffle = True, seed = seed)              
            

class RandomBatch(RandomSampling):
            
    def __next__(self):
        
        tmp = list(self.partition_list[self.index])
        self.indices_used.append(tmp)         
        self.index+=1
        
        if self.index==len(self.partition_list):
            self.index=0            
            
            if self.shuffle is True:
                self.list_of_indices = np.random.choice(self.num_indices, size=self.num_indices, p=self.prob, replace=self.replace)        
                self.partition_list = [self.list_of_indices[i:i + self.batch_size] for i in range(0, self.num_indices, self.batch_size)]
                                               
        return tmp
    
class RandomIndex(RandomSampling):   
    
    def __next__(self):
        
        if self.replace is False:

            index_num = self.list_of_indices[self.index]
            self.index+=1                

            if self.index == self.num_indices:
                self.index = 0                
                if self.shuffle is True:                    
                    self.list_of_indices = np.random.choice(self.num_indices, size=self.num_indices, p=self.prob, replace=False)                                                                                         
        else:

            index_num = np.random.choice(self.num_indices, size=1, p=self.prob, replace=True).item()

        self.indices_used.append(index_num)

        return index_num  

Examples

Our users can do, in the case of batch_size=1

rs = RandomSampling.uniform(10)
rs.show_epochs(epochs=3)
 Epoch : 0, indices used : [8, 1, 4, 8, 5, 9, 1, 9, 0, 3] 
 Epoch : 1, indices used : [3, 4, 9, 2, 4, 0, 7, 9, 4, 1] 
 Epoch : 2, indices used : [9, 6, 6, 8, 9, 6, 2, 0, 1, 5] 
rs = RandomSampling.random_shuffle(10, seed=100)
rs.show_epochs(epochs=3)
 Epoch : 0, indices used : [7, 6, 1, 5, 4, 2, 0, 3, 9, 8] 
 Epoch : 1, indices used : [7, 8, 5, 3, 4, 6, 0, 1, 9, 2] 
 Epoch : 2, indices used : [9, 7, 1, 8, 5, 4, 3, 2, 6, 0] 
tmp = [random.random() for i in range(10)]
tmp_sum = sum(tmp)
prob = [i/tmp_sum for i in tmp]
rs = RandomSampling.non_uniform(10, prob=prob)
rs.show_epochs(epochs=3)
 Epoch : 0, indices used : [2, 5, 7, 7, 2, 1, 6, 6, 2, 7] 
 Epoch : 1, indices used : [7, 4, 9, 9, 7, 1, 6, 1, 4, 9] 
 Epoch : 2, indices used : [6, 5, 6, 2, 5, 2, 8, 2, 2, 2] 

And for the case where batch_size>1

rs = RandomSampling.uniform(10, batch_size=5)
rs.show_epochs(epochs=3)
 Epoch : 0, batchs used : [[5, 7, 9, 5, 9], [1, 4, 6, 9, 6]] 
 Epoch : 1, batchs used : [[5, 7, 9, 5, 9], [1, 4, 6, 9, 6]] 
 Epoch : 2, batchs used : [[5, 7, 9, 5, 9], [1, 4, 6, 9, 6]] 
rs = RandomSampling.single_shuffle(10, batch_size=2, seed=100)
rs.show_epochs(epochs=3)
 Epoch : 0, batchs used : [[7, 6], [1, 5], [4, 2], [0, 3], [9, 8]] 
 Epoch : 1, batchs used : [[7, 6], [1, 5], [4, 2], [0, 3], [9, 8]] 
 Epoch : 2, batchs used : [[7, 6], [1, 5], [4, 2], [0, 3], [9, 8]] 

Comparison ISTA, proxSGD, proxMBSGD

For the 3D walnut dataset , I reconstructed one 2D slice with

 Acquisition Geometry 2D: (1601, 392) with labels ('angle', 'horizontal')
 Image Geometry 2D: (392, 392) with labels ('horizontal_y', 'horizontal_x')

For the $optimal=x^{*}$, I used FISTA with 1000 iterations.

  • num_subsets=100 and n_epochs=40.
  • The subsets are generated using AcquisitionGeometrySubsetGenerator with random_permutation.
  • For both proxSGD and proxMBSGD, I use uniform sampling with replacement and batch_size=5 for the latter.
  • step_size is the same for both.
rs2 = RandomSampling(len(f_subsets), batch_size=5) 
F2 = SGFunctionBase(f_subsets, selection=rs2)

For the proxMBSGD we need to add some lines in the approximate_gradient method as

    if isinstance(self.function_num, int):
        self.functions[self.function_num].gradient(x, out=out)
        out*=self.num_functions        
    else:
        tmp = 0
        # function_num is a list of lists, prange???
        for i in self.function_num:
            tmp += self.functions[i].gradient(x)        
        out.fill(tmp)
        out *= self.selection.num_batches

plot1
plot2

images2
images1

@MargaretDuff
Copy link
Member

Superseded by #1550

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
5 participants