PDHG step sizes #936

paskino · 2021-07-22T07:25:52Z

paskino
Jul 22, 2021
Maintainer

After discussing with @gschramm at Fully3D about his heuristic choice of step sizes in SPDHG I made a few experiments on PDHG based on the PDHG notebook.

His idea is rather simple. One can introduce a parameter gamma that controls the ratio between the step size for the primal and dual variable, sigma and tau respectively. Actually the docstring says so

CIL/Wrappers/Python/cil/optimisation/algorithms/PDHG.py

Lines 89 to 90 in 39b6f7a

    
                   :param sigma: Step size parameter for Primal problem 
        
                   :param tau: Step size parameter for Dual problem

but I am afraid it is wrong as the update of the dual variable is

CIL/Wrappers/Python/cil/optimisation/algorithms/PDHG.py

Line 150 in 39b6f7a

self.y_tmp.axpby(self.sigma, 1 , self.y_old, self.y_tmp)

and the update for the primal variable is

CIL/Wrappers/Python/cil/optimisation/algorithms/PDHG.py

Line 163 in 39b6f7a

self.x_tmp.axpby(-self.tau, 1. , self.x_old, self.x_tmp)

What I had not realised when talking to @gschramm is that he was using preconditioning, however if one uses the following:

sigma = gamma / K.norm()
tau = 1 / (gamma * K.norm())

# this will be equivalent to
sigma / tau = gamma ** 2
sigma * tau = 1 / ( K.norm() ) ** 2

OK, so now the problem of the choice of step sizes has become the problem of the choice of gamma.

@gschramm's idea is that the fastest convergence is for gamma equal to the norm of the solution. So, one should first run another algorithm, like OSEM or FBP and calculate such norm and then use it. If I understood correctly this will make the SPDHG iteration be similar to OSEM, for data fidelity as KullbackLeibler.

In the case of CT normally the data fidelity is L2NormSquared given the different noise. At any rate, with all the differences of data fidelity, non preconditioning, I tried to choose gamma as @gschramm. Actually I ran a grid search of the values of gamma starting around the value of the norm of the solution.

So this is what I did:

# run fbp
# Setup and run the FBP algorithm
fbp_recon = FBP(ig2D, ag2D,  device = 'gpu')(absorption_data)
norm_sol = fbp_recon.norm()

gs = [1e-3, 3e-3, 1e-2, 3e-2, 1e-1, 3e-1, 1, 3]
obj_100 = []

for g in gs:
    gamma = norm_sol * g
    sigma   = gamma / (normK)
    tau = 1 / (gamma * normK)

    algo = PDHG(f = F, g = G, operator = K, 
                max_iteration = 500,
                update_objective_interval = 100, 
                  sigma = sigma, tau=tau)
    # pdhg_l1.y_old = K.range.allocate(1)
    algo.run(100, verbose=1)
    obj_100.append(algo.get_last_objective())

Below I plot the objective value of PDHG after 100 iterations as a function of gamma, or better the value gs . As you can see, there is a minimum with g = 3e-1 and the default choice of sigma=1. is in red dot.

If you choose gamma that is at the minimum you would get the same objective value (and reconstruction) as the default choice in 1/5 of the iterations, i.e. 1000 vs 5000.

Below I show the ratio of the objective value of the 2 PDHG optimisations. As you can see the default choice is always higher than the optimal choice. (X axis is 100s of iterations)

paskino · 2021-07-22T08:02:26Z

paskino
Jul 22, 2021
Maintainer Author

I actually developed an adaptive selection of gamma based on the rule by @gschramm

## adaptive gamma
algo = PDHG(f = F, g = G, operator = K)

algo.max_iteration = 5000
algo.update_objective_interval = 20
for i in range(100):
    algo.run(20)
    gamma = algo.solution.norm()
    sigma = gamma / (normK)
    tau = 1 / (gamma * normK)
    algo.sigma = sigma
    algo.tau = tau

This is the comparison of the objective value of the adaptive vs default choice of sigma and gamma for the same optimisation problem.

0 replies

mehrhardt · 2021-07-22T09:11:46Z

mehrhardt
Jul 22, 2021

This is an interesting idea! How do you initialize the primal and dual variable of PDHG?

One of the proven convergence rates for PDHG is
1/N * (1/tau * norm(x0 - x*)^2 + 1/sigma * norm(y0 - y*)^2

If tau = gamma * beta, sigma = beta / gamma, then the "optimal" gamma (which minimizes the constant in the rate) is
gamma2 = norm(x0 - x*)/norm(y0 - y*).

How does gamma2 relate to gamma1 = norm(x*)? In general these are probably quite different, but perhaps they are similar for your examples?

4 replies

gschramm Jul 22, 2021
Collaborator

Hi Edo, thanks for the results. Just to make sure @mehrhardt already introduced the \gamma scaling factors in his paper. I only realized by coincidence that gamma = 1 / norm(x*) in general seems to be a good intitial guess for gamma. So if norm(x*) is far way from 1 (which it usually is when we don't do rescaling), gamma = 1does not work well. @mehrhardt by heuristic hand-waving explanation is that if we choose gamma = 1/norm(x*) than the pre-conditioned step size T is very similar to the "step size" used in MLEM. Does that make sense?

paskino Jul 22, 2021
Maintainer Author

My initialisation is with zero arrays.

I have not tested the gamma2 but I will investigate.

Let me get your notation correct: with x0 and y0 you mean the previous iteration primal and dual variables? With x* and y* they are the current solutions.

beta = 1/norm(operator)?

mehrhardt Jul 22, 2021

@mehrhardt by heuristic hand-waving explanation is that if we choose gamma = 1/norm(x*) than the pre-conditioned step size T is very similar to the "step size" used in MLEM. Does that make sense?

This is an interesting observation. Would be great if one could make theoretically sense of it.

mehrhardt Jul 22, 2021

My initialisation is with zero arrays.

I have not tested the gamma2 but I will investigate.

Let me get your notation correct: with x0 and y0 you mean the previous iteration primal and dual variables? With x* and y* they are the current solutions.

beta = 1/norm(operator)?

Yes, beta = 1/norm(operator) would be a reasonable choice. In the formula above x0 and y0 are the initial values of the algorithm and x* and y* any saddle point. Not sure what good values for these are in your heuristic adaptation.

epapoutsellis · 2021-07-22T09:19:01Z

epapoutsellis
Jul 22, 2021
Maintainer

Do you see the same improvement with KL? Also, one more thing. I have noticed, a while ago, some problems with primal/dual/pdgap objectives using Astra with gpu. I believe this is due to the adjoint mismatch. So for comparison I would choose the cpu backend.

1 reply

paskino Jul 22, 2021
Maintainer Author

I used Least Squares, however @gschramm has come to this idea with a KL.
These results are achieved with TIGRE, not ASTRA. As a matter of fact, the dual variable is infinite.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PDHG step sizes #936

{{title}}

Replies: 3 comments 5 replies

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

PDHG step sizes #936

paskino Jul 22, 2021 Maintainer

Replies: 3 comments · 5 replies

paskino Jul 22, 2021 Maintainer Author

mehrhardt Jul 22, 2021

gschramm Jul 22, 2021 Collaborator

paskino Jul 22, 2021 Maintainer Author

mehrhardt Jul 22, 2021

mehrhardt Jul 22, 2021

epapoutsellis Jul 22, 2021 Maintainer

paskino Jul 22, 2021 Maintainer Author

paskino
Jul 22, 2021
Maintainer

Replies: 3 comments 5 replies

paskino
Jul 22, 2021
Maintainer Author

mehrhardt
Jul 22, 2021

gschramm Jul 22, 2021
Collaborator

paskino Jul 22, 2021
Maintainer Author

epapoutsellis
Jul 22, 2021
Maintainer

paskino Jul 22, 2021
Maintainer Author