Backoff still not working well #769

MattWellie · 2024-05-09T01:32:47Z

From the logging messages emitted during graphQL queries it looks like the backoff isn't running for nearly as long as the code suggests it should.

Ref:

metamist/metamist/graphql/__init__.py

Line 145 in f1b9811

@backoff.on_exception(

Batch example: https://batch.hail.populationgenomics.org.au/batches/445352/jobs/1

Logging:

2024-05-09 00:39:35 INFO (backoff 105): Backing off query(...) for 0.4s (gql.transport.exceptions.TransportServerError: 504 Server Error: Gateway Timeout for url: https://sample-metadata-api-mnrpw3mdza-ts.a.run.app/graphql)
2024-05-09 00:44:36 ERROR (backoff 120): Giving up query(...) after 2 tries (gql.transport.exceptions.TransportServerError: 504 Server Error: Gateway Timeout for url: https://sample-metadata-api-mnrpw3mdza-ts.a.run.app/graphql)

> docker run australia-southeast1-docker.pkg.dev/cpg-common/images/cpg_workflows:latest pip show metamist
Name: metamist
Version: 6.10.1
...

The text was updated successfully, but these errors were encountered:

nevoodoo · 2024-05-30T03:47:25Z

Debugging

So I think the reason for this could be that the backoff decorator only works when you have a session instantiated like:

session = await client.connect_async(reconnecting=True)

result = await session.execute(query)

Should also be noted that decorator effects don't get passed to methods that are called inside the the wrapped method so it is possible that the code in

metamist/metamist/graphql/__init__.py

Lines 166 to 169 in f1b9811

    
           response = (client or configure_sync_client()).execute_sync( 
        
               _query if isinstance(_query, DocumentNode) else gql(_query), 
        
               variable_values=variables, 
        
           )

does not pass the decorator effects to the session object that is initiated inside execute_sync as needed.

Potential Solution

An approach that could be worth trying is to modify the RequestsHTTPTransport instance here:

metamist/metamist/graphql/__init__.py

Lines 71 to 74 in f1b9811

    
           transport = RequestsHTTPTransport( 
        
               url=url or get_sm_url(), 
        
               headers={'Authorization': f'Bearer {token}'}, 
        
           )

to

transport = RequestsHTTPTransport(
            url=url or get_sm_url(),
            headers={'Authorization': f'Bearer {token}'},
            retries=6,
            retry_backoff_factor=1.0,
        )

This would make 6 retries, spaced at 1s, 2s, 4s, 8s, 16s, 32s before giving up. I believe this can be tried without the backoff decorator.

Alternatively, we could just instantiate the session variable in the code and try to use the decorator like we are right now and see if that helps.

MattWellie · 2024-05-30T04:12:54Z

Happy to try out the retries/backoff in that object/client. I only went with the backoff decorator as it was working A-ok in my local experimentation and looked to be simple. After a few revisions I'm kinda sick of it now 😓

The backoff/retry was working, just not necessarily according to the timing options I set. It did seem to recognise the error types to catch etc. so I'm unsure about how some but not all of the arguments are being respected...

nevoodoo · 2024-05-30T04:37:53Z

Mmmmm, yeah this does look tricky. I'd investigate it further but I'm not too clear on how to reproduce this, and I also do not have the perms to see that batch so most of this is a shot in the dark for me. If you have the bandwidth to try the retries/backoff in the object/client, we can rule out/assess next steps from there and I'll look into it a bit more then :(

illusional · 2024-06-02T11:30:25Z

Can we just roll our own? This should be fairly straightforward to implement.

MattWellie added the low priority don't sweat it label May 9, 2024

nevoodoo self-assigned this May 30, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Backoff still not working well #769

Backoff still not working well #769

MattWellie commented May 9, 2024 •

edited

Loading

nevoodoo commented May 30, 2024

MattWellie commented May 30, 2024

nevoodoo commented May 30, 2024

illusional commented Jun 2, 2024

Backoff still not working well #769

Backoff still not working well #769

Comments

MattWellie commented May 9, 2024 • edited Loading

nevoodoo commented May 30, 2024

Debugging

Potential Solution

MattWellie commented May 30, 2024

nevoodoo commented May 30, 2024

illusional commented Jun 2, 2024

MattWellie commented May 9, 2024 •

edited

Loading