Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use gcloud storage instead of gsutil #246

Open
carbocation opened this issue Oct 7, 2022 · 8 comments
Open

Use gcloud storage instead of gsutil #246

carbocation opened this issue Oct 7, 2022 · 8 comments

Comments

@carbocation
Copy link

It seems that gcloud storage will be substantially faster for localization/delocalization vs gsutil. Seems like it would make sense to either apply the shim or to transition to using gcloud storage in place of gsutil in dsub.

@mbookman
Copy link
Contributor

Thanks for the pointer @carbocation! We will take a look.

@carbocation
Copy link
Author

So far in my tests, gcloud storage has been a successful drop-in replacement for gsutil (including the various tasks like ls, cat, cp, rm, etc, as well as with flags like -J, -n, etc). The only occasionally tricky bit (other than making sure the host machine used to launch dsub is upgraded and can use gcloud storage) has been to make sure that the docker instance has an acceptable version of google cloud tools to allow the same.

@mbookman
Copy link
Contributor

Overall, gcloud storage looks pretty good. The performance improvements are real and minimal code changes are needed to get them. That's pretty exciting.

That said, I have twice (in only a limited number of total tests) had downloads fail with errors like:

ERROR: Source hash fjoXWA== does not match destination hash ELZtmQ== for object ./NA12878.cg.bam_.gstmp.

I've filed a bug for this and will post back here what I learn.

FWIW, the test was to pull down 11 file (1.2 TB) to a GCE VM:

$ gcloud storage cp gs://genomics-public-data/ftp-trace.ncbi.nih.gov/1000genomes/ftp/technical/working/20101201_cg_NA12878/* .

@carbocation
Copy link
Author

carbocation commented Oct 14, 2022

Hearing about the error you encountered, I ran another test and got an "exciting" result after about 8,000 ~1mb files were copied (out of ~80,000 files):

Pausing command execution:

This command requires the `gcloud-crc32c` component to be installed. Would you like to install the `gcloud-crc32c` component to continue command execution? (Y/n)?  
Copying gs://bucket/path/to/file.xml to file://./file.xml
Pausing command execution:

This command requires the `gcloud-crc32c` component to be installed. Would you like to install the `gcloud-crc32c` component to continue command execution? (Y/n)? 

I was not watching this happen, and it then proceeded:

For the latest full release notes, please visit:
  https://cloud.google.com/sdk/release_notes

Do you want to continue (Y/n)?  
╔════════════════════════════════════════════════════════════╗
╠═ Creating update staging area                             ═╣
⠏ Completed files 8357 | 6.5GiB | 99.2MiB/s                                                                                                                                                              

Your current Google Cloud CLI version is: 405.0.0
Installing components from version: 405.0.0

┌───────────────────────────────────────────────────┐
│        These components will be installed.        │
Copying gs://bucket/path/to/anotherfile.xml to file://./anotherfile.xml
├───────────────────────────────┬─────────┬─────────┤
│              Name             │ Version │   Size  │
├───────────────────────────────┼─────────┼─────────┤
│ Google Cloud CRC32C Hash Tool │   1.0.0 │ 1.2 MiB │
└───────────────────────────────┴─────────┴─────────┘

And ultimately it failed:

⠼WARNING: Post processing failed.  Run `gcloud info --show-log` to view the failures.

==> Start a new shell for the changes to take effect.


Update done!

And is hanging:

⠧ Completed files 8608 | 6.7GiB | 99.2MiB/s

(That leftmost character is being animated around.)

And this was a very exciting failure mode indeed, because gcloud is no longer installed at its usual $PATH:

$ gcloud info --show-log
bash: /home/james/applications/google-cloud-sdk/bin/gcloud: No such file or directory
james@host:/mnt/storage 
$ which gcloud

So I guess this is just to say, it may require special care (e.g., making sure the Google Cloud CRC32C Hash Tool is installed) to make sure there is not unexpected behavior...

@carbocation
Copy link
Author

After reinstalling gcloud so I could finally finish my download*, I was also able to get a couple of hash mismatches, though this only occurred in about 1 out of every ~25,000 files for me:

ERROR: Source hash kfmcAw== does not match destination hash ebdNqw== for object ./file.xml_.gstmp.
  • = the download stalled at 99.97% and didn't complete until the time since the stall was almost exactly the same amount as the time before the stall (so the last 0.03% required 50% of the total download time).

Maybe not quite ready for prime time.

@mbookman
Copy link
Contributor

mbookman commented Nov 7, 2022

Hi @carbocation !

Wanted to give an update on this. The Cloud team was able to root cause the problem; it was a fairly straight-forward failure case that needed a retry. The report is that the fix is targeted for Cloud SDK 410.0.0. You can keep an eye out for releases here:

https://cloud.google.com/sdk/docs/release-notes

We'll give dsub integration another pass when see that version drop.

-Matt

@carbocation
Copy link
Author

[....] The Cloud team was able to root cause the problem; it was a fairly straight-forward failure case that needed a retry. The report is that the fix is targeted for Cloud SDK 410.0.0. [...]

Thanks for that update! 410.0.0 just came out and I don't see a mention of cloud storage, so I am guessing this fix didn't make it into 410. I'll keep my eyes peeled for when the fix eventually does make its way in.

@mbookman
Copy link
Contributor

The report from Google engineering is that while they missed adding a release note update, the code fix is indeed in 410.0.0.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants