-
Notifications
You must be signed in to change notification settings - Fork 26
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Uploading multiple files in parallel causes some to fail #975
Comments
A local development setup ( |
UpdateIt looks like the The setting's description reads:
The way I read it, this variable is only relevant if there's several uploads going on at the same time. TestsTested on a local |
Another update: The above mentioned fix helps for a deployment on my local machine, but not on our test deployment 🥴 Firefox shows Gunicorn logs a 400 response: Nginx logs a 408 response: Perhaps the different timestamps are interesting? Edit: The nginx logs pop up as soon as the upload is aborted, but the gunicorn logs for all the files pop up at the same time (in my case, once the initial upload succeeds as that takes long). |
Thanks for debugging this. I remember we've done some of the nginx fine-tuning from time to time when switching between infras, but after the migration to InvenioRDM we sort of ended in a "stable-enough" setup and left it untouched. I see that right now, we don't set the
I think the overall issue though, is that the parallel HTTP requests are not really taking place correctly from the client-side, and configuring the server to just wait longer is just to give some extra time for one of the other uploads to finish. We have this 3-step request flow for each file upload, but I feel that some of the HTTP requests are initiated prematurely, which ends up parallelizing on the actual sending of bytes on all the open connections instead of having some sort of queueing in place... This has to be verified on the client-side logic though. |
Another updateGood news: I made a little ConfigThe relevant part of the config is: ## API files
# Another location is defined in order to allow large file uploads in the files
# API without exposing the other parts of the application to receive huge request bodies.
location ~ /api/records/.+/draft/files(/.+(/(content|commit))?)? {
gzip off;
uwsgi_pass api_files_server;
include uwsgi_params;
uwsgi_buffering off;
uwsgi_request_buffering off;
# <new stuff>
# Allow file uploads to take 24h - this should only be available to trusted users
# After this timeout, parallel file uploads start failing
uwsgi_read_timeout 24h;
uwsgi_send_timeout 24h;
client_body_timeout 24h;
# The docs for the `uwsgi_request_buffering` directive say that if HTTP/1.1 chunked transfer encoding
# is enabled, the request will always be buffered - we really don't want that!
proxy_http_version 1.0;
chunked_transfer_encoding off;
# </new stuff>
# ... the rest is the headers & client_max_body_size 0 ...
} RationaleThe timeouts helped with some with the parallel uploads on a single deposit page. I looked into the documentation for the uwsgi module of nginx, and found something interesting:
After setting the Perhaps there was buffering going on, and that caused hiccups between nginx and uwsgi? Current statusAt the time of writing, I have 7 deposit forms for our test instance open, uploading 5 files with ~400MB each in parallel, with "Wi-Fi" throttling in Firefox. First failures after ~63 minsFor each deposit form (i.e. tab) there was one failure now (on the third PUT request, after two succeeded). The uWSGI logs state that some workers were "cheaped" (i.e. killed), and some had to be killed more brutally:
I'm not entirely sure as to why, as those workers were still busy. Nginx logged the failed uploads as 400 reply: Some more failures ~24 mins further inThe 4th PUT request per deposit page failed now (at 75% done), about 55mins after they were started according to the "timings" tab in the browser's dev tools. Further infosIn a previous run of the same experiment an hour ago, the (first) uploads for each deposit form got stuck at ~40%. Footnotes[1] As a side note, this seems to contradict a statement in the nginx config from the InvenioRDM cookiecutter. |
Yet another updateAfter disabling threading in uwsgi (since it has been reported by others to cause issues, see last comment), all uploads worked. [uwsgi]
# reasonable defaults
strict = true
master = true
vacuum = true
enable-threads = false
single-interpreter = true
die-on-term = true
need-app = true
auto-procname = true
# misc
wsgi-disable-file-wrapper = true
buffer-size = 8192
# worker management
# start with 4 worker processes, increase by 6 up to 64
# check every 3s if increase is needed
# (note: this feature works better with threading disabled!)
processes = 64
threads = 1
cheaper = 6
cheaper-initial = 4
cheaper-algo = spare
cheaper-overload = 3
cheaper-step = 4
# specific for API (Files)
socket = 0.0.0.0:5000
stats = 0.0.0.0:9002
module = invenio_app.wsgi_rest:application
mount = /api=invenio_app.wsgi_rest:application
manage-script-name = true
procname-prefix = "api-files " Edit: 10x5 files. Success. ✔️ |
NginxI think the contradicting nginx setting is from this SO answer, which mentions the uWSGIOof, that's bad news for uWSGI... Switching from threads to processes also means more resources (mainly memory-wise) are needed to serve the same traffic (though with the small benefit of not randomly failing requests 😅). Initializing the app and then forking can somewhat help with the memory usage because of CoW magic, though it's also a famous cause for mysterious bugs when e.g. DB connection pools are forked. But anyways, that's a resources/scaling issue for each instance. If switching to For reference, here's our uWSGI setup on Zenodo production, where each node can server [uwsgi]
socket = 0.0.0.0:5000
module = invenio_app.wsgi:application
stats = /tmp/stats.socket
master = true
vacuum = true
enable-threads = true
processes = 8
threads = 5
thunder-lock = true # https://marc.info/?l=uwsgi&m=140473636200986&w=2
# https://uwsgi-docs.readthedocs.io/en/latest/articles/SerializingAccept.html
# disable-logging = true
log-4xx = true
log-5xx = true
log-ioerror = true
log-x-forwarded-for = true
# Fork, then initialize application. This is to avoid issues with shared
# DB connections pools.
lazy = true
lazy-apps = true
single-interpreter = true
need-app = true
# Silence write errors for misbehaving clients
# https://github.com/getsentry/raven-python/issues/732
ignore-sigpipe = true
ignore-write-errors = true
disable-write-exception = true
# post-buffering = true
buffer-size = 65535
socket-timeout = 120
socket-write-timeout = 120
so-write-timeout = 120
so-send-timeout = 120
socket-send-timeout = 120
# NOTE: Disabled since we now use PgBouncer for connection pooling
# Automatically respawn processes after serving
# max-requests = 3000
# max-requests-delta = 30
# fix up signal handling
die-on-term = true |
To be fair, I haven't tried |
Trying to upload several large files at the same time to Zenodo will usually start the upload of three files in parallel.
It seems like two out of those three uploads make progress pretty slowly, and after ~60s the upload fails.
Since the upload form tries to keep three uploads running in parallel, the next two files start uploading.
These likely time out after little progress again (unless the first upload finished in the meanwhile, then another file will start uploading successfully).
The text was updated successfully, but these errors were encountered: