Large number of targeted Installations make the servers failed (499 errors) #123

SebC99 · 2019-03-14T10:59:38Z

Hello,
We want to be able to send a push request to all our users in an area (more than 100k) but it fails the servers with far too many connexions (nginx 1024 worker_connections are not enough) and all the standard requests to the servers end up with 499 errors.

Our Parse servers are on Elastic Beanstalk, and we use a simple query new Parse.Query("Parse.Installation").exists("deviceToken") in the Parse.Push.send method.

The text was updated successfully, but these errors were encountered:

SebC99 · 2019-04-30T18:08:33Z

anyone here?

acinader · 2019-04-30T18:13:52Z

Hi @SebC99

Not an issue I have run into personally.

Given the large number, can you use a queue and send them in smaller batches?

SebC99 · 2019-04-30T18:23:46Z

Why not, but I honestly don't know how do use this kind of queue ;)
And I've tried to use the batchSize parameter in the query or push methods, but with no better results. Which size of batches would you recommend anyway? Even 10.000 pushes take a lot of time (more than an hour)

dplewis · 2019-04-30T18:25:55Z

Have you tried PARSE_SERVER_ENABLE_EXPERIMENTAL_DIRECT_ACCESS=1? I recently hit max open file (TCP connection) on a completely separate issue.

Do you know where your connections are coming from / going to?

acinader · 2019-04-30T18:26:05Z

I don't know what the batch size should be. for large pushes, ideally, you could parallelize

@flovilmart looked into this in the past and I wrote https://github.com/parse-community/parse-server-sqs-mq-adapter, but i never used it.

SebC99 · 2019-04-30T18:32:46Z

@dplewis what do you mean? It's just the push adapter emits too many connections I guess so there is no room for any other requests. The batch feature of the adapter for identical payload doesn't seem to work... But it's very hard to understand the push and queue code ;)

davimacedo · 2019-04-30T18:35:44Z

I understand that the problem is not about sending the pushes. It seems that the pushes are successfully sent, right @SebC99 ? Can you observe in the push status if they are all sent?

What I saw sometimes is: the pushes are successfully sent but, as the clients receive them, they hit back the parse api and it makes the server to crash. Since you are noticing the worker_connections error in nginx, it might be the problem.

I see two possible solutions:

Like @acinader suggested, send the pushes in batches. You don't necessarily need to use a queue. You can use an approach as simple as first push everybody that is ios, then android. Or split by installation date, for example.
Scale up your servers horizontally in order to handle more requests from the clients in the peak.

SebC99 · 2019-04-30T18:40:34Z

@davimacedo not at all!! Only a very small number are sent, like 5000

davimacedo · 2019-04-30T18:48:25Z

What is the status you see in your push status? Sending forever? How are you running your parse server process? Is it a docker container? A service? Have you noticed this process crashing when sending the pushes?

davimacedo · 2019-04-30T18:50:48Z

Have you tried batchSize < 5000 ?

SebC99 · 2019-04-30T19:16:59Z

Here's what is in the DB for the last try

{ 
    "_id" : "YsXd23ED", 
    "pushTime" : "2019-03-09T12:24:30.138Z", 
    "query" : "{\"deviceToken\":{\"$exists\":true}}", 
    "payload" : "{
        \"alert-fr\":{\"title\":\"XXXX\",\"body\":\"XXXX\"},
        \"alert\":{\"title\":\"XXXX\",\"body\":\"XXXX\"},
        \"category\":\"update\",
        \"channel\":\"remote_notifications\",
        \"campaign\":\"marketing\"
    \"}", 
    "source" : "rest", 
    "status" : "running", 
    "numSent" : NumberInt(1496), 
    "pushHash" : "c4bf3a4c2e953169ead4d9c034576006", 
    "_wperm" : [

    ], 
    "_rperm" : [

    ], 
    "_acl" : {

    }, 
    "_created_at" : ISODate("2019-03-09T12:24:30.140+0000"), 
    "_updated_at" : ISODate("2019-03-09T12:28:00.645+0000"), 
    "count" : NumberInt(3595), 
    "failedPerType" : {
        "android" : NumberInt(327), 
        "ios" : NumberInt(40)
    }, 
    "numFailed" : NumberInt(367), 
    "sentPerType" : {
        "android" : NumberInt(537), 
        "ios" : NumberInt(959)
    }
}

dplewis · 2019-04-30T19:18:27Z

@SebC99 This is what I was talking about parse-community/parse-server#4173

With direct access there isn't any overhead but now that it uses HTTP interface that opens another connection. I think that's where your issue is coming from.

SebC99 · 2019-04-30T19:20:56Z

Thanks, I haven't noticed that one, I'll give it a try (direct access has failed me before for cloud functions, so I haven't try for push yet)

dplewis · 2019-04-30T19:26:25Z

Ignore that last comment looks like that has been updated. I don't know much about the push and queue code. I can try to run it locally and see what's causing the issue. I think is similar to what @davimacedo mentioned, something might be hitting the Parse API.

SebC99 · 2019-04-30T19:34:41Z

I'll try to investigate too.
If I remember well, a lot of beforeFind or beforeSave were appearing in the log, and I think it was about _User class, but I'm not sure.

SebC99 · 2019-05-01T13:50:05Z

After some tests, PARSE_SERVER_ENABLE_EXPERIMENTAL_DIRECT_ACCESS seems to decrease the load of the server.
But:

with content-available:1 and no payload, the push is sent quite fast (it is still marked as running hours after, but in 5 minutes 300.000 pushes have been sent)
with a payload, it is very very slow and seems to stop after 3000 sent pushes (in 5 minutes)

BTW, I understand numSent and numFailed values, but what is the count value?

SebC99 · 2019-05-01T15:07:46Z

And with VERBOSE, I can clearly see:
MaxListenersExceededWarning: Possible EventEmitter memory leak detected. 11 connect listeners added. Use emitter.setMaxListeners() to increase limit

I have the exact same error with batchSize set to 10 than batchSize set to 5000.
Even for a push with 50 device tokens only!

SebC99 · 2019-05-01T15:50:13Z

I also noticed this weird error from node-apn:
node-apn/node-apn#653 (comment)

SebC99 · 2019-05-01T16:28:48Z

If it helps, I keep testing things:

it works much better with android devices than ios devices
removing the database maxTimeMS on parse server helps
removing invalid device tokens is very very long, and when maxTimeMS is set to 5000ms it can fail, and it seems the promises stop here and the push just hangs (which explain the forever "pending" status)
there's clearly an issue with ios push, where there is a far too long promises chains that fails nodejs
without data payload it's much faster in any case

dplewis · 2019-05-01T17:51:25Z

If you look here promises are serialized.

Maybe do something similar to parse-community/parse-server#5420 to prevent a bottleneck.

Enqueue by PushStatusId or pushStatus.objectId

davimacedo · 2019-05-01T21:33:22Z

@SebC99 You said that it is much better without payload and that's interesting. I am wondering if the problem is the whole payload. Can you please try without sending "alert-fr":{"title":"XXXX","body":"XXXX"}, in the payload? I am wondering if the problem is related to the locale feature.

BTW, count is total pushes that should be sent, sent is how many succeeded and failed is how many failed. Ideally count should be sent + failed. In your case, the status is "sending" forever. It uses to happen when some of your batches failed to send due to a server crash (and will never be sent again). Because of this, sent + failed is always < count and the status never changes. The 3 most common reasons that I've seen to make it happen:

The reason I told before - something hitting back the server, crashing the server process and therefore stopping the batched to be sent
The query that is submitted to mongodb for each batch timeouts: parse server uses skip/limit to build the batches and it sometimes doesn't perform well
When building the batches, the process that is running parse server hits the maximum RAM limit and the processes crash.

Would you be able to observe if some of these are likely to be happening?

SebC99 · 2019-05-01T21:44:01Z

@davimacedo I tried with a simple "alert" payload (not localized) and I do have the exact same thing.

It's not 1. as I run my test on a standalone server without any other incoming requests
1. timeout is clearly an issue as removing the maxTimeMS improves the result
1. I reach the MaxListenersExceededWarning limit but not a max RAM limit, and in every case no crash on server side, just infinite hangs.
  Again PARSE_SERVER_ENABLE_EXPERIMENTAL_DIRECT_ACCESS removes the saturation of the server, but the speed/hang issues are still there.
  I think the serialized promises + the long request timeouts (to delete deviceToken?) can explain a lot, but there's still the payload impact which I can't explain...

dplewis · 2019-05-01T22:36:21Z

@SebC99 Thank you for providing detailed feedback. We have a general idea and suggestions on where the issue may be coming from.

Would you like to take a look at the serialized promises I pointed out #123 (comment) and submit a fix?

parseplatformorg mentioned this issue Oct 22, 2023

refactor: Upgrade npmlog from 4.1.2 to 7.0.1 #228

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Large number of targeted Installations make the servers failed (499 errors) #123

Large number of targeted Installations make the servers failed (499 errors) #123

SebC99 commented Mar 14, 2019

SebC99 commented Apr 30, 2019

acinader commented Apr 30, 2019

SebC99 commented Apr 30, 2019

dplewis commented Apr 30, 2019 •

edited

Loading

acinader commented Apr 30, 2019

SebC99 commented Apr 30, 2019

davimacedo commented Apr 30, 2019 •

edited

Loading

SebC99 commented Apr 30, 2019 •

edited

Loading

davimacedo commented Apr 30, 2019

davimacedo commented Apr 30, 2019

SebC99 commented Apr 30, 2019

dplewis commented Apr 30, 2019

SebC99 commented Apr 30, 2019

dplewis commented Apr 30, 2019

SebC99 commented Apr 30, 2019 •

edited

Loading

SebC99 commented May 1, 2019 •

edited

Loading

SebC99 commented May 1, 2019 •

edited

Loading

SebC99 commented May 1, 2019

SebC99 commented May 1, 2019 •

edited

Loading

dplewis commented May 1, 2019 •

edited

Loading

davimacedo commented May 1, 2019

SebC99 commented May 1, 2019

dplewis commented May 1, 2019

Large number of targeted Installations make the servers failed (499 errors) #123

Large number of targeted Installations make the servers failed (499 errors) #123

Comments

SebC99 commented Mar 14, 2019

SebC99 commented Apr 30, 2019

acinader commented Apr 30, 2019

SebC99 commented Apr 30, 2019

dplewis commented Apr 30, 2019 • edited Loading

acinader commented Apr 30, 2019

SebC99 commented Apr 30, 2019

davimacedo commented Apr 30, 2019 • edited Loading

SebC99 commented Apr 30, 2019 • edited Loading

davimacedo commented Apr 30, 2019

davimacedo commented Apr 30, 2019

SebC99 commented Apr 30, 2019

dplewis commented Apr 30, 2019

SebC99 commented Apr 30, 2019

dplewis commented Apr 30, 2019

SebC99 commented Apr 30, 2019 • edited Loading

SebC99 commented May 1, 2019 • edited Loading

SebC99 commented May 1, 2019 • edited Loading

SebC99 commented May 1, 2019

SebC99 commented May 1, 2019 • edited Loading

dplewis commented May 1, 2019 • edited Loading

davimacedo commented May 1, 2019

SebC99 commented May 1, 2019

dplewis commented May 1, 2019

dplewis commented Apr 30, 2019 •

edited

Loading

davimacedo commented Apr 30, 2019 •

edited

Loading

SebC99 commented Apr 30, 2019 •

edited

Loading

SebC99 commented Apr 30, 2019 •

edited

Loading

SebC99 commented May 1, 2019 •

edited

Loading

SebC99 commented May 1, 2019 •

edited

Loading

SebC99 commented May 1, 2019 •

edited

Loading

dplewis commented May 1, 2019 •

edited

Loading