Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Large number of targeted Installations make the servers failed (499 errors) #123

Open
SebC99 opened this issue Mar 14, 2019 · 23 comments
Open

Comments

@SebC99
Copy link

SebC99 commented Mar 14, 2019

Hello,
We want to be able to send a push request to all our users in an area (more than 100k) but it fails the servers with far too many connexions (nginx 1024 worker_connections are not enough) and all the standard requests to the servers end up with 499 errors.

Our Parse servers are on Elastic Beanstalk, and we use a simple query new Parse.Query("Parse.Installation").exists("deviceToken") in the Parse.Push.send method.

@SebC99
Copy link
Author

SebC99 commented Apr 30, 2019

anyone here?

@acinader
Copy link
Contributor

Hi @SebC99

Not an issue I have run into personally.

Given the large number, can you use a queue and send them in smaller batches?

@SebC99
Copy link
Author

SebC99 commented Apr 30, 2019

Why not, but I honestly don't know how do use this kind of queue ;)
And I've tried to use the batchSize parameter in the query or push methods, but with no better results. Which size of batches would you recommend anyway? Even 10.000 pushes take a lot of time (more than an hour)

@dplewis
Copy link
Member

dplewis commented Apr 30, 2019

Have you tried PARSE_SERVER_ENABLE_EXPERIMENTAL_DIRECT_ACCESS=1? I recently hit max open file (TCP connection) on a completely separate issue.

Do you know where your connections are coming from / going to?

@acinader
Copy link
Contributor

I don't know what the batch size should be. for large pushes, ideally, you could parallelize

@flovilmart looked into this in the past and I wrote https://github.com/parse-community/parse-server-sqs-mq-adapter, but i never used it.

@SebC99
Copy link
Author

SebC99 commented Apr 30, 2019

@dplewis what do you mean? It's just the push adapter emits too many connections I guess so there is no room for any other requests. The batch feature of the adapter for identical payload doesn't seem to work... But it's very hard to understand the push and queue code ;)

@davimacedo
Copy link
Member

davimacedo commented Apr 30, 2019

I understand that the problem is not about sending the pushes. It seems that the pushes are successfully sent, right @SebC99 ? Can you observe in the push status if they are all sent?

What I saw sometimes is: the pushes are successfully sent but, as the clients receive them, they hit back the parse api and it makes the server to crash. Since you are noticing the worker_connections error in nginx, it might be the problem.

I see two possible solutions:

  • Like @acinader suggested, send the pushes in batches. You don't necessarily need to use a queue. You can use an approach as simple as first push everybody that is ios, then android. Or split by installation date, for example.
  • Scale up your servers horizontally in order to handle more requests from the clients in the peak.

@SebC99
Copy link
Author

SebC99 commented Apr 30, 2019

@davimacedo not at all!! Only a very small number are sent, like 5000

@davimacedo
Copy link
Member

What is the status you see in your push status? Sending forever? How are you running your parse server process? Is it a docker container? A service? Have you noticed this process crashing when sending the pushes?

@davimacedo
Copy link
Member

Have you tried batchSize < 5000 ?

@SebC99
Copy link
Author

SebC99 commented Apr 30, 2019

Here's what is in the DB for the last try

{ 
    "_id" : "YsXd23ED", 
    "pushTime" : "2019-03-09T12:24:30.138Z", 
    "query" : "{\"deviceToken\":{\"$exists\":true}}", 
    "payload" : "{
        \"alert-fr\":{\"title\":\"XXXX\",\"body\":\"XXXX\"},
        \"alert\":{\"title\":\"XXXX\",\"body\":\"XXXX\"},
        \"category\":\"update\",
        \"channel\":\"remote_notifications\",
        \"campaign\":\"marketing\"
    \"}", 
    "source" : "rest", 
    "status" : "running", 
    "numSent" : NumberInt(1496), 
    "pushHash" : "c4bf3a4c2e953169ead4d9c034576006", 
    "_wperm" : [

    ], 
    "_rperm" : [

    ], 
    "_acl" : {

    }, 
    "_created_at" : ISODate("2019-03-09T12:24:30.140+0000"), 
    "_updated_at" : ISODate("2019-03-09T12:28:00.645+0000"), 
    "count" : NumberInt(3595), 
    "failedPerType" : {
        "android" : NumberInt(327), 
        "ios" : NumberInt(40)
    }, 
    "numFailed" : NumberInt(367), 
    "sentPerType" : {
        "android" : NumberInt(537), 
        "ios" : NumberInt(959)
    }
}

@dplewis
Copy link
Member

dplewis commented Apr 30, 2019

@SebC99 This is what I was talking about parse-community/parse-server#4173

With direct access there isn't any overhead but now that it uses HTTP interface that opens another connection. I think that's where your issue is coming from.

@SebC99
Copy link
Author

SebC99 commented Apr 30, 2019

Thanks, I haven't noticed that one, I'll give it a try (direct access has failed me before for cloud functions, so I haven't try for push yet)

@dplewis
Copy link
Member

dplewis commented Apr 30, 2019

Ignore that last comment looks like that has been updated. I don't know much about the push and queue code. I can try to run it locally and see what's causing the issue. I think is similar to what @davimacedo mentioned, something might be hitting the Parse API.

@SebC99
Copy link
Author

SebC99 commented Apr 30, 2019

I'll try to investigate too.
If I remember well, a lot of beforeFind or beforeSave were appearing in the log, and I think it was about _User class, but I'm not sure.

@SebC99
Copy link
Author

SebC99 commented May 1, 2019

After some tests, PARSE_SERVER_ENABLE_EXPERIMENTAL_DIRECT_ACCESS seems to decrease the load of the server.
But:

  • with content-available:1 and no payload, the push is sent quite fast (it is still marked as running hours after, but in 5 minutes 300.000 pushes have been sent)
  • with a payload, it is very very slow and seems to stop after 3000 sent pushes (in 5 minutes)

BTW, I understand numSent and numFailed values, but what is the count value?

@SebC99
Copy link
Author

SebC99 commented May 1, 2019

And with VERBOSE, I can clearly see:
MaxListenersExceededWarning: Possible EventEmitter memory leak detected. 11 connect listeners added. Use emitter.setMaxListeners() to increase limit

I have the exact same error with batchSize set to 10 than batchSize set to 5000.
Even for a push with 50 device tokens only!

@SebC99
Copy link
Author

SebC99 commented May 1, 2019

I also noticed this weird error from node-apn:
node-apn/node-apn#653 (comment)

@SebC99
Copy link
Author

SebC99 commented May 1, 2019

If it helps, I keep testing things:

  • it works much better with android devices than ios devices
  • removing the database maxTimeMS on parse server helps
  • removing invalid device tokens is very very long, and when maxTimeMS is set to 5000ms it can fail, and it seems the promises stop here and the push just hangs (which explain the forever "pending" status)
  • there's clearly an issue with ios push, where there is a far too long promises chains that fails nodejs
  • without data payload it's much faster in any case

@dplewis
Copy link
Member

dplewis commented May 1, 2019

If you look here promises are serialized.

Maybe do something similar to parse-community/parse-server#5420 to prevent a bottleneck.

Enqueue by PushStatusId or pushStatus.objectId

@davimacedo
Copy link
Member

@SebC99 You said that it is much better without payload and that's interesting. I am wondering if the problem is the whole payload. Can you please try without sending "alert-fr":{"title":"XXXX","body":"XXXX"}, in the payload? I am wondering if the problem is related to the locale feature.

BTW, count is total pushes that should be sent, sent is how many succeeded and failed is how many failed. Ideally count should be sent + failed. In your case, the status is "sending" forever. It uses to happen when some of your batches failed to send due to a server crash (and will never be sent again). Because of this, sent + failed is always < count and the status never changes. The 3 most common reasons that I've seen to make it happen:

  1. The reason I told before - something hitting back the server, crashing the server process and therefore stopping the batched to be sent
  2. The query that is submitted to mongodb for each batch timeouts: parse server uses skip/limit to build the batches and it sometimes doesn't perform well
  3. When building the batches, the process that is running parse server hits the maximum RAM limit and the processes crash.

Would you be able to observe if some of these are likely to be happening?

@SebC99
Copy link
Author

SebC99 commented May 1, 2019

@davimacedo I tried with a simple "alert" payload (not localized) and I do have the exact same thing.

  • It's not 1. as I run my test on a standalone server without any other incoming requests
    1. timeout is clearly an issue as removing the maxTimeMS improves the result
    1. I reach the MaxListenersExceededWarning limit but not a max RAM limit, and in every case no crash on server side, just infinite hangs.
      Again PARSE_SERVER_ENABLE_EXPERIMENTAL_DIRECT_ACCESS removes the saturation of the server, but the speed/hang issues are still there.
      I think the serialized promises + the long request timeouts (to delete deviceToken?) can explain a lot, but there's still the payload impact which I can't explain...

@dplewis
Copy link
Member

dplewis commented May 1, 2019

@SebC99 Thank you for providing detailed feedback. We have a general idea and suggestions on where the issue may be coming from.

Would you like to take a look at the serialized promises I pointed out #123 (comment) and submit a fix?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants