Skip to content
This repository has been archived by the owner on Dec 8, 2022. It is now read-only.

SOCKETS_Recv returning an error does not close the MQTT connection #2356

Open
lightblu opened this issue Aug 12, 2020 · 8 comments
Open

SOCKETS_Recv returning an error does not close the MQTT connection #2356

lightblu opened this issue Aug 12, 2020 · 8 comments

Comments

@lightblu
Copy link

Describe the bug

SOCKETS_Recv returning an error does not close the MQTT connection

The NetRecv task just stops polling SOCKETS_Recv until the keep alive tasks kicks in, tries to send a pingreq, and then the send fails. Only then the connection is considered "closed" and the disconnection is reported up the layers.

This is really a blocker if mqttconfigKEEP_ALIVE_* are configured for pretty high values, while it is still annoying with lower values.

That NetRecv starts just stops when encountering a recv error I think underlines that this is wrong, the connection is already (rightly) considered broken from NetRecv pov, it is just not acted further on.

Note: This issue was also observed and described here #2155 - though the reporter saw sending QoS0 messages (which there didn't even return an error in this situation) as a problem, and that was then fixed.
(Right, if you send something and notice a publish error, you can close and reopen the connection from the application layer; but if you have nothing to send you won't notice.)

Note2: This behaviour was also different (and imo correct) in the older version https://github.com/aws/amazon-freertos/blob/master/CHANGELOG.md#v148-05212019 I am coming from. There, when SOCKETS_Recv returned an error, that mqtt connection was immediatly considered remotely closed, SOCKETS_Shutdown and SOCKETS_Close was called, and the disconnect callback to the upper layer was called.

System information

  • Project Custom Application with MQTT and own SOCKETS_ implementation

Expected behavior

When the NetRecv task sees an error from SOCKETS_Recv, it should consider the connection broken and act immediatly (triggering SOCKETS_Shutdown, SOCKETS_Close and calling the disconnected callback).

To reproduce
Steps to reproduce the behavior:

  1. Force a remote disconnect somehow (e.g. by connecting with the ""same""" thing to AWS IoT a second time)

  2. Have a SOCKETS implementation that would then return an error in SOCKETS_Recv on the connection closure (I would think any SOCKETS_ implementation should do that, MQTT Publish message with QoS0 always return success even if there is some error at network level #2155 hints at that this also happens with the mbedTLS implementation).

  3. Have higher values configured for mqttconfigKEEP_ALIVE_INTERVAL_SECONDS and mqttconfigKEEP_ALIVE_ACTUAL_INTERVAL_TICKS to see the closure reported up the layers only when keepalive hits.

@yanjos-dev
Copy link
Contributor

Hello @lightblu, thank you for raising this issue with us and providing detailed information. I agree that this is not a desirable behavior and have started looking into this.

@yanjos-dev
Copy link
Contributor

@lightblu I found that we have addressed a similar issue in the aws/aws-iot-device-sdk-embedded-C repository in this PR. The PR enables the user to define an optional callback function that is called when the underlying socket gets disconnected.

The application writer can take appropriate action in the callback - possibly re-establishing the MQTT connection. Would this solve your problem?

@lightblu
Copy link
Author

Need to check this more closely but looks like it would need switching/porting to IotNetworkInterface_t. Currently not yet fully understood how new IotNetworkInterface_t is to be used vs the SOCKETS_ (I see IotNetworkAfr in platform/iot_network_freertos.h is the instantiation of IotNetworkInterface_t with the SOCKETS_ interface from abstractions/secure_sockets/include/iot_secure_sockets.h behind) and where this callback fits in? Is it from network interface to iot/mqtt lib then?

Still, also the recv in IotNetworkInterface_t is documented with

     * @return The number of bytes successfully received. This should be
     * `bytesRequested` when successful. Any other value may indicate an error.

and I would think the Recv thread on seeing an error indication from that function should initiate closing the socket? (and not need this callback? But will double check).

@aggarg
Copy link
Member

aggarg commented Aug 25, 2020

I see that Net_Recv task does call SOCKETS_Close: https://github.com/aws/amazon-freertos/blob/master/libraries/abstractions/platform/freertos/iot_network_freertos.c#L124

Is this what you are talking about?

Thanks.

@lightblu
Copy link
Author

Indeed, you are right. But that _networkReceiveTask -> _destroyConnection -> SOCKETS_Close after getting an error from SOCKETS_Recv only happens if destroyConnection had been set ( https://github.com/aws/amazon-freertos/blob/master/libraries/abstractions/platform/freertos/iot_network_freertos.c#L202 ), and that depends on the receiveCallback having destroyed the connection ( https://github.com/aws/amazon-freertos/blob/master/libraries/abstractions/platform/freertos/iot_network_freertos.c#L184 and following lines).

I do not overlook from the distance right now where this receiveCallback goes (likely into IotNetworkAfr/IotMqtt ?), but I would assume then this is not happening there? Cannot exclude error or misobservation on my side; if you are convinced that this destroyConnection should happen I will try to find the time soon to reprovoke and look at this situation.

@muneebahmed10
Copy link
Contributor

muneebahmed10 commented Sep 5, 2020

Hello @lightblu,

Is this still an issue for you? Do you know if MQTT's receive callback IotMqtt_ReceiveCallback is managing to call _IotMqtt_CloseNetworkConnection? Is the issue that the network interface's close function is called but destroy is not?

Also, I noticed that in this, MQTT calls its disconnect callback. Have you tried defining your own disconnect callback before making the call to IotMqtt_Connect, where you call the network interface's destroy - or call IotMqtt_Disconnect with the IOT_MQTT_FLAG_CLEANUP_ONLY flag - in cases when the disconnect reason is not IOT_MQTT_DISCONNECT_CALLED?

Thanks

@lightblu
Copy link
Author

lightblu commented Sep 11, 2020

Hi,

I just bumped our code to using the new IotMqtt_xxx v2 API. I am still seeing that behaviour that an underlying SOCKETS_Recv returning an error stops the NetRecv thread, but the disconnect is only happening when the next pingreq/resp is attempted and also the SOCKETS_Send fails (as before).

I did not get to digging that far to figure out what you ask - maybe this weekend, sorry.

(Overlooked the second question, yes, with v2 API we now also have the disconnect callback defined (that was another callback in v1 before). It gets called as described above only after the pingreq/resp fails - so with a OT_MQTT_KEEP_ALIVE_TIMEOUT).

@sarenameas
Copy link
Contributor

sarenameas commented Sep 18, 2020

Hello @lightblu ,

I have been reading through the discussion and what you are concerned about. You do not like that _networkReceiveTask() simply stops and exits without closing the connection after a SOCKETS_Recv() returns an error.

So _networkReceiveTask() will keep looping on SOCKETS_Recv until these conditions:

  • the application calls IotMqtt_Disconnect() -> calls SOCKET_Shutdown
  • the IotMqtt_ReceiveCallback() fails -> calls SOCKET_Shutdown
  • the keep alive fails in _IotMqtt_ProcessKeepAlive() -> calls SOCKET_Shutdown
  • There is an error on SOCKETS_Recv() -> Does nothing

If an error is returned from SOCKETS_Recv(), there is no chance for the IotMqtt_ReceiveCallback() to be invoked and the socket to be shutdown until we try to send something on the socket.

When the socket is shutdown, we then need to destroy the socket and clear all resources taken by the connection by calling SOCKETS_Close with IotNetworkAfr_Destroy(). IotNetworkAfr_Destroy() can be called only given that all references to the connection are not being used, so the it is the application's responsibility to destroy the connection.

The thing is also that the loop in _networkReceiveTask() is temporary work around for the lack of a socket poll and select feature. The transport layer is not supposed to shutdown the socket, it is the application layer's responsibility given return values from the socket. And we can't possibly know the server has disconnected us or some other network error until we try to read on the socket. The issue with this workaround is that it performs the applications responsibility of reading from the socket.

Let's say we did call IotNetworkAfr_Close() when SOCKETS_Revc returns an error. We cannot immediately call _destroyConnection() because there could be outstanding MQTT operations that are referencing the connection.

So now it comes down to letting the application know the connection was closed in the transport layer. This mechanism doesn't currently exist in the iot_network.h API. But as @yanjos-dev pointed out, in the latest CSDK v4_beta_deprecated branch this has: aws/aws-iot-device-sdk-embedded-C#634. If we put this connection close infrastructure into the FreeRTOS iot_network.h port, then we can add code to let MQTT know that the connection was closed. MQTT can then try and clear all of it's operations that have been queued in the taskpool (stored in _mqttConnection_t.pendingProcessing and _mqttConnection_t) and try to destroy the connection. There may also be other implications with QoS1 that I am not addressing when thinking of this change.

This would be a hefty change. It is indeed annoying to not know the connection is closed until the keep alive lets you know, but given that the overall MQTT API will return the expected results, we will consider it in our future planning. Thank you so much for bringing this issue to the light.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

5 participants