AWS, SQS trigger to Lambda is automatically disabled when Lambda fails

5,307

Solution 1

AWS support says the trigger can be disabled because of the insufficient permissions for the Lambda execution role.

My question:

Where the conditions when the Lambda trigger can be automatically disabled are documented? Or where to find why the trigger was disabled (some kind of Lambda service logs)?

AWS support answer:

Currently, there is no such public documentation which mentions the possible reasons for the Lambda trigger being disabled automatically. However, as I mentioned earlier, the most probable reason for the SQS Lambda trigger being disabled is that the Lambda function execution role does not have one or more of the following required permissions:

sqs:ChangeMessageVisibility

sqs:DeleteMessage

sqs:GetQueueAttribute

sqs:ReceiveMessage

Access to relevant KMS keys

Any applicable cross account permissions

Also, if the lambda function is in VPC, then the Lambda function should have all the permissions to list, create and delete the ENIs

Also, the reason for the trigger being disabled will not be mentioned in the Lambda function logs. So, I request you to please make sure that the Lambda function execution role has all the required permissions. If the Lambda function execution role has all the required permissions, the SQS trigger should not get disabled automatically.

In my case we actully missed the VPC permissions, i.e. we didn't attach the AWSLambdaVPCAccessExecutionRole policy to the Lambda execution role. (I have no idea how Lambda worked without this policy). Five days passed since we've fixed the roles, no trigger was disabled. So, it works.

As for DynamoDB and "backpressure", the idea of MLu is correct.

If you have only one write to DynamoDB per each SQS message, you can just fail in Lambda if the write fails. The message stays in SQS and will be received by the Lambda again after the visibility timeout. It's better to use the batch size of 1 to process the messages one by one in this case.

If you have multiple writes to DynamoDB per each SQS message (the write multiplication), the better is to catch ProvisionedThroughputExceededException in the Lambda and put the failed writes to another queue with delay to repeat them later by another Lambda. Note, it's important to repeat every single write, not the original message.

The dataflow will be like this:

Note, any delayed repeating of the writes are acceptable only if you really can delay and repeat them. They should be idempotent and should not contain real-time data. Otherwise, it can be better to silently ignore any exceptions, to avoid Lambda failure, and so, remove and forget the message from SQS.

Solution 2

I'm not sure why the Lambda stops working. I suspect Lambda service notices that it keeps failing so it temporarily suspends it. Not sure.

You can try a number of workarounds:

Use DynamoDB on demand capacity - AWS says it scales instantly.
Alternatively if you use provisioned capacity and get the Provisioned Throughput Exception don't actually abort the Lambda execution but instead re-insert the message to the SQS queue and exit successfully. That way Lambda service won't see any failures and no SQS messages will get lost either.

Something along these lines could help :)

5,307

gelin

Updated on September 18, 2022

Comments

gelin over 1 year

We have some Lambdas triggered by SQS queues. The Lambdas do intensive inserts into DynamoDB tables. The DynamoDB tables have autoscaling write capacity.

On peak loads, many numbers of messages come to Lambdas and they start to fail with ProvisionedThroughputExceededException. The DynamoDB needs minutes to scale up.

We expect when the Lambda fails the messages return back to SQS and are processed again after visibility timeout. This looks correct because later DynamoDB is scaled up and should be able to handle the grown writes.

However, we see a strange thing. When the number of execution errors for Lambda grows up, the SQS trigger is automatically disabled. The Lambda stops executions, the messages are accumulated in the queue.

Manual enabling of the trigger causes even more failures because DynamoDB is still not scaled up, but the number of messages to process from the queue was dramatically increased.

Only manual increasing the write capacity of DynamoDB helps.

Why the SQS trigger disables? This behavior is not documented.

How to avoid the trigger disabling?

In general, what is the recommended way to do a "backpressure" to limit the speed of polling the messages from SQS by a Lambda?
- gelin almost 5 years
  
  Even more upset: the Lambda may fail not only because of errors in my code but just because it's redeployed or needs to increase the concurrency.
gelin almost 5 years

The second approach is actually what I did. Unfortunately, this doesn't work well if you have write multiplication in Lambda. For example, in one of my cases, each input message makes 12 writes to DynamoDB. If any of the write fails, I need to repeat all 12, which increases the load of DynamoDB even more, etc... I've added another queue for delayed write operations and another Lambda to apply those operations without the multiplication.
gelin almost 5 years

Unfortunately again, Lambda may fail and FAILS because of issues in Lambda service itself, not my code (I don't see any exceptions in logs, but see failed executions of Lambda). And SQS trigger may be disabled in this case too. I've solved this (hope, temporary) by adding the Lambda which checks all SQS event sources in the account and enables them if they're disabled every 3 minutes.