Delivery Delays and Lost Deliveries

IMPORTANT: Upcoming change that could cause some webhooks to fail because of timeouts.

Hi everyone,

Some users have recently complained about unexplained delays in the trigger of their webhook events or even losing some events.

While digging into these issues we found that the max amount of time allowed for each trigger (currently 6 seconds) was not being enforced. This allowed webhook triggers to take much longer than expected to complete. Under load, this can lead to the slowdown of webhooks deliveries on other sites.

Next Monday we will be releasing a fix to put back the time limit check. This could cause some existing webhooks to fail on some events (if the webhook endpoint takes more than 6 seconds to return an answer to Shotgun).

We are also rolling out a fix for lost deliveries. So, starting today, no delivery should ever be lost.
Please do let us know if you see missing deliveries after this date.
(Note: We built the system so no delivery would ever be lost. The tradeoff for this is that it is possible, in extreme cases, that the same delivery is sent twice to a webhook.)

Regards,

Stéphane

PS: We are still investigating an issue which can also cause events to be delayed. We are working on getting this fixed asap and will provide an update here when the fix is ready.

5 Likes

Hi everyone, just a quick update here to let you know what’s going on…
First, we delayed the plan to inforce the maximum of 6 second allowed to process a webhook trigger. This is because we’re folding a fix for deliveries tagged as “failed” even if they had not reached the webhook endpoint (because of network hickups) and a fix for random delays that we noticed.
We’ve been testing this and things look pretty good so the plan is now to roll out the new stack next Monday (july 6th).

Note that, although this should not occur very often, as part of this change you may see a few deliveries that get repeated. We had to make the tradeoff between deliveries possibly not reaching the webhook endpoint and repeated ones so we chose the latter.

Regards,

-Stéphane

2 Likes

Hi everyone!
This was roled out a few minutes ago… 6 seconds delay are now enforced on webhook triggers and there should not be any missing deliveries anymore.

Please do let us know if you see any issues!

Thank you!

3 Likes

Hi everyone!
New updates on delays and lost deliveries…
About a week ago we fixed an issue that could cause events to get stuck in the pipeline. This occured about a few weeks ago and caused some events to stalled for more than a day (ouch!). This happened because the issue flew under our radar.
So, we did the following:

  1. Fixed the underlying issue
  2. We added a “canary” system that tells us right away if events are stuck in the pipeline.
  3. We’ve been monitoring the change
    From what we can tell, events do not get stuck in the pipeline anymore. Please do let us know if you encounter any issue like this in the future.
    (Specifically, @Stephen I know you hit this one. Things should be back to normal now…)
1 Like

Hi,

It seems our webhook enviroment outputs this error quite often. Webhook event triggers twice in most cases(rarely this does not happen and other webhook function dese not output the errors.).
So a problem actually happens: exact same Slack notification message is sent twice by single webhook event. Is that possible to be fixed?

Hi @kishikawa_takanori,
This should not happen very often. Definitely not on a regular basis.
Can you give me the name of your site and webhook and also an example of a delivery that occured twice?

Hi @daigles,

I sent a direct message. Please check the info.

Hi there,

We have integrated webhooks to our Django tool stack, and we are currently doing some stress tests.

We are testing this on a test site, with very few events being created except our test ones. Even in these conditions, I see sometimes the emission of deliveries being delayed for about 15 min from the moment they were created. This of course defeats the purpose of switching from loops to webhooks, which we would do only for latency issues. Is there an explanation for these huge time differences between when an event log entry is created and when it’s sent to the webhook endpoint?
Is it going to get worse if we have many webhooks?
thanks

Also, would this number be reduced by using deliveries in batch format or doesn’t it make any difference?
Also, the entities/fields that you configure for a webhook, do they influence the speed of the delivery, i.e. if you put more entity types/fields, does it impact the delivery rate?
I’m seeing consistently lapses of 10/15 min between actions made and events sent, and it’s on a test site with very little events, and this is for clients that can have more than 100k events per day.

Hi @kevinsallee,
Thanks for reporting this. We have been seeing delays in the pipeline that should not be there.
We are keeping webhooks in beta because we want to make sure the system is stable under all conditions before releasing it officially. We just hit one of these conditions last week. The good news is that we have a fix coming for it!

For reference, our goal is to be able to deliver events within a minute. We actually typically deliver events faster than this (within seconds) and it can happen that it takes a few minutes for an event to be fired but that should really be an exception.

I will update this thread when I get the latest news about the coming performance fix.

To answer your other questions:
It definitely is a good idea to only register webhooks on the entity/fields you need to track (e.g. not registering on all fields and then filtering in the webhook endpoint itself).
Using batch is more efficient because all waiting events get sent at once instead of the pipeline having to wait for an ack from the webhook to send the next event. Be aware though that timeouts are more agressive when using batch delivereries. Batch deliveries are really meant for cases where processing all events is super fast.

I hope this helps!
-Stephane

We’ve made progress. A fix was deployed early in the week and we have been monitoring, Things are better and you should see much less sporadic delays but we are still digging into one remaining issue that does cause some delays to still occur.

Thanks Stephane, I will check if the webhook is responding faster to hundreds of events (e.g. flipping Version statuses for 200-1k versions at the same time) and will let you know how that goes.

oh, just be careful… if you generate large bursts of events, it is possible that you get throttled and this will cause delays.
See here for details…

Our backend is setup to acknowledge immediately, and then treat the event in the background.
When I change 200 tasks statuses back and forth between two statuses (rev and apr for example), the first burst works fine, but subsequent changes still take up to 15 minutes.

our response time is pretty fast, I don’t have an average, but seeing the events treated, I see:

Process time (milliseconds) 617

It’s always under 1.5 seconds.

We have statistics for the background process, and I see it takes on average 0.36s so this is not even a bottleneck.

And it’s worth noting that large bursts of events are frequent for our clients. If you use some tool to create many entities at the same time (ex: CSV Import, or some of our tools that can create thousands of versions at the same time), then you would have thousand of events that need to be sent to the webhook’s endpoint.

It definitely seems you have a setup that would be a good candidate for batch webhooks!

The way things work, it is not just about the time it takes to handle each event. It is also about the time it takes to process all the events generated at once. The doc says each site has 60 seconds of event processing time per minute (wall clock minutes) across all webhooks. So, if you generate X events at the same time, they need to be processed within 60 seconds (cumulative) otherwize the processing of events will get throttled for this site. In your case, this could explain why the first batch goes through quickly (before throttling kicks in) but the next ones take more time.
If you give me the name of your site I might be able to get more info.
Regards,
Stéphane

Hey Stephane, the site is https://gpltechdemos.shotgunstudio.com
I guess we will have to implement the endpoint with batch requests then, that might be the culprit, thanks.

Hello Stéphane, it sounds the current webhook implementation is not friendly for simple setup (think aws lambda or “if this then that” lightweight workflows). If we have to implement some kind of caching on our backend to collect and dispatch batch of events it kind of defeats the appealing of using webhooks in the first place.
We were expecting something as simple as Jira webhooks where the only thing we have to worry about is to ensure that our endpoint is alive and reachable. If our processing is slow, then it is understood that it will take time to process all pending events, but that’s it.
It sounds that the Shotgrid implementation put a lot of complexity on clients which should be handled server side? I mean, even with a simple endpoint which would just acknowledge events, it seems that we could experience delays?
Thanks.

Hi Stéphane!
Jira does have a limit of 10 seconds (https://jira.atlassian.com/browse/BCLOUD-12211). Shotgrid has a limit of 6 seconds.
Oh, but I guess you are talking about event bursts.
I did not try this with Jira. Can you confirm no throttling happens after flooding Jira with a burst of events?

They do also have this post that talks about delays that can take 30 minutes (New Jira Cloud Webhook Retry Policy - Jira Cloud Announcements - The Atlassian Developer Community).

To be clear: On the Shotgrid side, the events will not be lost but can be delayed when bursts occur.

Hi Stéphane!
Well my point would be that I never had to worry about it (and read these documents) :wink:
To be fair, I think Jira generates a very small amount of events compared to SG. May be this where things could improved? Have « create » events, have a single update event when multiple values are changed, with a list of fields which were updated? And include in the payload the current fields values, so you don’t have to fetch the Entity if you need other fields values than the ones which were updated to implement your logic?
Something else which could be useful as well is to add the ability to filter the events before they are send? This could reduce the number of events which are fired by simply telling the server to not fire them if your logic is not interested by them, instead of ignoring them in the endpoint?
Cheers