IMPORTANT: Upcoming change that could cause some webhooks to fail because of timeouts.
Hi everyone,
Some users have recently complained about unexplained delays in the trigger of their webhook events or even losing some events.
While digging into these issues we found that the max amount of time allowed for each trigger (currently 6 seconds) was not being enforced. This allowed webhook triggers to take much longer than expected to complete. Under load, this can lead to the slowdown of webhooks deliveries on other sites.
Next Monday we will be releasing a fix to put back the time limit check. This could cause some existing webhooks to fail on some events (if the webhook endpoint takes more than 6 seconds to return an answer to Shotgun).
We are also rolling out a fix for lost deliveries. So, starting today, no delivery should ever be lost.
Please do let us know if you see missing deliveries after this date.
(Note: We built the system so no delivery would ever be lost. The tradeoff for this is that it is possible, in extreme cases, that the same delivery is sent twice to a webhook.)
PS: We are still investigating an issue which can also cause events to be delayed. We are working on getting this fixed asap and will provide an update here when the fix is ready.
Hi everyone, just a quick update here to let you know what’s going on…
First, we delayed the plan to inforce the maximum of 6 second allowed to process a webhook trigger. This is because we’re folding a fix for deliveries tagged as “failed” even if they had not reached the webhook endpoint (because of network hickups) and a fix for random delays that we noticed.
We’ve been testing this and things look pretty good so the plan is now to roll out the new stack next Monday (july 6th).
Note that, although this should not occur very often, as part of this change you may see a few deliveries that get repeated. We had to make the tradeoff between deliveries possibly not reaching the webhook endpoint and repeated ones so we chose the latter.
Hi everyone!
This was roled out a few minutes ago… 6 seconds delay are now enforced on webhook triggers and there should not be any missing deliveries anymore.
Hi everyone!
New updates on delays and lost deliveries…
About a week ago we fixed an issue that could cause events to get stuck in the pipeline. This occured about a few weeks ago and caused some events to stalled for more than a day (ouch!). This happened because the issue flew under our radar.
So, we did the following:
Fixed the underlying issue
We added a “canary” system that tells us right away if events are stuck in the pipeline.
We’ve been monitoring the change
From what we can tell, events do not get stuck in the pipeline anymore. Please do let us know if you encounter any issue like this in the future.
(Specifically, @Stephen I know you hit this one. Things should be back to normal now…)
It seems our webhook enviroment outputs this error quite often. Webhook event triggers twice in most cases(rarely this does not happen and other webhook function dese not output the errors.).
So a problem actually happens: exact same Slack notification message is sent twice by single webhook event. Is that possible to be fixed?
Hi @kishikawa_takanori,
This should not happen very often. Definitely not on a regular basis.
Can you give me the name of your site and webhook and also an example of a delivery that occured twice?
We have integrated webhooks to our Django tool stack, and we are currently doing some stress tests.
We are testing this on a test site, with very few events being created except our test ones. Even in these conditions, I see sometimes the emission of deliveries being delayed for about 15 min from the moment they were created. This of course defeats the purpose of switching from loops to webhooks, which we would do only for latency issues. Is there an explanation for these huge time differences between when an event log entry is created and when it’s sent to the webhook endpoint?
Is it going to get worse if we have many webhooks?
thanks
Also, would this number be reduced by using deliveries in batch format or doesn’t it make any difference?
Also, the entities/fields that you configure for a webhook, do they influence the speed of the delivery, i.e. if you put more entity types/fields, does it impact the delivery rate?
I’m seeing consistently lapses of 10/15 min between actions made and events sent, and it’s on a test site with very little events, and this is for clients that can have more than 100k events per day.
Hi @kevinsallee,
Thanks for reporting this. We have been seeing delays in the pipeline that should not be there.
We are keeping webhooks in beta because we want to make sure the system is stable under all conditions before releasing it officially. We just hit one of these conditions last week. The good news is that we have a fix coming for it!
For reference, our goal is to be able to deliver events within a minute. We actually typically deliver events faster than this (within seconds) and it can happen that it takes a few minutes for an event to be fired but that should really be an exception.
I will update this thread when I get the latest news about the coming performance fix.
To answer your other questions:
It definitely is a good idea to only register webhooks on the entity/fields you need to track (e.g. not registering on all fields and then filtering in the webhook endpoint itself).
Using batch is more efficient because all waiting events get sent at once instead of the pipeline having to wait for an ack from the webhook to send the next event. Be aware though that timeouts are more agressive when using batch delivereries. Batch deliveries are really meant for cases where processing all events is super fast.
We’ve made progress. A fix was deployed early in the week and we have been monitoring, Things are better and you should see much less sporadic delays but we are still digging into one remaining issue that does cause some delays to still occur.
Thanks Stephane, I will check if the webhook is responding faster to hundreds of events (e.g. flipping Version statuses for 200-1k versions at the same time) and will let you know how that goes.
oh, just be careful… if you generate large bursts of events, it is possible that you get throttled and this will cause delays.
See here for details…
Our backend is setup to acknowledge immediately, and then treat the event in the background.
When I change 200 tasks statuses back and forth between two statuses (rev and apr for example), the first burst works fine, but subsequent changes still take up to 15 minutes.
our response time is pretty fast, I don’t have an average, but seeing the events treated, I see:
Process time (milliseconds) 617
It’s always under 1.5 seconds.
We have statistics for the background process, and I see it takes on average 0.36s so this is not even a bottleneck.
And it’s worth noting that large bursts of events are frequent for our clients. If you use some tool to create many entities at the same time (ex: CSV Import, or some of our tools that can create thousands of versions at the same time), then you would have thousand of events that need to be sent to the webhook’s endpoint.
Hey Stephane, the site is https://gpltechdemos.shotgunstudio.com
I guess we will have to implement the endpoint with batch requests then, that might be the culprit, thanks.