I am not sure if we just have to many events registered or something else, but it is taking 40+min from when an event happens to when our event daemon finally triggers it.
I was curious if anyone had any issues with this or had best daemon practices for performance.
Just to add my thoughts as well. It sounds as though your event daemon is running behind, and has a backlog of events to catch up on.
If you have many event plugins or your events are likely to get triggered frequently and respond to large volumes of events, you need to make sure they are written and efficiently as possible.
You should limit the number of calls you make to Shotgun, try to use batch calls where possible.
When not debugging, try to ensure they produce as little logging output as possible.
You can also enable a timer logging on your plugins to see which ones are taking the most time:
If you have any plugins that need to do heavy lifting, consider threading them out or sending them as jobs to a farm to process.
Shotgun API calls can be configured as to whether they generate events, by setting it on the Script entity you’re using to authenticate within Shotgun. If your plugins are making changes to the Shotgun site and they are further triggering plugins needlessly then consider perhaps using a script that doesn’t generate events.
This may not apply in your situation, but when I was a using Shotgun in production we had a tool that ingested a load of data into Shotgun, creating up wards of about 300 assets in one go. We had an event daemon plugin that we wanted to respond to some of the changes, but when ever we ingested a large amount of data this would create a backlog. So instead we made it so the tool didn’t generate events automatically, and at the end of ingestion, it manually created a custom event in Shotgun and provided a dictionary of the entities it created via the metadata field. Then our plugin would respond to the single custom event and read the assets out of the meta data and respond accordingly.
You could argue that why didn’t the ingest tool handle the logic rather than deferring it to the event daemon, but the reason was that we had other tools where similar operations could occur and so we opted to have these tools create custom events representing multiple changes and have it processed in one place.
We have about 23 callbacks registered. The shotgun API performance seems pretty quick.
Looking into Philip’s point, the evidence does seem to point to a lagging behind behavior.
Thank you very much for the helpful tips! Will try to implement and post back if we find any interesting notes.
I do not seem to have such a timing_log option in my current configuration, is it something that can be added manually or do I need to update my configuration?
I would like to have some input about two parameters I noticed in shotgunEventDaemon.conf: fetch_interval = 5 and max_event_batch_size = 500
If I get it right, the _mainloop of shotgunEventDaemon :
fetches events with _get_NewEvents, which takes the last processed event as starting point and fetches all events with IDs above this reference,
processes these events with the plugins
checks the amount of processed events and sleeps for fetch_interval seconds if this amount is under max_event_batch_size
So now I’m wondering how changing max_event_batch_size to 300 (for ex) would impact the event handler. It would sleep less, so from my point of view there would be less risk of lag; but I guess there are some reasons not visible from the shotgunEventHandler code to choose these two values of 500 events and 5s. Would it remain stable? would the batch operations be still optimum?
The max_event_batch_size is a default value for the number of events to fetch which is a compromise between the time it takes for the app to produce results and the time for them to be sent over the wire. You don’t want to fetch too many events that your request blocks one of the app’s threads too long but you don’t want to fetch too few that you’re spending your time waiting for tiny batches of data over your network link’s latency.
This being said, the key parameter will be the fetch_interval. If the last time the daemon loaded data, if it got fewer than max_event_batch_size events, it’ll wait the amount of time prescribed by fetch_interval so increasing this value will process things slightly faster and will increase the probability your next fetch gets fewer than max_event_batch_size.
There’s a third value you can tweak but isn’t exposed in the config as it is much more delicate.
There is a phenomenon by which there can be a gap in the event sequence and sometimes that gap can be filled in later. Gaps aren’t super frequent but are regular, gaps that get filled in are rarer. Even if this later phenomenon is rare, we need to ensure that events are processed linearly otherwise we can introduce bad data. To do this, each missing event is given a time window in which it can reappear. We’ll process all events up to a gap, wait there, if and when the gap fills up we continue until the next gap. This build up what we call a backlog. The amount of time we wait is defined by a constant you can find the source code called BACKLOG_TIMEOUT. If you decrease this, gaps won’t slow down your daemon as much but the probability you miss a gap that fills up will increase. Ultimately, tweaking this value depends on your level of risk aversion.
Hi François, apologies for the delay and no offence taken to your pinging this thread.
You’re right. I did make a mistake. My original intention was to write:
The last time the daemon loaded data, if it got fewer than max_event_batch_size events, it’ll wait the amount of time prescribed by fetch_interval so decreasing this value will process things slightly faster and will increase the probability your next fetch gets fewer than max_event_batch_size.
Thanks for noticing that and calling it out (I won’t edit the original post for posterity as the correction is in your message and this one as well).
For anyone curious about how tweaking the fetch_interval value would affect things, here’s an extremely simplified example with some ridiculous numbers for illustrative purposes.
If your site generates 100 events per minute and the max_even_batch_size is 500 with a fetch_interval of 1 minute here is how that would play out.
The site generates 100 events in one minute
The daemon asks for 500 events and gets 100
Because the daemon received less than 500 events it will now wait 1 minute
During this minute, 100 events are generated
And so on… By the end of the 5th minute, you would have processed 500 events.
Let’s have another example with all values being equal but your fetch_interval is 8 minutes.
The site generates 100 events in one minute
The daemon asks for 500 events and gets 100
Because the daemon received less than 500 events it will now wait 8 minutes
During this 8 minutes, 800 events are generated
The daemon asks for 500 events and gets 500.
After processing, the daemon immediately asks for another 500 events and gets 300.
At the end of the 9th minute, you would have processed 900 events.
In both scenarios, you’ll have processed, on average, the same 100 events per minute that are generated by your Shotgun site. In the second scenario with the longer fetch interval, there is a bigger gap where the daemon waits for events to be generated and it has to catch up afterwards by fetching two consecutive batches.
In the first scenario, the daemon waits for shorter periods and never needs to catch up as much as in the second scenario because it can request events and process them almost as fast as they are generated.
Digging this out 5 years later, I’m really tempted to mess with the backlog timeout.
Our daemons are nice and snappy, up until they sleep for 5min whenever there’s a backlog.
As said above, it’s rare that the event actually appears, so we have to endure the entire timeout.
When it does eventually appear, it’s usually within a few seconds of being added to the backlog.
I’m curious to hear more about the root cause of these events skipping, so that I can better judge whether I can reduce the 5min timeout, and by how much.
Hey @ELeroyCA when it comes to the gaps and the backlog behaviour I got the following information from support.
It is normal and it is not something that we can do something about… In short, if two process asks for a new id at the same time and that one of the process is in a transaction that is being aborted, then we will “lost” these ids… if you want more details => FAQ: Using Sequences in PostgreSQL
Many things can go wrong and cause a transaction to abort… on the user side, I believe that the main point is to be careful on the parameters they are passing to the API would be the main thing to pay attention to.
The idea is the following:
There is no guarantee that the events will be added in the database sequentially.
Example:
when 2 processes are executed at almost the same time
Let say process A start a transaction and want to write in the eventLogEntry and continue doing something else.
Now process B is doing something that creates another EventLogEntry and finished before process A.
So if the eventdaemon is to ask for new events after the PRocess B finishes but before the process A finishes, it will miss events.That’s why a “backlog” process has been putted in place
In reality and my experience though, the gaps can be huge (10s of thousands of IDs) depending on the nature of what caused them. This line specifically in the event daemon is problematic:
Because it means that a large data set will be written to the shotgunEvents.id pickle data in this call:
Then when reading this file and writing to it, each iteration of the standard processing loop is transferring a significant amount of data and stressing the system I/O simply to maintain the timestamp for each theoretical, and yet-to-be-confirmed-valid event ids to test against the BACKLOG_TIMEOUT to finally drop them.
When we find we have a slow event daemon the shotgunEvents.id file is likely large in size due to this backlog data, indicating a fairly large gap and range of missing EventLogEntry IDs. We reset this data to remove the backlog ids and timestamps and restart the daemon to pick up the next real EventLogEntry ID. Dropping theoretical events in some cases is better than having a huge downtime in event processing.
Thanks for your answer @Halil. I think I’ll reduce the timeout, I have yet to found a case where a real event would take more than a few seconds to appear. I thought the pickled file wouldn’t grow past the maximum amount of entries it queries (500 by default I think) since it doesn’t seem to query further once it has a backlog?
That file has been a pain for us though, there’s a non-zero chance that it gets corrupted when restarting a daemon (I’d say about 10% chance for us, based on frequency), and the only way to get the daemon going again is to delete the file. Maybe it getting big and taking a while to read could be the cause.
It feels like reducing the timeout would help with accumulating less entries in the backlog before it moves on.