Any recommendations welcome. This has been an ongoing issue for a very long time and we are eager to find a resolution.
We have one error in our event plugin logs that is driving us nuts:
Event 117314756 is too old. Last event processed was (117315845)
Can ANYONE provide a reason for what would trigger this behavior??
And now for the long-winded context:
We periodically run into delays with our event plugin system. Its becoming so common place, users are predicting nearly daily slow downs in the afternoon; events run great and immediately in the morning but may take several minutes to respond in the afternoon.
We’re constantly looking for ways to mitigate the delays: splitting our plugins up over multiple daemons, looking for long running scripts to optimize or isolate, distributing longer running processes to run asynchronously on our farm, responding quickly to plugins that trigger critical error emails etc. The above error is one we see often, but don’t understand what is triggering or how to prevent that behavior.
In an effort to reduce variables we have one Daemon that has only one event plugin script that it processes. This particular plugin is important to production and every production person probably runs it at least a couple times per day. Users notice and are usually good about reporting quickly when it isn’t responsive, but they don’t trigger dozens of events a time, so it isn’t one we expect to have bottlenecks based on high activity. When users report this plugin isn’t responding, its almost always the above error that we see filling the logs.
The daemon tracks what the last processed event was, so why would it go back and recheck old event ids as indicated in the error message? Whatever the reason, we get long stretches in our logs (sometimes a few minutes, sometimes 20 or 45 minutes) where all it is doing is rechecking old events. And we all just wait.
One explanation we have heard was that if multiple plugins are on a daemon, and one has an issue and is then fixed, that the daemon has to re-evaluate event IDs for all the other plugins that kept running to get the fixed one caught up to the latest event ID. While that explanation makes since in general, I’m not clear how that scenario could be triggered if there is only one plugin on the daemon.
I don’t expect this is the only daemon or plugin having these issues, but given its isolation, it seems like an easier one to test and monitor. I hope that whatever we learn/resolve here can be applied to other daemons.