Connection delay issues = solution (TCP Keepalive)

Hello community,

We have been in contact with Autodesk support for the last couple of weeks regarding a strange issue that occured at our studio, for which we found a solution and wanted to share with the community, just in case anybody else has the same issues. So maybe it’s helpful for anybody else :wink:

We were experiencing delays when working with anything API (shotgun_api3) related. So everytime the pipeline establishes connections with our FPT site. So basically everywhere in the pipeline. For example, we are using an automatic Versions upload script that runs from SG Desktop. When we run it, it runs instantly. But if we ran it again after 4 minutes, it took 20 seconds to start. This also happened in Nuke, Houdini, etc. For instance, opening the tk-workfiles2 app after 4 mins of inactivity (that is, not doing anything that causes the API to connect) took 20 secs.

It was always 20 secs after 4 min of inactivity. So you can imagine how much impact this had on our day to day workflow.

So we started digging with the help of the Autodesk FPT devs and managed to narrow it down all the way down to our internet connection itself. Our modem to be precise. It turned out our modem had a pre-set tcp_idle_timeout set of 4 minutes. Meaning, it would automatically close open sockets, thus fooling our pipeline into thinking the connection to our FPT site was still open, while it was already closed by our modem, causing constant delays of exactly 20 secs.

We found out that by adding a piece of TCP keepalive in the core of FPT, the issue was solved.

This code essentially tells the modem (or any other piece of network stack) to send a keepalive packet every X seconds, and thus forcing the modem to keep the connection open.

I attached a version of the init.py file (that has tcp_keepalive enabled) that lives in the core folder here:

\install\core\python\tank_vendor\shotgun_api3\lib\httplib2\python3

init.py (71.9 KB)

3 Likes

thanks for the info.
Curious what kind of error messages got you to this point?

1 Like

There where no errors. Which made it difficult to find. Only timestamps in the log on moments where Toolkit made a connection, but after an idle time of 5 mins. Which where always 20secs. We narrowed things down systematicaly. Removing our server from the equation. Starting from a default config.

And eventually we made a test scripts using only the shotgun_api3 locally to only just connect and do a query and pauze the script for 5 mins and query again. It didn’t happen if we set the pauze to 4 mins. I then did the same tests at home, where I have a different provider and different modem. And there it didn’t occur.

So in the end the issue was a TCP idle timeout that was configured by our internet provider in our modem. We just received a new modem (different brand) and the issue was also resolved. But we also found out that by adding the tcp_keep_alive function to the API it was also solved. It appears tcp_keep_alive is set on a per-application basis. It can also be set system wide, but that might create other issues for other apps.

1 Like