CCP Veritas has a post up describing how he and his colleagues tracked down some issues with module reactivation in EVE. During my life as a programmer, I’ve come to love stories about tracking down elusive bugs, and this is a really great example.
We’ll start at the first thing we noticed when digging at it: the system responsible for telling the server when modules should be turned off or repeated would get minutes behind in processing when fleet fights happen while other systems remain reasonably responsive. This system, named Dogma, handles module activation/repeat/deactivation, as well as the actual effects of those modules. [...]
Tasks on the EVE servers use a time-sharing technique called cooperative multitasking which, in short, means that a task has to willingly yield execution to other tasks, otherwise it will run forever. In this case, it would seem the part of Dogma handling module repeat and deactivation was being too nice – yielding execution too much.
Looking at the code some, a theory emerged as to why. There was an error case that stuck out as odd – if an effect was supposed to be stopped or repeated, but the effect system itself didn’t agree that it was time yet, the code would throw up its hands and give up. If that error case gets hit, the processing loop would yield to other systems early. A code comment was very reassuring though – this error was supposedly “rare.”
The “rare” error happened 1.5 million times in the month of June, 2010 on TQ.
Great story, and an interesting followup. By all means, read it. After I read it, I did some calculations, presented below:
So, it seems to me that the problem was really quite rare. And it happened all the time.