When looking at code which juggles threads, there are definitely a few red flags. sleep
s spammed through the code is an obvious one- either wait on a shared resource or block on another thread or something, but spurious sleep
s are suspicious.
Until today, I would not have included "atomic operations" as a code smell in multi-threaded code, but we all learn new things.
This one comes from Mike J. We'll break this function up into two sections, because I've got a lot to say.
os_uint32 initCount;
/* in case multiple participants are created at the same time only 1 can start spliced */
initCount = pa_inc32_nv(&_ospl_SplicedInitCount);
if (initCount == 1) {
result = startSplicedWithinProcess(&spliced_thread, domainCfg->uri);
if (result == U_RESULT_OK) {
/* wait for up to 10 seconds for the domain to be available in this process */
while ((domain = u_userLookupDomain(domainCfg->id)) == NULL && (++sleepCounter < 100)) {
ospl_os_sleep(delay_100ms);
}
if (domain) {
domain->spliced_thread = spliced_thread;
}
}
pa_dec32(&_ospl_SplicedInitCount);
//I'm skipping over a bit here so the flow of the code is clear- we'll come back to this in a moment
} else {
pa_dec32(&_ospl_SplicedInitCount);
/* wait for up to 30 seconds for the spliced to be available in this process */
while ((domain = u_userLookupDomain(domainCfg->id)) == NULL && (++sleepCounter < 300)) {
ospl_os_sleep(delay_100ms);
}
}
This code starts a worker thread. In the middle, you can see one of those very fun suspicious sleep calls, where we poll a u_userLookupDomain
which can take up to 10 seconds to interact with the thread we just started. That's almost certainly bad, or at least hinting at something bad.
But the start of this block is the special one. For some reason, this code doesn't use mutexes to avoid two threads from entering the block. Instead it increments an integer using an atomic operation. This works, but it's definitely not the correct way to do this.
Once we've got our thread spun up, we decrement that counter release the mutex. If there was another thread trying to access this block, it just falls through into decrementing and then doing another sleep, this time for up to 30 seconds.
Now, this is all ugly and makes me unhappy, but I skipped over a bit. So this is the block that goes in the elided section.
#if 1
{
/* This piece of code is black magic, without it durability sometimes fail to become complete.
* This code replaces historical code which was already in without a clear description.
* Must be replaced to avoid unwanted delays and possible failures by a proper synchronisation.
*/
int n;
for (n=0; n<10; n++) {
ospl_os_sleep(delay_100ms);
}
}
#endif
The code is just a sleep, but the comment is perhaps the perfect comment for bad code. In three lines, we have so much story.
"This piece of code is black magic but major features break without it" is one of my favorite kinds of comments to see. You can see the #if
guard- the developer who put this here didn't want it here, they wanted to be able to easily disable the code. But they couldn't. And it's easy to guess why: it's some sort of threading race condition or synchronization problem that we can avoid by just waiting long enough.
The second line of the comment is the one that really elevates things, though. This mysterious sleep that fixes a problem is just a replacement for an even more mysterious synchronization attempt that we didn't understand.
And then finally, we admit we know this code has to go away. We don't know how, we don't know when, we have no plan to do it- but we know it must be replaced. We even know what to replace it with- proper synchronization.
Now, there is good news: this code is in a product that is largely deceased. The code was donated to a FOSS foundation, and after a few releases, the foundation in question rightly junked the code and replaced it with code that does thread management properly. That's good news for us, anyway. Poor Mike is still stuck using it, for legacy reasons.