[Voiceglue] Voiceglue hang - debugging so far

James Green james.green at mjog.com
Fri Jul 3 17:00:16 UTC 2009


As always, things are more complicated than they first look.

So we followed the idea that the r=0 loop within the IPC read() was
being caused by something disconnecting the remote end's socket.
Interestingly sometimes we see a write() failure on the socket with
errno 32 (broken pipe) supporting the theory.

We added logging anywhere that sockets / fds were being created or
closed by voiceglue, Scom or VXI (via SWI). This had to include
microsecond timings to follow the sequence of events truly.

Of specific interest we noticed the following:
1. Within the main VG getevents() loop:
2. A call would cease and VG calls IPC unregister which close()s the
VXML FD (say 19). This we verified as close returns 0.
3. SWIsocket.cpp created a new FD 0.01s later and gets the now closed FD
(19)
4. Scom sends a stop on the PERL_VXML_callid handle which VG now
processes and close(19) it within fh_stopped. At this point the call is
closed down including file descriptors however another thread's VXI
obviously has the FD number (19) registered itself.
5. Another incoming is processed, Scom registers the FD (19) again. IPC
gets a new FD (24).
6. VG finishes getevents() and starts again
7. SWI then closes the FD (19) created in step 3. The VXI instance
created in part 5 is now read() looping on r=0 on FD 24.
8. Scom notes that the FD (19) is now closed and stops the pipe.
9. VG attempts to send_bytes to the call createfd in step 5 but write()
on FD 19 returns that the FD is not open for writing.

So, should we be close()ing the IPC FD both within the unregister AND
fh_stopped? There's a clearly a race condition here.

Second, after a write() comes with with a broken pipe, we return -1 but
clearly a read() follows which doesn't sound right.

Third, within fh_stopped(), just before we close() a VXML IPC FD. At
this point have we not unregistered ourselves at the IPC layer? I ask
because a dup() of the FD succeeds and the unregister code clearly
closes the FD.

It is worth noting that we repeated the same test shortly after the
above, and exactly the same problem occurred. Given enough simultaneous
calls and sufficient time, the race condition is repeatable.

We continue the debugging but we're confident that if the above isn't
always the precise cause it's clearly not helping matters either... All
evidence collected so far points to corruption of the FD maps through
race conditions between VG/Scom and IPC/VXI and VXI/SWI.

We're going to remove the close() within the IPC unregister on Monday on
our test kit (there being no need to close() twice and as VG opens the
FD it might as well close it exclusively too) and hope everything does
not blow up. Would appreciate your thoughts on that one!

Look forward to hearing from you :-)

James

-----Original Message-----
From: voiceglue-bounces at voiceglue.org
[mailto:voiceglue-bounces at voiceglue.org] On Behalf Of Doug Campbell
Sent: 01 July 2009 15:11
To: General discussion about voiceglue
Subject: Re: [Voiceglue] Voiceglue hang - debugging so far

> Current theory is that Scom is not catching a closed file descriptor 
> which is then re-allocated.

That's great.  If true, I should be able to replicate it.

> How's your debugging coming along, Doug?

I am currently held hostage by a 15-hour-per-day assignment.
Hopefully I will get a reprieve soon.

Doug Campbell
_______________________________________________
Voiceglue mailing list
Voiceglue at voiceglue.org
http://www.voiceglue.org/mailman/listinfo/voiceglue

Checked by AVG - www.avg.com
Version: 8.5.375 / Virus Database: 270.13.1/2212 - Release Date:
07/01/09 05:53:00

Checked by AVG - www.avg.com 
Version: 8.5.375 / Virus Database: 270.13.2/2215 - Release Date:
07/03/09 05:53:00
-------------- next part --------------

No virus found in this outgoing message.
Checked by AVG - www.avg.com 
Version: 8.5.375 / Virus Database: 270.13.3/2216 - Release Date: 07/03/09 05:53:00


More information about the Voiceglue mailing list