[Voiceglue] Voiceglue hang - debugging so far

James Green james.green at mjog.com
Tue Jul 7 12:59:23 UTC 2009


OK We have Voiceglue stable - it no longer hangs.

Keep the close() within the c IPC unregister, and lose the one from
fh_stopped that closes the VXML interpreter side.

But... Now it leaks file descriptors.

The leak occurs when unregistering the perl_fh within Scom. We reach the
close() and complete the subroutine having cleaned the internal data
structures. Except sometimes that FD has not actually closed.

I tested by placing the close() within a while () loop. Inside it:

1.  Calls threads->suspend() to gain a mutex
2.  closes($fh->{handle})
3.  Checks is_fd_used(). If it is not used we break out, otherwise we:
3a. Increment a counter
3b. Loop
4.  After the loop if the counter is above 1 we log it
5.  Calls threads->resume() to unlock the mutex

Each log entry says it was closed twice before _is_fd_used actually
fails. Yet within /proc/<pid>/fd/ that FD is still present.

Replace close() with POSIX::close(fileno($fh->{handle})) and it seems to
close reliably.

I invite ideas as to why close() never fails at the perl level, yet
sometimes leaves the FD open. One possibility is that there's a second
(or more) references to it.

Note that kernel 2.6.28+ has some file descriptor handling changes. I
have no evidence to suggest this is having an impact here but it may be
worth a trip to Google.

James

-----Original Message-----
From: voiceglue-bounces at voiceglue.org
[mailto:voiceglue-bounces at voiceglue.org] On Behalf Of James Green
Sent: 03 July 2009 18:00
To: General discussion about voiceglue
Subject: Re: [Voiceglue] Voiceglue hang - debugging so far

As always, things are more complicated than they first look.

So we followed the idea that the r=0 loop within the IPC read() was
being caused by something disconnecting the remote end's socket.
Interestingly sometimes we see a write() failure on the socket with
errno 32 (broken pipe) supporting the theory.

We added logging anywhere that sockets / fds were being created or
closed by voiceglue, Scom or VXI (via SWI). This had to include
microsecond timings to follow the sequence of events truly.

Of specific interest we noticed the following:
1. Within the main VG getevents() loop:
2. A call would cease and VG calls IPC unregister which close()s the
VXML FD (say 19). This we verified as close returns 0.
3. SWIsocket.cpp created a new FD 0.01s later and gets the now closed FD
(19)
4. Scom sends a stop on the PERL_VXML_callid handle which VG now
processes and close(19) it within fh_stopped. At this point the call is
closed down including file descriptors however another thread's VXI
obviously has the FD number (19) registered itself.
5. Another incoming is processed, Scom registers the FD (19) again. IPC
gets a new FD (24).
6. VG finishes getevents() and starts again 7. SWI then closes the FD
(19) created in step 3. The VXI instance created in part 5 is now read()
looping on r=0 on FD 24.
8. Scom notes that the FD (19) is now closed and stops the pipe.
9. VG attempts to send_bytes to the call createfd in step 5 but write()
on FD 19 returns that the FD is not open for writing.

So, should we be close()ing the IPC FD both within the unregister AND
fh_stopped? There's a clearly a race condition here.

Second, after a write() comes with with a broken pipe, we return -1 but
clearly a read() follows which doesn't sound right.

Third, within fh_stopped(), just before we close() a VXML IPC FD. At
this point have we not unregistered ourselves at the IPC layer? I ask
because a dup() of the FD succeeds and the unregister code clearly
closes the FD.

It is worth noting that we repeated the same test shortly after the
above, and exactly the same problem occurred. Given enough simultaneous
calls and sufficient time, the race condition is repeatable.

We continue the debugging but we're confident that if the above isn't
always the precise cause it's clearly not helping matters either... All
evidence collected so far points to corruption of the FD maps through
race conditions between VG/Scom and IPC/VXI and VXI/SWI.

We're going to remove the close() within the IPC unregister on Monday on
our test kit (there being no need to close() twice and as VG opens the
FD it might as well close it exclusively too) and hope everything does
not blow up. Would appreciate your thoughts on that one!

Look forward to hearing from you :-)

James

-----Original Message-----
From: voiceglue-bounces at voiceglue.org
[mailto:voiceglue-bounces at voiceglue.org] On Behalf Of Doug Campbell
Sent: 01 July 2009 15:11
To: General discussion about voiceglue
Subject: Re: [Voiceglue] Voiceglue hang - debugging so far

> Current theory is that Scom is not catching a closed file descriptor 
> which is then re-allocated.

That's great.  If true, I should be able to replicate it.

> How's your debugging coming along, Doug?

I am currently held hostage by a 15-hour-per-day assignment.
Hopefully I will get a reprieve soon.

Doug Campbell
_______________________________________________
Voiceglue mailing list
Voiceglue at voiceglue.org
http://www.voiceglue.org/mailman/listinfo/voiceglue

Checked by AVG - www.avg.com
Version: 8.5.375 / Virus Database: 270.13.1/2212 - Release Date:
07/01/09 05:53:00

Checked by AVG - www.avg.com
Version: 8.5.375 / Virus Database: 270.13.2/2215 - Release Date:
07/03/09 05:53:00

Checked by AVG - www.avg.com
Version: 8.5.375 / Virus Database: 270.13.2/2215 - Release Date:
07/03/09 05:53:00

Checked by AVG - www.avg.com 
Version: 8.5.387 / Virus Database: 270.13.6/2221 - Release Date:
07/07/09 05:53:00
-------------- next part --------------

No virus found in this outgoing message.
Checked by AVG - www.avg.com 
Version: 8.5.387 / Virus Database: 270.13.7/2222 - Release Date: 07/07/09 05:53:00


More information about the Voiceglue mailing list