[Home]

Summary:ASTERISK-04093: [patch] tweak zapata.conf to explain purpose of 'jitterbuffers' directive and its relationship to EAGAIN errors.
Reporter:kb1_kanobe2 (kb1_kanobe2)Labels:
Date Opened:2005-05-06 03:36:07Date Closed:2011-06-07 14:11:58
Priority:TrivialRegression?No
Status:Closed/CompleteComponents:Core/General
Versions:Frequency of
Occurrence
Related
Issues:
Environment:Attachments:( 0) zapata.conf.sample.patch
Description:When calls are travelling between an iax2 channel and a zaptel channel across an asterisk server (ie. <-(iax2)->asterisk<-(zaptel)->) something inside chan_iax2.c is interfering with the chan_zap.c side of the call causing the write() on the zaptel side to fail with EAGAIN. This makes chan_zap to throw away the 320 bytes from the pending write and carry on without it, leading to pops and clicks on audio going out the zaptel interface.

If there are other calls up concurrently on the zaptel interface that are terminating inside the asterisk core (ie. app_milliwatt<-asterisk<-(zaptel)->) they do not experience EAGAIN errors. Similarly, if chan_iax2.so is not loaded, the write() failures do not occur.

Simply issuing 'reload chan_iax2.so' in the console has about a 60/40 chance of either substantially increasing or almost stopping the occurances of EAGAIN until the next 'reload chan_iax2.so' or the call is hungup and restarted.

The issue also seems to be affected by the latency and/or jitter reaching the far end of the iax connection - the 'closer' the remote host, the less frequently EAGAIN occurs overall on a problem call. I am running trunking and, if a second test call is placed to the same remote server as the first it also inherits the EAGAIN errors. If instead it's placed to a different server it has a 50/50 chance of developing its own errors. Disabling trunking, reloading and restarting the test call (to tear down the trunk) appears to almost, but not completely stop the errors from occuring.

Although this issue does not prevent the channel from functioning the rate of loss of data on some calls and the fact they cannot be mitigated by jitterbuffer or genericplc as it's on the outbound side of chan_zap as well the sheer volume of compliaints I've been receiving from my users over the last while has motivated me to mark this major.

****** ADDITIONAL INFORMATION ******

This issue was initially exposed in bug ASTERISK-4010 and is occuring on multiple Asus P4BG-MX based servers, running a Celeron 2.4 and 256Mb of ram with one t100p card and two Ethernet cards each. Kernel is 2.6.11.8 with Ingo Molnars 'realtime' patches applied. Priority of the Interrupt servicing thread for the t100p card has been raised to be the highest priority process (-52, using chrt) and asterisk (cvs-head from 3 may) is running with the priority boost flag (-p) applied.

I have followed the general debugging procedures outlined in http://www.mail-archive.com/asterisk-users@lists.digium.com/msg87960.html and can confirm that all these potential hotspots have been addressed, though my timer still does not score '100%' 100% of the time according to zttest.
Comments:By: richard (richard) 2005-05-07 15:54:55

Can you tell be what codecs you are using?

All the cases I have seen where people are getting the write failure message, are using T1/E1 cards. I am using TDMs and although I do not get this message, I seem to be seeing similar problems.

By: kb1_kanobe2 (kb1_kanobe2) 2005-05-07 16:17:53

My systems normally run ulaw/T1 and G.726 over IAX. I have just tested the issue between the same two reference peers using both ulaw over IAX and gsm over IAX and the behaviour was as with G.726.

I also heard a report last night on IRC of similar behaviour with zap<->sip calls. I have just tested this and can reproduce EAGAINs errors that way too (at a much higher rate), but not the bizarre 'reload' behaviour.

By: richard (richard) 2005-05-07 16:43:15

Thanks. My setup is ulaw SIP -> IAX gsm -> TDM Zap ulaw. If I change to IAX ulaw the audio disruption is no longer apparent.


Edited:
I stated above that I had not seen the "my_zt_write" errors

For the first time I have just tried calling from the TDM Zap to the SIP, (opposite of above) and get 7 or 8 of these errors at the instant SIP answers:

   -- Executing Dial("Zap/1-1", "IAX2/pbxwn@192.168.4.227/605") in new stack
   -- Called pbxwn@192.168.4.227/605
   -- Call accepted by 192.168.4.227 (format gsm)
   -- Format for call is gsm
   -- IAX2/pbxbr-2 is ringing
   -- IAX2/pbxbr-2 stopped sounds
   -- IAX2/pbxbr-2 answered Zap/1-1
May  8 11:59:56 WARNING[25585]: chan_zap.c:4409 my_zt_write: Write returned -1 (Resource temporarily unavailable) on channel 1 - audio may have been lost
May  8 11:59:56 WARNING[25585]: chan_zap.c:4409 my_zt_write: Write returned -1 (Resource temporarily unavailable) on channel 1 - audio may have been lost
May  8 11:59:56 WARNING[25585]: chan_zap.c:4409 my_zt_write: Write returned -1 (Resource temporarily unavailable) on channel 1 - audio may have been lost
May  8 11:59:56 WARNING[25585]: chan_zap.c:4409 my_zt_write: Write returned -1 (Resource temporarily unavailable) on channel 1 - audio may have been lost
May  8 11:59:56 WARNING[25585]: chan_zap.c:4409 my_zt_write: Write returned -1 (Resource temporarily unavailable) on channel 1 - audio may have been lost
May  8 11:59:56 WARNING[25585]: chan_zap.c:4409 my_zt_write: Write returned -1 (Resource temporarily unavailable) on channel 1 - audio may have been lost
May  8 11:59:56 WARNING[25585]: chan_zap.c:4409 my_zt_write: Write returned -1 (Resource temporarily unavailable) on channel 1 - audio may have been lost


These errors also occur if the IAX codec is changed from gsm to ulaw, although as mentioned previously,  there is no audible disruption to the call unlike with gsm where objectionable cracking is heard. I guess this is due to the compressed format being more sensitive to data loss.



By: kb1_kanobe2 (kb1_kanobe2) 2005-05-07 17:23:37

An interesting, but possibly dangerous workaround to this problem has presented itself: simply remove O_NONBLOCK from the flags used when opening the fd at line 849 of chan_zap.c

This prevents the loss of data and, voila, provides a clear, high quality call. Obviously this is masking the deeper problem (ie. the source of the blocks causing EAGAIN errors and/or the lack of a retry strategy for the failed write()) however it does provide a workaround for all the problems encountered thus far...

Thoughts, comments, observations? Anyone?



By: richard (richard) 2005-05-07 20:28:25

Perhaps I am looking at a different issue, as removing O_NONBLOCK does not improve the calls for me.

For what it's worth, when I disable the jitterbuffer or go back to old jitterbuffer, the string of error messages on answer are absent.

Testing is being done on a switched local LAN with no other traffic and a 10Mb WAN with minimal traffic and no packet loss - call disruption is more frequent on the latter.

By: kb1_kanobe2 (kb1_kanobe2) 2005-05-09 22:20:21

I was advised today by tzanger on IRC to increase the value of 'jitterbuffers' in zapata.conf to a larger number. Raising to '24' and compiling O_NONBLOCK back in has apparently resolved this issue, though the 'reload chan_iax2.so' behaviour experienced still confuses me.

The attached patch tweaks the sample zapata.conf to explain the purpose of the previously anonymous 'jitterbuffers' directive to help others avoid this pitfall in the future.

Disclaimer on file.



By: Andrew Kohlsmith (akohlsmith) 2005-05-10 06:32:47

I get these errors even on a TDM420 on my wunderboard (Epox Cu-133a)  :-)

By: lesnet (lesnet) 2005-05-10 09:46:53

I still get my_zt_write errors even after adjusting jitterbuffer=24
Using 1 span on a TE405P

By: Mark Spencer (markster) 2005-05-10 15:17:05

This is just a documentation issue.  The -EAGAIN is clearly only an issue when the there is too much jitter in the line or when there is a slip caused by the timing mismatch which is essentially always present in VoIP.  I don't know why someone turned that message into a warning when it clearly is only of debug value, but that has since been corrected in CVS.  If there are documentation updates that people feel would be appropriate, feel free to add them otherwise, I'll go ahead and clear this one out and allow more discussion to take place if necessary on the mailing list.

By: Andrew Kohlsmith (akohlsmith) 2005-05-10 15:24:45

Uh, Mark?  The patch is nothing more than a patch *to* the documentation (no code changes), and the new text seems pretty valid, does it not?

By: kb1_kanobe2 (kb1_kanobe2) 2005-05-10 23:14:22

Actually, Mark has a point here. Now the discarding of data by the failure of the write() call has been deemed a feature, not a bug and hence the associated message is back in the debug quagmire under all circumstances, this amendment to the documentation is of little value as tuning it remains a dark art.

Perhaps if the zaptel driver were able to communicate back to the calling function /why/ it had refused the data (which is I assume what is happening, rather than an OS-related block on the write()) then one could reasonably catch the relevant signals and generate a meaningful error when numbufs is too small. However, until that happens we simply need to trust that information is being discarded, rather than absorbed and it's for a good reason.

Accordingly, patch withdrawn.
:-)