GroupWise GWIA 420 TCP Read and TCP Write Error

A customer Penang was recently replacing their Sendmail to GroupWise 8.0. However for the past one month, they are facing problem sending email to certain users of certain email domains where they had no problem sending with Sendmail before.

We are seeing some 420 TCP Read Error, 420 TCP Write error and also TCP 421 Timeout error in the GWIA logs. The followings were the errors from the GWIA logs :-

[Winwamedical.com]
16:00:13 896 MSG 58050 Command: [202.75.48.118]
16:00:13 896 MSG 58050 Response: 220 smtp1.mschosting.com
16:00:13 896 MSG 58050 Command: EHLO mail.sunrisepaper.com.my
16:00:13 896 MSG 58050 Response: 250 ok
16:00:13 896 MSG 58050 Command: MAIL FROM:
16:00:13 896 MSG 58050 Response: 250 OK Sender ok
16:00:13 896 MSG 58050 Command: RCPT TO:
16:00:13 896 MSG 58050 Response: 250 OK Recipient ok
16:00:13 896 MSG 58050 Command: DATA
16:00:13 896 MSG 58050 Response: 354 Start mail input; end with .
16:00:13 896 MSG 58050 Detected error on SMTP command
16:00:13 896 MSG 58050 Command: Data...
16:00:13 896 MSG 58050 Response: 420 TCP write error

[Escatec.com]
16:41:20 880 MSG 58121 File: /root/sunriseEmail/sunDom/wpgate/gwia/wpcsout/gwi3f3a/4/4a118fb0.000 Message Id: (4A112097.908:101:48786) Size: 86.5 Kb
16:41:20 880 MSG 58121 Sender: alicia_chua@sunrisepaper.com.my
16:41:20 880 MSG 58121 Converting message to MIME: /root/sunriseEmail/sunDom/wpgate/gwia/send/xa118fb0.024
16:41:20 880 MSG 58121 Recipient: Saravana@escatec.com
16:41:20 880 MSG 58121 Recipient: Yeoh.KengHong@escatec.com
16:41:20 880 MSG 58121 Queuing message to daemon: /root/sunriseEmail/sunDom/wpgate/gwia/send/sa118fb0.024
16:41:20 352 DMN: MSG 58121 Sending file: /root/sunriseEmail/sunDom/wpgate/gwia/send/pa118fb0.024
16:41:22 352 DMN: MSG 58121 Attempting to connect to mailserver.escatec.com
16:41:23 352 DMN: MSG 58121 Connected to [203.106.231.124] (mailserver.escatec.com)
16:47:25 352 DMN: MSG 58121 SMTP session ended: [203.106.231.124] (mailserver.escatec.com)
16:47:25 352 DMN: MSG 58121 Send Failure: 420 TCP write error
16:47:32 896 MSG 58121 Analyzing result file: /root/sunriseEmail/sunDom/wpgate/gwia/result/ra118fb0.024
16:47:32 896 MSG 58121 Command: escatec.com
16:47:32 896 MSG 58121 Response: 220 escatec.com [ESMTP Server] service ready;ESMTP Server; 05/18/09 16:45:04
16:47:32 896 MSG 58121 Command: EHLO mail.sunrisepaper.com.my
16:47:32 896 MSG 58121 Response: 250 ok
16:47:32 896 MSG 58121 Command: MAIL FROM:
16:47:32 896 MSG 58121 Response: 250 Sender OK
16:47:32 896 MSG 58121 Command: RCPT TO:
16:47:32 896 MSG 58121 Response: 250 Recipient OK
16:47:32 896 MSG 58121 Command: RCPT TO:
16:47:32 896 MSG 58121 Response: 250 Recipient OK
16:47:32 896 MSG 58121 Command: DATA
16:47:32 896 MSG 58121 Response: 354 Start mail input; end with .
16:47:32 896 MSG 58121 Detected error on SMTP command
16:47:32 896 MSG 58121 Command: Data...
16:47:32 896 MSG 58121 Response: 420 TCP write error
16:47:32 896 MSG 58121 Deferring message: /root/sunriseEmail/sunDom/wpgate/gwia/defer/sa118fb0.024

[kingston.com.my]
10:51:26 184 Queuing deferred message: /root/sunriseEmail/sunDom/wpgate/gwia/send/sa113669.006
10:51:26 184 MSG 57671 Analyzing result file: /root/sunriseEmail/sunDom/wpgate/gwia/result/ra1133ec.002
10:51:26 184 MSG 57671 Command: kingston.com.my
10:51:26 184 MSG 57671 Response: 220 MYMF1 ESMTP SonicWALL (7.0.0.1393)
10:51:26 184 MSG 57671 Command: EHLO mail.sunrisepaper.com.my
10:51:26 184 MSG 57671 Response: 250 ok
10:51:26 184 MSG 57671 Command: MAIL FROM:
10:51:26 184 MSG 57671 Response: 250 2.1.0 MAIL ok
10:51:26 184 MSG 57671 Command: RCPT TO:
10:51:26 184 MSG 57671 Response: 250 2.0.0 Ok
10:51:26 184 MSG 57671 Command: RCPT TO:
10:51:26 184 MSG 57671 Response: 250 2.0.0 Ok
10:51:26 184 MSG 57671 Command: DATA
10:51:26 184 MSG 57671 Response: 354 3.0.0 End Data with .
10:51:26 184 MSG 57671 Detected error on SMTP command
10:51:26 184 MSG 57671 Command: Data...
10:51:26 184 MSG 57671 Response: 421 4.0.0 Error: timeout
10:51:26 184 MSG 57671 Deferring message: /root/sunriseEmail/sunDom/wpgate/gwia/defer/sa1133ec.002

We knew that these errors were due to communication issues, but we did whatever we could based on Novell Support Knowledgebase and even User Support Forums. Things remained rotten until we logged a support incident with Novell Technical Support.

The Novell Technical Support's Chat feature was awesome. From the Novell Customer Center, you can access the Chat feature in the Service Request Details page and a product related Technical Support Engineer will assist you.

Since it's a communication issue, the only way to find out the root of the problem is to packet trace the whole email sending process. We installed Ethereal on the SUSE Linux Enterprise Server 10 SP2 which powered the GroupWise system and start capturing packet trace sending emails to those three users domains.

Within a day or two, Novell Technical Support come back with the root cause of the problem and the solution. The suggested solution rocks and emails went through to those users. Apparently it was due to some Path MTU Discovery where the server does not receive the ICMP3-4 from routers that are connected to a link with a smaller MTU. By disabling the Path MTU Discovery on the SLES 10, it resolved the issues.

To Disable Path MTU Discovery
# 'sysctl -w net.ipv4.ip_no_pmtu_disc=1'

To Disable the Path MTU Discovery permanently
Put the line 'net.ipv4.ip_no_pmtu_disc = 1' (without quotes) in /etc/sysctl.conf to keep this setting configured after a server reboot.


The following is the original message from Novell Technical Support explaining the roof of the problem and the suggested solutions :-

I just picked up your Service Request from the unassigned queue and checked the problem description from the customer and the LAN trace called PACKETTRACE from the compressed archive PACKETTRACE.TAR.GZ that is attached to the Service Request.

If you have Ethereal or Wireshark installed, you can open the trace and follow my analyses below:

The trace shows two problems:

1) Path MTU Discovery - Server does not receive ICMP 3-4 (Destination Unreachable / Fagmentation Needed and DF Set) from routers that are connected to a link with a smaller MTU.

A SMTP connection, where the GWIA fails to deliver a mail message to recipient JulianaChooi@kingston.com.my because of this problem is the TCP connection between sockets 192.168.1.254:51064 (GWIA) and 202.188.165.2:25 (mailer daemon at mymf1.kingston.com.my).

You can extract this connection and all ICMP messages from the trace in Wireshark or Ethereal with the following display filter:

(ip.addr==192.168.1.254 && ip.addr==203.106.231.124 && tcp.port==55400 && tcp.port==25) || icmp

In frames #9438, #9446 and # 9447 you can see the initiation of the SMTP connection. In the TCP options of the SYN packets you can see that the TCP at each side negotiated a Maximum Segment Size of 1460 bytes.

The trace was captured at the GWIA host at IP address 192.168.1.254) and the remote SMTP server at IP address 203.106.231.124 seems to be 10 routers away from the GWIA.

Communication between the two TCP ends works well until the TCP at 192.168.1.254 sends a full data segment of 1448 bytes to 203.106.231.124 in frame #9484. The TCP at 203.106.231.124 never ACKnowledges income of this data segment, probably because it never received it. You can see in frame #9483 the last data segment that the TCP at 192.168.1.254 received from the TCP at 203.106.231.124 before the problem occurs. The sequence number of the first byte in this segment is 3246232251 and because it carries 46 bytes of data, the sequence number of the last byte in this segment is 3246232296 and hence the TCP at 192.168.1.254 would return ACKnowledgment number 3246232297 to confirm receipt of this segment.
You can see in frame # 9484 the full data segment that is sent by the TCP at 192.168.1.254. As you can see, the ACKnowledgment number in this segment is 3246232297 to confirm receipt of the segment in frame # 9483, and the sequence number of the first data byte in this segment is 4130583027. The total number of bytes in this segment is 1448 and hence the sequence number of the last byte in this segment would be 4130584474, so the TCP at 203.106.231.124 would return ACKnowledgment number 4130584475 to confirm receipt of this segment. Please note that this is the first fully sized TCP segment that was sent on this connection from the TCP at 192.168.1.254 to the TCP at 203.106.231.124.
In frame # 9485, the TCP at 192.168.1.254 continues transmission with a small segment of only 16 bytes.
Because the TCP at 192.168.1.254 does not receive an ACKnowlegment from the TCP at 192.168.1.254 that it received these two segments, it starts retransmitting the full segment in frames # 9530 and # 9561.
Frame # 9635 shows that the TCP at 203.106.231.124 still did not receive the full TCP segment from 192.168.1.254. It retransmitted its data, because it did not receive ACKnowledgment for it (the ACKnowledgment number is in the full segment from the TCP at 192.168.1.254). Another prove that it did not receive the full segment is that the ACKnowlegment number in the retransmitted TCP segment in frame # 9635 remains 4130583027. The TCP at 203.106.231.124 would have increased the ACKnowledgment number to 4130584475 if it had received the full segment from the TCP at 192.168.1.254.

The Don't Fragment flag in the IP header of the datagram in frame # 9484 is set, to indicate that routers are not supposed to fragment it when they need to forward the IP datagram on to a link with a smaller MTU (Maximum Transmission Unit).
A router should return an ICMP 3-4 message when it drops a datagram that is too big to forward and should be fragmented, but has the Don't Fragment flag enabled.
Because the trace does not show such ICMP message, it is either not sent by a router in the network path from 192.168.1.254 to 203.106.231.124 or it has been blocked by a firewall.

Please make sure that routers in the network path from 192.168.1.254 to 203.106.231.124 will send ICMP 3-4 messages in case they need to forward a datagram where the DF bit is set, while it needs to be fragmented to fit the MTU of the next link in the network path and make sure that firewalls do not block ICMP messages of type 3, code 4.

In case you cannot change router and firewall configuration in your network, then you can disable Path MTU Discovery at the GWIA host (192.168.1.254) per '# sysctl -w net.ipv4.ip_no_pmtu_disc=1'. Please do not forget to put the line 'net.ipv4.ip_no_pmtu_disc = 1' (without quotes) in /etc/sysctl.conf to keep this setting configured after a server reboot.


Kudos to the Novell Technical Support team for a great job to get the problem solved.

3 comments:

Wah happy happy put there d hor. Some more with people company domain name also u publish. Kanasai...btw something good to share with...KUDOS TO YOU 2

Was this box in a virtual machine, by chance?
I'm experiencing the same problem since moving my groupwise 8 domain/gwia/webaccess over to linux (also in a vm). Wondering if I should head down this same path... I'm getting a TON of Tcp Read errors, not many write errors though

We had this error and it ended up being the mail scanning feature of a UTM router that was causing issues. Stopped the outbound AV scanning of SMTP traffic and it all began to work OK again. Inbound scanning was still enabled.

Post a Comment