Random Croatian countryside #1

My company uses Postfix mail server to send mail.

Recently some users complained that they have received the same e-mail multiple times. I have inspected our logs and confirmed the problem. In short, the problem was that all messages should have unique message-id. We had a bug. While sending the same message to multiple people (for example in a mailing list), all messages had the same message-id and that caused some mails to be sent multiple times.

The developers fixed the problem and I wanted to make sure it is really fixed. After checking in a virtual machine we use for testing we were pretty sure the problem is fixed. All messages had unique message-id, even if the same mail was sent to multiple people. We have updated the code in production. After a few days I wanted to check Postfix logs to make sure the problem is really fixed.

I have found good (and short) article on Postfix logs, Troubleshooting with Postfix Logs that should provide enough information to get you started if you are new to Postfix. In short, Postfix has great logs. You will find everything you need there.

To further improve logs, we use variable envelope return path (VERP) to track all mail. In short, VERP makes it possible to know exactly which message bounced. Our VERP looks like this: vyx1psnf=qf3kay0g=user=example.net@example.com. We have sent vyx1psnf=qf3kay0g message (our tracking ID) to user@example.net. We own example.com domain and bounced mail will be returned there. (Please notice that example.net and example.com are not the same domain.) When we get mail sent to vyx1psnf=qf3kay0g=user=example.net@example.com we know it is a bounce and we know exactly which mail has bounced.

Let's take a look at the logs. I have just checked logs for the last two months and files for each day are from 64 to 358 MB in size. I have picked one file, Mac OS says it is 109.1 MB on disk. We will call the file log.

How does the file look like? (Some portions are anonymized.)

$ cat log | head
Dec 19 00:06:46 cdn postfix/qmgr[11196]: 051452B98BB: from=<19=zrnapvcy=user=example.net@example.com>, size=290769, nrcpt=1 (queue active)
Dec 19 00:06:46 cdn postfix/qmgr[11196]: 0E5B82B8EE4: from=<cqpygzdq=5wtz5664=user=example.net@example.com>, size=30125, nrcpt=1 (queue active)
...

(A big thank you to Davor Banović for helping me with Bash.)

Every line in log file has four parts. Let's inspect the first line in log in detail.

date and time Dec 19 00:06:46
host name cdn
Postfix component ID postfix/qmgr[11196]:
message 051452B98BB: from=<19=zrnapvcy=user=example.net@example.com>, size=290769, nrcpt=1 (queue active)

We are only interested in message, and it is freeform.

Let's get some statistics about the log file before we dive into one mail.

$ ls -l log
-rw-r--r-- 1 zeljko staff 109127473 Dec 20 11:55 19

Google says 109 127 473 bytes = 104.072068 megabytes.

All lines start with Dec so we can count number of lines.

$ grep -c Dec log
791311

So, 791k lines. 154k lines report status (about 19%).

$ grep -c status= log
153573

There are four statuses that are possible: deferred, sent, expired and bounced. Let's see how many of each there is.

$ grep -c status=deferred log
56609
$ grep -c status=sent log
92695
$ grep -c status=expired log
978
$ grep -c status=bounced log
3291

So, 56k deferred (37% of all statuses), 93k sent (60%), 1k expired (1%), 3k bounced (2%).

How many lines are there that contain VERP in from header and how many lines contain message-id?

$ grep -c ": from=<.*=.*=.*@.*>, " log
149133
$ grep -c message-id log
142528

(A big thank you to Rubular for helping me with regular expressions.)

Similar numbers. 149k lines contain VERP in from header (19% of all lines), 143k lines contain message-id (18%).

4k lines contain VERP in to header (1%). Those are bounces, remember?

$ grep -c ": to=<.*=.*=.*@.*>, " log
4279

There are 153k VERPs in either from or to headers, and VERPs appear only in mentioned headers and nowhere else.

$ grep -c ": \w*=<.*=.*=.*@.*>, " log
153412
$ grep -c "<.*=.*=.*@.*>, " log
153412

Random Croatian countryside #2

Enough of statistics, let's inspect one mail. A few lines from the log look like this.

...
Dec 19 12:25:07 cdn postfix/qmgr[11196]: 309332B8F9A: from=<vyx1psnf=qf3kay0g=user=example.net@example.com>, size=13803, nrcpt=1 (queue active)
Dec 19 12:25:07 cdn postfix/cleanup[31905]: 321512B8F5A: message-id=<801daf8a02064f0789a28289dd0fde67@example.com>
Dec 19 12:25:07 cdn postfix/smtpd[31881]: 63CBD2B8EF4: client=example.com[127.0.0.1]
Dec 19 12:25:07 cdn postfix/cleanup[31807]: 3233C2B8FA0: message-id=<4bad537e1cb14b9b86f693668a2a56ed@example.com>
Dec 19 12:25:07 cdn postfix/smtp[31781]: 419122B8FC8: to=<user@example.net>, relay=127.0.0.1[127.0.0.1]:10027, conn_use=3, delay=0.54, delays=0.41/0/0/0.13, dsn=2.0.0, status=sent (250 2.0.0 Ok: queued as 8D2632B8F9A)
...

As you can see, it is a mess because Postfix logs stuff as it happens, and a lot of things happen almost at the same time. Since we want to know if any mail is sent multiple times, for now we are only interested in lines that have text status=sent, so we will start there.

$ grep status=sent log
...
Dec 19 12:25:07 cdn postfix/smtp[31781]: 419122B8FC8: to=<user@example.net>, relay=127.0.0.1[127.0.0.1]:10027, conn_use=3, delay=0.54, delays=0.41/0/0/0.13, dsn=2.0.0, status=sent (250 2.0.0 Ok: queued as 8D2632B8F9A)
...

We need ID (419122B8FC8) from that line.

$ echo "Dec 19 12:25:07 cdn postfix/smtp[31781]: 419122B8FC8: to=<user@example.net>, relay=127.0.0.1[127.0.0.1]:10027, conn_use=3, delay=0.54, delays=0.41/0/0/0.13, dsn=2.0.0, status=sent (250 2.0.0 Ok: queued as 8D2632B8F9A)" | cut -d " " -f6 | cut -d ":" -f1
419122B8FC8

So, mail with Postfix ID 419122B8FC8 is sent.

$ grep 419122B8FC8 log
Dec 19 12:25:07 cdn postfix/smtpd[31793]: 419122B8FC8: client=example.com[127.0.0.1]
Dec 19 12:25:07 cdn postfix/cleanup[31809]: 419122B8FC8: message-id=<a64ea1f30f964980a4c27593daaa9a8e@example.com>
Dec 19 12:25:07 cdn postfix/qmgr[11196]: 419122B8FC8: from=<vyx1psnf=qf3kay0g=user=example.net@example.com>, size=13803, nrcpt=1 (queue active)
Dec 19 12:25:07 cdn postfix/smtp[31781]: 419122B8FC8: to=<user@example.net>, relay=127.0.0.1[127.0.0.1]:10027, conn_use=3, delay=0.54, delays=0.41/0/0/0.13, dsn=2.0.0, status=sent (250 2.0.0 Ok: queued as 8D2632B8F9A)
Dec 19 12:25:07 cdn postfix/qmgr[11196]: 419122B8FC8: removed

We also need message-id (a64ea1f30f964980a4c27593daaa9a8e@example.com), VERP (from=<vyx1psnf=qf3kay0g=user=example.net@example.com>) and relay (127.0.0.1).

$ grep 419122B8FC8 log | grep message-id | cut -d "<" -f2 | cut -d ">" -f1
a64ea1f30f964980a4c27593daaa9a8e@example.com

$ grep 419122B8FC8 log | grep ": from=<.*=.*=.*@.*>, " | cut -d "<" -f2 | cut -d ">" -f1
vyx1psnf=qf3kay0g=user=example.net@example.com

$ grep 419122B8FC8 log | grep relay= | cut -d "=" -f3 | cut -d "[" -f1
127.0.0.1

Now we have all data we need.

VERP vyx1psnf=qf3kay0g=user=example.net@example.com
message-id a64ea1f30f964980a4c27593daaa9a8e@example.com
Postfix ID 419122B8FC8
relay 127.0.0.1

How many times do message-id and VERP appear in the file?

$ grep vyx1psnf=qf3kay0g=user=example.net@example.com
Dec 19 12:25:07 cdn postfix/qmgr[11196]: 419122B8FC8: from=<vyx1psnf=qf3kay0g=user=example.net@example.com>, size=13803, nrcpt=1 (queue active)
Dec 19 12:25:07 cdn postfix/qmgr[11196]: 8D2632B8F9A: from=<vyx1psnf=qf3kay0g=user=example.net@example.com>, size=14415, nrcpt=1 (queue active)

$ grep a64ea1f30f964980a4c27593daaa9a8e@dgroups.org log
Dec 19 12:25:07 cdn postfix/cleanup[31809]: 419122B8FC8: message-id=<a64ea1f30f964980a4c27593daaa9a8e@example.com>
Dec 19 12:25:07 cdn dkimproxy.out[32050]: DKIM signing - signed; message-id=<a64ea1f30f964980a4c27593daaa9a8e@example.com>, signer=<@example.com>, from=<user@example.net>
Dec 19 12:25:07 cdn postfix/cleanup[31930]: 8D2632B8F9A: message-id=<a64ea1f30f964980a4c27593daaa9a8e@example.com>

Looks like 419122B8FC8 and 8D2632B8F9A are connected, since they always appear together.

$ grep 419122B8FC8 log
Dec 19 12:25:07 cdn postfix/smtpd[31793]: 419122B8FC8: client=example.com[127.0.0.1]
Dec 19 12:25:07 cdn postfix/cleanup[31809]: 419122B8FC8: message-id=<a64ea1f30f964980a4c27593daaa9a8e@example.com>
Dec 19 12:25:07 cdn postfix/qmgr[11196]: 419122B8FC8: from=<vyx1psnf=qf3kay0g=user=example.net@example.com>, size=13803, nrcpt=1 (queue active)
Dec 19 12:25:07 cdn postfix/smtp[31781]: 419122B8FC8: to=<user@example.net>, relay=127.0.0.1[127.0.0.1]:10027, conn_use=3, delay=0.54, delays=0.41/0/0/0.13, dsn=2.0.0, status=sent (250 2.0.0 Ok: queued as 8D2632B8F9A)
Dec 19 12:25:07 cdn postfix/qmgr[11196]: 419122B8FC8: removed

$ grep 8D2632B8F9A log
Dec 19 12:25:07 cdn postfix/smtpd[31842]: 8D2632B8F9A: client=example.com[127.0.0.1]
Dec 19 12:25:07 cdn postfix/cleanup[31930]: 8D2632B8F9A: message-id=<a64ea1f30f964980a4c27593daaa9a8e@example.com>
Dec 19 12:25:07 cdn postfix/qmgr[11196]: 8D2632B8F9A: from=<vyx1psnf=qf3kay0g=user=example.net@example.com>, size=14415, nrcpt=1 (queue active)
Dec 19 12:25:07 cdn postfix/smtp[31781]: 419122B8FC8: to=<user@example.net>, relay=127.0.0.1[127.0.0.1]:10027, conn_use=3, delay=0.54, delays=0.41/0/0/0.13, dsn=2.0.0, status=sent (250 2.0.0 Ok: queued as 8D2632B8F9A)
Dec 19 12:25:09 cdn postfix/smtp[31964]: 8D2632B8F9A: to=<user@example.net>, relay=example.net[127.0.0.1]:25, delay=1.9, delays=0.13/0/0.43/1.3, dsn=2.0.0, status=sent (250 OK id=1PUIFI-0002Yf-TL)
Dec 19 12:25:09 cdn postfix/qmgr[11196]: 8D2632B8F9A: removed

So, 419122B8FC8 used relay 127.0.0.1 (localhost) and 8D2632B8F9A used relay example.net. That is normal. Mail is first sent to 127.0.0.1 (localhost) and then to destination mail server (example.net). It is just the way things are set up. Maybe I get into more detail in another blog post.

We have more data now, and everything looks fine so far.

VERP vyx1psnf=qf3kay0g=user=example.net@example.com
message-id a64ea1f30f964980a4c27593daaa9a8e@example.com
Postfix ID 1 419122B8FC8
relay 1 127.0.0.1
Postfix ID 2 8D2632B8F9A
relay 2 example.net

Now, let's finally inspect the whole file and find out if any mail is sent multiple times. There are more than one way to do it. A couple of them are:

  • Get all lines with status sent. From each line grab Postfix ID. Get all lines with the ID. Get message-id, relay and VERP from lines with the ID. Ignore if VERP is in to header. (Those are bounces. We are interested only in VERPs in from header.)
  • Get all lines with a VERP. From each line grab Postfix ID. Get all lines with the ID. Get message-id, relay and status from lines with the ID. Ignore if status is not sent.

Since I have no idea how would I do that in Bash, I will use Ruby for that. This post got pretty long, that will be covered in another post.