My company uses Postfix mail server to send mail.
Recently some users complained that they have received the same e-mail multiple times. I have inspected our logs and confirmed the problem. In short, the problem was that all messages should have unique message-id. We had a bug. While sending the same message to multiple people (for example in a mailing list), all messages had the same message-id and that caused some mails to be sent multiple times.
The developers fixed the problem and I wanted to make sure it is really fixed. After checking in a virtual machine we use for testing we were pretty sure the problem is fixed. All messages had unique message-id, even if the same mail was sent to multiple people. We have updated the code in production. After a few days I wanted to check Postfix logs to make sure the problem is really fixed.
I have found good (and short) article on Postfix logs, Troubleshooting with Postfix Logs that should provide enough information to get you started if you are new to Postfix. In short, Postfix has great logs. You will find everything you need there.
To further improve logs, we use variable envelope return path (VERP) to track all mail. In short, VERP makes it possible to know exactly which message bounced. Our VERP looks like this:
email@example.com. We have sent
vyx1psnf=qf3kay0g message (our tracking ID) to
firstname.lastname@example.org. We own
example.com domain and bounced mail will be returned there. (Please notice that
example.com are not the same domain.) When we get mail sent to
email@example.com we know it is a bounce and we know exactly which mail has bounced.
Let's take a look at the logs. I have just checked logs for the last two months and files for each day are from 64 to 358 MB in size. I have picked one file, Mac OS says it is 109.1 MB on disk. We will call the file
How does the file look like? (Some portions are anonymized.)
Every line in log file has four parts. Let's inspect the first line in log in detail.
|date and time||
|Postfix component ID||
We are only interested in message, and it is freeform.
Let's get some statistics about the log file before we dive into one mail.
Google says 109 127 473 bytes = 104.072068 megabytes.
All lines start with
Dec so we can count number of lines.
So, 791k lines. 154k lines report status (about 19%).
There are four statuses that are possible: deferred, sent, expired and bounced. Let's see how many of each there is.
So, 56k deferred (37% of all statuses), 93k sent (60%), 1k expired (1%), 3k bounced (2%).
How many lines are there that contain VERP in
from header and how many lines contain message-id?
(A big thank you to Rubular for helping me with regular expressions.)
Similar numbers. 149k lines contain VERP in from header (19% of all lines), 143k lines contain message-id (18%).
4k lines contain VERP in
to header (1%). Those are bounces, remember?
There are 153k VERPs in either
to headers, and VERPs appear only in mentioned headers and nowhere else.
Enough of statistics, let's inspect one mail. A few lines from the log look like this.
As you can see, it is a mess because Postfix logs stuff as it happens, and a lot of things happen almost at the same time. Since we want to know if any mail is sent multiple times, for now we are only interested in lines that have text
status=sent, so we will start there.
419122B8FC8) from that line.
So, mail with Postfix ID
419122B8FC8 is sent.
We also need
Now we have all data we need.
How many times do message-id and VERP appear in the file?
Looks like 419122B8FC8 and 8D2632B8F9A are connected, since they always appear together.
So, 419122B8FC8 used relay 127.0.0.1 (localhost) and 8D2632B8F9A used relay example.net. That is normal. Mail is first sent to 127.0.0.1 (localhost) and then to destination mail server (example.net). It is just the way things are set up. Maybe I get into more detail in another blog post.
We have more data now, and everything looks fine so far.
|Postfix ID 1||419122B8FC8|
|Postfix ID 2||8D2632B8F9A|
Now, let's finally inspect the whole file and find out if any mail is sent multiple times. There are more than one way to do it. A couple of them are:
- Get all lines with status
sent. From each line grab Postfix ID. Get all lines with the ID. Get message-id, relay and VERP from lines with the ID. Ignore if VERP is in
toheader. (Those are bounces. We are interested only in VERPs in
- Get all lines with a VERP. From each line grab Postfix ID. Get all lines with the ID. Get message-id, relay and status from lines with the ID. Ignore if status is not
Since I have no idea how would I do that in Bash, I will use Ruby for that. This post got pretty long, that will be covered in another post.