Having a long list of emails of your potential customers and partners is very common today. And if you run an online business email marketing is probably on of the tools that you often use. It is thus handy to keep your mailing list clean. Emails in your database might be collected months, or even years, ago and many of them might be invalid by now.
In this post we will uncover techniques and tricks that are used by email verification tools and scripts. Email address verification is not as simple as it might seem. Here are some of the questions that we will answer in this post: How to verify an email address using SMTP protocol? How to recognize a good email address, to which a message will be delivered? How to clean up a mailing list?
We mention the SMTP protocol a lot in this article. Its current specification is in RFC 5321. Do not hesitate to open and read this RFC while reading this post.
Definition, SMTP Basics and Core of Verification
What is a valid email address? Let’s define that an email is valid if there is a mail server that will accept a message to that address. By accepting a message we mean that the server read contents of the message from the sender, but this does not imply that the message will be delivered to the recipient’s mailbox. And of course, this says nothing about whether the message will be read by its recipient.
A verification of an email address is performed using SMTP protocol. There are several SMTP commands that can be used for this. These are EXPN, VRFY, and RCPT TO. EXPN and VRFY are rarely enabled in today’s mail systems, this is why the most common verification command is RCPT TO. There are several steps that have to be done before a client can use the RCPT TO command:
- Find IP address of the target mail server.
- Establish a connection to the target mail server.
- Send HELO or EHLO command.
- Send MAIL FROM command.
- Now we can send RCPT TO command.
We will discuss these steps in detail later in this post. Let’s focus on the RCPT TO command now. In our case, the only interesting syntax of the RCPT TO command is when its argument is the target email address. For example:
When a client send this command to the server, the server will either accept the email or report an error. If the server recognizes the address and accept it, it will return either code 250 or 251. When we receive this code, it means the email address is valid. Otherwise, it may or may not be valid. Return codes starting with 4 – i.e. codes between 400 and 499, are considered to be temporary errors. These are usually not returned if an email address does not exist at all. Return codes starting with 5 – i.e. codes between 500 and 599, are reserved for permanent errors, which imply that the email address does not exist or that a message sent to it will not be accepted.
There are five possible results of email verification process:
- The email address is valid – The server returned 250 or 251 to RCPT TO command.
- The email address is probably valid, but a temporary error prevents immediate delivery – This usually occurs when a greylisting mechanism is installed on the target mail server and it is likely that repeated attempt to send a message to this email will succeed. It can also occur in case the target mailbox is full.
- The email address is valid, but every other email address on its domain is valid – There is a catch-all address enabled for the target domain. This means that we are unable to decide whether the email address represents a real user’s mailbox or whether messages to the address are sent to the catch-all address.
- It is not possible to determine whether the email address is valid, but messages will not be delivered – This happens in case that the target mail server does not respond.
- The email address is not valid – There is no mail server for the given domain or the mail server returned a permanent 5xx error to RCPT TO command.
This may not sound very complicated, but the real problem is to determine these result correctly and not to be misled by various anti-spam mechanisms that today’s mail servers implement.
Target Mail Servers
The first step in the process is to check whether MX records exist for the domain of the email address that we want to verify. MX records are DNS records that hold information about mail servers that handle emails for the particular domain. If there no MX records exist for a domain, no valid email addresses exist too. Let’s take a look at two examples using nslookup command. Note that you can similarly use other commands like dig.
Our sample emails to test will be email@example.com and firstname.lastname@example.org and we will be using Google’s public DNS server 126.96.36.199 to get DNS records.
> nslookup > server 188.8.131.52 Default Server: google-public-dns-a.google.com Address: 184.108.40.206 > set q=mx > gmail.com Server: google-public-dns-a.google.com Address: 220.127.116.11 Non-authoritative answer: gmail.com MX preference = 40, mail exchanger = alt4.gmail-smtp-in.l.google.com gmail.com MX preference = 30, mail exchanger = alt3.gmail-smtp-in.l.google.com gmail.com MX preference = 10, mail exchanger = alt1.gmail-smtp-in.l.google.com gmail.com MX preference = 5, mail exchanger = gmail-smtp-in.l.google.com gmail.com MX preference = 20, mail exchanger = alt2.gmail-smtp-in.l.google.com > nonexistingdomainname123456.com Server: google-public-dns-a.google.com Address: 18.104.22.168 *** google-public-dns-a.google.com can't find nonexistingdomainname123456.com: Non-existent domain
For the email address on non-existing domain, there simply were not found any MX server. This automatically means that email@example.com is not a valid email address. For firstname.lastname@example.org, the situation is different. We can see that for gmail.com domain there 5 MX records. It is common to see only 1 MX records for domains with small number of emails, but larger services may have more than one mail server. It is important not to pick only a single mail server here. In order to verify email properly, we may have to work with all servers from all MX records that were found.
Why is this so important? Imagine a scenario that we picked up just a single MX server from the list. It could happen that when we try to connect to it in order to verify an email, the server would not respond for whatever reason, or its response will report a temporary error because of a recent problem with this particular server. If we do not try other available mail servers, we could end up with a wrong result. It might be easily possible that another server will be perfectly fine and give us correct answers. So, in case of any problems with one server, just try another one. The problems include an inability to connect to SMTP port, no greeting received, non-OK responses to HELO/EHLO or MAIL FROM commands.
Handling Greetdelay and HELO/EHLO
At this step, we assume that we have a mail server from DNS MX record of the target domain. Next step is to connect to the target mail server to its SMTP port, which is port 25/TCP. We just mentioned that if a connection cannot be established, then we should try another server if available.
If a TCP connection is established, it is now important to wait until the mail server sends us its greeting. We cannot simply send our commands to the server, because it is prescribed in the SMTP protocol that the server talks first. And some servers are configured to refuse to work with clients that do not respect that. This is one of the many anti-spam techniques. Spam sending programs are sometimes programmed to be as simple and as fast as possible. This is why they do not wait for the server’s greeting and start sending commands just after the connection is established. Servers that reject this kind of clients get rid of these spammers at the earliest stage possible.
Another anti-spam technique, which is relatively new, is called Greetdelay. The basic idea is based on the same fact that we already mentioned – many programs used by spammers to send emails are very simple and coded to work as fast as possible. And they have to be fast if they want to send millions of emails to get some hits and sales. The Greetdelay idea is to take the time allowed by RFC specification of the SMTP protocol to delay the message transfer agent (MTA, the software that wants to send an email) a bit. For example, the server implementing Greetdelay waits 30 seconds before it sends the greeting to the client. Or 60 seconds. Or 90. Spam software that does not fully comply with the protocol may give up before it is allowed to send a message. Similar strategy can be implemented in other stages of the SMTP communication. The before-greeting stage is just the most common.
So, in order to successfully pass the Greetdelay mechanism, the email verifier engine has to set up its timeouts to at least 5 minutes. The greeting should come with code SMTP response code 220. Other codes may mean that we should try another MX server.
After the greeting is received, the client (our email verification engine) should send HELO or EHLO command. Since HELO command exists only for compatibility with very old software, use EHLO everytime. Its only argument is the fully-qualified domain name (FQDN) of the client. We will cover the importance of having your own domain name for the server where you run the email verification client in the next section, just after we send MAIL FROM command. But before we can do that, we will wait for the reply to our EHLO. The SMTP code here should be 250, otherwise we can assume that something is wrong with this server.
Let’s see how this goes when we talk to one of the Gmail’s servers (lines starting with S: came from the server, lines starting with C: came from our client):
> nc alt4.gmail-smtp-in.l.google.com 25 S: 220 mx.google.com ESMTP m14si19812165icp.91 - gsmtp C: EHLO mail.example.com S: 250-mx.google.com at your service, [198.51.100.123] S: 250-SIZE 35882577 S: 250-8BITMIME S: 250-STARTTLS S: 250-ENHANCEDSTATUSCODES S: 250-PIPELINING S: 250-CHUNKING S: 250 SMTPUTF8
We used Netcat as a client that allowed us to type the SMTP commands manually. Our IP address was 198.51.100.123. The MX server sent us a correct greeting and replied with EHLO command. The server’s response code was 250 and it sent us a list of supported extensions, which are not interesting for us here. We can go on.