Skip to content

Scraping Gmail IMAP Messages to MySQL Database

2009 April 23
by Tedb0t

For my ITP Student List visualization endeavors, I discovered that there’s a gmail account someone started in 2006 that is subscribed to the list and receives all its messages.  This is, to my knowledge, the only permanent archive of the list.  So I wanted to offload all those messages to a database so I could make a nice handy API for future visualizations.

It’s taken a week or two of solid hacking and research.  First I heard about a Python module to do it, which turned out to be really old, then I turned to Perl and, though I did get connected via Mail::IMAPClient (which I recommend), I then discovered a class called emailtodb for PHP.

This class turned out to be really busted; unless I’m missing some crazy PHP principles, it was bizarrely illogical.  I have heavily modified it and it’s a lot faster.  Download my zip of the class and example usage.  I haven’t really documented it, but leave a comment if you are confused about something.

So!  Here’s how you connect to Gmail IMAP with PHP:

$mbox = imap_open ("{imap.gmail.com:993/imap/ssl/novalidate-cert}[Gmail]/All Mail", "user_id", "password");

Here I’m connecting to the “All Mail” mailbox since I wanted to offload ALL the messages on the server.  You can also replace that with INBOX, which is the more typical usage.  I discovered /novalidate-cert by trial and error; I’m guessing you don’t need it if your client domain is SSL certified (or something? I still don’t really get SSL).

The biggest challenge was dealing with multipart MIME messages.  In the end it turned out to be fairly simple but the original code looks like it was written on crack, so I had to spend a lot of time fixing it.

Efficiency: Although I’m sure I could optimize the hell out of it, right now it goes at about .3 seconds on average for non-multipart messages and anywhere from 1-2 seconds or more for multipart messages (which usually just means it’s extracting an embedded attachment like an image).

Related Posts:


17 Responses leave one →
  1. Frankel permalink
    August 18, 2009

    Hi Ted,
    Thanks for posting this and making sense of the emailtodb class. I seem to be having trouble with attachments. I get this warning and thought you might point me in the right direction:
    —–
    Warning: mkdir() [function.mkdir]: No such file or directory in /home/.lacerations/userme/domain.com/emailtodb/class.emailtodb_tedb0t.php on line 665
    *Multipart* making path
    —–
    Thanks!

    • August 18, 2009

      You need to make a /files/email directory in the emailtodb directory, if you haven’t done that, and it should be writable and readable by the http server. Probably easiest to just make it 777.

  2. Chris permalink
    November 30, 2009

    This is a great improvement over the original. One thing I can’t figure out though – how to get the delete to work. Basically I’m just trying to dump the inbox into a db and delete the messages. Can’t get it to work using gmail or another imap server. Any ideas?

  3. November 30, 2009

    Hm, not sure, I haven’t tried to do any deleting…

  4. Ron permalink
    April 14, 2010

    I am new at this and am wondering. Where is this installed? How does this start up besides with the click of a button? I noticed the last entry was 11/2009.
    Have you moved beyond this?

  5. Martin permalink
    May 13, 2010

    Hi I am using your class but i am having problems with the attachments, if there is only one mail the attach downloads successfully but if there are more than one mail the attachments are not download. Do you know what can be wrong?
    Thanks in advance.

  6. matt frank permalink
    September 23, 2010

    great script one problem i found is it add email from and email to as the same address in the db even though it’s not. looking through script to see if i can fix but hopefully you can reply and tell me where to look

  7. Kerry permalink
    December 18, 2010

    Brilliant!!!!!!!!! I spent hours searching the net ……… Slight issue with $to names… in class change the following:

    if(is_array($to)){
    foreach ($from as $id => $object) {
    $toaddress = $object->mailbox . “@” . $object->host;
    }
    }

    to be:

    if(is_array($to)){
    foreach ($to as $id => $object) {
    $toname = $object->personal;
    $toaddress = $object->mailbox . “@” . $object->host;
    }
    }

    else the FROM address ends up in the database!

    also add:
    $email[‘TO_NAME’] = $this->mime_text_decode($toname);

    And you have to add another column into the database of EmailToP

    Then you can store the name of the To…

    I am modifying this class to differentiate between my INBOX and SENT items…

    I originally had a lookup happening using calls to IMAP but now in MYSQL its 10x faster!!!!

    Just have to sort out my select queries!

    Thanks again for this great bit of code….

  8. Kevin permalink
    April 13, 2011

    This is some great code and may just save me a ton or work.
    Couple questions though..

    Are the 2 scripts, check_mail.php and dump_to_db.php meant to be independant of one another? I am finding that the dump_to_db.php script seems to do pretty much what i am looking for with exception to truncating the database each time… I want to modify the code to only pull in “new” or “unread” messages… and just leave the rest of the database in tact… anyone know if this is easily doable? This way, i could cron the script to run every 5 minutes or so to pull the most recent messages if any.

    Thanks in advance.

  9. Jason permalink
    May 26, 2011

    I’m am not able to to get this to completely work. I am able to the checkmail code, and it’ll show that there is 103 messages. But when I run the dumptodb, total messages isn’t being populated and nothing is being entered into the db. I’ve trailed through the code and I can’t figure out why this isn’t working. The checkmail seems to connect alright, but something about the dump isn’t seeing the msgs. Any hints?

  10. Todd permalink
    June 10, 2012

    hi, is it possible to just send new messages to the database by reading the header of the email, I saw a reference in the class but was commented out and didn’t look finished

  11. ace permalink
    August 14, 2012

    hey community i dont get my message body in the database i am experiencing some problems

    • Brett permalink
      August 30, 2012

      I’m having that issue with E-mails that don’t have HTML body. Try emails that are HTML composed.

  12. Brett permalink
    August 30, 2012

    I, too, want to grab only new messages. I have an issue though where the last message to be imported to DB somehow has the wront HTML body of other messages instead of it’s own. It’s really strange.

  13. October 19, 2014

    I am getting this

    connect(‘imap.gmail.com’, ‘:993/imap/ssl/novalidate-cert’, ‘something@gmail.com’, ‘password’); //}[Gmail]/All Mail
    $edb->do_action(); ?>

    when i tried to run the visit the php file in browser

  14. Dan permalink
    January 22, 2015

    If anyone has made this completely work, please contact me at danhallchc at yahoo dot com. Thanks!

  15. Sandbird permalink
    April 18, 2016

    I thought i was going crazy, but i am not :)
    $message_id is never set…which is basically the message id number that an email has…so table emailtodb_email never gets IDEmail set.

    Using select LAST_INSERT_ID() as UID to create a ‘fake’ email id is totally wrong.
    You have to get the proper email id for each email. Every email has a unique email id…Thats the proper value.

    I am surprised $header=imap_headerinfo($this->link, $this->msgid, 80,80); even works.
    Because that just uses an incremental msgid number…Not the real ID of the email.

    You can use:
    public function searchMailbox($criteria = ‘ALL’) {
    $mailsIds = imap_search($this->link, $criteria, SE_UID, ‘UTF-8’);
    return $mailsIds ? $mailsIds : array();
    }

    $emails = $edb->searchMailbox(‘ALL’); //Or UNSEEN etc
    print_r($emails);

    To get a list of the proper email IDs

Leave a Reply

Note: You can use basic XHTML in your comments. Your email address will never be published.

Subscribe to this comment feed via RSS