Skip to content

Scraping Gmail IMAP Messages to MySQL Database

2009 April 23
by Tedb0t

For my ITP Student List visualization endeavors, I discovered that there’s a gmail account someone started in 2006 that is subscribed to the list and receives all its messages.  This is, to my knowledge, the only permanent archive of the list.  So I wanted to offload all those messages to a database so I could make a nice handy API for future visualizations.

It’s taken a week or two of solid hacking and research.  First I heard about a Python module to do it, which turned out to be really old, then I turned to Perl and, though I did get connected via Mail::IMAPClient (which I recommend), I then discovered a class called emailtodb for PHP.

This class turned out to be really busted; unless I’m missing some crazy PHP principles, it was bizarrely illogical.  I have heavily modified it and it’s a lot faster.  Download my zip of the class and example usage.  I haven’t really documented it, but leave a comment if you are confused about something.

So!  Here’s how you connect to Gmail IMAP with PHP:

$mbox = imap_open ("{imap.gmail.com:993/imap/ssl/novalidate-cert}[Gmail]/All Mail", "user_id", "password");

Here I’m connecting to the “All Mail” mailbox since I wanted to offload ALL the messages on the server.  You can also replace that with INBOX, which is the more typical usage.  I discovered /novalidate-cert by trial and error; I’m guessing you don’t need it if your client domain is SSL certified (or something? I still don’t really get SSL).

The biggest challenge was dealing with multipart MIME messages.  In the end it turned out to be fairly simple but the original code looks like it was written on crack, so I had to spend a lot of time fixing it.

Efficiency: Although I’m sure I could optimize the hell out of it, right now it goes at about .3 seconds on average for non-multipart messages and anywhere from 1-2 seconds or more for multipart messages (which usually just means it’s extracting an embedded attachment like an image).

Related Posts:


  • Frankel

    Hi Ted,
    Thanks for posting this and making sense of the emailtodb class. I seem to be having trouble with attachments. I get this warning and thought you might point me in the right direction:
    —–
    Warning: mkdir() [function.mkdir]: No such file or directory in /home/.lacerations/userme/domain.com/emailtodb/class.emailtodb_tedb0t.php on line 665
    *Multipart* making path
    —–
    Thanks!

    • http://www.liminastudio.com Tedb0t

      You need to make a /files/email directory in the emailtodb directory, if you haven’t done that, and it should be writable and readable by the http server. Probably easiest to just make it 777.

  • Chris

    This is a great improvement over the original. One thing I can’t figure out though – how to get the delete to work. Basically I’m just trying to dump the inbox into a db and delete the messages. Can’t get it to work using gmail or another imap server. Any ideas?

  • http://www.liminastudio.com Tedb0t

    Hm, not sure, I haven’t tried to do any deleting…

  • Ron

    I am new at this and am wondering. Where is this installed? How does this start up besides with the click of a button? I noticed the last entry was 11/2009.
    Have you moved beyond this?

  • Martin

    Hi I am using your class but i am having problems with the attachments, if there is only one mail the attach downloads successfully but if there are more than one mail the attachments are not download. Do you know what can be wrong?
    Thanks in advance.

  • matt frank

    great script one problem i found is it add email from and email to as the same address in the db even though it’s not. looking through script to see if i can fix but hopefully you can reply and tell me where to look

  • Kerry

    Brilliant!!!!!!!!! I spent hours searching the net ……… Slight issue with $to names… in class change the following:

    if(is_array($to)){
    foreach ($from as $id => $object) {
    $toaddress = $object->mailbox . “@” . $object->host;
    }
    }

    to be:

    if(is_array($to)){
    foreach ($to as $id => $object) {
    $toname = $object->personal;
    $toaddress = $object->mailbox . “@” . $object->host;
    }
    }

    else the FROM address ends up in the database!

    also add:
    $email['TO_NAME'] = $this->mime_text_decode($toname);

    And you have to add another column into the database of EmailToP

    Then you can store the name of the To…

    I am modifying this class to differentiate between my INBOX and SENT items…

    I originally had a lookup happening using calls to IMAP but now in MYSQL its 10x faster!!!!

    Just have to sort out my select queries!

    Thanks again for this great bit of code….

  • Kevin

    This is some great code and may just save me a ton or work.
    Couple questions though..

    Are the 2 scripts, check_mail.php and dump_to_db.php meant to be independant of one another? I am finding that the dump_to_db.php script seems to do pretty much what i am looking for with exception to truncating the database each time… I want to modify the code to only pull in “new” or “unread” messages… and just leave the rest of the database in tact… anyone know if this is easily doable? This way, i could cron the script to run every 5 minutes or so to pull the most recent messages if any.

    Thanks in advance.

  • Jason

    I’m am not able to to get this to completely work. I am able to the checkmail code, and it’ll show that there is 103 messages. But when I run the dumptodb, total messages isn’t being populated and nothing is being entered into the db. I’ve trailed through the code and I can’t figure out why this isn’t working. The checkmail seems to connect alright, but something about the dump isn’t seeing the msgs. Any hints?