Scraping Gmail IMAP Messages to MySQL Database
For my ITP Student List visualization endeavors, I discovered that there’s a gmail account someone started in 2006 that is subscribed to the list and receives all its messages. This is, to my knowledge, the only permanent archive of the list. So I wanted to offload all those messages to a database so I could make a nice handy API for future visualizations.
It’s taken a week or two of solid hacking and research. First I heard about a Python module to do it, which turned out to be really old, then I turned to Perl and, though I did get connected via Mail::IMAPClient (which I recommend), I then discovered a class called emailtodb for PHP.
This class turned out to be really busted; unless I’m missing some crazy PHP principles, it was bizarrely illogical. I have heavily modified it and it’s a lot faster. Download my zip of the class and example usage. I haven’t really documented it, but leave a comment if you are confused about something.
So! Here’s how you connect to Gmail IMAP with PHP:
$mbox = imap_open ("{imap.gmail.com:993/imap/ssl/novalidate-cert}[Gmail]/All Mail", "user_id", "password");
Here I’m connecting to the “All Mail” mailbox since I wanted to offload ALL the messages on the server. You can also replace that with INBOX, which is the more typical usage. I discovered /novalidate-cert by trial and error; I’m guessing you don’t need it if your client domain is SSL certified (or something? I still don’t really get SSL).
The biggest challenge was dealing with multipart MIME messages. In the end it turned out to be fairly simple but the original code looks like it was written on crack, so I had to spend a lot of time fixing it.
Efficiency: Although I’m sure I could optimize the hell out of it, right now it goes at about .3 seconds on average for non-multipart messages and anywhere from 1-2 seconds or more for multipart messages (which usually just means it’s extracting an embedded attachment like an image).
Hi Ted,
Thanks for posting this and making sense of the emailtodb class. I seem to be having trouble with attachments. I get this warning and thought you might point me in the right direction:
—–
Warning: mkdir() [function.mkdir]: No such file or directory in /home/.lacerations/userme/domain.com/emailtodb/class.emailtodb_tedb0t.php on line 665
*Multipart* making path
—–
Thanks!
You need to make a /files/email directory in the emailtodb directory, if you haven’t done that, and it should be writable and readable by the http server. Probably easiest to just make it 777.
This is a great improvement over the original. One thing I can’t figure out though – how to get the delete to work. Basically I’m just trying to dump the inbox into a db and delete the messages. Can’t get it to work using gmail or another imap server. Any ideas?
Hm, not sure, I haven’t tried to do any deleting…
I am new at this and am wondering. Where is this installed? How does this start up besides with the click of a button? I noticed the last entry was 11/2009.
Have you moved beyond this?
Hi I am using your class but i am having problems with the attachments, if there is only one mail the attach downloads successfully but if there are more than one mail the attachments are not download. Do you know what can be wrong?
Thanks in advance.