Scraping Gmail IMAP Messages to MySQL Database
For my ITP Student List visualization endeavors, I discovered that there’s a gmail account someone started in 2006 that is subscribed to the list and receives all its messages. This is, to my knowledge, the only permanent archive of the list. So I wanted to offload all those messages to a database so I could make a nice handy API for future visualizations.
It’s taken a week or two of solid hacking and research. First I heard about a Python module to do it, which turned out to be really old, then I turned to Perl and, though I did get connected via Mail::IMAPClient (which I recommend), I then discovered a class called emailtodb for PHP.
This class turned out to be really busted; unless I’m missing some crazy PHP principles, it was bizarrely illogical. I have heavily modified it and it’s a lot faster. Download my zip of the class and example usage. I haven’t really documented it, but leave a comment if you are confused about something.
So! Here’s how you connect to Gmail IMAP with PHP:
$mbox = imap_open ("{imap.gmail.com:993/imap/ssl/novalidate-cert}[Gmail]/All Mail", "user_id", "password");
Here I’m connecting to the “All Mail” mailbox since I wanted to offload ALL the messages on the server. You can also replace that with INBOX, which is the more typical usage. I discovered /novalidate-cert by trial and error; I’m guessing you don’t need it if your client domain is SSL certified (or something? I still don’t really get SSL).
The biggest challenge was dealing with multipart MIME messages. In the end it turned out to be fairly simple but the original code looks like it was written on crack, so I had to spend a lot of time fixing it.
Efficiency: Although I’m sure I could optimize the hell out of it, right now it goes at about .3 seconds on average for non-multipart messages and anywhere from 1-2 seconds or more for multipart messages (which usually just means it’s extracting an embedded attachment like an image).
Related Posts:
-
Frankel
-
Chris
-
Ron
-
Martin
-
matt frank
-
Kerry
-
Kevin
-
Jason