I’ve been self hosting email for at least three years now, and since the start I’ve been wanting a particular feature. Many may remember years ago when Google announced how Gmail would start prefetching and locally serving up images. A feature that was announced as privacy protecting, but in Google classic fashion really just meant they wanted to increase the value of you data, by preventing other from also tracking it. Googles awful privacy track record aside, it’s still a pretty good idea. I wanted that feature. I wanted Dovecot to automagically fetch remote images to a local server and rewrite the URLs. This method had one particular downside, running a server and a small risk of someone randomly guessing image URLs. What I did end up settling on was… instead of serving them from my own server, I’d simply inline the images with base64 encoding, attached as a multi-part section.

Sieves

I’ve checked out Sieves a couple of times through out the years, particularly trying to solve this problem but couldn’t get it to work properly. I believe this was due to a miss-configuration in my Mail Server setup, which seems to have self-corrected during an update or rebuild.

Sieve scripts are pretty neat and allow you to do all sorts of customizations and I am excited to start using it more, but for this particular problem the bulk is written in Python, with only a ‘1 liner’ Sieve script to invoke it.


require ["vnd.dovecot.filter"];
filter "image_stripper.py";

NOTE: I personally found the naming and documentation for the Dovecot extprograms confusing. I would have expected filter to work on purely accept/reject and pipe to alter messages. I still never quite figured out what pipe was for but whatever content is returned from a filter call will replace the contents of the email. This includes all of the email headers.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
#!/usr/bin/python3
import base64
import hashlib
import re
import sys
import urllib.request
from email.errors import MessageError
from email.message import MIMEPart
from email.parser import Parser
from email.policy import default
from html.parser import HTMLParser

class HTMLImageParser(HTMLParser):
    images = {}

    def handle_starttag(self, tag, attrs):
        if tag != 'img': return
        attributes = {k: v for k, v in attrs}
        if 'src' not in attributes or 'http' not in attributes['src']: return

        src = attributes['src']

        with urllib.request.urlopen(src) as resp:
            img = resp.read()
            hashid = hashlib.sha256(img).hexdigest()

            self.images[hashid] = (src, img, resp.getheader('Content-Type'), )


def main():
    parser = Parser(policy=default)
    stdin = sys.stdin

    try:
        msg = parser.parse(stdin)
        body = msg.get_body()
        sender = msg.get('From')
        from_domain = re.findall(r'@(.+)>', sender)[0]

        html_parser = HTMLImageParser()
        html_parser.feed(body.get_content())

        body_str = body.as_string(False, 0)
        body.clear()
        body = MIMEPart()
        body.add_header('Content-Type', 'text/html')
        body.add_header('Content-Transfer-Encoding', '7bit')

        boundary_group = MIMEPart()
        boundary_group.make_related()
        boundary_group.attach(body)

        msg.attach(boundary_group)

        part = 1
        for h in html_parser.images:
            src, img, ct = html_parser.images[h]
            filename = src.split('/')[-1]
            print(ct, src)

            cid = f'part{part}.{h[:32]}@{from_domain}'
            part = part + 1

            body_str = body_str.replace(src, f'cid:{cid}')

            part = MIMEPart()
            part.add_header('Content-Type', ct, name=filename)
            part.add_header('Content-Disposition', 'inline', filename=filename)
            part.add_header('Content-Id', str(f'{cid}'))
            part.add_header('Content-Transfer-Encoding', 'base64')
            part.add_header('X-Stripper', '1')
            part.set_payload(base64.b64encode(img))
            boundary_group.attach(part)

        body.set_payload(body_str[body_str.index('\n\n') + 2:])

    except Exception as e: # (TypeError, MessageError):
        raise e
        print(stdin.read())  # fallback
    else:
        print(msg.as_string())

if __name__ == "__main__":
    main()
    sys.exit(0)

This used all built in libraries, starting with email to parse the incoming message. With this we find a couple of important things. We need the message body and the from address domain. The body get rewritten at the end, but it also gets cleared out early on. The body string is passed to the build in html parser library. This will be used to find all the image tags and if they start with http the image is fetched with urllib.request. When the html parsing is done, we’re left with a list of hashes, image URLs, the image itself and its content type. Then the main loop can process these, creating a new MIMEPart for each image in base64 encoding with a Content-ID matching the ones creating during fetching. That’s much of the work. I’ve been running this off and on for a week now and for the most part it seems to be working, I have seen a particular set of glitchy messages, but as it’s Craigslist saved searches….. yeah, that could be anything. More testing to come.