Part 1: Even maildirs need their personal space
I did a migration of my mail server over the last weekend and I ran into a strange error when I brought things back online. My system was receiving mail fine, but it was being deferred in the postfix queue with the message:
Command output: /usr/bin/maildrop: Unable to open mailbox.
This was very odd, because the bulk of the files copied over to my new mail server VM were in fact mailboxes; I had double-verified the copy before I shut down the old VM. Yet all my mails were being queued due to this error.
It was late at night when I finished the VM copy and everything else was working fine, so I left it until morning. When I started to work on it again the next day, I noticed something odd: not all of the users on the system were being affected - my wife's email and my dedicated account for my phone were working fine. So it was time to put on my Mark Watney pants and start digging into the technical detail.
A lot of the hits I found when doing my initial web searches pointed to various permissions problems on the mailboxes themselves, which I had already ruled out. One post even suggested:
Then the mailbox does not exist or has the wrong permissions. The simplest solution is to delete the mailbox and create it again.
Um, not gonna happen; some of my mail folders have gigabytes of emails. My interest in digging
through forums with advice like that quickly faded. At this point, I spent several hours reading
over the postfix maildrop howto to confirm my
setup was correct, and fiddling with chroot and suid settings in /etc/postfix/main.cf
and
/etc/postfix/master.cf
, even though I knew they had been working prior to the move. I
ended up reverting basically all of the changes I made during that period.
As the troubleshooting progressed, I noticed another strange data point: some of my emails were being delivered. I couldn't see any obvious pattern with which ones were failing and which were succeeding, but I knew there had to be one.
It was time to pull out a bigger gun: strace
. For those not familiar with it,
strace
shows all of the system calls (entrypoints into the kernel) that a process makes as
it runs. I waited until the mail server wasn't receiving anything, then ran strace
on the
running postfix master
process:
strace -f -p $(pidof master) |& postfix-flush.log
Because there were quite a few mails backed up, I ended up with output from a number of different child processes interspersed, but eventually I came upon my smoking gun:
[pid 21070] stat("/home/spam/Maildir/tmp", 0x7fff88ea8280) = -1 ENOENT (No such file or directory) [pid 21070] stat("/home/spam/Maildir", {st_mode=S_IFDIR|0700, st_size=4096, ...}) = 0
The maildrop process was looking for a directory called
tmp inside the user's Maildir, and it wasn't there. And then I remembered: when I copied the files
across from my old VM to my new VM, I excluded the directory /tmp
. But because I was doing
this from within the VM's file system mounted from the host, I used relative directory names. So
rsync
dutifully ignored exactly what I told it to: every file and directory with the name
tmp
.
A quick check of the old VM's file system confirmed that every user Maildir had a tmp
folder, and its absence on the new VM was causing maildrop
to consider every Maildir
missing. Unfortunately, maildirmake
is non-idempotent, so I couldn't just re-run it on
every Maildir. Instead, a short bash
script took care of things:
cd /home find */Maildir -type d -name cur | \ sed -e 's/\/cur$/\/tmp/' | \ while read dir; do mkdir "$dir" done for u in *; do chown $u $u/Maildir done
(At this point I did go looking for the maildrop
web site to see if I could submit a patch
which would make the error message less generic, but it is hosted at SourceForge and I couldn't get a git clone
to
work immediately, so I gave up. I probably should come back to this, but I fear that the error is
probably so vague because it is issued at the end of a large block of code which could have
multiple reasons for failing, and restructuring 20-year-old C++ code is not a thing that brings me
great joy. But I really should come back to it. I've just added it to my personal todo backlog.
Honest.)
The astute reader might be asking at this point: if your rsync
copy excluded all
directories named tmp
, why was this bug affecting only some of your users and some of your
mailboxes? It turns out that while maildrop
refuses to consider the mailbox as even
existing if the tmp
subdirectory is missing, the dovecot IMAP
server knows how to detect this and knows that it is safe to take the simple corrective action of
creating the directory. So every mailbox which had been accessed by a user since the migration had
an appropriate tmp
directory in place.
Mail delivery sorted; on to great victory!
Part 2: DevOps' dirty little Docker secret
Little did I know that another equally silly bug in /tmp
handling would bite me only a few
days later, when I was migrating the last VM away from my Xen VM server. This one was my internal
file server, which runs Jellyfin, a community fork of the Emby media
server.
Spoiler: here's DevOps' dirty little Docker secret: containers are just Linux processes as a service, and whatever is ugly in the Linux you put in is ugly in the result you get out.
But then we would, okay, I’m going to get this application that is in a container from development. Cool. It’s—don’t look inside of it, it’s just going to make you sad, but take these containers and put them into production and you can manage them regardless of what that application is actually doing. It felt like it wasn’t so much breaking down a wall, as it was giving a mechanism to hurl things over that wall. Is that just because I worked in terrible places with bad culture? If so, I don’t know that I’m very alone in that, but that’s what it felt like.
—Corey Quinn (emphasis added)
I run Jellyfin in a Docker container on my file server. This container gets read-only access to my actual media files so that I know it's not going to modify or delete anything, and Jellyfin handles any necessary media conversion in its writable cache.
This worked well for me until I pulled the latest Docker image down and found that it was perpetually stopping and restarting, with the rather perplexing error message:
Failed to create CoreCLR, HRESULT: 0x80004005
I found a Jellyfin bug report explaining exactly this behaviour, but the resolution was a little unsatisfying:
This seemed to be an issue with the container, not Jellyfin itself. Closing. Thanks for your insight...
The most recent comment suggested reverting to a previous version, which is not a viable long-term solution. Digging into the container itself, I found that I could start it with bash as my entrypoint:
docker run --rm -ti --entrypoint /bin/bash jellyfin/jellyfin
But running Jellyfin itself kept giving the same error. Running apt update
so that I could
add a couple of helpful packages gave the first clue:
root@8f5667a9fd17:/# apt update Get:1 http://deb.debian.org/debian bullseye InRelease [116 kB] Err:1 http://deb.debian.org/debian bullseye InRelease Couldn't create temporary file /tmp/apt.conf.1w476m for passing config to apt-key ...
That didn't seem right, and sure enough, /tmp
was just plain missing:
root@8f5667a9fd17:/# ls -la /tmp ls: cannot access '/tmp': No such file or directory
I created /tmp
and sure enough, Jellyfin kicked into life. So then it was just a matter of
creating my own Dockerfile:
FROM jellyfin/jellyfin:latest RUN mkdir /tmp; chmod 1777 /tmp
And building that for my Docker start scripts to use:
docker build -t jellyfin:local .
I'll update that bug shortly with an explanation and link here.