Translation(s): English


A long-standing AppArmor bug lp:1597017 may cause failure of whole LXC container or some systemd services inside it due to restriction on allowed mount paths options. It affects mostly non-root containers since containers started by root may use lxc.apparmor.profile = generated configuration. As of the end of 2023 the AppArmor bug has been fixed in upstream releases, but fixes have not backported to the apparmor and lxc packages in the bookworm release.

On the host symptoms are "failed flags match" messages appearing in the system logs (journalctl output) during attempts to start a container

AVC apparmor="DENIED" operation="mount" info="failed flags match" error=-13 profile="lxc-container-default-with-nesting" name="/run/systemd/unit-root/proc/" pid=10210 comm="(d-logind)" flags="rw, nosuid, nodev, noexec, remount, bind"

Inside containers some systemd services are failing, network is not configured, attempts to login may cause delays.

(d-logind)[135]: systemd-logind.service: Failed to set up mount namespacing: Permission denied
(d-logind)[135]: systemd-logind.service: Failed at step NAMESPACE spawning /lib/systemd/systemd-logind: Permission denied
systemd[1]: systemd-logind.service: Main process exited, code=exited, status=226/NAMESPACE
systemd[1]: systemd-logind.service: Failed with result 'exit-code'.
systemd[1]: Failed to start systemd-logind.service - User Login Management.

See also the note on a kernel bug requiring a permissive AppArmor profile or systemd unit settings overrides like the following drop-in /etc/systemd/system/service.d/private-nework-no.conf

[Service]
PrivateNetwork=no

Systemd has a number of sandboxing options to limit what part of system is accessible for specific units. This bug affects options relying on e.g. read-only bind mounts or hiding of some filesystem subtrees.

AppArmor profiles shipped in the lxc package allows only limited set of mount options and paths. Otherwise privileged containers would be able to remount /proc or /sys with options allowing to change state of the host system. Some of rules are commented out despite they are required for systemd, see lines in file:/etc/apparmor.d/abstractions/lxc/container-base after

# FIXME: This currently doesn't work due to the apparmor parser treating those as allowing all mounts.

It has been fixed upstream after lxc-5.0.3 release.

Privileged containers are not recommended by LXC developers due to security issues

Unprivileged containers have less chances to affect host system, so alternatives are using a less restrictive AppArmor profile or disabling sandboxing settings for systemd units inside containers.

Generated AppArmor profile

Containers running by root may use

lxc.apparmor.profile = generated

In this case the profile has rules with specific mount paths instead of wildcards confusing AppArmor parser. It is not suitable for containers started by regular users since they do not have enough privileges to load a custom AppArmor profile.

Permissive AppArmor profile

This approach is acceptable for unprivileged containers and risky for privileged ones. AppArmor may be disabled completely

lxc.apparmor.profile = unconfined

Nesting profile effectively allows any mount due to the parser bug, but other rules are still enforced

lxc.apparmor.profile = lxc-container-default-with-nesting

Default AppArmor profile breaks systemd >= 253

Since version 253 systemd isolates unit generators, and there is no way to disable it by configuration inside container. The only alternative to the generated or a permissive AppArmor profile is a fully unprivileged container. Example is Ubuntu-23.10 mantic guest. The early boot error with lxc-container-default-cgns is

Failed to fork off sandboxing environment for executing generators: Protocol error

inside the container and the following one on the host

audit[15446]: AVC apparmor="DENIED" operation="mount" info="failed flags match" error=-13 profile="lxc-container-default-cgns" name="/" pid=15446 comm="(sd-gens)" flags="rw, rslave"

Disabling systemd sandboxing

If a container has systemd < 253 then you may use the lxc-container-default-cgns AppArmor profile and may override some systemd options inside the container. E.g. create /etc/systemd/system/service.d/disable-sandboxing.conf

[Service]
ProcSubset=all
ProtectProc=default
ProtectControlGroups=no
ProtectKernelTunables=no
NoNewPrivileges=no
LoadCredential=

ProtectSystem=no
ProtectHome=no
PrivateDevices=no
PrivateTmp=no
ReadWritePaths=
ProtectKernelLogs=no
ProtectKernelModules=no
PrivateMounts=no

Generator added by the "download" template

Containers created using the "download" template have the /etc/systemd/system-generator/lxc generator that creates systemd configuration drop-ins. Its effect depends on runtime: LXC container configuration (files included from /usr/share/lxc/config), Linux distribution, whether it is a privileged container. As a result, containers having similar systemd versions and similar overrides may have different behavior. You still may need to add more directives specified in the previous section, e.g. only second block is required for Debian 12 bookworm. This generator may create /run/systemd/system/systemd-resolved.service.d/zzz-lxc-ropath.conf and /run/systemd/system/systemd-networkd.service.d/zzz-lxc-ropath.conf files. You may need to override them by

[Service]
BindReadOnlyPaths=

Ubuntu systemd patches

Ubuntu systemd packages are built with a patch that allows to ignore errors during attempts to create an isolated environment for a unit. As a result no overrides are required to mount-related sandboxing till Ubuntu-23.10 mantic. See the note on systemd-253 above why it is not enough for latest releases.

Fully unprivileged container

Systemd has a notion of fully unprivileged container when some features are disabled. Runtime environments like Docker are detected by dropped capabilities, read-only /sys, etc. This kind of runtime is not recommended by systemd developers. Lack of the CAP_SYS_ADMIN capability may break some applications inside containers.

Do not include userns.conf and nesting.conf. Avoid lxc.mount.auto = sys:rw. E.g. /usr/share/lxc/config/common.conf sets suitable sys:mixed. Due to lack of capabilities, all necessary mounts should be specified in the container configuration.

lxc.include = /usr/share/lxc/config/common.conf
lxc.cap.drop = sys_admin mknod sys_module
lxc.mount.entry = shm dev/shm tmpfs nodev,nosuid,mode=1777,strictatime,create=dir 0 0
lxc.mount.entry = tmpfs run tmpfs nodev,noexec,nosuid,mode=755,size=20%,nr_inodes=800k
lxc.mount.entry = mqueue dev/mqueue mqueue defaults,optional,create=dir 0 0
# Debian specific
lxc.mount.entry = tmpfs run/lock tmpfs nodev,noexec,nosuid,mode=1777,size=5242880,create=dir 0 0

Add usual options like lxc.idmap, lxc.net, etc. To isolate users inside the container, add tmpfs mounts to run/user/PID that is normally responsibility of systemd-logind.

Troubleshooting

See also