A long-standing AppArmor bug lp:1597017 may cause failure of whole LXC container or some systemd services inside it due to restriction on allowed mount paths options. It affects mostly non-root containers since containers started by root may use lxc.apparmor.profile = generated configuration. As of the end of 2023 the AppArmor bug has been fixed in upstream releases, but fixes have not backported to the apparmor and lxc packages in the bookworm release.
Contents
On the host symptoms are "failed flags match" messages appearing in the system logs (journalctl output) during attempts to start a container
AVC apparmor="DENIED" operation="mount" info="failed flags match" error=-13 profile="lxc-container-default-with-nesting" name="/run/systemd/unit-root/proc/" pid=10210 comm="(d-logind)" flags="rw, nosuid, nodev, noexec, remount, bind"
Inside containers some systemd services are failing, network is not configured, attempts to login may cause delays.
(d-logind)[135]: systemd-logind.service: Failed to set up mount namespacing: Permission denied (d-logind)[135]: systemd-logind.service: Failed at step NAMESPACE spawning /lib/systemd/systemd-logind: Permission denied systemd[1]: systemd-logind.service: Main process exited, code=exited, status=226/NAMESPACE systemd[1]: systemd-logind.service: Failed with result 'exit-code'. systemd[1]: Failed to start systemd-logind.service - User Login Management.
See also the note on a kernel bug requiring a permissive AppArmor profile or systemd unit settings overrides like the following drop-in /etc/systemd/system/service.d/private-nework-no.conf
[Service] PrivateNetwork=no
Systemd has a number of sandboxing options to limit what part of system is accessible for specific units. This bug affects options relying on e.g. read-only bind mounts or hiding of some filesystem subtrees.
AppArmor profiles shipped in the lxc package allows only limited set of mount options and paths. Otherwise privileged containers would be able to remount /proc or /sys with options allowing to change state of the host system. Some of rules are commented out despite they are required for systemd, see lines in file:/etc/apparmor.d/abstractions/lxc/container-base after
# FIXME: This currently doesn't work due to the apparmor parser treating those as allowing all mounts.
It has been fixed upstream after lxc-5.0.3 release.
Privileged containers are not recommended by LXC developers due to security issues
Unprivileged containers have less chances to affect host system, so alternatives are using a less restrictive AppArmor profile or disabling sandboxing settings for systemd units inside containers.
Generated AppArmor profile
Containers running by root may use
lxc.apparmor.profile = generated
In this case the profile has rules with specific mount paths instead of wildcards confusing AppArmor parser. It is not suitable for containers started by regular users since they do not have enough privileges to load a custom AppArmor profile.
Permissive AppArmor profile
This approach is acceptable for unprivileged containers and risky for privileged ones. AppArmor may be disabled completely
lxc.apparmor.profile = unconfined
Nesting profile effectively allows any mount due to the parser bug, but other rules are still enforced
lxc.apparmor.profile = lxc-container-default-with-nesting
Default AppArmor profile breaks systemd >= 253
Since version 253 systemd isolates unit generators, and there is no way to disable it by configuration inside container. The only alternative to the generated or a permissive AppArmor profile is a fully unprivileged container. Example is Ubuntu-23.10 mantic guest. The early boot error with lxc-container-default-cgns is
Failed to fork off sandboxing environment for executing generators: Protocol error
inside the container and the following one on the host
audit[15446]: AVC apparmor="DENIED" operation="mount" info="failed flags match" error=-13 profile="lxc-container-default-cgns" name="/" pid=15446 comm="(sd-gens)" flags="rw, rslave"
Disabling systemd sandboxing
If a container has systemd < 253 then you may use the lxc-container-default-cgns AppArmor profile and may override some systemd options inside the container. E.g. create /etc/systemd/system/service.d/disable-sandboxing.conf
[Service] ProcSubset=all ProtectProc=default ProtectControlGroups=no ProtectKernelTunables=no NoNewPrivileges=no LoadCredential= ProtectSystem=no ProtectHome=no PrivateDevices=no PrivateTmp=no ReadWritePaths= ProtectKernelLogs=no ProtectKernelModules=no PrivateMounts=no
Generator added by the "download" template
Containers created using the "download" template have the /etc/systemd/system-generator/lxc generator that creates systemd configuration drop-ins. Its effect depends on runtime: LXC container configuration (files included from /usr/share/lxc/config), Linux distribution, whether it is a privileged container. As a result, containers having similar systemd versions and similar overrides may have different behavior. You still may need to add more directives specified in the previous section, e.g. only second block is required for Debian 12 bookworm. This generator may create /run/systemd/system/systemd-resolved.service.d/zzz-lxc-ropath.conf and /run/systemd/system/systemd-networkd.service.d/zzz-lxc-ropath.conf files. You may need to override them by
[Service] BindReadOnlyPaths=
Ubuntu systemd patches
Ubuntu systemd packages are built with a patch that allows to ignore errors during attempts to create an isolated environment for a unit. As a result no overrides are required to mount-related sandboxing till Ubuntu-23.10 mantic. See the note on systemd-253 above why it is not enough for latest releases.
Fully unprivileged container
Systemd has a notion of fully unprivileged container when some features are disabled. Runtime environments like Docker are detected by dropped capabilities, read-only /sys, etc. This kind of runtime is not recommended by systemd developers. Lack of the CAP_SYS_ADMIN capability may break some applications inside containers.
Do not include userns.conf and nesting.conf. Avoid lxc.mount.auto = sys:rw. E.g. /usr/share/lxc/config/common.conf sets suitable sys:mixed. Due to lack of capabilities, all necessary mounts should be specified in the container configuration.
lxc.include = /usr/share/lxc/config/common.conf lxc.cap.drop = sys_admin mknod sys_module lxc.mount.entry = shm dev/shm tmpfs nodev,nosuid,mode=1777,strictatime,create=dir 0 0 lxc.mount.entry = tmpfs run tmpfs nodev,noexec,nosuid,mode=755,size=20%,nr_inodes=800k lxc.mount.entry = mqueue dev/mqueue mqueue defaults,optional,create=dir 0 0 # Debian specific lxc.mount.entry = tmpfs run/lock tmpfs nodev,noexec,nosuid,mode=1777,size=5242880,create=dir 0 0
Add usual options like lxc.idmap, lxc.net, etc. To isolate users inside the container, add tmpfs mounts to run/user/PID that is normally responsibility of systemd-logind.
Troubleshooting
Check container configuration: lxc.apparmor.profile and other options, e.g.
lxc-info -n CONTAINER -c lxc.mount.auto
usually should have sys:ro or sys:mixed to disable udevd.
Inspect host logs (journalctl -b for the current boot) for AppArmor "denied" messages
- If container does not start that try to run it in foreground
lxc-unpriv-start -n CONTAINER -F
- In some cases verbose systemd logs might help
lxc-unpriv-start -n CONTAINER -F -s lxc.init.cmd='/sbin/init --log-level=debug'
- If the container can start then attach to it
lxc-unpriv-attach -n CONTAINER --clear-env --keep-var TERM
and inspect its statesystemctl --failed
and logs for failed unitsjournalctl -b -u systemd-hostnamed
- To read logs while the container is stopped, specify the journal directory and ensure that the user has enough permissions to read the files
lxc-usernsexec -- journalctl -D ~/.local/share/lxc/CONTAINER/rootfs/var/log/journal/
See also
lp:1597017 AppArmor bug "mount rules grant excessive permissions" fixed in 3.1.6, 3.0.12, 2.13.9 (2023-06-21). Also known as CVE-2016-1585 and 929990
https://github.com/lxc/lxc/issues/4280 and https://github.com/lxc/lxc/pull/4295 are the LXC bug and the pull request that allows required mount options. Unsafe if the AppArmor bug is not fixed. The fix is not included into lxc-5.0.3. See the pull request description or the commit message for details.
https://systemd.io/CONTAINER_INTERFACE/#fully-unprivileged-container-payload Fully Unprivileged Container Payload in The Container Interface systemd documents Notice the "What You Shouldn’t Do" section of this document.
https://blog.iwakd.de/lxc-cap_sys_admin-jessie Christian Seiler. LXC containers without CAP_SYS_ADMIN under Debian Jessie. 2015
