FreeBSD: Maintaining sysutils/slurm-wlm [Part 2: Cgroups and FreeBSD?!]

Image Source: https://pixabay.com/

The push in the right direction

In my last post, i described, how i fixed a bug of the slurm-wlm port on FreeBSD, in which slurm tried to use the wrong socket address length (bug #288593). In that bug report, a committer pointed out to me, that the slurm port was not creating /var/spool/slurmctld during install time and a patch to fix this would be desired. I also remembered, that i manually created the directory when trying to start slurmctld, because it was complaining about the directory not existing.

So, i promptly got to work, to get a working patch ready to fix this issue. Luckily the solution to the problem was very obvious: Use Makefile and pkg-plist to create the directory during installation and set correct permissions. This time i will not display the diff file in this post, since it is so minor, that just explaining the changes should be enough.

I added one new line in pkg-plist to create the directory and set correct permissions:

@dir(%%USERS%%,%%GROUPS%%,700) /var/spool/slurmctld

A very simple change, that only has to be fit alphabetically into the other directories in pkg-plist. It creates /var/spool/slurmctld during installation and sets the owner the the user and group as which slurmctld will run and limits access to it via changing its mode to 700, so only the slurm user can access it for security.

I also modified PLIST_SUB in the ports Makefile, to pass the $USERS and $GROUPS variable to pkg-plist by simply appending:

USERS=${USERS} GROUPS=${GROUPS}

I also added a few post-install lines, to create the directory during post-install and set the correct permissions, via:

@${MKDIR} ${STAGEDIR}/var/spool/slurmctld
@${CHOWN} ${USERS}:${GROUPS} ${STAGEDIR}/var/spool/slurmctld
@${CHMOD} 700 ${STAGEDIR}/var/spool/slurmctld

I packed my changes into a patch file and submitted it as bug #288612, which promptly got committed, but the Committer pointed out to me, that chowning during post-install is illegal, so that line got removed from the patch. I totally oversaw make complaining about the illegal chown, but was happy that the Committer noticed it!

A good salmon swims upstream!

In my previous post i submitted a patch, which FreeBSD specifically corrects the socket address length of the socket slurmctld sets up for communication. The Committer who committed the changes for me, pointed out not to forget to submit the changes upstream. So, of course, i went ahead and opened a ticket at shedMD, the developers of slurm as ticket id 23388. I quickly noticed an activity difference between FreeBSD and ShedMD: I was very surprised to see, that a FreeBSD ticket is usually answered within 24h, while my ticket at ShedMD has not been answered to the date of me writing this article. It just again illustrated the size of FreeBSD as an open-source project to me. I know this is probably normal and not too important, but i just wanted to highlight the speed at which bug reports at FreeBSD are answered and noticed!

The cgroups surgery

Now to the main part of this post and also the next major problem i am facing with getting slurm to run on FreeBSD: slurm is tied to cgroups.

kavex@FreeBSD ~ $ sudo slurmctld -Dvvv -f /usr/local/etc/slurm.conf
slurmd: debug:  Log file re-opened
slurmd: debug2: hwloc_topology_init
slurmd: debug2: hwloc_topology_load
slurmd: debug2: hwloc_topology_export_xml
slurmd: debug:  CPUs:4 Boards:1 Sockets:1 CoresPerSocket:4 ThreadsPerCore:1
slurmd: error: Unable to initialize cgroup plugin
slurmd: error: slurmd initialization failed

Cgroups, developed at Google in 2006 and implemented into the Linux Kernel in 2008, is a feature which makes it possible to group progresses into a hierarchical order and limit resources (CPU, RAM, etc.) accessible to that group. FreeBSD lacks this feature, but it does have the POSIX compliant process gorups (PGID), which was added to UNIX in the 1980s. Sadly PGID is not a drop-in replacement for cgroups, since cgroups is not only able to group processes into a context, but also set them into hierarchical relation and limit resources assigned to them. PGID is only able to group processes together, so they can be killed with only one kill to the process group ID, but nothing more.

So to fully port slurm to FreeBSD, i will also have to find a replacement for those functionalities. But for now, the most important thing is, that slurm is able to group all its processes and terminate them all together. But before dealing with how to replace cgorups, i first needed to surgically remove cgorups from slurms code, again via my beloved #ifdef (__FreeBSD__) safeguards.

By using grep i identified three cgroup initialization steps, which i had to skip, if on FreeBSD, to avoid running into “could not initialize cgroups” errors. The final patch i came up with, is the following:

--- /usr/ports/sysutils/slurm-wlm/work/slurm-23.11.7/src/slurmd/slurmd/slurmd.c.orig	2025-08-03 00:53:28.293537000 +0200
+++ /usr/ports/sysutils/slurm-wlm/work/slurm-23.11.7/src/slurmd/slurmd/slurmd.c	2025-08-03 01:13:23.233553000 +0200
@@ -2191,10 +2191,17 @@
 	build_all_frontend_info(true);

 	/*
+	 * cgroups is unsupported on FreeBSD and would prevent slurmd from starting
+	 */
+	#if defined(__FreeBSD__)
+	info("FreeBSD: Skipping cgroup_conf_init() - cgroups unsupported");
+	#else
+	/*
 	 * This needs to happen before _read_config where we will try to read
 	 * cgroup.conf values
 	 */
 	cgroup_conf_init();
+	#endif

 	xcpuinfo_refresh_hwloc(original);

@@ -2214,6 +2221,10 @@
 	 * defaults and command line.
 	 */
 	_read_config();
+
+     	#if defined(__FreeBSD__)
+	info("FreeBSD: Skipping cgroup_g_init() - cgroups unsupported");
+	#else
 	/*
 	 * This needs to happen before _resource_spec_init where we will try to
 	 * attach the slurmd pid to system cgroup, and after _read_config to
@@ -2223,6 +2234,7 @@
 		error("Unable to initialize cgroup plugin");
 		return SLURM_ERROR;
 	}
+	#endif

 #ifndef HAVE_FRONT_END
 	if (!find_node_record(conf->node_name))
@@ -2562,6 +2574,10 @@
  */
 static int _resource_spec_init(void)
 {
+	#if defined(__FreeBSD__)
+	debug("FreeBSD: Skipping system cpuset and memory cgroup setup");
+	return SLURM_SUCCESS;
+	#endif
 	fini_system_cgroup();	/* Prevent memory leak */
 	if (_core_spec_init() != SLURM_SUCCESS)
 		error("Resource spec: core specialization disabled");

I submitted it as bug #288617, but to the time of writing this article, it was not yet committed. I suspect this to be the case, because it does remove functionality and i do not yet have anything at hand, to bring this functionality back.

Which brings us to the next chapter:

Injecting a task/PGID plugin into slurm’s source code

Now, to at least give slurm minimal control over the processes it launches, i implemented a small task/pgid plugin, which complements the already present proctrack/pgid plugin, using FreeBSD’s process groups implementation.

My plans for the future also include a task/rctl and task/jail plugins, to allow for more fine grained control and simulate the missing cgroups functionality as good as possible.

But for now, to get immediate functionality, i focus on task/pgid. We already have proctrack/pgid, so task/pgid should complement it as much as possible. When reviewing the source code of proctrack/pgid, it is clear that it treats the PGID as the container id and then uses that container id for signaling, waiting, etc. It expects one PGID per step and the tasks then group as child processes in that PGID. So it provides PGID based tracking of steps and tasks. Now for task/pgid, i only need to setup a PGID per step and group tasks as children in that PGID, so that proctrack/pgid has something to track and of course the plugin should also cache the PGID to kill the processes later. But the plugin should also provide stub implementations of all functions required for a slurm plugin.

The final plugin i came up with looked like this:

#include "slurm_xlator.h"
#include "log.h"
#include "xmalloc.h"
#include "task.h"  

#include "src/common/slurm_protocol_api.h"
#include "src/slurmd/slurmstepd/slurmstepd_job.h"

#include <sys/types.h>
#include <signal.h>
#include <unistd.h>
#include <errno.h>
#include <string.h>

/* Required plugin identifiers (exported) */
__attribute__((visibility("default"))) const char     plugin_name[]    = "PGID task plugin for FreeBSD";
__attribute__((visibility("default"))) const char     plugin_type[]    = "task/pgid";
__attribute__((visibility("default"))) const uint32_t plugin_version   = SLURM_VERSION_NUMBER;

/* Required generic plugin entry points */
int init(void)  { slurm_info("task/pgid: init");  return SLURM_SUCCESS; }
int fini(void)  { slurm_info("task/pgid: fini");  return SLURM_SUCCESS; }

static pid_t job_pgid = -1;

/* ==== Required task_* API (must all be present) ==== */

/* Called when slurmd receives a batch launch request */
int task_p_slurmd_batch_request(batch_job_launch_msg_t *req)
{
    (void)req;
    return SLURM_SUCCESS;
}

/* Called when slurmd receives a general launch request */
int task_p_slurmd_launch_request(launch_tasks_request_msg_t *req,
                                 uint32_t node_id, char **err_msg)
{
    (void)req; (void)node_id; (void)err_msg;
    return SLURM_SUCCESS;
}

int task_p_slurmd_suspend_job(uint32_t job_id)
{
    (void)job_id;
    return SLURM_SUCCESS;
}

int task_p_slurmd_resume_job(uint32_t job_id)
{
    (void)job_id;
    return SLURM_SUCCESS;
}

/* Before setuid to the job user */
int task_p_pre_setuid(stepd_step_rec_t *step)
{
    (void)step;
    return SLURM_SUCCESS;
}

/* Called in privileged context before launch */
int task_p_pre_launch_priv(stepd_step_rec_t *step,
                           uint32_t node_tid, uint32_t global_tid)
{
    (void)step; (void)node_tid; (void)global_tid;
    return SLURM_SUCCESS;
}

int task_p_pre_launch(stepd_step_rec_t *step)
{
    pid_t cur = getpid();

    /* Case A: no PGID recorded yet for this step -> become the group leader */
    if (step->pgid <= 0) {
        if (setpgid(0, 0) < 0) {
            /* If a sibling beat us to it, join that PGID instead */
            if (errno == EACCES || errno == EPERM || errno == EEXIST) {
                /* Someone created a group already; query our pgid and store it */
                pid_t pg = getpgid(0);
                if (pg < 0) {
                    slurm_error("task/pgid: getpgid failed after race: %s", strerror(errno));
                    return SLURM_ERROR;
                }
                step->pgid = pg;
                slurm_debug("task/pgid: joined existing PGID %d (race)", step->pgid);
                return SLURM_SUCCESS;
            }
            slurm_error("task/pgid: setpgid(0,0) failed for leader pid=%d: %s", (int)cur, strerror(errno));
            return SLURM_ERROR;
        }
        step->pgid = getpgid(0);
        if (step->pgid < 0) {
            slurm_error("task/pgid: getpgid failed after creating group: %s", strerror(errno));
            return SLURM_ERROR;
        }
        slurm_debug("task/pgid: created step PGID %d (leader pid=%d)", step->pgid, (int)cur);
        return SLURM_SUCCESS;
    }

    /* Case B: PGID exists -> join it */
    if (setpgid(0, step->pgid) < 0) {
        /* ESRCH: parent/leader not visible yet; tiny retry helps on fast forks */
        if (errno == ESRCH) {
            usleep(1000); /* 1 ms backoff */
            if (setpgid(0, step->pgid) == 0) {
                slurm_debug("task/pgid: joined PGID %d after retry", step->pgid);
                return SLURM_SUCCESS;
            }
        }
        slurm_error("task/pgid: setpgid(0,%d) failed: %s", step->pgid, strerror(errno));
        return SLURM_ERROR;
    }
    slurm_debug("task/pgid: joined existing PGID %d", step->pgid);
    return SLURM_SUCCESS;
}

/* After a task terminates */
int task_p_post_term(stepd_step_rec_t *step, stepd_step_task_info_t *task)
{
    (void)step; (void)task;
    return SLURM_SUCCESS;
}

/* After the whole step finishes */
int task_p_post_step(stepd_step_rec_t *step)
{
    (void)step;
    return SLURM_SUCCESS;
}

/* Allow plugin to track additional PIDs if needed */
int task_p_add_pid(pid_t pid)
{
    (void)pid;
    return SLURM_SUCCESS;
}

int task_p_signal(stepd_step_rec_t *step, int sig)
{
    if (step && step->pgid > 1) {
        slurm_debug("task/pgid: sending signal %d to PGID %d", sig, step->pgid);
        if (killpg((pid_t)step->pgid, sig) < 0) {
            slurm_error("task/pgid: killpg(%d) failed: %s", step->pgid, strerror(errno));
            return SLURM_ERROR;
        }
    }
    return SLURM_SUCCESS;
}
int task_p_fini(stepd_step_rec_t *step) { (void)step; return SLURM_SUCCESS; }

In that C file you can definitely see many stub functions, that only return SLURM_SUCCESS; These functions are simply needed for the plugin to compile correctly and for now, only exist to make the compiler happy. I plan to add functionality to them in the future, but for now i focus on getting slurm to run at all.

The other functions, that actually add function, should be well enough commented to understand them while reading through the source code. They just add simple task management via process groups.

For the injection process, i had to patch slurm’s configure script + Makefile.in for task plugins and also copy a Makefile.in from another plugin and adapt it for pgid, as it’s Makefile.in. All of them together, were able to get the plugin to build with slurm’s own build process.

The Development process of the pgid plugin, or at least the begin of it, can be tracked via https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=288668

That’s it for this post! I hope you enjoyed reading it and were able to get a glimpse into my FreeBSD submissions. Thank you for reading my post and have a nice day!

– Generic Rikka

Leave a Reply

Your email address will not be published. Required fields are marked *