Select Page

TL;DR: Use irqbalance.


Yesterday was a good day. When I was taking a shower, I started to wonder: How does interrupt handling works in an SMP system? 🙂

Life is easy in a uniprocessor system: I press a key, the keyboard controller tells my PIC, who tells my CPU – That’s it.

But what if I have more than one CPU? My machine has 4 CPUs, and when I press A on my keyboard, I expect to see one, instead of 4 As…More importantly, how does the system decide which CPU should be responsible for handling this interrupt? Since keeping one CPU busy while the others idle sounds no good, maybe it should be doing it in a “balanced” (maybe round-robin) fashion? Is it configurable?…

Then I ran out of hot water. 🙂 Anyway, I did some research on it, and here’s what I’ve learned.

The “IRQs and Interrupts” section from Chapter 4 of Understanding the Linux Kernel is a very good read on this topic.

All examples in this article will be based on Linux kernel version 5.7.0-rc6 (Kleptomaniac Octopus). You can download a tar ball of the source code here.


/proc/interrupts is an interesting file containing information about which interrupts are currently in use, and what they are used for:

peilin@PWN:~$ cat /proc/interrupts
           CPU0       CPU1       CPU2       CPU3

  1:          0          0        123          0   IO-APIC   1-edge      i8042

 12:          0        290          0          0   IO-APIC   12-edge     i8042

 19:          0         30          0     179784   IO-APIC   19-fasteoi  enp0s3

IRQ 1 is the “8042” PS/2 keyboard controller, and IRQ 12 is the mouse. While all 123 interrupts generated by the keyboard went to CPU2, all 290 interrupts generated by the mouse went to CPU1. Note normally these numbers should be way greater than a few hundred, but I’m SSHing into this VM box from my host Mac.

enp0s3 represents my Ethernet card, it generates a lot of interrupts, most of which go to CPU3.

So is there a way for our users to configure it? The answer is yes, and it is something called SMP IRQ affinity:

Each IRQ has its own directory under /proc/irq:

peilin@PWN:~$ ls /proc/irq
0  1  10  11  12  13  14  15  18  19  2  21  22  3  4  5  6  7  8  9  default_smp_affinity
peilin@PWN:~$ cat /proc/irq/19/smp_affinity
8

The smp_affinity file for IRQ 19 contains a single character 8. This character works as a bitmask: Decimal 8 equals to binary 1000, which means only the fourth CPU (CPU3) receives IRQ 19. You may notice that CPU1 also received 30 interrupts from it – I don’t know why. Maybe smp_affinity for IRQ 19 has changed since last system reset.

An alternative way is to use smp_affinity_list, which simply list out the CPU number(s) allowed to receive an IRQ:

peilin@PWN:~$ cat /proc/irq/19/smp_affinity_list
3

Okay. So what’s the mechanism behind all of this? Who are we talking to when we modify smp_affinity? Here’s a nice diagram from Intel 82093AA datasheet:

Basically, in an APIC system, each CPU has its own local APIC (LAPIC) attached to it. The motherboard also contains a “global” APIC, called I/O APIC (Actually a motherboard may contain several I/O APICs, but let’s assume there’s only one for simplicity). All external IRQs are first sent to I/O APIC, who is responsible for “routing” these IRQs to different LAPICs according to its Interrupt Redirection Table. When we modify smp_affinity, basically we are trying to configure I/O APIC.


Things seem to be working beautifully. However, smp_affinity is not always consistent with /proc/interrupts for me, and I have no idea why. I/O APIC may use different “routing policies” other than simply hardcoding a bitmask – maybe it has something to do with that.

I believe this is why I should use tools like irqbalance, instead of trying to configure everything manually – especially when I don’t have a good reason to do so. 🙁

Finally, more IA64-specific information can be found in Documentation/ia64/irq-redir.rst for the (really) curious readers.


That’s pretty much it! Finally I would like to briefly walk through how the Linux kernel uses smp_affinity.

smp_affinity is created in register_irq_proc() in kernel/irq/proc.c:

#ifdef CONFIG_SMP
	/* create /proc/irq/<irq>/smp_affinity */
	proc_create_data("smp_affinity", 0644, desc->dir,
			 &irq_affinity_proc_ops, irqp);

Where irq_affinity_proc_ops is a struct proc_ops defined as:

static const struct proc_ops irq_affinity_proc_ops = {
	.proc_open	= irq_affinity_proc_open,
	.proc_read	= seq_read,
	.proc_lseek	= seq_lseek,
	.proc_release	= single_release,
	.proc_write	= irq_affinity_proc_write,
};

…which contains some handlers. It seems that whenever we write to smp_affinity, we call irq_affinity_proc_write(), which is a wrapper:

static ssize_t irq_affinity_proc_write(struct file *file,
		const char __user *buffer, size_t count, loff_t *pos)
{
	return write_irq_affinity(0, file, buffer, count, pos);
}

…of write_irq_affinity(), a longer function. Basically in this case write_irq_affinity() first calls cpumask_parse_user(buffer, count, new_value), which parses the user-provided string in smp_affinity, then it calls irq_set_affinity(irq, new_value). Defined in include/linux/interrupt.h, irq_set_affinity() is yet another wrapper function:

/**
 * irq_set_affinity - Set the irq affinity of a given irq
 * @irq:	Interrupt to set affinity
 * @cpumask:	cpumask
 *
 * Fails if cpumask does not contain an online CPU
 */
static inline int
irq_set_affinity(unsigned int irq, const struct cpumask *cpumask)
{
	return __irq_set_affinity(irq, cpumask, false);
}

…which calls __irq_set_affinity() in kernel/irq/manage.c:

int i(unsigned int irq, const struct cpumask *mask, bool force)
{
	struct irq_desc *desc = irq_to_desc(irq);
	unsigned long flags;
	int ret;

	if (!desc)
		return -EINVAL;

	raw_spin_lock_irqsave(&desc->lock, flags);
	ret = irq_set_affinity_locked(irq_desc_get_irq_data(desc), mask, force);
	raw_spin_unlock_irqrestore(&desc->lock, flags);
	return ret;
}

In the highlighted critical region it calls irq_set_affinity_locked(), which calls irq_try_set_affinity(), which, in turn, calls irq_do_set_affinity(), where it finally calls a chip-specific handler to do the real stuff:

		ret = chip->irq_set_affinity(data, mask, force);

In this case, I believe it is ioapic_set_affinity() in arch/x86/kernel/apic/io_apic.c:

static struct irq_chip ioapic_chip __read_mostly = {
	.name			= "IO-APIC",
	.irq_startup		= startup_ioapic_irq,
	.irq_mask		= mask_ioapic_irq,
	.irq_unmask		= unmask_ioapic_irq,
	.irq_ack		= irq_chip_ack_parent,
	.irq_eoi		= ioapic_ack_level,
	.irq_set_affinity	= ioapic_set_affinity,
	.irq_retrigger		= irq_chip_retrigger_hierarchy,
	.irq_get_irqchip_state	= ioapic_irq_get_chip_state,
	.flags			= IRQCHIP_SKIP_SET_WAKE,
};

References:

  1. Introduction to Linux Interrupts and CPU SMP Affinity – THE GEEK STUFF
  2. Understanding /proc/interrupts – theurbanpenguin on Youtube