这里是原文的中文翻译,精简了部分内容。

进程与线程

进程(Process)是一个操作系统抽象概念,它将多种资源组合在一起:

  • 地址空间
  • 一个或多个线程
  • 打开的文件
  • 套接字(Sockets)
  • 信号量(Semaphores)
  • 共享内存区域
  • 定时器
  • 信号处理程序
  • 许多其他资源和状态信息

所有这些信息都被归纳在进程控制块(PCB)中。在 Linux 中,这就是 struct task_struct

struct task_struct

让我们仔细看看 struct task_struct。我们可以直接查看源代码,但在这里我们将使用一个名为 pahole 的工具(dwarves 安装包的一部分),以便对该结构有一些深入了解:

$ pahole -C task_struct vmlinux
 
struct task_struct {
    struct thread_info thread_info;                  /*     0     8 */
    volatile long int          state;                /*     8     4 */
    void *                     stack;                /*    12     4 */
 
    ...
 
    /* --- cacheline 45 boundary (2880 bytes) --- */
    struct thread_struct thread __attribute__((__aligned__(64))); /*  2880  4288 */
 
    /* size: 7168, cachelines: 112, members: 155 */
    /* sum members: 7148, holes: 2, sum holes: 12 */
    /* sum bitfield members: 7 bits, bit holes: 2, sum bit holes: 57 bits */
    /* paddings: 1, sum paddings: 2 */
    /* forced alignments: 6, forced holes: 2, sum forced holes: 12 */
} __attribute__((__aligned__(64)));

如你所见,这是一个相当大的数据结构:大小接近 8KB,包含 155 个字段。

线程 (Threads)

线程是内核进程调度器用来允许应用程序运行 CPU 的基本单位。线程具有以下特征:

  • 每个线程都有自己的栈,它与寄存器值一起决定了线程的执行状态
  • 线程在进程的上下文中运行,同一进程中的所有线程共享资源
  • 内核调度的是线程而不是进程,用户级线程(如纤程、协程等)在内核层是不可见的

典型的线程实现是将线程实现为独立的数据结构,然后链接到进程数据结构。例如,Windows 内核就使用了这种实现方式:

../_images/ditaa-4b5c1874d3924d9716f26d4893a3e4f313bf1c43.png

Linux 对线程使用了不同的实现方式。基本单位称为任务(task,因此有 struct task_struct),它既用于线程也用于进程。它没有将资源嵌入到任务结构中,而是拥有指向这些资源的指针。

因此,如果两个线程属于同一个进程,它们将指向相同的资源结构实例。如果两个线程处于不同的进程中,它们将指向不同的资源结构实例。

../_images/ditaa-fd771038e88b95def30ae9bd4df0b7bd6b7b3503.png

clone 系统调用

在 Linux 中,新的线程或进程是通过 clone() 系统调用创建的。fork() 系统调用和 pthread_create() 函数都使用了 clone() 的实现。

它允许调用者决定哪些资源应与父进程共享,哪些应被复制或隔离:

  • CLONE_FILES - 与父进程共享文件描述符表
  • CLONE_VM - 与父进程共享地址空间
  • CLONE_FS - 与父进程共享文件系统信息(根目录、当前目录)
  • CLONE_NEWNS - 不与父进程共享挂载命名空间
  • CLONE_NEWIPC - 不与父进程共享 IPC 命名空间(System V IPC 对象、POSIX 消息队列)
  • CLONE_NEWNET - 不与父进程共享网络命名空间(网络接口、路由表)

例如,如果调用者使用了 CLONE_FILES | CLONE_VM | CLONE_FS,那么实际上就创建了一个新线程。如果没有使用这些标志,则创建一个新进程。

命名空间与“容器”

“容器”是一种轻量级的虚拟机形式,它们共享同一个内核实例,这与普通虚拟化不同,普通虚拟化中管理程序(hypervisor)运行多个虚拟机,每个虚拟机都有自己的内核实例。

容器技术的例子包括 LXC(允许运行轻量级“VM”)和 docker(一种用于运行单个应用程序的专用容器)。

容器建立在少数几个内核功能之上,其中之一就是命名空间(namespaces)。它们允许隔离那些原本全局可见的不同资源。例如,如果没有容器,所有进程在 /proc 中都是可见的。有了容器,一个容器中的进程对其他容器(在 /proc 中或被杀死的权限)将是不可见的。

为了实现这种分区,使用了 struct nsproxy 结构来对我们要分区的资源类型进行分组。它目前支持 IPC、网络、cgroup、挂载、PID、时间命名空间。例如,网络接口列表不再是全局列表,而是 struct net 的一部分。系统使用默认的命名空间 (init_net) 初始化,默认情况下所有进程都将共享此命名空间。当创建一个新的命名空间时,会创建一个新的网络命名空间,然后新进程可以指向该新命名空间,而不是默认的那个。

访问当前进程

访问当前进程是一个频繁的操作:

  • 打开文件需要访问 struct task_struct 的 file 字段
  • 映射新文件需要访问 struct task_struct 的 mm 字段
  • 超过 90% 的系统调用需要访问当前进程结构,因此它必须很快
  • current 宏可用于访问当前进程的 struct task_struct

为了支持多处理器配置中的快速访问,使用了一个每 CPU (per CPU) 变量来存储和检索指向当前 struct task_struct 的指针:

../_images/ditaa-019489e686a2f60f1594e37458cfcb10320eae0f.png

以前,current 宏的实现使用了以下序列:

/* how to get the current stack pointer from C */
register unsigned long current_stack_pointer asm("esp") __attribute_used__;
 
/* how to get the thread information struct from C */
static inline struct thread_info *current_thread_info(void)
{
   return (struct thread_info *)(current_stack_pointer & ~(THREAD_SIZE – 1));
}
 
#define current current_thread_info()->task

上下文切换 (Context switching)

下图显示了 Linux 内核上下文切换过程的概览:

../_images/ditaa-f6b228332baf165f498d8a1bb0bc0bdb91ae50c5.png

请注意,在发生上下文切换之前,必须先进行内核转换(kernel transition),可以通过系统调用或中断来实现。此时,用户空间寄存器被保存在内核栈上。在某个时刻,schedule() 函数会被调用,它可以决定必须从 T0 切换到 T1(例如,因为当前线程正在阻塞等待 I/O 操作完成,或者因为它分配的时间片已耗尽)。

此时,context_switch() 将执行特定于架构的操作,并在需要时切换地址空间:

static __always_inline struct rq *
context_switch(struct rq *rq, struct task_struct *prev,
         struct task_struct *next, struct rq_flags *rf)
{
    prepare_task_switch(rq, prev, next);
 
    /*
     * For paravirt, this is coupled with an exit in switch_to to
     * combine the page table reload and the switch backend into
     * one hypercall.
     */
    arch_start_context_switch(prev);
 
    /*
     * kernel -> kernel   lazy + transfer active
     *   user -> kernel   lazy + mmgrab() active
     *
     * kernel ->   user   switch + mmdrop() active
     *   user ->   user   switch
     */
    if (!next->mm) {                                // to kernel
        enter_lazy_tlb(prev->active_mm, next);
 
        next->active_mm = prev->active_mm;
        if (prev->mm)                           // from user
            mmgrab(prev->active_mm);
        else
            prev->active_mm = NULL;
    } else {                                        // to user
        membarrier_switch_mm(rq, prev->active_mm, next->mm);
        /*
         * sys_membarrier() requires an smp_mb() between setting
         * rq->curr / membarrier_switch_mm() and returning to userspace.
         *
         * The below provides this either through switch_mm(), or in
         * case 'prev->active_mm == next->mm' through
         * finish_task_switch()'s mmdrop().
         */
        switch_mm_irqs_off(prev->active_mm, next->mm, next);
 
        if (!prev->mm) {                        // from kernel
            /* will mmdrop() in finish_task_switch(). */
            rq->prev_mm = prev->active_mm;
            prev->active_mm = NULL;
        }
    }
 
    rq->clock_update_flags &= ~(RQCF_ACT_SKIP|RQCF_REQ_SKIP);
 
    prepare_lock_switch(rq, next, rf);
 
    /* Here we just switch the register state and the stack. */
    switch_to(prev, next, prev);
    barrier();
 
    return finish_task_switch(prev);
  }

然后它将调用特定于架构的 switch_to 实现来切换寄存器状态和内核栈。注意,寄存器保存在栈上,而栈指针保存在任务结构中:

#define switch_to(prev, next, last)               \
do {                                              \
    ((last) = __switch_to_asm((prev), (next)));   \
} while (0)
 
/*
 * %eax: prev task
 * %edx: next task
 */
.pushsection .text, "ax"
SYM_CODE_START(__switch_to_asm)
    /*
     * Save callee-saved registers
     * This must match the order in struct inactive_task_frame
     */
    pushl   %ebp
    pushl   %ebx
    pushl   %edi
    pushl   %esi
    /*
     * Flags are saved to prevent AC leakage. This could go
     * away if objtool would have 32bit support to verify
     * the STAC/CLAC correctness.
     */
    pushfl
 
    /* switch stack */    movl    %esp, TASK_threadsp(%eax)    movl    TASK_threadsp(%edx), %esp
  #ifdef CONFIG_STACKPROTECTOR
    movl    TASK_stack_canary(%edx), %ebx
    movl    %ebx, PER_CPU_VAR(stack_canary)+stack_canary_offset
  #endif
 
  #ifdef CONFIG_RETPOLINE
    /*
     * When switching from a shallower to a deeper call stack
     * the RSB may either underflow or use entries populated
     * with userspace addresses. On CPUs where those concerns
     * exist, overwrite the RSB with entries which capture
     * speculative execution to prevent attack.
     */
    FILL_RETURN_BUFFER %ebx, RSB_CLEAR_LOOPS, X86_FEATURE_RSB_CTXSW
    #endif
 
    /* Restore flags or the incoming task to restore AC state. */
    popfl
    /* restore callee-saved registers */
    popl    %esi
    popl    %edi
    popl    %ebx
    popl    %ebp
 
    jmp     __switch_to  SYM_CODE_END(__switch_to_asm)
  .popsection

你会注意到指令指针(IP)没有被显式保存。这是不需要的,因为:

  • 任务总是会在该函数中恢复
  • schedule()context_switch() 总是内联的)调用者的返回地址保存在内核栈上
  • 使用 jmp 执行 __switch_to(),这是一个函数,当它返回时,它将从栈中弹出原始(下一个任务的)返回地址

阻塞和唤醒任务

任务状态

下图显示了任务(线程)状态以及它们之间可能的转换:

../_images/ditaa-0b8cde2be9bbd195ac9dcaeac978a8bbe0d3b805.png

阻塞当前线程

阻塞当前线程是我们需要执行的一项重要操作,以实现高效的任务调度——我们希望在 I/O 操作完成时运行其他线程。

为了实现这一点,会发生以下操作:

  • 将当前线程状态设置为 TASK_UINTERRUPTIBLE 或 TASK_INTERRUPTIBLE
  • 将任务添加到等待队列
  • 调用调度器,调度器将从 READY(就绪)队列中选择一个新任务
  • 执行到新任务的上下文切换

下面是一些 wait_event 实现的片段。请注意,等待队列是一个带有一些额外信息(如指向任务结构的指针)的列表。

另请注意,为了确保 wait_eventwake_up 之间不会发生死锁,投入了大量精力:任务在检查 condition 之前被添加到列表中,在调用 schedule() 之前会检查信号。

/**
 * wait_event - sleep until a condition gets true
 * @wq_head: the waitqueue to wait on
 * @condition: a C expression for the event to wait for
 *
 * The process is put to sleep (TASK_UNINTERRUPTIBLE) until the
 * @condition evaluates to true. The @condition is checked each time
 * the waitqueue @wq_head is woken up.
 *
 * wake_up() has to be called after changing any variable that could
 * change the result of the wait condition.
 */
#define wait_event(wq_head, condition)            \
do {                                              \
  might_sleep();                                  \
  if (condition)                                  \
          break;                                  \
  __wait_event(wq_head, condition);               \
} while (0)
 
#define __wait_event(wq_head, condition)                                  \
    (void)___wait_event(wq_head, condition, TASK_UNINTERRUPTIBLE, 0, 0,   \
                        schedule())
 
/*
 * The below macro ___wait_event() has an explicit shadow of the __ret
 * variable when used from the wait_event_*() macros.
 *
 * This is so that both can use the ___wait_cond_timeout() construct
 * to wrap the condition.
 *
 * The type inconsistency of the wait_event_*() __ret variable is also
 * on purpose; we use long where we can return timeout values and int
 * otherwise.
 */
#define ___wait_event(wq_head, condition, state, exclusive, ret, cmd)    \
({                                                                       \
    __label__ __out;                                                     \
    struct wait_queue_entry __wq_entry;                                  \
    long __ret = ret;       /* explicit shadow */                        \
                                                                         \
    init_wait_entry(&__wq_entry, exclusive ? WQ_FLAG_EXCLUSIVE : 0);     \
    for (;;) {                                                           \
        long __int = prepare_to_wait_event(&wq_head, &__wq_entry, state);\
                                                                         \
        if (condition)                                                   \
            break;                                                       \
                                                                         \
        if (___wait_is_interruptible(state) && __int) {                  \
            __ret = __int;                                               \
            goto __out;                                                  \
        }                                                                \
                                                                         \
        cmd;                                                             \
    }                                                                    \
    finish_wait(&wq_head, &__wq_entry);                                  \
   __out:  __ret;                                                        \
 })
 
 void init_wait_entry(struct wait_queue_entry *wq_entry, int flags)
 {
    wq_entry->flags = flags;
    wq_entry->private = current;
    wq_entry->func = autoremove_wake_function;
    INIT_LIST_HEAD(&wq_entry->entry);
 }
 
 long prepare_to_wait_event(struct wait_queue_head *wq_head, struct wait_queue_entry *wq_entry, int state)
 {
     unsigned long flags;
     long ret = 0;
 
     spin_lock_irqsave(&wq_head->lock, flags);
     if (signal_pending_state(state, current)) {
         /*
          * Exclusive waiter must not fail if it was selected by wakeup,
          * it should "consume" the condition we were waiting for.
          *
          * The caller will recheck the condition and return success if
          * we were already woken up, we can not miss the event because
          * wakeup locks/unlocks the same wq_head->lock.
          *
          * But we need to ensure that set-condition + wakeup after that
          * can't see us, it should wake up another exclusive waiter if
          * we fail.
          */
         list_del_init(&wq_entry->entry);
         ret = -ERESTARTSYS;
     } else {
         if (list_empty(&wq_entry->entry)) {
             if (wq_entry->flags & WQ_FLAG_EXCLUSIVE)
                 __add_wait_queue_entry_tail(wq_head, wq_entry);
             else
                 __add_wait_queue(wq_head, wq_entry);
         }
         set_current_state(state);
     }
     spin_unlock_irqrestore(&wq_head->lock, flags);
 
     return ret;
 }
 
 static inline void __add_wait_queue(struct wait_queue_head *wq_head, struct wait_queue_entry *wq_entry)
 {
     list_add(&wq_entry->entry, &wq_head->head);
 }
 
 static inline void __add_wait_queue_entry_tail(struct wait_queue_head *wq_head, struct wait_queue_entry *wq_entry)
 {
     list_add_tail(&wq_entry->entry, &wq_head->head);
 }
 
 /**
  * finish_wait - clean up after waiting in a queue
  * @wq_head: waitqueue waited on
  * @wq_entry: wait descriptor
  *
  * Sets current thread back to running state and removes
  * the wait descriptor from the given waitqueue if still
  * queued.
  */
 void finish_wait(struct wait_queue_head *wq_head, struct wait_queue_entry *wq_entry)
 {
     unsigned long flags;
 
     __set_current_state(TASK_RUNNING);
     /*
      * We can check for list emptiness outside the lock
      * IFF:
      *  - we use the "careful" check that verifies both
      *    the next and prev pointers, so that there cannot
      *    be any half-pending updates in progress on other
      *    CPU's that we haven't seen yet (and that might
      *    still change the stack area.
      * and
      *  - all other users take the lock (ie we can only
      *    have _one_ other CPU that looks at or modifies
      *    the list).
      */
     if (!list_empty_careful(&wq_entry->entry)) {
         spin_lock_irqsave(&wq_head->lock, flags);
         list_del_init(&wq_entry->entry);
         spin_unlock_irqrestore(&wq_head->lock, flags);
     }
 }

唤醒任务

我们可以使用 wake_up 原语来唤醒任务。唤醒任务执行以下高级操作:

  • 从等待队列中选择一个任务
  • 将任务状态设置为 TASK_READY
  • 将任务插入调度器的 READY(就绪)队列
  • 在 SMP 系统上,这是一个复杂的操作:每个处理器都有自己的队列,队列需要平衡,CPU 需要被信号通知
#define wake_up(x)                        __wake_up(x, TASK_NORMAL, 1, NULL)
 
/**
 * __wake_up - wake up threads blocked on a waitqueue.
 * @wq_head: the waitqueue
 * @mode: which threads
 * @nr_exclusive: how many wake-one or wake-many threads to wake up
 * @key: is directly passed to the wakeup function
 *
 * If this function wakes up a task, it executes a full memory barrier before
 * accessing the task state.
 */
void __wake_up(struct wait_queue_head *wq_head, unsigned int mode,
               int nr_exclusive, void *key)
{
    __wake_up_common_lock(wq_head, mode, nr_exclusive, 0, key);
}
 
static void __wake_up_common_lock(struct wait_queue_head *wq_head, unsigned int mode,
                  int nr_exclusive, int wake_flags, void *key)
{
  unsigned long flags;
  wait_queue_entry_t bookmark;
 
  bookmark.flags = 0;
  bookmark.private = NULL;
  bookmark.func = NULL;
  INIT_LIST_HEAD(&bookmark.entry);
 
  do {
          spin_lock_irqsave(&wq_head->lock, flags);
          nr_exclusive = __wake_up_common(wq_head, mode, nr_exclusive,
                                          wake_flags, key, &bookmark);
          spin_unlock_irqrestore(&wq_head->lock, flags);
  } while (bookmark.flags & WQ_FLAG_BOOKMARK);
}
 
/*
 * The core wakeup function. Non-exclusive wakeups (nr_exclusive == 0) just
 * wake everything up. If it's an exclusive wakeup (nr_exclusive == small +ve
 * number) then we wake all the non-exclusive tasks and one exclusive task.
 *
 * There are circumstances in which we can try to wake a task which has already
 * started to run but is not in state TASK_RUNNING. try_to_wake_up() returns
 * zero in this (rare) case, and we handle it by continuing to scan the queue.
 */
static int __wake_up_common(struct wait_queue_head *wq_head, unsigned int mode,
                            int nr_exclusive, int wake_flags, void *key,
                  wait_queue_entry_t *bookmark)
{
    wait_queue_entry_t *curr, *next;
    int cnt = 0;
 
    lockdep_assert_held(&wq_head->lock);
 
    if (bookmark && (bookmark->flags & WQ_FLAG_BOOKMARK)) {
          curr = list_next_entry(bookmark, entry);
 
          list_del(&bookmark->entry);
          bookmark->flags = 0;
    } else
          curr = list_first_entry(&wq_head->head, wait_queue_entry_t, entry);
 
    if (&curr->entry == &wq_head->head)
          return nr_exclusive;
 
    list_for_each_entry_safe_from(curr, next, &wq_head->head, entry) {
          unsigned flags = curr->flags;
          int ret;
 
          if (flags & WQ_FLAG_BOOKMARK)
                  continue;
 
          ret = curr->func(curr, mode, wake_flags, key);
          if (ret < 0)
                  break;
          if (ret && (flags & WQ_FLAG_EXCLUSIVE) && !--nr_exclusive)
                  break;
 
          if (bookmark && (++cnt > WAITQUEUE_WALK_BREAK_CNT) &&
                          (&next->entry != &wq_head->head)) {
                  bookmark->flags = WQ_FLAG_BOOKMARK;
                  list_add_tail(&bookmark->entry, &next->entry);
                  break;
          }
    }
 
    return nr_exclusive;
}
 
int autoremove_wake_function(struct wait_queue_entry *wq_entry, unsigned mode, int sync, void *key)
{
    int ret = default_wake_function(wq_entry, mode, sync, key);
 
    if (ret)
        list_del_init_careful(&wq_entry->entry);
 
    return ret;
}
 
int default_wake_function(wait_queue_entry_t *curr, unsigned mode, int wake_flags,
                    void *key)
{
    WARN_ON_ONCE(IS_ENABLED(CONFIG_SCHED_DEBUG) && wake_flags & ~WF_SYNC);
    return try_to_wake_up(curr->private, mode, wake_flags);
}
 
/**
 * try_to_wake_up - wake up a thread
 * @p: the thread to be awakened
 * @state: the mask of task states that can be woken
 * @wake_flags: wake modifier flags (WF_*)
 *
 * Conceptually does:
 *
 *   If (@state & @p->state) @p->state = TASK_RUNNING.
 *
 * If the task was not queued/runnable, also place it back on a runqueue.
 *
 * This function is atomic against schedule() which would dequeue the task.
 *
 * It issues a full memory barrier before accessing @p->state, see the comment
 * with set_current_state().
 *
 * Uses p->pi_lock to serialize against concurrent wake-ups.
 *
 * Relies on p->pi_lock stabilizing:
 *  - p->sched_class
 *  - p->cpus_ptr
 *  - p->sched_task_group
 * in order to do migration, see its use of select_task_rq()/set_task_cpu().
 *
 * Tries really hard to only take one task_rq(p)->lock for performance.
 * Takes rq->lock in:
 *  - ttwu_runnable()    -- old rq, unavoidable, see comment there;
 *  - ttwu_queue()       -- new rq, for enqueue of the task;
 *  - psi_ttwu_dequeue() -- much sadness :-( accounting will kill us.
 *
 * As a consequence we race really badly with just about everything. See the
 * many memory barriers and their comments for details.
 *
 * Return: %true if @p->state changes (an actual wakeup was done),
 *           %false otherwise.
 */
 static int
 try_to_wake_up(struct task_struct *p, unsigned int state, int wake_flags)
 {
     ...

抢占任务 (Preempting tasks)

到目前为止,我们了解了上下文切换如何在线程之间自愿发生。接下来我们将看看如何处理抢占。我们将从内核配置为非抢占式的简单情况开始,然后转移到抢占式内核的情况。

非抢占式内核

  • 在每个时钟滴答 (tick) 中,内核会检查当前进程的时间片是否耗尽
  • 如果发生这种情况,会在中断上下文中设置一个标志
  • 在返回用户空间之前,内核会检查此标志并在需要时调用 schedule()
  • 在这种情况下,当任务在内核模式下运行(例如系统调用)时,不会被抢占,因此不存在同步问题

抢占式内核

在这种情况下,即使我们正在内核模式下运行并执行系统调用,当前任务也可以被抢占。这需要使用特殊的同步原语:preempt_disablepreempt_enable

为了简化抢占式内核的处理,并且由于 SMP 无论如何都需要同步原语,因此在使用自旋锁 (spinlock) 时会自动禁用抢占。

如前所述,如果遇到需要抢占当前任务的条件(其时间片已过期),则会设置一个标志。每当重新激活抢占时(例如通过 spin_unlock() 退出临界区时),都会检查此标志,并在需要时调用调度器来选择新任务。

进程上下文

既然我们已经检查了进程和线程(任务)的实现、上下文切换是如何发生的、我们如何阻塞、唤醒和抢占任务,我们最终可以定义什么是进程上下文以及它的属性:

当内核正在运行系统调用时,它就是在进程上下文中执行。

在进程上下文中,有一个定义明确的上下文,我们可以使用 current 访问当前进程数据。

在进程上下文中,我们可以睡眠(等待条件)。

在进程上下文中,我们可以访问用户空间(除非我们是在内核线程上下文中运行)。

内核线程

有时内核核心或设备驱动程序需要执行阻塞操作,因此它们需要在进程上下文中运行。

内核线程正是用于此目的,是一类特殊的任务,它们不拥有“用户空间”资源(例如没有地址空间或打开的文件)。