MULTI/MANY-CORE SYSTEMS – System Software Laboratory

HOME > PROJECT > MULTI/MANY-CORE SYSTEMS 한국어 / English

Operating Systems for Multi/Many-Core Systems

As the number of cores keeps increasing, it is important to fully parallelize the large number of processes and threads over the multi/many-cores and provide scalable performance.

Dynamic Core Affinity for High-throughput I/O

The core affinity defines a set of cores on which a given task can run. We can classify the core affinity into two: (1) interrupt-core affinity that represents the mapping between an I/O device and a set of cores that handle the interrupts from the device and (2) process-core affinity that represents the mapping between a process and a set of cores allowed to run the process. We showed the impact of core affinity on the network performance [P2S22008], and suggested the dynamic core affinity mechanism that decides the best core affinity of networking processes at the kernel level based on the cache layout, communication intensiveness, and core loads [HotI2009].

We are expanding the basic idea of the dynamic core affinity into big data analytics frameworks. The Hadoop threads perform a combination of network I/O, disk I/O, and computation. In case of HDFS, the majority of operations are network and disk I/O intensive; thus, it is beneficial to run HDFS threads on I/O-favorable cores [ParCo2014]. We are also trying to further improve the locality of network and storage I/O operations on many‐core systems by partitioning cores for I/O system calls and event handlers. In order to implement fine‐grained many‐core partitioning, we decouple the system call context from the user‐level process [CCPE2020, Cluster2021].

Linux kernel module for MPI Intra-node Communication

LiMIC enables high-performance MPI intra-node communication over multi-core systems. LiMIC achieves very good performance since it no longer requires MPI libraries to copy messages in and out of shared memory regions for intra-node communication. Rather, LiMIC enables messages to be copied directly from sending process’ memory to the receiver process. Thus, LiMIC eliminates unnecessary message copies and enhances cache utilization. LiMIC has two different designs called LiMIC1 and LiMIC2.

LiMIC1 is a stand-alone communication module with generalized kernel-access interfaces [ICPP2005]. The module manages send, receive, and completion queues internally so that, once the MPI library decides to use intra-node communication channel, simply calls the interfaces and lets the communication module handle the message transportation between intra-node processes. However, having the separate message queues from the MPI library brings on many tricky issues, such as synchronization for internal queues.

In LiMIC2, the kernel extension provides lightweight primitives to perform the memory mapping between different processes [Cluster2007, ICPP2008]. The kernel module exposes the interfaces for memory mapping and data copy to the MPI library. The lightweight primitives do not need to have any internal queues and data structures shared between intra-node processes. Therefore, the lightweight primitives can avoid the synchronization and MPI message matching, which can result in lower overhead and increase the parallelism of local MPI processes. As emerging many-core processors have a large number of cores, we are improving the scalability of LiMIC2 by exploiting features provided by modern memory systems and interconnects [EuroMPI2017, SUPE2023]. We are also trying to solve imbalance problems in exascale computing systems [SC2021].