HOME > PROJECT > MULTI/MANY-CORE SYSTEMS 한국어 / English

High-Performance Computing on Multi/Many-Core Systems

As the capability of CPUs and accelerators (e.g., GPUs and DPUs) keeps increasing in High-Performance Computing (HPC) Systems, it is important to fully parallelize the large number of processes and threads over the multi/many-cores and provide scalable performance.

High-Performance Message Passing Interface (MPI)

LiMIC enables high-performance MPI intra-node communication over multi-core systems. LiMIC achieves very good performance since it no longer requires MPI libraries to copy messages in and out of shared memory regions for intra-node communication. Rather, LiMIC enables messages to be copied directly from sending process’ memory to the receiver process. Thus, LiMIC eliminates unnecessary message copies and enhances cache utilization. LiMIC has two different designs called LiMIC1 [ICPP2005] and LiMIC2. In LiMIC2, the kernel extension provides lightweight primitives to perform the memory mapping between different processes [Cluster2007, ICPP2008]. As emerging many-core processors have a large number of cores, we are improving the scalability of LiMIC2 by exploiting features provided by modern memory systems and interconnects [EuroMPI2017, SUPE2023]. We are also trying to solve imbalance problems in exascale computing systems [SC2021].

The emerging Data Processing Units (DPUs) incorporate fully programmable general-purpose processors that enable Network Interface Cards (NICs) to handle both control and data plane functions. Accordingly, there have been recent attempts to offload tasks beyond the network protocols performed on the CPU to the DPU. In particular, Message Passing Interface (MPI) is attracting attention as a key offloading target in supercomputing systems. We address the lack of an implementation-agnostic MPI offloading framework that avoids changes to existing MPI libraries and applications by proposing macro-level MPI offloading named Mammoth, which can offload several existing MPI implementations to the off-path accelerator in DPU [CCGrid2026].

Dynamic Core Affinity for High-throughput I/O

The core affinity defines a set of cores on which a given task can run. We can classify the core affinity into two: (1) interrupt-core affinity that represents the mapping between an I/O device and a set of cores that handle the interrupts from the device and (2) process-core affinity that represents the mapping between a process and a set of cores allowed to run the process. We showed the impact of core affinity on the network performance [P2S22008], and suggested the dynamic core affinity mechanism that decides the best core affinity of networking processes at the kernel level based on the cache layout, communication intensiveness, and core loads [HotI2009].

We are expanding the basic idea of the dynamic core affinity into big data analytics frameworks. The Hadoop threads perform a combination of network I/O, disk I/O, and computation. In case of HDFS, the majority of operations are network and disk I/O intensive; thus, it is beneficial to run HDFS threads on I/O-favorable cores [ParCo2014]. We are also trying to further improve the locality of network and storage I/O operations on many-core systems by partitioning cores for I/O system calls and event handlers. In order to implement fine-grained many-core partitioning, we decouple the system call context from the user-level process [CCPE2020, Cluster2021].

Scroll to top