Software Architecture for Fault-Tolerant Multicore Computing with Hybridized Non-Volatile Memories
(PI: Prof. C.L. Wang, 9/2015-8/2018)
Two-level memory hierarchy (on-chip and off-chip) containing both non-volatile SST-RAM and volatile SRAM/DRAM

  • Members:

  • Dr. King Tin Lam

  • Mr. Mingzhe Zhang

  • Ms. Zhourui Zhang

  • Mr. Xin Yau

Several emerging non-volatile memory (NVM) technologies like Spin-Transfer Torque Magnetic RAM (STT-MRAM, or simply STT-RAM) and Phase-Change Memory (PCM) have migrated from laboratory samples to integrated products. The industry-wide turn to one of them is no longer a far vision. Such future memory chips are expected to provide an alternative to flash memory and eventually replace SRAM/DRAM in the existing memory hierarchy of computers and electronic devices. STT-RAM is of particular potential to outrun others for it has the most complete set of desired features to be a universal memory: non-volatility, high density (scalability), low read latency and energy, ultra-low leakage power and long enough endurance. In the near future, we can expect a hybridized memory hierarchy composed of both volatile SRAM/DRAM and non-volatile STT-RAM.

With non-volatile main memory that always preserves data, we see new opportunities to bring some native support of fault tolerance to program execution, making data loss decoupled from power loss and transient software or hardware failures. Like scratchpad memory in most embedded systems, future mainstream (server) processor chips can have some programmable on-chip memory apart from uncontrollable caches. More predictable and high performance can be made possible via software-controlled data caching/prefetching and anti-caching via the new programmable datapath (scratchpad).

In this project, we propose a new multicore architecture with a two-level memory hierarchy (on-chip and off-chip) containing both non-volatile SST-RAM and volatile SRAM/DRAM. We will investigate the challenges to the design of system software architectures and the associated programming model for reliable big data computing using such hybridized memory hardware. Specifically, we hope to modify the Linux kernel to build native NVM management for use by the upper level, and develop a data-centric fault-tolerant software system for MapReduce-like programming in a reliable manner. Several ideas are proposed, including:

  • The design of a persistent process repository and persistent page table in the OS kernel.
  • Multi-temperature data management across the memory hierarchy for attaining the best data locality based on exploiting the programmable on-chip scratchpad.
  • Software support for effortlessly embedding fault tolerance into an in-memory computing application.

There are many challenges ahead of these innovations including complex atomicity and consistency problems, dangling references, and the design of efficient logging and commit mechanisms. Our research will be pursued in a simulated environment based on Gem5. This research could impact the future design of server systems, mobile devices, and wearable electronics (for easily including reliability); and potentially enable innovations for minimizing downtime in mission-critical applications.


非易失性內存(Non-Volatile Memory, NVM)所儲存的數據 不會因電源故障或停電而丟失, 可以永久保存數據。我們可以通過使用非易失內存來支持程 序容錯, 保證系統可靠性。 新型的非易失性內存,  如相變化內存(PCM), 自旋扭矩轉換內存 (STT-MRAM), 已經走出實驗室樣品, 並集成到部分產品投入市場。NVM在不久的將來, 將逐漸取代閃存(Flash); 並在工業界廣泛使用,最終取代現有的SRAM/DRAM當中, STT- MRAM將是最有希望超過其他內存技術, STT-MRAM擁有作為通用內存的諸多特點, 高密度(可擴展性),無限次讀寫(耐久性), 讀寫速度快, 低能耗等。

預見在STT-MRAM全面取代 SRAM/DRAM, Flash等之前,混合SRAM/DRAMSTT-MRAM內存架構將是最可行和最經濟的方案。 因此我們設計了一個擁有兩級(片上與片 )內存構架的多核體系結構。每級內存同時包含非易失性的STT-MRAM和易失性的SRAM DRAM的傳統內存。我們的設計走 軟硬結合的方式 -- 基於兩級混合式內存構架, 我們設計相應的內存管理系統, 並針對大數據計算的容錯編程模型作出貢獻。其中含蓋幾個創新的技術和設計:(1) 將片外內存作為片上內存的交換分區;利用比Flash 快上千倍的STT-MRAM存儲系統狀態和關鍵性數據, 減少容錯開銷, 達到更快速的系統恢復, (2) 基於各級內存數據訪問頻率和數據關鍵性(溫度感知), 決定數據暫存位置(片上/ , DRAM/STT-MRAM),以達到最優數據局部性 (data locality) (3)  基於可編程緩存(scratchpad memory), 實現軟件可控的緩存機制,劃分可緩存”(caching) 反緩存”(anti-caching) 數據管理模式, 並依數據關鍵性和時效性, 區隔其各層內存讀寫通道; 讓各非易失性內存獲得最有效利用。以上技術和設計將在Gem-5 模擬器上進行測試。

這項研究將實現零內存數據丟失,零停機的高性能容錯解決方案, 尤其在關鍵應用的失效時間最小化技術上取得突破, 作出貢獻。我們提出的兩級混合內存架構將影響未來產品的設計,例如服務器,移動設備,和可攜帶型電子儀器等。




Last Modification: August 30, 2015, by Dr. C.L. Wang

  • Overview
  • People
  • Updates
  • Publication
  • Related Links