新闻中心
网站首页   学会概况   学会规章   新闻中心   学术交流
社会服务   科学普及  计算机大赛   会员中心   联系方式
一键拨号
一键留言
会员中心
学会动态
学术报告Practical Reliability Analysis of GPGPUs in the Wild:
2019-05-31

计算机软件新技术国家重点实验室  

     

要:

General Purpose Graphics Processing Units (GPGPUs) have rapidly evolved to enable energy-efficient data-parallel computing for a broad range of scientific areas. While GPUs achieve exascale performance at a stringent power budget, they are also susceptible to soft errors (faults), often caused by high-energy particle strikes, that can significantly affect application output quality. As those applications are normally long-running, investigating the characteristics of GPU errors becomes imperative to better understand the reliability of such systems. In this talk, I will present a study of the system conditions that trigger GPU soft errors using a six-month trace data collected from a large-scale, operational HPC system from Oak Ridge National Lab. Workload characteristics, certain GPU cards, temperature and power consumption could be indicative of GPU faults, but it is non-trivial to exploit them for error prediction. Motivated by these observations and challenges, I will show how machine-learning-based error prediction models can capture the hidden interactions among system and workload properties. The above findings beg the question: how can one better understand the resilience of applications if faults are bound to happen? To this end, I will illustrate the challenges of comprehensive fault injection in GPGPU applications and outline a novel fault injection solution that captures the error resilience profile of GPGPU applications.

报告人简介:

Evgenia Smirni received the Diploma degree in Computer Science and Informatics from the University of Patras, Greece, in 1987 and the Ph.D. degree in Computer Science from Vanderbilt University in 1995. She is the Sidney P. Chockley Professor of Computer Science at the College of William and Mary, Williamsburg, VA, USA. Her research interests include queuing networks, stochastic modeling, Markov chains, resource allocation policies, storage systems, data centers and cloud computing, workload characterization, models for performance prediction, and reliability of distributed systems and applications. She has served as the Program co-Chair of QEST’05, ACM Sigmetrics/Performance’06, HotMetrics’10, ICPE’17, DSN’17, SRDS’19, and HPDC'19. She also served as the General co-Chair of QEST’10 and NSMC’10. She is an ACM Distinguished Scientist.

地点:计算机科学技术楼229

时间:611  10:00-10:40

 

 

上一篇:2019年江苏省大学生计算机设计大赛决赛在中国矿业大学举行
下一篇:JSCS 2019年“物联网与工业互联网科研创新论坛”在南京航空航天大学召开
版权所有:江苏省计算机学会
苏ICP备14049275号-1