As semiconductor manufacturing processes scale to smaller feature sizes, manufacturing defect and wear-out induced permanent component failure are challenging how systems are traditionally designed. Historically, a combination of careful process tuning and design rule specification has been sufficient to cost-effectively ensure that deterministic design practices eventually result in acceptable system yield and lifetime. However, as transistors and interconnect shrink, they are simultaneously becoming more prone to complete or parametric failure at manufacturing time as well as early degradation and total breakdown in the field, resulting in systems that are increasingly expensive to produce and less likely to function correctly for as long as intended. Stochastic design approaches are therefore needed to enable designers to quickly perform the complex modeling and analysis required to design and evaluate emerging defect- and permanent-failure- tolerant architectures.
In this talk, I will describe our recent research on the design-time architectural optimization of cost and lifetime in embedded network-on- chip-based multi-processor-systems-on-chip (NoC-based MPSoCs). If there is sufficient slack in a system-- excess computation, storage, and communication resources-- when components fail, it may be possible to re-map tasks and data and re-route traffic such that performance constraints can be satisfied, averting system failure. Given a fixed NoC communication architecture, our goal is to cost-effectively perform slack allocation, distributing excess system-level resources (e.g., replacing low-performance processors with high performance processors, or small memories with larger memories) such that when permanent component failure occurs, with high probability sufficient resources remain for the system to continue to satisfy performance constraints. Our novel and scalable slack allocation technique, Critical Quantity Slack Allocation (CQSA), uses information about system architecture and the target application to efficiently and effectively jointly optimize lifetime and cost. As a result, CQSA is able to find a wide variety of designs that are, on average, within 1.4% of the optimal while exploring 1.4% of the design space.
Brett H. Meyer is a post-doctoral researcher in the Department of Computer Science at the University of Virginia. He received his BS in Electrical Engineering from the University of Wisconsin-Madison in 2003, and his MS and PhD in Electrical and Computer Engineering from Carnegie Mellon University in 2005 and 2009 respectively. His research is broadly focused on system-level modeling and design automation for embedded multiprocessor-systems-on-chips, with an emphasis on increasing lifetime and yield in the presence of permanent component failure and manufacturing defect.