Improving System-Level Lifetime Reliability of Multicore Soft Real-Time Systems

Yue Ma, Student Member, IEEE, Thidapat Chantem, Member, IEEE, Robert P. Dick, Member, IEEE, and Xiaobo Sharon Hu, Fellow, IEEE

Abstract—This paper studies the problem of maximizing multicore system lifetime reliability, an important design consideration for many real-time embedded systems. Existing work has investigated the problem, but has neglected important failure mechanisms. Furthermore, most existing algorithms are too slow for online use, and thus cannot address runtime workload and environment variations. This paper presents an online framework that maximizes system lifetime reliability through reliability-aware utilization control. It focuses on homogeneous multicore soft real-time systems. It selectively employs a comprehensive reliability estimation tool to deal with a variety of failure mechanisms at the system level. A model-predictive controller adjusts utilization by manipulating core frequencies, thereby reducing temperature, and an online heuristic adjusts the controller sampling window length to decrease the reliability effects of thermal cycling. Experiments with a real quad-core ARM processor and a simulator demonstrate that the proposed approach improves system mean time to failure by 50% on average and 141% in the best case, compared with existing techniques.

Index Terms—Dynamic voltage and frequency scaling (DVFS), online control, real-time embedded system, system-level reliability optimization.

I. INTRODUCTION

MULTICORE systems offer good performance with reasonable power consumption and are widely used in many real-time applications such as automotive electronics, industrial automation, and avionics. Ongoing increases in device densities and operating frequencies increase microprocessor power density and temperature. This increased chip temperature accelerates the aging of devices, resulting in permanent faults and reducing operating lifespans. Many embedded systems, e.g., automotive infotainment, operate in environments with varying temperatures, as well as varying task execution times and task periods. Online reliability optimization approaches have advantages for such systems.

System reliability is related to the reliability of each transistor, the number of transistors, the distribution of permanent failures, and the mechanisms of failure propagation [1]. For a given transistor, lifetime is temperature dependent. In general, a higher temperature increases the effects of the gradual displacement and mass transportation of the atoms in metal wires, and the deterioration of the gate dielectric layer. Changes in temperature over time also influence reliability. When integrated circuit (IC) materials with differing thermal expansion coefficients abut, thermal fluctuation produces mechanical stress, which leads to failures [2]. Reliability modeling and optimization techniques should consider these divergent wear processes.

Most existing work on temperature management focuses on improving lifetime reliability by minimizing chip’s peak temperature [3]–[6] or increasing the quality of service given a temperature constraint [7]–[10]. However, minimizing peak temperature sometimes implies increasing thermal cycling (TC), so it may, or may not, have a net positive impact on reliability [11]. The lifetime reliability maximization problem has its own unique characteristics and requires an in-depth analysis of the temporal temperature profile, including its peaks and valleys.

There exist several efforts that directly address increasing lifetime reliability, measured by mean time to failure (MTTF), via task mapping, scheduling [12]–[15], and dynamic voltage and frequency scaling (DVFS) [16], [17]. However, the complexity of the reliability and thermal models (often based on Monte Carlo simulations [1], [18]–[20]) results in high execution overheads, making them impractical for online use. In order to reduce overhead, a common strategy is to use device-level MTTF model to directly replace the system-level model [12]–[15], [21]. This, however, is a potentially inaccurate approximation as the temporal fault distributions of individual devices and million-device systems differ greatly. Only system-level models can accurately determine system-level MTTF.

In this paper, we introduce an online framework called reliability-aware utilization control (RUC) to improve the lifetime reliability in the presence of wear caused by failure mechanisms that strongly depend on temperature and temperature variation, such as electromigration (EM), time-dependent dielectric breakdown (TDB), stress migration (SM), and TC. We pay special attention to reducing the complexity of reliability control and lifetime reliability estimation, enabling online in-system reliability estimation. Our main contributions are as follows.

1) Based on the observation that increasing utilization via decreasing core frequencies reduces core temperature, we designed a model-predictive controller (MPC) that

Digital Object Identifier 10.1109/TVLSI.2017.2669144
keeps system utilization at a desired value called the utilization set point. Our MPC algorithm incorporates various design considerations including real-time constraints and MTTF dependency on peak temperature and load balancing.

2) We developed a heuristic algorithm to dynamically adjust the sampling window length of the MPC, thereby influencing: a) system reliability via peak temperature and TC and b) the MPC computational overhead. Our heuristic algorithm adjusts sampling window length to balance these considerations.

3) We implemented our online framework both in a simulation environment and on a hardware board (NVIDIA’s Jetson TK1 board containing a Tegra K1 (TK1) processor with four Cortex-A15 ARM cores [22]). The simulator’s parameters are calibrated based on the measurement from TK1 board.

We conducted a large set of experiments on the hardware board and the simulator to validate our approach and compared it with two existing control methods: temperature-aware (TA) and utilization control (UC). For ten MiBench benchmarks [23], RUC improves MTTF by 50% on average and by up to 141% for processors with high power density. Improvements are small for very low power density processors.

The rest of this paper is organized as follows. We review related work in Section II. Section III introduces hardware and software models and describes the system-level reliability model. Section IV formulates the problem and provides an overview of the RUC framework. Section V describes RUC in detail. Sections VI and VII describe our experimental setup and results. Section VIII concludes this paper.

## II. RELATED WORK

Researchers have modeled system lifetime reliability in several ways. An architecture-level model was designed to calculate a processor’s lifetime due to EM, SM, TDDB, and TC [24]. The model was used to analyze the effects of CMOS technology scaling on system lifetime [2]. A statistical model and simulation methodology were used to determine multicore system-on-chip reliability [25]. System lifetime has also been estimated in the presence of simulated sequences of failures [18]. A fast and accurate Monte Carlo-based modeling framework that integrates device-, component-, and system-level models was proposed in [1].

Several papers have described using the above models and hardware to improve lifetime reliability through offline optimization strategies. For periodic tasks running on a multiprocessor system on chip (MPSoC), Huang et al. [12] proposed an analytical model to estimate the lifetime reliability of MPSoCs and a task mapping and scheduling algorithm to guard against aging effects. Based on the same model, Das et al. [14], [15] extended the work to improve the lifetime of network on chips and also solve the energy–reliability tradeoff problem for multimedia MPSoCs. Based on a system-level lifetime reliability model, Zhou et al. [20] presented a unified metric combining system-level lifetime reliability and soft-error reliability, and maximized the soft-error reliability under the lifetime reliability constraint.

These approaches can improve lifetime reliability but require that tasks and their timing and power characteristics be known at design time, and remain static. Our work supports tasks with characteristics that are unknown at design time and change during system operation, using DVFS as a mechanism.

Several dynamic reliability management approaches have been developed. Hartman and Thomas [19] describe a runtime task mapping technology to maximize system-level lifetime reliability by utilizing wear sensors. However, wear sensors are very frequently unavailable or inaccessible to system software. Bolchini et al. [26] dynamically determined the most effective mapping of tasks to minimize network-on-chip energy consumption and maximize the lifetime. Mandelli et al. [27] proposed a runtime distributed energy-aware mapping technique that balances thermal distribution and improves reliability. Bolchini et al. [28] analyzed how load distribution strategies influence system lifetime reliability. Das et al. [21] proposed a machine learning-based algorithm to handle inter- and intra-application variations and reduce peak temperature and TC. All the above work improves lifetime reliability and captures variations in workload and runtime environments. However, none of them use a system-level reliability model, potentially undermining accuracy and fidelity. In this paper, we control a core’s frequency and dynamically determine the length of the control period to improve the system-level reliability of real-time multicore systems.

### III. MODELS

In this section, we present the hardware platform as well as the task and reliability models used in the RUC framework.

#### A. Hardware Platform

We focus on homogeneous multicore systems in this paper. Let $M = \{\rho_1, \rho_2, \ldots, \rho_m\}$ denote the set of $m$ cores in a given system. Power consumption is the fundamental cause of rising temperature and hence system failure. A core dissipates idle power when it is idle and consumes additional active power when it performs operations [6]. Both active and idle powers are related to the core’s frequency. Let the utilization of a core in a given time interval $\Delta t$ be $U = (\Delta t_\rho / \Delta t)$, where $\Delta t_\rho$ is the amount of time that the core executes operations [6]. Runtime temperature can be estimated using an RC thermal modeling tool or measured by thermal sensors. We use both thermal sensors and Hotspot [29], an IC thermal modeling tool, in experimental evaluations.

#### B. Task Model

We consider independent real-time tasks with soft deadlines. A task that misses its deadline is immediately terminated. Tasks are periodic and follow partitioned scheduling, i.e., they are mapped to cores statically and no migration is allowed. Furthermore, tasks can be arbitrarily preempted. Such an execution model is prevalent in real-time systems since it incurs low development cost and runtime overhead. Tasks on each core are scheduled according to a real-time scheduling policy such as earliest deadline first or rate monotonic scheduling [30].
We use $\tau_i$ to denote the $i$th task. Since all the jobs of the $i$th task have the same properties, $\tau_i$ also denotes the jobs of the $i$th task. $\tau_i$ is associated with a tuple $\{e_i, d_i\}$, where $e_i$ is the execution time and $d_i$ is the relative deadline, as well as the period. For a given duration, hereafter referred to as sampling window $k$ ($SW_k$), if core $\rho_i$’s frequency is fixed at $f_i(k)$, its utilization can be calculated as

$$U_i(k) = \sum_{j \in \Gamma(k)} \frac{e_j}{d_j} = \sum_{j \in \Gamma(k)} \frac{e_j^* + e_j^*/f(k)}{d_j}$$

where $e_j^*$ reflects the portion of job execution that is dependent on the core’s frequency and $e_j^*$ is the independent portion. $\Gamma(k)$ is the set of jobs executed in $SW_k$.

C. System-Level Reliability

We consider four main IC failure mechanisms in this paper: EM, TDDB, SM, and TC [24]. EM is the dislocation of metal atoms, and TDDB is the deterioration of the gate oxide layer. SM is caused by directionally biased motion of atoms in metal wires. Wear due to EM, SM, and TDDB is strongly dependent on temperature. TC refers to IC fatigue failures caused by thermal mismatch deformation. We focus on short-term TC in this paper, as it dramatically decreases reliability [11] and accelerates system failure [31]. In this paper, we propose a framework to improve system-level reliability by reducing the effects of both temperature and TC. System-level reliability is calculated by a system-level Monte Carlo reliability modeling tool [1].

Wear due to EM, SM, and TDDB is exponentially dependent on temperature. However, the wear due to TC depends on the amplitude (e.g., the difference between the peak and valley temperature), period, and maximum temperature of thermal cycles. Fig. 1 summarize some system MTTF data obtained from the system-level reliability tool with default settings [1]. Fig. 1(a)–(c) depicts the MTTF of an example system as a function of the amplitude, period, and peak temperature of thermal cycles, respectively. As a comparison, Fig. 1(d) shows the system MTTF due to temperature alone without thermal cycles. As can be seen from Fig. 1, system MTTF generally increases at lower temperatures and with smaller thermal cycles, but the precise relationship is complicated.
resource management [33] and TA control strategies [34]. The length of $SW$ is chosen such that a core’s temperature can be considered constant within a sampling window. Before formulating the UC problem, we introduce the following notation.

- $U_i(k)$ is the utilization of core $\rho_i$ during the $k$th sampling window, $SW_k$.
- $U^s$ is the utilization set point.
- $U^\text{max}$ is the upper bound on the utilization to ensure schedulability.
- $f_i(k)$ is the frequency of core $\rho_i$ during $SW_k$.
- $f_{\min}$ and $f_{\max}$ are the lowest and highest core’s frequencies allowed, respectively.
- $\bar{f}(k)$ is the average frequency over all cores during $SW_k$.

We aim to solve the following problem:

$$
\min_{\rho_i \in M} \sum_{\rho_i \in M} (U^s - U_i(k))^2. \tag{2}
$$

The solution to (2) must satisfy these constraints

$$
U_i(k) \leq U^\text{max} \text{ for } \rho_i \in M, \tag{3}
$$

$$
-f_{\min} \leq f_i(k) - \bar{f}(k) \leq f_{\bar{f}} \text{ for } \rho_i \in M, \tag{4}
$$

$$
f_{\min} \leq f_i(k) \leq f_{\max} \text{ for } \rho_i \in M. \tag{5}
$$

The first constraint ensures that no cores exceed the schedulability bound. Based on the discussion in Section III-C, the second constraint is introduced to bound the differences in the core frequencies, which in turn bounds the core temperature differences. This constraint requires the differences between each core’s frequency and the average frequency of all cores, $\bar{f}(k)$, to be smaller than a threshold $f_{\bar{f}}$. The third constraint limits core frequencies.

Solving the above problem by controlling a core’s frequency reduces operating temperature under the real-time constraints. A shorter sampling window results in a lower temperature, but frequently altering a core’s frequency causes more fluctuation in operating temperature and increases the effects of TC. We tune the length of the sampling window to balance peak temperature and TC-dependent wear.

### B. Overview of Reliability-Aware Utilization Control

Real computing systems (and especially embedded systems) frequently operate in environments with ambient temperature and task execution time variations that cannot be predicted at design time, motivating us to develop an online reliability optimization framework. This framework improves system MTTF by solving the optimization problem defined in (2)–(5). The solution sets core frequencies for the next sampling window, which are sent (see the dashed lines in Fig. 3) to each core before the start of $SW_{k+1}$. Unlike the GUC, the SWC stores temperature history. Specifically, the temperatures measured in the most recent $s$ sampling windows are saved. We refer to the $s$ sampling windows as one profiling window. At the end of each profiling window, a temperature profile is generated. SWC then analyzes this temperature profile and determines whether to increase or reduce the sampling window length for the next profiling window.

### V. RESOURCE UTILIZATION CONTROLLER DETAILS

This section presents the detailed design of the GUC and SWC. We introduce a new utilization model describing how a core’s utilization changes with frequency. Based on this model, we discuss how the GUC solves the problem defined in (2)–(5) using a controller. Finally, we elaborate on how the heuristic used in the SWC adjusts the length of the sampling window dynamically at runtime to improve system MTTF.

#### A. Dynamic Utilization Model

In order to control utilization, we must first model it. We use a dynamic model to determine utilization as a function of core frequency. We first rewrite the nonlinear utilization model in 1) in linear form

$$
u_j(k) = \sum_{\tau_j \in I(k)} \frac{e_j^r}{d_j} + \sum_{\tau_j \in I(k)} \frac{e_j^l}{d_j} \times F_j(k) \tag{6}
$$

where $F_j(k) = 1/f_j(k)$ is the clock period, which we refer to as the manipulated variable. The dependence of utilization on frequency can be modeled as

$$
u_j(k+1) = \nu_j(k) + g_j(k) \times \Delta F_j(k) \tag{7}
$$

Fig. 3. High-level overview of RUC.
where \( g_j(k) = \sum_{\varepsilon \in \Gamma(k)} (e'_j/d_i) \) and \( \Delta F_j(k) = F_j(k + 1) - F_j(k) \). Since we have \( m \) cores, we use a matrix to describe the dynamic utilization model as

\[
U(k + 1) = U(k) + G(k) \times \Delta F(k).
\]

(8)

\( G(k) \) is a diagonal matrix and \( G(k)_{i,i} = g_i(k) \). \( \Delta F(k) = [\Delta F_1(k), \ldots, \Delta F_m(k)]^T \) and \( U(k) = [u_1(k), \ldots, u_m(k)]^T \).

Although (8) indicates the behavior of utilization as a function of the manipulated variable, it cannot be used directly because \( G(k) \) is only known at runtime. According to feedback control theory, if stable control can be achieved, the value of \( G(k) \) does not affect the final result [33]. Therefore, assuming that the system is stable, we can set \( G(k) \) to constant \( G \) and rewrite (8) as

\[
U(k + 1) = U(k) + G \times \Delta F(k).
\]

(9)

B. Global Utilization Controller Design

We solve the constrained optimization problem defined in (2)–(5) using an MPC. The basic idea behind any MPC is to optimize a cost function. Hence, we first find similarities between the constrained optimization problem defined in (2)–(5) and an MPC cost function optimization problem.

After this, we transform the MPC cost function optimization problem to a standard quadratic programming problem and solve it using existing solvers.

Inside the \( k \)th sampling window, MPC minimizes the cost function

\[
J(k) = \sum_{i=1}^{N_1} \delta_i [U(k + i) - \xi(k + i)]^2 + \sum_{j=1}^{N_2} \lambda_j \left[ \frac{1}{\Delta f(k + j - 1)} \right]^2
\]

(10)

where \( N_1 \) is the prediction horizon and \( N_2 \) is the control horizon. \( \delta_i \) is the tracking error weight and \( \lambda_j \) is the control penalty error [6]. The user-specified reference trajectory \( \xi(k + i) \) defines the ideal trajectory along which the utilization should converge to the set point.

Although this standard cost function in (10) is not our proposed optimization function in (2), the solution of the cost function is also the solution of (2). The first term in the cost function (10) is a variation of the function in (2). The second term in (10) minimizes the changes in the manipulated variable and does not affect the final result of the optimization problem. It only helps to make the control more stable. Hence, minimizing the cost function (10) under the constraints in (3)–(5) would also lead to an optimal solution to the problem defined in (2)–(5).

We propose using a quadratic programming solver to solve the optimization problem. A standard quadratic programming problem can be written as

\[
\min_{\varepsilon} \{ \frac{1}{2} \varepsilon^T \Omega \varepsilon + \zeta^T \varepsilon \}, \text{s.t.} \quad \begin{cases} A_0 \times \varepsilon \leq b_0 \\ A_1 \times \varepsilon = b_1 \\ b_1 \leq \varepsilon \leq b_u \end{cases}
\]

(11)

where \( \Omega, A_0 \), and \( A_1 \) are matrices. \( \zeta, b_0, b_1, \) and \( b_u \) are vectors, and \( \varepsilon \) denotes the change in a core’s frequency.

Since this standard quadratic programming problem can be directly solved by existing tools, the key point in MPC design is to transform the cost function defined in (10) and the constraints defined in (3)–(5) to (11). Based on \( \varepsilon \) and the core’s frequency inside \( SW_k \), the core’s frequency at \( SW_{k+1} \) can be directly calculated. Because the solution to (10) is also the solution to (2), the optimal solution to the quadratic programming problem is also the optimal solution to the proposed optimization problem defined in (2)–(5).

We first transform the MPC cost function in (10) to the target function in standard quadratic problem. Suppose we have \( h = \max\{N_1, N_2\} \), then the cost function can be rewritten as

\[
J(k) = [Y'(k) - \Phi(k)]^T \Pi [Y'(k) - \Phi(k)] + e^T \Theta e
\]

(12)

where \( Y'(k) = [U(k + 1), \ldots, U(k + h)]^T \) and \( \Phi = [\xi(k + 1), \ldots, \xi(k + h)]^T \). \( \Pi \) and \( \Theta \) are two diagonal matrices where \( \Pi_{i,i} = \delta_i \) and \( \Theta_{i,i} = \lambda_i \). Based on (9)

\[
Y'(k) = Y(k) + \Lambda \times \varepsilon
\]

(13)

where \( Y(k) = [U(k), \ldots, U(k)]^T \). \( \Lambda \) is a low triangular matrix and \( \Lambda_{i,j} = G \) if \( i \geq j \) and 0 otherwise. Therefore, (10) can be simplified as

\[
J(k) = e^T (\Lambda^T \Pi \Lambda + \Theta) e + 2(Y(k)^T \Pi \Lambda - \Phi(k)^T \Pi \Lambda) e + H.
\]

(14)

\( H \) is independent of the manipulated variable \( e \). The cost function is successfully transformed to the objective function in a standard quadratic programming problem in (11), where \( \Omega = 2(\Lambda^T \Pi \Lambda + \Theta) \) and \( \zeta = 2[Y(k)^T \Pi \Lambda - \Phi(k)^T \Pi \Lambda] e \).

The next step is to transform the constraints to the standard form. The first constraint in (3) can be described as

\[
\Lambda \times \varepsilon \leq U^* - Y(k)
\]

(15)

where \( U^* = [U_{\text{max}}^1, \ldots, U_{\text{max}}^m]^T \). This is a standard form where \( \Lambda \) and \( U^* - Y(k) \) correspond to \( A_0 \) and \( b_0 \) in (11), respectively.

The second constraint in (4) enforces the even distribution of core frequencies. The constraint on core frequencies constrains core clock periods

\[
-F_{\text{th}} \leq F_i(k) - F_i(k) \leq F_{\text{th}}
\]

(16)

where \( F_{\text{th}} \) is the threshold related to \( f_{\text{th}} \). The clock period of core \( \rho_i \) at the \( (k + h) \)th sampling window, \( F_i(k + h) \), is

\[
\Psi_0 = \Psi_2 + \Psi_1 \times e
\]

(17)

where \( \Psi_1 \) is a block matrix where the element \( \Psi_{i,j} \) is a unit matrix if \( j \leq i \), and 0 otherwise. \( \Psi_0 = [F_1(k + 1), \ldots, \Psi_m(k + 1), \ldots, F_m(k + h)] \) and \( \Psi_2 = [F_1(k), \ldots, \Psi_m(k), \ldots, F_m(k)] \). Based on (16) and (17), the second constraint can be expressed as

\[
(\Psi_3 \Psi_1 - \Psi_3 \Psi_3 \Psi_1 \varepsilon) \leq \Psi_4 - \Psi_3 \Psi_2 + \Psi_5 \Psi_2
\]

(18)

where \( \Psi_4 = [F_{\text{th}}, \ldots, F_{\text{th}}, -F_{\text{th}}, \ldots, -F_{\text{th}}]^T \) and \( \Psi_3 = [E, -E]^T \). \( \Psi_5 \) is diagonal matrix and \( \Psi_{5,i} = \Psi_6 \), where \( \Psi_6 \) is a matrix where all elements are equal to 1/m.
The last constraint in (5) sets the upper and lower bounds on core frequencies. Based on (17), it can be expressed as

$$\Psi_3 \Psi_6 \leq \Psi_7 - \Psi_3 \Psi_2$$  \hspace{1cm} (19)

where $\Psi_7 = [F_{\text{max}}, \ldots, F_{\text{max}}, -F_{\text{min}}, \ldots, -F_{\text{min}}]^T$, $F_{\text{max}} = (1/f_{\text{max}})$, and $F_{\text{min}} = (1/f_{\text{min}})$.

After the above transformations, the problem in (2)–(5) is a standard quadratic programming problem and can be directly solved using standard solvers, e.g., COPL_QP [35] or Python’s package CVXOPT [36].

GUC aims to reduce the peak and average temperature for the entire multicore chip and to balance the temperatures across cores. Although a lower temperature tends to increase the system MTTF, the latter also depends on the amplitudes, periods, and peak and valley temperature of thermal cycles. Hence, we propose an SWC to reduce the aging effect of TC.

C. Sampling Window Controller Design

We design SWC to minimize the aging effect of TC by dynamically changing the length of the sampling window, $L_{\text{sw}}$. Since a core’s frequency can change from one sampling window to the next, the length of sampling window directly impacts TC. In general, a shorter sampling window means scaling core’s frequency more frequently, and helps to reduce the peak temperature and amplitude of thermal cycles, but may increase the frequency of thermal cycles.

It is difficult to precisely model how the sampling window affects the system reliability, and the complicated reliability model also increases the overhead of searching for the best sampling period. Hence, to balance the impact of peak temperature against that of thermal cycles on the system MTTF, we design an efficient online heuristic based on binary search to adjust $L_{\text{sw}}$ at runtime.

Our heuristic algorithm in the SWC uses the concept of a profiling window to adjust the $L_{\text{sw}}$. A profiling window is composed of $s$ equal-length sampling windows. In each profiling window, the $s$ temperature points make up a temperature profile. At the end of the profiling window, we calculate the system MTTF from the temporal temperature profile using a reliability modeling tool [1]. Since this tool assumes the input temperature profile is repeated until the chip fails, the established temperature profile must have enough temperature points. Although a larger $s$ makes the calculation of system MTTF more precise, it increases the overhead of the reliability modeling tool. In our experimental evaluations, we find that setting $s = 50$ achieves an accurate system MTTF with an acceptable overhead. We store the history of the system MTTF for the various length of the sampling window and rely on binary search to find the most appropriate $L_{\text{sw}}$ for the sampling windows in the next profiling window.

To perform binary search, we need to determine the lower and upper bounds of $L_{\text{sw}}$. We set the lower bound of $L_{\text{sw}}$ ($L_{\text{sw}}^{\text{lb}}$) to 1 s. Since the quadratic programming solver executes at the end of each sampling window, and the execution time of a typical solver, e.g., COPL_QP [35], is less than 10 ms on our platform (with ARMv7 1.23 GHz), setting $L_{\text{sw}}^{\text{lb}}$ to 1 s keeps the overhead due to the GUC under 1% of the sampling window length. To determine the upper bound of the $L_{\text{sw}}^{\text{ub}}$, we consider two factors. Since the GUC expects a constant temperature in a given sampling window, the sampling window length must be smaller than some constant, say $C_{\text{HW}}$. $C_{\text{HW}}$ is dependent on the hardware platform. A low power density chip leads to a large $C_{\text{HW}}$ and vice versa. As for the other factor, since thermal cycles can essentially be avoided if the $L_{\text{sw}}$ equals the hyperperiod of the periodic task set, any length larger than the hyperperiod increases temperature and TC. Hence, we need to consider only lengths of the sampling window less than or equal to the hyperperiod. As a result, we set $L_{\text{sw}}^{\text{ub}} = \min(HP, C_{\text{HW}})$.

The pseudocode for the SWC is given in Algorithm 1. Lines 1 and 2 initialize the needed variables, where $L_{\text{sw}}^t$ is the length of the current sampling window. The infinite while loop (lines 3–25) allows the algorithm to run throughout system lifetime. At the end of each sampling window, the temperature of each core is read from TSs (lines 5–7). At the end of the $j$th profiling window ($i = s$), a temperature profile, $TP(j)$, is constructed based on the collected temperature of the $m$ cores in the current $s$ sampling windows (line 10). The corresponding MTTF, $MTTF_j$, is obtained using a reliability modeling tool [1] (line 11). The $L_{\text{sw}}^t$ is next determined for the sampling windows in the upcoming profiling window (lines 12–23). $L_{\text{sw}}^t$ is set to $(L_{\text{sw}}^{\text{lb}} + L_{\text{sw}}^{\text{ub}})/2$ at the second sampling window (lines 12 and 13), and is then dynamically changed based on the comparison of MTTFs (lines 15–20). $L_{\text{sw}}^t$ and $L_{\text{sw}}^{\text{ub}}$ are updated, and $L_{\text{sw}}^t$ converges to the most appropriate value. Finally, $i$ and $j$ are updated to indicate the beginning of the next profiling window (line 22).

VI. EXPERIMENTAL SETUP

To evaluate the proposed RUC framework, we conducted experiments to compare the proposed framework with two existing approaches. In this section, we present the platforms,
benchmarks, and the existing frameworks used for comparison in our experiments.

A. Benchmarks

A total of ten benchmarks were chosen from Mibench [23], which include different tasks from automotive, network, office, security, telecommunication, and consumer applications. We measured the execution times of the benchmarks on a quad-core ARM Cortex-A15 chip, Nvidia TK1 [22], when a core’s frequency is 1.24 GHz, and then assigned them to appropriate cores (see Table I). Note that core 0 is reserved for the operating system. Based on the execution time data, we designed two groups of task setups. In the first group, tasks are frame based and share the same period and deadline. We will measure the system-level MTTFs when the deadline is 1.6, 1.8, 2.0, 2.2, and 2.4 s. In the second group, a task’s deadline and period are random in the ranges 1.2–1.6, 1.6–2.0, 2.0–2.4, and 2.4–2.8 s.

B. Comparison Targets

We compared our proposed RUC with two existing frameworks: pure UC [33] and TA control [6], [9]. In contrast with RUC, the length of sampling window in UC is fixed. In general, it is set to 1 s [33]. In addition, since UC is not designed for lifetime reliability, it does not bound the differences in core frequencies. TA has a similar design to UC, but the goal is to control the temperature to a specific set point, which is lower than the maximum temperature allowed by the chip. Hence, TA is an effective way to guarantee that the operating temperature is safe.

Two metrics are considered in the comparison. Reliability is quantified by the system MTTF, which is obtained using the system-level reliability modeling tool [1]. Real-time performance is quantified as the percentage of feasible solutions (FSs), which is the ratio of the number of tasks satisfying their real-time requirements to the total number of tasks.

C. Experimental Platforms

We conducted experiments on two platforms: an ARMv7 chip and a simulator. The first experimental platform is a quad-core Cortex-A15 ARM chip: Nvidia’s TK1 [22]. This Nvidia TK1 SoC is designed for mobile and automotive applications. The quad-core ARM Cortex-A15 CPU and 192 Kepler GPU cores provide high performance with low power requirement. As a low-power chip, TK1 supports 12 different frequencies from 1.24] to 2.32 GHz, but all cores are required to have the same frequency. Except for the primary core (core 0), all cores can be powered on and off dynamically. There are 11 thermal sensors to sample the CPU, GPU, and other component temperatures every 50 ms. However, the default interface only provides one CPU temperature for all CPU cores. TK1 is shipped with an operating system based on Ubuntu 14.04 LTS. Hence, most of desktop-level benchmarks can be directly executed on TK1.

In order to evaluate RUC on different platforms, we constructed a hardware platform simulator. We extracted the parameters from TK1 to build power model. In the simulator, the temperature is obtained using Hotspot, a thermal modeling tool [29]. Our simulator is tested via a comparison with the TK1.

We first obtained the parameter values for Hotspot. Since TK1 reports only one temperature for all cores, we made an assumption that all cores have the same thermal capacitance, C, and thermal resistance R. We used no fan when measuring the temperature. The temperature profile for four fully loaded core running at the highest frequency is shown in Fig. 4. Note that due to power variation and other uncontrollable parameters, e.g., small changes in ambient temperature, the obtained temperature profile fluctuated slightly. In this configuration, the chip’s power is 6.98 W [37], so R can be estimated as 10.24 °C/W. For the same reasons, we estimated C to be in the range of 2.34–7.34J/°C. The capacitance used in Hotspot is randomly generated in this range.

Based on R and C, we determined the power model for the simulator. Let $P_{\text{act}}(f)$ be the core’s active power at frequency $f$. $P_{\text{oth}}(f)$ is the power independent of core utilization but related to core frequency. $P_{\text{act}}(f)$ includes the core’s idle power and other components’ power, e.g., GPU and memory. The average power ($\bar{P}$) can be calculated as

$$\bar{P} = U \times P_{\text{act}}(f) + P_{\text{oth}}(f). \quad (20)$$

The values of $P_{\text{act}}(f)$ and $P_{\text{oth}}(f)$ are extracted from the measurement of TK1. It is difficult to measure TK1 power consumption directly. Instead, we used steady-state temperatures to determine values of $P_{\text{oth}}(f)$ and $P_{\text{act}}(f)$. Fig. 5(a) shows the steady-state temperature with different numbers of fully loaded cores running at different frequencies. Fig. 5(b) depicts the

---

**Table I: Benchmarks**

<table>
<thead>
<tr>
<th>Benchmark</th>
<th>Application</th>
<th>Mapping to</th>
<th>Execution time</th>
</tr>
</thead>
<tbody>
<tr>
<td>bitcount</td>
<td>Automotive</td>
<td>Core 2</td>
<td>960 ms–969 ms</td>
</tr>
<tr>
<td>susan</td>
<td>Automotive</td>
<td>Core 1</td>
<td>129 ms–151 ms</td>
</tr>
<tr>
<td>gsort</td>
<td>Automotive</td>
<td>Core 1</td>
<td>176 ms–195 ms</td>
</tr>
<tr>
<td>dijkstra</td>
<td>Network</td>
<td>Core 3</td>
<td>130 ms–141 ms</td>
</tr>
<tr>
<td>patricia</td>
<td>Network</td>
<td>Core 3</td>
<td>445 ms–459 ms</td>
</tr>
<tr>
<td>stringsearch</td>
<td>Office</td>
<td>Core 2</td>
<td>3 ms–10 ms</td>
</tr>
<tr>
<td>blowfish</td>
<td>Security</td>
<td>Core 1</td>
<td>1352 ms–1369 ms</td>
</tr>
<tr>
<td>sha</td>
<td>Security</td>
<td>Core 2</td>
<td>54 ms–60 ms</td>
</tr>
<tr>
<td>crc32</td>
<td>Telecommunication</td>
<td>Core 3</td>
<td>843 ms–874 ms</td>
</tr>
<tr>
<td>cjpeg</td>
<td>Consumer</td>
<td>Core 3</td>
<td>16 ms–20 ms</td>
</tr>
</tbody>
</table>
Fig. 5. Steady-state temperature with (a) different numbers of fully loaded cores running at different frequencies and (b) one core running at different frequencies and utilizations.

TABLE II
PARAMETERS IN POWER MODEL

<table>
<thead>
<tr>
<th>Frequency (GHz)</th>
<th>$P_{\text{act}}(J)$</th>
<th>$P_{\text{actth}}(J)$</th>
</tr>
</thead>
<tbody>
<tr>
<td>1.24</td>
<td>0.14648</td>
<td>2.24609</td>
</tr>
<tr>
<td>1.33</td>
<td>0.19124</td>
<td>2.26847</td>
</tr>
<tr>
<td>1.43</td>
<td>0.22108</td>
<td>2.31595</td>
</tr>
<tr>
<td>1.53</td>
<td>0.23600</td>
<td>2.38851</td>
</tr>
<tr>
<td>1.63</td>
<td>0.26991</td>
<td>2.48684</td>
</tr>
<tr>
<td>1.73</td>
<td>0.37164</td>
<td>2.51329</td>
</tr>
<tr>
<td>1.84</td>
<td>0.46251</td>
<td>2.54245</td>
</tr>
<tr>
<td>1.94</td>
<td>0.53033</td>
<td>2.56822</td>
</tr>
<tr>
<td>2.01</td>
<td>0.59001</td>
<td>2.58993</td>
</tr>
<tr>
<td>2.12</td>
<td>0.69851</td>
<td>2.61163</td>
</tr>
<tr>
<td>2.22</td>
<td>0.86806</td>
<td>2.65102</td>
</tr>
<tr>
<td>2.32</td>
<td>1.11626</td>
<td>2.64147</td>
</tr>
</tbody>
</table>

Fig. 6. Temperature readings from thermal sensors and estimated by Hotspot for various core frequencies.

steady-state temperatures for one core running at different frequencies and utilizations. Note that since the resolution of the thermal sensors is 0.5 °C, manipulating a core’s frequency may not change the measured steady-state temperature, especially when the workload is light. Based on these measurements, the power values are derived and summarized in Table II.

In order to validate this power model, we compared the temperature readings from the thermal sensors with those estimated by Hotspot (see Fig. 6). As can be seen from Fig. 6, the parameter values that we have obtained resulted in accurate temperature estimation. The average temperature difference between TK1 and the simulator is less than 2 °C, and Fig. 7 shows that this small difference does not impact the system-level MTTF. The data in Fig. 7 indicate that RUC has a similar performance on both TK1 and the simulator. The maximum difference is about 2%, and the average is less than 1%. These experiments demonstrated that the parameter values that we have obtained indeed lead to reliable accuracy of the simulator.

VII. RESOURCE UTILIZATION CONTROL EVALUATION

In this section, we examine the performance of the proposed RUC compared with the UC and TA control approaches.

A. Experiments on the Tegra Chip

We compared the proposed RUC with UC and TA to determine whether RUC can improve lifetime reliability without sacrificing real-time performance. For RUC and UC, the utilization set point is close to the schedulability boundary, which was set to 70%. We used a temperature set point of 70 °C in TA, which is lower than the peak operating temperature.

We first studied the overhead of the solver to the optimization problem defined in (2)–(5). If four cores are required to have the same frequency, such as TK1 [22], a brute-force search solver consumes less than 1 ms to find the optimal solution. This overhead is tolerable in our RUC. However, if cores can run at different frequencies, the overhead of brute-force search increases to 74 ms. In this case, brute-force search is impractical. The overhead can be reduced if we use standard solvers, e.g., COPL_QP [35] or CVXOPT [36]. The overhead of COPL_QP is about 4 ms and CVXOPT is about 25 ms. Both their overheads are small, and we can choose one of them offline.

Fig. 8 shows the experimental results when tasks are frame based. RUC achieves a higher MTTF than UC and TA in all the cases. Thanks to the proposed sampling window control, when compared with UC, RUC improves the MTTF by 9.4% on average, and up to 32%. TA keeps the runtime temperature at 70 °C, and the corresponding MTTF is much smaller than RUC and UC when the workload is light. RUC doubles the MTTF when compared with TA. In terms of real-time performance, all the tested frameworks have a large and similar percentage of FSs. Since RUC may increase the length of the sampling window and reduce the DVFS frequency, it may lead to more tasks missing their deadlines. However, its percentage of FS is only 2% smaller than TA, and also close to 100%. In most soft real-time systems, this is tolerable [33].
In Fig. 9, each task has the same deadline and period, which is randomly generated in different ranges for different tasks. The average MTTF improvement of RUC over UC is 6.4%, and up to 9.5%. Compared with TA, this improvement increases to over 100%. For real-time performance, RUC does not have the highest percentage of FS, but the small loss in number of FS does not limit its application in many soft real-time systems.

**B. Simulation Results**

Leveraging the simulator, we conducted two sets of experiments to further assess the pros and cons of RUC in a more general setting. Tasks have been assigned to appropriate cores, and each task’s execution time is the average of measured execution time on TK1 (see Table I). We extended the simulator so that different cores can operate at different frequencies on this simulator. In the first set of experiments, we assumed that the chip has the same low power density as TK1. We increased the power density in the second set of experiments, but increased the chip’s power density changes, which weakens the effect of DVFS and also the sampling window control. For this kind of chip, the MTTF improvement of RUC over UC is less than 32%.

In order to study the effectiveness of RUC on a high power density chip, we used the same simulator as in the previous experiments, but increased the chip’s power density by 3×1 the power density. We assumed that the ConnectLand 1504002 heatsink [38] was used. With this heatsink, the thermal resistance is estimated to be 7.02 J/°C, and the thermal capacitance is in the range of 1.92–25.87 watt/°C. For RUC and UC, we used the same CVXOPT [36] to solve the quadratic programming problem.

All the above experiments demonstrate that RUC has a better system reliability performance. However, for a low-power chip, different core frequencies do not lead to significant power changes, which weakens the effect of DVFS and also the sampling window control. For this kind of chip, the MTTF improvement of RUC over UC is less than 32%.

In order to study the effectiveness of RUC on a high power density chip, we used the same simulator as in the previous experiments, but increased the chip’s power density by 3×1 the power density. We assumed that the ConnectLand 1504002 heatsink [38] was used. With this heatsink, the thermal resistance is estimated to be 7.02 J/°C, and the thermal capacitance is in the range of 1.92–25.87 watt/°C. For RUC and UC, we used the same CVXOPT [36] to solve the quadratic programming problem.

The experimental results are shown in Fig. 12. In all cases, the average improvement in MTTF by RUC is 39.5% for frame-based tasks and 54.9% when tasks’ periods are randomly generated.
random. At higher power density, reducing core frequency dramatically reduces runtime temperature. At the same time, frequent changes in DVFS settings introduce more temperature fluctuations, which increases aging from TC. Hence, sampling window control in RUC benefits lifetime reliability. RUC can achieve a more stable temperature profile than UC, which is one of the most important factors that positively impact the system-level MTTF. The behaviors of RUC and UC have similar real-time performance to that in the previous experiments. The percentage of FS is close to 100%. RUC allows most tasks complete before their deadlines even when the workload is heavy.

VIII. CONCLUSION

We proposed an RUC framework to maximize the lifetime of multicore systems under soft real-time constraints. Based on analysis of the influence of operating temperature on TC and system-level lifetime reliability, we designed an online framework that lowers peak temperature, balances the temperature differences among cores, and reduces TC. Our framework uses an MPC to reduce peak temperature by adjusting core frequencies and voltages within each sampling window. A heuristic was developed to dynamically adapt the sampling window length to reduce TC and improve reliability. We conducted experiments with different types of tasks on multiple platforms, including a quad-core ARM chip and a simulator, and considered chips with different power densities. The results revealed that, on all the platforms and with a variety of setups, our approach was effective in increasing the lifetime of soft real-time systems compared with existing TA and utilization control approaches.

REFERENCES


Yue Ma (S’16) received the B.S. degree from the Chengdu University of Technology, Chengdu, China, and the M.S. degree from the University of Electronic Science and Technology of China, Chengdu. He is currently pursuing the Ph.D. degree with the Department of Computer Science and Engineering, University of Notre Dame, Notre Dame, IN, USA. His current research interests include real-time embedded systems, reliable system design, power efficiency, and temperature-aware resources management.

Thidapat Chantem (S’05–M’11) received the bachelor’s degree from Iowa State University, Ames, IA, USA, in 2005, and the master’s and Ph.D. degrees from the University of Notre Dame, Notre Dame, IN, USA, in 2011. She is an Assistant Professor of Electrical and Computer Engineering at Virginia Tech, Blacksburg, VA, USA. Her current research interests include real-time embedded systems, energy-aware and thermal-aware system-level design, cyber-physical system design, and intelligent transportation systems.

Robert P. Dick (S’95–M’02) received the B.S. degree from Clarkson University, Potsdam, NY, USA, in 1996, and the Ph.D. degree from Princeton University, Princeton, NJ, USA, in 2002. He was a Visiting Researcher at the NEC Labs, Princeton, NJ, USA, in 1999, a Visiting Professor at the Department of Electronic Engineering, Tsinghua University, Beijing, China, in 2002, and was on the faculty of Northwestern University, Evanston, IL, USA, from 2003 to 2008. He is currently an Associate Professor of Electrical Engineering and Computer Science at the University of Michigan, Ann Arbor, MI, USA.

Dr. Dick was a recipient of the NSF CAREER Award and won his department’s Best Teacher of the Year Award in 2004. In 2007, his technology won a Computerworld Horizon Award and his paper was selected as one of the 30 in a special collection of DATE papers appearing during the past 10 years. His 2010 work won the Best Paper Award at DATE. He served as the Technical Program Committee Co-Chair of the 2011 International Conference on Hardware/Software Codesign and System Synthesis, as an Associate Editor of the IEEE TRANSACTION ON VLSI SYSTEMS, and as a Guest Editor of ACM Transactions on Embedded Computing Systems. He is also the Chief Executive Officer of the Stryd, Inc., Boulder, CO, which produces wearable electronics for athletes.

Xiaobo Sharon Hu (S’85–M’89–SM’02–F’16) received the B.S. degree from Tianjin University, Tianjin, China, the M.S. degree from the New York University Tandon School of Engineering, Brooklyn, NY, USA, and the Ph.D. degree from Purdue University, West Lafayette, IN, USA. She is currently a Professor with the Department of Computer Science and Engineering, University of Notre Dame, Notre Dame, IN, USA. She has authored more than 250 papers in the related areas. Her current research interests include real-time embedded systems, low-power system design, and computing with emerging technologies. Dr. Hu served as an Associate Editor of the IEEE TRANSACTIONS ON VLSI, ACM Transactions on Design Automation of Electronic Systems, and ACM Transactions on Embedded Computing. She is the Program Chair of the 2016 Design Automation Conference (DAC) and the TPC Co-Chair of the 2014 and 2015 DAC. She was a recipient of the NSF CAREER Award in 1997, and the Best Paper Award from the Design Automation Conference in 2001 and the IEEE Symposium on Nanoscale Architectures in 2009.