TY - JOUR
T1 - Fault-tolerant parallel scheduling of tasks on a heterogeneous high-performance workstation cluster
AU - Kwok, Yu Kwong
N1 - Funding Information:
The author would like to sincerely thank the reviewers for their constructive and insightful comments, which improve the paper considerably. This research was supported by a competitive earmarked research grant from the Research Grants Council of the Hong Kong SAR under project number HKU 7124/99E, a research initiation grant from the HKU CRCG under contract number 10202518, and a research grant from the HKU URC under contract number 10203010. A preliminary version of portions of this article appeared in the proceedings of the 11th IASTED International Conference on Parallel and Distributed Computing and Systems (PDCS’99), November 1999, Boston, USA.
PY - 2001/7
Y1 - 2001/7
N2 - We propose a new approach, called cluster-based search (CBS), for scheduling large task graphs in parallel on a heterogeneous cluster of workstations connected by a high-speed network (e.g., using an ATM switch at OC-3 speed). The CBS algorithm uses a parallel random neighborhood search which works by refining multiple different initial schedules simultaneously using different workstations. The workstations communicate periodically to exchange their best solutions found thus far in order to direct the search to more promising regions in the search space. Heterogeneity of machines is exploited by the biased partitioning of the search space. The parallel random neighborhood search is fault-tolerant in that the workload of a failed workstation is automatically redistributed to other workstations so that the search can continue. We have implemented the CBS algorithm as a core function of our on-going development of SSI middleware for a Sun workstation cluster.
AB - We propose a new approach, called cluster-based search (CBS), for scheduling large task graphs in parallel on a heterogeneous cluster of workstations connected by a high-speed network (e.g., using an ATM switch at OC-3 speed). The CBS algorithm uses a parallel random neighborhood search which works by refining multiple different initial schedules simultaneously using different workstations. The workstations communicate periodically to exchange their best solutions found thus far in order to direct the search to more promising regions in the search space. Heterogeneity of machines is exploited by the biased partitioning of the search space. The parallel random neighborhood search is fault-tolerant in that the workload of a failed workstation is automatically redistributed to other workstations so that the search can continue. We have implemented the CBS algorithm as a core function of our on-going development of SSI middleware for a Sun workstation cluster.
KW - Cluster computing
KW - Fault-tolerant scheduler
KW - Heterogeneous systems
KW - Neighborhood search
KW - Parallel algorithms
KW - Task graphs
UR - http://www.scopus.com/inward/record.url?scp=0035392842&partnerID=8YFLogxK
U2 - 10.1023/A:1011186732749
DO - 10.1023/A:1011186732749
M3 - Article
AN - SCOPUS:0035392842
SN - 0920-8542
VL - 19
SP - 299
EP - 314
JO - Journal of Supercomputing
JF - Journal of Supercomputing
IS - 3
ER -