Summary: | HTCondor has been widely adopted by HEP clusters to provide high-level scheduling performance. Unlike other schedulers, HTCondor provides loose management of the worker nodes. We developed a maintenance automation tool called “HTCondor MAT” that focuses on dynamic resource management and automatic error handling. A central database records all worker node information, which is sent to the worker node for the startd configuration. If an error happens for the worker node, the node information stored in the database is updated and the worker node is reconfigured with the new node information. The new configuration stops the startd from accepting error-related jobs until the worker node recovers. The MAT has been deployed in the IHEP HTC cluster to provide a central way to manage the worker nodes and remove the impacts of errors on the worker nodes automatically.
|