University of Maryland Improves Heterogeneous Cluster Environment using Bright Cluster Manager

December 22, 2015
No Comment

The University of Maryland, Baltimore County operates the UMBC High Performance Computing Facility (HPCF), a community-based, interdisciplinary core facility for scientific computing and research. The facility is used in a wide variety of research involving numerical simulations, statistical comparison, computational models, and atmospheric remote sensing.

The HPCF was started in 2008 by more than 20 researchers from ten academic departments and research centers from UMBC’s three colleges, and is supported by faculty contributions, federal grants, and the UMBC administration.

More than 80 projects are using or have used the HPC for research across various disciplines. Research includes investigations on HPC itself.

One investigator runs sample scripts and custom code to determine the most effective ways to make use of cluster computers. That research compares different techniques to maximize performance through CPU and memory tuning as well as evaluates new types of high performance parallel processing.

Another researcher uses the cluster for atmospheric science research, analyzing more than a decade’s worth of data to perform atmospheric modeling. Other examples include numerical simulation of calcium waves in human heart cells and a visualization of early Washington, DC.

The Challenge

UMBC was seeking ways to simplify and shorten the deployment process for its HPCF. Its latest machine, a 240-node high-performance compute (HPC) cluster, began operations in 2014. It joined the two previous generations of hardware, resulting in an infrastructure with a mix of traditional CPU nodes, nodes with GPUs, and nodes with Intel Phi Series co-processors. The multiple generations of hardware led to different InfiniBand cards, two different InfiniBand speeds and two different InfiniBand switches within the cluster.

UMBC cluster administrators were concerned that the heterogeneous nature of the hardware environment was complicating its use. “We knew that deployment and management wouldn’t be as straightforward as it is with other clusters. We wanted to be sure we had time to work with researchers on their projects on the cluster, and not just spend all of our time performing baseline maintenance,” said Damian Doyle, Director of Enterprise Infrastructure and Support in UMBC’s Division of Information Technology.

The Solution

Doyle explains that UMBC had run clusters in the past without proprietary management software, but knew that Bright Cluster Manager software would be a good choice for such a unique, heterogeneous cluster, and would make a big difference in the ease of deployment and troubleshooting down the road. When they purchased the components for the newest cluster, they knew they were going to have a hybrid environment and were concerned about the difficulty of managing multiple generations of hardware and software. They wanted to leverage Bright Cluster Manager to troubleshoot cluster issues, and make use of its built-in libraries. This would allow the admin staff to concentrate on integrating the different hardware, work with researchers to set scheduling, and tailor jobs to researchers’ project needs.

The Result

Seamless deployment and management

The UMBC administrators implemented their latest HPC in 2014 with Bright Cluster Manager, and report that its use allows the cluster to function optimally. Starting it up was even easier than anticipated. Administrators are convinced there would have been a much steeper learning curve without it and it would have taken far longer to put in place. Since deployment, Bright has proven to be valuable to the cluster administrators, who believe the UMBC cluster runs better and is more reliable than clusters without Bright.

Bright Cluster Manager makes cluster management easier and simplifies installing and updating nodes. Rather than being bogged down performing time-consuming baseline maintenance and recompiling libraries, administrators are now free to spend more time doing more productive work.

“The functionality and ease of use that Bright provides us with is fantastic,” Doyle said. “Just as we had hoped, Bright Cluster Manager takes care of managing the cluster, installing and updating nodes, compiling, and more. Because of that, the cluster administrators are free to work directly with researchers to help them get the most out of the cluster, and don’t have to worry about things like recompiling libraries and performing basic maintenance on the cluster.”

Very high level of support helps UMBC avoid errors and fix problems quickly

Perhaps the most valuable impact on the UMBC cluster has been the support Bright provides to primary cluster administrators – from patching smaller issues directly as they’ve come up, going out of their way to help administrators with issues that don’t technically fall within their scope, to in-depth support during a full upgrade of the operating system.

According to Doyle, Bright provides excellent support, even providing information about issues not directly related to Bright Cluster Manager. Bright maintains a test cluster, which has helped UMBC avoid errors and fix problems quickly. In addition, Bright experts have been easy to reach whenever problems have arisen.

After more than a year of using Bright Cluster Manager for deployment and management of their cluster, the UMBC team is sure they want to hold onto it. When asked by UMBC’s chief information officer about the value of continuing to use Bright, Doyle comments, “What I always go back to is that Bright is an invaluable partner because of the support they provide and the reliability of their product.”

He concludes, “They’re available to troubleshoot when we need support, going above and beyond to even troubleshoot our hardware or researcher code. We know they’ve tested their product before they roll it out, and there will be no need to troubleshoot the manager itself. Bright adds value to our cluster, reducing the workload on administrators and ensuring that we’re running smoothly.”

Via CTO Vision