Cluster Computing At Home
Podcast: Play in new window | Download (47.7MB) | Embed
Subscribe: Apple Podcasts | Spotify | Email | RSS | More
High performance computations using a cluster of compute nodes can be exclusively expensive. Even with the various cloud Platform-as-a-Service (PaaS) options spinning up a cluster can become cost prohibitive for an individual or an open source or non-profit project. Raspberry Pi’s on the other hand are inexpensive, fully functional computing boards that can be combined and used to build their own cluster.
For his term paper in a Parallel Programming graduate course, Beej built a small 4 node cluster using the most recent, as of 2021, Raspberry Pi model. The Raspberry Pi 4 Model B contains 8GB of RAM and a 1.5 GHz quad-core ARM processor. After building the Raspberry Pi cluster, he compared it to 4 nodes of a larger, more expensive university cluster. For testing the clusters, Beej modified an md5 hash cracking algorithm to use a hybrid model of parallelism combining MPI and OpenMP. He tested Clusters at 2, 3, and 4 nodes each tested with 1 – 4 threads per cluster using hashed strings of four, five, and six-character length.
Both the Raspberry Pi cluster and the university cluster showed similar behavior when comparing string sizes across the different node and thread combinations. In tests with fewer nodes per cluster the more powerful university cluster out performs the Raspberry Pi cluster. However, as more nodes are added the Raspberry Pi showed similar speeds to the university cluster, even out performing in a few places.
BJ enjoyed building and testing this Raspberry Pi cluster. It is an example of some of the fun things we get to do when learning in the field. It did turn out to be a little more expensive than expected, mainly because he chose to use the top Raspberry Pi 4 with 8GB RAM. That decision came from BJ’s desire to reuse the Raspberry Pi’s in the cluster this summer to build other projects. He could have saved money and still accomplished the hybrid parallelization using less RAM or even a Raspberry Pi 3. Whether you want to build the best or save money, use the information here as an example of the many fun things we get to do as programmers. If building a cluster computer or playing with Raspberry Pi’s doesn’t inspire you, find something that does to help you find joy in learning within the field.
Episode Breakdown
Background Information
Cluster Computing
Cluster computing involves distributing computational workload across a group of two or more interconnected computers. Each connected computer or node in the cluster is an independent, self-contained system that may be contain a single or multiprocessor core. A node can function as a complete computer containing it’s own operating system, memory, and Input Output drivers.
Typically clusters use distributed memory as each node in the cluster has it’s own memory, though the internal architecture may vary from node to node as to how memory is shared between cores in a multiprocessor node. Though it is possible to mimic shared memory in a cluster using a distributed shared memory architecture. In addition to the compute nodes and depending on the architecture a control node, the cluster will also contain a network switch to allow for internode communication. Nodes will also need a scheduling application such as Message Processing Interface (MPI).
Clusters allow for high performance and high availability without the higher cost of a high powered single system. They also add redundancy and safety from failure as they can be configured to survive the failure of an individual node.
Message Passing Interface
The Message Processing Interface (MPI) is a committee developed specification for message-passing library interfaces. It was created by a group of venders, specialist, and library developers working together on the MPI Forum.
MPI is not a language, implementation, or particular set of libraries for passing messages, instead it is a specification with multiple implementations including OpenMPI and MPICH among others. OpenMPI is an open source project running on many of the Top 500 supercomputers. It combines three of the most popular MPI implementations: FT-MPI, LA-MPI, and LAM/MPI. MPICH is another popular open source implementation of MPI focusing on variations of Unix operating systems including Mac OS X. Though it also works with Microsoft’s MS-MPI.
The specification deals with passing information between processes within parallel programming with the goal of creating a standard for improving the ease of use for passing messages and the ability to move code between cluster architectures. The standard is designed to be used with the Fortran and C languages, but can also be used in C++ by accessing the bindings for C. Not limited by architecture, it can be used with shared or distributed memory clusters and with high performance multiple core processors. Libraries or Bindings exist to extend MPI into other languages including .NET Common Language Infrastructure (CLI), Python, MATLAB, R, and OCaml, just to name a few. There have even been a few unofficial attempts to port MPI to Java.
OpenMP
While designed for Fortran, OpenMP also extends the C and C++ languages, via compiler directives and callable runtime libraries. OpenMP allows for multithreaded programming allowing a node to use multiple cores and threads on each core. OpenMP is compiler agnostic allowing it to be used in a variety of instances where shared memory parallelism is used, specifically in shared memory architectures.
OpenMP is considered easier to use in shared memory systems such as compute nodes with multi-core processors than implementations of the Message Processing Interface. Going beyond shared memory parallelism, OpenMP has shown to be useful in distributed memory systems as well.
Researchers at Purdue University’s School of Electrical and Computer Engineering describe two different techniques for use of OpenMP in distributed memory parallel computing. In the first one they showed a compile-time/run-time combination for use of OpenMP with Software Distributed Shared Memory (DSM) Systems to proactively move data between systems. In the second technique they translated OpenMP directly to MPI. In both cases they found the performance of OpenMP to improve distributed systems.
Other Forms of Node Communication
In addition to OpenMP and specific implementations of the MPI specification, other forms of parallel processing communication exist. Older than MPI, Parallel Virtual Machine (PVM) was created at Oak Ridge in 1989 for use with C, C++, and Fortran. Unlike MPI, it has a concrete implementation. PVM uses a set of libraries that allow each node in a cluster to become a parallel virtual machine, meaning that each node must have all the resources.
Simple Linux Utility for Resource Management (SLURM) {not to be confused with the cola} is an open source workload manager used for scheduling tasks across multiple nodes. It is the workload manager for a majority of the Top 500 supercomputers. Slurm creates a workload management with no single point of failure or fault-tolerant job options. It is designed for high performance and scalability as well as resource management so that idle nodes can be powered down when not in use.
The Sun Grid Engine enables execution of shell scripts or UNIX batch jobs on a cluster of nodes. The jobs are scheduled to run when the node is idle or under light load.
Comparing Clusters
Raspberry Pi Cluster
Inexpensive, yet powerful, since it’s introduction in 2012 the Raspberry Pi has grown from a single core processor working with 256 MB of SDRAM to the current quad-core Pi 4 Model B with options ranging from 2GB to 8GB of RAM. Researchers in the Electrical and Computer Engineering Department at the University of Maine, found it possible to build a low cost, power efficient cluster with 25 Raspberry Pi’s that is capable of 15 GFLOPS of performance using older Raspberry Pi 2 Model B nodes.
BJ used four CanaKit Raspberry Pi 4 Model B (8 GB) Basic Kits to create the cluster. The Raspberry Pi 4 Model B contains a 1.5GHz 64-bit quad-core ARMv7 CPU, 8GB LPDDR4-3200 SDRAM, Gigabit Ethernet, 2 USB 3.0 ports, 2 USB 2.0 ports, and 2 micro-HDMI ports. He connected each Raspberry Pi to a Zyxel GS1200-5 5-port Gigabit Managed Switch via 3ft Cat6 cables. The GS1200-5 is a managed switch with fanless design for quieter running with a capacity up to 10 Gbps and a 128 KB packet buffer.
University Cluster
In the paper, BJ compared his Raspberry Pi Cluster to the Tennessine Oneseventeen (ts) cluster at the University of Tennessee Chattanooga’s Sim Center. The Tennessine cluster contains 33 compute nodes of Dell R730s with 128 GB RAM and 28 cores each. It has options for Slurm, OpenMPI, and Sun Grid Scheduler. For testing, BJ used modules containing OpenMPI, gcc with C++ 17, and MPI Exec.
Testing Code
Clusters were tested using a brute force md5 hash cracking program written in C++. The original code found on github.com used Open MPI to spread the work of creating and comparing hashes of every string possible using only lower case characters and knowing the length of the original string. The Raspbian Buster Lite OS could not connect to GitHub. BJ cloned the repo then uploaded his changes to BitBucket to clone onto the Raspberry Pi cluster.
MPI was set up in the code to work with one control node and N – 1 worker nodes where N is the total number of nodes. In the case of BJ’s cluster N = 4, so he had 1 control and 3 worker nodes. The code first calculated the total number of strings possible based on the number of characters in the string. Then is divides that by the number of nodes in the cluster.
Each node is sent a starting point to generate strings of the correct length. A clock is started on the control node (rank 0) before the worker nodes are called. Once one of the worker nodes finds the correct string it is sent back to the control (rank 0) and the clock is stopped.
OpenMP was added to the worker node code to further break down the work and access multiple threads on each node for faster processing.
Cluster Testing
Each cluster ran two hundred sixteen tests, consisting of 36 tests for each of six different hashes. In order to see the difference in load between clusters, BJ hashed different sized strings ranging from four character to six character strings.
Because of the nature of the brute force hash cracking, different strings of the same size will take different amounts of time to be cracked. To mitigate this two strings of each length were pre-tested with one representing the faster crack and the other representing the slower. The two four character strings were tung and bird. The two five character strings were mandy and burns. The two six character strings were amanda and openmp.
Results were taken from the average of six tests per node and thread count combination. Three from each of the two strings.
Results
Overall Performance
The 216 lines of data from each test run were put into a CSV file for each cluster tested. The three test runs for each hash in each node/thread combination were then averaged with the other hash of the same size to create 36 averages, one for each combination of character length, number of computing nodes, and number of threads per node.
Overall the average speeds were as expected. The fewer nodes and threads performed slower whereas the more nodes and threads performed faster. The six character strings create a similar ‘W’ shape in both clusters when plotted on a graph. This is likely a result of the method of splitting the potential strings between nodes and threads in the two node three thread cluster causing the correct string to be placed further from an entry point than the other cluster settings.
A spike can be seen going from the 4 threads in a smaller set of nodes to 1 thread in a larger set of nodes. This can be explained by the fact that the more threads on fewer nodes have more processes than more nodes with fewer threads. A 2 node cluster, one control and one worker, with 4 threads has 4 compute processes whereas a 3 node cluster, one compute and two worker, with only 1 thread per node has 2 compute processes.
Average Speed per Node by String Size
In addition, BJ compared the average speeds of each of the number of nodes in the Raspberry Pi Cluster based on the number of characters in the hashed string. The four character hash performed as expected showing increase in speed with the increase in number of threads for each of the nodes.
However in the two node clusters for both the five and six character string hashes showed varying results. This may be in part due to the nature of the way the code is splitting the possible strings between nodes and threads. Both the three node and four node clusters performed as expected increasing speed with the increase in threads per node.
Raspberry Pi vs University Speed Comparison
Finally, BJ compared the Raspberry Pi cluster and the Sim Center cluster for each node/thread combinations. At the two node level all three sizes of strings perform as expected with the Sim Center drastically beating the Raspberry Pi cluster. However, as more nodes are added to the cluster the Raspberry Pi cluster evens out with the Sim Center, even surpassing it in some combinations.
Lessons Learned
Hash Cracking Not A Good Measure
Brute force hash cracking, while fascinating, may not be the best measure of performance. Each thread is given a starting point for generating potential hashes based on the size of the original string and the number of nodes and threads.
The speed of finding the hash depends on where in the set of generated hashes the string falls. This location varies based on the way the work is broken down between nodes and threads creating inconsistencies between the different combinations of nodes and threads per node.
Matrix multiplication or Benchmark testing would be a more consistent way to compare cluster speeds between the different node and thread combinations.
More Data Needed for Statistical Analysis
A better design for testing would be to use a large amount of hashed strings and calculate speed differences between no parallelism and running each of the node thread combinations for each cluster. With this larger set of similar data statistical analysis could be run to find the standard deviation within the set of hashes. This could be used to compare the two clusters to see if there is a significant difference in processing speed.
More tests per node thread combination and fewer different hashes would allow for better sets of data without adding more time to testing. Since there is variation in the time for the same size words between node thread combinations more different hashes but of the same size would allow for a better average performance.
Tricks of the Trade
Parallel computing has a lot of interesting concepts in it that don’t just apply with computers. A lot of the ideas about parallelizing code also apply to teams of programmers. There is a lot here to unpack that can help you in greatly increasing efficiency in a development environment, even if you aren’t using parallel programming directly.