Weak Scaling vs Strong Scaling

Abheeshta Vemuri 23
16 min readDec 5, 2022

High Performance Computing (HPC) clusters allow many processors to solve big problems. Also known as parallel computing, this approach uses multiple processors working simultaneously to provide superior processing power and significantly reduce computation time. Scalability or scaling is a commonly used term in these contexts and describes the capacity of hardware and software to provide more computing power as the amount of resources increases. Scalability, the ability to proportionally increase overall system capacity by adding hardware, is important for clusters. Software scalability is also called parallelization efficiency. This is the difference between the actual speedup you get and the speedup you get with a specific number of processors.

What is strong scaling?

strong scaling” with the number of cores. Strong scaling is a measure of how effective parallelization is with respect to the number of cores used. It is calculated by dividing the total run time of a program by the run time of the program when run on a single core. The higher the ratio, the more effective the parallelization.

High scaling increases the number of processors while keeping the problem size constant. This reduces the load on each processor. Heavy scaling is typically used to find a configuration that produces good runtime with reasonable resource overhead for long-running CPU-bound applications. The workload of each processor should be kept high enough to keep all processors busy. There is usually a more or less continuous slowdown as the number of processes increases.
In an ideal world where the problem scales linearly, running a program on a system with N nodes would make the program N times faster. (Of course, in the presence of N, the air time must prevail, and proportionality does not hold. So the goal of solving a highly scaled problem is to use a more powerful computer to die the problem The problem to achieve linear speedup even remotely is usually CPU bound.

Amdahl’s law and strong scaling:

Amdahl’s law is a well-known mathematical law that describes how the speedup of a program or system scales with the number of processors used. It is named after computer scientist Gene Amdahl and was first proposed in 1967. Amdahl’s law states that the maximum speedup that can be achieved is limited by the fraction of the program that must be executed sequentially. In other words, no matter how many processors are added, some part of the program must still be executed on one processor. Strong scaling, on the other hand, is a measure of how well a system can increase its performance when more processors are added. This is different from Amdahl’s law because it does not take into account the amount of the program that must remain sequential. Instead, it looks at the total speedup that can be achieved when the number of processors is increased.

Amdahl discovered in 1967 that the amount of serial components in software that cannot be parallelized limits acceleration. Amdahl’s law is expressed as:

Acceleration is 1/(s + p/N)

Where N is the number of processors, p is the percentage of execution time that can be parallelized and s is the percentage of execution time spent in the serial section. According to Amdahl’s Law, the serial portion of the code determines the maximum acceleration for a fixed problem. The following example describes the heavy scaling that occurs. Imagine software that uses a single
processor core and takes 20 hours to complete. If a particular section of the program that takes 1 hour to run cannot be parallelized (s = 1/20 = 0.05) and the code that takes the remaining 19 hours to run can be parallelized (p = 1 s = 0.95), then the program is No matter how many processors are used to run in parallel, the minimum execution time will never be less than this critical time. So the maximum possible acceleration is 20 times (N =, acceleration = 1/s = 20). As a result, parallelization becomes less efficient as the amount of resources increases. Because of this, only fully parallelized programs can benefit from parallel computing with many processors.

Calculating Strong Scaling Speedup:

Given the time it takes to complete a serial task as t(1), it takes N processing components (parallel tasks) to do the same unit of work as t (N ). Given the time it takes, speedup is specified as

Speedup=t(1)/t(N)

What is Weak Scaling?

Scalability that describes the ability of a system to maintain roughly the same level of performance when adding additional resources. In terms of computing, this means that the application can take advantage of additional computing resources to serve more customers or process more data, while still achieving the same performance. This is often referred to as “strong scaling,” as the performance increases with the addition of resources.

Weak scaling increases both the number of processors and the problem size. It also puts a constant load on each CPU. Large memory-bound applications that require more memory than a single node can provide typically use weak scaling. Memory access techniques tend to focus on the closest and ignore what’s farther away, so they typically scale well as the number of cores increases.
The maximum problem size or available resources are usually the only factors limiting growth.
The work done by each node remains constant even as the machine scales up to fit perfectly weakly scaling applications.

Gustafson’s law and weak scaling:

Gustafson’s law states that the performance of a parallel system increases proportionally with the number of processors used, up to a certain point. In other words, if you double the number of processors, the performance of the system should double as well. This is known as strong scaling. Weak scaling, on the other hand, states that the performance of a parallel system increases proportionally with the size of the problem being solved. In other words, if you double the size of the problem, the performance of the system should double as well.

The maximum acceleration of a fixed-size problem is determined by Amdahl’s law. Amdahl’s law stipulates that in order for him to achieve a 500x speedup on 1000 processors, the percentage of off-the-shelf components must not exceed 0.1% of him, and the parallel computing bottleneck It seems that Gustafson found that the problem size is actually proportional to the amount of available resources. Using a large number of resources to perform a computation is not beneficial if the problem requires only a few resources. It makes more sense to use smaller quantities for smaller problems and larger quantities for larger problems.
Gustafson’s Law, established in 1988, is based on the assumption that parallel components scale with available resources and serial components do not scale with task complexity.
Specifies the scaled acceleration formula.
scaled

acceleration = s + p N

where s, p, and N have the same meaning as Amdahl’s law. According to Gustafson’s law, scaled acceleration has no upper bound and increases linearly (slope less than 1) with the number of processors. Unlike Amdahl’s law, which focuses on a fixed problem size, weak scaling determines the scaled acceleration as a function of the amount of work completed for the scaled problem size. In the example above, s = 0.05 and p = 0. Gustafson’s law states that scaled acceleration becomes infinite when an infinite number of processors are used. For N = 1000, the scaled acceleration is actually 950.

Weak scaling efficiency calculation:

Weak scaling efficiency is the product of the time to complete a unit of work on one processing element t(1) and the time to complete N of the same units of work on N processes. is represented. element. t(N).

Efficiency = t(1)/t(N)

Real applications often contain elements of both weak and strong scaling, and the ideal is rarely fully realized . Additionally, the type of scaling depends on the interaction between the application and the computer architecture. For example, distributed memory, message routing systems, and shared memory systems scale differently. Additionally, data-parallel applications (those that allow each node to work on different data collections), by definition, scale poorly. Before I go any further and give some examples of scaling that I use, I have to warn you.

Practical applications usually have varying levels of complexity, so it can be difficult to quantify the increasing “size” of the problem. For example, it is known that solving a set of N linear equations using Gaussian elimination (flop) requires O(N3) floating point operations. In other words, increasing the number of equations by 2 increases the size of the “problem” by 8 instead of 2. Solving the PDE on a 3-dimensional spatial grid and a 1-dimensional temporal grid results in a problem of size N4.

We can also say,

The formula for calculating the efficiency of a system’s scaling is: Efficiency = (Time at new scale / Time at old scale) / (Fold increase in scale) For example, if a system had a run time of 10 minutes at a scale of 10 nodes, and a run time of 20 minutes at a scale of 20 nodes, then the scaling efficiency would be calculated as follows: Efficiency = (20 min / 10 min) / (2 fold increase) = 0.5

Measuring parallel scaling performance:

Measuring parallel scaling performance involves testing how the performance of a system changes with the addition of more processors or cores. This can be done by running the same set of tasks across multiple processors or cores and measuring the overall time it takes to complete the tasks. This can then be compared to the performance of the system when only a single processor or core is used. This can be used to measure the overall efficiency of the system when different amounts of processors or cores are used.

Parallel scaling performance can be measured using metrics such as speedup, scalability, and efficiency. Speedup measures how much faster a parallelized process runs compared to a sequential process. Scalability measures how well a system can increase performance when more resources are added. Efficiency measures how well resources are used, usually expressed as a percentage.

When using a HPC cluster, it is almost always useful to measure the parallel scaling of your jobs. Weak scaling tests increase both the work size and the number of processing elements, while strong scaling measures how the total computation time of a job varies with the number of processing elements (threads or MPI processes). to find out. Parallel scaling test results can help you understand how many resources to request in relation to the size of a particular piece of work.

To demonstrate the parallel scaling evaluation process, we use a sample OpenMP code to generate a Julia set image, julia set openmp.c. Each pixel in the Julia set can be mapped onto the complex plane to produce an image associated with the complex function

Below is a picture of the Julia set image.

The julia set openmp.c file has two OpenMP directives beginning with “#pragma omp” (see the OpenMP Quick Reference card for more information on OpenMP syntax). These two “#pragma omp” lines can actually be combined into one, as shown in the example below. Added schedule option (dynamic). Improves workload distribution at the cost of additional overhead. Additionally, this choice duplicates the overhead of parallelization in real applications (such as communication and load balancing in MPI parallelized programs).

Its main function has also been changed to accept height and width as integer arguments from the command line.

Heavy scaling is tested by running the code with different number of threads at constant height and width. Each computation takes a certain amount of time to generate a tracked Julia set. The results are shown in the given tables and figure. Figure 1 shows the approximate curve based on Amdahl’s law.

Weak scaling can be measured by running the code with different thread counts as well as scaled heights. Width kept.

Table 2 and Figure 2 show the results of the weak scaling test. Figure 2 shows the approximate curve based on Gustafson’s law.

Table 1: Strong scaling fof Julia set generator codes

Table 2: Weak scaling of Julia set generator codes

Figure 1: Heavy Tick Code Graph

Amdahl’s Law trendlines are represented by dashed lines.

Figure 2: Weak Scaling Julia Set Generation Code Plot

The fitted curve based on Gustafson’s law is represented by the dashed line.

Combining the strong scaling and weak scaling results, the formulas for Amdahl’s law and Gustafson’s law (p) can be used to determine the ratio of the serial component (s) to the parallel part. Fitting Figures 1 and 2, the fitted values ​​for the serial fraction s for Amdahl’s law and Gustafson’s law are 0.03 and 0.1, respectively. The discrepancy in s is due to the approximation of the rule requiring the serial component to remain constant and the parallel component to be accelerated proportionally to the number of processing elements (processes/threads).

Parallelization overhead in real world applications also job size (eg from OpenMP dynamic loop scheduling).

Scaling Measurement Guidelines:

In addition to basic code performance and optimization considerations (that is, single thread performance) when timing your application, you should consider the following factors:

  1. Establish a baseline: Before implementing any scaling strategy, it is important to establish a baseline of performance metrics. This can be done by measuring the current performance metrics of the system, such as latency, throughput, and resource utilization.
  2. Determine scaling needs: Once the baseline has been established, it is important to determine what kind of scaling is necessary. This can be done by analyzing the system’s performance metrics and determining if the current performance is sufficient for the current workloads or if additional scaling is necessary.
  3. Create a scaling plan: Once the scaling needs have been determined, it is important to create a scaling plan. This plan should include the types of scaling that need to be implemented, the resources that will be needed, and the timeline for implementation.
  4. Test the plan: Once the scaling plan has been created, it is important to test the plan to ensure that it will work as expected. This can be done by creating a test environment and running simulations of the system with the proposed scaling plan.
  5. Monitor and adjust: Once the scaling plan has been tested and implemented, it is important to monitor the system’s performance metrics and adjust the plan as necessary. This can be done by analyzing the system’s performance metrics and determining if the scaling plan is having the desired effect.
  6. Evaluate the results: Finally, it is important to evaluate the results of the scaling plan. This can be done by comparing the system’s performance metrics before and after the scaling plan was implemented and determining if the plan was successful.

Wall clocks use similar time units.

o Example: 1.time step ends every second, etc.

For threaded jobs, use job sizes ranging from 1 to the number of processing elements per node.

o to the total amount of MPI processes required.

Job size increments must be powers of 2 or equivalent (for example, powers of cubes for weakly scaling 3D simulations).

Note: Scaling values ​​using multiple CPUs are not considered baselines.

– Should provide scaling performance for small datasets (low resolution) when memory requirements exceed those available on a single node. This allows you to compare scaling performance over the full range from 1 to the number of processes you want to use. All results at the required problem size plus as close to it as possible.

3. Calculate a number of individual runs for each project size.

o Average the data and remove outliers if necessary.

4. Select the problem situation or configuration that most closely resembles the production run you want to run.

Scaling should be evaluated considering the overall performance of the application.

No preferences or skewed models.

5. When using many nodes, the following should be considered.

a) Link speed and latency.

b) Maximum memory per node

c) Maximum number of processors

d) Number of processors per node (node)

e) System parameters and limits (e.g:stacksize)

Note: For programs that use MPI, adjusting the MPI settings can also significantly improve the performance of your application. MPI programs also require a certain amount of memory per MPI process. This increases as the processor and his MPI processes are loaded.

6. Where practical, but not required, additional measures are taken using various systems. Vastly different in terms of processor and network balance (that is, CPU speed and connection speed).

Note: You don’t have to use item number 5 above, but you can use it if you need to optimize your code.

Heavy Scaling:

To refresh the memory, heavy scaling increases the number of processors while maintaining a constant problem size. This reduces the load on each CPU.

Note: The time required to complete a serial task is shown in the table below as T(1) (here for one processor).

The time required to perform the same unit of work on Np processing elements is expressed in the notation T(Np) (here representing Np different processors).

N1 is the number of processors required to complete the serial task (1 processor here).

Np is the number of processors required to complete the task in parallel.

The ideal acceleration suggests that if the CPU is doubled, the acceleration is essentially doubled, but the actual acceleration shows the acceleration shown in the table above. . Heavy scaling efficiency is also calculated using the user information formula shown in the table above, but not shown here. For project applications, naming the acceleration is sufficient. As mentioned earlier, we aim to find the sweet spot for code execution. As can be seen from the graph and table, for a given problem size of 4 nodes (128 cores), maximum acceleration is reached before deceleration is observed.

If the processor runs differently, the following factors can cause the acceleration to drop and the actual acceleration to deviate from the ideal acceleration.

  1. Amdahl’s law (ideal speedup) and actual speedup differ in that they do not require an infinitely fast network. This means that data can move infinitely fast between processes (no latency, infinite bandwidth). This is not true in real life. All networks have non-zero latency and limited bandwidth, so data cannot be sent infinitely fast. Amdahl’s Law can also be viewed as a recommendation on where to focus your efforts to improve performance. I think I did my best to improve the parallel section of code. Analyze the program to understand the MPI part, use asynchronous data transfers, limit the number of collective operations, use very low latency and high bandwidth networks (I don’t understand such terms). If not, ignore it). Amdahl’s Law states that we should focus on the serial portion of our application to improve performance. Note that the serial portion of your application determines how much you can scale the parallel portion.
  2. After 4 knots the acceleration is reduced for true acceleration. Adding nodes has the disadvantage of increasing traffic between compute nodes, which can slow down your application. It can also be argued that computation time cannot be reduced any further because it is not possible to divide the problem size more effectively to reduce traffic and I/O faster.

Poor Scaling:

Remember that poor scaling increases output size and number of processors. It also puts a constant load on each CPU

Note: The time required to complete a serial task is shown in the table below as T(1) (here for one processor).

The time required to perform the same unit of work on Np processing elements is expressed in the notation T(Np) (here representing Np different processors).

Efficiency with increasing number of processors is calculated using the efficiency formula shown in the weak scaling table above. Since the problem size is 16000000 (N x N, N=4000 for 1 CPU), the problem size grows linearly with the number of processors. The number of processors doubles, and the problem size doubles. This keeps the computational load on each CPU constant as the number of processors increases. The ideal efficiency remains 1. This indicates that the same amount of computation has to be done for different numbers of processors, so the wall clock time is the same as for one processor. However, the actual efficiency shows that efficiency decreases as the number of processors increases.

The actual efficiency trend in the graph can be interpreted as follows.

  1. For execution within a single node, as the number of processors increases from 1, more MPI setup time is required and more data needs to be transferred between processors. Communication delays increase due to the limited capacity of communication networks.
  2. In multi-node systems, communication network bandwidth is limited, which increases the latency of data transfers between nodes. Since this is a basic implementation of the CG algorithm and little optimization of the code for parallel execution has been done, load balancing across nodes is also a potential factor.
  3. The granularity of the CG code that determines the communication frequency between parallel jobs (the stronger the granularity, the higher the communication frequency) is unknown, but assuming that the granularity is strong, in addition to I/O, the delay bottleneck will increase. there is no doubt. For more data. Weak scaling preserves the per-processor problem size for large problems (such as global satellite imagery analysis or streaming data problems), thus determining how well a parallel program scales to large spatial problems help you.

Overview:

Strong scaling is a type of scaling in which the amount of work remains constant as the number of processors increases. It is often used for large parallel tasks to reduce the amount of time it takes to complete a job. With strong scaling, the more processors are added, the faster the job finishes. Weak scaling is a type of scaling in which the amount of work increases as the number of processors increases. This is often used for tasks that have an inherent parallelism, allowing new processors to work on new parts of the problem. With weak scaling, the more processors are added, the longer the job takes to finish but the overall throughput increases.

The two common types of software scaling, heavy scaling and weak scaling, are discussed in this article. Below is a summary of some key points.

Scalability is key to effective parallel computing.

Amdahl’s law governs strong scaling. That is, the speed for a given problem size is related to the number of processors.

Gustafson’s law governs weak scaling. That is, it speeds up problem sizes scaled with respect to the number of processors. When using a

HPC cluster, it is almost always useful to measure job scaling.

The results of both the strong scaling test and the weak scaling test provide useful suggestions about the ideal match between plant size and the amount of resources that should be required of the plant.

--

--