This is a translated essay from the original Chinese text in this blog.

The essay was finished one day after the end of ASC24, potentially exist with some inaccuracies due to my limited ablility. And this is the first time I try to write MPI program and HPC application. Should there be any inaccuracies, your understanding is greatly appreciated. While my previous areas of study may intersect only minimally with this realm, I warmly welcome any corrections and enhancements from all.

The background of OpenCAEPoro, I believe, requires no explain again. To cut to the chase, it has become apparent to all that the dgemm function stands out as a focal point. Enhancing this function could achieve notably effects, thereby constituting our primary optimization target. Besides, in OpenCAEPoro, the algorithm for searching left and right neighbors exhibits particularly high time complexity, predominantly residing in O(N) or O(N^2) time complexity, so this is one of our points for optimization. The least point is about PETSc, by optimized the memory access speed, decoup function, we can reach greater speed. Meanwhile, we are trying to use asynchronous computing, which allows the program to fully maximize both CPU computational performance and remote access capabilities.

Next, this is our score list in finial competition. (Because the IB verbs drive problem, we got longer time than we expect.)
OpenCAEPoro成绩单
Accelration ratio in preliminary competition is show below.
初赛加速比


dgemm algorithm

First and foremost is the dgemm algorithm. In the unofficial ASC communication group (a QQ group number: 775322734), an expert has provided what appears to be an optimal solution, a technological marvel from Intel. we dare not claim that our code is faster. Here is the direct link:https://www.intel.com/content/www/us/en/developer/articles/technical/onemkl-improved-small-matrix-performance-using-just-in-time-jit-code.html
In contrast, our approach utilizing block caching and compile-time calculations with C++ templates may seem somewhat inadequate. Some may argue that why we do not employ the dgemm algorithm form MKL. Through our testing, we have observed that MKL performs poorly on small matrices. The matrices calculated in testcase1 within OpenCAEPoro are relatively small, possibly within the range of 20x20 (the exact size in the finial competition is unknown). Furthermore, I provide the code we referenced, which stems from a tutorial by an NVIDIA engineer, serving as an exemplary guide for HPC beginners:https://github.com/yzhaiustc/Optimizing-DGEMM-on-Intel-CPUs-with-AVX512F
In our rendition of the code, the matrix multiplication for small matrices offers an approximate 30% performance enhancement compared to the original version by the author (due to certain scaling operations not being compiled).

#define A(i, j) A[(i) * LDA + (j)]
#define B(i, j) B[(i) * LDB + (j)]
#define C(i, j) C[(i) * LDC + (j)]
#include "immintrin.h"
#include <stdint.h>
#include <cstring>

void scale_c_k4(double *C, int M, int N, int LDC, double scalar)
{
    int i, j;
    for (i = 0; i < M; i++)
    {
        for (j = 0; j < N; j++)
        {
            C(i, j) *= scalar;
        }
    }
}

void mydgemm_cpu_opt_k4(int M, int N, int K, double alpha, double *A, int LDA, double *B, int LDB, double beta, double *C, int LDC)
{
    int i, j, k;
    if (beta != 1.0)
        scale_c_k4(C, M, N, LDC, beta);
    for (i = 0; i < M; i++)
    {
        for (j = 0; j < N; j++)
        {
            double tmp = C(i, j);
            for (k = 0; k < K; k++)
            {
                tmp += alpha * A(i, k) * B(k, j);
            }
            C(i, j) = tmp;
        }
    }
}

template <typename BETA, typename ALPHA>
void mydgemm_cpu_v4(int M, int N, int K, ALPHA alpha, double *A, int LDA, double *B, int LDB, BETA beta, double *C, int LDC)
{
    int i, j, k;
    if constexpr (beta == 0) // optimize for beta == 0
        memset(C, 0, sizeof(double) * M * N);
    else
    {
        if constexpr (beta != 1.0)
            scale_c_k4(C, M, N, LDC, beta);
    }

    int M4 = M & -4, N4 = N & -4;
    for (i = 0; i < M4; i += 4)
    {
        for (j = 0; j < N4; j += 4)
        {
            double c00 = C(i, j);
            double c01 = C(i, j + 1);
            double c02 = C(i, j + 2);
            double c03 = C(i, j + 3);
            double c10 = C(i + 1, j);
            double c11 = C(i + 1, j + 1);
            double c12 = C(i + 1, j + 2);
            double c13 = C(i + 1, j + 3);
            double c20 = C(i + 2, j);
            double c21 = C(i + 2, j + 1);
            double c22 = C(i + 2, j + 2);
            double c23 = C(i + 2, j + 3);
            double c30 = C(i + 3, j);
            double c31 = C(i + 3, j + 1);
            double c32 = C(i + 3, j + 2);
            double c33 = C(i + 3, j + 3);
            for (k = 0; k < K; k++)
            {
                double a0, a1, a2, a3;
                if constexpr (alpha != 1.0)
                {
                    a0 = alpha * A(i, k);
                    a1 = alpha * A(i + 1, k);
                    a2 = alpha * A(i + 2, k);
                    a3 = alpha * A(i + 3, k);
                }
                else
                {
                    a0 = A(i, k);
                    a1 = A(i + 1, k);
                    a2 = A(i + 2, k);
                    a3 = A(i + 3, k);
                }

                double b0 = B(k, j);
                double b1 = B(k, j + 1);
                double b2 = B(k, j + 2);
                double b3 = B(k, j + 3);
                c00 += a0 * b0;
                c01 += a0 * b1;
                c02 += a0 * b2;
                c03 += a0 * b3;
                c10 += a1 * b0;
                c11 += a1 * b1;
                c12 += a1 * b2;
                c13 += a1 * b3;
                c20 += a2 * b0;
                c21 += a2 * b1;
                c22 += a2 * b2;
                c23 += a2 * b3;
                c30 += a3 * b0;
                c31 += a3 * b1;
                c32 += a3 * b2;
                c33 += a3 * b3;
            }
            C(i, j) = c00;
            C(i, j + 1) = c01;
            C(i, j + 2) = c02;
            C(i, j + 3) = c03;
            C(i + 1, j) = c10;
            C(i + 1, j + 1) = c11;
            C(i + 1, j + 2) = c12;
            C(i + 1, j + 3) = c13;
            C(i + 2, j) = c20;
            C(i + 2, j + 1) = c21;
            C(i + 2, j + 2) = c22;
            C(i + 2, j + 3) = c23;
            C(i + 3, j) = c30;
            C(i + 3, j + 1) = c31;
            C(i + 3, j + 2) = c32;
            C(i + 3, j + 3) = c33;
        }
    }
    if (M4 == M && N4 == N)
        return;
    // boundary conditions
    if (M4 != M)
        mydgemm_cpu_opt_k4(M - M4, N, K, alpha, A + M4, LDA, B, LDB, 1.0, &C(M4, 0), LDC);
    if (N4 != N)
        mydgemm_cpu_opt_k4(M4, N - N4, K, alpha, A, LDA, &B(0, N4), LDB, 1.0, &C(0, N4), LDC);
}

Optimize for searching neighbor

Vtune Profile can not give useful infomation expect the dgemm algorithm is a hotpot. Fortunately, the code being relatively compact that allows us to initiate optimization efforts from the most time-consuming areas. In the Partition section, we found a weird code in the partion of search and construct connections between different components. This approach, accomplished through array traversal, results in the need to run 24 times to obtain connection information for finding left and right children in the 2x2x2 components. Is it feasible to employ a hash table?

Subsequently, the two individuals responsible for this competition topic diligently inspected the source code for two weeks and eventually uncovered areas amenable to straightforward optimization. Found within the SetDistribution function in the Partition.cpp file, the following represents a mere excerpt of the code, with two instances suitable for such modifications. In this section, the performance was enhanced by a factor of 7.

/// Get elements and their neighbors belonging to current process,
	/// Also, process of them are needed.
	/// Setup elements those will be sent to corresponding process
	//  target process, num of elements, num of edges, global index of elements, xadj, adjncy, adjproc
	vector<vector<idx_t>>  send_buffer;
	vector<vector<idx_t>>& recv_buffer = elementCSR;
	recv_buffer.push_back(vector<idx_t> {myrank, 0, 0});

	// euphoria
	unordered_map<idx_t, idx_t> ump_send_buffer;

	// Calculate necessary memory first
	for (idx_t i = 0; i < numElementLocal; i++) {
		if (part[i] == myrank) {
			recv_buffer[0][1] ++;
			recv_buffer[0][2] += (xadj[i + 1] - xadj[i]);
		}
		else {
			// euphoria
			if(ump_send_buffer[part[i]]){
				idx_t k = ump_send_buffer[part[i]] - 1;
				send_buffer[k][1] ++;
				send_buffer[k][2] += (xadj[i + 1] - xadj[i]);
			}
			else{
				send_buffer.push_back(vector<idx_t>{part[i], 1, xadj[i + 1] - xadj[i]});
				ump_send_buffer[part[i]] = send_buffer.size();
			}

			// idx_t k;
			// for (k = 0; k < send_buffer.size(); k++) {
			// 	if (part[i] == send_buffer[k][0]) {
			// 		// existing process
			// 		send_buffer[k][1] ++;
			// 		send_buffer[k][2] += (xadj[i + 1] - xadj[i]);
			// 		break;
			// 	}
			// }
			// if (k == send_buffer.size()) {
			// 	// new process
			// 	send_buffer.push_back(vector<idx_t>{part[i], 1, xadj[i + 1] - xadj[i]});
			// }
		}
	}

Optimization in PETSc LinerSovler

The first optimization we try was rewrite the decoup function. This is one of its subsidiary functions are show below. We modified the repeated memory deallocation operations to static memory, thereby alleviating the pressure on the memory pool from repeated allocate and free operations. Simultaneously, we allocated memory with alignment, thus improving access efficiency.

void inverse1(double *A, int N)
{
    static int *IPIV = static_cast<int *>(aligned_alloc(64, N * sizeof(int)));
    int LWORK = N * N;
    static int last_N = N;
    static double *WORK = static_cast<double *>(aligned_alloc(64, LWORK * sizeof(double)));
    int INFO;

    if (last_N < N)
    {
        last_N = N;

        std::free(IPIV);
        std::free(WORK);

        IPIV = static_cast<int *>(aligned_alloc(64, N * sizeof(int)));
        WORK = static_cast<double *>(aligned_alloc(64, LWORK * sizeof(double)));
    }

    dgetrf_(&N, &N, A, &N, IPIV, &INFO);

    if (INFO > 0)
    {
        printf("WARNING: The factorization has been completed, but the factor U is exactly singular");
    }
    else if (INFO < 0)
    {
        printf("WARNING: There is an illegal value");
    }

    dgetri_(&N, A, &N, IPIV, WORK, &LWORK, &INFO);
}

Subsequently, we realized that the same approach could be applied to the entire LS solving function, starting from minor variables like "args" to relatively larger ones such as "nNDCount". These modifications are estimated to yield a 10% enhancement in Object Time. Furthermore, during our documentation reading, we surprisely found PETSc's own implementation of memory pool operations, promising an additional 10% performance boost.
PetSc内存操作

After consulting the PetSc documentation, we confirmed that variables such as Mat, Vector, and Ksp in PetSc could be reused. With this understanding, we converted them into static type variables, aiming for further performance gains. Unfortunately, we did not observe any significant improvements, speculating that the compiler might have concealed any potential enhancements through magical optimizations.

Within the solving function, we noticed functions like MatAssemblyBegin and MatAssemblyEnd, indicating asynchronous operations. Upon consulting the PETSc documentation, this was affirmed. Despite injecting computations into these areas, there were no discernible changes in Object Time, possibly due to the small scale of the test case. Similarly, drawing inspiration from this concept, we utilized C++ 20 features for asynchronous data copying during result transfer (While the operation is outside of Object Time, only for save time in finial competition).

Next, I would like to discuss some unsuccessful optimization attempts. We experimented with altering matrix indexing, matrix types, modifying preconditioning methods to accelerate least squares solving, and even attempting to accelerate linear solving processes using CUDA (though outside Object Time, serving to effectively reduce runtime during the finial competition). These endeavors were abandoned due to a lack of noticeable performance enhancements or computational errors. These endeavors consumed over half of our preparation time, yet regrettably did not yield any fruitful results.

Modify file read/write IO in the summary

** Importance Notice: There is a bug in my implemented code. The logic is sound, but the main thread's output is corrupted, causing the SUMMAY.out file is disappear. I made this modification on my own and only discovered it during the competition. I take full responsibility and have already pleaded for forgiveness from my teammates. Unfortunately, due to this error, almost all cases were run twice during the final competition. I offer my sincere public apology for my mistake. I am truly sorry for my error. **

In OpenCAEPoro, during the summarization phase, the program writes "SUMMAY_proc_xxx" and "Fast_proc_xxx" files to the disk, which are then read and processed by the main thread to generate the final data. This approach limits scalability and may lead to performance degradation when dealing with hundreds of parallel cores. After reading the code, I decided to modify this section by avoiding disk writes and instead summarizing the data through MPI communication. This resulted in a 10% performance improvement, with even greater enhancements as the number of threads increases. It's estimated that with thousands of threads, the performance gains could reach several hundredfold.

The original process looked something like this::
原版汇总方法

I changed the process to have each thread send its data to the main thread by MPI Communication(first sending the length, then the content). As I am not very familiar with MPI writing, it seems that using the Gather method would be more appropriate.
新版汇总方法

Other little Optimizations

Some methods are come form OI competition.

  1. Input&output acceleration
    Removing the synchronization with the C language library for input and output can enhance the I/O streams to reach the level of Printf:
std::ios::sync_with_stdio(false);
std::cin.tie(0); 
std::cout.tie(0);
  1. Preallocate space for the vector
    Origin code prefer not to preallocate space using the "reserve" function for vectors. However, some parts can be computed in advance to reserve an appropriate space. If calculated incorrectly, it can have a negative impact. By default, a vector automatically doubles its size when it needs to expand.

  2. More beautiful acceleration ratio
    After multiple tests, it has been found that the fewer cores there are, the higher the acceleration ratio (higher communication and memory overhead). This code has an acceleration ratio of approximately 1.68 on a single 48 core machine.

  3. Manual unfolding
    We manually expanded a portion of the loops in our code. When comparing the manually expanded code to the non-expanded code at the assembly level, the manually expanded code will be longer, but there is a performance improvement of 1-2% (which can be considered within the margin of error). We seek an explanation from experts as we ourselves do not have a clear conclusion.

Self-assessment

If the input and output are not incorrect, I am satisfied with my own modifications. Any more advanced techniques are beyond my current capabilities and represent areas for further learning. This achievement is the result of the collaborative efforts of our team of five. While we believe we have done our best, we are all convinced that there are likely even better methods out there. We welcome the insights and corrections of experts in the field.

Perhaps points for optimization could include:

  1. PetSc matrix types
  2. PetSc matrix indexing
  3. The reason for assembling the matrix twice in OpenCAEPoro? (Not certain if this is a valid point)