Proof of Concepts

php 7 performance boost using parallel computing
Posted by pk on Tuesday, November 24 2015 - 14:25:20
5 stars levelphp languageC Language

Goal

Consider parallel computing as a credible option for a future php release... php8? php9? php10?

What has be done

Parallel recoding of 2 php functions ( array_sum() and in_array() ) using openmp.

Result

php is 3.5 to 3.8 times faster using 4 cores.

Environment

Tests have been done on Ubuntu 12.04 LTS with php 5.3.10.
php 7.0RC7 has been patched.
the -fopenmp flag has been added to the php Makefile.

Tests files

array_sum.php

$n      = 40000;
$t      = array();
$sum    = $n*($n+1)/2;

$time_mem = microtime(true);
for($i=1;$i<=$n;++$i)   $t[$i] = $i;
printf("Initialize:\t   %.4f\n",microtime(true)- $time_mem);

$time_mem = microtime(true);
for($i=0;$i<$n;++$i)
{
    if ($sum!=($v=array_sum($t))) { echo "Error: [$sum] != $v ...!\n"; break;}
}
printf("duration:\t   %.4f\n",microtime(true)- $time_mem);


in_array.php

$n = 20000;
$t = array();

$time_mem = microtime(true);
for($i=0;$i<$n;++$i) { $v = sprintf("%030d",$i); $t[$v] = $v; }
printf("Initialize:\t   %.4f\n",microtime(true)- $time_mem);

$time_mem = microtime(true);
for($i=0;$i<$n;++$i)
{
    $v = sprintf("%030d",$i); if (!in_array($v,$t,true)) echo "Error: [$v] not found!\n";
}
for($i=$n;$i<2*$n;++$i)
{
    $v = sprintf("%030d",$i); if (in_array($v,$t,true)) echo "Error: [$v] wrongly found!\n";
}
printf("duration:\t   %.4f\n",microtime(true)- $time_mem);

 

Original source code of array_sum(), extracted from ext/standard/array.c file

PHP_FUNCTION(array_sum)
{
	zval *input,
		 *entry,
		 entry_n;

	if (zend_parse_parameters(ZEND_NUM_ARGS(), "a", &input) == FAILURE) {
		return;
	}

	ZVAL_LONG(return_value, 0);

	ZEND_HASH_FOREACH_VAL(Z_ARRVAL_P(input), entry) {
		if (Z_TYPE_P(entry) == IS_ARRAY || Z_TYPE_P(entry) == IS_OBJECT) {
			continue;
		}
		ZVAL_COPY(&entry_n, entry);
		convert_scalar_to_number(&entry_n);
		fast_add_function(return_value, return_value, &entry_n);
	} ZEND_HASH_FOREACH_END();
}

Modified source code of array_sum(), to be replaced within the ext/standard/array.c file

PHP_FUNCTION(array_sum)
{
    register int nb_cores   = 1;
    register int n;
    
    zval *input;
    
    if (zend_parse_parameters(ZEND_NUM_ARGS(), "a", &input) == FAILURE) {
        return;
    }

    ZVAL_LONG(return_value, 0);

// TODO:    nb_cores has to be a global value, computed only once at the init of php, and get a max value from php.ini !
//          according to my first tests the best value to choose is not the number of threads in multithreading processors,
//          but be smaller or equal to the number of real physical cores. 
//          Use :  "OMP_NUM_THREADS=xx /path_to/php test_file.php" to specify xx, the number of threads to use in your test
#pragma omp parallel sections
{
    #pragma omp section
    {
        nb_cores = omp_get_num_threads();
    }
}

    if ( (nb_cores==1) || ((n=(Z_ARRVAL_P(input))->nNumUsed) < 3*nb_cores) )    // classical single threaded computing
    {
        zval *entry, entry_n;
        
        ZEND_HASH_FOREACH_VAL(Z_ARRVAL_P(input), entry) {
            if (Z_TYPE_P(entry) == IS_ARRAY || Z_TYPE_P(entry) == IS_OBJECT) {
                continue;
            }
            ZVAL_COPY(&entry_n, entry);
            convert_scalar_to_number(&entry_n);
            fast_add_function(return_value, return_value, &entry_n);
        } ZEND_HASH_FOREACH_END();
    }
    else                                                                        // multi threaded computing
    {
        do
        {
            register int rank;
            register int q,r;
            register Bucket *_p0 = (Z_ARRVAL_P(input))->arData;
            register Bucket *_end = _p0 + n;
            zval     t_result[nb_cores];
            
            q = n/nb_cores;
            r = n%nb_cores;
            
            #pragma omp parallel for                    // firstprivate(q,r,_p0) shared(t_result) 
            for(rank=0; rank < nb_cores; ++rank)        // compute sum  for each chunk (1 chunk per core)
            {
                register int    i;
                register int    m       =            q  + (rank < r ? 1    : 0);  // chunk size = q+1 for the r first ones ... = q for the others .
                register Bucket *_prank = _p0 + rank*q  + (rank < r ? rank : r);  // pointer to the first bucket of the rank's chunk
                zval     local_result;
                
                ZVAL_LONG(&local_result, 0);
                for (i=0; i < m; ++i)
                {
                    register Bucket *_p;
                    zval            entry_n;
                    
                    _p = _prank + i;
                    zval *_z = &_p->val;
                    
                    if (UNEXPECTED(Z_TYPE_P(_z) == IS_UNDEF))                   continue;
                    if (Z_TYPE_P(_z) == IS_ARRAY || Z_TYPE_P(_z) == IS_OBJECT)  continue;
                    
                    ZVAL_COPY(&entry_n, _z);
                    convert_scalar_to_number(&entry_n);
                    fast_add_function(&local_result, &local_result, &entry_n);
                }
                ZVAL_COPY(&t_result[rank], &local_result);
            }
            
            for(rank=0; rank < nb_cores; ++rank)  // consolidate chunks into result
            {
                fast_add_function(return_value, return_value, &t_result[rank]);
            }
        } while (0);
    }
}

 

Original source code of in_array(), extracted from ext/standard/array.c file

static inline void php_search_array(INTERNAL_FUNCTION_PARAMETERS, int behavior) /* {{{ */
{
	zval *value,				/* value to check for */
		 *array,				/* array to check in */
		 *entry;				/* pointer to array entry */
	zend_ulong num_idx;
	zend_string *str_idx;
	zend_bool strict = 0;		/* strict comparison or not */

#ifndef FAST_ZPP
	if (zend_parse_parameters(ZEND_NUM_ARGS(), "za|b", &value, &array, &strict) == FAILURE) {
		return;
	}
#else
	ZEND_PARSE_PARAMETERS_START(2, 3)
		Z_PARAM_ZVAL(value)
		Z_PARAM_ARRAY(array)
		Z_PARAM_OPTIONAL
		Z_PARAM_BOOL(strict)
	ZEND_PARSE_PARAMETERS_END();
#endif

	if (strict) {
		ZEND_HASH_FOREACH_KEY_VAL(Z_ARRVAL_P(array), num_idx, str_idx, entry) {
			ZVAL_DEREF(entry);
			if (fast_is_identical_function(value, entry)) {
				if (behavior == 0) {
					RETURN_TRUE;
				} else {
					if (str_idx) {
						RETVAL_STR_COPY(str_idx);
					} else {
						RETVAL_LONG(num_idx);
					}
					return;
				}
			}
		} ZEND_HASH_FOREACH_END();
	} else {
		ZVAL_DEREF(value);
		if (Z_TYPE_P(value) == IS_LONG) {
			ZEND_HASH_FOREACH_KEY_VAL(Z_ARRVAL_P(array), num_idx, str_idx, entry) {
				if (fast_equal_check_long(value, entry)) {
					if (behavior == 0) {
						RETURN_TRUE;
					} else {
						if (str_idx) {
							RETVAL_STR_COPY(str_idx);
						} else {
							RETVAL_LONG(num_idx);
						}
						return;
					}
				}
			} ZEND_HASH_FOREACH_END();
		} else if (Z_TYPE_P(value) == IS_STRING) {
			ZEND_HASH_FOREACH_KEY_VAL(Z_ARRVAL_P(array), num_idx, str_idx, entry) {
				if (fast_equal_check_string(value, entry)) {
					if (behavior == 0) {
						RETURN_TRUE;
					} else {
						if (str_idx) {
							RETVAL_STR_COPY(str_idx);
						} else {
							RETVAL_LONG(num_idx);
						}
						return;
					}
				}
			} ZEND_HASH_FOREACH_END();
		} else {
			ZEND_HASH_FOREACH_KEY_VAL(Z_ARRVAL_P(array), num_idx, str_idx, entry) {
				if (fast_equal_check_function(value, entry)) {
					if (behavior == 0) {
						RETURN_TRUE;
					} else {
						if (str_idx) {
							RETVAL_STR_COPY(str_idx);
						} else {
							RETVAL_LONG(num_idx);
						}
						return;
					}
				}
			} ZEND_HASH_FOREACH_END();
 		}
	}

	RETURN_FALSE;
}

Modified source code of in_array(), to be replaced within the ext/standard/array.c file


in_array, modified only within the strict comparaison case!

static inline void php_search_array(INTERNAL_FUNCTION_PARAMETERS, int behavior) /* {{{ */
{
    register int nb_cores   = 1;
    register int n;
    
  
    zval *value,                /* value to check for */
         *array,                /* array to check in */
         *entry;                /* pointer to array entry */
    zend_ulong num_idx;
    zend_string *str_idx;
    zend_bool strict = 0;       /* strict comparison or not */

#ifndef FAST_ZPP
    if (zend_parse_parameters(ZEND_NUM_ARGS(), "za|b", &value, &array, &strict) == FAILURE) {
        return;
    }
#else
    ZEND_PARSE_PARAMETERS_START(2, 3)
        Z_PARAM_ZVAL(value)
        Z_PARAM_ARRAY(array)
        Z_PARAM_OPTIONAL
        Z_PARAM_BOOL(strict)
    ZEND_PARSE_PARAMETERS_END();
#endif


// TODO:    nb_cores has to be a global value, computed only once at the init of php, and get a max value from php.ini !
//          according to my first tests the best value to choose is not the number of threads in multithreading processors,
//          but be smaller or equal to the number of real physical cores. 
//          Use :  "OMP_NUM_THREADS=xx /path_to/php test_file.php" to specify xx, the number of threads to use in your test
#pragma omp parallel sections
{
    #pragma omp section
    {
        nb_cores = omp_get_num_threads();
    }
}
    // TODO :   only strict mode is currently using openmp, some cut'n'paste to do!!!
    if (strict)
    {
        if ( (nb_cores==1) || ((n=(Z_ARRVAL_P(array))->nNumUsed) < 3*nb_cores) )    // classical single threaded computing
        {
            ZEND_HASH_FOREACH_KEY_VAL(Z_ARRVAL_P(array), num_idx, str_idx, entry) {
                ZVAL_DEREF(entry);
                if (fast_is_identical_function(value, entry)) {
                    if (behavior == 0) {
                        RETURN_TRUE;
                    } else {
                        if (str_idx) {
                            RETVAL_STR_COPY(str_idx);
                        } else {
                            RETVAL_LONG(num_idx);
                        }
                        return;
                    }
                }
            } ZEND_HASH_FOREACH_END();
        }
        else                                                                        // multi threaded computing
        {
            do
            {
                register int rank;
                register int q,r;
                register int found = -1;
                register Bucket *_p0 = (Z_ARRVAL_P(array))->arData;
                register Bucket *_end = _p0 + n;
                zval     t_result[nb_cores];
                
                q = n/nb_cores;
                r = n%nb_cores;
                
                #pragma omp parallel for                    // firstprivate(q,r,_p0) shared(found) 
                for(rank=0; rank < nb_cores; ++rank)        // search within each chunk (1 chunk per core)
                {
                    register int    i;
                    register int    m       =            q  + (rank < r ? 1    : 0);  // chunk size = q+1 for the r first ones ... = q for the others .
                    register Bucket *_prank = _p0 + rank*q  + (rank < r ? rank : r);  // pointer to the first bucket of the rank's chunk
                    int     local_found     = -1;
                    
                    for (i=0; i < m; ++i)
                    {
                        register Bucket *_p;
                        
                        if (local_found >= 0)         continue;                           // no break is allowed in openmp nor abort in my version : do a continue on check to skip!
                        if (!(i%100) && (found >= 0)) { local_found = found; continue; }  // check only every 100 iteration to minimize shared memory access
                        
                        
                        _p = _prank + i;
                        zval *_z = &_p->val;
                        if (UNEXPECTED(Z_TYPE_P(_z) == IS_UNDEF))                   continue;
                        ZVAL_DEREF(_z);
                        if (fast_is_identical_function(value, _z))
                        {
                            found = local_found = (_prank-_p0) + i;
                            continue;
                        }
                    }
                }
                if (found >= 0)
                {
                    if (behavior == 0) {
                        RETURN_TRUE;
                    } else {
                        register Bucket *_p;
                        _p = _p0 +found;
                        num_idx = _p->h;
                        str_idx = _p->key;

                        if (str_idx) {
                            RETVAL_STR_COPY(str_idx);
                        } else {
                            RETVAL_LONG(num_idx);
                        }
                        return;
                    }
                }
            } while (0);
        }

    } else {
        ZVAL_DEREF(value);
        if (Z_TYPE_P(value) == IS_LONG) {
            ZEND_HASH_FOREACH_KEY_VAL(Z_ARRVAL_P(array), num_idx, str_idx, entry) {
                if (fast_equal_check_long(value, entry)) {
                    if (behavior == 0) {
                        RETURN_TRUE;
                    } else {
                        if (str_idx) {
                            RETVAL_STR_COPY(str_idx);
                        } else {
                            RETVAL_LONG(num_idx);
                        }
                        return;
                    }
                }
            } ZEND_HASH_FOREACH_END();
        } else if (Z_TYPE_P(value) == IS_STRING) {
            ZEND_HASH_FOREACH_KEY_VAL(Z_ARRVAL_P(array), num_idx, str_idx, entry) {
                if (fast_equal_check_string(value, entry)) {
                    if (behavior == 0) {
                        RETURN_TRUE;
                    } else {
                        if (str_idx) {
                            RETVAL_STR_COPY(str_idx);
                        } else {
                            RETVAL_LONG(num_idx);
                        }
                        return;
                    }
                }
            } ZEND_HASH_FOREACH_END();
        } else {
            ZEND_HASH_FOREACH_KEY_VAL(Z_ARRVAL_P(array), num_idx, str_idx, entry) {
                if (fast_equal_check_function(value, entry)) {
                    if (behavior == 0) {
                        RETURN_TRUE;
                    } else {
                        if (str_idx) {
                            RETVAL_STR_COPY(str_idx);
                        } else {
                            RETVAL_LONG(num_idx);
                        }
                        return;
                    }
                }
            } ZEND_HASH_FOREACH_END();
        }
    }

    RETURN_FALSE;
}

 

Benchmark on a Standard PC

Test Machine:  i7 4765T  4 cores 8 threads OS: Ubuntu 12.04.5 LTS
in_array (   , strict ) php 5.3.10 php 7.0RC7 php 7.0RC7    gcc -fopenmp      OMP_NUM_THREADS=x
1 thread 2 threads 3 threads 4 threads 5 threads 6 threads 7 threads 8 threads 9 threads 16 threads
duration 14.48 s 6.10 s 6.43 s 3.55 s 2.52 s 2.01 s 2.50 s 2.28 s 2.08 s 1.95 s 3.65 s 4.37 s
scaling     0.95 1.72 2.42 3.03 2.44 2.68 2.93 3.13 1.67 1.40
scaling/optimal scaling     94.87% 85.92% 80.69% 75.87% 48.80% 44.59% 41.90% 39.10% 18.57% 8.72%
 
array_sum ( long values ) php 5.3.10 php 7.0RC7 php 7.0RC7    gcc -fopenmp      OMP_NUM_THREADS=x
1 thread 2 threads 3 threads 4 threads 5 threads 6 threads 7 threads 8 threads 9 threads 16 threads
duration 22.37 s 7.36 s 6.83 s 3.61 s 2.58 s 2.06 s 3.10 s 2.60 s 2.30 s 2.09 s 4.08 s 4.80 s
scaling     1.08 2.04 2.85 3.57 2.37 2.83 3.20 3.52 1.80 1.53
scaling/optimal scaling     107.76% 101.94% 95.09% 89.32% 47.48% 47.18% 45.71% 44.02% 20.04% 9.58%
 
Test Machine:  Intel Xeon E3-1220 V2  4 cores OS: Ubuntu 12.04.5 LTS
in_array (   , strict ) php 5.3.10 php 7.0RC7 php 7.0RC7    gcc -fopenmp      OMP_NUM_THREADS=x
1 thread 2 threads 3 threads 4 threads 5 threads 6 threads 7 threads 8 threads 9 threads 16 threads
duration 13.60 s 5.91 s 6.44 s 3.65 s 2.53 s 1.99 s 3.59 s 3.26 s 3.35 s 3.60 s 3.57 s 4.67 s
scaling     0.92 1.62 2.34 2.97 1.65 1.81 1.76 1.64 1.66 1.27
scaling/optimal scaling     91.77% 80.96% 77.87% 74.25% 32.92% 30.21% 25.20% 20.52% 18.39% 7.91%
 
array_sum ( long values ) php 5.3.10 php 7.0RC7 php 7.0RC7    gcc -fopenmp      OMP_NUM_THREADS=x
1 thread 2 threads 3 threads 4 threads 5 threads 6 threads 7 threads 8 threads 9 threads 16 threads
duration 20.75 s 7.80 s 6.89 s 3.73 s 2.58 s 2.02 s 3.87 s 3.45 s 3.57 s 3.89 s 3.82 s 4.96 s
scaling     1.13 2.09 3.02 3.86 2.02 2.26 2.18 2.01 2.04 1.57
scaling/optimal scaling     113.21% 104.56% 100.78% 96.53% 40.31% 37.68% 31.21% 25.06% 22.69% 9.83%

 

Benchmark on a Supercomputer

 Test Machine: 2 x Intel Xeon E5-2680 V3  12 cores
=> 24 cores
OS: CentOS release 6.6 (Final)
in_array (   , strict ) php 5.3.10 php 7.0RC7 php 7.0RC7    gcc -fopenmp      OMP_NUM_THREADS=x
1 2 3 4 5 6 7 8 9 10 11 12
duration 35.54 s 25.17 s 25.41 s 13.52 s 9.11 s 6.95 s 5.63 s 4.77 s 4.16 s 3.71 s 3.39 s 3.12 s 2.89 s 2.68 s
scaling     0.99 1.86 2.76 3.62 4.47 5.28 6.05 6.78 7.42 8.07 8.71 9.39
scaling/optimal scaling     99.06% 93.08% 92.10% 90.54% 89.41% 87.95% 86.44% 84.80% 82.50% 80.67% 79.18% 78.26%
 
  13 14 15 16 17 18 19 20 21 22 23 24
  2.56 s 2.46 s 2.39 s 2.33 s 2.24 s 2.18 s 2.15 s 2.15 s 2.14 s 2.11 s 2.07 s 2.46 s
  9.83 10.23 10.53 10.80 11.24 11.55 11.71 11.71 11.76 11.93 12.16 10.23
  75.63% 73.08% 70.21% 67.52% 66.10% 64.14% 61.62% 58.53% 56.01% 54.22% 52.87% 42.63%
 
array_sum ( long values ) php 5.3.10 php 7.0RC7 php 7.0RC7    gcc -fopenmp      OMP_NUM_THREADS=x
1 2 3 4 5 6 7 8 9 10 11 12
duration 18.98 s 7.30 s 7.28 s 3.58 s 2.40 s 1.91 s 1.56 s 1.36 s 1.23 s 1.14 s 1.08 s 1.06 s 1.03 s 1.00 s
scaling     1.00 2.04 3.04 3.82 4.68 5.37 5.93 6.40 6.76 6.89 7.09 7.30
scaling/optimal scaling     100.27% 101.96% 101.39% 95.55% 93.59% 89.46% 84.79% 80.04% 75.10% 68.87% 64.43% 60.83%
 
  13 14 15 16 17 18 19 20 21 22 23 24
  0.94 s 0.96 s 0.98 s 1.01 s 0.97 s 0.99 s 0.97 s 1.06 s 1.13 s 1.12 s 1.13 s 1.18 s
  7.77 7.60 7.45 7.23 7.53 7.37 7.53 6.89 6.46 6.52 6.46 6.19
  59.74% 54.32% 49.66% 45.17% 44.27% 40.97% 39.61% 34.43% 30.76% 29.63% 28.09% 25.78%

Faq: Is it usefull on a heavy traffic web site?

Answer: It is not clear.

We could think that the requests distribution on each core would be enough to handle the parallelism naturally and efficiently.
But this is not as obvious as our first intuition might suggest.

 It depends on several parameters:

  • The requests arrival distribution law.
  • The ratio between a query’s execution time and the requests arrival rate.

Anyway, we can pre-suppose that the processor is able to handle all requests that arrive in the case of a single-core use.
Having noticed that, can we expect any gain by performing multi-core computing?

I'll show you a scenario where this is obvious.

Suppose in this case, that each task takes 4 ms of monocore computing, and that a new request is made each 1.1 ms.
With our standard 4 cores PC, with an overall computing time of 1.1 ms in multicore mode, we get the following graph:

This shows that after 5.5 ms:

  • In multicore, 5 clients have already been served, and it very quickly.
  • In singlecore, only 2 clients have been completely served, and for each, much more slower.

In every queue management problem, it is generally better to have a short time of service.

Of course, we also could have found cases where the opposite is true.
Therefore, if multicore implementation ideas would be retained, it would be interesting to have a new php.ini parameter, let’s say max_core_number, whose default value would be 1 for web mode, and 4 in cli mode...
The administrator would then be able to optimally configure as needed.

Faq: What can be parallelized within php ?

Answer: We can imagine to parallelize php at several levels:

  • At the php engine level.
    • php functions ( as I have done in this  “proof of concept” ).
    • the php interpreter itself, for example , by prefetching next opcode while excuting the current one.
  • In userland.
    • Imagine and implement simple and efficient primitives to make the php developer comfortable with parallel programming.
      For example, we could define a function to be run asynchronously:
             parallel function my_func($a,$b) { … }
      (parallel or async or …), with no global nor static variable be allowed within the function.
      …and an implicit barrier when using the result of the function.

      $a = my_func($x,$y);
      $a will be assign a new ‘not_yet_available’ zval type and a reference to what…
      using the value of $a will wait for the function to complete.
      so the call to a parallel function of a library would be transparent to the user.

      Of course, this will require a significant development effort, and rewriting a lot of code (the allocator for example) in order to obtain the smallest granularity for concurrent access.

Faq: Why openmp ?

Answer: it was directly available on my dev environment.

My dev environment is ubuntu 12.04 LTS
and openmp  is the only multi-threaded tool directly available ( gcc 4.6.3 ).
If I had gcc 5.x, I would have try with cilk plus.
 

yakpro rulez!

Site has been updated on Saturday, November 05 2016 - 20:02:15