The Erroneous Basis of base_convert()

29 01 2016

July

July (Photo credit: kurafire)

Some bugs linger from one version of PHP to the next, such as the one associated with base_convert() and other base conversion functions. It has existed at least since the long, gone days of PHP 4. It even managed to escape the notice of Eevee in his encyclopedic rant railing against PHP a few years ago. An understandable oversight, given that nowadays developers usually apply themselves to other endeavors than converting values from one base to another.

Introduction

The manual lists base_convert() as appearing in PHP 4 and 5. In truth, this function has inhabited PHP at least since version 3.0.18 (see math.c). A comment preceding the function definition states its ambitious objective, as follows:

“Converts a number in a string from any base <= 36 to any base <= 36.”

The function generally delivers well provided that it works with certain data, i.e. positive integers that fall within the range of the base from which to perform conversion. Before preceding with a fuller discussion of base_convert(), consider the following snippet in order to refresh your knowledge about how computers deal with numbers:

<?php
$a = 0x40;            
var_dump($a);     

See live code.

So, what value does $a hold? Computers typically store values in binary form and display decimal values. 0x40 and 64 are different representations whose value in a computer’s memory is 1000000 (binary).

Using base_convert()

If a developer should wish to convert 0x40 into binary manually, base_convert() superficially seems optimal for this task, as follows:

<?php
$a = 0x40;
var_dump(base_convert($a, 10, 2));  // string(7) "1000000"
var_dump(base_convert(40, 16, 2));  // string(7) "1000000"
var_dump(decbin($a));               // string(7) "1000000"

See live code.

The second parameter of base_convert() specifies the base pertaining to the first parameter’s value the next example demonstrates:

<?php
$a = 0x40;
var_dump( base_convert($a, 16, 2) ); // string(7) "1100100"

See live code.

The function interprets the first parameter as 64 in hexadecimal which accordingly evaluates as 100 given the following logic:

4 * 16 ** 0 ==   4  
6 * 16 ** 1 ==  96
               100

In order to have the hexadecimal representation evaluate as 64, one must specify base 10 for the second parameter as in the PHP snippet preceding this last one.

The Manual proclaims that the first parameter of base_convert() must be a numeric string, but an int or a float also suffice. Internally, the implementation forces the first argument into a string as the internal source code reveals:


PHP_FUNCTION(base_convert)
{
zval *number, temp;
zend_long frombase, tobase;
zend_string *result;
if (zend_parse_parameters(ZEND_NUM_ARGS(), "zll", &number, &frombase, & tobase) == FAILURE) {
return;
}
convert_to_string_ex(number);
[…. snip]

After parsing out the parameters, the code explicitly converts number to a string if it happens to be another data type such as an integer or a float. (See internal code.) Despite, the Manual insisting that the first argument must be a string, PHP attempts to transform whatever data type of the input into some kind of a meaningful string, such that a false value yields an empty string whereas a TRUE produces “1”. Floats and integers transform into corresponding numeric strings.

The key point about the first parameter is that it correlates with a positive integer and is within the numeric range for integers (see this bug report and a related one).

Forget the Frills

If there is anything extra, i.e. a unary minus sign or a decimal point associated with the numeric value, such minutia can askew the result. The next example displays the inability of base_convert() to correctly process negative numbers:

<?php
$a = 65;
$b = ~$a + 1;   // get 2s complement
echo $b, "\n";                 
printf("0x%x\n\n",$b);            
$c = base_convert($b,10,2);
echo $c, "\n";
$i = intval($c,2);
echo $i;

/*  Output
 *  -65
 *  0xffffffffffffffbf (2s complement notation in hex)
 *
 *  1000001 (binary)   
 *  65
 */

See live code.

Relying on base_convert() to transform the negative number into a binary representation produces an incorrect numeric string, such that when the code uses it to yield an integer value, 65 displays instead of -65. A bug report filed in 2011 concerning this issue remains open.

Another problem of base_convert() is its inability to properly evaluate a number with a decimal point. PHP uses an internal function to implement base_convert() and it deviates from the familiar C-library function strtol() significantly. A bug report filed in 2012 pertaining to this matter also remains open.

Core contributor Sara Golemon describes a very weird issue (see here) that occurs if the first parameter exceeds what can be expressed in the base to convert from. The function will work with all the valid ordinal numbers of the first parameter. So, if one were to input an array for the first parameter, a Notice displays complaining about array to string conversion, but the function continues otherwise unimpeded, as the following demonstrates:

<?php

$a = array(1,2,3);
var_dump( base_convert($a,16,10) );  // 170

See live code here.

In order to understand the result one must recall what happens with array to string conversion in addition to the Notice displaying. The result still yields a string, namely “Array”. In terms, of base 16 the only sensible digits are “A” and “a”. So, PHP evaluates “Aa” which in hexadecimal is 170, a decidedly weird but logical outcome.

To demonstrate just how odd this behavior is, consider the following snippet:

<?php
echo base_convert("2a4",10,8);  	// 30 (octal)

See live code.

The equivalent C library function strtol() in contrast ignores all input after and including the first invalid character, as follows:

#include <math.h>
#include <stdio.h>
#include <stdlib.h>
int main(void)
{
    char *p, *s;
    long li;

    s = "2a4";
    li = strtol(s,NULL,8);			// 2 (octal)
    printf("%ld\n",li);
    return 0;
}

See live code.

According to Golemon, the bizarre behavior of base_convert() also occurs with bindec(), octdex() and hexdec(), which the following snippet confirms:

<?php
echo base_convert("1.5",16,10), "\n";	// 21
echo base_convert("1.5",10,16), "\n"; 	// f
echo bindec("1.5"),"\n";				// 1
echo octdec("1.5"),"\n";				// 13
echo hexdec("1.5"),"\n";				// 21

See live code.

A View from Within

Internally, base_convert() relies on C source code using an internal function with a faulty design, as a bug report yet unassigned and open reveals:

“_php_math_basetozval appears to be source of problem b/c while it mimics strtol, unlike strtol it fails to adequately deal with ‘.’ being an illegal character. strtol disregards ‘.’ and all subsequent input whereas _php_math_basetozval just skips the ‘.’ and processes the next character.”

An old article shows the important role that binary numbers play in accomplishing base conversion. It mentions that the source code implementing base_convert() uses an intermediate numeric binary value. This situation still exists in the source code of PHP 7. The resulting intermediate value may lack validity owing to a couple of conditions that may occur in a loop in _php_math_basetozval():

if (c >= '0' && c = 'A' && c = 'a' && c = base)
        continue;

Instead of bailing on a unary minus character or a decimal point, the loop conveniently ignores any such characters and carries on processing. And, if the numeric character is larger than the base can accommodate, that character, too is ignored and the loop continues iterating. The resulting number passes to _php_math_zvaltobase() which converts this value, albeit invalid, into the specified base, unless the number is too large, such as an extreme number like infinity or an issue arises with an inappropriate base.

The obvious solution would appear to be to do like the C-library function strtol() and replace each “continue” with a “break”. Then no further processing would occur. Perhaps such a solution is inadvisable with respect to backwards compatibility (BC), but if PHP 7 was able to intentionally break BC regarding parsing expressions (uniform variable syntax), then why not do it in this case, too?

Resolution, Maybe?

In 2013, Golemon created a formal proposal (also known as an RFC, i.e. “Request for Comments”) offering various ways to ameliorate the buggy behavior of base_convert() and related functions that depend on the misbehaving internal function. Instead of revising it, she takes a more conservative approach by suggesting ways to incorporate error messages.

To date, Golemon’s RFC has yet to appear in the PHP project’s RFC Listing. The document indicates its status as “Under Discussion” despite the latest discussion occurring the same year that the RFC surfaced, briefly attracting just a few discussants (see Internals List). One participant opined that adding error messages might have disruptive consequences. Others favored adding error reporting. Then, nothing further happened.

Clearly, _php_math_basetozval() needs some measure of fixing so that it ceases to have an adverse effect on base conversions in PHP. Or, the limitations of PHP’s base conversion functions need proper documentation so that users may set their expectations realistically and use the functionality prudently instead of naively expecting the absence of an error message to confirm that all is well with the resulting conversion.

In the meantime, users need to validate the first parameter for base_convert() as Golemon suggests offering the following code:

<?php
if (strcmp($val, base_convert($val, $base, $base))) {
  /* $val isn't purely in base $base */
} else {
  $newval = base_convert($val, $base, $newbase);
}

While a feasible solution, it seems silly to have to call a function twice in order to successfully use it. Instead of feeling reassured by this snippet, it seems to underscore that base_convert() remains a less than reliable function. In truth, base_convert() and the other related base conversion functions need a more solid basis. Until such transpires, one may wish to consider the GMP extension and get accustomed to using intval() as needed, as follows:

<?php

function gmp_base_convert($numStr,$base){
      $gnum = gmp_init( $numStr );
      return gmp_strval( $gnum, $base );
}

$n = -65;
$b = gmp_base_convert("$n",2);
echo "$n (decimal) is $b (binary) ";

$i = intval($b, 2);
$d = gmp_base_convert($i,10);
echo "\n$b (binary) is $d (decimal) ";

$n = 0.5;       // should err out
echo "\n$n is ",gmp_base_convert("$n",16);

$n = "2A";      // should partially work
$i = intval("$n",8);
echo "\n$n in decimal is ",gmp_base_convert("$i",10);

/*  Output:
 *
 *  -65 (decimal) is -1000001 (binary)
 *  -1000001 (binary) is -65 (decimal)
 *
 *  0.5 is 
 *  Warning: gmp_init(): Unable to convert variable to GMP - string is not an integer in 
 *  /in/DvusS on line 5
 *  0
 *
 *  2A in decimal is 2
 */

See live code.

Note, the GMP extension prefixes a unary minus sign to indicate a negative number instead of representing it in 2s complement notation.

If your script merely needs to transform a decimal number into binary, octal or hexadecimal, printf() and its companion sprintf() still hold up admirably:

<?php
$a = 65;
$x = ~$a + 1;       // -65 in 2s complement

printf("%3d = 0x%x\n\n",$a,$a); 
printf("%d = 0x%x\n",$x,$x);

/*  Output:
 *   65 = 0x41
 *  -65 = 0xffffffffffffffbf
 */

See live code.

A difficulty arises if you wish to convert a negative number represented in 2s complement into a decimal. A user on StackOverFlow suggests a clever solution which involves some bit twiddling. Since it relies on the potentially problematic bindec(), I replace it with intval() and modify the code in other ways, too, as follows:

<?php
/*  convert 16-bit numeric string to decimal 
 */
function showDecimal( $bin, $base ) {
    
    $num = intval( $bin, $base );
    
    $num &= 0xffff;                   // using only last 16 bits
    if ( 0x8000 & $num ) {            // true if $num is negative ...
      $num = ( $num - 0x010000  ); 
    }
    
    return $num;
}

$b = "ffffffbf";                      //  2s complement (hex) for
echo showDecimal($b,16), "\n";         // -65

$c = "1111111111111101";              //  2s complement (binary) for
echo showDecimal($c,2), "\n";          // -3

See live code.

The binary representation is 16 bits while the hexadecimal one is 32 (each hexadecimal accounts for 4 bits). The result of the AND operation with 0xffff allows the function to deal with a hex string like “ffbf”, and thereby reduce the scale of the conversion from 32 to 16 bits. As a result of the AND, all the preceding bits will turn into zero values.

The next AND determines whether $num is a negative number. In a 16-bit scale, negative numbers range from 0x8000 (-32,768) to 0xFFFF (-1) in 2s complement format.

An interesting article about Java explains that to obtain the 2s complement for a negative number one would add the negative number to the highest power of 2 in binary which for 16 bits is 2 to the power of 16, expressed in hex as 0x010000. Computer scientist Donald Knuth in The Art of Computer Programming describes a reverse process for obtaining a decimal from 2s complement notation. According to him, if the leftmost bit for an n-bit pattern is set, then that implies a negative sign so the code should subtract 2 to the power of n, from the number expressed as a binary of n bits, as follows:

<?php

$x = 0xFFFD;             // unsigned 16-bit: 65533

// left bit set, so  $x as a 2s Complement:
$n = $x - (2 ** 16);    // 65533 -  65536
echo $n;                // -3 

See live code.

Crunch Time

The burden of creating valid output with respect to base conversion has shifted more towards Userland, notwithstanding the availability of GMP and the more limited possibilities of printf() and sprintf(). A user may feel compelled to acquire expertise on this relatively esoteric topic in order to create functionality for effecting base conversion. PHP originally had a noble purpose: free users from having to immerse themselves in excessive technical details. Unless, a core contributor embarks on revising the internal code that impairs base_convert() and related functions, users out of necessity may have to chomp bits.

This work is licensed under a Creative Commons License

Advertisements

Actions

Information

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s




%d bloggers like this: