Testing 1,2,3 …

18 03 2011

by Danny van Lieshout

Had an in-person interview yesterday.  Yay! The technical questions were loads of fun, and I’d like to share one of them with you.

Consider a user-defined function in PHP that is similar to the UNIX command, grep.  For those of you who are unfamiliar with this command or need a refresher, grep is a dandy command-line utility that returns all the lines of one or more files that match a certain pattern. grep is an acronym of arguable meaning  (see  http://www.vicchi.org/tag/grep/).

In the case of the above question, the idea was to apply the “grep” principle to a string of text and have the function return all the positions in the string that match a pattern. Here’s the function signature:

function grep($needle, $haystack)

Now with the benefit of your brains, plus pen and paper, figure out algorithmically how to define the above function so that it returns all the positions of the needle in the haystack, using the following values:

$needle = "aaa";
$haystack = "bbbccaaaccaaab";

I’ll even give you a hint about what the results will look like:

var_dump( array(5,10) );

Now don’t get substr happy — you need to make the function body work with whatever value needle may have, which could even be something like “aaab”!

The Algorithmic Approach

This means we’ll avoid using the easy PHP built-in preg_match_all. We’re going to solve this problem the anti-PHP way and forget about all that advice we’re heard about the built-ins being superior b/c they are written in C, and are so much faster than any code we might devise, etc. I am less than enthusiastic about this anti-PHP trend, and mention it only for the purpose of enlightening others about this technique which seems to fly in the face of what you may have previously learned about PHP.

So, to get the position of all matching strings, here’s how I solved it initially:

function grep($needle, $haystack)
{
   $results = null;
   $p = FALSE;
   $offset = 0;
   while( ( $p = strpos($haystack, $needle,$offset)) !== FALSE )
   {
	 	$results[] = $p;
		$offset = $p + strlen($needle);
   }
   var_dump($results);
}

$needle = "aaa";
$haystack = "bbbccaaaccaaab";
grep( $needle, $haystack);

The only tricky part here is recalling the correct order of the needle and haystack parameter positions because their order is inconsistent in PHP functions. But Dominik Jungowski has analyzed this situation and observes that unless using the array functions pertaining to string data, the order is usually haystack, needle (see http://www.phpdevblog.net/2009/08/haystack-needle-sheet.html and http://www.phpdevblog.net/uploads/files/haystack-needle.pdf for tipsheet.

In the above code, I use strpos(), speaking of which there is a good discussion of this function available at http://tiffanybbrown.com/2004/06/22/using_strpos_versus_stristrstrstr/. In addition I use strlen() to dynamically change the value of the offset, so that the parser can traverse the string and identify all instances of the needle’s position. But, is strlen() really necessary to achieve the results we seek? Actually, we can get by without it (see note at http://us.php.net/manual/en/function.strpos.php#60037, as follows:

function better_grep($needle, $haystack)
{
   $results = null;
   $p = -1;
   $i=0; 
   while (($p = strpos($haystack,$needle,$p + 1)) !== false) { 
            $results[$i++] = $p; 
   }
   var_dump($results);
}
$needle = "aaa";
$haystack = "bbbccaaaccaaab";
better_grep( $needle, $haystack);

All we need is to proceed from the start of the string and then next time through the loop from the position of the character after the starting position of the found $needle. So, if the $needle in this case is at position 5, then we need to search again starting from sixth position in the string.

Of course, to be really algorithmic we could even do away with strpos() to extract all the instances of the needle in the haystack. Here’s the code:

function doCount($str) {
  $count=0;
  for($c = 0; isset($str[$c]); $c++) {
    $count++;
  }
  return $count;
}

$haystack = "bybobbaaabbbcccbbbbobby";
$haycount = doCount($haystack);
$arrPos = null;
$needle = 'bb';
$needlecount = doCount($needle);

for ($i=0; $i < $haycount; $i++)  // traverse haystack
{
  if( $haystack[$i] == $needle[0] ){ // if element == start of needle
	
    if ( ($needlecount == 1)) {    // if 1-letter needle
      $arrPos[] = $i;
      continue;
    }
    else   // else $needlecount > 1         
   {
      //  successive letters match the needle?
      for ($x=$i+1,$y = 1; ($x <= $haycount-1) && ($y <= $needlecount-1); $x++,$y++) 
      {
	 if  (  $needle[$y] == $haystack[$x]  ) {   
	       if ($y == ($needlecount - 1) )  // at needle end?
	       {
		     $arrPos[] = $i; // store position contained in $i
		 // decomment line below for preg_match_all behavior
		 //$i += ($needlecount-1); 
	             break;
		}// inermost if
	  }//iner if
	 else
	 {
	      break;
	 }
					
      }//inner for	  
		
    }// inner if
  }//inner if	
}// for
var_dump($arrPos);

The above code uses a different needle — variety can be good! The one oddity is that you’d need to decomment the line “$i += ($needlecount-1);” in order for my algorithmic script to produce results like the PHP built-in preg_match_all() function. Otherwise, my code is able to perceive more instances of a needle like “bb” than preg_match_all().

The PHP Way
But, wait, that was only half the question.  The other half involves solving the same problem again but this time use an appropriate  PHP built-in function:

function grep2($needle, $haystack)
{
	if ( preg_match_all("/$needle/",$haystack,$matches,PREG_OFFSET_CAPTURE) > 0)
	{
              var_dump($matches);
	}
}
grep2( $needle, $haystack);

Of course the results in grep2() require some work to extract the values which are nested 2 levels deep. Here’s one solution:


function grep2a($needle, $haystack)
{
        $results = null;
	if ( preg_match_all("/$needle/",$haystack,$matches,PREG_OFFSET_CAPTURE) > 0)
	{
	   for( $i=0, $max=count($matches[0]); $i < $max; $i++ ) {
  	        $results[] = $matches[0][$i][1]; 	   
            }
   	} 	
        echo  var_dump($results); 
} 
grep2a( $needle, $haystack);	

 

Here’s another way with less hard-coding:

 
function grep2b($needle, $haystack) {
     $results = null;
     if ( preg_match_all("/$needle/",$haystack,$matches,PREG_OFFSET_CAPTURE) > 0)
    {
	    foreach ($matches[0] as $key => $value) {
	        if (is_array($value)) {
	      	   foreach( $value as $k => $v) {
		      	if (ctype_alpha($v)) continue;
	                $results[] = $v;
		    }// end inner-foreach
	        }//end inner-if
	    }//end foreach
      }// end if
     var_dump($results);

}
grep2b( $needle, $haystack);

In the above code, the second foreach may encounter one of two values, either the string value of the needle or an integer representing ts position in the larger string. That is why I use ctype_alpha() to disregard the string value so that the parser can locate the integer instead.

This work is licensed under a Creative Commons License

Advertisements

Actions

Information

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s




%d bloggers like this: