Thursday, February 19, 2009

Parsing unescaped urls in django

Modern browsers escape urls automatically before sending them to the server, but what happens if your application serves http requests to clients that doesn't escape urls?

The answer is that can get unexpected results if you server works in Django (and probably in any python framework/application). That's because python's BaseHTTPServer.BaseHTTPRequestHandler handles urls according to standards, not from a human point of view.

Let's see with an example, consider the next request: Garcia&country=Catalonia

if you request it with a browser, it will escape the space in the url, so the server will get:

but what if the client uses, for example, python's urllib2.urlopen without escaping (using urllib.quote)? Of course it is a mistake, but you, as server side developer can't control your clients.

In that case the whole request that server receives is:

GET Garcia&country=Catalonia HTTP/1.1

and after being processed (splitted) by python's BaseHTTPServer.BaseHTTPRequestHandler, what we'll get from django is:

request.method == 'GET'
request.META['QUERY_STRING'] == 'name=Marc'
request.META['SERVER_PROTOCOL'] == 'Garcia&country=Catalonia HTTP/1.1'

so our request.GET dictionary will look like:

request.GET == {'name': 'Marc'}

what is not the expected value (from a human point of view).

So, what we can do for avoiding this result is quite easy (and of course tricky), and is getting the GET values not from django request.GET dictionary but from the one returned by this function:

def _manual_GET(request):
    if ' ' in request.META['SERVER_PROTOCOL']:
        query_string = ' '.join(
            [request.META['QUERY_STRING']] +
            request.META['SERVER_PROTOCOL'].split(' ')[:-1]
        args = query_string.split('&')
        result = {}
        for arg in args:
            key, value = arg.split('=', 1)
            result[key] = value
        return result
        return request.GET


  1. request.META['SERVER_PROTOCOL'] will still be wrong. What you can do to solve this problem is make this function return a tuple of objects, the first one is the request.GET and the second request.META['SERVER_PROTOCOL'].

    Another good thing to do with this code is put it inside a proccess_view() middleware function, so you will not have to call this function on every request.

  2. You're right, and probably it would be easy to create a middleware that just modifies the META and GET attributes of the request to the "correct" ones, so you forget on that issue just installing it.

    Actually what I posted is what I needed, just a patch for one view that return the GET values (I really don't care on the protocol value in this case).

    Thanks for your comment.

  3. In my opinion you better don't mess with these things. It's against the HTTP protocol if the client doesn't escape the URL properly. This has nothing to do with human failure. The HTTP protocol is not designed to be used by humans manually.

  4. Spammer bots came to my site and produced same error. Now I know why - they are using various url get libraries without escaping.

  5. I don't get it. Wouldn't you want something that is violating the specs to fail rather than just disguising the problem?

  6. Ok, it looks like I don't care not respecting the protocol, and it's not the case at all. I really hate to be flexible with some parts of the reality that are wrong, but sadly I can't change them.

    Let me use an example, that it'll sound familiar to all of you, and that is similar to my case...

    You know that there is a standard for HTML and CSS, as we have it for HTTP. And when you develop a website, you spend more time creating bad html and css than following the standards, just because a browser that you hate, and you can't do anything to change (ie6).

    So my case is exactly the same. My server has clients that doesn't follow the standards (because they are not escaping urls). So I have two options, give good results even when this happens, or just save and give wrong data because being strict with the standards. And the fact is that I'm not going to get wrong results just because of following the standards.

    I hope you can get my point of view, and the moral of the post now. :)