How to Validate a URL

INTERNET! There's an old joke among programmers, particularly those who have had to use regexes more often than they're comfortable with:

Some people, when confronted with a problem, think "I know, I'll use regular expressions." Now they have two problems.

It's a seductive trap: Regexes are good at processing strings, and are more complex than your usual string-processing utilities, so it seems logical to use regexes to do advanced string-parsing. But regular expressions are not meant to do arbitrary string parsing. Regular expressions are meant to parse regular languages. Furthermore, regular expressions are notoriously hard to read, resulting in, what appears to be, a string of random characters sneezed out all over your screen. For example, consider the following that's used for parsing a valid URL:


Regex regex =new Regex(
  @"^((((H|h)(T|t)|(F|f))(T|t)(P|p)((S|s)?))\://)?(www.|[a-zA-Z0-9].)[a-zA-Z0-9\-\.]+\.[a-zA-Z]{2,6}(\:[0-9]{1,5})*(/($|[a-zA-Z0-9\.\,\;\?\'\\\+&%\$#\=~_\-]+))*$"
);

For all the detail in this regex, it makes a few crucial mistakes. Putting on my SQA hat, here's a few failing test cases to prove it:

• http://sites.google.com (no "www" prefix)
• https://192.168.0.1 (same reason)
• www.google.com (no protocol)
• http://google.com (no www)
• ftp://user:[email protected] (no basic auth credentials allowed)
• news://www.example.com (only http, https, ftp, and ftps allowed)
• http://www.test.com?pageid=123&testid=1524 (no url parameters)
• http://www.⌘.ws/ (non-ascii urls not allowed)
• http://www.foo.com/blah_blah_(wikipedia) (no parenthesis in urls)

Of course, sometimes the cure is worse than the disease. Lesson learned: don't use regex for URL validation.

[Advertisement] BuildMaster allows you to create a self-service release management platform that allows different teams to manage their applications. Explore how!