2011
01.05

This is just a quick post to share a regular expression for a URL I had to come up with when needing to validate a URL in a Flex app. The code below is for Flex, but would only require a few minor changes for another language, the double backslash before the ? appears to be required for Flex, using a single backslash does not work, read more about that in this older post. This also contains what would be capturing brackets in other languages, I could have used non-capturing brackets but that would have made this already complicated example even more difficult to read.

linkValidator = new RegExpValidator();
linkValidator.expression = "(http(s)?:\/\/)?(([a-z]+[a-z0-9\-]*[.])?([a-z0-9]+[a-z0-9\-]*[.])+[a-z]{2,3}|localhost)(\/[a-z0-9_-]+[a-z0-9_ -]*)*\/?(\\?[a-z0-9_-]+=[a-z0-9 ‘,.-]*(&[a-z0-9_-]+=[a-z0-9 ‘,.-]*)*)?(#[a-z0-9/_-]*)?$";
linkValidator.noMatchError = resourceManager.getString("lang", "invalidURL");
linkValidator.flags = "i";
linkValidator.source = linkTextArea;
linkValidator.property = "text";
linkValidator.trigger = linkTextArea;
linkValidator.triggerEvent = Event.CHANGE;

I’ll break it down to the individual sections with a brief explaination.

//protocol and subdomain
(http(s)?:\/\/)?(([a-z]+[a-z0-9\-]*[.])?

The first part includes the protocol (http:// or https://), I am only dealing with web http urls here and it is optional in my app hence the ? at the end of the first group, the rest includes an optional subdomain which should start with one or more letters followed by zero or more letters/numbers/hyphens and a dot. This first subdomain and dot is also optional. So far this would match: [empty string] http:// https//www. https://ww2 etc.

//server hostname
([a-z0-9]+[a-z0-9\-]*[.])+[a-z]{2,3}|localhost)

This next part includes the rest of the web host, the first grouping (first enclosing brackets) specifies the start of the hostname or a further subdomain which must start with a letter or number followed by a dot (the dot as a character set is how to represent the dot in Flex, you might be able to just use \.). This can be repeated many times, but then should be followed by 2 or three characters. Alternatively the hostname localhost can be used instead, the extra closing bracket matches the additional opening one after the protocol. This section should match: www.example.com localhost example.com example.co.uk co.uk etc.

//web path
(\/[a-z0-9_-]+[a-z0-9_ -]*)*\/?

This next part consists of the optional path (directory from the web root), it starts with a forward slash and can be any number of letters, numbers, underscores, spaces or hyphens, but can not start with a space (you might need to backslash escape your hypen in a different language. The trailing backslash is also optional as is the entire path. This part should match: [empty string] / /directory /a/b/ etc.

//query string
(\\?[a-z0-9_-]+=[a-z0-9 ',.-]*(&[a-z0-9_-]+=[a-z0-9 ',.-]*)*)?

This part contains the optional query string part of the URL. Starting with a ? (may require only a single backslash in a different language), followed by the first parameter made up of one or more letters/numbers/underscores/hyphens the equals sign, followed by an optional parameter value made up of letters/numbers/spaces/apostrophes/commas/dots/hypens. This parameter=value part of the query string can be repeated several times after that but each extra parameter should be preceded with the ampersand (you would normally use just the & for this, but flex requires the encoded version). This section could match: [empty string] ?a= ?a=bc ?a=b&c=d&e=f etc.

//fragment
(#[a-z0-9/_-]*)?$

Finally the last part of the expression contains the optional url fragment (the part with a #). In my case I specified zero or more letters/numbers/forward slashes/underscores/hypens (flex does not require escaping the forward slash when it is included in a character set). Then the dollar sign specifies that there should not be anything else after this. This could match: [empty string] # #value #a/b/c etc.

I hope this is useful to those struggling to create their own URL regular expression matchers. Flex devs remember to double escape the ? for the query string part of the URL.

3 comments so far

Add Your Comment
  1. How about .info tld, ie. sub.domain.info ? It’s 4 chars so it won’t match your regexp.

    One should be careful making too many assumptions about what makes up a valid URL. Especially nowadays with IDN and soon custom TLDs.

  2. Yes, you are correct. I should have probably been more relaxed about the top level domain and just used any number of characters or {2,} which should match 2 or more. I might update the post with this, thanks.

  3. Thank you…It is very useful…..