How should the URL auto-linking filter work?

How should the URL auto-linking filter work?

by Tim Hunt -
Number of replies: 7
Picture of Core developers Picture of Documentation writers Picture of Particularly helpful Moodlers Picture of Peer reviewers Picture of Plugin developers

Consider the following three example URLs in context:

  1. See this wikipedia page: http://en.wikipedia.org/wiki/Slash_(punctuation).
  2. Here is a picture of a smile: http://example.com/render/emoticon.php?s=smile.
  3. See your favourite news source (e.g. www.bbc.co.uk).

I typed those as plain text, so you can see how Moodle's filter_urltolink handles them. Currently it gets 1. and 2. right, but 3. wrong.

I filed MDL-22390 ages ago because I am always doing things like 3., and I had never thought of examples 1. and 2.

I am pleased to say that this filter has unit tests, including those tricky cases, so I am now aware of them.

Clearly we cannot correctly handle all three examples (at least not without serious artificial intelligence). So, I have come here to ask what we think would be the better behaviour. Comments please.

Average of ratings: -
In reply to Tim Hunt

Re: How should the URL auto-linking filter work?

by Hubert Chathi -

As of right now, all three links are incorrect, and #3's problem has nothing to do with the bracket.  Here's what I see:

My guess is that #2 is a less-important case to worry about.  Maybe it should ignore closing brackets, unless the URL already has an opening bracket?  Then it would handle #1 and #3 correctly.  If you really wanted to, I guess you could try to match the number of opening and closing brackets to handle "(e.g.http://en.wikipedia.org/wiki/Slash_(punctuation))." correctly but of course, then you can't use a regexp (at least, not without some extra processing).

For all the problems that auto-linking causes, I'm surprised that there isn't a "best practices" somewhere.  (Or at least, there isn't one that I'm aware of.)

In reply to Tim Hunt

Re: How should the URL auto-linking filter work?

by Gareth J Barnard -
Picture of Core developers Picture of Particularly helpful Moodlers Picture of Plugin developers

Dear Tim,

In continuation from the dev chat I consider that:

http://www.w3.org/Addressing/URL/url-spec.txt

Needs to be supported in terms of allowable characters.  In that stipulation therefore with the example being the ')' character then there is a Catch-22 in terms of automated URL detection based upon characters alone.  Therefore there does need to be an element of additonal intelligent logic.

Therefore based upon the presumption that the URL that the user enters is valid and reachable, then the initial maximum possible URL be determined, then tested for existence.  If it exists then the filter calculation is valid.  If it does not exist, then reduce the number of filtered characters by one and repeat the check.  If that is then not valid, then flag up for the non-artificial intelligent human to state the intended destination.

Cheers,

Gareth

In reply to Gareth J Barnard

Re: How should the URL auto-linking filter work?

by Hubert Chathi -

If you check that a URL is reachable, you have to be able to handle pages that the user is able to reach, but the server is not able to, for whatever reason.  e.g. the page is password-protected, or the server is behind a restrictive firewall.

You'd probably also want to limit the characters that you would try chopping off -- probably just to punctuation characters.  e.g. in the URL http://example.com/foo/bar)., the comma, period, and closing parenthesis are questionable whether they belong to the URL or not, but the rest of the URL is not questionable.  This way, you don't hammer some other server with 100 requests when someone types in an incorrect URL.

In reply to Hubert Chathi

Re: How should the URL auto-linking filter work?

by Gareth J Barnard -
Picture of Core developers Picture of Particularly helpful Moodlers Picture of Plugin developers

So, therefore, implement the 'request for comments' updated version and when the filter muffs it up allow the user to correct it.

A Google gives:

http://eureka.ykyuen.info/2012/01/19/php-convert-url-into-clickable-link-with-urllinker/#more-8656 -> https://bitbucket.org/kwi/urllinker/src

https://code.google.com/p/php-rfc-3986/

I have no idea if they are better than what is current in place, but might shed light on current issues.

In reply to Gareth J Barnard

Re: How should the URL auto-linking filter work?

by Hubert Chathi -

From a quick glance at the code, urllinker has a set of characters that it doesn't consider to be part of the url if it's at the end of the code.  So with Tim's example #1, it would skip the closing parenthesis (incorrectly) and and period (correctly).

As far as I can tell, php-rfc-3986 only checks whether a string is a URL, and doesn't try to guess what the user's intent was.

In reply to Hubert Chathi

Re: How should the URL auto-linking filter work?

by Tim Hunt -
Picture of Core developers Picture of Documentation writers Picture of Particularly helpful Moodlers Picture of Peer reviewers Picture of Plugin developers

Please bear in mind that this is a Moodle text filter. It runs tens of times on every single page. It must be very, very fast.

Average of ratings: Useful (1)