New Online Website Link Checker Engine

Recently, we have updated our Website Link Checker’s engine to a much stronger one. This post describes the features of the new engine. We believe that few broken link checkers, if any, can compete with our latest engine. Before this update, our Website Line Checker tool was able to check only two basic types of web page links:

  • <a> href values,
  • <meta> content url values with <meta> http-equiv="refresh".

When we are talking about web site links, we usually do think of <a> href values. However, from a webmaster’s point of view, it is important that all URLs in the code of the web are correct. For example, it is important to know whether all website’s images (<img> src) do exist. And there are many other ways to link to resources from web pages. Think about Java applets, Flash objects, new HTML5 media tags, etc. The more different types of linking your broken link checker recognizes the better. On the other hand, without losing any real functionality, it might be possible to omit several obsolete and poorly documented tags, which are usually unsupported by modern web browsers. Considering this, we have decided to implement a link checker engine that would fully cover HTML4 and HTML5 standards and CSS3.

Link Groups

Internally, the new Website Link Checker’s engine distinguishes between three groups of links.

First Group

The first group contains all simple links specifications, i.e. a link is fully specified by a value of a single attribute of a single tag. The group is defined by the following list of tags and attributes:

  • <a> href,
  • <link> href,
  • <script> src,
  • <area> href,
  • <img> src,
  • <img> longdesc,
  • <del> cite,
  • <ins> cite,
  • <audio> src,
  • <video> src,
  • <video> poster,
  • <source> src,
  • <embed> src,
  • <embed> pluginspage,
  • <track> src,
  • <input> src,
  • <body> background,
  • <frame> src,
  • <frame> longdesc,
  • <iframe> src,
  • <iframe> longdesc,
  • <blockquote> cite,
  • <q> cite.

Second Group

The second group contains links within CSS code. We analyze CSS using a regular expression, which looks for URLs. We search for a CSS code within three types of locations:

  • style attribute of arbitrary tag,
  • <style> tag,
  • and any file with declared content type “text/css”.

Third Group

The last group contains links that require specialized code for their extraction. Our engine handles the following kinds of links within this group:

  • <base> tag – This tag changes the meaning of all relative links within a web page.
  • <!--[if ...]> – The conditional comment tag is Internet Explorer specific tag that contains HTML code that is interpreted by Internet Explorer and thus unlike other comments, it should be analyzed for links.
  • <meta> content url values with <meta> http-equiv="refresh" – The link is a part of the content attribute’s value, which is a list of values separated using semicolon.
  • <param> value values with <param> name="movie".
  • <object> code, classid, codebase, data, and archive values – Messy and complicated tag, which attributes are poorly defined and handled differently by each web browser and thus various hacks are commonly implemented in order to achieve the required functionality in all major browsers. This is why its analysis is similarly messed up.
  • <applet> code, codebase, and archive values – Almost the same horror as with the <object> tag.
  • <form> action values – It is important to care about the value of <form> method attribute because requesting the form target page with GET method instead of POST method, or vice versa, can lead to incorrect assessment of the link validity.
  • <input> formaction values – Similarly to the previous case, the correct HTTP method must be used for verification. The method can be found in the <input> tag’s parent <form> tag.

Link Position in Dynamic Pages

Website Link Checker is able to find, check and report all mentioned types of links. It can tell you the exact position of each link within the code of an analyzed web page. This feature is quite tricky. Web sites can contain hundreds or even thousands of web pages and it would be impractical to store all the pages locally, since the reported link checker’s result is usually used once, immediately after its job is finished. When a user clicks on the HTML icon to see positions of the links within the code of the selected web page, Website Link Checker loads a new copy of that page, and that page only, from the target server again. However, with dynamic web pages, it may happen that the newly downloaded content is different from the analyzed one. This is why Website Link Checker performs a heuristic search for the links it found during the link checking process and it tries to find the links within the new content. Thus, unlike many simpler link checkers, it is able to correctly display positions of the links even within dynamic web pages.