Information about Link Rot

Link rot is the process by which links on a website gradually become irrelevant or broken as time goes on, because websites that they link to disappear, change their content or redirect to new locations.

The phrase also describes the effects of failing to update web pages so that they become out-of-date, containing information that is old and useless, and that clutters up search engine results. This process most frequently occurs in personal web pages and is prevalent in free web hosts such as GeoCities, where there is no financial incentive to fix link rot (most of these sites have not been updated for years on end).

Prevalence

The 404 "not found" response is familiar to even the occasional Web user. A number of studies have examined the prevalence of link rot on the Web, in academic literature, and in digital libraries. In a 2003 experiment, Fetterly et al. (2003) discovered that about 0.5% of web pages disappeared each week. McCown et al. (2005) discovered that half of the URLs cited in D-Lib Magazine articles were no longer accessible 10 years after publication, and other studies have shown link rot in academic literature to be even worse (Spinellis, 2003, Lawrence et al., 2001). Nelson and Allen (2002) examined link rot in digital libraries found that about 3% of the objects were no longer accessible after one year.

News sites contribute to the link rot problem by commonly keeping only recent news articles online where they are freely accessible at their original URLs, then removing them or moving them to a paid subscription area. This causes a heavy loss of supporting links in sites discussing newsworthy events and using news sites as references.

Discovering

Detecting link rot for a given URL is difficult using automated methods. If a URL is accessed and returns back an HTTP 200 (OK) response, it may be considered accessible, but the contents of the page may have changed and may no longer be relevant. Some web servers also return a soft 404, a page returned with a 200 (OK) response (instead of a 404) that indicates the URL is no longer accessible. Bar-Yossef et al. (2004) developed a heuristic for automatically discovering soft 404s.

Combating

The WordPress blog publishing systems implements a guard against link rot. If an author renames a link, the old link automatically redirects to the new location.[1]

Web archiving

To combat link rot, web archivists are actively engaged in collecting the Web or particular portions of the Web and ensuring the collection is preserved in an archive, such as an archive site, for future researchers, historians, and the public. The largest web archiving organization is the Internet Archive which strives to maintain an archive of the entire Web. National libraries, national archives and various consortia of organizations are also involved in archiving culturally important Web content.

Individuals may also use a number of tools that allow them to archive web resources that may go missing in the future:
  • WebCite, a tool specifically for scholarly authors, journal editors and publishers to permanently archive "on-demand" and retrieve cited Internet references (Eysenbach and Trudel, 2005).
  • StayBoyStay is an on demand archiving service that can archive any number of webpages. The new URI includes a hash and date that prove when the archive was taken and that tampering has not occurred.
  • Archive-It, a subscription service that allows institutions to build, manage and search their own web archive
  • hanzo:web is a personal web archiving service created by Hanzo Archives that can archive a single web resource, a cluster of web resources, or an entire website, as a one-off collection, scheduled/repeated collection, an RSS/Atom feed collection or collect on-demand via Hanzo's open API.
  • Internet Archive (The Internet Archive Wayback Machine) is free to use and automatically takes periodic snapshots of pages that can then be accessed for free and without registration many years later simply by typing in the URL, which is helpful when dealing with link rot.

Webmasters

Webmasters have developed a number of best practices for combating link rot:
  • Avoiding unmanaged hyperlink collections
  • Avoiding links to pages deep in a website ("deep linking")
  • Using hyperlink checking software or a Content Management System (CMS) that automatically checks links
  • Using permalinks
  • Using redirection mechanisms (e.g. "301: Moved Permanently") to automatically refer browsers and crawlers to the new location of a URL

Authors citing URLs

A number of studies have shown how wide-spread link rot is in academic literature (see below). Authors of scholarly publications have also developed best-practices for combating link rot in their work:

See also

References

1. ^ Rønn-Jensen, Jesper (2007-10-05). Software Eliminates User Errors And Linkrot. Justaddwater.dk. Retrieved on 2007-10-05.

Link rot on the Web

In academic literature

In digital libraries

External links

A hyperlink, is a reference or navigation element in a document to another section of the same document or to another document that may be on a different website.

Hyperlinks are part of the foundation of the World Wide Web created by Tim Berners-Lee, but are not limited to
..... Click the link for more information.
A website (alternatively, Web site or web site) is a collection of Web pages, images, videos or other digital assets that is hosted on one or several Web server(s), usually accessible via the Internet, cell phone or a LAN.
..... Click the link for more information.
A Web page or webpage is a resource of information that is suitable for the World Wide Web and can be accessed through a web browser. This information is usually in HTML or XHTML format, and may provide navigation to other web pages via hypertext links.
..... Click the link for more information.
search engine is an information retrieval system designed to help find information stored on a computer system. Search engines help to minimize the time required to find information and the amount of information which must be consulted, akin to other techniques for managing
..... Click the link for more information.
original research or unverifiable claims.
* It needs additional references or sources for verification.

Please help [ improve the article] or discuss these issues on the talk page.
..... Click the link for more information.
Yahoo! GeoCities is a webhosting service founded by David Bohnett and John Rezner in late 1994 as Beverly Hills Internet (BHI).

In its original form, site users selected a "city" in which to place their webpages.
..... Click the link for more information.
soft 404. Soft 404s are problematic for automated methods of discovering whether a link is broken.

Some proxy servers generate a 404 error when the remote host is not present, rather than returning lower level errors such as hostname lookup failing, or "connection refused".
..... Click the link for more information.
Uniform Resource Locator (URL) formerly known as Universal Resource Locator, is a technical, Web-related term used in two distinct meanings:
  • In popular usage, many technical documents, it is a synonym for Uniform Resource Identifier (URI);

..... Click the link for more information.
D-Lib Magazine is an on-line magazine dedicated to digital library research and development. Content of current and past issues are available free of charge. The publication is financially supported by the Defense Advanced Research Projects Agency (as part of the Digital
..... Click the link for more information.
Uniform Resource Locator (URL) formerly known as Universal Resource Locator, is a technical, Web-related term used in two distinct meanings:
  • In popular usage, many technical documents, it is a synonym for Uniform Resource Identifier (URI);

..... Click the link for more information.
Hypertext Transfer Protocol (HTTP) is a communications protocol used to transfer or convey information on the World Wide Web. Its original purpose was to provide a way to publish and retrieve HTML hypertext pages.
..... Click the link for more information.
soft 404. Soft 404s are problematic for automated methods of discovering whether a link is broken.

Some proxy servers generate a 404 error when the remote host is not present, rather than returning lower level errors such as hostname lookup failing, or "connection refused".
..... Click the link for more information.
soft 404. Soft 404s are problematic for automated methods of discovering whether a link is broken.

Some proxy servers generate a 404 error when the remote host is not present, rather than returning lower level errors such as hostname lookup failing, or "connection refused".
..... Click the link for more information.
WordPress is a blog publishing system written in PHP and backed by a MySQL database. WordPress is the official successor of b2\cafelog, developed by Michel Valdrighi. The name WordPress was suggested by Christine Selleck, a friend of lead developer Matt Mullenweg.
..... Click the link for more information.
Web archiving is the process of collecting the Web or particular portions of the Web and ensuring the collection is preserved in an archive, such as an archive site, for future researchers, historians, and the public.
..... Click the link for more information.
World Wide Web (commonly shortened to the Web) is a system of interlinked, hypertext documents accessed via the Internet. With a web browser, a user views web pages that may contain text, images, videos, and other multimedia and navigates between them using hyperlinks.
..... Click the link for more information.
prevew not available
..... Click the link for more information.
archive refers to a collection of historical records, and also refers to the location in which these records are kept.[1]

Archives are made up of records (AKA primary source documents) which have been accumulated over the course of an individual or organization's
..... Click the link for more information.
In web archiving, an archive site is a website that stores information on, or the actual, webpages from the past for anyone to view.

Common Techniques

Two common techniques are #1 using a web crawler or #2 user submissions.
..... Click the link for more information.
Internet Archive

Formation 1996
Type on-line library
Website www.archive.org

The Internet Archive (IA) is a non-profit organization dedicated to maintaining an on-line library and archive of Web and multimedia resources.
..... Click the link for more information.
A national library is a library specifically established by the government of a country to serve as the preeminent repository of information for that country. Unlike public libraries, these rarely allow citizens to borrow books.
..... Click the link for more information.
A national archive is a central archive maintained by a nation.

List of national archives

  • National Archives of India
  • Archives nationales (France)
  • Archives New Zealand
  • Arquivo Nacional da Torre do Tombo, Portugal
  • Archivo General de Indias, Spain

..... Click the link for more information.
WebCite is a free non-profit tool supported by a consortium of publishers and editors, designed for scholarly authors to cite webpages which have previously been archived by WebCite, thereby preventing linkrot.
..... Click the link for more information.
Internet Archive

Formation 1996
Type on-line library
Website www.archive.org

The Internet Archive (IA) is a non-profit organization dedicated to maintaining an on-line library and archive of Web and multimedia resources.
..... Click the link for more information.
The webmaster (feminine: webmistress), also called the system administrator, the author, or the website administrator, is the person responsible for designing, developing, marketing, or maintaining a website.
..... Click the link for more information.
This article or section appears to contain a large number of buzzwords and may require cleanup.
Please help [ rewrite this article] to make it more concrete and meaningful, removing tautologies, obvious statements and excessive abstraction.
..... Click the link for more information.
Deep linking, on the World Wide Web, is making a hyperlink that points to a specific page or image on another website, instead of that website's main or home page. Such links are called deep links.

Example

This link: http://www.un.org/Overview/rights.
..... Click the link for more information.
A Content Management System (CMS) is a software system used for content management. Content management systems are deployed primarily for interactive use by a potentially large number of contributors.
..... Click the link for more information.
# symbol indicates a permanent link to the blog entry in question]] A permalink is a URL that points to a specific blogging entry even after the entry has passed from the front page into the blog archives.
..... Click the link for more information.
URL redirection, also called URL forwarding, domain redirection and domain forwarding, is a technique on the World Wide Web for making a web page available under many URLs.
..... Click the link for more information.


This article is copied from an article on Wikipedia.org - the free encyclopedia created and edited by online user community. The text was not checked or edited by anyone on our staff. Although the vast majority of the wikipedia encyclopedia articles provide accurate and timely information please do not assume the accuracy of any particular article. This article is distributed under the terms of GNU Free Documentation License.
Herod_Archelaus


page counter