Web harvesting is a term used by the National Library to describe the selecting, copying and archiving of websites found on the internet. The collection of New Zealand websites is covered by Legal Deposit legislation (National Library of New Zealand Act 2003, Part 4).
General information for Publishers
Most web harvesting is undertaken on a selective basis by the Alexander Turnbull Library. These websites form the New Zealand Web Archive, which is part of the Alexander Turnbull Library's published collections.
Special harvests of the .NZ domain have been undertaken by the National Library every couple of years since 2008. These provide snapshots of the New Zealand Internet. The domain harvests are not currently available to the public.
Nominate a site
If you know of a New Zealand or Pacific website you feel should be in the collection do let us know. If you’re the copyright owner of a Pacific Island website or a New Zealander publishing a website overseas, you are most welcome to nominate your own site. This will assist us in the permissions process, as overseas websites are not covered by New Zealand Legal Deposit legislation.
Technical information for Webmasters
The Library uses the Web Curator Tool to acquire copies of the publicly available pages on a website. If the Library takes a copy of your website you will see this identifier in your website access logs:
Mozilla/5.0 compatible; heritrix/1.14.1 +https://natlib.govt.nz/publishers-and-authors/web-harvesting
If our crawling is having an impact on your site please contact us immediately.
Most of our harvests are scheduled for evenings or weekends, with a delay between page requests so that we cause minimal disruption while web pages are copied. We have found it necessary to ignore the robots.txt protocol in order to obtain a copy that retains the look of the original website. We monitor the Web Curator Tool logs while the harvest is in progress to check for any crawler traps and avoid them if necessary. If we have technical problems downloading your website we may contact you.
Design tips for making your website preservable
Websites are designed for online use. Few web owners consider how their websites might be accessible once it is archived in a digital repository. Here are some tips to enable your website to be more easily preserved.
1. Web Standards and Accessibility Guidelines
Follow web standards and accessibility guidelines. These standards are helpful not only for online users, but also for web crawlers that institutions like the National Library of New Zealand uses when copying websites from the Internet.
2. Site Map
Include a site map that provides a standard link to all website content (including pages, images, documents, videos). The URLs should be static and explicit so that the National Library web crawler can follow the links.
3. Avoid proprietary formats for important content
Open standards like HTML and CSS are widely used, but proprietary software may disappear if the product closes, making it much more difficult to preserve long term. The National Library crawler can’t handle Flash, so if you use Flash, offer a text-only version too so that your website can be harvested and preserved.
4. Maintain stable URLs and redirect when necessary
When updating a website, be sure to provide redirects for relocated documents. In other words avoid linkrot! If you update your website and use a new URL ensure the old URL redirects to it.
5. Allow browsing of collections, not just searching
If your website includes a searchable collection of items, make sure it’s also browseable, e.g. having an A-Z list is useful, or sorting by category. The search function doesn’t operate in an archived site, but the browse capability usually does.
6. Consider using a Creative Commons license
The Library is permitted to copy and archive New Zealand websites according to Legal Deposit legislation, but a Creative Commons license allows the user to know what they are allowed to download or reuse on your website. A CC license also gives us permission to archive sites that may not be in scope for Legal Deposit, but still in scope for the Library’s collections.
7. Adding dates to your web pages
It’s very helpful for the user to be able to see when a webpage was created or last updated. We include the date the website was harvested in our archival browser, but that date may be different from the date the website was created or updated.
8. Use archive-friendly platform providers
If preserving your website is important to you check that your platform provider is archive-friendly. Some companies block crawlers outright and they may prevent us from crawling your site. If you are using crawler blocking software consider adding our crawler to your whitelist. Contact Web.Archive@dia.govt.nz for more information.
ArchiveReady is a handy tool to use if you want to quickly find out if your website is archive ready. Simply enter the URL of your website and the tool will analyse your site and prepare a brief summary for you.
These tips have been adapted from the following sources.
Guidelines for Preservable Websites, Columbia University Libraries.
Designing preservable websites, redux, Library of Congress.
Five Tips for Designing Preservable Websites, Smithsonian Institution Archives
Making Your Website Crawler-Friendly, UK Web Archive