Archive for April, 2007

Google Sitemaps - What is it - Why you should be using it

Thursday, April 12th, 2007

A sitemap is single web page or series of pages containing links to all pages on your site to help search engines locate and index those pages, typically organized in hierarchical fashion. If you don’t have a sitemap already, you need to create one to ensure all of the pages you want indexed are easily found by search engines.

Google, in looking for ways to reduce spidering bandwidth and improve SERPs freshness and relevance, introduced a new sitemap protocol back in mid 2005. The sitemap protocol is an XML file in a particular format, generated from URL lists, webserver directories, or from web server access logs. It is then posted on your site and you notify Google of your sitemap file location and of subsequent changes to your sitemap via a ‘ping’. Google downloads the file and updates it’s index of your pages with the listed or updated URLs. This ‘push’ mechanism has many advantages over waiting for Google to ‘pull’ in the traditional fashion, in that you can notify Google of new or changed pages instantly, and Google can then spider your site faster and more efficiently, based on the URLs supplied in your sitemap file. The protocol also allows for additional information about pages, such as when a page was last updated and how frequently a page changes. Bottom line is you get your pages indexed by Google faster, and notify of any additions or changes directly, rather than waiting for Google to find them on their own.

Before we get into the nitty-gritty about generating and uploading your first sitemap, there have been some new developments with the sitemaps protocol. It has been adopted by Google, Yahoo, MSN and now Ask.com! What this means is that you can now generate a single sitemap file, and send it to all four of these major search engines. Well, three actually. Microsoft doesn’t have a mechanism to accept a sitemap file yet, likely sidetracked with the launch of Vista and Live, but it should fairly soon.

In addition, you now don’t need to manually notify Google, Yahoo or Ask.com of changes to your sitemap. The sitemap protocol now allows for the location of your sitemap file to be included in your robots.txt file. These search engines read your robots.txt file, get the location of your sitemap file and download it automatically. How great is that!?

So how do you generate a sitemap file that adheres to the defined format that would be accepted by these major search engines? We’ll, if you have a small site of only one or a few pages, you can create one manually using a standard text editor such as Windows Notepad. The format of the sitemap file and protocol can be found on sitemaps.org. Here is an example sitemap file containing a single URL:

<?xml version="1.0" encoding="UTF-8"?>
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
<url>
<loc>http://www.example.com/</loc>
<lastmod>2005-01-01</lastmod>
<changefreq>monthly</changefreq>
<priority>0.8</priority>
</url>
</urlset>

And another containing multiple URLs:

<?xml version="1.0" encoding="UTF-8"?>
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
<url>
<loc>http://www.example.com/</loc>
<lastmod>2005-01-01</lastmod>
<changefreq>monthly</changefreq>
<priority>0.8</priority>
</url>
<url>
<loc>http://www.example.com/catalog?item=12&desc=vacation_hawaii</loc>
<changefreq>weekly</changefreq>
</url>
<url>
<loc>http://www.example.com/catalog?item=73&desc=vacation_new_zealand</loc>
<lastmod>2004-12-23</lastmod>
<changefreq>weekly</changefreq>
</url>
<url>
<loc>http://www.example.com/catalog?item=74&desc=vacation_newfoundland</loc>
<lastmod>2004-12-23T18:00:15+00:00</lastmod>
<priority>0.3</priority>
</url>
<url>
<loc>http://www.example.com/catalog?item=83&desc=vacation_usa</loc>
<lastmod>2004-11-23</lastmod>
</url>
</urlset>

As you can see, very easy with a fix number of fields and options. Review the sitemap protocol and format, create your sitemap, save it as sitemap.xml on your web site and then upload it to Google. You’ll need to create an account on Google Webmaster Tools to enter your web site address and then upload the initial sitemap. After that, you simply ping Google to notify when your sitemap changes by going to http://www.google.com/webmasters/sitemaps/ping?sitemap=http://www.yourdomain.tld/sitemap.xml in a web browser. Same thing for Yahoo. Create an account on Yahoo Site Maps to upload your first sitemap, then ping with subsequent changes. Yahoo Site Maps ping URL is simply http://search.yahooapis.com/SiteExplorerService/V1/ping?sitemap=http://www.yourdomain.tld/sitemap.xml. Ask.com doesn’t require that you create an account with them. Just use the ping URL of http://submissions.ask.com/ping?sitemap=http://www.yourdomain.tld/sitemap.xml. In all cases, replace www.yourdomain.tld with your actual domain and sitemap.xml with the real name of your sitemap file.

However if you have a large site with hundred to tens of thousands of pages (or more), you’ll want to use an automated sitemap generation script to do this for you. If you use a Content Management System (CMS), blog, bulletin board or forum, chances are it already has this capability built in or is available as a plugin or module for that system. Check the website, file repository or forums for your particular content system for details.

If not using one of these types of content management systems, there are also free and commercial stand-alone products that can do the hard work for you. A free Python script with sample configs is available on Sourceforge that anyone can use. This is the one I use. It supports the aforementioned three methods of propagating your sitemap - URL list, website directories, or via web server access log.

I prefer the access log method, as if you have shell access to your host and CRONTAB access, you can automate the generation of your sitemap and pinging of Google when it changes. Instructions are included in the accompanying example config scripts. Simple CRON entry to automate dialy generation would look something like this:

45 23 * * * /usr/bin/python /usr/local/scripts/sitemap_gen-1.4/sitemap_gen.py --config=/usr/local/scripts/sitemap_gen-1.4/config.xml | mail -s "Generated Sitemap for domain.tld" you@youremail.tld

Be sure to edit the config.xml file first. Replace various paths and filenames as appropriate. Anything after the pipe (|) is optional - that tells CRON to email you with notification and the output of the script. Good idea in order to ensure no problems. One other note. You’ll want to time the generation of your sitemap to coinside with the time your access logs are rotated. My logs rotate and midnight, so I schedule this to run about 15 minutes before so captures the most amount of data.

You can then either ping the search sites manually or automatically (for those that support it), via wget in a simple shell script:

#!/bin/sh

/usr/bin/wget -v "www.google.com/webmasters/sitemaps/ping?sitemap=http%3A%2F%2Fwww.yourdomain.tld%2Fsitemap.xml"
/usr/bin/wget -v "search.yahooapis.com/SiteExplorerService/V1/ping?sitemap=http%3A%2F%2Fwww.yourdomain.tld%2Fsitemap.xml "
/usr/bin/wget -v "submissions.ask.com/ping?sitemap=http%3A%2F%2Fwww.yourdomain.tld%2Fsitemap.xml"

Name the file google_ping or similar, chmod 700/755 and chown as appropriate (if your web logs are not world-readable, may need to run as web server user [apache|www|httpd|etc]). Then run your new google_ping script via another CRON entry, perhaps 15-30 minutes after sitemap generation, like so:

15 0 * * * /usr/local/scripts/google_ping | mail -s "Pinged All Sitemaps" you@yourdomain.tld

Hopefully this will all become unnecessary once the new robots.txt auto-discovery is supported by all the major search engines. Until then, the above should work for you just fine. At minimum, can still generate and submit manaully (or via your CMS/plugin, if you have one).

So if you don’t have a sitemap file, I recommend you create one immediately, include the location in your robots.txt file, and upload to these four (three) search engines sites. You will get spidered quickly, your indexed pages should increase, and subsequently your traffic.