In this post, I will explain what a sitemap is and how it is used. Then, I will explain the default behavior in a website powered by Hugo. Finally, I will detail why it is interesting to override this default behavior and how to do that.
β― What is a sitemap?
The sitemap is an XML file, generally located at the root of a website, that describes pages this website contains. The
main interest of this file is to facilitate the comprehension of your website for search engines
like Bing, DuckDuckGo, Google, … by listing
which pages are existing in this website, and also information concerning the recurrence to fetch potential changes on
these pages (the last modification of the page, or a frequency to fetch like daily
, weekly
, monthly
, … depending
on how often you are updating content of your pages).
You can check the sitemap file of (the majority of) websites by visiting their base URL followed by /sitemap.xml
. For
example, for my blog, the sitemap is accessible using the following URL:
https://blog.laromierre.com/sitemap.xml. The structure looks like this:
|
|
Other more ambitious websites have a mechanism of inheritance in their sitemaps. For example, the Backmarket’s sitemap, accessible at the address https://backmarket.com/sitemap.xml, looks like this:
|
|
Each of these links leads to another XML file describing the pages corresponding to this sitemap. In the same example, if you visit https://www.backmarket.com/sitemap_general.xml, you land on a new XML file containing a nice collection of 689 URLs!
β― How it works in Hugo websites?
Well, it’s pretty straightforward: when you launch the generation of your website via the command hugo
, the website is
generated as part of the generation of the website. After running this command, if you throw an eye in the public/
folder, you can see that a file named sitemap.xml
has been created!
You can define a general behavior for your website by adding either in your config file or in the frontmatter of specific files the following sitemap-related configuration:
|
|
There are currently two possible configuration keys. The first one is changeFreq
. It allows you to specify how often
the page content is updated. Acceptable values are always
, hourly
, daily
, weekly
, monthly
, yearly
and
never
. By default, this value is not specified in the sitemap file. The other configuration, priority,
indicates the page’s priority compared to the other pages of the website.
For more information about this configuration and the possibilities, I invite you to read the correspondingdocumentation on the gohugo website.
You can also override the default configuration by specifying new values in the frontmatter of the concerned files. For example, imagine that you do not update the content of your website very often globally. Then, you have defined a general change from frequency to monthly in the general configuration. But you refresh the content of the specific file every day and want to ensure that the search engines fetch this page daily. You can achieve this as follows.
In your config.toml
:
|
|
In the frontmatter of the concerned file:
|
|
The sitemap.xml
file generated looks like this:
|
|
β― How to exclude urls from the sitemap
The first thing you will probably remark if you look at your sitemap.xml
is that an XML block has been generated for
every entry on your website. You may even learn about the existence of pages you never tried to access before seeing
them listed in this file. For instance, below is a typical structure for a Hugo project:
|
|
Suppose you have a website with two pages, two posts, and two categories, and each post uses two different tags (yes, lots of βtwoβ in this example). How many URLs are added by default in your sitemap? Let’s count them!
- 1 for the homepage (
baseurl
) - 2 for the posts (
baseurl/posts/post1
andbaseurl/posts/post2
) - 2 for the pages (
baseurl/pages/page1
andbaseurl/pages/page2
)
This is roughly where you have explicitly added content, and typically, pages you expect to be proposed to some potential visitors by the search engines. But actually, all the following links are also added in the sitemap file:
- 2 for the categories (
baseurl/categories/category1
andbaseurl/categories/category2
) (it may eventually make sense to have direct access on a category, to view the list of articles that belong to this category) - 4 for the tags (
baseurl/tags/tag1
,baseurl/tags/tag2
,baseurl/tags/tag3
andbaseurl/tags/tag4
) - 1 for the posts page (
baseurl/posts
) - 1 for the pages page (
baseurl/pages
) - 1 for the categories page (
baseurl/categories
) - 1 for the tags page (
baseurl/tags
)
So, in this example, on the 5 pages we expect to be proposed to potential visitors, we have 10 pages that exist for the internal navigation on the website … and those are pages, from my point of view, that should be excluded from the sitemap file. In reality, it generally also concerns some other specific pages of the website: for instance, on my blog, I have a search page, an archives page, a page about me, and a couple of pages for the terms and conditions. And I also want to exclude these pages from my sitemap.
I guess I have successfully convinced you about your interest in fine-tuning your sitemap file and filtering what are the paths listed on it or excluded from the list (if this is not the case, no need to continue this reading π). So, how to exclude paths from the sitemap file?
Well, it is more or less easy, depending on the version of Hugo you are using.
β― For Hugo β€ v0.124.1
Until this version, the only available configurations are those listed above (changeFreq
and priority
). We will have
to override the initial mechanism by creating a custom layout to exclude pages from the sitemap generation. To do so,
let’s first create a file named layouts/_default/sitemap.xml
. Then, copy the content
of the file from the GitHub project gohugo:
|
|
At the fourth line we now add a condition to only select lines that are not excluded (let’s name it disable
since it
is the name chosen for achieving this in the future Hugo versions. Thus, replace the fourth line (highlighted) by the
following:
|
|
Finally, add in the frontmatter of every file you want to exclude from the sitemap.xml the following configuration:
|
|
If you want to exclude a path for which you didn’t have any markdown file (for example, categories, tags, …), create a
file named _default.md
, on which you add the sitemap configuration as above.
Note that for tags and categories page, you may have to add a configuration key “title” in their frontmatter. Otherwise, depending on the theme you are using, you may have an empty label where they are displayed.
For example, with a tag named aTag
, create a file content/tags/a-tag/_default.md
, with the following content:
|
|
Another example, if you want to exclude /post/
, create a file content/posts/_default.md
, with the following content:
|
|
I advise you to regularly check the content of the sitemap file before applying your changes to verify that your changes have been successful and match your expectations!
β― For Hugo > v0.124.1
When I write this article, the latest Hugo version is v0.124.1. So, I will have to verify that what I am describing here will work as expected when this new version is available.
Actually, the solution that I have described on the previous section has been added in this commit, which updated the sitemap.xml template file.
Thus, you no longer need to add a custom layout sitemap.xml to your website’s layouts folder. Simply add the following configuration in the frontmatter of the files you want to exclude from the sitemap file will have the same effect:
|
|
β― Conclusion
I explained here what a sitemap is, how it works, in particular for websites using Hugo, and the possibilities of configuration Hugo allows. I then had a special focus on how to exclude paths from the sitemap, depending on the version Hugo used. I hope this guide will be useful π