Google-friendly sitemaps for multilingual Wagtail sites

The recently launched Girls Not Brides project that myself and the B Team completed last month is a multilingual site, currently with English, French, and Spanish versions of the content.

As per the majority of my Wagtail builds, the sitemap for the site was implemented using the standard Wagtail Sitemap generator, with a little customisation on some page types that we didn't want to be visible in the index.

After submitting the initial sitemap.xml to Google Search Console, we realised that the out-of-the-box sitemap format wasn't correct for Google's exacting standards.

What's the problem?

The default Wagtail sitemap format outputs an XML <loc> node for each page in the site (excluding any that are purposefully hidden or private), and as the Girls Not Brides site has an actual page per language, each of these pages was included in the output.

According to Google, that's a no-no. The only <loc> nodes in the XML should be for the user's current language, and there should be related <xhtml:link> nodes for the related pages in the other languages (including the current language).

Here's the example from Google's documentation:

<?xml version="1.0" encoding="UTF-8"?>
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9"
  xmlns:xhtml="http://www.w3.org/1999/xhtml">
  <url>
    <loc>http://www.example.com/english/page.html</loc>
    <xhtml:link 
               rel="alternate"
               hreflang="de"
               href="http://www.example.com/deutsch/page.html"/>
    <xhtml:link 
               rel="alternate"
               hreflang="de-ch"
               href="http://www.example.com/schweiz-deutsch/page.html"/>
    <xhtml:link 
               rel="alternate"
               hreflang="en"
               href="http://www.example.com/english/page.html"/>
  </url>
  <url>
    <loc>http://www.example.com/deutsch/page.html</loc>
    <xhtml:link 
               rel="alternate"
               hreflang="de"
               href="http://www.example.com/deutsch/page.html"/>
    <xhtml:link 
               rel="alternate"
               hreflang="de-ch"
               href="http://www.example.com/schweiz-deutsch/page.html"/>
    <xhtml:link 
               rel="alternate"
               hreflang="en"
               href="http://www.example.com/english/page.html"/>
  </url>
  <url>
    <loc>http://www.example.com/schweiz-deutsch/page.html</loc>
    <xhtml:link 
               rel="alternate"
               hreflang="de"
               href="http://www.example.com/deutsch/page.html"/>
    <xhtml:link 
               rel="alternate"
               hreflang="de-ch"
               href="http://www.example.com/schweiz-deutsch/page.html"/>
    <xhtml:link 
               rel="alternate"
               hreflang="en"
               href="http://www.example.com/english/page.html"/>
  </url>
</urlset>

Inheritance to the rescue

As all of our page types in the site descend from a common ancestor (an enhanced Wagtail Page), it was pretty simple to customise the output of the sitemap.xml to fit Google's required format.

First, we needed to modify the return value of the native Wagtail Page get_sitemap_urls method:

from wagtail.models import Page
from translation.utils import get_translated_sitemap_urls

class ExtendedPage(Page):
    class Meta:
        abstract = True

    def get_sitemap_urls(self, request):
        page = self.localized
        default_data = [{
            'location': page.full_url,
            'lastmod': (page.last_published_at or page.latest_revision_created_at),
        }]
        return get_translated_sitemap_urls(page, request, default_data)

To keep things nice and modular, this was added to our base ExtendedPage mixin used for all of our page types, and the translation work done in a utility function:

# Amend these to the domains your translated site uses
TRANSLATED_DOMAINS = {
    'en': 'www.girlsnotbrides.org',
    'fr': 'www.fillespasepouses.org',
    'es': 'www.girlsnotbrides.es',
}

def get_translatable_page_siblings(instance, only_live=True, include_self=False):

    translations = instance.get_translations(inclusive=include_self)

    if only_live:
        translations = translations.live()

    return translations

def get_translated_sitemap_urls(page, request, default_data):

    domains = TRANSLATED_DOMAINS
    host = request.get_host()
    domain = None
    lang = None

    for k, v in domains.items():
        if v == host:
            domain = v
            lang = k

    if domain and lang:

        # return empty array if this page isn't the current language
        if page.locale.language_code != lang:
            return []

        # get default sitemap data for page, return if no data
        data = next(iter(default_data))
        if not data:
            return []

        # get self and siblings, add to data and return
        siblings = get_translatable_page_siblings(page, only_live=True, include_self=True).select_related('locale')
        data['alternates'] = siblings

        return [data]

    # return an empty array if none of the above
    return []

We're using translated domains for Girls Not Brides, but the principle would be the same if using path prefixes such as /en/, /es/, /fr/ etc.

It's important to pass in the default_data from the super().get_sitemap_urls(request) call here - we don't want to expose pages that shouldn't be in the sitemap, so if this is empty we should also return an empty list as per the documentation.

The rest of the function is pretty simple - once we've checked that the page in question is in the current language, we get the translated page siblings (including the page itself), and add this to extra alternates list.

If this page isn't the current language, we return an empty list to exclude it from the sitemap.

Formatting the output

So far so good, but we still need to match the output as defined by the Google documentation.

This again is pretty simple, we just override the output XML template.

Full disclosure tangent

It took me a little longer than I'd expected to do this, as it's not clear at all from the documentation from Wagtail or Django what the path of this template is.

After a little trial and error (as my good friend Rob Lowe once said, "I've built my whole career on trial and error"), I determined that the path is simply the root of any included templates folder, e.g templates/sitemap.xml.

Now all we need to do is modify this template to output our extra our alternates list:

<?xml version="1.0" encoding="UTF-8"?>
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9" xmlns:xhtml="http://www.w3.org/1999/xhtml">
{% spaceless %}
    {% for url in urlset %}
        <url>
            <loc>{{ url.location }}</loc>
            {% if url.lastmod %}
                <lastmod>{{ url.lastmod|date:'Y-m-d' }}</lastmod>
            {% endif %}
            {% for item in url.alternates %}
                <xhtml:link rel="alternate" hreflang="{{ item.locale.language_code }}" href="{{ item.url }}"/>
            {% endfor %}
        </url>
    {% endfor %}
{% endspaceless %}
</urlset>

You can view the results by visiting the Girls Not Brides sitemap.

Give me the code!

Here's a Wagtail 3.x compatible Github gist containing all the required code and implementation notes.