The recently launched Girls Not Brides project that myself and the B Team completed last month is a multilingual site, currently with English, French, and Spanish versions of the content.
As per the majority of my Wagtail builds, the sitemap for the site was implemented using the standard Wagtail Sitemap generator, with a little customisation on some page types that we didn't want to be visible in the index.
After submitting the initial sitemap.xml
to Google Search Console, we realised that the out-of-the-box sitemap format wasn't correct for Google's exacting standards.
What's the problem?
The default Wagtail sitemap format outputs an XML <loc>
node for each page in the site (excluding any that are purposefully hidden or private), and as the Girls Not Brides site has an actual page per language, each of these pages was included in the output.
According to Google, that's a no-no. The only <loc>
nodes in the XML should be for the user's current language, and there should be related <xhtml:link>
nodes for the related pages in the other languages (including the current language).
Here's the example from Google's documentation:
http://www.example.com/english/page.html
http://www.example.com/deutsch/page.html
http://www.example.com/schweiz-deutsch/page.html
Inheritance to the rescue
As all of our page types in the site descend from a common ancestor (an enhanced Wagtail Page), it was pretty simple to customise the output of the sitemap.xml
to fit Google's required format.
First, we needed to modify the return value of the native Wagtail Page
get_sitemap_urls method:
from wagtail.models import Page
from translation.utils import get_translated_sitemap_urls
class ExtendedPage(Page):
class Meta:
abstract = True
def get_sitemap_urls(self, request):
page = self.localized
default_data = [{
'location': page.full_url,
'lastmod': (page.last_published_at or page.latest_revision_created_at),
}]
return get_translated_sitemap_urls(page, request, default_data)
To keep things nice and modular, this was added to our base ExtendedPage
mixin used for all of our page types, and the translation work done in a utility function:
# Amend these to the domains your translated site uses
TRANSLATED_DOMAINS = {
'en': 'www.girlsnotbrides.org',
'fr': 'www.fillespasepouses.org',
'es': 'www.girlsnotbrides.es',
}
def get_translatable_page_siblings(instance, only_live=True, include_self=False):
translations = instance.get_translations(inclusive=include_self)
if only_live:
translations = translations.live()
return translations
def get_translated_sitemap_urls(page, request, default_data):
domains = TRANSLATED_DOMAINS
host = request.get_host()
domain = None
lang = None
for k, v in domains.items():
if v == host:
domain = v
lang = k
if domain and lang:
# return empty array if this page isn't the current language
if page.locale.language_code != lang:
return []
# get default sitemap data for page, return if no data
data = next(iter(default_data))
if not data:
return []
# get self and siblings, add to data and return
siblings = get_translatable_page_siblings(page, only_live=True, include_self=True).select_related('locale')
data['alternates'] = siblings
return [data]
# return an empty array if none of the above
return []
We're using translated domains for Girls Not Brides, but the principle would be the same if using path prefixes such as /en/
, /es/
, /fr/
etc.
It's important to pass in the default_data
from the super().get_sitemap_urls(request)
call here - we don't want to expose pages that shouldn't be in the sitemap, so if this is empty we should also return an empty list as per the documentation.
The rest of the function is pretty simple - once we've checked that the page in question is in the current language, we get the translated page siblings (including the page itself), and add this to extra alternates
list.
If this page isn't the current language, we return an empty list to exclude it from the sitemap.
Formatting the output
So far so good, but we still need to match the output as defined by the Google documentation.
This again is pretty simple, we just override the output XML template.
Full disclosure tangent
It took me a little longer than I'd expected to do this, as it's not clear at all from the documentation from Wagtail or Django what the path of this template is.
After a little trial and error (as my good friend Rob Lowe once said, "I've built my whole career on trial and error"), I determined that the path is simply the root of any included templates folder, e.g templates/sitemap.xml
.
Now all we need to do is modify this template to output our extra our alternates
list:
{% spaceless %}
{% for url in urlset %}
{{ url.location }}
{% if url.lastmod %}
{{ url.lastmod|date:'Y-m-d' }}
{% endif %}
{% for item in url.alternates %}
{% endfor %}
{% endfor %}
{% endspaceless %}
You can view the results by visiting the Girls Not Brides sitemap.
Give me the code!
Here's a Wagtail 3.x compatible Github gist containing all the required code and implementation notes.