2023-05-25

Specifying the Output Location of Sitemap in next-sitemap

Introduction

By default, next-sitemap generates sitemap files directly in the public/ directory. This implies that the standard sitemap storage location would be in the form of https://<your website name>/sitemap.xml.

Sitemap Privacy

A sitemap can serve as a treasure map for web scrapers, as it provides a comprehensive list of links within a website. Once a scraper identifies the sitemap, they can navigate almost all pages of a site.

The sitemap's location should ideally only be known to search engines such as Google. When Google recognizes your sitemap, it also enhances visibility on other search engines like Yahoo! Japan or Bing, improving your site's SEO. There's no need to reveal the sitemap location to scrapers. If left to default, the sitemap at /sitemap.xml is an open invitation for endless crawling by web scrapers. This underscores the importance of altering the default sitemap storage location to prevent it from being easily accessible by scrapers.

Changing the Sitemap Directory

With the next-sitemap library, it is possible to change the location where the sitemap is saved. This can be achieved by specifying the outDir in the next-sitemap.config.js file. This feature provides users the flexibility to store their sitemap at a location of their choosing.

To illustrate, you can specify outDir in the next-sitemap.config.js file as shown below. This will direct the output of sitemap.xml and robot.txt to public/my-dir.

next-sitemap.config.js
 /** @type {import('next-sitemap').IConfig} */
 module.exports = {
   siteUrl: 'https://io.traffine.com/',
   generateRobotsTxt: true,
   sitemapSize: 7000,
+  outDir: './public/my-dir'
 };

Incorrect Directions in robots.txt and sitemap.xml

An issue that users face is that robots.txt and sitemap.xml do not correctly point to the directory specified in outDir. Let's see the generated entries:

public/mu-dir/robots.txt
# *
User-agent: *
Allow: /

# Host
Host: https://hoge.jp

# Sitemaps
Sitemap: https://io.traffine.com/sitemap.xml # here is the problem
public/mu-dir/sitemap.xml
<?xml version="1.0" encoding="UTF-8"?>
<sitemapindex xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
<sitemap><loc>https://io.traffine.com/sitemap-0.xml</loc></sitemap>  <!-- Here is the problem -->
</sitemapindex>

As seen, these files are incorrectly pointing to the path as if generated in /public.

Both files should look something like this:

robots.txt
  # *
  User-agent: *
  Allow: /

  # Host
  Host: https://hoge.jp

  # Sitemaps
- Sitemap: https://io.traffine.com/sitemap.xml # Here is the problem
+ Sitemap: https://io.traffine.com/my-dir/sitemap.xml
sitemap.xml
  <?xml version="1.0" encoding="UTF-8"?>
  <sitemapindex xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
- <sitemap><loc>https://io.traffine.com/sitemap-0.xml</loc></sitemap>  <!-- Here is the problem -->
+ <sitemap><loc>https://io.traffine.com/my-dir/sitemap-0.xml</loc></sitemap>
  </sitemapindex>

Solution to the Problem

To correct the aforementioned issue with the direction of robots.txt and sitemap.xml, we can implement a solution that involves creating and using a JavaScript file.

Firstly, prepare a JavaScript file named sitemap-replace.js. This script will read the existing robots.txt and sitemap.xml files, replace the incorrect paths with the correct ones, and then write the corrected data back to the files. Here's how to do it:

sitemap-replace.js
const replaceSitemap = async (fileName) => {
  const fs = require('fs/promises')
  const appRoot = require('app-root-path')
  const subDirectory = 'my-dir' // Change this
  const filePath = `${appRoot}/public/${subDirectory}/${fileName}`

  const original = await fs.readFile(filePath, 'utf8')
  const replacedData = original.replace(
    /https\:\/\/io\.traffine\.com\/sitemap/g, // Change this
    `https://io.traffine.com/${subDirectory}/sitemap` // Change this
  )

  await fs.writeFile(filePath, replacedData, 'utf8')
}

;(async () => {
  await replaceSitemap('robots.txt')
  await replaceSitemap('sitemap.xml')
})()

Next, adjust the postbuild command in package.json to run this script after the next-sitemap command. This ensures that the script runs every time you build your project, keeping your sitemap paths correctly pointed to your specified directory.

package.json
 {
 ...

   "build": "next build",
-  "postbuild": "next-sitemap --config next-sitemap.config.js"
+  "postbuild": "next-sitemap --config next-sitemap.config.js && node sitemap-replace.js"

 ...
 }

After executing this script, the contents of the robots.txt and sitemap.xml files are correctly updated to point to the new directory. The updated files will look like this:

public/my-dir/robots.txt
# *
User-agent: *
Allow: /

# Host
Host: https://io.traffine.com

# Sitemaps
Sitemap: https://io.traffine.com/my-dir/sitemap.xml
sitemap.xml
<?xml version="1.0" encoding="UTF-8"?>
<sitemapindex xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
<sitemap><loc>https://io.traffine.om/my-dir/sitemap-0.xml</loc></sitemap>
</sitemapindex>

With this solution, you are now able to maintain the privacy of your sitemap location, while ensuring it is correctly accessible to search engines.

References

https://github.com/iamvishnusankar/next-sitemap

Ryusei Kakujo

researchgatelinkedingithub

Focusing on data science for mobility

Bench Press 100kg!