Introduction
By default, next-sitemap generates sitemap files directly in the public/
directory. This implies that the standard sitemap storage location would be in the form of https://<your website name>/sitemap.xml
.
Sitemap Privacy
A sitemap can serve as a treasure map for web scrapers, as it provides a comprehensive list of links within a website. Once a scraper identifies the sitemap, they can navigate almost all pages of a site.
The sitemap's location should ideally only be known to search engines such as Google. When Google recognizes your sitemap, it also enhances visibility on other search engines like Yahoo! Japan or Bing, improving your site's SEO. There's no need to reveal the sitemap location to scrapers. If left to default, the sitemap at /sitemap.xml
is an open invitation for endless crawling by web scrapers. This underscores the importance of altering the default sitemap storage location to prevent it from being easily accessible by scrapers.
Changing the Sitemap Directory
With the next-sitemap library, it is possible to change the location where the sitemap is saved. This can be achieved by specifying the outDir
in the next-sitemap.config.js
file. This feature provides users the flexibility to store their sitemap at a location of their choosing.
To illustrate, you can specify outDir
in the next-sitemap.config.js
file as shown below. This will direct the output of sitemap.xml
and robot.txt
to public/my-dir
.
/** @type {import('next-sitemap').IConfig} */
module.exports = {
siteUrl: 'https://io.traffine.com/',
generateRobotsTxt: true,
sitemapSize: 7000,
+ outDir: './public/my-dir'
};
Incorrect Directions in robots.txt and sitemap.xml
An issue that users face is that robots.txt
and sitemap.xml
do not correctly point to the directory specified in outDir
. Let's see the generated entries:
# *
User-agent: *
Allow: /
# Host
Host: https://hoge.jp
# Sitemaps
Sitemap: https://io.traffine.com/sitemap.xml # here is the problem
<?xml version="1.0" encoding="UTF-8"?>
<sitemapindex xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
<sitemap><loc>https://io.traffine.com/sitemap-0.xml</loc></sitemap> <!-- Here is the problem -->
</sitemapindex>
As seen, these files are incorrectly pointing to the path as if generated in /public
.
Both files should look something like this:
# *
User-agent: *
Allow: /
# Host
Host: https://hoge.jp
# Sitemaps
- Sitemap: https://io.traffine.com/sitemap.xml # Here is the problem
+ Sitemap: https://io.traffine.com/my-dir/sitemap.xml
<?xml version="1.0" encoding="UTF-8"?>
<sitemapindex xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
- <sitemap><loc>https://io.traffine.com/sitemap-0.xml</loc></sitemap> <!-- Here is the problem -->
+ <sitemap><loc>https://io.traffine.com/my-dir/sitemap-0.xml</loc></sitemap>
</sitemapindex>
Solution to the Problem
To correct the aforementioned issue with the direction of robots.txt
and sitemap.xml
, we can implement a solution that involves creating and using a JavaScript file.
Firstly, prepare a JavaScript file named sitemap-replace.js
. This script will read the existing robots.txt
and sitemap.xml
files, replace the incorrect paths with the correct ones, and then write the corrected data back to the files. Here's how to do it:
const replaceSitemap = async (fileName) => {
const fs = require('fs/promises')
const appRoot = require('app-root-path')
const subDirectory = 'my-dir' // Change this
const filePath = `${appRoot}/public/${subDirectory}/${fileName}`
const original = await fs.readFile(filePath, 'utf8')
const replacedData = original.replace(
/https\:\/\/io\.traffine\.com\/sitemap/g, // Change this
`https://io.traffine.com/${subDirectory}/sitemap` // Change this
)
await fs.writeFile(filePath, replacedData, 'utf8')
}
;(async () => {
await replaceSitemap('robots.txt')
await replaceSitemap('sitemap.xml')
})()
Next, adjust the postbuild
command in package.json
to run this script after the next-sitemap
command. This ensures that the script runs every time you build your project, keeping your sitemap paths correctly pointed to your specified directory.
{
...
"build": "next build",
- "postbuild": "next-sitemap --config next-sitemap.config.js"
+ "postbuild": "next-sitemap --config next-sitemap.config.js && node sitemap-replace.js"
...
}
After executing this script, the contents of the robots.txt
and sitemap.xml
files are correctly updated to point to the new directory. The updated files will look like this:
# *
User-agent: *
Allow: /
# Host
Host: https://io.traffine.com
# Sitemaps
Sitemap: https://io.traffine.com/my-dir/sitemap.xml
<?xml version="1.0" encoding="UTF-8"?>
<sitemapindex xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
<sitemap><loc>https://io.traffine.om/my-dir/sitemap-0.xml</loc></sitemap>
</sitemapindex>
With this solution, you are now able to maintain the privacy of your sitemap location, while ensuring it is correctly accessible to search engines.
References