Giter Site home page Giter Site logo

cicirello / generate-sitemap Goto Github PK

View Code? Open in Web Editor NEW
64.0 3.0 14.0 493 KB

Generate an XML sitemap for a GitHub Pages site using GitHub Actions

Home Page: https://actions.cicirello.org/generate-sitemap/

License: MIT License

Dockerfile 0.39% Python 99.61%
actions sitemap-generator workflows github-pages github-actions sitemap xml lastmod robots robots-exclusion-protocol

generate-sitemap's Introduction

generate-sitemap

cicirello/generate-sitemap - Generate XML sitemaps for static websites in GitHub Actions

Check out all of our GitHub Actions: https://actions.cicirello.org/

About

GitHub Actions GitHub release (latest by date) Count of Action Users
Build Status build CodeQL
Source Info GitHub GitHub top language
Support GitHub Sponsors Liberapay Ko-Fi

The generate-sitemap GitHub action generates a sitemap for a website hosted on GitHub Pages, and has the following features:

  • Support for both xml and txt sitemaps (you choose using one of the action's inputs).
  • When generating an xml sitemap, it uses the last commit date of each file to generate the <lastmod> tag in the sitemap entry. If the file was created during that workflow run, but not yet committed, then it instead uses the current date (however, we recommend if possible committing newly created files first).
  • Supports URLs for html and pdf files in the sitemap, and has inputs to control the included file types (defaults include both html and pdf files in the sitemap).
  • Now also supports including URLs for a user specified list of additional file extensions in the sitemap.
  • Checks content of html files for <meta name="robots" content="noindex"> directives, excluding any that do from the sitemap.
  • Parses a robots.txt, if present at the root of the website, excluding any URLs from the sitemap that match Disallow: rules for User-agent: *.
  • Enables specifying a list of directories and/or specific files to exclude from the sitemap.
  • Sorts the sitemap entries in a consistent order, such that the URLs are first sorted by depth in the directory structure (i.e., pages at the website root appear first, etc), and then pages at the same depth are sorted alphabetically.
  • It assumes that for files with the name index.html that the preferred URL for the page ends with the enclosing directory, leaving out the index.html. For example, instead of https://WEBSITE/PATH/index.html, the sitemap will contain https://WEBSITE/PATH/ in such a case.
  • Provides option to exclude .html extension from URLs listed in sitemap.

The generate-sitemap GitHub action is designed to be used in combination with other GitHub Actions. For example, it does not commit and push the generated sitemap. See the Examples for examples of combining with other actions in your workflow.

The generate-sitemap action is for GitHub Pages sites, such that the repository contains the html, etc of the site itself, regardless of whether or not the html was generated by a static site generator or written by hand. For example, I use it for multiple Java project documentation sites, where most of the site is generated by javadoc. I also use it with my personal website, which is generated with a custom static site generator. As long as the repository for the GitHub Pages site contains the site as served (e.g., html files, pdf files, etc), the generate-sitemap action is applicable.

The generate-sitemap action is not for GitHub Pages Jekyll sites (unless you generate the site locally and push the html output instead of the markdown, but why would you do that?). In the case of a GitHub Pages Jekyll site, the repository contains markdown, and not the html that is generated from the markdown. The generate-sitemap action does not support that use-case. If you are looking to generate a sitemap for a Jekyll website, there is a Jekyll plugin for that.

Table of Contents

The remainder of the documentation is organized into the following sections:

Requirements

This action relies on actions/checkout@v2 with fetch-depth: 0. Setting the fetch-depth to 0 for the checkout action ensures that the generate-sitemap action will have access to the commit history, which is used for generating the <lastmod> tags in the sitemap.xml file. If you instead use the default when applying the checkout action, the <lastmod> tags will be incorrect. So be sure to include the following as a step in your workflow:

    steps:
    - name: Checkout the repo
      uses: actions/checkout@v4
      with:
        fetch-depth: 0 

Inputs

path-to-root

The path to the root of the website relative to the root of the repository. Default . is appropriate in most cases, such as whenever the root of your Pages site is the root of the repository itself. If you are using this for a GitHub Pages site in the docs directory, such as for a documentation website, then just pass docs for this input.

base-url-path

This is the url to your website. You must specify this for your sitemap to be meaningful. It defaults to https://web.address.of.your.nifty.website/ for demonstration purposes.

include-html

This flag determines whether html files are included in your sitemap (files with an extension of either .html or .htm). Default: true.

include-pdf

This flag determines whether pdf files are included in your sitemap. Default: true.

additional-extensions

If you want to include URLs to other document types, you can use the additional-extensions input to specify a list (separated by spaces) of file extensions. For example, Google (and other search engines) index a variety of other file types, including docx, doc, source code for various common programming languages, etc. Here is an example:

    - name: Generate the sitemap
      uses: cicirello/generate-sitemap@v1
      with:
        additional-extensions: doc docx ppt pptx

exclude-paths

The action will automatically exclude any files or directories based on a robots.txt file, if present. But if you have additional directories or individual files that you wish to exclude from the sitemap that are not otherwise blocked, you can use the exclude-paths input to specify a list of them, separated by any whitespace characters. For example, if you wish to exclude the directory /exclude-these as well as the individual file /nositemap.html, you can use the following:

    - name: Generate the sitemap
      uses: cicirello/generate-sitemap@v1
      with:
        exclude-paths: /exclude-these /nositemap.html

If you have many such cases to exclude, your workflow may be easier to read if you use a YAML multi-line string, with the following:

    - name: Generate the sitemap
      uses: cicirello/generate-sitemap@v1
      with:
        exclude-paths: >
          /exclude-these 
          /nositemap.html

sitemap-format

Use this to specify the sitemap format. Default: xml. The sitemap.xml generated by the default will contain lastmod dates that are generated using the last commit dates of each file. Setting this input to anything other than xml will generate a plain text sitemap.txt simply listing the urls.

drop-html-extension

The drop-html-extension input provides the option to exclude .html extension from URLs listed in the sitemap. The default is drop-html-extension: false. If you want to use this option, just pass drop-html-extension: true to the action in your workflow. GitHub Pages automatically serves the corresponding html file if URL has no file extension. For example, if a user of your site browses to the URL, https://WEBSITE/PATH/filename (with no extension), GitHub Pages automatically serves https://WEBSITE/PATH/filename.html if it exists. The default behavior of the generate-sitemap action includes the .html extension for pages where the filename has the .html extension. If you prefer to exclude the .html extension from the URLs in your sitemap, then pass drop-html-extension: true to the action in your workflow. Note that you should also ensure that any canonical links that you list within the html files corresponds to your choice here.

date-only

The date-only input controls whether XML sitemaps include the full date and time in lastmod, or only the date. The default is date-only: false, which includes the full date and time in the lastmod fields. If you only want the date in the lastmod, then use date-only: true.

Outputs

sitemap-path

The generated sitemap is placed in the root of the website. This output is the path to the generated sitemap file relative to the root of the repository. If you didn't use the path-to-root input, then this output should simply be the name of the sitemap file (sitemap.xml or sitemap.txt).

url-count

This output provides the number of URLs in the sitemap.

excluded-count

This output provides the number of URLs excluded from the sitemap due to either <meta name="robots" content="noindex"> within html files, or due to exclusion from directives in a robots.txt file.

Examples

Basic Action Syntax

You can run the action with a step in your workflow like this:

    - name: Generate the sitemap
      uses: cicirello/generate-sitemap@v1
      with:
        base-url-path: https://THE.URL.TO.YOUR.PAGE/

In the above example, the major release version was used, which ensures that you'll be using the latest patch level release, including any bug fixes, etc. If you prefer, you can also use a specific version such as with:

    - name: Generate the sitemap
      uses: cicirello/[email protected]
      with:
        base-url-path: https://THE.URL.TO.YOUR.PAGE/

Example 1: Minimal Example

In this example workflow, we use all of the default inputs except for the base-url-path input. The result will be a sitemap.xml file in the root of the repository. After completion, it then simply echos the outputs.

name: Generate xml sitemap

on:
  push:
    branches: [ main ]

jobs:
  sitemap_job:
    runs-on: ubuntu-latest
    name: Generate a sitemap

    steps:
    - name: Checkout the repo
      uses: actions/checkout@v4
      with:
        fetch-depth: 0 

    - name: Generate the sitemap
      id: sitemap
      uses: cicirello/generate-sitemap@v1
      with:
        base-url-path: https://THE.URL.TO.YOUR.PAGE/

    - name: Output stats
      run: |
        echo "sitemap-path = ${{ steps.sitemap.outputs.sitemap-path }}"
        echo "url-count = ${{ steps.sitemap.outputs.url-count }}"
        echo "excluded-count = ${{ steps.sitemap.outputs.excluded-count }}"

Example 2: Webpage for API Docs

This example workflow illustrates how you might use this to generate a sitemap for a Pages site in the docs directory of the repository. It also demonstrates excluding pdf files, and configuring a plain text sitemap.

name: Generate API sitemap

on:
  push:
    branches: [ main ]

jobs:
  sitemap_job:
    runs-on: ubuntu-latest
    name: Generate a sitemap

    steps:
    - name: Checkout the repo
      uses: actions/checkout@v4
      with:
        fetch-depth: 0 

    - name: Generate the sitemap
      id: sitemap
      uses: cicirello/generate-sitemap@v1
      with:
        base-url-path: https://THE.URL.TO.YOUR.PAGE/
        path-to-root: docs
        include-pdf: false
        sitemap-format: txt

    - name: Output stats
      run: |
        echo "sitemap-path = ${{ steps.sitemap.outputs.sitemap-path }}"
        echo "url-count = ${{ steps.sitemap.outputs.url-count }}"
        echo "excluded-count = ${{ steps.sitemap.outputs.excluded-count }}"

Example 3: Including Additional Indexable File Types

In this example workflow, we add various additional types to the sitemap using the additional-extensions input. Note that this also include html files and pdf files since the workflow is using the default values for include-html and include-pdf, which both default to true.

name: Generate xml sitemap

on:
  push:
    branches: [ main ]

jobs:
  sitemap_job:
    runs-on: ubuntu-latest
    name: Generate a sitemap

    steps:
    - name: Checkout the repo
      uses: actions/checkout@v4
      with:
        fetch-depth: 0 

    - name: Generate the sitemap
      id: sitemap
      uses: cicirello/generate-sitemap@v1
      with:
        base-url-path: https://THE.URL.TO.YOUR.PAGE/
        additional-extensions: doc docx ppt pptx xls xlsx

    - name: Output stats
      run: |
        echo "sitemap-path = ${{ steps.sitemap.outputs.sitemap-path }}"
        echo "url-count = ${{ steps.sitemap.outputs.url-count }}"
        echo "excluded-count = ${{ steps.sitemap.outputs.excluded-count }}"

Example 4: Combining With Other Actions

Presumably you want to do something with your sitemap once it is generated. In this example workflow, we combine it with the action peter-evans/create-pull-request. First, the cicirello/generate-sitemap action generates the sitemap. And then the peter-evans/create-pull-request monitors for changes, and if the sitemap changed will create a pull request.

name: Generate xml sitemap

on:
  push:
    branches: [ main ]

jobs:
  sitemap_job:
    runs-on: ubuntu-latest
    name: Generate a sitemap

    steps:
    - name: Checkout the repo
      uses: actions/checkout@v4
      with:
        fetch-depth: 0 

    - name: Generate the sitemap
      id: sitemap
      uses: cicirello/generate-sitemap@v1
      with:
        base-url-path: https://THE.URL.TO.YOUR.PAGE/

    - name: Create Pull Request
      uses: peter-evans/create-pull-request@v3
      with:
        title: "Automated sitemap update"
        body: > 
          Sitemap updated by the [generate-sitemap](https://github.com/cicirello/generate-sitemap) 
          GitHub action. Automated pull-request generated by the 
          [create-pull-request](https://github.com/peter-evans/create-pull-request) GitHub action.

Real Examples From Projects Using the Action

Personal Website

This first real example is from the personal website of the developer. One of the workflows, sitemap-generation.yml, is strictly for generating the sitemap. It runs on pushes of either *.html or *.pdf files to the staging branch of this repository. After generating the sitemap, it uses peter-evans/create-pull-request to generate a pull request. You can also replace that step with a commit and push instead. You can find the resulting sitemap here: sitemap.xml.

Documentation Website for a Java Library

This next example is for the documentation website of the Chips-n-Salsa library. The docs.yml workflow runs on push and pull-requests of either *.java files. It uses Maven to run javadoc (e.g., with mvn javadoc:javadoc). It then copies the generated javadoc documentation to the docs directory, from which the API website is served. This is followed by another GitHub Action, cicirello/javadoc-cleanup, which makes a few edits to the javadoc generated website to improve mobile browsing.

Next, it commits any changes (without pushing yet) produced by javadoc and/or javadoc-cleanup. After performing those commits, it now runs the generate-sitemap action to generate the sitemap. It does this after committing the site changes so that the lastmod dates will be accurate. Finally, it uses peter-evans/create-pull-request to generate a pull request. You can also replace that step with a commit and push instead.

You can find the resulting sitemap here: sitemap.xml.

Built With

The generate-sitemap action uses the following:

Blog Posts

Here is a selection of blog posts about generate-sitemap on DEV.to:

Support the Project

You can support the project in a number of ways:

  • Starring: If you find the generate-sitemap action useful, consider starring the repository.
  • Sharing with Others: Consider sharing it with others who you feel might find it useful.
  • Reporting Issues: If you find a bug or have a suggestion for a new feature, please report it via the Issue tracker.
  • Contributing Code: If there is an open issue that you think you can help with, submit a pull request.
  • Sponsoring: You can also consider becoming a sponsor.

License

The scripts and documentation for this GitHub action is released under the MIT License.

generate-sitemap's People

Contributors

cicirello avatar dependabot[bot] avatar travisbrace avatar xbftw avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar

generate-sitemap's Issues

Enable including urls of filetypes other than html and pdf in the sitemap

Is your feature request related to a problem? Please describe.
Google and other search engines index more than just content in html and pdf. If it can be indexed by a search engine, it may be desirable to include in the sitemap.

Describe the solution you'd like
An input to the action that can be used to specify the file extensions to include in the sitemap.

BUG: Fails to exclude html from sitemap if content="noindex" before name="robots"

Describe the bug
Fails to exclude html from sitemap if content="noindex" before name="robots".

To Reproduce
Steps to reproduce the behavior:

  1. Create an html file with <meta content="noindex" name="robots"> rather than <meta name="robots" content="noindex">.
  2. Run the action.
  3. Observe that it fails to exclude that html file from the sitemap.

Expected behavior
Order of name and content in a meta tag shouldn't matter.

Replace the usage of GitHub Action's deprecated set-output command

Describe the bug
GitHub Actions has deprecated the set-output workflow command, which we are currently using for workflow outputs of the action. See https://github.blog/changelog/2022-10-11-github-actions-deprecating-save-state-and-set-output-commands/. That same link indicates the replacement.

To Reproduce
Steps to reproduce the behavior:

  1. Run the action.
  2. Inspect the workflow run.
  3. Notice the deprecation warning.

Expected behavior
No deprecation warning.

generate-sitemap doesn't produce any output

As the title said, and I don't know why. I would be grateful if somebody could help me.
Here is my yml file:


name: CI

# Controls when the action will run. 
on:
  # Triggers the workflow on push or pull request events but only for the main branch
  push:
    branches: [ main , christmas]
  pull_request:
    branches: [ main ]

  # Allows you to run this workflow manually from the Actions tab
  workflow_dispatch:

# A workflow run is made up of one or more jobs that can run sequentially or in parallel
jobs:
  # This workflow contains a single job called "build"
  build:
    # The type of runner that the job will run on
    runs-on: ubuntu-latest

    # Steps represent a sequence of tasks that will be executed as part of the job
    steps:
      # Checks-out your repository under $GITHUB_WORKSPACE, so your job can access it
      - uses: actions/checkout@v2
        with:
          fetch-depth: 0 

      - name: generate-sitemap
      # You may pin to the exact commit or the version.
      # uses: cicirello/generate-sitemap@2777ea0908c8365591fe1f1960ef898ca326511b
        uses: cicirello/[email protected]
        with:
          base-url-path: https://doktormugg.se/ # default is https://web.address.of.your.nifty.website/

Feature Req: [non UTF8 character error message console log]

Is your feature request related to a problem? Please describe.
If the sitemap finds a page with a non UTF8 encoded character it throws an error and exits

Describe the solution you'd like
Not sure what that would look like in python but would be a nice feature if before the script exited with the non UTF8 character error it logged what page it was finding the error on.

Describe alternatives you've considered
For anyone having a similar issue --

I cloned my project locally and grepped through the project to find the character with this command here

grep -raxv --include=*.html '.*' ./

Additional context
Thanks for the awesome github action it's been much appreciated.

Option for date-only in lastmods

Is your feature request related to a problem? Please describe.
Some may prefer only dates rather than full W3C Datetime.

Describe the solution you'd like
An input named date-only that defaults to false to avoid surprising existing users. When user sets to true use only the date formatted as yyyy-mm-dd dropping the time.

Describe alternatives you've considered
This is related to a feature request in #57 where a user configurable date format was requested. However, the sitemap protocol only allows 2 options full W3C Datetime which is what the action currently does or date only. Fully user configurable date formats is too error prone since can specify formats not supported by sitemap protocol.

Additional context
See #57.

Drop index.shtml from URLs

Summary

Just like index.html is currently dropped from URLs, index.shtml should also be dropped. Originally proposed as part of #50, but split off here as separate issue.

BUG: need to handle "&" differently

Describe the bug
If there is an & in the URL, according to chatgpt it needs to be parsed to &amp;

To Reproduce
For example the url
<loc>https://planetrenox.pages.dev/gv/terms&policies</loc>
is wrong

Expected behavior
<loc>https://planetrenox.pages.dev/gv/terms&amp;policies</loc>

Additional context
I might be wrong about the fix but on github, it flags the file as red so it seems like it's def wrong with the current.

Bug in regex used to detect robots noindex directive in page header

Summary

The current regular expression used to detect if there is a meta tag in the page header with a robots noindex directive (e.g., to exclude such pages from the sitemap) has a potential bug. \s* is used in a couple places to account for sequences of space characters. However, it is not being passed through to Python's regular expression processor, and instead being detected as an invalid escape sequence in the string. Need to escape the \. Revealed when upgrading to Python 3.12, which gives a warning. Earlier versions of Python not warning on this, although behavior appears to be correct. Not entirely sure why. But should fix this none-the-less.

Feature Req: Exclude folder support

Common pages are being added to sitemap
I have a folder called common. The common folder includes html pages that included in many places in my website. The sitemap generator will include these files in the sitemap.xml but actually I don't want them to be included.

Solution
Add a config input param that takes a path of a folder. The script will then ignore these folders and not include any resource under it.

Check .shtml files for noindex directives

Summary

One of the features of the action is to exclude from the sitemap any html files that contain a noindex directive in a meta tag in the head. But it currently doesn't do this for .shtml files.

BUG: [Doesn't exclude "noindex" files from sitemap]

Describe the bug
The action doesn't seem to exclude html pages with the <meta name="robots" content="noindex"> meta tag or set as "Disallow" in the robots.txt file. In my case this was my "404.html" file.

To Reproduce
Steps to reproduce the behavior:

  1. Make sure you have a file (or files) that are set as "Disallow" in the robots.txt file or contain the <meta name="robots" content="noindex"> meta tag in the page's html file.
  2. Commit and push the repository.
  3. Once the commit has finished, check the sitemap.xml file. You will see files included in the sitemap that should have been excluded.

Expected behavior
Should exclude any pages from the sitemap that contain the <meta name="robots" content="noindex"> meta tag or are set as "Disallow" in the robots.txt file (such as my 404.html).

Screenshots
N/A

Relevant System Info:

  • OS: Linux Mint Debian Edition 4 (LMDE4)
  • Cinnamon Version: 5.0.7

Additional context

Regarding React App

Can I generate a sitemap during the production build using this and submit to my hosting service using GitHub actions?

Option to drop .html from urls

GitHub Pages will automatically serve the corresponding html file if a url without .html is given. For example, a url like https://some.website.domain/dir/filename automatically results in GitHub Pages serving https://some.website.domain/dir/filename.html. Add an input to give user control over whether the .html is included in urls in their sitemap. It should default to false to avoid surprising existing users. Documentation should also suggest that they set their canonical links within the html files to coincide with their sitemap.

BUG: Uppercased meta tags with robots noindex directives not detected

Describe the bug
If a robots noindex directive in a meta tag uses upper case for any combination of meta tag, robots, noindex, the action fails to exclude it from the sitemap.

To Reproduce
Steps to reproduce the behavior:

  1. Create a, html file with any of meta, robots, or noindex in uppercase in the meta robots directive.
  2. Run the action.
  3. Observe that the html file above is still listed in the sitemap.

Expected behavior
Such an html file shoud be excluded.

Feature Req: Generate the lastmod tag with the current date

Is your feature request related to a problem? Please describe.
It would be nice if the lastmod tag comes with a default value of the current date if any previous version of the page was found, since google search console doesn't seem to like empty lastmod tags.

Additional context
Currently, the lastmod tags are generated this way:
image

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.