What is sitemap.xml, and Why a Pentester Should Care
Everyone seems to be familiar with robots.txt - the contents of that file are normally used to tell search engines what branches of your site to NOT index, or in some misguided cases folks think that this can be used to "secure" your pages for some reason. But what about sitemap.xml, what's that for?
The sitemap.xml is basically a set of "signposts" in a file to tell search engines which pages on your site are most important to index. On a more complex site, you'll often see a sitemap_index.xml file that in turn points to a number of other xml files that each cover one aspect or area of your site. Either way, normally these are in a cron job or at least in the deployment workflow, so that they get updated either daily or whenever one of your dev's pushes something new to prod.
So what do these look like? Let's look at the SANS.ORG file. Note that it's created via a script, so as noted it should be updated automagically, but look at the date in the metadata - it's almost 2 years old! (this is NOT unusual)
curl -k https://www.sans.org/sitemap.xml
<?xml version="1.0" encoding="utf-8"?>
<!--Generated by Screaming Frog SEO Spider 14.2-->
<!-- last update Feb. 22, 2021 -->
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
<url>
<loc>https://www.sans.org/press</loc>
<lastmod>2021-02-22</lastmod>
<changefreq>monthly</changefreq>
<priority>1.0</priority>
</url>
<url>
<loc>https://www.sans.org/event/soc-training-2021</loc>
<lastmod>2021-02-22</lastmod>
<changefreq>monthly</changefreq>
<priority>1.0</priority>
</url>
....
Looking at the file for sans.edu, we see a number of entries that don't have lastmod or frequency or priority, just the url's (at least at the head of the file anyway). This is more typically what I see..
curl-k https://www.sans.edu/sitemap.xml
<?xml version="1.0"?>
<urlset>
<url>
<loc>https://www.sans.edu/</loc>
</url>
<url>
<loc>https://www.sans.edu/federal-education-benefits/</loc>
</url>
<url>
<loc>https://www.sans.edu/about/</loc>
</url>
<url>
<loc>https://www.sans.edu/graduate-certificates/</loc>
</url>
<url>
<loc>https://www.sans.edu/bacs-degree/</loc>
</url>
<url>
<loc>https://www.sans.edu/bacs-degree/2/</loc>
</url>
...
</urlset>
So why should I care? For me, it's a quick way to find pages or branches of the site that might have been removed from navigation, but might still be on disk.
How bad can this get? You'll find yourself on a pentest saying -- hmmm - what can I do with a "payments" branch of the site that was removed 2-3-5 years ago, but is still on disk with a signpost pointed to it? Or worse yet, if this is a repeat of a test you did a year ago, you might find that the sitemap.xml file might point you to all the problems from the last test (ie - before all of last year's vulnerabilities got remediated), just waiting for you to exploit again!
So - these files can easily reach hundreds or thousands of links, how can we hunt through the sitemap files to find any hidden treasures? First, spider the site in your favourite tool (I use Burp), and export the list of URLs in the site from there to a text file.
Alternatively, this Powershell one-liner will scrape your site from the root:
$site = 'https://www.giac.org'
$sitelinks = (Invoke-WebRequest -Uri $site).Links.Href | Sort | Get-Unique
(note that this will grab links that point off site also)
Next, grab the urls in the sitemap.xml file?
# grab the file
$sitemap = invoke-webrequest -uri ($site + '/sitemap.xml')
# parse out just the links, the values inside the "url" xml tags
$sitemaplinks = ([xml] $sitemap).urlset.url.loc
Let's look for one keyword, the new "GTRP" certification, which is in both lists:
PS C:\work> $sitelinks | grep grtp
https://giac.org/certifications/red-team-professional-grtp/
PS C:\work> $sitemaplinks | grep grtp
https://www.giac.org/mlp/grtp-beta-faqs/
https://www.giac.org/certifications/red-team-professional-grtp/
This is where we see the powershell "crawl the site" approach fall apart a little - the links in the giac site don't all consistently contain "www", but the sitemap file is. The links also include several relative references, so there are a fair number of links that don't even include https://. It's still a valid export of the crawled site, but it's not in the same fully specified format that's in the sitemap.xml file (which is 100% consistent and perfect, since the links were generated by a script and not a person).
OK, let's export the list from burp instead of using the powershell list, then pull that list into powershell. Right-click the site-root in burp, and select "copy urls in this host". (Don't select "copy links in this host, it'll collect all of the links to other sites as well). Paste that into a file and then pull that file into a list:
$sitelinks = get-content c:\work\giac-exported-from-burp.txt
Now to find the difference between the sitemap list and the site itself? Also just another one-liner
$diff = $sitemaplinks.loc | ?{$sitelinks -notcontains $_}
(output not shown, but read on)
ouch - at this point we see that identical items in each list might terminate with a "/" or not, which makes them different. Let's normalize both lists, taking off trailing "/" characters.
$smlist = @()
foreach ($sml in $sitemaplinks ) {
$l = $sml.length
if($sml.substring($l-1,1) -eq "/" ) { $sml = $sml.substring(0,$l-1) }
$smlist += $sml
}
$slist = @()
foreach ($sl in $sitelinks ) {
$l = $sl.length
if($sl.substring($l-1,1) -eq "/" ) { $sl = $sl.substring(0,$l-1) }
$slist += $sl
}
NOW we can diff the list!
$diff = $smlist | ?{$slist -notcontains $_}
$diff
https://www.giac.org/retired-certifications
https://www.giac.org/frames
https://www.giac.org/maintenance
https://www.giac.org/mlp
https://www.giac.org/coronavirus-response
https://www.giac.org/justification-letters
https://www.giac.org/get-certified/in-person
https://www.giac.org/about/mission
https://www.giac.org/become-a-gse
https://www.giac.org/mlp/grtp-beta-faqs
https://www.giac.org/mlp/firstcertificationsc
https://www.giac.org/mlp/firstcertificationtt
https://www.giac.org/mlp/rsac-2023
https://www.giac.org/mlp/rsac-2022
https://www.giac.org/mlp/trust-me
https://www.giac.org/workforce-development/government/dodd-8140
https://www.giac.org/blog/top-5-reasons-to-earn-giac-certifications-keep-them-active
https://www.giac.org/blog/giac-introduces-industry-first-cloud-forensic-responder-gcfr-certification
https://www.giac.org/blog/explore-the-value-of-giac-cloud-security-and-penetration-testing-certifications
https://www.giac.org/blog/why-workforce-frameworks-certifications-matter-cybersecurity
https://www.giac.org/blog/launch-your-offensive-operations-career-with-giac-certifications
https://www.giac.org/blog/the-value-of-certifications
https://www.giac.org/blog/why-certify-with-giac
https://www.giac.org/blog/explore-giac-certifications-in-offensive-operations
https://www.giac.org/blog/giac-launches-new-cloud-penetration-tester-certification
https://www.giac.org/blog/best-practices-for-giac-exam-prep
https://www.giac.org/blog/mitigating-cyber-risk-through-cloud-security
https://www.giac.org/blog/cloud-skilled-cyber-professionals-needed-to-secure-organizations-globally
https://www.giac.org/blog/gpcs-announcement
https://www.giac.org/blog/build-a-security-operations-career-with-giacs-new-cyber-security-certification
https://www.giac.org/podcasts/trust-me-im-certified/giacs-new-podcast-hosted-by-jason-nickola-gse-and-sans-instructor
https://www.giac.org/podcasts/trust-me-im-certified/exploring-imposter-syndrome-through-experience-education-and-gatekeeping-with-lesley-carhart
https://www.giac.org/podcasts/trust-me-im-certified/practicing-confidence-mental-agility-and-vulnerability-while-building-your-cybersecurity-career-with-chris-cochran
This is the final list of pages that are in the sitemap.xml file, but are NOT in the navigable website. When you check them individually, most of them return a page! As discussed, this is almost always the case! When it's not automated, you can think of sitemap.xml as a local wayback machine!
But wait, that's not all!!
Often the sitemap file will be a sitemap_index.xml file, which in turn will point to additional files. Let's take a look at that parent file:
#first, collect the parent sitemap_index.xml, this only contains links to xml files:
$site="https://companyname.com"
$sitemap_parent = invoke-webrequest -uri ($site + '/sitemap_index.xml')
$xml_parent = ([xml]$sitemap_parent).sitemapindex.sitemap.loc
$xml_parent
https://companyname.com/post-sitemap.xml
https://companyname.com/page-sitemap.xml
https://companyname.com/attachment-sitemap.xml
https://companyname.com/career-sitemap.xml
https://companyname.com/service-sitemap.xml
https://companyname.com/resource-sitemap.xml
https://companyname.com/category-sitemap.xml
https://companyname.com/post_tag-sitemap.xml
https://companyname.com/type-sitemap.xml
https://companyname.com/author-sitemap.xml
Next, we parse through each of these xml files the component sitemap.xml links to, and create one mondo "sitemaplinks" variable from each in turn:
$temp = @()
foreach ($x in $xml_parent) {
$s = invoke-webrequest -uri $x
$temp += ([xml] $s).urlset.url
}
$sitemaplinks = $temp | sort | uniq
This list can now be plugged into at the "normalize" step in the original code above.
If you've found something good in a sitemap file on a pentest or assessment, please share using our comment form!
Want to learn more about sitemaps? The definitive source of course is google (searching and indexing is kinda their thing, and this was their idea):
https://developers.google.com/search/docs/crawling-indexing/sitemaps/overview
As always, you'll find this code in my github (a bit later today):
https://github.com/robvandenbrink/sitemap/tree/main
(yes, this includes error checking, checks for both files and accounts for the possible presence of either or both sitemap and sitemap_index files)
===============
Rob VandenBrink
rob@coherentsecurity.com
Comments