Site Architecture & Crawlability
You'll learn:
- •Design a crawlable site structure
- •Use robots.txt and XML sitemaps effectively
Crawlability: Can Google Find Your Pages?
Before a page can rank, Google must discover it. Crawlers navigate your site through links. A confusing architecture means important pages might never get indexed.
Flat vs. Deep Site Architecture
Aim for a "flat" architecture where important pages are just a few clicks from the homepage:
- Good (flat): Home → Category → Product (3 clicks)
- Bad (deep): Home → 2024 → Blog → Category → Post → Page (6+ clicks)
Google has a "crawl budget"—the number of pages it will crawl on your site per visit. Complex architectures waste this budget on unimportant pages.
Internal Linking Best Practices
- Use descriptive anchor text: "SEO guide" not "click here"
- Link to related content: keep users on your site longer
- Add HTML sitemaps: linked from footer for users and crawlers
- Use breadcrumbs: show users (and Google) your site hierarchy
- Fix broken links: regularly check for 404 errors
Robots.txt: Controlling Crawler Access
The robots.txt file tells crawlers which parts of your site to avoid. It's useful for preventing crawlers from accessing admin areas, staging sites, or resource-intensive pages.
# robots.txt
User-agent: *
Allow: /
Disallow: /admin/
Disallow: /api/
Disallow: /private/Warning: robots.txt is public. Don't use it to hide sensitive information—use authentication instead. Also, Google may still index pages disallowed in robots.txt if other sites link to them.
XML Sitemaps: Your Site's Table of Contents
An XML sitemap lists all important pages on your site. Submit it to Google Search Console to help Google discover and prioritize your content.
<?xml version="1.0" encoding="UTF-8"?>
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
<url>
<loc>https://example.com/page-1</loc>
<lastmod>2026-01-24</lastmod>
<priority>1.0</priority>
</url>
</urlset>