Search engines are answer machines. They exist to discover, understand, and organize the internet’s content to offer the most relevant results to the questions searchers are asking. First, your content must be visible to search engines to appear in search results. Most processes run to find out your website, and one of them is indexing. It is arguably the most vital piece of the SEO puzzle: If your site can’t be found, there’s no way you’ll ever show up in the SERPs (Search Engine Results Page).
Firstly, How do search engines work?
Search engines work through three primary functions:
Scour the internet for content, looking over the code/content for each URL they find.
What Is The Meaning Of Crawling?
Store and organize the content found during the crawling process. Once a page is in the index, it’s in the running to show up for relevant queries.
Provide the pieces of content that will best answer a searcher’s query, which means that results are ordered from most relevant to least relevant.
What Is The Meaning Of Ranking?
Search engines process and store information they find in an index, a massive database of all the content they’ve discovered and deem good enough to serve up to searchers.
How do search engines interpret and store your pages?
Once you’ve ensured your site has been crawled, the next order of business is to make sure it can be indexed. That’s right — just because your site can be discovered and crawled by a search engine doesn’t necessarily mean that it will be stored in their index. In the previous article on crawling, we discussed how search engines discover your web pages. The index is where your discovered pages are stored.
After a crawler finds a page, the search engine renders it like a browser. In the process of doing so, the search engine analyzes that page’s contents. All of that information is stored in its index.
Read on to learn about how indexing works and how you can make sure your site makes it into this all-important database.
Can I see how a Googlebot crawler sees my pages?
Yes, the cached version of your page will reflect a snapshot of the last time Googlebot crawled it.
Google crawls and caches web pages at different frequencies. More established, well-known sites that frequently post like https://www.nytimes.com will be crawled more often than the much-less-famous website.
Are pages ever removed from indexing?
Yes, pages can be removed from the index! Some of the main reasons why a URL might be removed include the URL:
- Is returning a “not found” error (4XX) or server error (5XX) – This could be accidental (the page was moved and a 301 redirect was not set up) or intentional (the page was deleted and 404ed to get it removed from the index)
- It had a noindex meta tag added. This tag can be added by site owners to instruct the search engine to omit the page from its index.
- Has been manually penalized for violating the search engine’s Webmaster Guidelines and, as a result, was removed from the index.
- Has been blocked from crawling with the addition of a password required before visitors can access the page.
If you believe that a page on your website that was previously in Google’s index is no longer showing up, you can use the URL Inspection tool to learn the status of the page, or use Fetch as Google which has a “Request Indexing” feature to submit individual URLs to the index.
(Bonus: GSC’s “fetch” tool also has a “render” option that allows you to see if there are any issues with how Google is interpreting your page).
Tell search engines how to index your site
1- Robots meta directives
Meta directives (or “meta tags”) are instructions you can give to search engines regarding how you want your web page to be treated.
You can tell search engine crawlers things like “Do not index this page in search results” or “Don’t pass any link equity to any on-page links”. These instructions execute via Robots Meta Tags in the <head> of your HTML pages (most commonly used) or via the X-Robots-Tag in the HTTP header.
2- Robots meta tag
The robot’s meta tag belongs within the <head> of the HTML of your webpage. It can exclude all or specific search engines. The following are the most common meta directives, along with what situations you might apply them in.
Index/noindex tells the engines whether the page should be crawled and kept in a search engines’ index for retrieval. If you opt to use “noindex,” you’re communicating to crawlers that you want the page excluded from search results. By default, search engines assume they can index all pages, so using the “index” value is unnecessary.
- When you might use: You might opt to mark a page as “noindex” if you’re trying to trim thin pages from Google’s index of your site (ex: user-generated profile pages) but you still want them accessible to visitors.
Follow/nofollow tells search engines whether links on the page should be followed or nofollowed. “Follow” results in bots following the links on your page and passing link equity through to those URLs. Or, if you elect to employ “nofollow,” the search engines will not follow or pass any link equity through to the links on the page. By default, all pages are assumed to have the “follow” attribute.
- When you might use: nofollow is often used together with noindex when you’re trying to prevent a page from being indexed and prevent the crawler from following links on the page.
- Noarchive is used to restrict search engines from saving a cached copy of the page. By default, the engines will maintain visible copies of all pages they have indexed, accessible to searchers through the cached link in the search results.
- When you might use: If you run an e-commerce site and your prices change regularly, you might consider the noarchive tag to prevent searchers from seeing outdated pricing.
The x-robots tag is used within the HTTP header of your URL, providing more flexibility and functionality than meta tags if you want to block search engines at scale because you can use regular expressions, block non-HTML files, and apply sitewide noindex tags.
Understanding the different ways you can influence crawling and indexing will help you avoid the common pitfalls that can prevent your important pages from getting found.