Tag: Indexing

  • What to do if your Development Site is Indexed by Google

    What to do if your Development Site is Indexed by Google

    We take a look at what to do if your development/staging/testing site has been indexed in Google, including expert advice direct from a Google employee.

    Oops! Our Dev site has been indexed in Google

    First of all, don’t beat yourself up: a development site getting accidentally indexed in Google has happened at least once to every agency, developer, and inhouse team on the planet.

    Perhaps you found out about it via an angry email from your client or their SEO consultant, or maybe you discovered it yourself whilst checking for indexed pages.

    How did the dev site get indexed?

    If Google is able to find its way to your development site unimpeded, and finds there no instruction not to crawl or index the URLs available to it, there is a high chance Google will store the pages in their search engine for users to find.

    This can happen reasonably quickly – even if your dev site was only available for a few days or weeks – this could be enough time for Google to index the entire site.

    So if your development site does get indexed in Google, what can you do about it, and are there any urgent solutions for situations where for example the client or management is upset?

    How to quickly remove a dev site from the Google index

    As recommended by Google’s John Mu, if you find your staging site has been indexed and there is an urgent requirement to remove it, the quickest way to remove content from Google is to use the official ‘Removals Tool‘ found in Search Console.

    I’d do a site-removal request in search console – if the site is verified, it’ll be hidden in search within less than a day. After that, you have time to figure out what to do for the long run.

    John Mu via TechSEO subreddit
    Official video: Removals in Search Console – Daniel Waisberg

    To use the Removals tool you will first need to verify the specific domain you want to remove in Search Console if it is already verified.

    John goes on to offer some footnotes and warnings regarding use of the tool:

    • If you make a mistake and need to cancel a removal request, this process should be fast.
    • Remember that removals apply to both www and non-www, and both http/https.
    • Using the tool properly should clear the URL from Google for around 6 months.

    After temporarily removing URLs from Google, it is sensible to then work towards a permanent removal.

    How to permanently remove a development or staging site from Google

    Process chart: how to remove your staging site from Google

    The most effective way to request Google no longer indexes a page is to either use a noindex command, or ensure the resource responds with a 410/404 HTTP response to indicate it is no longer available.

    Google have stated in the past that a noindex tag and 404/410 should work at the same speed.

    If Google returns to a resource following a temporary removal request and finds a 404/410 or noindex tag they will cancel the removal request as it is no longer needed.

    You could also set up authentication which would result in Google being unable to access a resource (eg with a 401 HTTP response).

    Using a robots.txt block is not a good solution if your site has already been indexed. It can take a long time to have any impact and is not a direct instruction to remove content from the index, so Google can ignore it and leave the page indexed if they wish.

    If your site is already indexed in Google, using a robots.txt rule to prevent Google crawling the site will also prevent them from seeing a noindex tag/header if you add one to a page.

    How to remove a development site from Google’s cache

    Using the Removals Tool in Search Console will by default remove the URLs entirely, including the cache.

    When using the tool you are given the option to remove the cached URL – which will clear the snippet shown in search results – until the resource is recrawled and a new snippet will be shown.

    How to prevent a development site from getting indexed by Google

    To prevent your dev site getting indexed in Google there are a variety of methods you can use:

    Methods to block a staging site from appearing in Google
    • Authentication (password, IP address, CMS/plugin based, etc)
    • Noindex tag or header
    • Robots.txt disallow rule (least recommended option)

    Google’s John Mu recommends the use of server side authentication as the best method:

    My recommendation is always to use server-side authentication for staging / dev sites, since it’s obvious when it’s blocked, and obvious when it’s forgotten. Robots.txt and robots meta tags are easy to accidentally deploy to your live site.

    Note: Robots.txt is not a good option because it can be ignored by Google and other search engines.

    How to stop Google indexing a WordPress development site

    You can use any of the standard methods to stop Google indexing a WordPress staging site – eg password protection, noindex or blocking Google from crawling the site with robots.txt.

    The easiest method if you have access to the WordPress admin dashboard is to set WordPress to enable the ‘Discourage search engines’ option via Settings > Reading. This method should add a noindex tag to all your pages.

  • Does Google Index Text Content in CSS Pseudo Elements?

    Does Google Index Text Content in CSS Pseudo Elements?

    Traditionally when Google (or other search engines) look for text-based content to index – they expect to find this content directly in the HTML of the webpage that is served to them.

    This changed somewhat with the rise of sites using JavaScript to serve anywhere from small pieces of content to entire websites.

    Google was forced then to invest resources attempting to render and index JavaScript based content as effectively as possible.

    CSS Pseudo Elements

    But what about text content that is sourced purely from CSS? It is possible to add content to a page using CSS pseudo elements such as ::before and ::after combined with the CSS content property.

    See a simple example below:

    <p>99 bottles of beer on the wall, 99 bottles of beer.</p>
    p::after {content:' Take one down and pass it around,
    98 bottles of beer on the wall.'}

    Will display as:

    99 bottles of beer on the wall, 99 bottles of beer. Take one down and pass it around, 98 bottles of beer on the wall.

    Year after year as CSS gets more advanced and other features are introduced, such as the ability to do mathematical calculations or count elements using only CSS, the likelihood of devs and designers adopting these features becomes higher.

    But will Google be able to render and index this content? Will the text found in the CSS appear and be searchable in Google?

    Is using CSS for text best practice?

    Before we start it is important to note that in the large majority of situations using CSS pseudo elements and the ‘content’ property (instead of HTML) to display any significant amount of text based content on a website is absolutely not best practice for various reasons, including:

    1. The text is not selectable by users, meaning it can’t be highlighted or copied/pasted
    2. The text will be ignored by screen readers – making the content inaccessible and against accessibility guidelines.

    F87: Failure of Success Criterion 1.3.1 due to inserting non-decorative content by using :before and :after pseudo-elements and the ‘content’ property in CSS

    W3.org

    CSS pseudo elements should generally speaking only be used for decorative elements that are non-essential to the consumption of the content on the page.

    SEO Poll

    Until the production of this article – I was not able to find any other SEO-focused articles on this topic, so I thought it could be interesting to dig in and so some research.

    I asked the SEO community what they thought in a Twitter poll, with the following results:

    Taking out users that just wanted to see the results, there is a fairly even split between the three choices with ‘No’ and ‘I don’t know’ getting an equal number of votes (12), and ‘Yes’ trailing behind by just a few votes (9).

    Test

    To test I created a page that contained zero standard HTML based content and added text content using CSS pseudo elements attached to heading, paragraph, div and link tags – sourced from an external file CSS file.

    You can also view the code and resulting page on CodePen here.

    To give the URL a little boost to help it get indexed more quickly (or indeed at all) I linked to it temporarily from the footer of the site.

    Rendering

    To test I also ran the page through the Fetch tool in Search Console and the Mobile Friendly testing tool.

    Both showed that Google were able to fully render the CSS content as it appeared to normal users on the page.

    Results

    Eventually (slightly to my surprise) the page did get indexed in Google despite the complete lack of content.

    However checking the resulting listing in Google, and after searching for strings of text from the page – it became clear that no actual content had been indexed.

    So we can confirm from this test that: NO – although Google can render it, CSS based content will not currently be indexed in Google.

    If you include text content on your site using CSS pseudo elements and the CSS ‘content’ property it is currently not possible for Google to index the text content.

    Update (14/7/2021)

    The fantastic Jess Peck alerted me to a previous test she conducted on the same subject you can view here, and another post/experiment from Mathias Bynens that doesn’t use any HTML at all.