Can I use two HTTP result codes on a page?
Q: (01:22) All right, so the first question I have on my list here is it’s theoretically possible to have two different HTTP result codes on a page, but what will Google do with those two codes? Will Google even see them? And if yes, what will Google do? For example, a 503 plus a 302.
- (01:41) So I wasn’t aware of this. But, of course, with the HTTP result codes, you can include lots of different things. Google will look at the first HTTP result code and essentially process that. And you can theoretically still have two HTTP result codes or more there if they are redirects leading to some final page. So, for example, you could have a redirect from one page to another page. That’s one result code. And then, on that other page, you could serve a different result code. So that could be a 301 redirect to a 404 page is kind of an example that happens every now and then. And from our point of view, in those chain situations where we can follow the redirect to get a final result, we will essentially just focus on that final result. And if that final result has content, then that’s something we might be able to use for canonicalization. If that final result is an error page, then it’s an error page. And that’s fine for us too.
Does using a CDN improve rankings if my site is already fast in my main country?
Q: (02:50) Does putting a website behind a CDN improve ranking? We get the majority of our traffic from a specific country. We hosted our website on a server located in that country. Do you suggest putting our entire website behind a CDN to improve page speed for users globally, or is that not required in our case?
- (03:12) So obviously, you can do a lot of these things. I don’t think it would have a big effect on Google at all with regards to SEO. The only effect where I could imagine that something might happen is what users end up seeing. And kind of what you mentioned, if the majority of your users are already seeing a very fast website because your server is located there, then you’re kind of doing the right thing. But of course, if users in other locations are seeing a very slow result because perhaps the connection to your country is not that great, then that’s something where you might have some opportunities to improve that. And you could see that as something in terms of an opportunity in the sense that, of course, if your website is really slow for other users, then it’s going to be rarer for them to start going to your website more because it’s really annoying to get there. Whereas, if your website is pretty fast for other users, then at least they have an opportunity to see a reasonably fast website, which could be your website. So from that point of view, if there’s something that you can do to improve things globally for your website, I think that’s a good idea. I don’t think it’s critical. It’s not something that matters in terms of SEO in that Google has to see it very quickly as well or anything like that. But it is something that you can do to kind of grow your website past just your current country. Maybe one thing I should clarify, if Google’s crawling is really, really slow, then, of course, that can affect how much we can crawl and index from the website. So that could be an aspect to look into. In the majority of websites that I’ve looked at, I haven’t really seen this as being a problem with regards to any website that isn’t millions and millions of pages large. So from that point of view, you can double-check how fast Google is crawling in the Search Console and the crawl stats. And if that looks reasonable, even if that’s not super fast, then I wouldn’t really worry about that.
Should I disallow API requests to reduce crawling?
Q: (05:20) Our site is a live stream shopping platform. Our site currently spends about 20% of the crawl budget on the API subdomain and another 20% on image thumbnails of videos. Neither of these subdomains has content which is part of our SEO strategy. Should we disallow these subdomains from crawling, or how are the API endpoints discovered or used?
- (05:49) So maybe the last question there first. In many cases, API endpoints end up being used by JavaScript on our website, and we will render your pages. And if they access an API that is on your website, then we’ll try to load the content from that API and use that for the rendering of the page. And depending on how your API is set up and how your JavaScript is set up, it might be that it’s hard for us to cache those API results, which means that maybe we crawl a lot of these API requests to try to get a rendered version of your pages so that we can use those for indexing. So that’s usually the place where this is discovered. And that’s something you can help by making sure that the API results can also be cached well, that you don’t inject any timestamps into URLs, for example, when you’re using JavaScript for the API, all of those things there. If you don’t care about the content that’s returned with these API endpoints, then, of course, you can block this whole subdomain from being crawled with the robots.txt file. And that will essentially block all of those API requests from happening. So that’s something where you first of all need to figure out are these API results are actually part of the primary content or important critical content that I want to have indexed from Google?
Should I use rel=nofollow on internal links?
Q: (08:05) Is it appropriate to use a no-follow attribute on internal links to avoid unnecessary crawler requests to URLs which we don’t wish to be crawled or indexed?
- (08:18) So obviously, you can do this. It’s something where I think, for the most part, it makes very little sense to use nofollow on internal links. But if that’s something that you want to do, go for it. In most cases, I will try to do something like using the rel=canonical to point at URLs that you do want to have indexed or using the robots.txt for things that you really don’t want to have crawled. So try to figure out is it more like a subtle thing that you have something that you prefer to have indexed and then use rel=canonical for that? Or is it something where you say actually, when Googlebot accesses these URLs, it causes problems for my server. It causes a large load. It makes everything really slow. It’s expensive or what have you. And for those cases, I would just disallow the crawling of those URLs. And try to keep it kind of on a basic level there. And with the rel=canonical, obviously, we’ll first have to crawl that page to see the rel=canonical. But over time, we will focus on the canon that you’ve defined. And we’ll use that one primarily for crawling and indexing.
Why don’t site:-query result counts match Search Console counts?
Q: (09:35) Why don’t the search results of a site query, which returns so many giant numbers of results, match what Search Console and the index data have for the same domain?
- (09:55) Yeah, so this is a question that comes up every now and then. I think we’ve done a video on it separately as well. So I would double-check that out. I think we’ve talked about this a long time already. Essentially, what happens there is that there are slightly different optimisations that we do for site queries in terms of we just want to give you a number as quickly as possible. And that can be a very rough approximation. And that’s something where when you do a site query, that’s usually not something that the average user does. So we’ll try to give you a result as quickly as possible. And sometimes, that can be off. If you want a more exact number of the URLs that are actually indexed for your website, I would definitely use Search Console. That’s really the place where we give you the numbers as directly as possible, as clearly as possible. And those numbers will also fluctuate a little bit over time. They can fluctuate depending on the data centre sometimes. They go up and down a little bit as we crawl new things, and we kind of have to figure out which ones we keep, all of those things. But overall, the number in Search Console for in, I think the indexing report that’s really the number of URLs that we have indexed for your website. I would not use the about number for any diagnostics purposes in the search results. It’s really meant as a very, very rough approximation.
What’s the difference between JavaScript and HTTP redirects?
Q: (11:25) OK, now a question about redirects again, about the differences between JavaScript versus 301, HTTP, status code redirects, and which one would I suggest for short links.
- (11:43) So, in general, when it comes to redirects, if there’s a server-side redirect where you can give us a result code as quickly as possible, that is strongly preferred. The reason that it is strongly preferred is just that it can be processed immediately. So any request that goes to your server to one of those URLs, we’ll see that redirect URL. We will see the link to the new location. We can follow that right away. Whereas, if you use JavaScript to generate a redirect, then we first have to render the JavaScript and see what the JavaScript does. And then we’ll see, oh, there’s actually a redirect here. And then we’ll go off and follow that. So if at all possible, I would recommend using a server-side redirect for any kind of redirect that you’re doing on your website. If you can’t do a server-side redirect, then sometimes you have to make do. And a JavaScript redirect will also get processed. It just takes a little bit longer. The meta refresh type redirect is another option that you can use. It also takes a little bit longer because we have to figure that out on the page. But server-side redirects are great. And there are different server-side redirect types. So there’s 301 and 302. And I think, what is it, 306? There’s 307 and 308, something along those lines. Essentially, the differences there are whether or not it’s a permanent redirect or a temporary redirect. A permanent redirect tells us that we should focus on the destination page. A temporary redirect tells us we should focus on the current page that is redirecting and kind of keep going back to that one. And the difference between the 301, 302, and the 307, and I forgot what the other one was, is more of a technical difference with regards to the different request types. So if you enter a URL in your browser, then you do what’s called a GET request for that URL, whereas if you send something to a form or use specific types of API requests, then that can be a POST request. And the 301, 302 type redirects would only redirect the normal browser requests and not the forms and the API requests. So if you have an API on your website that uses POST requests, or if you have forms where you suspect someone might be submitting something to a URL that you’re redirecting them, then obviously, you would use the other types. But for the most part, it’s usually 301 or 302.
Should I keep old, obsolete content on my site, or remove it?
Q: (14:25) I have a website for games. After a certain time, a game might shut down. Should we delete non-existing games or keep them in an archive? What’s the best option so that we don’t get any penalty? We want to keep informed of the game through videos, screenshots, et cetera.
- (14:42) So essentially, this is totally up to you. It’s something where you can remove the content of old things if you want to. You can move them to an archive section. You can make those old pages no-index so that people can still go there when they’re visiting your website. There are lots of different variations there. The main thing that probably you will want to do if you want to keep that content is moving it into an archive section, as you mentioned. The idea behind an archive section is that it tends to be less directly visible within your website. That means it’s easy for users and for us to recognise this is the primary content, like the current games or current content that you have. And over here is an archive section where you can go in, and you can dig for the old things. And the effect there is that it’s a lot easier for us to focus on your current live content and to recognise that this archive section, which is kind of separated out, is more something that we can go off an index. But it’s not really what you want to be found for. So that’s kind of the main thing I would focus on there. And then whether or not you make the archive contains no index after a certain time or for other reasons, that’s totally up to you.
Is there a way to force site links to show up?
Q: (16:02) Is there any strategy by which desired pages can appear as a site link in Google Search results?
- (16:08) So site links are the additional results that are sometimes shown below a search result, where it’s usually just a one-line link to a different part of the website. And there is no meta tag or structured data that you can use to enforce a site link to be shown. And it’s a lot more than our systems try to figure out what is actually kind of related or relevant for users when they’re looking at this one web page as well? And for that, our recommendation is essentially to have a good website structure, to have clear internal links so that it’s easy for us to recognise which pages are related to those pages, and to have clear titles that we can use and kind of show as a site link. And with that, it’s not that there’s a guarantee that any of this will be shown like that. But it kind of helps us to figure out what is related. And if we do think it makes sense to show a site link, then it’ll be a lot easier for us to actually choose one based on that information.
Our site embeds PDFs with iframes, should we OCR the text?
Q: (17:12) More technical one here. Our website uses iframes and a script to embed PDF files onto our pages and our website. Is there any advantage to taking the OCR text of the PDF and pasting it somewhere into the document’s HTML for SEO purposes, or will Google simply parse the PDF contents with the same weight and relevance to index the content?
- (17:40) Yeah. So I’m just momentarily thrown off because it sounds like you want to take the text of the PDF and just kind of hide it in the HTML for SEO purposes. And that’s something I would definitely not recommend doing. If you want to have the content indexable, then make it visible on the page. So that’s the first thing there that I would say. With regards to PDFs, we do try to take the text out of the PDFs and index that for the PDFs themselves. From a practical point of view, what happens with a PDF is as one of the first steps, we convert it into an HTML page, and we try to index that like an HTML page. So essentially, what you’re doing is kind of framing an indirect HTML page. And when it comes to iframes, we can take that content into account for indexing within the primary page. But it can also happen that we index the PDF separately anyway. So from that point of view, it’s really hard to say exactly kind of what will happen. I would turn the question around and frame it as what do you want to have to happen? And if you want your normal web pages to be indexed with the content of the PDF file, then make it so that that content is immediately visible on the HTML page. So instead of embedding the PDF as a primary piece of content, make the HTML content the primary piece and link to the PDF file. And then there is a question of do you want those PDFs indexed separately or not? Sometimes you do want to have PDFs indexed separately. And if you do want to have them indexed separately, then linking to them is great. If you don’t want to have them indexed separately, then using robots.txt to block their indexing is also fine. You can also use the no index [? x-robots ?] HTTP header. It’s a little bit more complicated because you have to serve that as a header for the PDF files if you want to have those PDF files available in the iframe but not actually indexed. I don’t know. Timing, we’ll have to figure out how long we make these.
We want to hide links to other sites to avoid passing link juice, is it ok?
Q: (20:02) We want to mask links to external websites to prevent the passing of our link juice. We think the PRG approach is a possible solution. What do you think? Is the solution overkill, or is there a simpler solution out here?
- (20:17) So the PRG pattern is a complicated way of essentially making a POST request to the server, which then redirects somewhere else to the external content so Google will never find that link. From my point of view, this is super overkill. There’s absolutely no reason to do this unless there’s really a technical reason that you absolutely need to block the crawling of those URLs. I would either just link to those pages normally or use the rel nofollow to link to those pages. There’s absolutely no reason to go through this weird POST redirect pattern there. It just causes a lot of server overhead. It makes it really hard to cache that request and take users to the right place. So I would just use a nofollow on those links if you don’t want to have them followed. The other thing is, of course, just blocking all of your external links. That rarely makes any sense. Instead, I would make sure that you’re taking part in the web as it is, which means that you link to other sites naturally. They link to you naturally, taking part of the normal part of the web and not trying to keep Googlebot locked into your specific website. Because I don’t think that really makes any sense.
Does it matter which server platform we use, for SEO?
Q: (21:47) For Google, does it matter if our website is powered by WordPress, WooCommerce, Shopify, or any other service? A lot of marketing agencies suggest using specific platforms because it helps with SEO. Is that true?
- (22:02) That’s absolutely not true. So there is absolutely nothing in our systems, at least as far as I’m aware, that would give any kind of preferential treatment to any specific platform. And with pretty much all of these platforms, you can structure your pages and structure your website however you want. And with that, we will look at the website as we find it there. We will look at the content that you present, the way the content is presented, and the way things are linked internally. And we will process that like any HTML page. As far as I know, our systems don’t even react to the underlying structure of the back end of your website and do anything special with that. So from that point of view, it might be that certain agencies have a lot of experience with one of these platforms, and they can help you to make really good websites with that platform, which is perfectly legitimate and could be a good reason to say I will go with this platform or not. But it’s not the case that any particular platform has an inherent advantage when it comes to SEO. You can, with pretty much all of these platforms, make reasonable websites. They can all appear well in search as well.
Does Google crawl URLs in structured data markup?
Q: (23:24) Does Google crawl URLs located in structured data markup, or does Google just store the data?
- (23:31) So, for the most part, when we look at HTML pages, if we see something that looks like a link, we might go off and try that URL out as well. That’s something where if we find a URL in JavaScript, we can try to pick that up and try to use it. If we find a link in a text file on a site, we can try to crawl that and use it. But it’s not really a normal link. So it’s something where I would recommend if you want Google to go off and crawl that URL, make sure that there’s a natural HTML link to that URL, with a clear anchor text as well, that you give some information about the destination page. If you don’t want Google to crawl that specific URL, then maybe block it with robots.txt or on that page, use a rel=canonical pointing to your preferred version, anything like that. So those are the directions I would go there. I would not blindly assume that just because it’s in structured data, it will not be found, nor would I blindly assume that just because it’s in structured data, it will be found. It might be found. It might not be found. I would instead focus on what you want to have to happen there. If you want to have it seen as a link, then make it a link. If you don’t want to have it crawled or indexed, then block crawling or indexing. That’s all totally up to you.
Sign up for our Webmaster Hangouts today!