Much is written about SEO, and much that is written refers to ‘the Google algorithm’. Unfortunately, and for reasons that we’ll explain later on in this post, what is written is not always as clear as it could be. It often seems like you can replace references to Google’s algorithm with “magic” and have the article in question read just as well. This post is aimed at a broad audience, with little or no prior knowledge, to explain roughly how Google’s search algorithm works – and why it’s often necessary to be a little vague when we talk about it.
Wait, What Is An Algorithm?
An algorithm can informally be thought of as another word for a procedure or process. The way we add two numbers together is an algorithm, long division is an algorithm, and a way to find the fastest route between two points on a map is an algorithm.
The question of what is and is not an algorithm is surprisingly deep and interesting; but all we are seeking to answer here is the question “How does Google return search results?” So diving into the philosophical issues and mathematical theses that determine what exactly an algorithm is would probably not be worth it for this article, and also I wouldn’t even know where to start, being neither a philosopher nor a mathematician.
The algorithm Google uses incorporates many smaller algorithms and what are known as heuristics, which is a fancy word for ‘good guesses that mostly work’.
It is this algorithm that ultimately determines the results you see when you use Google’s search engine.
The core of Google’s search engine, the factor they are always attempting to optimise for, is utility. How useful is this page to the searcher? With that question in mind, many of the specific factors governing Google’s ranking algorithms can be worked out or discovered – but why would you do that?
Search engines are hugely important to most modern businesses. Using search engines such as Google is one of the most common ways consumers find new products and information online. That’s one reason why it’s important to have a good overview-level understanding of how Google works, even if you don’t know and never intend to learn the specific details.
The (Lack Of) Competition
Another reason is that within the search arena, Google is overwhelmingly dominant. Competitors such as Bing, DuckDuckGo, Blekko and Ask Jeeves are not usually worth considering as significant sources of traffic. Most other search engines use either Google’s technology or Bing’s technology, and don’t do much more than overlay their own branding and promote their own products. Others are really useful, but quite niche. That means that, at time of writing, it’s more beneficial for marketing managers or business owners to understand the algorithm Google uses than the algorithms used by much of its competition.
Why An Overview Of Google’s Algorithm Is Useful
Try searching for information on Google’s algorithm’s actual workings, though, and you’ll get so many specific details and items to action that you’ll probably end up with no clearer an action plan. Worse, you can end up ‘knowing’ specific information that is obsolete or even harmful by the year’s end. This is because Google is constantly adjusting its algorithm in response to malicious attempts at manipulating the Search Engine Ranking Positions (SERPs, this is industry jargon used to describe search results) and in order to improve their results for all users. Your honest attempt at optimising your site today could be a clear sign of a dishonest spam site tomorrow.
What is important is a high level understanding of the ideas that drive Google’s algorithm, and what is arguably even more important is realising that the fundamental ideas behind Google’s algorithms are completely understandable even if you don’t know the first thing about computers.
How It All Began
The basics of a search engine are simple enough that you can find implementations even in casual computer games and on throwaway category pages on modern websites. Check the words that were entered into the search box – if they turn up in one of the pages, return that page. If you want to get super super fancy, return them in order from the page with the most mentions to the page with the least mentions. Naturally, if you’re searching other people’s pages with this technique there’s a big incentive to try to cheat the system.
Matching patterns is not hard – scaling to meet demand is hard, sure, but that’s a nice problem to have. It also doesn’t really return very good results, for most users. Anyone who used AltaVista as a student can probably remember that using a search engine back then was a damned art. You’d refine and refine your search, trying to second-guess the creator of the perfect query-answering page that you are sure exists (it rarely did), knowing that if a search result had been compromised in order to fill your computer with unwanted toolbars and magic smoke there was little you could do.
A significant problem for early search engines was judging the quality of a page as well as the relevance. This was hard, and it was the popularity of Google, as powered by their Page Rank algorithm, that made it seem like we should both expect and demand high quality results from a search engine. It still is a difficult challenge, but it seemed at the time as though even estimating a page’s quality would be impossible.
Page Rank has traditionally been the centrepiece of Google’s search engine, and, loosely, attempts to measure the quality of your website by counting citations in the form of links elsewhere on the web. These links are given more or less weight depending on how many links point to them in turn, and so on.
Even as Google were implementing Page Rank, there were other people approaching similar conclusions and ideas – many similar projects, such as HyperSearch and the project that would eventually become Baidu, influenced early Googlers. Google’s achievement was that they were able to implement an extremely successful and scalable algorithm at a time when these ideas were generally considered to be operating on or beyond the fringes.
The thinking at the time was that it should be harder to manipulate the search engine using Page Rank, and that the links to a particular website would accurately reflect the popularity of that site. This was definitely true at the time, and the basic assumption survived a lot of abuse for a long time, but there were issues with the system that were eventually exploited.
The Day Directories Died
Directories (such as Yahoo!) were the core of the early web, useful guides that newbies could easily use to find helpful and essential websites. As such, they formed an authoritative guide to the very best of the web, and it was hard to imagine how the web would even be usable without them.
As the web grew in size and ambition, motivations changed. Firstly, in the heady days before the dotcom bubble, being on the front page of an important directory could result in serious traffic to a valuable startup. Secondarily, word was increasingly spreading about Google’s fancy search engine. I can still remember the day our IT teacher held a special class to announce that it was time to stop using AltaVista, with the soft-spoken gravitas of a vet bearing bad news about Mr. Scruffles. It was an important phenomenon, stretching far beyond the world of webmasters. This meant that ranking well in a search engine suddenly became important, and links from directories were the most reliable way to achieve that goal.
Webmasters were now paying big money to get into directories. Directories could now guarantee both traffic and a prominent place in a search engine – a bargain! Then, suddenly, it seemed as though the prominent place in the search engine was the better deal, as directories struggled to attract traffic. Finally, directories emerged that specifically targeted search engines, without seeking to be a useful resource at all.
The only people submitting to these directories were the SEO-savvy. They were no longer centres for information – they were split roughly equally between advertising platforms and, on the lower end, ‘link farms’, where you submitted a line or two of text in exchange for an unwarranted higher position in the SERPs. It was a horrendous waste of time, but if you didn’t want your competitors to supplant you in the increasingly important search engines, you had to play along.
This was the problem with a naïve Page Rank-based search algorithm – as more and more people joined the web, and more and more commercial interests became involved, it became possible for anyone with sufficient resources and motive to game the links.
Directories (and similar article farms with names like 1toparticle4u.co.tk and gettopsearchhoorayarticle.com) were only the most blatant examples of manipulation of search metrics. Google needed to make its core product more sophisticated, and quickly.
Quality Signals And You
Page Rank itself is a form of quality signal – a way of measuring the utility of the stuff on the page by how good it is, not ‘just’ by its relevance to the query. This is really, really difficult to do, not least because as soon as a quality signal is discovered it is possible to manipulate it. It’s already hard to measure subjective qualities with numbers and algorithms, but when people know what you’re doing it gets close to impossibility.
Even in some of its earliest forms, Google’s search engine could not be reduced simply to Page Rank. A number of other quality signals were taken into account from the very start, including quality signals that even many SEOs think are recent inventions. From the famous paper, “Simple experiments indicate PageRank can be personalized by increasing the weight of a user’s home page or bookmarks. As for link text, we are experimenting with using text surrounding links in addition to the link text itself.”
Anything about your site that Google can see can be used to evaluate your website. Therefore we can assume, even without reading Google’s public announcements, that anything that can be interpreted as a quality signal can be used to rank your site.
What are some obvious quality signals that are used by Google? Well, typos, for one. Spelling and grammatical errors can mean that your site is seen as significantly lower in quality, meaning it could rank poorly. Broken pages are another – not just broken in terms of functionality (4xx or 5xx errors, blank pages or pages with garbled text or error messages), but also in terms of strange or illogical HTML structure, or being incredibly slow to load. Even the structure of the page layout (how it actually looks to the user) can affect ranking!
Then again, there are subtler quality signals. The age of a page is sometimes a sign of a high quality academic site that’s useful to everyone and never needs to be changed or updated, but it can also mark an old abandoned eCommerce site that’s barely kept ticking over. Attempts to measure the importance of age by Google have included measures of both freshness and of maturity – so context is clearly important to how Google interprets some quality signals, it’s not just a case of making a number go up.
Finally, there are some fairly contentious quality signals. One such quality signal that Google is currently pushing is factual accuracy. Seems fairly non-contentious, for sure, but the fear is that a site’s quality and accuracy could end up being based entirely on the company’s Knowledge Graph, which is infamous for being easy to manipulate by pranksters. The Knowledge Graph is primarily based on WikiData (previously, FreeBase) and Wikipedia, and as such is roughly as reliable as its sources. What happens, though, when a fringe or unpopular view gets targeted?
I might not believe in crystal therapy, but should a talented mathematician’s site be punished in the SERPs because they happen to have some cranky views on science? How about conspiracy theories? If David Icke breaks a true story about government corruption, and it happens to be on a site full of false facts about lizard people, I don’t want that buried in my search results! The relevance of the true or false facts to the ranking content needs to be carefully considered before rolling out such an update – fortunately, the signs so far are that Google is capable of doing this with a fair amount of sensitivity.
The reason for including subtle and downright contentious quality signals along with more straightforward indicators is good, too. That reason is that the most straightforward indicators can be manipulated by people seeking to deceive Google – giving off signals of good quality without actually putting the work in to improve their websites and their content.
So, Google’s algorithm is based on pretty much everything they know about your site – potentially including whois details and that blog post about becoming enlightened in India you posted twenty years ago.
It’s constantly tested and revised, but there are a couple of constant, over-arching factors you can aim for.
One is clear, on-site marks of quality. This includes features like a good vocabulary, well-structured markup, accurate facts, original research and adherence to web development and design best practices.
Another is clear, on-site marks of relevance. This includes featuring the exact keyphrase on the landing page, but also featuring a relevant and appropriate lexicon across the entire site that reinforces, subtly and unsubtly, the holistic meaning of the site, and the meaning of the page in context. The perceived meaning of your site also involves using properly meaningful markup (‘semantic HTML’, JSON-LD and microdata), and using appropriate internal linking.
Beyond the work you can do yourself on your own site, the impact you can have personally becomes more limited, and results become less predictable. This is off-site optimisation, and it is the source of many SEO headaches.
Google is constantly revising its estimation of off-site indicators of site quality and relevance. This isn’t just a case of reducing the relative importance, either. It has been common in SEO’s history for ranking factors to go from unquestionable signs of quality to damning marks of perfidy and guile. Entire websites can drop to the second page, or fall out of the SERPs completely. The impact of such dramatic drops in ranking are extremely painful for businesses that have invested in such losing SEO strategies. So why does Google do it?
Advanced Search Engine Manipulation
Almost all off-site signals of quality are, to a greater or lesser extent, capable of being manipulated in a way that Google did not intend. This sort of manipulation forms a spectrum of SEO hats, from sparkling white (just drawing attention to your good work, albeit in a way that adds no real value) to deepest black (hacking and hijacking unprotected websites to spam links back to your own site). We just call them white hat and black hat because hackers call themselves white or black hat, again depending on their sense of professional ethics, and many early SEO professionals came from a programming background.
So manipulating the signals is possible, if unethical, and this is why Google occasionally changes particular positive signals to negative ones. Directory links were positive, as we mentioned earlier, but then they were abused – suddenly, having a significant number of links from directories meant you were suspected of abusing the system. Article links were also positive, but they too were abused. Soon after, guest blogging and infographic links went the same way.
Many of Google’s decisions about what to rank highly are based on understanding that some amount of manipulation exists, and choosing quality signals that are harder to manipulate. For example, links from government websites and universities (which are difficult to get by bribing or befriending the site owner) are preferred over links from dot.tk websites (it’s free to register a dot.tk domain name, making it easy to spam with them). In general, you can assess the likelihood that a particular quality signal will assist your site by how easy it is to fake it.
Google is by no means infallible – this fantastic post by Bartosz Góralewicz showcases an instance where Google was deceived by fake signals pretending that a particular site was low quality. However, you can expect Google to take steps to mitigate or reduce instances of manipulation wherever it finds them. Advantages gained by manipulating or faking quality signals cannot be expected to last indefinitely.
For some time, to catch general cases of manipulation rather than having to respond to specific instances, Google has been looking at the profiles of websites, rather than looking at statistics in isolation. That means that if you get a lot of links (for example), but they all arrive at once, or they’re overwhelmingly from a particular quality of website, or they just look unusual for your industry, they’ll count for less – and may count against you.
In other words, to generate organic traffic, build your website organically. Don’t rush, don’t spam, and instead commit to contributing something of real value.
Should I Be Worried About Negative SEO?
Probably not. Negative SEO is possible, as Góralewicz showed, but it is uncommon, and you can mitigate most attacks.
Most of the anecdotes about negative SEO attacks occur just after updates. This suggests that most (but not all!) negative SEO is actually just sites being badly affected by Google updating their algorithm.
From our own experience, we come across hundreds – thousands – of high-ranking sites that are horribly vulnerable to negative SEO attacks. It would just take one dedicated attack from a determined malefactor to temporarily annihilate their rankings. In the case of niche sites or sites that otherwise have low traffic despite high rankings, it could be permanent. The fact that these sites invariably persist, thrive and prosper despite having structural vulnerabilities to negative SEO that even a naïve white hat SEO like myself can spot suggests that negative SEO is very rare indeed.
For the effort and cost involved, a particularly skilled black hat SEO might as well just compromise your site directly. If it’s weak to negative SEO attacks, it probably has security issues as well, and almost all negative SEO attacks are just as illegal – in for a penny, in for a pound. Alternatively, they could use their time and effort to boost their own site, which is much easier than trying to hurt every single one of their competitors. Negative SEO does occur, but it’s very unlikely that it will happen until your business is large and important enough that it can muster some effective counter-measures.
Man Vs Machine
Machine learning is a field of computer science that involves teaching computers to ‘reason’ about data themselves, to a very limited extent. It is closely linked to Artificial Intelligence, and can enable computers to recognise abstract concepts such as birds or musical style.
Machine learning has made startling strides in recent years due to successes in the field of neural networks. Having a big, impressive machine learning blog post or landmark project has also become a mark of belonging to the tech big leagues, making at least paying lip service ubiquitous in certain circles. For these reasons, it is frequently speculated that machine learning feeds into Google’s algorithm in some way.
While modern machine learning is impressive, it’s important not to overstate its capabilities. For instance, while this neural net has been taught to play breakout exceptionally well, it did not outperform humans when playing some other computer games.
From my perspective, whether or not Google actually uses advanced machine learning in its web search seems to be beside the point. For quite a while, Google has been known to use mechanical turk ‘quality raters‘ to evaluate the quality of their search results, and to refine their search accordingly. So it doesn’t matter, in some sense, whether it’s man or machine judging how well Google is working; what matters is that there is constant intelligent feedback being funnelled back to Google about how its algorithm works in practice.
On a long enough timescale, the best quality and most appropriate web pages will float to the top.
Single Shot Search Results
Increasingly, Google is becoming personalised. It personalises its ad content to a fantastic degree, and we have no reason to believe that it doesn’t do the same for search. Some personal anecdotes illustrating what Google reads:
By using the documentation for the Racket programming language regularly, I slowly trained Google to stop returning results about tennis rackets when I searched for, say, “Racket string manipulation”.
When someone started calling everybody, including me, “clownshoes” over social media because they found the phrase funny, I got increasing numbers of Google adverts for Oddball’s extra large men’s shoes.
When I received a lot of spam emails from a particular company, they began showing up in my search. When I blocked them, they stopped.
This is something you simply cannot completely control, and is not typically worth worrying about. You can encourage useful associations by interacting with fans and fostering community around your website and product, but beyond that it’s a situation where your product, website, content and customer service will speak for itself.
Recurring problems with Google do surface from time to time. It seems to put too much trust in its own properties sometimes (e.g. Freebase, and its own social media sites), and black hat SEO techniques have repeatedly succeeded in propelling bizarre, awful, unusable sites to the front page of Google.
It’s worth keeping this in mind. Black hat SEO techniques do work, and so-simple-they-couldn’t-work lies to Google do work (as long as you use their own platforms to lie to them). The difference is that white hat SEO techniques work long-term, add value to your users and customers, and integrate with and strengthen your entire marketing campaign.
For the price of a black hat SEO campaign that could tank your business next year, you could (and should) just opt for PPC, which is less exciting but usually much more profitable.
Taking A Balanced Approach
Google’s search product is huge, complex, and frequently changing. It’s a mish-mash of clever algorithms and guesstimations that works pretty well, taken as a whole. This is something that directly affects SEO, which is becoming an increasingly holistic task. Your website needs to be healthy from host to HTML, in addition to the effective content marketing that is the backbone of modern SEO.
This is why SEOs are often hand-wavey and mysterious about the algorithm. The algorithm is actually known and, except for its innermost workings, well-understood. However, having an understanding of its innermost workings will only rarely have an impact on your site’s performance in the SERPs. Instead, having insight into what abstract concepts Google is optimising its SERPs towards, and optimising your own site towards the same things, is better than trying to optimise your site for a particular metric. So we don’t usually need to point to particular patents that suggest Google is putting added weight on location for search, because Google will outright tell us that they’re trying to deliver the best local results.
The big takeaway for the reader who’s never thought about Google’s algorithm before is that no single activity, taken in isolation, will drastically improve your site’s ranking. There are, by now, far in excess of two hundred separate ranking factors taken into account. You need to assume that the effects of massively over-optimising one element of your site, whether that’s keyphrases or links, will be dampened, negated or (potentially) even reversed by other aspects of the algorithm. To rise through the SERPs, everything about your site needs to be optimised, including conforming to best practices and dry standards as well as looking pretty to smartphone users.
It’s tough, but it’s good that optimising organic search is starting to happen, well, organically.