PostHeaderIcon It’s Penguin-Hunting Season: How to Be the Predator and Not the Prey

Posted by russvirante

Penguin changed everything. For most search engine optimizers like myself, especially those who operate in the gray areas of optimization, we had long grown comfortable with using “ratios” and “percentages” as simple litmus tests to protect ourselves against the wrath of Google. I can’t tell you how many times I both participated in and was questioned about what our current “anchor text ratio” was. Many of you probably remember having the same types of discussions back in the keyword-stuffing days.

We now know unequivocally that Google has used and continues to use statistical tools far more advanced than simply looking at where an individual ranking factor sits on a dial. (We certainly have more than enough Remove’em users to prove that.) My understanding of Penguin and its content-focused predecessor Panda is that Google now employs machine-learning techniques across large data sets to uncover patterns of over-optimization that aren’t easily discerned by the human eye or the crude algorithms of the past. It is with this understanding that I and my company, Virante, Inc., undertook the Open Penguin Data project, and ultimately formed our Penguin Vulnerability Score.

The Open Penguin Data Project

Matt Cutts occasionally gives us a heads-up about future updates, and in the Spring of 2013 we were informed that within a few weeks Penguin 2.0 would roll out. I remember exactly when the idea hit me. I was reading “How is Big Data Different from Previous Data” by Bryan Eisenberg, and it occurred to me that the kind of stuff we were doing at Remove’em to detect bad links just didn’t keep muster with the sophistication of the “big data” analysis Google was using at the time. So Virante went to work. We started monitoring a huge number of keywords, so that when Penguin 2.0 hit we could catch winners and losers. In the end, we used data from three different awesome providers: Authority Labs (for the initial data set), Stat Search Analytics (for cross-validation) and SerpMetrics (for determining that we weren’t just picking up manual penalties). We identified around 600 losing URL/keyword pairs and matched them with their competitors who did not lose rankings.

We then opened the data up to the community at the Open Penguin Data project website and asked members of the community to contribute their ideas for factors that might influence the Penguin algorithm. You can go there right now and download the latest data set, although at present I know there is a bug in the mozRank and mozTrust columns that needs to be fixed. We have identified over 70 factors that may influence Penguin and are still building upon them, with the latest variable update being October 14th. Unfortunately, only certain variables can be added now as fresh data won’t be relevant. The data behind the factors came from a large number of sources beginning with Moz of course, and including Majestic SEO, Ahrefs, Grep/Words, and

We then began to analyze the data in a number of ways. The first was through standard correlation coefficients to help determine direction of influence (assuming there was any influence at all). It is important that I deal with the issue of correlation vs. causation here, because I am sure one of you will bring it up.

Correlation vs. causation

The purpose of the Open Penguin Data Project was not and is not to determine which factors cause a Penguin penalty. Rather, we want to determine which factors predict a Penguin penalty so that we can build a reasonable model of vulnerability. Once we know a website’s vulnerability to Penguin, we can start applying different techniques to lower that vulnerability that fall closer to the realm of causal factors.

For example, we will talk about the difference of mozTrust and mozRank as being a fairly good predictor of Penguin. No one in their right mind believes that Google consumes Moz’s data to determine who and who not to penalize. However, once we know that a site is likely to be penalized (because we know the mozTrust and mozRank differential), we can start to apply tactics that will likely counter Penguin, such as using the disavow tool or removing spammy links. We aren’t talking about causation, we are talking about prediction.

The analysis of the risk factors

We then began analyzing the data using a couple of methods. First, we used standard mean Spearman correlations to give us an idea of the lay of the land. This allowed us to also build a crude regression model that actually works quite well without much tweaking. This model essentially comes from adding up the correlation coefficients for each of the factors. Obviously, more sophisticated modeling is better than this, but to build a crude overview, this works quite nicely and can be done on the fly. The real magic happens, though, when we apply the same sorts of machine-learning techniques to the data set that Google uses in building models like Penguin.

Let me be clear, I do not presume to know what statistical techniques Google used to build their model. However, there are certain types of techniques that are regularly used to answer these types of multivariate classification problems and I chose to use them. In particular, I chose to use a gradient boosting algorithm. You can read up on the methodology or the specific implementation we used via scikit-learn, but I’ll save you the headache and tell you what you need to know.

Most of us think about statistical analysis as putting some variables in Excel and making a nice graph with a linear regression that shows an upward or downward trend. You can see this below. Unfortunately, this grossly over-simplifies complex problems and often produces a crude result where everything above the line is considered different from that below the line, when clearly they are not. As you see in the example graph below, there are plenty of penalized sites that get missed by falling below the line and completely decent sites that are above the line that get hit.

Classification systems work differently. We aren’t necessarily concerned with higher or lower numbers, we are concerned with patterns that might predict something. In this case, we know sites that were hit by Penguin, so now we use a whole bunch of factors and see how the patterns between them might accurately predict them. We don’t need to draw an arbitrary line, we can individually analyze the points using machine learning, as you see in the example graph below.

The hard part is that machine learning tells us a lot about prediction, but not a lot about how we came to that prediction. That is where some extra work comes into play. With the Open Penguin Data project, we grouped some of the factors by common characteristics and measured the effectiveness of their predictions in isolation from the other factors. For example, we grouped trust metrics together and anchor text metrics together. We then grouped them in combinations as well. This then gave us a model we could use to determine not only increased Penguin vulnerability, but also what factors contributed to that vulnerability and to what degree.

So, let’s talk through some of them here.

Anchor text

By now, everyone and their paid search guy knows that manipulated commercial anchor text is a risk factor for both algorithmic and manual penalties. So, of course, we looked at this closely from the start. We actually broke down the anchor text into three subcategories: exact-match anchor text (meaning the keyword is exactly the keyword for which you would like to rank), phrase-match anchor text (meaning the keyword for which you would like to rank occurs somewhere within the anchor text) and commercial anchor text (the anchor text has a high CPC value).

Exact-match anchor text

We broke exact-match anchor text down into a couple of metrics:

  1. The most common anchor to the page is exact match
  2. The highest mozRank passed anchor to the page is exact match
  3. There is at least one exact match anchor to the page
  4. The most common anchor to the domain is exact match
  5. The highest mozRank passed anchor to the domain is exact match
  6. There is at least one exact match anchor to the domain

Across the board, every single metric related to anchor text provided some positive predictive power except for highest mozRank passed anchor to the domain. Importantly, no single factor had a particularly strong mean Spearman correlation coefficient. For example, the highest was that the domain merely had a single link with the exact match anchor text (.11 correlation coefficient). This is a very weak signal, but our analysis looks to find patterns in these weak signals, so we are not necessarily hindered because each measurement is not sufficiently predictive.

For the biggest victims of Penguin, we often see that exact match anchor text is the second- or third-largest predictor. For example, the below webmaster’s predictive vulnerability score could be lowered by 50% simply by impacting exact match anchor text links. For this particular webmaster, the anchor text hit most positive signals we measure regarding anchor text.

Now let me say it one more time: I am not saying that Google is using anchor text to determine who to penalize, rather that it is a strong predictor. Prediction is not causation. However, we can say that the groupings of exact-match anchor text metrics allow us to detect Penguin vulnerability quite well.

Phrase-match anchor text

We broke down phrase-match anchor text in the exact same fashion. This was one of the more surprising features we noticed. In many cases, phrase-match anchor text metrics appeared to be more predictive than exact-match anchor text. Many SEOs, myself included, have long depended on what we call “brand blend” to protect against over-optimization penalties. Instead of just building links for the keyword “SEO”, we might build links for “Virante SEO” or “SEO by Virante”. This may have insulated us against manual anchor text over-optimization penalties, but it does not appear to be the case with Penguin.

In the example I mentioned above, the webmaster hit nearly every exact match anchor text metric. They also hit every phrase match metric as well. The combination of these factors increased their prediction of being impact by Penguin by a full 100%.

Shoving your high-value keywords inside other phrases doesn’t guarantee you any protection. Now, there are a lot of potential takeaways from this. It could be an artifact of merely doubling the exact match influence (i.e. if you score high on exact match, you will also score high on phrase match). We do see some of this occurring, but it doesn’t appear to explain all of the additional predictive power. It could be that they are targeting other related keywords and thereby increase their exposure to other parts of the Penguin algorithm. All we know, though, is that the predictive power of the model increases greatly when we take into account phrase-match anchor text. Nothing more, nothing less.

Commercial anchor text

This is my favorite measure of all, as it shows how Google can use one of its most powerful ancillary data sets, bid prices for keywords, to detect manipulation of the link graph. We built 4 metrics around commercial anchor text.

  1. The page has a high-value anchor in a single link
  2. The majority of the anchors are valuable
  3. The majority of links are very high-value anchors
  4. Has a high CPC site-wide

Both having high-value anchors and very high-value anchors had strong predictive values of penguin vulnerability. In keeping with the example we have been using so far, you can see that removing commercial anchor text would have a profound impact on our prediction as to whether or not the site will be impacted by Penguin.

If you’ve been paying close attention, you may have noticed that a lot of these are related. Having exact-match and phrase-match anchor text likely means you have highly commercial anchors. All of these metrics are related to one another and it is their combined weak signals that make it easier to detect Penguin vulnerability.

Link sources

The next issue we tried to target was the quality of link sources. The most obvious step was trying to detect commonly spammed link sources: directories, forums, guestbooks, press releases, articles, and comments. Using a set of footprints to identify these types of links and spidering all of the backlinks of the training set, we were able to build a few metrics identifying sites that either simply had these types of links or had a preponderance of these types of links.

First, it was interesting that every type of link was positively correlated, but only very weakly. You can’t just look at a bunch of article directory submissions and assume that is the cause of a Penguin penalty. However, the combination—that is a site that would rely on four or five of these types of techniques for nearly all of their PageRank—would appear to have a greater risk factor.

At this point, I want to stop and draw attention to something: Each of these groupings of factors appear to have some good predictive value, but none of them comes even close to explaining the whole vulnerability. Fixing your exact-match anchor text links, or phrase-match links, or commercial anchor links, or poor link sources by themselves will not insulate you from detection. It is the combination of these factors that appears to increase the vulnerability to Penguin. Most sites that we see hit by Penguin have vulnerability scores that are 250%+, although in Penguin 2.1 we saw them as low as 150%. To get to these levels you have to trip a wide variety of factors, but you don’t have to be egregiously violating any one single SEO tactic.


This was one of the most disappointing features we used. I was certain, as were many, that site-wide links would be the nail in the coffin. Clearly site-wide links are the culprit behind the Penguin penalty, right? Well, the data just doesn’t bear that out.

Site-wides are just too common. The best sites on the web enjoy tons of site-wide links, often in the form of Blog-Rolls. In fact, high site-wide rates correlate negatively with Penguin penalties. Certainly this doesn’t mean you should run out and try to get a bunch of site-wide links, but it does beg the question: Are site-wides really all that bad?

Here is where we find the real difference: anchor text. Commercial anchor text site-wides positively correlate with Penguin penalties. While we cannot say they cause them, there is definitely a predictive leap between just any old site-wide link and a site-wide link with specific, commercially valuable anchor text.

This also helps illustrate another issue we SEOs often run into: anecdotal evidence. It is really easy to look at a link profile, see that site-wide, and immediately assume it is the culprit. It is then seemingly reinforced when we scratch the surface with too simple an analysis like looking at the preponderance of that feature among sites that are penalized. It can and does often lead us down the wrong path.

Trust, trust, trust

Of all the eye-opening, mind-blowing discoveries revealed by the Open Penguin Data project, this one was the biggest. At minimum, we all need to tip our hats to the folks at Moz and Majestic for providing us with great link statistics. Two of the strongest metrics we found in helping predict Penguin vulnerability were MozRank greater than MozTrust (Moz) and Domain Citation Flow over Domain Trust Flow (Majestic).

Both Moz and Majestic give us statistics that mimic to a certain degree the raw flow of PageRank (MozTrust and Citation Flow) and an alternative often referred to as Trust Rank (MozRank and Trust Flow). They are essentially the same thing, except Trust metrics start with a trusted set of URLs like .govs and .edus and gives extra value to sites that get links from these trusted sources. These metrics by themselves, while useful in other endeavors, don’t really give us much information about Penguin.

However, if we flag URLs and domains where the trust metrics are lower than the raw link metrics, we score some of the highest correlations of all factors tested. Even cruder metrics like whether or not the domain has a single .gov link help predict Penguin vulnerability. While it would be insane to conclude that Google has a subscription to Moz and Majestic and use them to build their Penguin algorithm, this appears to be true: In the aggregate, cheap, low quality links are a Penguin risk factor.

What we should learn

There are some really amazing takeaways that we can build from this kind of analysis—the kind of takeaways that should change your understanding of Penguin and Google’s algorithm for many of you who are not yet seasoned professionals. So let’s dive in…

Penguin isn’t spam detection, it’s you detection

Try this fact on for size. If you hit every anchor text trigger in the Open Penguin data set, our predictive model actually DROPS in effectiveness. At first glance this seems counter-intuitive. Certainly Google should catch these extreme spammers. The reality is, though, that cruder algorithms generally clear out this type of search spam. If you have done any traditional off-site SEO in the last three years, it will probably create additional Penguin vulnerability. The Penguin update is targeted at catching patterns of optimization that aren’t so easily detected. The most egregious offenders are more likely to be caught by other algorithms than Penguin. So when the next Penguin update comes out and you hear people complain about how some spam site wasn’t affected, you can be confident that this isn’t a flaw in Penguin, rather a deliberate choice on Google’s behalf to create separate algorithms to target different types of over-optimization.

The rise of the link assassin

It was Ian Curl, a former Virante employee and now head of Link Assassins who first pointed out to me the clear future of SEO: pruning the link graph. Google has essentially given us the tools via GWT to both view our links and disavow them. A new class of link removal and disavow professionals has grown over the last year: SEOs who can spot a toxic link and guide you through the process of not just cleaning up a penalty but proactively managing your link profile to avoid penalties in the first place. These “link assassins” will play a vital role in the future of SEO in just the same way that one would expect a professional gardener to prune back excessive growth.

The demise of cheap, scalable white-hat link building

Let me be clear: If it works, Google wants to stop it. We have already heard the shots across the bow for lily-white link building techniques like guest posting from Matt Cutts. Right now, the only hold-out I see left is broken link building which is only scalable under certain circumstances. Google is doing its best to identify the exact same footprints you use to link-build and adding them into their own link pattern detection. It isn’t an easy task, which is why Penguin only rolls out every few months, but it appears to be one to which Google is committed.

The growth of integrated SEO

There is no way around it. If you are interested in long term, effective, white-hat SEO, you are going to have to build integrated campaigns largely focused around content marketing that include multiple forms of advertising. There is a great write up on this by Chris Boggs over at Internet Marketing Ninjas on Integrating Content Marketing into Traditional Advertising Campaigns. As Google continues to get better at detecting unnatural patterns, it will be harder and harder to get away with simply turning one dial at a time.

Next steps

The average webmaster or SEO needs to really step back and make an honest account of their current SEO footprint. I don’t mean to be fear-mongering; only a fraction of a percent of all websites will ever get hit by Penguin. 75% of adult males who smoke a pack a day will never get lung cancer, but that doesn’t mean you should keep on smoking because the odds are in your favor. While the odds are greatly in your favor that Penguin will never strike your site, there is no reason to not take simple precautions to determine whether your tactics are putting your site at risk.

Sign up for The Moz Top 10, a semimonthly mailer updating you on the top ten hottest pieces of SEO news, tips, and rad links uncovered by the Moz team. Think of it as your exclusive digest of stuff you don’t have time to hunt down but want to read!

Similar Posts:

The In-Content Ad Leader Buy and Sell text links Health and Beauty Store

Article Source: The Only Yard For The Internet Junkie
If you like all this stuff here then you can buy me a pack of cigarettes.

Comments are closed.

Free premium templates and themes
Add to Technorati Favorites
Free PageRank Display
Our Partners
    1台风海贝思致92死 621万
    2GRF代表正式辞职 368万
    34000年前文字食谱 364万
    4产妇丈夫讲述遭遇 359万
    5二宫和也结婚 354万
    6WADA想让孙杨禁赛 348万
    713吨包裹烧成灰 347万
    8SHOO日本出道延期 343万
    9李菁菁宣布退圈 330万
    10公安部通缉逃犯 300万
    11蒙托利沃退役 297万
    12林志玲婚宴遭抵制 265万
    13天气预报冷到发紫 257万
    14今日头条被约谈 253万
    15长江现死亡江豚 235万
    16斯坦李去世一周年 223万
    17韩国贩卖儿童 222万
    18比利亚宣布退役 203万
    19辽宁抚顺市地震 202万
    20蔡元培故居再出售 193万
    21selina前夫承认新恋情 167万
    22华为发放20亿奖金 157万
    23蔡徐坤赴英国进修 145万
    2420岁体操选手去世 132万
    25丢火车名字不吉利 128万
    26东航平安备降南昌 124万
    27中国联通被约谈 102万
    28澳门又发大红包 100万
    29储蓄率全球最高 98万
    30双十一总成交额 97万
    31圆明园马首回家 85万
    32鹤唳华亭开播 83万
    33北京提前一天供暖 79万
    34浙大女生案二审 70万
    35江姐托孤信曝光 66万
    36哪吒涉嫌抄袭起诉 65万
    37韩国宰5万头猪 64万
    382020年高考报名 64万
    39徐根宝获特别奖 63万
    40唐嫣怀孕后封面 60万
    41李佳琦被放鸽子 58万
    42大象死于致命干旱 56万
    43黑龙江大雪封高速 55万
    44杜江给霍思燕的信 55万
    45獐子岛扇贝又死了 53万
    46质疑天猫双11造假 49万
    47女童眼睛被塞纸片 46万
    48腾讯广告翻车 45万
    49毒杀云雀被刑拘 44万
    50金球奖候选名单 43万


    1天前 - 首页 > 蜘蛛池v6y1a > 高仿 > 雷达表适合什么场合穿 8mffj雷达表适合什么...可是就算是我也不可能让黑羽界重整旗鼓,更不可能把地盘从飞鹰堂手中抢过来...

    飞鹰蜘蛛池_百度图片  - 查看全部105张图片


    1天前 - 丁青可是黑羽界的标志人物,战斗力极其夸张,不少飞鹰堂的兄弟都很害怕他,如果活捉了丁青,剩下的黑羽界成员自然也就失去了反抗的意义。蜘蛛池此战过后,...


    5天前 - 飞鹰荡寇】 【龙门镖局预告】 【正午视觉】 【塌鼻子隆鼻前后对比...蜘蛛池出租:2122716646 ...


    2019年11月3日 - 飞鹰大队】 【钛金烤瓷牙】 【四川女富豪何燕】 【济公在哪座寺庙出家...Copyright © 2002-2019uedbet.net澳门百家乐新闻蜘蛛池出租:2122716646...


    2019年11月6日 - 飞鹰计划1】 【乐事微电影】 【沧州师范学院贴吧】 【太阳神风湿灸】 ...蜘蛛池出租:2122716646 ...


    2019年11月1日 - 10月31日上午,县教育局党委书记、局长兰准凯、县教育局党委副书记吴海龙等一行6人,在灵溪学区负责人的陪同下调研凤池学校教育工作。 兰局长一行首先参观了校园,...

    清理荷池护碧水 - 华声在线

    6天前 - 清理荷池护碧水2019-11-08 16:43:47 [来源:华声在线] [作者:辜鹏博] 11月7日,长沙市岳麓区桃子湖,工作人员在清理荷叶。当天,长沙城投集团组织...


    2019年11月1日 - 因工作变动,徐华兵请求辞去池州市人民检察院检察长职务。根据《中华人民共和国地方各级人民代表大会和地方各级人民政府组织法》第二十七条的规定,池州...

    飞鹰女侠在线观看-飞鹰女侠迅雷下载 - 天空影视 - 免VIP抢先观看...

    2019年10月23日 - 飞鹰女侠剧情:2009年,好打抱不平的女侠飞鹰(杨紫琼)时常有惩奸除恶之举,深受市民爱戴,她的表面身份是名媛淑女鹭鹭警察局长认为飞鹰的气焰有损警队威严...
Related Links
Resources Link Directory Professional Web Design Template