由抄袭造成的重复内容
使用集合让一切井井有条
根据您的偏好保存内容并对其进行分类。
2008年7月6日星期日
发表者:Sven Naumann,搜索质量组
原文:
Duplicate content due to scrapers
发表于:2008年6月9日星期一,上午3:40
重复内容一直是网站管理员们热议的话题之一,我们觉得很有必要对在各种会议上和
网站管理员支持论坛
中我们被问及的常见问题作以下统一解答。
在做深入探讨之前,我想先简要谈谈网站管理员们经常担忧的一个问题:在大多数情况下,网站管理员往往对擅自抄袭和传播自己内容的第三方无能为力。我们知道这并不能归咎于网站管理员们,这也就意味着同一内容出现在许多不同网站其本身并不理所当然地被认为是违反了
网站管理员指南
。这仅仅导致了Google必须增加一个额外步骤,即鉴别内容的原创来源,而这正是Google所擅长的,在大多数情况下原创内容源都能被正确地识别出来,从而不会给发布真正原创内容的网站带来负面影响。
一般而言,我们把网站内容雷同问题主要分为两种情况:
-
站内内容重复,比如同一内容(经常是无意识地)在您的网站内重复出现。
-
站间内容重复,比如您网站的内容(同样,经常也是无意识地)在其他网站重复出现。
对于第一种情况,您可以亲自动手解决Google对您网站上的重复内容进行索引的问题。您可以阅读 Adam Lasnik 发表的
Deftly dealing with duplicate content
以及Vanessa Fox 发表的
Duplicate content summit at SMX Advanced
。这两篇文章都提供了一些很好的建议,帮助您解决站内内容重复的问题。这里还有一个特别的建议帮助您避免站内内容被重复索引:您可以将您希望被抓取的URL序列包含在您的站点地图文件中。遇到包含同一内容的不同网页时,这么做有助于我们准确收录您真正想提供给用户的那部分内容。其他有关于站内内容重复的信息您可以参阅讨论此主题的有关“
帮助中心文章
”。
第二种情形可能是有人剽窃了您网站中的内容,并将其展示在其他网站上牟利。同时,网络代理服务器也经常抓取通过代理方式访问的网站的部分内容。当在不同网站遇到相同内容的时候,我们会基于许多不同的依据来判断究竟哪个网站才是原创,而这样的判断通常是准确的。这也意味着,当您发现有人剽窃了您的内容时,您大可不必过分担心它对您的网站在谷歌搜索排名上的负面影响。
如果您将自己网站的内容与他人分享, 但同时还希望自己的网站被识别为原创来源的话,您需要请合作伙伴在其网站内容上添加指向您原创内容的链接。您也可以在Vanessa Fox最近发表的文章
Ranking as the original source for content you syndicate
找到其他有关处理这一问题的建议。
有些网站管理员会有这样的疑问: 什么原因会导致有时候抄袭内容反而比原创内容的排名还要高呢?这应该是个特例,但如果您真的遇到这种情况,请您务必做到:
-
检查一下您的内容是否能被我们抓取。您可能无意间在 robots.txt文件中阻止了部分内容被正常访问。
-
您可以检查一下Sitemap文件,看看您自己是否改动过那些被抄袭的特定内容。
-
检查您的网站是否符合网站管理员指南。
最后我想指出的是,在绝大多数情况下,含有雷同重复内容并不会对您的网站在谷歌搜索上的排名有负面影响。这些内容可能已经被过滤出去了。如果您参照上述提到的一些建议,您会了解到怎样才能更精确地控制搜索引擎抓取的内容以及出现在索引中的内容版本。只有被确认为蓄意或恶意抄袭时,雷同重复内容才有可能会被视为违反了网站管理员指南。
如果您想更深入地讨论这一话题,请浏览我们的
网站管理员支持论坛
。
如果希望阅读本文德语版,请点击阅读“
Duplicate Content aufgrund von Scraper-Sites
”。
如未另行说明,那么本页面中的内容已根据知识共享署名 4.0 许可获得了许可,并且代码示例已根据 Apache 2.0 许可获得了许可。有关详情,请参阅 Google 开发者网站政策。Java 是 Oracle 和/或其关联公司的注册商标。
最后更新时间 (UTC):2008-07-01。
[null,null,["最后更新时间 (UTC):2008-07-01。"],[[["\u003cp\u003eGoogle can typically identify and prioritize original content, even when duplicated on other sites, so webmasters generally shouldn't worry about negative impacts.\u003c/p\u003e\n"],["\u003cp\u003eDuplicate content issues can occur within a single website or across multiple websites, and Google offers resources to address both scenarios.\u003c/p\u003e\n"],["\u003cp\u003eWebmasters can utilize tools like robots.txt, Sitemaps, and syndication guidelines to manage duplicate content and ensure their preferred versions are indexed.\u003c/p\u003e\n"],["\u003cp\u003eWhile rare, if scraped content outranks the original, webmasters should verify crawler access, Sitemap entries, and adherence to webmaster guidelines.\u003c/p\u003e\n"],["\u003cp\u003eIn most cases, duplicate content is filtered rather than penalized, and negative consequences primarily arise from deliberate, malicious duplication attempts.\u003c/p\u003e\n"]]],["Google addresses duplicate content issues, differentiating between internal and external occurrences. For internal duplicates, webmasters should use Sitemaps and follow provided tips to control indexing. For external duplicates, Google identifies the original source, mitigating negative impacts on the originating site. When syndicating content, webmasters should request backlinks from partners. Scraped content ranking higher is rare and can be due to crawling issues or site guideline violations. Generally, duplicate content is filtered without negative effects, unless malicious intent is apparent.\n"],null,["# Duplicate content due to scrapers\n\nMonday, June 09, 2008\n\n\nSince duplicate content is a hot topic among webmasters, we thought it might be a good time to\naddress common questions we get asked regularly at conferences and on the\n[Google Webmaster Help Group](https://support.google.com/webmasters/go/community).\n\n\nBefore diving in, I'd like to briefly touch on a concern webmasters often voice: in most cases a\nwebmaster has no influence on third parties that scrape and redistribute content without the\nwebmaster's consent. We realize that this is not the fault of the affected webmaster, which in\nturn means that identical content showing up on several sites in itself is not inherently regarded\nas a violation of our\n[webmaster guidelines](/search/docs/essentials).\nThis simply leads to further processes with the intent of determining the original source of the\ncontent---something Google is quite good at, as in most cases the original content can be\ncorrectly identified, resulting in no negative effects for the site that originated the content.\n\n\nGenerally, we can differentiate between two major scenarios for issues related to duplicate\ncontent:\n\n- Within-your-domain-duplicate-content, that is, identical content which (often unintentionally) appears in more than one place on your site\n- Cross-domain-duplicate-content, that is, identical content of your site which appears (again, often unintentionally) on different external sites\n\n\nWith the first scenario, you can take matters into your own hands to avoid Google indexing\nduplicate content on your site. Check out Adam Lasnik's post\n[Deftly dealing with duplicate content](/search/blog/2006/12/deftly-dealing-with-duplicate-content)\nand Vanessa Fox's\n[Duplicate content summit at SMX Advanced](/search/blog/2007/06/duplicate-content-summit-at-smx),\nboth of which give you some great tips on how to resolve duplicate content issues within your\nsite. Here's one additional tip to help avoid content on your site being crawled as duplicate:\ninclude the preferred version of your URLs in your Sitemap file. When encountering different pages\nwith the same content, this may help raise the likelihood of us serving the version you prefer.\nSome additional information on duplicate content can also be found in our comprehensive\n[Help Center article](/search/docs/advanced/guidelines/duplicate-content)\ndiscussing this topic.\n\n\nIn the second scenario, you might have the case of someone scraping your content to put it on a\ndifferent site, often to try to monetize it. It's also common for many web proxies to index parts\nof sites which have been accessed through the proxy. When encountering such duplicate content on\ndifferent sites, we look at various signals to determine which site is the original one, which\nusually works very well. This also means that you shouldn't be very concerned about seeing\nnegative effects on your site's presence on Google if you notice someone scraping your content.\n\n\nIn cases when you are syndicating your content but also want to make sure your site is identified\nas the original source, it's useful to ask your syndication partners to include a link back to\nyour original content. You can find some additional tips on dealing with syndicated content in a\nrecent post by Vanessa Fox,\n[Ranking as the original source for content you syndicate](https://www.vanessafoxnude.com/2008/05/14/ranking-as-the-original-source-for-content-you-syndicate/).\n\n\nSome webmasters have asked what could cause scraped content to rank higher than the original\nsource. That should be a rare case, but if you do find yourself in this situation:\n\n- Check if your content is still accessible to our crawlers. You might unintentionally have blocked access to parts of your content in your robots.txt file.\n- You can look in your Sitemap file to see if you made changes for the particular content which has been scraped.\n- Check if your site is in line with our webmaster guidelines.\n\n\nTo conclude, I'd like to point out that in the majority of cases, having duplicate content does\nnot have negative effects on your site's presence in the Google index. It simply gets filtered\nout. If you check out some of the tips mentioned in the resources above, you'll basically learn\nhow to have greater control about what exactly we're crawling and indexing and which versions are\nmore likely to appear in the index. Only when there are signals pointing to deliberate and\nmalicious intent, occurrences of duplicate content might be considered a violation of the\nwebmaster guidelines.\n\n\nIf you would like to further discuss this topic, you can visit our\n[Webmaster Help Group](https://support.google.com/webmasters/go/community).\n\nWritten by Sven Naumann, Search Quality Team"]]