MediaWiki talk:Robots.txt

From Wikinews, the free news source you can write!
Jump to navigation Jump to search

Sitemap[edit]

Can we please remove

# Following is temporary/experimental BMcN 2013-06-18.
#
Sitemap: https://en.wikinews.org/wiki/Special:NewsFeed
Sitemap: http://en.wikinews.org/wiki/Special:NewsFeed
Sitemap: https://en.wikinews.org/w/index.php?title=Special:NewsFeed
Sitemap: http://en.wikinews.org/w/index.php?title=Special:NewsFeed
Sitemap: /wiki/Special:NewsFeed
Sitemap: /wiki/Special%3ANewsFeed
Sitemap: /w/index.php?title=Special:NewsFeed
#
# Fair bit of 'belt and braces' up there...
# 


and replace it all with

Sitemap: https://en.wikinews.org/w/index.php?title=Special:NewsFeed&feed=sitemap&categories=Published&notcategories=No%20publish%7CArchived%7CAutoArchived%7Cdisputed&namespace=0&count=30&hourcount=120&ordermethod=categoryadd&stablepages=only

What this would change: currently the defaults for GNSM are in Category:Published and the latest sighted version. This would also require the article not be in categories No publish, Archived, AutoArchived, or disputed. It would also only publish articles up to 5 days old, and sort them by when they were added to the Published category.

(You could also follow the belt 'n braces approach used originally, and just make sure the query string is [properly] added to each of those urls.) - Amgine | t 23:44, 28 January 2014 (UTC)[reply]

Done --Pi zero (talk) 00:18, 29 January 2014 (UTC)[reply]
  • Amgine, did you establish if multiple Sitemap: lines are valid?
As it stands, I'd amend the proposed URL as-follows:
&feed=sitemap&categories=Published&notcategories=No%20publish|Archived|AutoArchived|disputed&namespace=0&count=100030&hourcount=120&ordermethod=categoryadd&stablepages=only
Assuming multiple Sitemap: lines are permitted, I'd suggest these as a couple of possible options:
  1. &feed=sitemap&categories=Published|Featured%20article&notcategories=No%20publish|disputed&namespace=0&count=500&ordermethod=categoryadd&stablepages=only (FAs)
  2. &feed=sitemap&categories=Published|Politics%20and%20conflicts&notcategories=No%20publish|disputed&namespace=0&count=100&ordermethod=categoryadd&stablepages=only (Full politics and conflicts category)
I've really not kept up on 'White-Hat' SEO; but, what we're looking at here is getting as much as-possible well indexed by GNews. I'd be generally interested in someone fully spidering enWN and looking at how the published PageRank algorithm rates the site overall. Are we making best-use of internal links, interwiki links and so? Or, are there simple changes we can make to better-present the site structure to robots? --Brian McNeil / talk 12:11, 29 January 2014 (UTC)[reply]
Those changes to the existing url seem reasonable to me, and I've deployed them. I think someone was saying last night that google hangs on to stuff after it drops off the list anyway, not that that's any reason to stint on the list. --Pi zero (talk) 12:45, 29 January 2014 (UTC)[reply]
In reverse order:
  • Yes, Google does hang on to things after it drops off the list, but not as *news*.
  • Mmm, auditing the en.WN SEO is more work than I can give to it. I will try to learn more about how to improve MW site ranking in legitimate manners, but beyond keywording (which may be automatable on a per-page basis) and adding automated RSS feeds in various places (you can do this by, for example, adding various GNSM rss/atom feeds to places like feedster) I don't have good ideas for automated processes.
  • Multiple sitemap lines are valid. They can be used to aggregate more than one domain, btw. I agree the count should be 1000, but do we really want to be pushing stale news to Google? they do check the publish date and de-prioritize old news and, I expect, de-prioritize sites which try to game the system with old content, like most blogs. Featured articles is good, but the politics content will already be found in the first sitemap (at least, I expect 10% of en.WN content falls into Politics and Conflicts.)
- Amgine | t 16:13, 29 January 2014 (UTC)[reply]
Estimating by the page count in Category:Politics and conflicts, likely it's more like 30 or 35 percent. --Pi zero (talk) 16:21, 29 January 2014 (UTC)[reply]
Yah, which means the top 300-350 P&C articles will be found by that first sitemap. By the way, Pi zero, I think Brian meant to have 3 sitemap lines looking like this:
Sitemap: https://en.wikinews.org/w/index.php?title=Special:NewsFeed&feed=sitemap&categories=Published&notcategories=No%20publish%7CArchived%7CAutoArchived%7Cdisputed&namespace=0&count=1000&ordermethod=categoryadd&stablepages=only
Sitemap: https://en.wikinews.org/w/index.php?title=Special:NewsFeed&feed=sitemap&categories=Published%7CFeatured%20article&notcategories=No%20publish%7Cdisputed&namespace=0&count=500&ordermethod=categoryadd&stablepages=only
Sitemap: https://en.wikinews.org/w/index.php?title=Special:NewsFeed&feed=sitemap&categories=Published%7CPolitics%20and%20conflicts&notcategories=No%20publish%7Cdisputed&namespace=0&count=100&ordermethod=categoryadd&stablepages=only
- Amgine | t 16:54, 29 January 2014 (UTC)[reply]