User:Bawolff bot

From Wikinews, the free news source you can write!
Jump to navigation Jump to search

Bawolff bot (Running OK!)



Stop my wild rampage
Emergency kill button

Summary

This bot is owned by user:Bawolff

If it has any problems block if necessary and, please leave a note at user:Bawolff (This should never happen though, because I should always be around when it does stuff. If I am around, try and find me on irc as well). The bot is currently running.


This bot is run from the toolserver. Thank you toolserver folks for letting me use your servers.

Current Caveats

  • Statistics can be distorted rather easily
  • A hit does not necessarily equal a page view. Could be googlebot indexing us. Could be someone repeatedly hitting refresh. Could be javascript fetching pages, without the user ever seeing it (although most js uses the api, so its not counted), could be many things.
  • Redirects are counted as separate pages.
  • It may lie about the time period it used to generate statistics. (If it gets a 404 on the stats file, it uses the file from last hour, but still reports using the latest hour in some places)
  • Counts interwiki interlanguage links as hits (say you link to chinese wikinews from english wikipedia, that gets routed through us, and is counted as a hit). This is only for both interlanguage and interproject.

New source code

  • This also will filter out anything not in category:Published
  • If you adapt it, you will have to change somethings, like the email in the user-agent, and the cd command at the begging of each file. and the username/password in post.dat

doPopArticle.sh

#!/bin/bash --

#Note: this script + save.sh and update.sh are under the GPL version 2
# as published by the free software foundation.


cd /home/bawolff/pop

temp=`mktemp -p . tmp.XXXXXXXXXXXXXXXX`

./update.sh > "$temp"

./save.sh 'Template:Popular_articles' "$temp"

rm -f "$temp"

save.sh

#!/bin/bash --

cd /home/bawolff/pop

# first param article name, second file containing article text.
postDat=`cat post.dat`

cookies=`mktemp -p . tmp.XXXXXXXXXXXXXXXXX `

site='http://en.wikinews.org/w/api.php'

token=`wget --post-data $postDat --save-cookies "$cookies" --keep-session-cookies --header 'User-agent: Wikinews popular article bot - bawolff+wnbot@somewhere.invalid' -q --header 'From: bawolff@somewhere.invalid' "$site" -O - |egrep '^\s*token:'|cut -d : -f 2|tr -d ' '`

#echo $token

#echo "`cat post.dat`&token=$token&"

res=`wget -q --post-data "$postDat&lgtoken=$token&" --save-cookies "$cookies" --load-cookies "$cookies" --keep-session-cookies --header 'User-agent: Wikinews popular article bot - bawolff+wnbot@somewhere.invalid' --header 'From: bawolff+wnbot@somewhere.invalid' "$site" -O - | egrep '^\s*result: Success$' ` 

#echo result
#echo d"$res"d
if [ -z "$res" ]
then
echo Error logging in 1>&2
exit 1
fi

editToken=`wget "${site}?action=query&prop=info&titles=${1}&intoken=edit&format=yaml" -q --save-cookies "$cookies" --load-cookies "$cookies" --keep-session-cookies --header 'User-agent: Wikinews popular article bot - bawolff+wnbot@somewhere.com' --header 'From: bawolff+wnbot@somewhere.com' -O - |egrep '^\s*edittoken:'   | sed  's/^\s*edittoken:\s\([a-f0-9]*\)../\1%2B%5C/g'  `

#echo $editToken

temp=`mktemp -p . tmp.XXXXXXXXXXXXXXXXXX `

echo -n "action=edit&format=yaml&title=${1}&token=${editToken}&summary=Updating%20popular%20artcle%20list&bot&minor&assert=user&text=" > $temp

tr \\n \\v < $2 |sed -e 's/%/%25/g' -e 's/\v/%0A/g' -e 's/ /%20/g' -e 's/\+/%2B/g' -e 's/&/%26/g' >> $temp

#cat $temp

wget -q  --post-file "$temp" --save-cookies "$cookies" --load-cookies "$cookies" --keep-session-cookies --header 'User-agent: Wikinews popular article bot - bawolff+wnbot@somewhere.com' --header 'From: bawolff+wnbot@somewhere.com' "$site" -O /dev/null


rm -f $temp
rm -f $cookies
#page=sed

update.sh

#!/bin/bash --
 
#downloads statistics
#figures out what is relevent to wikinews
#new wikimarkup for stats page to standard out

cd /home/bawolff/pop
 
check_if_there () {
#takes a date format string that equals url of stats, and a relative date/time.
#checks http status code
#returns 0 for 200, 1 for 404, and exits shell script for anything else
 
isThere=$(HEAD -S -H 'User-agent: Wikinews stats bot. Contact [[user:Bawolff]]' -H 'From: bawolff+wnb@somewhere.com' $(date -d "$2" -u  "$1") | head -n 1 |sed 's/.*--> \([0-9][0-9][0-9]\).*/\1/')
 
if [ "$isThere" = 200 ]
then
        return 0 # it is there, success!
elif [ "$isThere" = 404 ]
then
        return 1 # not there, try the next one
elif [ "$isThere" = 301 ]
then
	return 1
else
        exit 1 # strange status code, bail out
fi
}

 
get_and_make() {
#takes a date format string that equals url of stats, and a relative date/time
#gets the corresponding file, greps lines relevant to en.wikinews
#cuts out Main Page and other namespaces, and gives [hits in hour, pagetitle]. sorts, takes top 20
#wikifies (including hour for map in [[Template:Popular articles/top]]) and outputs to stdout
#to increase number of results you have to change head filter, AND change number of closing }}s

export LC_ALL=C #for sorting

sortedArticleList=`mktemp -p . tmp.XXXXXXXXXXXXXXX || echo "articleList-$$.tmp"`
pubAPIRes=`mktemp -p . tmp.XXXXXXXXXXXXXXXXXX || echo "pubAPIRes-$$.tmp"`
pubList=`mktemp -p . tmp.XXXXXXXXXXXXXXXXXXXXXXXXX || echo "pubList-$$.tmp"`
filteredArticleList=`mktemp -p . tmp.XXXXXXXXXXXXXXXXXXX || echo "filteredList-$$.tmp"`

#get pop articles. rm obvious non-articles take 45 most pop, then resort alphabetical (needed for join)
wget `date -d "$2" -u  "$1"`  -q --header\='User-agent: Wikinews stats bot. Contact [[user:Bawolff]]' --header\='From: me@somewhere.com' -O - \
|zgrep 'en\.n' |awk '-F ' '{if ($2 !~ /(^Main_Page)|(^Talk:)|(^User:)|(^User_talk:)|(^Wikinews:)|(^Wikinews_talk:)|(^Category:)|(^Category talk:)|(^File:)|(^File talk:)|(^Special:)|(^en:)|(^Http:)/) print $3, $2}' \
|sed 's/%27/'\'/g \
|sort -g -r  \
|head -n 45 \
|sort -k 2 > "$sortedArticleList"

#list of the $sortedArticleList that are in category pub
wget 'http://en.wikinews.org/w/api.php?action=query&prop=categories&clcategories=Category:Published&format=xml&cllimit=max&titles='"`cut -d ' ' -f 2 $sortedArticleList |tr '\n' '|'`" -O "$pubAPIRes" -q --header\='User-agent: Wikinews stats bot. Contact [[user:Bawolff]]' --header\='From: somewhere@replacewithyouremail.com'

#turn into nice newline seperated list.
echo 'cat api/query/normalized/n[@to=/api/query/pages/page[categories]/@title]/@from' | xmllint $pubAPIRes --shell --noent|sed -n -e 's/&quot;/"/' -e 's/^ from\=\"\(.*\)\"$/\1/p' | sort > "$pubList"

#remove non-published from $sortedArticleList
join -o 1.1\ 1.2 -1 2 "$sortedArticleList" "$pubList" \
|sort -g -r > $filteredArticleList

rm -f "$sortedArticleList" "$pubList" "$pubAPIRes"

#take space seperated value, turn to wikisyntax

head -n 15 < "$filteredArticleList" |awk 'BEGIN { HOURSTART = "'$(date -u -d '1 hour ago' +%H)'" ;HOUREND = "'$(date -u +%H)'"; print "<noinclude>{{/top|" HOURSTART "}}</noinclude>"} {print "{{#ifexpr: {{{top|40}}} > "NR-1"|# [[:" gensub(/_/, " ", "g", $2) "]] {{#if:{{{nohits|}}}||&nbsp;<small>('\'\'\''" $1 "'\'\'\'' hits last hour)</small>}}"} END {print "}} }} }} }} }} }} }} }} }} }} }} }} }} }} }} \n<noinclude>\nThese statistics are generated from [http://dammit.lt/wikistats/ Wikistats]. They are based on number of visits to each page over the last hour. These statistics include all visits, both by people and by automated computer programs. Although these are probably reasonably accurate, they are easy to distort. Please note that sometimes these statistics are updated on an irregular basis. This page was generated at 21:36, 13 September 2010 (UTC) for the time period " HOURSTART ":00–" HOUREND ":00 UTC.</noinclude>"}'
 
rm -f "$filteredArticleList"

}
 

# try each of these until we get one
 
if check_if_there '+http://dammit.lt/wikistats/pagecounts-%Y%m%d-%H0000.gz' now
then
get_and_make '+http://dammit.lt/wikistats/pagecounts-%Y%m%d-%H0000.gz' now

elif check_if_there '+http://dammit.lt/wikistats/pagecounts-%Y%m%d-%H0001.gz' now # sometimes files are a minute late
then
get_and_make '+http://dammit.lt/wikistats/pagecounts-%Y%m%d-%H0001.gz' now
 
elif check_if_there '+http://dammit.lt/wikistats/pagecounts-%Y%m%d-%H0000.gz' '1 hour ago'
then
get_and_make '+http://dammit.lt/wikistats/pagecounts-%Y%m%d-%H0000.gz' '1 hour ago'

elif check_if_there '+http://dammit.lt/wikistats/pagecounts-%Y%m%d-%H0001.gz' '1 hour ago'
then
get_and_make '+http://dammit.lt/wikistats/pagecounts-%Y%m%d-%H0001.gz' '1 hour ago'
 
else # none of them worked :(
exit 2
fi

post.dat

action=login&lgname=Bawolff_bot&lgpassword=*****&format=yaml&

What it does (old)

  • takes domas's statistics [1]. Gets whats relevent to wikinews. Takes the top 20 (this could be easily changed to anything) along with total hits.Makes a wikipage out of it. Uses pywikipedia sandbot to reset template:Popular articles based on output.
  • Source code: (if you want to run it, change your email to reflect something appropriate, make sure the directory structure matches (might want to change some relative paths to absolute). this expects a pywikipedia install in a directory called pywikipedia, and lwp-request (part of perl), and wget to be installed
wn_stats/update.py (main script called to do stuff)
#!/bin/sh --

#cd /some path to/wn_stats/

#false || echo false
if ! ./make_stats.sh >cur_stats.wiki 
then
cd ../pywikipedia/
python bot_error.py
echo wikinews statistics error 1>&2
exit 1
fi



cd ../pywikipedia/
python update_newpop.py && python bot_ok.py #file is hardcoded
wn_stats/make_stats.sh
#!/bin/bash --

#downloads statistics
#figures out what is relevent to wikinews
#new wikimarkup for stats page to standard out

cd /home/bawolff/src/wn_stats/

check_if_there () {
#takes a date format string that equals url of stats. check http status code
# returns 0 for 200, 1 for 404, and exits shell script for anything else

isThere=$(HEAD -H 'From: Bawolff+wnbots@**somewhere**.invalid' $(date -d "$2" -u  "$1") | head -n 1 |cut -d ' ' -f 1)

#echo $isThere is status

if [ "$isThere" = 200 ]
then
        #echo it is there 
        return 0

        elif [ "$isThere" = 404 ]
        then
                #echo not there `date -d "$2" -u "$1"`
                return 1
else
        exit 1
fi
}

get_and_make() {

#wget `date -d "$2" -u  "$1"`  -q -O -
#cat pagecounts-20080713-170000.gz 
#to increae count you have to change head filter, AND change {{#ifexpr}}s
wget `date -d "$2" -u  "$1"`  -q --header\='From: bawolff+wnbots@**somewhere**.invalid' -O -|zgrep 'en\.n'|awk '-F ' '{print $3, $2}'|sort -g -r|head -n 20|awk 'BEGIN { TIMESTART = "'$(date -u -d '1 hour ago' +%H)':00" ;TIMEEND = "'$(date -u +%H)':00"; print "<noinclude>\n== Most Popular Last Hour ==\n</noinclude>{{#ifexpr: {{{count|40}}} > 0|{{#ifexpr: {{{count|41}}} > 1|{{#ifexpr: {{{count|41}}} > 2|{{#ifexpr: {{{count|41}}} > 3|{{#ifexpr: {{{count|41}}} > 4|{{#ifexpr: {{{count|41}}} > 5|{{#ifexpr: {{{count|41}}} > 6|{{#ifexpr: {{{count|41}}} > 7|{{#ifexpr: {{{count|41}}} > 8|{{#ifexpr: {{{count|41}}} > 9|{{#ifexpr: {{{count|41}}} > 10|{{#ifexpr: {{{count|41}}} > 11|{{#ifexpr: {{{count|41}}} > 12|{{#ifexpr: {{{count|41}}} > 13|{{#ifexpr: {{{count|41}}} > 14|{{#ifexpr: {{{count|41}}} > 15|{{#ifexpr: {{{count|41}}} > 16|{{#ifexpr: {{{count|41}}} > 17|{{#ifexpr: {{{count|41}}} > 18|{{#ifexpr: {{{count|41}}} > 19|"} {print "# [[:" gensub(/_/, " ", "g", $2) "]] {{#if:{{{nohits|}}}| |<small>('\'\'\''" $1 "'\'\'\'' hits last hour)</small>}}| }}"} END {print "* Total hits last hour: '\'\'\'$(./make_proj_stats.sh)\'\'\''<noinclude>\n\nThese statistics are generated by Domas'\''s [http://dammit.lt/wikistats/ Wikistats]<sup>['$( date -d "$2" -u "$1")']</sup>. They are based on number of visits to each page over the last hour. These statistics include all visits, both people and automated computer programs ones. Although these are probably reasonably accurate, they are easy to distort. Please note sometimes these statistics are updated on an irregular basis. This page was generated on ~~~~~ over the time period of " TIMESTART "-" TIMEEND " UTC.\n\n'\'\''For an extended list, see a [http://wikistics.falsikon.de/latest-daily/wikinews/en/ daily] or [http://wikistics.falsikon.de/latest/wikinews/en/ monthly] summary.'\'\''</noinclude>"}'

}

#at pagecounts/pagecounts-20080701-000000 |grep '^en\.n'|awk '-F ' '{print $3, $2}'|sort -g



if check_if_there '+http://dammit.lt/wikistats/pagecounts-%Y%m%d-%H0000.gz' now
then
         get_and_make '+http://dammit.lt/wikistats/pagecounts-%Y%m%d-%H0000.gz' now

elif check_if_there '+http://dammit.lt/wikistats/pagecounts-%Y%m%d-%H0001.gz' now
        then
        get_and_make '+http://dammit.lt/wikistats/pagecounts-%Y%m%d-%H0001.gz' now

elif check_if_there '+http://dammit.lt/wikistats/pagecounts-%Y%m%d-%H0000.gz' '1 hour ago'
then
get_and_make '+http://dammit.lt/wikistats/pagecounts-%Y%m%d-%H0000.gz' '1 hour ago'
elif check_if_there '+http://dammit.lt/wikistats/pagecounts-%Y%m%d-%H0001.gz' '1 hour ago'
then
get_and_make '+http://dammit.lt/wikistats/pagecounts-%Y%m%d-%H0001.gz' '1 hour ago'

else
exit 2
fi
wn_stats/make_proj_stats.sh
#!/bin/bash --

#helper script. not called directly
#downloads statistics
#figures out what is relevent to wikinews
#output number of total hits to wikinews

check_if_there () {
#takes a date format string that equals url of stats. check http status code
# returns 0 for 200, 1 for 404, and exits shell script for anything else

isThere=$(HEAD -H 'From: bawolff+wnbots@**SOMEWHERE**' $(date -d "$2" -u  "$1") | head -n 1 |cut -d ' ' -f 1)

#echo $isThere is status

if [ "$isThere" = 200 ]
then
        #echo it is there 
        return 0

        elif [ "$isThere" = 404 ]
        then
                #echo not there `date -d "$2" -u "$1"`
                return 1
else
        exit 1
fi
}

get_and_make() {

wget `date -d "$2" -u  "$1"` --header\='From: bawolff+wnbots@**SOMEWHERE**' -q -O -|grep 'en\.n'|cut -d ' ' -f 3

}


if check_if_there '+http://dammit.lt/wikistats/projectcounts-%Y%m%d-%H0000' now
then
         get_and_make '+http://dammit.lt/wikistats/projectcounts-%Y%m%d-%H0000' now

elif check_if_there '+http://dammit.lt/wikistats/projectcounts-%Y%m%d-%H0001' now
        then
        get_and_make '+http://dammit.lt/wikistats/projectcounts-%Y%m%d-%H0001' now

elif check_if_there '+http://dammit.lt/wikistats/projectcounts-%Y%m%d-%H0000' '1 hour ago'
then
get_and_make '+http://dammit.lt/wikistats/projectcounts-%Y%m%d-%H0000' '1 hour ago'
elif check_if_there '+http://dammit.lt/wikistats/projectcounts-%Y%m%d-%H0001' '1 hour ago'
then
get_and_make '+http://dammit.lt/wikistats/projectcounts-%Y%m%d-%H0001' '1 hour ago'

else
exit 2
fi
pywikipedia/update_newpop.py
# -*- coding: utf-8 -*-
"""
This is a modified version of pywikipedias sandbot to update based on hardcoded file
I no this is very ugly, but i really don't know python

This bot cleans a sandbox by replacing the current contents with predefined
text.

This script understands the following command-line arguments:

    -hours:#       Use this parameter if to make the script repeat itself
                   after # hours. Hours can be defined as a decimal. 0.001
                   hours is one second.

"""
#
# (C) Leogregianin, 2006
# (C) Wikipedian, 2006-2007
# (C) Andre Engels, 2007
# (C) Siebrand Mazeland, 2007
#
# Distributed under the terms of the MIT license.
#
__version__ = '$Id: clean_sandbox.py 4402 2007-10-03 14:24:58Z leogregianin $'
#

import wikipedia
import time

f = open('../wn_stats/cur_stats.wiki', 'r')

content = {
    'en': unicode(f.read(), 'UTF-8'),
    }

msg = {
    'en': u'Robot: Updating Popular article list (over hour)',
    }

sandboxTitle = {
    'en': u'template:Popular_articles',
    }

class SandboxBot:
    def __init__(self, hours, no_repeat):
        self.hours = hours
        self.no_repeat = no_repeat

    def run(self):
        mySite = wikipedia.getSite()
        while True:
            now = time.strftime("%d %b %Y %H:%M:%S (UTC)", time.gmtime())
            localSandboxTitle = wikipedia.translate(mySite, sandboxTitle)
            sandboxPage = wikipedia.Page(mySite, localSandboxTitle)
            try:
                text = sandboxPage.get()
                translatedContent = wikipedia.translate(mySite, content)
                if text.strip() == translatedContent.strip():
                    wikipedia.output(u'No change!.')
                else:
                    translatedMsg = wikipedia.translate(mySite, msg)
                    sandboxPage.put(translatedContent, translatedMsg)
            except wikipedia.EditConflict:
                wikipedia.output(u'*** Loading again because of edit conflict.\n')
            if self.no_repeat:
                wikipedia.output(u'\nDone.')
                wikipedia.stopme()
                return
            else:
                wikipedia.output(u'\nSleeping %s hours, now %s' % (self.hours, now))
                time.sleep(self.hours * 60 * 60)

def main():
    hours = 1
    no_repeat = True
    for arg in wikipedia.handleArgs():
        if arg.startswith('-hours:'):
            hours = float(arg[7:])
            no_repeat = False
        else:
            wikipedia.showHelp('clean_sandbox')
            wikipedia.stopme()
            return

    bot = SandboxBot(hours, no_repeat)
    bot.run()

if __name__ == "__main__":
    try:
        main()
    finally:
        wikipedia.stopme()

These are set to run 11 minutes past the hour, every hour my computer is on.

What it used to do

This is outdated, and no longer true.

Recent Popular articles

  • Takes a list of the last 42 published article (~3-5 days), and intersects it with a list of the top 200 articles with the most hits between now and the start of the month.
  • The length of the lists for published articles can be changed by admins by modifying the count on the DPL at User:Bawolff bot/recentPub
    • If you wish to change the count, feel free, but leave me a note at user talk:bawolff
    • If you want to change anything else about the DPL, please consult with me first to hopefully avoid breaking the bot
    • If you want to change the top 200 popular article list length contact me. (Can be any number up to 5001)
  • The resulting list is usually ~9 articles, and ordered by most popular. It can be found at {{Popular articles/recent}}
    • If you wish to change the formating of that template, please tell em as the bot will override your changes.

Popular articles (over month)

  • Takes a list of the last 15 (technically 16 including main page which is discarded) most popular articles (articles with most hits between now and the beginning of month).
    • The length of this list can be changed to any value up to 5001. Leave me a note
  • The list can be found at {{Popular articles}}
    • If you wish to change the formating of that template, please tell me as the bot will override your changes.
    • The list may be misleading as it is number of hits since now to begining of month, and stories are ussually popular for periods of less then a month

Misc. info

  • Uses Leon's Wikicharts tool. This takes 1 out of 50 download of a wikinews page (not including downloads of images, css, javascript, but does include image description pages. basically any wikipage and any special page viewed directly), and reports it to wikicharts. Wikicharts then compiles a list of the pages with the most hits. (Since most people only care about articles, that is all this bot counts)
  • This bot is likely to quite possibly be misleading, especially the recent popular list, as it limits the article list to the most recent 42, which are often very low on the popularity list (often #1 is really #30th most popular). This is it because it takes time to accumulate hits. In other words, USE STATS WITH CAUTION
    • In addition the statistics tool this depends on also has a big warning label, so doubly use with caution.
    • Users without javascript are not counted.
  • The Wikicharts tool can be viewed on the toolserver
  • This bot runs sporadically. (pop articles over month, if it runs it will be every 2 hours at zero minutes. Recent articles over a month, every 2 hours at 10 minutes past the hour). This bot will do nothing if there is nothing to do (i.e. the list is the same as last time checked). Note: May go for periods without checking when my computer is off.
  • This bot is a combination of a shell script and a modified version of the pywikipediabot sandbox cleaner bot.
  • Any question comments and questions please ask at user talk:Bawolff