Wikinews:Bots/Requests/InternetArchiveBot

The following discussion is preserved as an archive. Please do not modify it. Subsequent comments should be made on the appropriate discussion page. No further edits should be made to this discussion.

Closed as successful. [₂₄Cr][talk] 07:08, 30 October 2021 (UTC)[reply]

InternetArchiveBot (talk · contribs)

Operator: Cyberpower678 (talk · contribs) and Harej (talk · contribs)
Bot name: InternetArchiveBot (Talk • contribs • bot status log • actions • block log • other log)
Programming language: PHP
Already used on: Operates on dozens of additional Wikimedia wikis
Task: InternetArchiveBot identifies dead links and adds links to archived versions where available. Per request on Phabricator. Harej (talk) 22:32, 20 January 2021 (UTC)[reply]

Comments

Task 1

Old requirements

Comment @Harej: Please note, I would like to have all links (not just the dead links) archived. There are roughly 22k articles in the CAT:Published category. And the bot just needs to archive the links, it should NOT edit the articles. After all the links are archived in that category, IABot needs to monitor the same category for any new additions to the category. Can you please confirm IABot can do that, Harej? Thanks.
•–• 06:31, 21 January 2021 (UTC)[reply]
- Why should the bot not edit the articles? Providing working alternatives to non-functioning links is the entire point. Harej (talk) 17:17, 21 January 2021 (UTC)[reply]

@Harej: We use {{source}} for adding the sources as well as the external links. That template has parameters to mark if the link is broken or to provide archived link. However, there are articles which do not use that template. We do not remove/replace sources after the article is published. However, while cleaning up, if a source has broken URL, admins generally find the archived version and add it. Giving a bot access to edit and review archived pages is scary -- as it could do irreversible damage. We wish to use the bot to save the source before the link rot -- once it is saved, one could add the archived version. Moreover, if an archive already exists, there is a sapient decision to make which version of the archive to use. There are articles which are not sighted (maybe because the file/template used for that page has been modified) -- if the bot were to add archived links to that article and sights, such important changes would disappear unnoticed forever. That is why the bot should not be directly editing the article, just merely archiving all the hyperlinks.
•–• 17:45, 21 January 2021 (UTC)[reply]

InternetArchiveBot operates on 50 Wikimedia wikis without causing "irreversible damage" to them. In the event the bot malfunctions on a given wiki, mistakes can be undone and the bot can be stopped by any logged in user through the bot interface. The bot is very sophisticated, can be configured to parse templates as well as fix plain links, and automatically picks archives closest to when the link was added. The Wayback Machine has been automatically saving outbound links on Wikimedia projects for years. The benefit of proactively, and automatically, fixing broken links vastly outweighs the risks. Linkrot is endemic to wikis at such a scale that automated solutions are required. Harej (talk) 17:59, 21 January 2021 (UTC)[reply]

With respect, acagastya is correct: there is danger of irreversible damage. There's history here, between Wikinews and the Foundation, involving a WON'T FIX closure of a Bugzilla request (from before my time, but I learned about it from bawolff). 'Nuff said on that point, I hope. --Pi zero (talk) 18:18, 21 January 2021 (UTC)[reply]

I don't know what the Wikimedia Foundation has to do with this; InternetArchiveBot is a service of the Internet Archive. The bot is operated in many diverse contexts and can be highly customized for a given use case, and I am more than happy to work with the community on that. If you do not want InternetArchiveBot to fix broken links, which is its primary function, then I am not sure what you are requesting at all. If you want the Wayback Machine to preserve outgoing links, it is already doing that. Harej (talk) 18:22, 21 January 2021 (UTC)[reply]

What you may be interested in is not automated operation, but the option to scan a page and add archives with the element of human review. If so the Analyze a Page feature may be useful for you. (Make sure you have "English Wikinews" as your selected project.) With this you can enter a page title, have it retrieve archive links for all links (not just dead ones), and make the edits conveniently while giving you the ability to review. This may be a more workable option. Harej (talk) 02:11, 22 January 2021 (UTC)[reply]

@Harej: I don't think it works the way I expected. Could you please try with US Republicans query Linux Foundation about open-source security? I think it requires <ref></ref> and does not detect URLs.
•–• 05:54, 22 January 2021 (UTC)[reply]

Acagastya I was not able to run the bot on that page as it is locked. Harej (talk) 18:43, 25 January 2021 (UTC)[reply]

Alternatively, @Harej:, (this is a less likely scenario), if there was a list of source URLs to be archived, can the tool take care of those? I could write a script to extract all URLs, if there exists a way to automate the archival of those URLs.
•–• 11:13, 22 January 2021 (UTC)[reply]

There are 82k sources in a 6MB file which needs to be archived <https://0x0.st/-iR0.txt> -- is some way I could use this list, run it across the bot/tool, and archive all the links?
•–• 15:33, 22 January 2021 (UTC)[reply]

User:Acagastya the API documentation is available at meta:InternetArchiveBot/API. Harej (talk) 20:22, 22 January 2021 (UTC)[reply]

Thanks, @Harej:. Also, there is some context sensitive information I need to discuss, re IABot. Will you be available on IRC?
103.48.105.246 (talk) 20:36, 22 January 2021 (UTC)[reply]

Sure, I am in #IABot on Freenode as "harej". Harej (talk) 20:55, 22 January 2021 (UTC)[reply]

I'm confused as to what the opposition is to this bot. It has been useful on countless wikis and is not likely to cause any damage. Even if it did, it could be easily reverted. It is useful for dead links. --IWI (talk) 14:01, 25 January 2021 (UTC)[reply]

Actually, as noted above some damage that can be done is irreversible. --Pi zero (talk) 14:50, 25 January 2021 (UTC)[reply]

Can you explain how irreversible damage happens on a wiki when all versions of a page are stored and can be restored at any time? I don't think this was ever clearly explained. Harej (talk) 17:54, 25 January 2021 (UTC)[reply]

It's not readily fixable; as I remarked before, the bugzilla request was closed as WON'T FIX. Given which, I'm not really seeing any cause to dwell on the details (unless you know of someone considering attempting a rewrite of the DPL extension, in which case please let me know and perhaps I'll contact them). --Pi zero (talk) 20:29, 25 January 2021 (UTC)[reply]

We needed IABot to take care of two different tasks. Upon careful discussion, it was agreed to split the tasks, the prior being taking the snapshot of all the sources in archives (currently under progress, oversee by user:acagastya); and the other being taking snapshots of all the sources for new articles. Since the first task is under progress, let's focus the discussion on just the latter task.
•–• 20:34, 25 January 2021 (UTC)[reply]

Task 2

Comment @Harej: All the sources used in a mainspace article are expected to be listed in the url parameter of {{source}}. {{source}} also accepts archiveurl parameter. Could you configure IABot in such a way that it archives the sources from mainspace articles (which are NOT archived), from url and update archiveurl? @Pi zero: does this sound safe? I hope I am not overlooking something, please let me know if I am.
•–• 20:41, 25 January 2021 (UTC)[reply]

I have configured the bot to recognize the "archiveurl" parameter. Harej (talk) 20:50, 25 January 2021 (UTC)[reply]

Comment Seems to me one ought to wait until publication-plus-24-hours before archiving sources. Other than that, I have no strong objection; the most technical harm that could possibly be done is quite limited. --Pi zero (talk) 21:05, 25 January 2021 (UTC)[reply]
Question @Harej: Does IABot have specific triggers (like working at the time of new page creation)? Or does it work in a said time-interval? Re what @Pi zero: has added Harej, I think IABot can just check, if(wikitext.categories.includes('Published')) { run(); }. Will that work? Additionally, can you control which snapshot instance will be added to archive URL? The snapshot with timestamp closest after category:published was added will be ideal.
•–• 07:17, 26 January 2021 (UTC)[reply]

(Yeah, published will likely do, which would admittedly be much easier.) --Pi zero (talk) 17:45, 26 January 2021 (UTC)[reply]

Acagastya, IABot does not have triggers. However, the bot will use the archive corresponding to the stated access time in the citation, or the closest one to when the URL was added. Harej (talk) 17:20, 1 February 2021 (UTC)[reply]

Question @Harej: Would IABot be able to fill in all three parameters on {{source}} for pages with deadlinks? Namely also, brokenURL and archivedescription when links are being recovered? —chaetodipus (talk · contribs) 04:46, 29 July 2021 (UTC)[reply]
Question @Cyberpower678, Harej: I see the Phabricator note says this is stalled. Could you briefly outline what needs to be done to unstall? --Green Giant (talk) 03:38, 7 August 2021 (UTC)[reply]

@Cyberpower678, Harej: It’s been a fortnight with no response. Please could you provide an update? If not, shall we close this request? [₂₄Cr][talk] 12:54, 21 August 2021 (UTC)[reply]

@Cromium: I think that is because noone here is voting -- we need to vote it to pass or fail for them to proceed.
•–• 12:59, 21 August 2021 (UTC)[reply]

@Acagastya: Is that why it is listed as stalled? If so, we can vote but I was hoping to see a test run of 10-20 edits first. However, if the same task is being done on another wiki, I guess we can move to approval. [₂₄Cr][talk] 13:10, 21 August 2021 (UTC)[reply]

I will test the bot for 20 edits. After around 20 edits I will stop the bot, and let you assess. Harej (talk) 22:08, 13 September 2021 (UTC)[reply]
- This may be delayed as the bot is currently down for maintenance. Harej (talk) 23:44, 13 September 2021 (UTC)[reply]

We have been trying to do test edits on the bot and have not been succeeding. Basically the bot goes through AllPages in alphabetical order and all the pages it has come across are protected. The bot will work for “single page” runs via the Management Interface (makes edits on your user account’s behalf, so it will work on the pages you normally can edit) and also multi-page runs for unprotected pages. If you want the bot to make background edits, it will need to be promoted to admin. Harej (talk) 19:08, 27 September 2021 (UTC)[reply]

@Harej: I’ve promoted the bot to admin for a month to help the test run. Please advise if anything else is needed. [₂₄Cr][talk] 12:53, 2 October 2021 (UTC)[reply]

@Harej, I saw the bot made a single edit so far (diff). If the bot is rescuing a dead link in sources, can it also add "brokenURL = true" so that it displays the archived link in the article? —chaetodipus (talk · contribs) 04:39, 20 October 2021 (UTC)[reply]

chaetodipus, it will add that parameter on future edits. Harej (talk) 18:30, 20 October 2021 (UTC)[reply]

Votes

Support given the limitations of the bot (the way it was designed, to serve a specific purpose), I think it is achieves a part of the task, and I am okay with the compromise.
•–• 17:34, 1 February 2021 (UTC)[reply]

And thinking about it, if the bot does not edit semi-protected pages, we won't royally screw up. :D
•–• 17:36, 1 February 2021 (UTC)[reply]

Support if it gets the process moved to implementation. [₂₄Cr][talk] 00:01, 27 August 2021 (UTC)[reply]

Support I think this would definitely be useful in recovering the many dead sources in our archives. —chaetodipus (talk · contribs) 05:27, 21 October 2021 (UTC)[reply]

Support I am familiar with IAB and find it extremely useful on other projects. I am surprised to learn in viewing this request that it isn't already approved. I think including this is a no-brainer. --TheSandDoctor (talk) 21:29, 22 October 2021 (UTC)[reply]

The above discussion is preserved as an archive. Please do not modify it. Subsequent comments should be made on the appropriate discussion page, such as the current discussion page. No further edits should be made to this discussion.