00:13:37 Vokunal edited Deathwatch (+0): https://wiki.archiveteam.org/?diff=51121&oldid=51120 00:13:38 JustAnotherArchivist changed the user rights of User:Vokunal 02:49:32 With URL-needing projects like #down-the-tube, when the tracker says there are 0 to do, does that mean that the system literally has no more urls to go off of? Or that it's just not willing to allocate any right now? 02:50:05 when the youtube tracker says there are 0 to do, it means there are no more urls in the youtube queue, yeah 02:50:31 the youtube project is not trying to archive all of youtube (that would be infeasible), it has to be actually important videos 02:50:44 if it reaches 0, great, we have more capacity for the other projects 02:51:24 Alright, that's what I wanted/needed to know. Thanks 03:00:59 On a seperate curiosity, I've been wondering from a previous conversation if it'd be possible (and if possible; if it should be done) get all the failed imgur outlinks from the logs of AB projects and run those through the imgur warrior. 03:02:41 yes, you "just" need to download all the AB logs from IA, parse them, upload the lists and submit to #imgone 03:03:05 and maybe make a service for that, since other projects will want some processing too 03:04:05 pabs: are warcs public for imgur? for many projects they aren't :( 03:05:00 sounded like Pedrosso was talking about warcs for AB not #imgone? 03:05:07 I was, I was 03:05:13 * pabs not sure about imgur warcs tho 03:05:46 btw the AB warcs are linked from https://archive.fart.website/archivebot/viewer/ 03:05:47 ah 03:06:08 Also, pabs, what exactly do you mean by making a service for that? 03:06:30 nicolas17: they're both public 03:07:06 Pedrosso: as in a server with some code that does this all day long, and lets people add processing and flows. ie if AB finds a wiki, it should go to #wikibot 03:07:29 so the service would parse the warcs and connect that link 03:09:12 That sounds like a good idea. Tho I individually don't have enough knowledge nor experience here to begin to think about executing that 03:11:04 There is a tool for WARC extraction, although that would have slightly different results than log parsing. 03:11:22 s/extraction/scraping/ I guess, extracting links that appear in WARCs. 03:11:23 Sry bout the disconnect/reconnect, if it shows 03:12:24 I think this was less about scraping the HTML in WARCs and more about sending the 429ed imgur requests from AB to #imgone 03:12:35 ^ 03:12:37 Yeah, they're not equivalent. 03:12:50 WARC scraping would produce more results but also requires munching more data. 03:13:03 but really, both could be useful. indeed, tons more data for scraping though 03:13:34 The former I suppose would be more specific to what I originally asked, the latter would be far more general and fit with the service idea 03:13:36 could do scraping only for the AB jobs without offsite links 03:14:47 anyway. its good to start simple though and work up from there, so manually do this, then hackily automate parts, then betterise the automation, then package it into a service 03:15:58 it's a nice thought, but it would duplicate some of the logic for cross-project dispatch we do already and i'm not sure what the best strategy for eventually rationalizing that would be 03:16:10 s/dispatch/backfeed/ 03:17:34 are there any docs for that? I hadn't heard of any cross-project dispatch yet 03:17:57 #// dispatches to Telegram and (soon?) Imgur. 03:17:58 pabs: #// already sends telegram links to #telegrab 03:18:03 how that works behind the scenes, I don't know 03:18:12 ah, interesting... 03:18:22 there's loads but it's all done haphazardly inline https://github.com/ArchiveTeam/urls-grab/blob/master/urls.lua#L1688 03:18:29 No Imgur yet. arkiver, here's a reminder. ;-) 03:18:32 oh ew 03:18:44 I expected something server side rather than the worker for one project submitting into another 03:21:49 the logical thing might be to have a central url clearinghouse that identified all specially-handled urls and forwarded them to the appropriate projects (and either sent the rest to #// or, possibly configurably, dropped them as might be more appropriate for archivebot) 03:22:28 yes 03:25:42 in practice all new projects send outlinks to #// anyway, so (if eg telegram links to mediafire or whatever) they do get to the appropriate projects eventually 03:29:50 So a mediafire outlink from the AB will be sent to #// where it'll be sent to #mediaonfire? 03:30:34 Only DPoS projects send things to #//. AB does not. 03:30:51 Ah, I see I see 03:32:46 right. and bundling that queueing with archival makes it not compose well with archivebot, plus it's a needless round-trip, plus it requires the code to actually opt in (when looking for an example i was surprised to find that apparently pastebin doesn't queue outlinks at all) 03:39:45 plus changes require #// worker updates to take effect (minor considering how most people run it, but still) 03:48:41 idk, i can think of some cases in which you really do need the original discovery context and not just the url (nitter/mastodon instances, blogs at custom domains). but i think all we actually do at present is url-pattern-based 03:54:09 s/discovery context/page structure/ (i can't actually think of any examples where you need the discovery context) 05:31:15 here we go, here we go again… https://x.com/dexerto/status/1722958208807891046?s=12 05:31:16 nitter: https://nitter.net/dexerto/status/1722958208807891046 05:34:58 https://i.kym-cdn.com/entries/icons/original/000/029/223/cover2.jpg 05:52:56 Tech234a edited List of websites excluded from the Wayback Machine/Partial exclusions (+52, Add early Apple Store): https://wiki.archiveteam.org/?diff=51122&oldid=50493 05:56:57 Petchea edited Tumblr (+107, /* History */): https://wiki.archiveteam.org/?diff=51123&oldid=51113 06:07:04 https://abcnews.go.com/Technology/wireStory/jezebel-sharp-edged-feminist-website-shutting-after-16-104768751 I don't see it on deathwatch or mentioned here 06:07:39 Indeed, but it's running through AB already. 06:08:39 Didn't realise it was part of G/O. Another one for the list, I guess. 06:14:03 JustAnotherArchivist edited Deathwatch (+184, /* 2023 */ Add Jezebel): https://wiki.archiveteam.org/?diff=51124&oldid=51121 14:51:14 pabs: poor TheTechRobo he may get the hug of death of HN :D 15:43:38 wtf send help https://lounge.thetechrobo.ca/uploads/f2d379beb39b7321/IMG_2421.jpeg 16:15:02 that's pretty moderate so far 16:17:59 1.7k now 16:40:17 TheTechRobo: congrats on getting on front page :) 16:40:21 very nice tool as well! 16:40:28 JAA: whoops 16:40:31 thanks for the reminder 16:46:08 arkiver: :D 17:40:21 https://transfer.archivete.am/jxHWG/static.spore.com-ids-2016-fix.txt.zst - Fixed line 538397 and broken sorting 17:40:32 Pedrosso pokechu22 ^ 18:00:29 JAABot edited CurrentWarriorProject (-4): https://wiki.archiveteam.org/?diff=51125&oldid=51117 20:19:22 error: Hello there 20:22:18 howdy 20:35:38 Also I think you should change your nickname (with /nick new_nickname) or you'll get pinged every time someone uses the "error" word 20:37:06 fair lol 20:37:22 changed 22:07:39 Does anyone know if Fextralife (https://fextralife.com) has been grabbed ever? Specifically curious about their wikis, which seem like a goldmine 22:07:44 (Wiki page already created for those interested) 22:22:30 I don't see anything on https://archive.org/search?query=originalurl%3A%28%2Afextralife%2A%29 however, idk if there's possibly another way of searching for it 22:23:07 If it's really a goldmine of wikis; maybe move this to in #wikiteam ? 22:24:34 Didregard that last statment, as after all it's the entire website 22:27:24 tomodachi94: I believe #archivebot automatically moves wikis to #wikibot when it discovers them, so I'd suggest you repeat this in #archivebot so that an admin can submit it 22:28:01 do ask them if it does move them automatically as I don't know 22:29:46 archivebot is a self-contained system and doesn't submit anything to any other tooling 22:30:57 Strange, when I had asked AB to archive a website with a wiki in it, it sent it there. Perhaps I misinterpreted it 22:32:21 does anyone know if blogger/blogspot is in the warrior? 22:32:31 or if there's even an initiative to archive it? 22:33:45 Pedrosso: The originalurl search only works for wikis specifically dumped by WikiTeam tooling. Basically nobody else sets that metadata field. Certainly not AB. 22:34:04 And no, AB does not submit anything elsewhere. That was done manually. 22:34:16 tomodachi94: doesn't look like it https://archive.fart.website/archivebot/viewer/?q=fextralife 22:34:58 mossssss: It isn't yet, but we're aware of the situation. Unfortunately, it doesn't seem to be possible to enumerate the blogs or similar. 22:35:40 Looks like that (Google inactive accounts etc.) was never added to Deathwatch though. 22:37:40 oh no!!! thats so frustrating that there's no way to do it 22:37:48 Very frustrating indeed. 22:38:04 i'm stressed because i know there's so much stuff on there that is totally going to be lost 22:38:08 :( 22:38:28 google is really on a 'HOW much are we storing???' kick lately 22:38:49 Could you specify? 22:39:10 If there's anything you particularly care about, feel free to ask in #archivebot about archiving it. Blogger blogs work fairly well (except for some pagination mess and the 'dynamic view' script hells). 22:42:30 JustAnotherArchivist edited Deathwatch (+635, /* 2023 */ Add Google's inactive accounts purge): https://wiki.archiveteam.org/?diff=51126&oldid=51124 22:44:04 this is perhaps a bit backwards but is there a way to do it through individual bloggers profiles? probably half of profiles aren't visible but the ones that are usually have 1-3 blogs on them 22:45:42 The profile IDs are much too large to be bruteforced, and IIRC there's quite a bit of rate limiting on the profile pages. 22:46:18 yeah - that makes sense. its just the only half-plausible solution i can come up with lol 22:51:05 JAA: I've grabbed the wiki links from tomodachi94's suggested website. https://transfer.archivete.am/ErUPC/wikilinks.txt 22:54:43 I mentioned the Blogger thing multiple times 2-3 months ago... 22:55:25 Yes, it was discussed extensively in May. 22:55:46 But since we have no way of discovering blogs, really... 22:56:30 Not even Blogger ID numeration? 22:56:35 *ID number 22:56:39 Even if it's rate limited? 23:00:57 a bit, yeah (https://wiki.archiveteam.org/index.php/Blogger#Strategy, https://hackint.logs.kiska.pw/archiveteam-bs/20230910#c378934) 23:02:29 on a related note, anyone know whether blog names or user ids can be extracted from blogspot image cdn urls? parts don't look entirely random, but i'm not sure 23:02:45 Yeah, it's one of the reasons why I became slightly to somewhat more inactive in ArchiveBot s: 23:11:01 There seems to be an implicit feeling that Blogger may be deemed less important than YouTube or other stuff 23:12:22 Even though it uses less space than something video related 23:15:44 i actually had no idea about this--must have missed the discussion 23:20:12 Barto, arkiver, TheTechRobo: oh, didn't think it would reach the front page :) 23:21:46 thuban: Yeah, that's why this should've been on Deathwatch from the start. :-| 23:22:36 pabs: muahaha 23:22:40 congratz 23:22:51 congratz indeed 23:23:50 pabs raking in that HN karma :p 23:25:26 re blogger, a while back I found you can scrape front pages for profile links, scrape front page links from profiles, and you get a probably ever-expanding lists 23:25:53 I'm not even sure even adding it on Deathwatch when it was announced would help 23:26:15 I don't imagine it'd be complete but quite extensive 23:26:23 it would be nice to try 23:26:28 It would 23:27:17 Ryz: It does help. It wasn't really on my radar anymore until some people brought it up again a couple days ago (on Reddit and via email). 23:28:19 * JAA summons the arkiver. 23:29:01 https://mkx9delh5a.execute-api.ca-central-1.amazonaws.com/uploads/c7743b41c33e6600/arkiver.png 23:29:02 it is time. 23:29:06 arkiver 23:29:59 my hacky script for blogger/blogspot enumeration: https://transfer.archivete.am/RAiXa/archive-blogspot.sh 23:30:16 (note the captchas you get really hamper the process) 23:30:41 anyone here work at google? :p 23:30:59 and my list of blogs I wanted to AB: https://transfer.archivete.am/XWpXt/blogspot.com-blogs.txt 23:31:25 ooh, nevermind pabs: that(the script) does look like it'd be complete 23:31:35 I have so many Blogspot websites to process too 23:32:02 sorry for the traffic bump TheTechRobo :) 23:32:23 same - i may just send them in the other channel if i need to 23:32:29 * pabs reached out to a Google person he knows 23:32:44 (not in the right dept tho) 23:33:02 🤞 23:33:11 fireonlive, you can tell if they say (opinions my own) 23:33:18 haha 23:33:21 true true 23:33:32 pabs: All good! :-) 23:34:12 ah! https://news.ycombinator.com/item?id=38228481 :) 23:40:41 PaulWise edited Blogger (+294, add second strategy): https://wiki.archiveteam.org/?diff=51127&oldid=47348 23:42:42 PaulWise edited Blogger (+116, add list of blogs found with the second strategy): https://wiki.archiveteam.org/?diff=51128&oldid=51127 23:45:01 here's a list of 144 blogs extracted from my irc logs (excluding #archivebot but not other archiveteam channels): https://transfer.archivete.am/2sEI9/blogspot_blogs_from_irc_logs.txt 23:46:27 (some of these are from topics of channels i scanned during the freenode implosion--i had totally forgotten about that) 23:48:34 does this mean we might be able to do it?? (I would be SO relieved lol - even some is better than none) 23:49:43 Tomodachi94 created Fextralife (+458, Create page): https://wiki.archiveteam.org/?title=Fextralife 23:49:44 Tomodachi94 uploaded File:Fextralife banner.png: https://wiki.archiveteam.org/?title=File%3AFextralife%20banner.png 23:50:04 One potential concern is that many blogs will not be at risk, and I guess we don't have a good way of identifying which ones are. 23:51:18 yeah - i know its any google account that hasnt been touched in 2 years - but that doesnt necessarily mean that the blogs are representative of the accounts 23:52:37 Any blog with a post in the past 2 years would *probably* be fine, but scheduled posts are a thing, so it's not reliable. 23:53:41 omg i totally forgot about that... 23:55:31 not sure why it keeps disconnecting me lol so annoying