00:13:37 <h2ibot> Vokunal edited Deathwatch (+0): https://wiki.archiveteam.org/?diff=51121&oldid=51120
00:13:38 <h2ibot> JustAnotherArchivist changed the user rights of User:Vokunal
02:49:32 <Pedrosso> With URL-needing projects like #down-the-tube, when the tracker says there are 0 to do, does that mean that the system literally has no more urls to go off of? Or that it's just not willing to allocate any right now?
02:50:05 <nicolas17> when the youtube tracker says there are 0 to do, it means there are no more urls in the youtube queue, yeah
02:50:31 <nicolas17> the youtube project is not trying to archive all of youtube (that would be infeasible), it has to be actually important videos
02:50:44 <nicolas17> if it reaches 0, great, we have more capacity for the other projects
02:51:24 <Pedrosso> Alright, that's what I wanted/needed to know. Thanks
03:00:59 <Pedrosso> On a seperate curiosity, I've been wondering from a previous conversation if it'd be possible (and if possible; if it should be done) get all the failed imgur outlinks from the logs of AB projects and run those through the imgur warrior.
03:02:41 <pabs> yes, you "just" need to download all the AB logs from IA, parse them, upload the lists and submit to #imgone
03:03:05 <pabs> and maybe make a service for that, since other projects will want some processing too
03:04:05 <nicolas17> pabs: are warcs public for imgur? for many projects they aren't :(
03:05:00 <pabs> sounded like Pedrosso was talking about warcs for AB not #imgone?
03:05:07 <Pedrosso> I was, I was
03:05:13 * pabs not sure about imgur warcs tho
03:05:46 <pabs> btw the AB warcs are linked from https://archive.fart.website/archivebot/viewer/
03:05:47 <nicolas17> ah
03:06:08 <Pedrosso> Also, pabs, what exactly do you mean by making a service for that?
03:06:30 <thuban> nicolas17: they're both public
03:07:06 <pabs> Pedrosso: as in a server with some code that does this all day long, and lets people add processing and flows. ie if AB finds a wiki, it should go to #wikibot
03:07:29 <pabs> so the service would parse the warcs and connect that link
03:09:12 <Pedrosso> That sounds like a good idea. Tho I individually don't have enough knowledge nor experience here to begin to think about executing that
03:11:04 <JAA> There is a tool for WARC extraction, although that would have slightly different results than log parsing.
03:11:22 <JAA> s/extraction/scraping/ I guess, extracting links that appear in WARCs.
03:11:23 <Pedrosso29> Sry bout the disconnect/reconnect, if it shows
03:12:24 <pabs> I think this was less about scraping the HTML in WARCs and more about sending the 429ed imgur requests from AB to #imgone
03:12:35 <Pedrosso29> ^
03:12:37 <JAA> Yeah, they're not equivalent.
03:12:50 <JAA> WARC scraping would produce more results but also requires munching more data.
03:13:03 <pabs> but really, both could be useful. indeed, tons more data for scraping though
03:13:34 <Pedrosso29> The former I suppose would be more specific to what I originally asked, the latter would be far more general and fit with the service idea
03:13:36 <pabs> could do scraping only for the AB jobs without offsite links
03:14:47 <pabs> anyway. its good to start simple though and work up from there, so manually do this, then hackily automate parts, then betterise the automation, then package it into a service
03:15:58 <thuban> it's a nice thought, but it would duplicate some of the logic for cross-project dispatch we do already and i'm not sure what the best strategy for eventually rationalizing that would be
03:16:10 <thuban> s/dispatch/backfeed/
03:17:34 <pabs> are there any docs for that? I hadn't heard of any cross-project dispatch yet
03:17:57 <JAA> #// dispatches to Telegram and (soon?) Imgur.
03:17:58 <nicolas17> pabs: #// already sends telegram links to #telegrab
03:18:03 <nicolas17> how that works behind the scenes, I don't know
03:18:12 <pabs> ah, interesting...
03:18:22 <thuban> there's loads but it's all done haphazardly inline https://github.com/ArchiveTeam/urls-grab/blob/master/urls.lua#L1688
03:18:29 <JAA> No Imgur yet. arkiver, here's a reminder. ;-)
03:18:32 <nicolas17> oh ew
03:18:44 <nicolas17> I expected something server side rather than the worker for one project submitting into another
03:21:49 <thuban> the logical thing might be to have a central url clearinghouse that identified all specially-handled urls and forwarded them to the appropriate projects (and either sent the rest to #// or, possibly configurably, dropped them as might be more appropriate for archivebot)
03:22:28 <pabs> yes
03:25:42 <thuban> in practice all new projects send outlinks to #// anyway, so (if eg telegram links to mediafire or whatever) they do get to the appropriate projects eventually
03:29:50 <Pedrosso> So a mediafire outlink from the AB will be sent to #// where it'll be sent to #mediaonfire?
03:30:34 <JAA> Only DPoS projects send things to #//. AB does not.
03:30:51 <Pedrosso> Ah, I see I see
03:32:46 <thuban> right. and bundling that queueing with archival makes it not compose well with archivebot, plus it's a needless round-trip, plus it requires the code to actually opt in (when looking for an example i was surprised to find that apparently pastebin doesn't queue outlinks at all)
03:39:45 <thuban> plus changes require #// worker updates to take effect (minor considering how most people run it, but still)
03:48:41 <thuban> idk, i can think of some cases in which you really do need the original discovery context and not just the url (nitter/mastodon instances, blogs at custom domains). but i think all we actually do at present is url-pattern-based
03:54:09 <thuban> s/discovery context/page structure/ (i can't actually think of any examples where you need the discovery context)
05:31:15 <fireonlive> here we go, here we go again… https://x.com/dexerto/status/1722958208807891046?s=12
05:31:16 <eggdrop> nitter: https://nitter.net/dexerto/status/1722958208807891046
05:34:58 <JAA> https://i.kym-cdn.com/entries/icons/original/000/029/223/cover2.jpg
05:52:56 <h2ibot> Tech234a edited List of websites excluded from the Wayback Machine/Partial exclusions (+52, Add early Apple Store): https://wiki.archiveteam.org/?diff=51122&oldid=50493
05:56:57 <h2ibot> Petchea edited Tumblr (+107, /* History */): https://wiki.archiveteam.org/?diff=51123&oldid=51113
06:07:04 <mgrandi> https://abcnews.go.com/Technology/wireStory/jezebel-sharp-edged-feminist-website-shutting-after-16-104768751 I don't see it on deathwatch or mentioned here
06:07:39 <JAA> Indeed, but it's running through AB already.
06:08:39 <JAA> Didn't realise it was part of G/O. Another one for the list, I guess.
06:14:03 <h2ibot> JustAnotherArchivist edited Deathwatch (+184, /* 2023 */ Add Jezebel): https://wiki.archiveteam.org/?diff=51124&oldid=51121
14:51:14 <Barto> pabs: poor TheTechRobo he may get the hug of death of HN :D
15:43:38 <TheTechRobo> wtf send help  https://lounge.thetechrobo.ca/uploads/f2d379beb39b7321/IMG_2421.jpeg
16:15:02 <Barto> that's pretty moderate so far
16:17:59 <TheTechRobo> 1.7k now
16:40:17 <arkiver> TheTechRobo: congrats on getting on front page :)
16:40:21 <arkiver> very nice tool as well!
16:40:28 <arkiver> JAA: whoops
16:40:31 <arkiver> thanks for the reminder
16:46:08 <TheTechRobo> arkiver: :D
17:40:21 <ScenarioPlanet> https://transfer.archivete.am/jxHWG/static.spore.com-ids-2016-fix.txt.zst - Fixed line 538397 and broken sorting
17:40:32 <ScenarioPlanet> Pedrosso pokechu22 ^
18:00:29 <h2ibot> JAABot edited CurrentWarriorProject (-4): https://wiki.archiveteam.org/?diff=51125&oldid=51117
20:19:22 <ScenarioPlanet> error: Hello there
20:22:18 <error> howdy
20:35:38 <ScenarioPlanet> Also I think you should change your nickname (with /nick new_nickname) or you'll get pinged every time someone uses the "error" word
20:37:06 <error> fair lol
20:37:22 <redlattice> changed
22:07:39 <tomodachi94> Does anyone know if Fextralife (https://fextralife.com) has been grabbed ever? Specifically curious about their wikis, which seem like a goldmine
22:07:44 <tomodachi94> (Wiki page already created for those interested)
22:22:30 <Pedrosso> I don't see anything on https://archive.org/search?query=originalurl%3A%28%2Afextralife%2A%29 however, idk if there's possibly another way of searching for it
22:23:07 <Pedrosso> If it's really a goldmine of wikis; maybe move this to in #wikiteam ?
22:24:34 <Pedrosso> Didregard that last statment, as after all it's the entire website
22:27:24 <Pedrosso> tomodachi94: I believe #archivebot automatically moves wikis to #wikibot when it discovers them, so I'd suggest you repeat this in #archivebot so that an admin can submit it
22:28:01 <Pedrosso> do ask them if it does move them automatically as I don't know
22:29:46 <thuban> archivebot is a self-contained system and doesn't submit anything to any other tooling
22:30:57 <Pedrosso> Strange, when I had asked AB to archive a website with a wiki in it, it sent it there. Perhaps I misinterpreted it
22:32:21 <mossssss> does anyone know if blogger/blogspot is in the warrior?
22:32:31 <mossssss> or if there's even an initiative to archive it?
22:33:45 <JAA> Pedrosso: The originalurl search only works for wikis specifically dumped by WikiTeam tooling. Basically nobody else sets that metadata field. Certainly not AB.
22:34:04 <JAA> And no, AB does not submit anything elsewhere. That was done manually.
22:34:16 <thuban> tomodachi94: doesn't look like it https://archive.fart.website/archivebot/viewer/?q=fextralife
22:34:58 <JAA> mossssss: It isn't yet, but we're aware of the situation. Unfortunately, it doesn't seem to be possible to enumerate the blogs or similar.
22:35:40 <JAA> Looks like that (Google inactive accounts etc.) was never added to Deathwatch though.
22:37:40 <mossssss> oh no!!! thats so frustrating that there's no way to do it
22:37:48 <Pedrosso> Very frustrating indeed.
22:38:04 <mossssss> i'm stressed because i know there's so much stuff on there that is totally going to be lost
22:38:08 <fireonlive> :(
22:38:28 <fireonlive> google is really on a 'HOW much are we storing???' kick lately
22:38:49 <Pedrosso> Could you specify?
22:39:10 <JAA> If there's anything you particularly care about, feel free to ask in #archivebot about archiving it. Blogger blogs work fairly well (except for some pagination mess and the 'dynamic view' script hells).
22:42:30 <h2ibot> JustAnotherArchivist edited Deathwatch (+635, /* 2023 */ Add Google's inactive accounts purge): https://wiki.archiveteam.org/?diff=51126&oldid=51124
22:44:04 <mossssss> this is perhaps a bit backwards but is there a way to do it through individual bloggers profiles? probably half of profiles aren't visible but the ones that are usually have 1-3 blogs on them
22:45:42 <JAA> The profile IDs are much too large to be bruteforced, and IIRC there's quite a bit of rate limiting on the profile pages.
22:46:18 <mossssss> yeah - that makes sense. its just the only half-plausible solution i can come up with lol
22:51:05 <Pedrosso> JAA: I've grabbed the wiki links from tomodachi94's suggested website. https://transfer.archivete.am/ErUPC/wikilinks.txt
22:54:43 <Ryz> I mentioned the Blogger thing multiple times 2-3 months ago...
22:55:25 <JAA> Yes, it was discussed extensively in May.
22:55:46 <JAA> But since we have no way of discovering blogs, really...
22:56:30 <Ryz> Not even Blogger ID numeration?
22:56:35 <Ryz> *ID number
22:56:39 <Ryz> Even if it's rate limited?
23:00:57 <thuban> a bit, yeah (https://wiki.archiveteam.org/index.php/Blogger#Strategy, https://hackint.logs.kiska.pw/archiveteam-bs/20230910#c378934)
23:02:29 <thuban> on a related note, anyone know whether blog names or user ids can be extracted from blogspot image cdn urls? parts don't look entirely random, but i'm not sure
23:02:45 <Ryz> Yeah, it's one of the reasons why I became slightly to somewhat more inactive in ArchiveBot s:
23:11:01 <Ryz> There seems to be an implicit feeling that Blogger may be deemed less important than YouTube or other stuff
23:12:22 <Ryz> Even though it uses less space than something video related
23:15:44 <thuban> i actually had no idea about this--must have missed the discussion
23:20:12 <pabs> Barto, arkiver, TheTechRobo: oh, didn't think it would reach the front page :)
23:21:46 <JAA> thuban: Yeah, that's why this should've been on Deathwatch from the start. :-|
23:22:36 <Barto> pabs: muahaha
23:22:40 <Barto> congratz
23:22:51 <Pedrosso> congratz indeed
23:23:50 <fireonlive> pabs raking in that HN karma :p
23:25:26 <pabs> re blogger, a while back I found you can scrape front pages for profile links, scrape front page links from profiles, and you get a probably ever-expanding lists
23:25:53 <Ryz> I'm not even sure even adding it on Deathwatch when it was announced would help
23:26:15 <Pedrosso> I don't imagine it'd be complete but quite extensive
23:26:23 <mossssss> it would be nice to try
23:26:28 <Pedrosso> It would
23:27:17 <JAA> Ryz: It does help. It wasn't really on my radar anymore until some people brought it up again a couple days ago (on Reddit and via email).
23:28:19 * JAA summons the arkiver.
23:29:01 <fireonlive> https://mkx9delh5a.execute-api.ca-central-1.amazonaws.com/uploads/c7743b41c33e6600/arkiver.png
23:29:02 <fireonlive> it is time.
23:29:06 <fireonlive> arkiver
23:29:59 <pabs> my hacky script for blogger/blogspot enumeration: https://transfer.archivete.am/RAiXa/archive-blogspot.sh
23:30:16 <pabs> (note the captchas you get really hamper the process)
23:30:41 <fireonlive> anyone here work at google? :p
23:30:59 <pabs> and my list of blogs I wanted to AB: https://transfer.archivete.am/XWpXt/blogspot.com-blogs.txt
23:31:25 <Pedrosso> ooh, nevermind pabs: that(the script) does look like it'd be complete
23:31:35 <Ryz> I have so many Blogspot websites to process too
23:32:02 <pabs> sorry for the traffic bump TheTechRobo :)
23:32:23 <mossssss> same - i may just send them in the other channel if i need to
23:32:29 * pabs reached out to a Google person he knows
23:32:44 <pabs> (not in the right dept tho)
23:33:02 <fireonlive> 🤞
23:33:11 <katia> fireonlive, you can tell if they say (opinions my own)
23:33:18 <fireonlive> haha
23:33:21 <fireonlive> true true
23:33:32 <TheTechRobo> pabs: All good! :-)
23:34:12 <fireonlive> ah! https://news.ycombinator.com/item?id=38228481 :)
23:40:41 <h2ibot> PaulWise edited Blogger (+294, add second strategy): https://wiki.archiveteam.org/?diff=51127&oldid=47348
23:42:42 <h2ibot> PaulWise edited Blogger (+116, add list of blogs found with the second strategy): https://wiki.archiveteam.org/?diff=51128&oldid=51127
23:45:01 <thuban> here's a list of 144 blogs extracted from my irc logs (excluding #archivebot but not other archiveteam channels): https://transfer.archivete.am/2sEI9/blogspot_blogs_from_irc_logs.txt
23:46:27 <thuban> (some of these are from topics of channels i scanned during the freenode implosion--i had totally forgotten about that)
23:48:34 <mossssss> does this mean we might be able to do it?? (I would be SO relieved lol - even some is better than none)
23:49:43 <h2ibot> Tomodachi94 created Fextralife (+458, Create page): https://wiki.archiveteam.org/?title=Fextralife
23:49:44 <h2ibot> Tomodachi94 uploaded File:Fextralife banner.png: https://wiki.archiveteam.org/?title=File%3AFextralife%20banner.png
23:50:04 <JAA> One potential concern is that many blogs will not be at risk, and I guess we don't have a good way of identifying which ones are.
23:51:18 <mossssss> yeah - i know its any google account that hasnt been touched in 2 years - but that doesnt necessarily mean that the blogs are representative of the accounts
23:52:37 <JAA> Any blog with a post in the past 2 years would *probably* be fine, but scheduled posts are a thing, so it's not reliable.
23:53:41 <mossssss> omg i totally forgot about that...
23:55:31 <mossssss90> not sure why it keeps disconnecting me lol so annoying