00:58:19 <h2ibot> Ka edited List of websites excluded from the Wayback Machine (-26, as of today ezboard appears to be available -…): https://wiki.archiveteam.org/?diff=50700&oldid=50542
00:58:20 <h2ibot> Ka edited Twitter (-60, /* Vital Signs */): https://wiki.archiveteam.org/?diff=50701&oldid=50593
01:00:19 <h2ibot> JAABot edited List of websites excluded from the Wayback Machine (+0): https://wiki.archiveteam.org/?diff=50702&oldid=50700
01:03:19 <h2ibot> DigitalDragon edited ArchiveTeam Domains (-277, remove dead domains): https://wiki.archiveteam.org/?diff=50703&oldid=47046
01:33:11 <flashfire42|m> Ukraine defence minister sacked just came across on the news we can archive that
01:33:35 <fireonlive> https://www.reuters.com/world/europe/ukraines-zelenskiy-moves-replace-wartime-defence-minister-2023-09-03/
01:33:56 <JAA> Yeah, saw it earlier, but not sure what there is to archive really.
01:34:56 <JAA> Maybe the MoD website.
01:39:28 <JAA> https://www.mil.gov.ua/ is using Buttflare in an aggressive enough configuration that AB can't grab it.
01:46:21 <fireonlive> :(
03:04:42 <h2ibot> FireonLive edited Talk:Main Page (+17, looking at wanted templates): https://wiki.archiveteam.org/?diff=50704&oldid=48686
03:06:42 <h2ibot> FireonLive edited NewsGrabber (+12, fixup infobox): https://wiki.archiveteam.org/?diff=50705&oldid=50579
03:07:42 <h2ibot> FireonLive edited NewsGrabber (+24, it's a.... DPoS): https://wiki.archiveteam.org/?diff=50706&oldid=50705
03:20:52 <TheTechRobo> fireonlive: How dare you be so rude to Template:Special case and Template:On hiatus?
03:21:36 <fireonlive> they deserved it! 💢🥊
03:25:47 <nicolas17> fireonlive: https://cdn.discordapp.com/attachments/286612533757083648/1148094078022602944/m2-res_640p.mp4
03:29:28 <fireonlive> aww :3
04:12:23 <fireonlive> so english wikipedia is really "that bad" eh
04:12:37 <fireonlive> (outside of content)
05:58:14 <h2ibot> Yts98 edited ZOWA (+45, Update information, datetimeify): https://wiki.archiveteam.org/?diff=50707&oldid=50639
06:06:21 <manu|m> not sure if you’ve seen it already, but the Telegram project (ArchiveTeam’s choice atm) seems to have no outstanding TODOs
12:54:30 <plcp> on the orange FAI pages topic (scheduled to disappear tomorrow) we took upon ourselves (me & few friends) to dump as much as possible of them
12:54:35 <plcp> for now, we have ~5k warcs (one per page/website/subdomain) taking a bit more than a hundred gigabyte
12:55:21 <plcp> I hope that we can grab a couple thousand more til the end (that'll be ~10-15% of the sites, completely archived, mostly the larger ones)
12:55:33 <plcp> warcs produced using wget's warc support (we've done something like « for i in $(cat pages.txt); do wget -r --warcfile=$i "$i" ... » +some other flags, ratelimits, etc)
12:56:05 <plcp> (we've tried to reach orange to ask for more time, but I don't have that much hope on that request)
12:56:35 <plcp> now, I'm wondering, what are we going to do with all these warcs
12:57:34 <plcp> should we merge everything into some megawarcs & upload it ourselves to the IA under our own names?
12:58:22 <plcp> or share it here and do something something
12:58:47 <plcp> (it's my first time doing this kind of thing, tbh I'm not sure of anything)
12:59:10 <plcp> cc pokechu22 maybe :o)
12:59:58 <imer> probably don’t need to megawarc them, yes to uploading them to IA
13:00:55 <imer> they wont be indexed into the WBM since the crawl is untrusted
13:01:14 <imer> (i dont really have an idea, so take this with a grain of salt and/or wait for me to be corrected)
13:02:18 <imer> if not indexed into IA a crawl per domain/user is probably easier to work with as well
13:02:31 <imer> indexed into WBM*
13:03:25 <plcp> yeah I don't expect them to get indexed (at best I'll setup later some pywb somewhere to expose them)
13:06:59 <plcp> we also had to use a proxy to apply some rewrite rules, as some older pages had only dead links (rewriting perso.wanado.fr/<domain> into <domain>.pagesperso-orange.fr "ressurected" some sites, for example)
13:07:13 <plcp> thus, strictly speaking, it's not a carbon copy of the pages
13:55:10 <thuban> speaking of those orange.fr pages, deadline is tomorrow and our archivebot jobs are certain not to finish. i know it's technically possible to extract the remaining urls from the ab jobs and put them into #// (with a pattern-based rate limit to avoid ddosing); any chance of an admin actually doing so?
13:55:13 <thuban> the ab jobs are rate-limited by ip bans, so dumping to #// would allow us to get more done even if the anti-ddos rate limit had to be pretty tight
13:59:49 <thuban> (the relevant domains, for convenience, are orange.fr, monsite-orange.fr, pagesperso-orange.fr, pagespro-orange.fr, and woopic.com)
14:06:50 <imer> #// doesn't really do recursive discovery
14:07:13 <thuban> true but irrelevant
14:07:33 <imer> Ah right, so the list is already complete?
14:07:52 <thuban> no, but there are millions of urls in the queue we won't get to
14:08:11 <imer> I see, yeah that might make sense
14:17:43 <nstrom|m> we'd also need a target w/ some space on it, atm everything that's still running is at a crawl because optane9 is blocked most of the time
14:18:26 <nstrom|m> but it's not a bad idea. I'd throw some boxes at it if it was up and running agani
14:18:39 <thuban> damn, i thought we had some buffer
14:19:23 <imer> Is it struggling currently? Unless we're pushing a lot of data through it we should have a lot of temp space & ingest capacity left
14:19:39 <thuban> either way, i will reiterate my previous offer of space if it would be helpful
14:19:45 <imer> Rew.by might need a ping if its stuck or something
14:20:41 <imer> "A lot" should be a few gbit/s if i remember right
14:30:41 <imer> rewby: ^ optane9 seems to be having issues (seeing -1 on my end)
14:36:51 <nstrom|m> it's been letting things through in little chunks every so often so not completely stuck, but def struggling
14:39:23 <imer> New tech for offloading so some teething issues are expected, hopefully ots an easy fix
15:39:51 <vokunal|m> Yesterday the limits were taken off of #down-the-tube. If that's still the case, it could have a lot to do with why uploads are frozen
15:47:19 <arkiver> thuban: any rate limits on orange
15:47:19 <arkiver> ?
15:49:16 <thuban> arkiver: 1 request/second appears to be the maximum safe rate for a single ip
15:49:21 <thuban> no idea what it can sustain in total
15:49:25 <arkiver> but do they fall over at high rate?
15:49:29 <arkiver> ah okey
15:49:39 <arkiver> i was hoping this could be done with AB fully but sounds like that's not the case
15:49:50 <thuban> mais, non
15:51:05 <arkiver> which ones have you not done with AB yet?
15:51:14 <arkiver> thuban: ^
15:54:11 <thuban> i'm not 100% sure how the lists we discovered were sliced up; pokechu22 put the actual jobs in
15:54:13 <thuban> but i _believe_ that everything we found went into ab, so pulling the 'remaining' urls of the four current jobs should cover everything we can
17:32:31 <thuban> hm, correction: i think everything got queued for monsite-orange.fr and pagesperso-orange.fr, but not pagespro-orange.fr
17:34:01 <thuban> here is a list of 953 pagespro-orange.fr sites _not_ in the 'priority' job (scrubbed, suitable for ab): https://transfer.archivete.am/l2Sws/orangefr_pagespro_scrubbed.txt.zst
17:35:40 <thuban> do we have enough archivebot pipelines to add this one? if so i would appreciate someone (pokechu22?) running it
17:35:59 <thuban> (i expect most sites to not work, but a few will)
17:38:27 <pokechu22> thuban: I already did a job for the entirety of pagespro that I confirmed worked (by a local crawl) but I can try to put that in too
17:39:01 <pokechu22> I need to eat first though
17:39:04 <thuban> oh, ok! sorry, must have not seen that since it already finished
17:39:23 <thuban> no problem then
17:41:40 <pokechu22> It's worth saving the list of stuff that doesn't work in any case, which I don't think we have done (my local crawl has a warc, but it'd be good to do it via AB too)
17:50:51 <thuban> pokechu22: on that note, can you explain how you generated your lists? (arkiver asked earlier what we've done already, and while i _think_ everything in my 'full' list made it into your seed_urls lists, i wasn't sure)
17:52:54 <pokechu22> The first "full" lists of mine are from the wayback CDX server for various domains (e.g. acounting for perso.wanadoo.fr and perso.orange.fr being on pagesperso-orange.fr now), along with a bit of bing/google search (but that's limited to only a few entries). I mixed stuff into the later lists though
17:54:13 <pokechu22> I *think* the priority list is just your list as-is, while the pagesperso-orange.fr_seed_urls.txt and monsite-orange.fr_seed_urls_v2.txt are my lists as-is, and pagesperso-orange.fr_monsite-orange.fr_seed_urls_2_no_coverage.txt is whatever was in your full list that's not in one of my lists. So there's some overlap between the priority job and the other jobs but everything
17:54:16 <pokechu22> should be represented.
17:56:16 <thuban> ok, cool
18:00:42 <thuban> (we might want to do the list of redirects, for discoverability--there should be enough time)
18:10:26 <ShadowJonathan> (Got told to move here) Heya, I'm planning to mirror-archive a novel/obscure website that may be shutting down soon (currently planning to use WarcMiddleware), with the intention to ask to be included in the archiveteam collection, so that it may show up on the wayback machine.
18:10:43 <ShadowJonathan> Is there anything important that I should know before I do this? Anything regarding the quality/origin of this WARC, or do yall recommend to use a different tool to mirror a complete website?
18:11:20 <thuban> ShadowJonathan: as a rule the internet archive will not whitelist third-party warcs for the wayback machine; i recommend you request the site in #archivebot and let us do it for you.
18:12:41 <ShadowJonathan> Understandable, that's the answer I was after, I'll take a look there then
18:13:49 <nicolas17> yeah, we can give suggestions on how you can best produce a quality WARC, but it's *not* going to appear on the wayback machine either way
18:14:47 <ShadowJonathan> Entirely understandable, I see that the domain I'm after is already in archivebot, but its from 2019, I'll take a peek at the documentation before I'll poke the channel with my question
18:34:56 <imer> We (as in people with AB access) can do a re-ar chive if theres new content/reason for it
18:40:32 <pokechu22> thuban: looks like essentially all of those weren't on the list of URLs I originally tested, so it is new data
18:41:19 <thuban> pokechu22: aha, thanks for running it
18:53:07 <JAA> ShadowJonathan: I've never heard of WarcMiddleware before, but based on a quick glance at the code, it does not appear to be good software.
18:53:33 <JAA> It 'converts a Scrapy request to a WarcRequestRecord' here: https://github.com/odie5533/WarcMiddleware/blob/bc63a1caa48a542df4fa0e877ede362c64ddcd25/warcmiddleware.py#L32-L61
18:53:41 <JAA> That obviously won't preserve the data as sent by the server.
18:54:22 <JAA> So, yet another person writing WARCs that doesn't understand the purpose of WARCs...
18:54:24 <ShadowJonathan> Hmmm...
18:54:32 <ShadowJonathan> It's listed on the wiki though
18:54:46 <JAA> Yeah, that list needs an overhaul.
18:54:53 <ShadowJonathan> And looked to be the first one to mirror an entire website
18:55:10 <ShadowJonathan> But tbh if yall don't accept third party WARCs, there's a number of resources that need to be updated
18:55:40 <ShadowJonathan> One gist I found seems to suggest to just poke one of yall here to move it into the collection, which seemed very very trusting, but yeah, ofc the policy has changed inbetween then and now
18:56:02 <ShadowJonathan> https://gist.github.com/Asparagirl/6206247
19:00:40 <thuban> i suspect "If you're uploading a WARC that should be included in the ArchiveTeam collection" meant 'if you are a member of archiveteam uploading part of an archiveteam project (and it is 2014 and we are still doing things this way)'
19:01:01 <thuban> but yes, very misleading in present context
19:08:05 <fireonlive> grab-site and warcprox are 'blessed' by JAA i believe
19:08:17 <fireonlive> well, seem not bad
19:08:18 <fireonlive> :p
19:11:01 <JAA> wpull (and by extension grab-site) isn't perfect but doesn't have grave errors at least.
19:11:07 <fireonlive> :)
19:11:22 <JAA> warcprox isn't blessed by me, but because it comes from IA, it's assumed good until proven otherwise.
19:11:35 <fireonlive> ye, that's better wording sorry
19:11:46 <JAA> wget-at is also good. (wget is not.)
19:11:49 <fireonlive> i don't think there's anything else to add to 'the list'
19:12:02 <fireonlive> ah right archiveteam-flavoured wget
19:12:13 <JAA> qwarc also writes WARCs according to the spec to the best of my knowledge and capability.
19:12:31 <JAA> Everything else is best presumed terrible and unusable until proven otherwise.
19:12:44 <fireonlive> :)
19:13:00 * fireonlive taps the follow the spec sign
19:14:11 <JAA> :-)
19:16:29 <JAA> I'll update the tools list.
19:17:17 <fireonlive> was just looking at that and wondering if we need a 'recommended' column or something like that
19:17:19 <fireonlive> lol
19:17:23 <fireonlive> thanks
19:54:26 <h2ibot> JustAnotherArchivist edited The WARC Ecosystem (+1536, /* Tools */ Add recommendation column): https://wiki.archiveteam.org/?diff=50708&oldid=50444
19:54:37 <JAA> Now the list looks pretty sad.
19:55:22 <fireonlive> lots of red :(
20:03:27 <h2ibot> JustAnotherArchivist edited The WARC Ecosystem (+758, /* Tools */ Add wget-at and qwarc): https://wiki.archiveteam.org/?diff=50709&oldid=50708
20:04:28 <h2ibot> Rexma edited Deathwatch (+56, /* 2023 */ its still up and i checked some…): https://wiki.archiveteam.org/?diff=50710&oldid=50698
20:06:29 <h2ibot> FireonLive edited The WARC Ecosystem (-8, make table fit better on smaller screens): https://wiki.archiveteam.org/?diff=50711&oldid=50709
20:06:50 <JAA> :-)
20:07:06 <fireonlive> (the first really long tests link flew the recommended off screen for me)
20:07:08 <fireonlive> :)
20:07:24 <JAA> Yeah, same, actually, didn't check what was causing it though.
20:07:49 <fireonlive> ahh =]
20:07:53 <fireonlive> gotta love tables haha
20:18:06 <vokunal|m> I was confused when it said the wiki page was edited to fit better on smaller screens. It didn't fit in mine before, and mine's 32 inch. Then I remembered I keep the wiki in 150% zoom
21:02:58 <flashfire42> heh time to scrape the betting channels and weird bullshit because we are nearly out of telegram items
21:08:42 <nicolas17> are we uploading stuff to IA yet or are we still filling up temporary storage?
21:09:04 <Barto> flashfire42: call Gooshka, he'll figure out a way to queue telegram stuffs :-)
21:09:24 <flashfire42> Gooshka is working on it. so am I. and nicolas17 I dont honestly know the answer to that
21:09:28 <Barto> :-)
21:09:38 <flashfire42> that_lurker is as well
21:10:05 <nicolas17> it's not a big deal if we go idle, we don't *have* to keep workers busy... is that "weird bullshit" useful to archive? :P
21:10:35 <flashfire42> *wiggles hand*
21:10:35 <flashfire42> sorta
21:20:08 <nicolas17> instead of "oh no workers are idle, time to throw whatever garbage we find into the queue to keep them busy", we should be saying "oh finally workers are idle, now the targets can finally catch up with their uploads" :P
21:21:01 <flashfire42> are we doing uploads?
21:21:16 <nicolas17> I don't know what's the status, that's why I was asking
21:22:08 <flashfire42> If I can get confirmation we are doing uploads again I am quite more than happy to start letting it play catchup
21:22:30 <flashfire42> But the peoples demand work
21:22:42 <nicolas17> isn't it even worse if we're *not* doing uploads?
21:23:19 <JAA> Yes
21:23:46 <flashfire42> fair call if this is a veiled request to stop queueing I can stop
21:24:26 <flashfire42> or at least stop mass queueing
21:24:46 <nicolas17> I'm not saying "don't add stuff", I'm not even saying "the stuff you're adding is worthless crap" (I don't know if it is), just "*if* it's worthless crap then don't add it just to keep things busy"
21:25:07 <nicolas17> whether we have capacity for it or not, is not for me to say :)
21:25:36 <DigitalDragons> I don't see any new items on the archiveteam IA account so it would seem uploads are not happening
21:26:22 <nicolas17> I'm once again wishing for a graph of total available space on targets
21:27:01 <flashfire42> admitedly some of my queueing is just busy work.
22:49:15 <fireonlive> re: Geoff: https://www.geoffchappell.com/ at the least
22:50:21 <fireonlive> https://www.linkedin.com/in/geoffchappellsoftwareanalyst as well
22:59:20 <JAA> The website's been run through AB earlier.
22:59:26 <fireonlive> ah awesome :)
22:59:33 <fireonlive> sorry i should check fart more
22:59:46 <JAA> LinkedIn is horrible and generally not archiveable.
22:59:49 <JAA> HTTP 999
23:00:33 <fireonlive> ugh yeah, not too surprising
23:00:40 <TheTechRobo> my favourite status code, 999 Fuck Yourself
23:02:54 <DigitalDragons> 9xx "asshole errors" group
23:28:05 <h2ibot> Yts98 edited Collecting items randomly (+1153, unify algebraic notation, do some programming…): https://wiki.archiveteam.org/?diff=50712&oldid=21529