00:58:19 Ka edited List of websites excluded from the Wayback Machine (-26, as of today ezboard appears to be available -…): https://wiki.archiveteam.org/?diff=50700&oldid=50542 00:58:20 Ka edited Twitter (-60, /* Vital Signs */): https://wiki.archiveteam.org/?diff=50701&oldid=50593 01:00:19 JAABot edited List of websites excluded from the Wayback Machine (+0): https://wiki.archiveteam.org/?diff=50702&oldid=50700 01:03:19 DigitalDragon edited ArchiveTeam Domains (-277, remove dead domains): https://wiki.archiveteam.org/?diff=50703&oldid=47046 01:33:11 Ukraine defence minister sacked just came across on the news we can archive that 01:33:35 https://www.reuters.com/world/europe/ukraines-zelenskiy-moves-replace-wartime-defence-minister-2023-09-03/ 01:33:56 Yeah, saw it earlier, but not sure what there is to archive really. 01:34:56 Maybe the MoD website. 01:39:28 https://www.mil.gov.ua/ is using Buttflare in an aggressive enough configuration that AB can't grab it. 01:46:21 :( 03:04:42 FireonLive edited Talk:Main Page (+17, looking at wanted templates): https://wiki.archiveteam.org/?diff=50704&oldid=48686 03:06:42 FireonLive edited NewsGrabber (+12, fixup infobox): https://wiki.archiveteam.org/?diff=50705&oldid=50579 03:07:42 FireonLive edited NewsGrabber (+24, it's a.... DPoS): https://wiki.archiveteam.org/?diff=50706&oldid=50705 03:20:52 fireonlive: How dare you be so rude to Template:Special case and Template:On hiatus? 03:21:36 they deserved it! 💢🥊 03:25:47 fireonlive: https://cdn.discordapp.com/attachments/286612533757083648/1148094078022602944/m2-res_640p.mp4 03:29:28 aww :3 04:12:23 so english wikipedia is really "that bad" eh 04:12:37 (outside of content) 05:58:14 Yts98 edited ZOWA (+45, Update information, datetimeify): https://wiki.archiveteam.org/?diff=50707&oldid=50639 06:06:21 not sure if you’ve seen it already, but the Telegram project (ArchiveTeam’s choice atm) seems to have no outstanding TODOs 12:54:30 on the orange FAI pages topic (scheduled to disappear tomorrow) we took upon ourselves (me & few friends) to dump as much as possible of them 12:54:35 for now, we have ~5k warcs (one per page/website/subdomain) taking a bit more than a hundred gigabyte 12:55:21 I hope that we can grab a couple thousand more til the end (that'll be ~10-15% of the sites, completely archived, mostly the larger ones) 12:55:33 warcs produced using wget's warc support (we've done something like « for i in $(cat pages.txt); do wget -r --warcfile=$i "$i" ... » +some other flags, ratelimits, etc) 12:56:05 (we've tried to reach orange to ask for more time, but I don't have that much hope on that request) 12:56:35 now, I'm wondering, what are we going to do with all these warcs 12:57:34 should we merge everything into some megawarcs & upload it ourselves to the IA under our own names? 12:58:22 or share it here and do something something 12:58:47 (it's my first time doing this kind of thing, tbh I'm not sure of anything) 12:59:10 cc pokechu22 maybe :o) 12:59:58 probably don’t need to megawarc them, yes to uploading them to IA 13:00:55 they wont be indexed into the WBM since the crawl is untrusted 13:01:14 (i dont really have an idea, so take this with a grain of salt and/or wait for me to be corrected) 13:02:18 if not indexed into IA a crawl per domain/user is probably easier to work with as well 13:02:31 indexed into WBM* 13:03:25 yeah I don't expect them to get indexed (at best I'll setup later some pywb somewhere to expose them) 13:06:59 we also had to use a proxy to apply some rewrite rules, as some older pages had only dead links (rewriting perso.wanado.fr/ into .pagesperso-orange.fr "ressurected" some sites, for example) 13:07:13 thus, strictly speaking, it's not a carbon copy of the pages 13:55:10 speaking of those orange.fr pages, deadline is tomorrow and our archivebot jobs are certain not to finish. i know it's technically possible to extract the remaining urls from the ab jobs and put them into #// (with a pattern-based rate limit to avoid ddosing); any chance of an admin actually doing so? 13:55:13 the ab jobs are rate-limited by ip bans, so dumping to #// would allow us to get more done even if the anti-ddos rate limit had to be pretty tight 13:59:49 (the relevant domains, for convenience, are orange.fr, monsite-orange.fr, pagesperso-orange.fr, pagespro-orange.fr, and woopic.com) 14:06:50 #// doesn't really do recursive discovery 14:07:13 true but irrelevant 14:07:33 Ah right, so the list is already complete? 14:07:52 no, but there are millions of urls in the queue we won't get to 14:08:11 I see, yeah that might make sense 14:17:43 we'd also need a target w/ some space on it, atm everything that's still running is at a crawl because optane9 is blocked most of the time 14:18:26 but it's not a bad idea. I'd throw some boxes at it if it was up and running agani 14:18:39 damn, i thought we had some buffer 14:19:23 Is it struggling currently? Unless we're pushing a lot of data through it we should have a lot of temp space & ingest capacity left 14:19:39 either way, i will reiterate my previous offer of space if it would be helpful 14:19:45 Rew.by might need a ping if its stuck or something 14:20:41 "A lot" should be a few gbit/s if i remember right 14:30:41 rewby: ^ optane9 seems to be having issues (seeing -1 on my end) 14:36:51 it's been letting things through in little chunks every so often so not completely stuck, but def struggling 14:39:23 New tech for offloading so some teething issues are expected, hopefully ots an easy fix 15:39:51 Yesterday the limits were taken off of #down-the-tube. If that's still the case, it could have a lot to do with why uploads are frozen 15:47:19 thuban: any rate limits on orange 15:47:19 ? 15:49:16 arkiver: 1 request/second appears to be the maximum safe rate for a single ip 15:49:21 no idea what it can sustain in total 15:49:25 but do they fall over at high rate? 15:49:29 ah okey 15:49:39 i was hoping this could be done with AB fully but sounds like that's not the case 15:49:50 mais, non 15:51:05 which ones have you not done with AB yet? 15:51:14 thuban: ^ 15:54:11 i'm not 100% sure how the lists we discovered were sliced up; pokechu22 put the actual jobs in 15:54:13 but i _believe_ that everything we found went into ab, so pulling the 'remaining' urls of the four current jobs should cover everything we can 17:32:31 hm, correction: i think everything got queued for monsite-orange.fr and pagesperso-orange.fr, but not pagespro-orange.fr 17:34:01 here is a list of 953 pagespro-orange.fr sites _not_ in the 'priority' job (scrubbed, suitable for ab): https://transfer.archivete.am/l2Sws/orangefr_pagespro_scrubbed.txt.zst 17:35:40 do we have enough archivebot pipelines to add this one? if so i would appreciate someone (pokechu22?) running it 17:35:59 (i expect most sites to not work, but a few will) 17:38:27 thuban: I already did a job for the entirety of pagespro that I confirmed worked (by a local crawl) but I can try to put that in too 17:39:01 I need to eat first though 17:39:04 oh, ok! sorry, must have not seen that since it already finished 17:39:23 no problem then 17:41:40 It's worth saving the list of stuff that doesn't work in any case, which I don't think we have done (my local crawl has a warc, but it'd be good to do it via AB too) 17:50:51 pokechu22: on that note, can you explain how you generated your lists? (arkiver asked earlier what we've done already, and while i _think_ everything in my 'full' list made it into your seed_urls lists, i wasn't sure) 17:52:54 The first "full" lists of mine are from the wayback CDX server for various domains (e.g. acounting for perso.wanadoo.fr and perso.orange.fr being on pagesperso-orange.fr now), along with a bit of bing/google search (but that's limited to only a few entries). I mixed stuff into the later lists though 17:54:13 I *think* the priority list is just your list as-is, while the pagesperso-orange.fr_seed_urls.txt and monsite-orange.fr_seed_urls_v2.txt are my lists as-is, and pagesperso-orange.fr_monsite-orange.fr_seed_urls_2_no_coverage.txt is whatever was in your full list that's not in one of my lists. So there's some overlap between the priority job and the other jobs but everything 17:54:16 should be represented. 17:56:16 ok, cool 18:00:42 (we might want to do the list of redirects, for discoverability--there should be enough time) 18:10:26 (Got told to move here) Heya, I'm planning to mirror-archive a novel/obscure website that may be shutting down soon (currently planning to use WarcMiddleware), with the intention to ask to be included in the archiveteam collection, so that it may show up on the wayback machine. 18:10:43 Is there anything important that I should know before I do this? Anything regarding the quality/origin of this WARC, or do yall recommend to use a different tool to mirror a complete website? 18:11:20 ShadowJonathan: as a rule the internet archive will not whitelist third-party warcs for the wayback machine; i recommend you request the site in #archivebot and let us do it for you. 18:12:41 Understandable, that's the answer I was after, I'll take a look there then 18:13:49 yeah, we can give suggestions on how you can best produce a quality WARC, but it's *not* going to appear on the wayback machine either way 18:14:47 Entirely understandable, I see that the domain I'm after is already in archivebot, but its from 2019, I'll take a peek at the documentation before I'll poke the channel with my question 18:34:56 We (as in people with AB access) can do a re-ar chive if theres new content/reason for it 18:40:32 thuban: looks like essentially all of those weren't on the list of URLs I originally tested, so it is new data 18:41:19 pokechu22: aha, thanks for running it 18:53:07 ShadowJonathan: I've never heard of WarcMiddleware before, but based on a quick glance at the code, it does not appear to be good software. 18:53:33 It 'converts a Scrapy request to a WarcRequestRecord' here: https://github.com/odie5533/WarcMiddleware/blob/bc63a1caa48a542df4fa0e877ede362c64ddcd25/warcmiddleware.py#L32-L61 18:53:41 That obviously won't preserve the data as sent by the server. 18:54:22 So, yet another person writing WARCs that doesn't understand the purpose of WARCs... 18:54:24 Hmmm... 18:54:32 It's listed on the wiki though 18:54:46 Yeah, that list needs an overhaul. 18:54:53 And looked to be the first one to mirror an entire website 18:55:10 But tbh if yall don't accept third party WARCs, there's a number of resources that need to be updated 18:55:40 One gist I found seems to suggest to just poke one of yall here to move it into the collection, which seemed very very trusting, but yeah, ofc the policy has changed inbetween then and now 18:56:02 https://gist.github.com/Asparagirl/6206247 19:00:40 i suspect "If you're uploading a WARC that should be included in the ArchiveTeam collection" meant 'if you are a member of archiveteam uploading part of an archiveteam project (and it is 2014 and we are still doing things this way)' 19:01:01 but yes, very misleading in present context 19:08:05 grab-site and warcprox are 'blessed' by JAA i believe 19:08:17 well, seem not bad 19:08:18 :p 19:11:01 wpull (and by extension grab-site) isn't perfect but doesn't have grave errors at least. 19:11:07 :) 19:11:22 warcprox isn't blessed by me, but because it comes from IA, it's assumed good until proven otherwise. 19:11:35 ye, that's better wording sorry 19:11:46 wget-at is also good. (wget is not.) 19:11:49 i don't think there's anything else to add to 'the list' 19:12:02 ah right archiveteam-flavoured wget 19:12:13 qwarc also writes WARCs according to the spec to the best of my knowledge and capability. 19:12:31 Everything else is best presumed terrible and unusable until proven otherwise. 19:12:44 :) 19:13:00 * fireonlive taps the follow the spec sign 19:14:11 :-) 19:16:29 I'll update the tools list. 19:17:17 was just looking at that and wondering if we need a 'recommended' column or something like that 19:17:19 lol 19:17:23 thanks 19:54:26 JustAnotherArchivist edited The WARC Ecosystem (+1536, /* Tools */ Add recommendation column): https://wiki.archiveteam.org/?diff=50708&oldid=50444 19:54:37 Now the list looks pretty sad. 19:55:22 lots of red :( 20:03:27 JustAnotherArchivist edited The WARC Ecosystem (+758, /* Tools */ Add wget-at and qwarc): https://wiki.archiveteam.org/?diff=50709&oldid=50708 20:04:28 Rexma edited Deathwatch (+56, /* 2023 */ its still up and i checked some…): https://wiki.archiveteam.org/?diff=50710&oldid=50698 20:06:29 FireonLive edited The WARC Ecosystem (-8, make table fit better on smaller screens): https://wiki.archiveteam.org/?diff=50711&oldid=50709 20:06:50 :-) 20:07:06 (the first really long tests link flew the recommended off screen for me) 20:07:08 :) 20:07:24 Yeah, same, actually, didn't check what was causing it though. 20:07:49 ahh =] 20:07:53 gotta love tables haha 20:18:06 I was confused when it said the wiki page was edited to fit better on smaller screens. It didn't fit in mine before, and mine's 32 inch. Then I remembered I keep the wiki in 150% zoom 21:02:58 heh time to scrape the betting channels and weird bullshit because we are nearly out of telegram items 21:08:42 are we uploading stuff to IA yet or are we still filling up temporary storage? 21:09:04 flashfire42: call Gooshka, he'll figure out a way to queue telegram stuffs :-) 21:09:24 Gooshka is working on it. so am I. and nicolas17 I dont honestly know the answer to that 21:09:28 :-) 21:09:38 that_lurker is as well 21:10:05 it's not a big deal if we go idle, we don't *have* to keep workers busy... is that "weird bullshit" useful to archive? :P 21:10:35 *wiggles hand* 21:10:35 sorta 21:20:08 instead of "oh no workers are idle, time to throw whatever garbage we find into the queue to keep them busy", we should be saying "oh finally workers are idle, now the targets can finally catch up with their uploads" :P 21:21:01 are we doing uploads? 21:21:16 I don't know what's the status, that's why I was asking 21:22:08 If I can get confirmation we are doing uploads again I am quite more than happy to start letting it play catchup 21:22:30 But the peoples demand work 21:22:42 isn't it even worse if we're *not* doing uploads? 21:23:19 Yes 21:23:46 fair call if this is a veiled request to stop queueing I can stop 21:24:26 or at least stop mass queueing 21:24:46 I'm not saying "don't add stuff", I'm not even saying "the stuff you're adding is worthless crap" (I don't know if it is), just "*if* it's worthless crap then don't add it just to keep things busy" 21:25:07 whether we have capacity for it or not, is not for me to say :) 21:25:36 I don't see any new items on the archiveteam IA account so it would seem uploads are not happening 21:26:22 I'm once again wishing for a graph of total available space on targets 21:27:01 admitedly some of my queueing is just busy work. 22:49:15 re: Geoff: https://www.geoffchappell.com/ at the least 22:50:21 https://www.linkedin.com/in/geoffchappellsoftwareanalyst as well 22:59:20 The website's been run through AB earlier. 22:59:26 ah awesome :) 22:59:33 sorry i should check fart more 22:59:46 LinkedIn is horrible and generally not archiveable. 22:59:49 HTTP 999 23:00:33 ugh yeah, not too surprising 23:00:40 my favourite status code, 999 Fuck Yourself 23:02:54 9xx "asshole errors" group 23:28:05 Yts98 edited Collecting items randomly (+1153, unify algebraic notation, do some programming…): https://wiki.archiveteam.org/?diff=50712&oldid=21529