00:00:25 arkiver, pokechu22: here's my list of 156864 raw orange.fr urls: https://transfer.archivete.am/bE5jI/orangefr_raw.txt.zst 00:01:23 Will look at this shortly, thanks 00:01:50 here's my list of 159650 'cleaned' urls (where i cleaned up whitespace, handled transformations like monsite.orange.fr/ -> .monsite-orange.fr, and otherwise took my best guess at anything malformed): https://transfer.archivete.am/SB82D/orangefr_scrubbed.txt.zst 00:02:49 and here's a list of 61667 'bad' urls (which is just the raw list minus the cleaned list): https://transfer.archivete.am/vCBTZ/orangefr_badraw.txt.zst 00:05:34 (the cleaned list is longer than the raw list because i (a) generated if i only had /path.ext, to avoid no-parent issues, and (b) generated multiple guesses for some malformed urls where i had only the username) 00:06:36 And this is based on scraping a list that they provide, right? So most of the pages should exist? 00:07:06 yes; no 00:07:25 unfortunately a lot of the pages in the directory are down 00:07:57 I'm a bit worried because two of my !a < list jobs for monsite-orange.fr both seem to have resulted in the site banning it (possibly because of too many requests to nonexistent pages, but maybe just because it was running too fast) which is annoying... 00:08:19 the api had 'accessible' and 'status' parameters; i am not sure what the distinction is and chose the values that gave me the largest list 00:08:22 oof :/ 00:09:29 i can change those params and get you a shorter list to prioritize, if that would help 00:09:32 An additional anoyance is that each page that doesn't exist redirects twice (https://yachtlink.pagesperso-orange.fr/ -> https://r.orange.fr/r/Oerreur_404 -> https://e.orange.fr/error404.html) 00:09:43 ye 00:09:58 Sure, that'd be helpful as it'd be pretty easy to run that list first and then run the remaining stuff not on that list 00:11:08 ok, will do. probably take a few hours 00:11:17 Alright 00:15:09 list is going to be about 1/3 the size of the big one 00:31:36 thuban: how exactly did you make the badraw list? http://acf.luis.pagesperso-orange.fr/ is valid for instance (it just doesn't work with https) 00:35:09 literally just raw minus scrubbed. that site had a trailing slash in the raw list ("acf.luis.pagesperso-orange.fr/"); i removed those if they were directly on the domain (for deduping purposes) 00:36:49 Oh, not links that seemed like complete junk 00:38:24 yeah, the idea was mostly to have the originals for discoverability (esp for the changed domains) 01:21:20 thuban: some (as in several thousand?) of the ones you have aren't in my list at all, which means there's no archive.org coverage. Unfortunately my organization is a mess and I now have 2GB of lists of URLs so it'll be a bit before I can actually run stuff though... and make sure I'm actually looking at all of this correctly :| 01:29:52 that's ok, take your time! the priority list will probably be done in another 30-45 minutes, if that helps 01:30:58 my VPS has 659GB unused bandwidth for the rest of the month 01:59:02 DogsRNice: trouble with factorio? or just proactive? 02:00:01 no idea i just noticed someone was doing the factorio sites and didnt do the forums 02:04:18 ah ok 02:25:12 I skipped the forums because they're somewhat large - it'd make sense to do them later but I'd rather not start a multi-day proactive thing just yet 02:25:44 If we want to do one it's fine but eh 02:26:21 arkiver, pokechu22: here are my 'priority' lists (scraped with accessible=true and status=active; sites should all be online). these lists are a strict subset of those previously posted 02:26:44 48298 raw urls: https://transfer.archivete.am/eabTo/orangefr_online_raw.txt.zst 02:27:12 49007 cleaned urls: https://transfer.archivete.am/7QXLi/orangefr_online_scrubbed.txt.zst 02:30:38 the 'bad' urls all either had trailing slashes or were of the old *.(orange|wanadoo).fr format with quasi-redirects. trailing slashes are transparent for our purposes, so instead of the entire 'bad' list here are just the redirects 02:31:01 7440 redirect urls: https://transfer.archivete.am/LUa27/orangefr_online_redirect.txt.zst 02:32:03 I'm going to run this with entries like 08.pagesperso-orange.fr/odp/index.htm stripped out (leaving only 08.pagesperso-orange.fr) for now since having both is the kind of situation that can lead to really weird no-parent behavior 02:32:27 hmm, ok 02:32:28 AB also needs either http:// or https:// before each URL; I'll add http to ones with multiple dots and https to ones without 02:33:09 ah, i never remember that. do you want me to do that / any other processing? 02:33:26 I can handle it - I've already built some jank regexes for it :) 02:33:46 ok! 02:34:25 first prefix everything with http:// and then replace ^http://([^/\.]+\.[^/\.]+-orange\.fr)$ with https://\1 04:43:40 I don't think the orange stuff is going to finish on time - running at more than 1 page/second seemed to result in blocks, and after going through about 4.5K seed URLs of 45K URLs we're already at ~125K queued or a day and a half. So at that rate it'd be 15 days to finish, which we don't have. And that's just for this smaller list. Any ideas about how to handle that? 04:58:54 i guess i would suggest either seeing if you can reduce the delay (i know it's different infra, but i was able to do all my scraping with 0.5s delay and didn't get banned) or trying to parallelize the load across multiple pipelines 05:00:25 If .5s is fine I can do that - it was originally .25-.375 at con=1 05:00:42 I'm not sure how long they ban for though which makes me nervous about experimenting 05:02:24 as i said, different infra (and it involved a token which i just yoinked from the browser), so can't be sure based just on that. could you try testing with a sacrificial ip, like a home connection? 05:03:07 I guess I could - though I don't have quite the same infra either 05:03:14 i mean on their end 05:03:44 i.e., the directory api being different from the actual page servers 05:04:21 What host is the directory API on? 05:04:47 api.annuaire-pp.orange.fr 05:06:29 ah, yeah, might have different rate-limiting then :| 05:09:11 multiple pipelines is probably easiest/safest, but idk what wrangling them is like 05:09:40 (alas, this is really a job for #Y...) 05:11:49 Theoretically I could just run e.g. all of the pagespro-orange.fr jobs on one pipeline, pagesperso-orange.fr on a second, and moinsite-orange.fr on a third (that's trivial by just using different lists), and that's what I originally planned on doing, but it's not easy to do that for in-progress jobs 05:12:41 I'm going to try running pagespro-orange.fr locally since there's no job for that yet (beyond the ones you have in your list) 05:29:21 The other thing that would help is if we could just skip the 2-step redirect chain, but there's no way to apply ignores onto redirect targets so it's going to redownload https://r.orange.fr/r/Oerreur_404 and https://e.orange.fr/error404.html every time it hits a 404 :| 07:32:45 pokechu22: If I can help, I will! 11:23:01 pokechu22: wowturkey still down 16:37:36 So unfortunately, 500-500 delay results in a ban unfortunately. Happened to me on my residential connection overnight and happened to one of the jobs (not the priority one) I changed yesterday too. I guess the 1-second delay is the only safe one :| 16:48:27 I did, however, build a list of stuff under pagespro-orange.fr that's valid 17:03:55 09:58:59 AM -+rss- Fig Has Joined AWS: https://fig.io/blog/post/fig-joins-aws https://news.ycombinator.com/item?id=37296401 17:42:21 So, what channel do we use for ZOWA? 17:43:06 The ideas from yesterday: zowch z-oww-a nowa zowwa zowaah zowie (plus one that shall not be named) 19:16:06 ooh! ooh! the shall not be named one! 19:17:47 in absence of that, zowch 19:19:52 +1 zowch 19:21:46 FireonLive edited Current Projects (+121, add ZOWA): https://wiki.archiveteam.org/?diff=50608&oldid=50551 19:24:08 one day i'll go though and make 300,000 edits with the https://www.mediawiki.org/wiki/Help:Magic_words#formatdate thing 19:24:21 too bad there doesn't seem to be one for time 19:25:05 hmmm 19:26:09 yeah sadly {{#formatdate:2023-09-29T03:00Z}} doesn't appear to work 19:28:47 FireonLive edited Current Projects (+16, use formatdate for ZOWA, more to come): https://wiki.archiveteam.org/?diff=50609&oldid=50608 19:31:33 i found {{#time}} but what the fuck is this: 2023-09-29UTC03:000 19:32:10 i'll look more into it later :p 19:32:49 mediawiki is really something 19:47:48 #time doesn't seem to account for user preferences. 19:49:50 Yts98 edited ZOWA (+24, Update project status): https://wiki.archiveteam.org/?diff=50610&oldid=50195 19:54:25 ah, darn 19:54:39 thanks yts98 :) 20:01:01 Perhaps we should just have a simple template to render datetimes in a consistent manner. {{datetime|2023-08-28|22:00|CEST|+2}} → {{#formatdate:2023-08-28}} 22:00 CEST (UTC+2) or similar 20:01:39 i'd be up for something that's consistent 20:01:57 The last two parameters could be optional, and the default would be UTC. 20:02:23 people wildly get confused with named timezones though so perhaps we could leave that out 20:02:38 EST vs EDT, even big streamers scheduling things 20:03:05 'hey you know it's DT over there now.. so is happening at 7 or 8?' 20:03:15 seems to come up a lot lol 20:04:36 'ET' 20:04:42 (ノಥ益ಥ)ノ彡┻━┻ 20:04:53 too bad we can't just link them all to something like (js-ridden) https://www.timeanddate.com/worldclock/converter.html?iso=20230831T030000&p1=1440 20:04:55 :P 20:05:06 'type where you are and see what it is' 20:05:30 https://mkx9delh5a.execute-api.ca-central-1.amazonaws.com/uploads/e5654758afc913ec/image.png (i added Ottawa in this example) 20:07:39 the frowny faces are because it's mainly used for figuring out when to meet i guess 20:08:42 JAA: can we pls kill DST everywhere tks 20:08:45 T_T 20:09:00 inb4 perma-dst everywhere because i guess that sounds nicer to politicans 20:16:54 Yes please 20:19:40 as long as it's gone i'll accept it 20:19:49 :D 20:20:04 (the DST vs ST 'final time' debate) 20:20:30 Same, I don't even care anymore which one is chosen, just get rid of the stupid transition twice per year. 20:21:33 for sure 20:32:09 pokechu22: that sucks. multiple pipelines, then? i know you can't really do that with the jobs already in progress, but i don't think duplicating some of the work would hurt 20:32:13 (i also don't see any reason it needs to be done by domain--seems better to just split evenly) 20:34:35 Yeah, there's no real reason to split by domain, other than how I was building up my own lists originally. If it were an !a < list job for example.com/foo example.com/bar example.org/baz example.org/quux it would make sense to split example.com and example.org into two jobs to fully avoid !a < list issues, but we've already got multiple subdomains and multiple domains doesn't 20:34:38 make much of a difference 20:35:08 Unfortunately there are only 6 different sets of pipelines with distinct IPs, of which 3 are banned and 2 currently have jobs running on them 20:35:21 oof 20:35:51 the remaining one is also basically always full since it effectively only has 4 slots at the moment and they're usually filled with long-running jobs :| 20:36:41 Hopefully the bans don't last too long and we can get the other ones back into use 20:36:46 :I 20:36:48 yeah 20:38:05 at least we'll definitely get through all the front pages from the priority list (and probably their assets as well) 20:40:11 Yeah 20:53:17 +1 zowch 21:03:48 what's ZOWA 21:04:03 https://wiki.archiveteam.org/index.php/ZOWA 21:04:41 oh yikes, video... any idea of size? 21:06:50 #zowch for ZOWA 21:07:28 anyone updating channel on wiki? 21:07:38 Does archiveteam accept donations? if so, I hope they all go to the guy responsible for coming up with the channel names 21:07:42 he's got a hard jo 21:07:43 b 21:07:43 is the telegram thing still going nuts? 21:08:07 Like is the redoing everything thing still active or is it back to normal? 21:09:19 so many OWASP channels 21:11:13 appledash: https://wiki.archiveteam.org/index.php/Donate 21:12:22 wtf, the fact that someone who has only donated $40 is top 15 is a travesty 21:12:28 Remind me to contribute when I gat paid 21:13:55 Can someone fill me in on the owasp drama? Maybe in -ot 21:14:37 I have no fucking idea I just jumped on the bandwagon 21:14:50 appledash: It has only been in use and publicised since a couple months ago during the Imgur project, although the page has existed for years. 21:15:00 Ahhh 21:16:30 Switchnode edited ZOWA (+5, add irc channel): https://wiki.archiveteam.org/?diff=50611&oldid=50610 21:16:53 I queued one more job for orange.fr URLs that aren't found on archive.org at all, though whether or not the pipeline slot will free up remains to be seen 21:33:11 JustAnotherArchivist edited ZOWA (+56, Reference for shutdown): https://wiki.archiveteam.org/?diff=50612&oldid=50611 21:44:40 rewby: how are the targets and IA doing? do you have a giant backlog in temporary storage again? 21:50:16 nicolas17: I have about 31.2TiB in temp storage. And another 200 or so TiB left on it. 21:50:30 Targets are fine at the moment] 21:50:47 It's just that all active projects managed to hit bugs all at once as far as I can tell 21:51:32 Based on what I've read (and I'm not an authority here): shreddit is paused due to some concern around image capture maybe not working right 21:51:39 deadcat is just mostly done 21:51:57 oh, I thought shreddit was still paused to give capacity to gfycat/xuite 21:51:59 (and waiting for an update for the last few items) 21:52:04 xuite is just slow 21:52:14 (something something asia is a pain to get data in and out of) 21:52:29 If you have ipv6, I think xuite could use your help 21:52:49 telegram was provided offload capacity but I don't know if it's being used yet 21:53:10 telegram seems to have 0 in todo 21:53:25 Actually, tg is slowly returning stuff 21:53:28 So looks to be working 21:53:42 Uh... what else... urls is still paused 21:54:06 I think a bunch of stuff in tg was stashed away, maybe it needs to be brought back, but idk status, I wasn't even in the channel the last few days 21:54:09 Although that's been hooked up to offload too in case arkiver wants to have a go at it (although probably not at full speed to conserve space) 21:54:31 And yeah... that's about it? 21:55:24 shreddit was paused while i.reddit.com's new javascript/etc fuckery is checked to ensure the data we save is good 21:55:29 AIUI 21:55:48 if there's "free" capacity we can slightly open the faucet on imgur (: 21:56:03 imgur is slowly deleting images off of the CDN now, per BigBrain 21:56:11 302s are rising from the canary list 21:56:19 Ah 21:56:23 I'll add it to offload I guess 21:56:28 And then it's up to arkiver and JAA to turn that on and off 21:56:34 :) thanks 21:56:50 Mind you, I've only got like a quarter of a PiB of space 21:56:58 And that has to last us until the IA comes back 21:57:12 are you not uploading anything to IA right now? 21:57:16 Not yet 21:57:18 Code's not ready for it 21:57:51 It's nice to see nearly 200M items in queue and realize for once it's only like ~75GiB 21:58:13 vokunal|m: lol, in what project? 21:58:22 xuite if i had to guess 21:58:58 Imgur. Though is it probably the item size avg bugged after being offline so long? 21:59:10 nicolas17: Getting code ready for uploading to IA is a lower prio than actually capturing data atm 21:59:11 telegram is still running (so items submitted to the bot are still processed), but its backlog was stashed and since other projects are paused it's not receiving items from outlinks (which were the majority of its volume) 22:00:38 ah 22:00:44 vokunal|m: that math doesn't look right :P 22:01:02 item size is 367 KB 22:02:40 arkiver is the deduplication still turned off for telegram? 22:03:49 rewby: imgur has a lot of 'redo' that will probably have low success rate, so we can also regulate speed that way 22:03:59 move some stuff from redo to todo to slow down, ask me to add a bruteforced list to speed up :P 22:04:09 73TB? I think i divided instead of multiplied 22:05:03 vokunal|m: yes that's the right multiplication, but note a lot of those 200M are retries and will fail 22:12:19 FireonLive edited Current Projects (+27, add IRC channel for ZOWA): https://wiki.archiveteam.org/?diff=50613&oldid=50609 22:22:00 flashfire42: yes, i'll turn that on shortly again 22:23:47 Probably a good idea 22:28:48 https://wiki.archiveteam.org/index.php/Template:@ 22:28:51 interesting template 22:29:20 (it's an image!) 22:29:31 oh, for emails 22:30:31 (well one email :3) 22:31:04 I wonder if we will ever find out the reason behind the ingestion issues 22:32:07 And are we slowly pushing from the offload storage or is it just sitting quietly? 22:33:52 not uploading to IA from offload atm, code needs to be written (rewby mentioned it above) 22:34:35 My plan is to spend some time later this week getting uploading going 22:35:25 FireonLive edited Template:IRC-Hackint (+22, +deleteme in favour of Template:IRC): https://wiki.archiveteam.org/?diff=50614&oldid=41452 22:35:29 i have no idea what i went to wiki.archiveteam.org for initially, but it ended in that 22:41:26 FireonLive edited YouTube (-2, #youtubearchive → on haitus): https://wiki.archiveteam.org/?diff=50615&oldid=50569 22:41:39 it wasn't that either 22:41:43 oh well :D 23:05:48 front pages of 'online' orange.fr sites are done :D 23:07:25 ~8 days' worth of requests remaining in queue, so front page assets at least should just finish before shutdown 23:09:00 awesome 23:09:08 ^_^