-
thuban
arkiver, pokechu22: here's my list of 156864 raw orange.fr urls:
transfer.archivete.am/bE5jI/orangefr_raw.txt.zst
-
pokechu22
Will look at this shortly, thanks
-
thuban
here's my list of 159650 'cleaned' urls (where i cleaned up whitespace, handled transformations like monsite.orange.fr/<slug> -> <slug>.monsite-orange.fr, and otherwise took my best guess at anything malformed):
transfer.archivete.am/SB82D/orangefr_scrubbed.txt.zst
-
thuban
and here's a list of 61667 'bad' urls (which is just the raw list minus the cleaned list):
transfer.archivete.am/vCBTZ/orangefr_badraw.txt.zst
-
thuban
(the cleaned list is longer than the raw list because i (a) generated <site> if i only had <site>/path.ext, to avoid no-parent issues, and (b) generated multiple guesses for some malformed urls where i had only the username)
-
pokechu22
And this is based on scraping a list that they provide, right? So most of the pages should exist?
-
thuban
yes; no
-
thuban
unfortunately a lot of the pages in the directory are down
-
pokechu22
I'm a bit worried because two of my !a < list jobs for monsite-orange.fr both seem to have resulted in the site banning it (possibly because of too many requests to nonexistent pages, but maybe just because it was running too fast) which is annoying...
-
thuban
the api had 'accessible' and 'status' parameters; i am not sure what the distinction is and chose the values that gave me the largest list
-
thuban
oof :/
-
thuban
i can change those params and get you a shorter list to prioritize, if that would help
-
pokechu22
An additional anoyance is that each page that doesn't exist redirects twice (
yachtlink.pagesperso-orange.fr ->
r.orange.fr/r/Oerreur_404 ->
e.orange.fr/error404.html)
-
thuban
ye
-
pokechu22
Sure, that'd be helpful as it'd be pretty easy to run that list first and then run the remaining stuff not on that list
-
thuban
ok, will do. probably take a few hours
-
pokechu22
Alright
-
thuban
list is going to be about 1/3 the size of the big one
-
pokechu22
thuban: how exactly did you make the badraw list?
acf.luis.pagesperso-orange.fr is valid for instance (it just doesn't work with https)
-
thuban
literally just raw minus scrubbed. that site had a trailing slash in the raw list ("acf.luis.pagesperso-orange.fr/"); i removed those if they were directly on the domain (for deduping purposes)
-
pokechu22
Oh, not links that seemed like complete junk
-
thuban
yeah, the idea was mostly to have the originals for discoverability (esp for the changed domains)
-
pokechu22
thuban: some (as in several thousand?) of the ones you have aren't in my list at all, which means there's no archive.org coverage. Unfortunately my organization is a mess and I now have 2GB of lists of URLs so it'll be a bit before I can actually run stuff though... and make sure I'm actually looking at all of this correctly :|
-
thuban
that's ok, take your time! the priority list will probably be done in another 30-45 minutes, if that helps
-
nicolas17
my VPS has 659GB unused bandwidth for the rest of the month
-
fireonlive
DogsRNice: trouble with factorio? or just proactive?
-
DogsRNice
no idea i just noticed someone was doing the factorio sites and didnt do the forums
-
fireonlive
ah ok
-
pokechu22
I skipped the forums because they're somewhat large - it'd make sense to do them later but I'd rather not start a multi-day proactive thing just yet
-
pokechu22
If we want to do one it's fine but eh
-
thuban
arkiver, pokechu22: here are my 'priority' lists (scraped with accessible=true and status=active; sites should all be online). these lists are a strict subset of those previously posted
-
thuban
-
thuban
-
thuban
the 'bad' urls all either had trailing slashes or were of the old *.(orange|wanadoo).fr format with quasi-redirects. trailing slashes are transparent for our purposes, so instead of the entire 'bad' list here are just the redirects
-
thuban
-
pokechu22
I'm going to run this with entries like 08.pagesperso-orange.fr/odp/index.htm stripped out (leaving only 08.pagesperso-orange.fr) for now since having both is the kind of situation that can lead to really weird no-parent behavior
-
thuban
hmm, ok
-
pokechu22
AB also needs either http:// or https:// before each URL; I'll add http to ones with multiple dots and https to ones without
-
thuban
ah, i never remember that. do you want me to do that / any other processing?
-
pokechu22
I can handle it - I've already built some jank regexes for it :)
-
thuban
ok!
-
pokechu22
first prefix everything with http:// and then replace ^
([^/\.]+\.[^/\.]+-orange\.fr)$ with https://\1
-
pokechu22
I don't think the orange stuff is going to finish on time - running at more than 1 page/second seemed to result in blocks, and after going through about 4.5K seed URLs of 45K URLs we're already at ~125K queued or a day and a half. So at that rate it'd be 15 days to finish, which we don't have. And that's just for this smaller list. Any ideas about how to handle that?
-
thuban
i guess i would suggest either seeing if you can reduce the delay (i know it's different infra, but i was able to do all my scraping with 0.5s delay and didn't get banned) or trying to parallelize the load across multiple pipelines
-
pokechu22
If .5s is fine I can do that - it was originally .25-.375 at con=1
-
pokechu22
I'm not sure how long they ban for though which makes me nervous about experimenting
-
thuban
as i said, different infra (and it involved a token which i just yoinked from the browser), so can't be sure based just on that. could you try testing with a sacrificial ip, like a home connection?
-
pokechu22
I guess I could - though I don't have quite the same infra either
-
thuban
i mean on their end
-
thuban
i.e., the directory api being different from the actual page servers
-
pokechu22
What host is the directory API on?
-
thuban
api.annuaire-pp.orange.fr
-
pokechu22
ah, yeah, might have different rate-limiting then :|
-
thuban
multiple pipelines is probably easiest/safest, but idk what wrangling them is like
-
thuban
(alas, this is really a job for #Y...)
-
pokechu22
Theoretically I could just run e.g. all of the pagespro-orange.fr jobs on one pipeline, pagesperso-orange.fr on a second, and moinsite-orange.fr on a third (that's trivial by just using different lists), and that's what I originally planned on doing, but it's not easy to do that for in-progress jobs
-
pokechu22
I'm going to try running pagespro-orange.fr locally since there's no job for that yet (beyond the ones you have in your list)
-
pokechu22
The other thing that would help is if we could just skip the 2-step redirect chain, but there's no way to apply ignores onto redirect targets so it's going to redownload
r.orange.fr/r/Oerreur_404 and
e.orange.fr/error404.html every time it hits a 404 :|
-
AntoninDelFabbro|m
pokechu22: If I can help, I will!
-
erkinalp
pokechu22: wowturkey still down
-
pokechu22
So unfortunately, 500-500 delay results in a ban unfortunately. Happened to me on my residential connection overnight and happened to one of the jobs (not the priority one) I changed yesterday too. I guess the 1-second delay is the only safe one :|
-
pokechu22
I did, however, build a list of stuff under pagespro-orange.fr that's valid
-
fireonlive
-
JAA
So, what channel do we use for ZOWA?
-
JAA
The ideas from yesterday: zowch z-oww-a nowa zowwa zowaah zowie (plus one that shall not be named)
-
fireonlive
ooh! ooh! the shall not be named one!
-
fireonlive
in absence of that, zowch
-
DigitalDragons
+1 zowch
-
h2ibot
FireonLive edited Current Projects (+121, add ZOWA):
wiki.archiveteam.org/?diff=50608&oldid=50551
-
fireonlive
one day i'll go though and make 300,000 edits with the
mediawiki.org/wiki/Help:Magic_words#formatdate thing
-
fireonlive
too bad there doesn't seem to be one for time
-
fireonlive
hmmm
-
fireonlive
yeah sadly {{#formatdate:2023-09-29T03:00Z}} doesn't appear to work
-
h2ibot
FireonLive edited Current Projects (+16, use formatdate for ZOWA, more to come):
wiki.archiveteam.org/?diff=50609&oldid=50608
-
fireonlive
i found {{#time}} but what the fuck is this: 2023-09-29UTC03:000
-
fireonlive
i'll look more into it later :p
-
fireonlive
mediawiki is really something
-
JAA
#time doesn't seem to account for user preferences.
-
h2ibot
Yts98 edited ZOWA (+24, Update project status):
wiki.archiveteam.org/?diff=50610&oldid=50195
-
fireonlive
ah, darn
-
fireonlive
thanks yts98 :)
-
JAA
Perhaps we should just have a simple template to render datetimes in a consistent manner. {{datetime|2023-08-28|22:00|CEST|+2}} → {{#formatdate:2023-08-28}} 22:00 CEST (UTC+2) or similar
-
fireonlive
i'd be up for something that's consistent
-
JAA
The last two parameters could be optional, and the default would be UTC.
-
fireonlive
people wildly get confused with named timezones though so perhaps we could leave that out
-
fireonlive
EST vs EDT, even big streamers scheduling things
-
fireonlive
'hey you know it's DT over there now.. so is happening at 7 or 8?'
-
fireonlive
seems to come up a lot lol
-
JAA
'ET'
-
JAA
(ノಥ益ಥ)ノ彡┻━┻
-
fireonlive
-
fireonlive
:P
-
fireonlive
'type where you are and see what it is'
-
fireonlive
-
fireonlive
the frowny faces are because it's mainly used for figuring out when to meet i guess
-
fireonlive
JAA: can we pls kill DST everywhere tks
-
fireonlive
T_T
-
fireonlive
inb4 perma-dst everywhere because i guess that sounds nicer to politicans
-
JAA
Yes please
-
fireonlive
as long as it's gone i'll accept it
-
fireonlive
:D
-
fireonlive
(the DST vs ST 'final time' debate)
-
JAA
Same, I don't even care anymore which one is chosen, just get rid of the stupid transition twice per year.
-
fireonlive
for sure
-
thuban
pokechu22: that sucks. multiple pipelines, then? i know you can't really do that with the jobs already in progress, but i don't think duplicating some of the work would hurt
-
thuban
(i also don't see any reason it needs to be done by domain--seems better to just split evenly)
-
pokechu22
Yeah, there's no real reason to split by domain, other than how I was building up my own lists originally. If it were an !a < list job for example.com/foo example.com/bar example.org/baz example.org/quux it would make sense to split example.com and example.org into two jobs to fully avoid !a < list issues, but we've already got multiple subdomains and multiple domains doesn't
-
pokechu22
make much of a difference
-
pokechu22
Unfortunately there are only 6 different sets of pipelines with distinct IPs, of which 3 are banned and 2 currently have jobs running on them
-
thuban
oof
-
pokechu22
the remaining one is also basically always full since it effectively only has 4 slots at the moment and they're usually filled with long-running jobs :|
-
pokechu22
Hopefully the bans don't last too long and we can get the other ones back into use
-
thuban
:I
-
thuban
yeah
-
thuban
at least we'll definitely get through all the front pages from the priority list (and probably their assets as well)
-
pokechu22
Yeah
-
vokunal|m
+1 zowch
-
nicolas17
what's ZOWA
-
JAA
-
nicolas17
oh yikes, video... any idea of size?
-
JAA
#zowch for ZOWA
-
nicolas17
anyone updating channel on wiki?
-
appledash
Does archiveteam accept donations? if so, I hope they all go to the guy responsible for coming up with the channel names
-
appledash
he's got a hard jo
-
appledash
b
-
flashfire42
is the telegram thing still going nuts?
-
flashfire42
Like is the redoing everything thing still active or is it back to normal?
-
fireonlive
so many OWASP channels
-
JAA
-
appledash
wtf, the fact that someone who has only donated $40 is top 15 is a travesty
-
appledash
Remind me to contribute when I gat paid
-
nstrom|m
Can someone fill me in on the owasp drama? Maybe in -ot
-
flashfire42
I have no fucking idea I just jumped on the bandwagon
-
JAA
appledash: It has only been in use and publicised since a couple months ago during the Imgur project, although the page has existed for years.
-
appledash
Ahhh
-
h2ibot
Switchnode edited ZOWA (+5, add irc channel):
wiki.archiveteam.org/?diff=50611&oldid=50610
-
pokechu22
I queued one more job for orange.fr URLs that aren't found on archive.org at all, though whether or not the pipeline slot will free up remains to be seen
-
h2ibot
JustAnotherArchivist edited ZOWA (+56, Reference for shutdown):
wiki.archiveteam.org/?diff=50612&oldid=50611
-
nicolas17
rewby: how are the targets and IA doing? do you have a giant backlog in temporary storage again?
-
rewby
nicolas17: I have about 31.2TiB in temp storage. And another 200 or so TiB left on it.
-
rewby
Targets are fine at the moment]
-
rewby
It's just that all active projects managed to hit bugs all at once as far as I can tell
-
rewby
Based on what I've read (and I'm not an authority here): shreddit is paused due to some concern around image capture maybe not working right
-
rewby
deadcat is just mostly done
-
nicolas17
oh, I thought shreddit was still paused to give capacity to gfycat/xuite
-
rewby
(and waiting for an update for the last few items)
-
rewby
xuite is just slow
-
rewby
(something something asia is a pain to get data in and out of)
-
rewby
If you have ipv6, I think xuite could use your help
-
rewby
telegram was provided offload capacity but I don't know if it's being used yet
-
nicolas17
telegram seems to have 0 in todo
-
rewby
Actually, tg is slowly returning stuff
-
rewby
So looks to be working
-
rewby
Uh... what else... urls is still paused
-
nicolas17
I think a bunch of stuff in tg was stashed away, maybe it needs to be brought back, but idk status, I wasn't even in the channel the last few days
-
rewby
Although that's been hooked up to offload too in case arkiver wants to have a go at it (although probably not at full speed to conserve space)
-
rewby
And yeah... that's about it?
-
fireonlive
shreddit was paused while i.reddit.com's new javascript/etc fuckery is checked to ensure the data we save is good
-
fireonlive
AIUI
-
nicolas17
if there's "free" capacity we can slightly open the faucet on imgur (:
-
fireonlive
imgur is slowly deleting images off of the CDN now, per BigBrain
-
fireonlive
302s are rising from the canary list
-
rewby
Ah
-
rewby
I'll add it to offload I guess
-
rewby
And then it's up to arkiver and JAA to turn that on and off
-
fireonlive
:) thanks
-
rewby
Mind you, I've only got like a quarter of a PiB of space
-
rewby
And that has to last us until the IA comes back
-
nicolas17
are you not uploading anything to IA right now?
-
rewby
Not yet
-
rewby
Code's not ready for it
-
vokunal|m
It's nice to see nearly 200M items in queue and realize for once it's only like ~75GiB
-
nicolas17
vokunal|m: lol, in what project?
-
fireonlive
xuite if i had to guess
-
vokunal|m
Imgur. Though is it probably the item size avg bugged after being offline so long?
-
rewby
nicolas17: Getting code ready for uploading to IA is a lower prio than actually capturing data atm
-
thuban
telegram is still running (so items submitted to the bot are still processed), but its backlog was stashed and since other projects are paused it's not receiving items from outlinks (which were the majority of its volume)
-
fireonlive
ah
-
nicolas17
vokunal|m: that math doesn't look right :P
-
nicolas17
item size is 367 KB
-
flashfire42
arkiver is the deduplication still turned off for telegram?
-
nicolas17
rewby: imgur has a lot of 'redo' that will probably have low success rate, so we can also regulate speed that way
-
nicolas17
move some stuff from redo to todo to slow down, ask me to add a bruteforced list to speed up :P
-
vokunal|m
73TB? I think i divided instead of multiplied
-
nicolas17
vokunal|m: yes that's the right multiplication, but note a lot of those 200M are retries and will fail
-
h2ibot
FireonLive edited Current Projects (+27, add IRC channel for ZOWA):
wiki.archiveteam.org/?diff=50613&oldid=50609
-
arkiver
flashfire42: yes, i'll turn that on shortly again
-
flashfire42|m
Probably a good idea
-
fireonlive
-
fireonlive
interesting template
-
fireonlive
(it's an image!)
-
fireonlive
oh, for emails
-
fireonlive
(well one email :3)
-
flashfire42|m
I wonder if we will ever find out the reason behind the ingestion issues
-
flashfire42|m
And are we slowly pushing from the offload storage or is it just sitting quietly?
-
fireonlive
not uploading to IA from offload atm, code needs to be written (rewby mentioned it above)
-
rewby
My plan is to spend some time later this week getting uploading going
-
h2ibot
FireonLive edited Template:IRC-Hackint (+22, +deleteme in favour of Template:IRC):
wiki.archiveteam.org/?diff=50614&oldid=41452
-
fireonlive
i have no idea what i went to wiki.archiveteam.org for initially, but it ended in that
-
h2ibot
FireonLive edited YouTube (-2, #youtubearchive → on haitus):
wiki.archiveteam.org/?diff=50615&oldid=50569
-
fireonlive
it wasn't that either
-
fireonlive
oh well :D
-
thuban
front pages of 'online' orange.fr sites are done :D
-
thuban
~8 days' worth of requests remaining in queue, so front page assets at least should just finish before shutdown
-
fireonlive
awesome
-
fireonlive
^_^