-
h2ibot
Ka edited List of websites excluded from the Wayback Machine (-26, as of today ezboard appears to be available -…):
wiki.archiveteam.org/?diff=50700&oldid=50542
-
h2ibot
Ka edited Twitter (-60, /* Vital Signs */):
wiki.archiveteam.org/?diff=50701&oldid=50593
-
h2ibot
JAABot edited List of websites excluded from the Wayback Machine (+0):
wiki.archiveteam.org/?diff=50702&oldid=50700
-
h2ibot
DigitalDragon edited ArchiveTeam Domains (-277, remove dead domains):
wiki.archiveteam.org/?diff=50703&oldid=47046
-
flashfire42|m
Ukraine defence minister sacked just came across on the news we can archive that
-
fireonlive
-
JAA
Yeah, saw it earlier, but not sure what there is to archive really.
-
JAA
Maybe the MoD website.
-
JAA
mil.gov.ua is using Buttflare in an aggressive enough configuration that AB can't grab it.
-
fireonlive
:(
-
h2ibot
FireonLive edited Talk:Main Page (+17, looking at wanted templates):
wiki.archiveteam.org/?diff=50704&oldid=48686
-
h2ibot
FireonLive edited NewsGrabber (+12, fixup infobox):
wiki.archiveteam.org/?diff=50705&oldid=50579
-
h2ibot
FireonLive edited NewsGrabber (+24, it's a.... DPoS):
wiki.archiveteam.org/?diff=50706&oldid=50705
-
TheTechRobo
fireonlive: How dare you be so rude to Template:Special case and Template:On hiatus?
-
fireonlive
they deserved it! 💢🥊
-
nicolas17
-
fireonlive
aww :3
-
fireonlive
so english wikipedia is really "that bad" eh
-
fireonlive
(outside of content)
-
h2ibot
Yts98 edited ZOWA (+45, Update information, datetimeify):
wiki.archiveteam.org/?diff=50707&oldid=50639
-
manu|m
not sure if you’ve seen it already, but the Telegram project (ArchiveTeam’s choice atm) seems to have no outstanding TODOs
-
plcp
on the orange FAI pages topic (scheduled to disappear tomorrow) we took upon ourselves (me & few friends) to dump as much as possible of them
-
plcp
for now, we have ~5k warcs (one per page/website/subdomain) taking a bit more than a hundred gigabyte
-
plcp
I hope that we can grab a couple thousand more til the end (that'll be ~10-15% of the sites, completely archived, mostly the larger ones)
-
plcp
warcs produced using wget's warc support (we've done something like « for i in $(cat pages.txt); do wget -r --warcfile=$i "$i" ... » +some other flags, ratelimits, etc)
-
plcp
(we've tried to reach orange to ask for more time, but I don't have that much hope on that request)
-
plcp
now, I'm wondering, what are we going to do with all these warcs
-
plcp
should we merge everything into some megawarcs & upload it ourselves to the IA under our own names?
-
plcp
or share it here and do something something
-
plcp
(it's my first time doing this kind of thing, tbh I'm not sure of anything)
-
plcp
cc pokechu22 maybe :o)
-
imer
probably don’t need to megawarc them, yes to uploading them to IA
-
imer
they wont be indexed into the WBM since the crawl is untrusted
-
imer
(i dont really have an idea, so take this with a grain of salt and/or wait for me to be corrected)
-
imer
if not indexed into IA a crawl per domain/user is probably easier to work with as well
-
imer
indexed into WBM*
-
plcp
yeah I don't expect them to get indexed (at best I'll setup later some pywb somewhere to expose them)
-
plcp
we also had to use a proxy to apply some rewrite rules, as some older pages had only dead links (rewriting perso.wanado.fr/<domain> into <domain>.pagesperso-orange.fr "ressurected" some sites, for example)
-
plcp
thus, strictly speaking, it's not a carbon copy of the pages
-
thuban
speaking of those orange.fr pages, deadline is tomorrow and our archivebot jobs are certain not to finish. i know it's technically possible to extract the remaining urls from the ab jobs and put them into #// (with a pattern-based rate limit to avoid ddosing); any chance of an admin actually doing so?
-
thuban
the ab jobs are rate-limited by ip bans, so dumping to #// would allow us to get more done even if the anti-ddos rate limit had to be pretty tight
-
thuban
(the relevant domains, for convenience, are orange.fr, monsite-orange.fr, pagesperso-orange.fr, pagespro-orange.fr, and woopic.com)
-
imer
#// doesn't really do recursive discovery
-
thuban
true but irrelevant
-
imer
Ah right, so the list is already complete?
-
thuban
no, but there are millions of urls in the queue we won't get to
-
imer
I see, yeah that might make sense
-
nstrom|m
we'd also need a target w/ some space on it, atm everything that's still running is at a crawl because optane9 is blocked most of the time
-
nstrom|m
but it's not a bad idea. I'd throw some boxes at it if it was up and running agani
-
thuban
damn, i thought we had some buffer
-
imer
Is it struggling currently? Unless we're pushing a lot of data through it we should have a lot of temp space & ingest capacity left
-
thuban
either way, i will reiterate my previous offer of space if it would be helpful
-
imer
Rew.by might need a ping if its stuck or something
-
imer
"A lot" should be a few gbit/s if i remember right
-
imer
rewby: ^ optane9 seems to be having issues (seeing -1 on my end)
-
nstrom|m
it's been letting things through in little chunks every so often so not completely stuck, but def struggling
-
imer
New tech for offloading so some teething issues are expected, hopefully ots an easy fix
-
vokunal|m
Yesterday the limits were taken off of #down-the-tube. If that's still the case, it could have a lot to do with why uploads are frozen
-
arkiver
thuban: any rate limits on orange
-
arkiver
?
-
thuban
arkiver: 1 request/second appears to be the maximum safe rate for a single ip
-
thuban
no idea what it can sustain in total
-
arkiver
but do they fall over at high rate?
-
arkiver
ah okey
-
arkiver
i was hoping this could be done with AB fully but sounds like that's not the case
-
thuban
mais, non
-
arkiver
which ones have you not done with AB yet?
-
arkiver
thuban: ^
-
thuban
i'm not 100% sure how the lists we discovered were sliced up; pokechu22 put the actual jobs in
-
thuban
but i _believe_ that everything we found went into ab, so pulling the 'remaining' urls of the four current jobs should cover everything we can
-
thuban
hm, correction: i think everything got queued for monsite-orange.fr and pagesperso-orange.fr, but not pagespro-orange.fr
-
thuban
here is a list of 953 pagespro-orange.fr sites _not_ in the 'priority' job (scrubbed, suitable for ab):
transfer.archivete.am/l2Sws/orangefr_pagespro_scrubbed.txt.zst
-
thuban
do we have enough archivebot pipelines to add this one? if so i would appreciate someone (pokechu22?) running it
-
thuban
(i expect most sites to not work, but a few will)
-
pokechu22
thuban: I already did a job for the entirety of pagespro that I confirmed worked (by a local crawl) but I can try to put that in too
-
pokechu22
I need to eat first though
-
thuban
oh, ok! sorry, must have not seen that since it already finished
-
thuban
no problem then
-
pokechu22
It's worth saving the list of stuff that doesn't work in any case, which I don't think we have done (my local crawl has a warc, but it'd be good to do it via AB too)
-
thuban
pokechu22: on that note, can you explain how you generated your lists? (arkiver asked earlier what we've done already, and while i _think_ everything in my 'full' list made it into your seed_urls lists, i wasn't sure)
-
pokechu22
The first "full" lists of mine are from the wayback CDX server for various domains (e.g. acounting for perso.wanadoo.fr and perso.orange.fr being on pagesperso-orange.fr now), along with a bit of bing/google search (but that's limited to only a few entries). I mixed stuff into the later lists though
-
pokechu22
I *think* the priority list is just your list as-is, while the pagesperso-orange.fr_seed_urls.txt and monsite-orange.fr_seed_urls_v2.txt are my lists as-is, and pagesperso-orange.fr_monsite-orange.fr_seed_urls_2_no_coverage.txt is whatever was in your full list that's not in one of my lists. So there's some overlap between the priority job and the other jobs but everything
-
pokechu22
should be represented.
-
thuban
ok, cool
-
thuban
(we might want to do the list of redirects, for discoverability--there should be enough time)
-
ShadowJonathan
(Got told to move here) Heya, I'm planning to mirror-archive a novel/obscure website that may be shutting down soon (currently planning to use WarcMiddleware), with the intention to ask to be included in the archiveteam collection, so that it may show up on the wayback machine.
-
ShadowJonathan
Is there anything important that I should know before I do this? Anything regarding the quality/origin of this WARC, or do yall recommend to use a different tool to mirror a complete website?
-
thuban
ShadowJonathan: as a rule the internet archive will not whitelist third-party warcs for the wayback machine; i recommend you request the site in #archivebot and let us do it for you.
-
ShadowJonathan
Understandable, that's the answer I was after, I'll take a look there then
-
nicolas17
yeah, we can give suggestions on how you can best produce a quality WARC, but it's *not* going to appear on the wayback machine either way
-
ShadowJonathan
Entirely understandable, I see that the domain I'm after is already in archivebot, but its from 2019, I'll take a peek at the documentation before I'll poke the channel with my question
-
imer
We (as in people with AB access) can do a re-ar chive if theres new content/reason for it
-
pokechu22
thuban: looks like essentially all of those weren't on the list of URLs I originally tested, so it is new data
-
thuban
pokechu22: aha, thanks for running it
-
JAA
ShadowJonathan: I've never heard of WarcMiddleware before, but based on a quick glance at the code, it does not appear to be good software.
-
JAA
-
JAA
That obviously won't preserve the data as sent by the server.
-
JAA
So, yet another person writing WARCs that doesn't understand the purpose of WARCs...
-
ShadowJonathan
Hmmm...
-
ShadowJonathan
It's listed on the wiki though
-
JAA
Yeah, that list needs an overhaul.
-
ShadowJonathan
And looked to be the first one to mirror an entire website
-
ShadowJonathan
But tbh if yall don't accept third party WARCs, there's a number of resources that need to be updated
-
ShadowJonathan
One gist I found seems to suggest to just poke one of yall here to move it into the collection, which seemed very very trusting, but yeah, ofc the policy has changed inbetween then and now
-
ShadowJonathan
-
thuban
i suspect "If you're uploading a WARC that should be included in the ArchiveTeam collection" meant 'if you are a member of archiveteam uploading part of an archiveteam project (and it is 2014 and we are still doing things this way)'
-
thuban
but yes, very misleading in present context
-
fireonlive
grab-site and warcprox are 'blessed' by JAA i believe
-
fireonlive
well, seem not bad
-
fireonlive
:p
-
JAA
wpull (and by extension grab-site) isn't perfect but doesn't have grave errors at least.
-
fireonlive
:)
-
JAA
warcprox isn't blessed by me, but because it comes from IA, it's assumed good until proven otherwise.
-
fireonlive
ye, that's better wording sorry
-
JAA
wget-at is also good. (wget is not.)
-
fireonlive
i don't think there's anything else to add to 'the list'
-
fireonlive
ah right archiveteam-flavoured wget
-
JAA
qwarc also writes WARCs according to the spec to the best of my knowledge and capability.
-
JAA
Everything else is best presumed terrible and unusable until proven otherwise.
-
fireonlive
:)
-
» fireonlive taps the follow the spec sign
-
JAA
:-)
-
JAA
I'll update the tools list.
-
fireonlive
was just looking at that and wondering if we need a 'recommended' column or something like that
-
fireonlive
lol
-
fireonlive
thanks
-
h2ibot
JustAnotherArchivist edited The WARC Ecosystem (+1536, /* Tools */ Add recommendation column):
wiki.archiveteam.org/?diff=50708&oldid=50444
-
JAA
Now the list looks pretty sad.
-
fireonlive
lots of red :(
-
h2ibot
JustAnotherArchivist edited The WARC Ecosystem (+758, /* Tools */ Add wget-at and qwarc):
wiki.archiveteam.org/?diff=50709&oldid=50708
-
h2ibot
Rexma edited Deathwatch (+56, /* 2023 */ its still up and i checked some…):
wiki.archiveteam.org/?diff=50710&oldid=50698
-
h2ibot
FireonLive edited The WARC Ecosystem (-8, make table fit better on smaller screens):
wiki.archiveteam.org/?diff=50711&oldid=50709
-
JAA
:-)
-
fireonlive
(the first really long tests link flew the recommended off screen for me)
-
fireonlive
:)
-
JAA
Yeah, same, actually, didn't check what was causing it though.
-
fireonlive
ahh =]
-
fireonlive
gotta love tables haha
-
vokunal|m
I was confused when it said the wiki page was edited to fit better on smaller screens. It didn't fit in mine before, and mine's 32 inch. Then I remembered I keep the wiki in 150% zoom
-
flashfire42
heh time to scrape the betting channels and weird bullshit because we are nearly out of telegram items
-
nicolas17
are we uploading stuff to IA yet or are we still filling up temporary storage?
-
Barto
flashfire42: call Gooshka, he'll figure out a way to queue telegram stuffs :-)
-
flashfire42
Gooshka is working on it. so am I. and nicolas17 I dont honestly know the answer to that
-
Barto
:-)
-
flashfire42
that_lurker is as well
-
nicolas17
it's not a big deal if we go idle, we don't *have* to keep workers busy... is that "weird bullshit" useful to archive? :P
-
flashfire42
*wiggles hand*
-
flashfire42
sorta
-
nicolas17
instead of "oh no workers are idle, time to throw whatever garbage we find into the queue to keep them busy", we should be saying "oh finally workers are idle, now the targets can finally catch up with their uploads" :P
-
flashfire42
are we doing uploads?
-
nicolas17
I don't know what's the status, that's why I was asking
-
flashfire42
If I can get confirmation we are doing uploads again I am quite more than happy to start letting it play catchup
-
flashfire42
But the peoples demand work
-
nicolas17
isn't it even worse if we're *not* doing uploads?
-
JAA
Yes
-
flashfire42
fair call if this is a veiled request to stop queueing I can stop
-
flashfire42
or at least stop mass queueing
-
nicolas17
I'm not saying "don't add stuff", I'm not even saying "the stuff you're adding is worthless crap" (I don't know if it is), just "*if* it's worthless crap then don't add it just to keep things busy"
-
nicolas17
whether we have capacity for it or not, is not for me to say :)
-
DigitalDragons
I don't see any new items on the archiveteam IA account so it would seem uploads are not happening
-
nicolas17
I'm once again wishing for a graph of total available space on targets
-
flashfire42
admitedly some of my queueing is just busy work.
-
fireonlive
re: Geoff:
geoffchappell.com at the least
-
fireonlive
-
JAA
The website's been run through AB earlier.
-
fireonlive
ah awesome :)
-
fireonlive
sorry i should check fart more
-
JAA
LinkedIn is horrible and generally not archiveable.
-
JAA
HTTP 999
-
fireonlive
ugh yeah, not too surprising
-
TheTechRobo
my favourite status code, 999 Fuck Yourself
-
DigitalDragons
9xx "asshole errors" group
-
h2ibot
Yts98 edited Collecting items randomly (+1153, unify algebraic notation, do some programming…):
wiki.archiveteam.org/?diff=50712&oldid=21529