#archiveteam-bs

00:58

h2ibot

Ka edited List of websites excluded from the Wayback Machine (-26, as of today ezboard appears to be available -…): wiki.archiveteam.org/?diff=50700&oldid=50542
00:58

h2ibot

Ka edited Twitter (-60, /* Vital Signs */): wiki.archiveteam.org/?diff=50701&oldid=50593
01:00

h2ibot

JAABot edited List of websites excluded from the Wayback Machine (+0): wiki.archiveteam.org/?diff=50702&oldid=50700
01:03

h2ibot

DigitalDragon edited ArchiveTeam Domains (-277, remove dead domains): wiki.archiveteam.org/?diff=50703&oldid=47046
01:33

flashfire42|m

Ukraine defence minister sacked just came across on the news we can archive that
01:33

fireonlive

reuters.com/world/europe/ukraines-z…wartime-defence-minister-2023-09-03
01:33

JAA

Yeah, saw it earlier, but not sure what there is to archive really.
01:34

JAA

Maybe the MoD website.
01:39

JAA

mil.gov.ua is using Buttflare in an aggressive enough configuration that AB can't grab it.
01:46

fireonlive

:(
03:04

h2ibot

FireonLive edited Talk:Main Page (+17, looking at wanted templates): wiki.archiveteam.org/?diff=50704&oldid=48686
03:06

h2ibot

FireonLive edited NewsGrabber (+12, fixup infobox): wiki.archiveteam.org/?diff=50705&oldid=50579
03:07

h2ibot

FireonLive edited NewsGrabber (+24, it's a.... DPoS): wiki.archiveteam.org/?diff=50706&oldid=50705
03:20

TheTechRobo

fireonlive: How dare you be so rude to Template:Special case and Template:On hiatus?
03:21

fireonlive

they deserved it! 💢🥊
03:25

nicolas17

fireonlive: cdn.discordapp.com/attachments/2866…1148094078022602944/m2-res_640p.mp4
03:29

fireonlive

aww :3
04:12

fireonlive

so english wikipedia is really "that bad" eh
04:12

fireonlive

(outside of content)
05:58

h2ibot

Yts98 edited ZOWA (+45, Update information, datetimeify): wiki.archiveteam.org/?diff=50707&oldid=50639
06:06

manu|m

not sure if you’ve seen it already, but the Telegram project (ArchiveTeam’s choice atm) seems to have no outstanding TODOs
12:54

plcp

on the orange FAI pages topic (scheduled to disappear tomorrow) we took upon ourselves (me & few friends) to dump as much as possible of them
12:54

plcp

for now, we have ~5k warcs (one per page/website/subdomain) taking a bit more than a hundred gigabyte
12:55

plcp

I hope that we can grab a couple thousand more til the end (that'll be ~10-15% of the sites, completely archived, mostly the larger ones)
12:55

plcp

warcs produced using wget's warc support (we've done something like « for i in $(cat pages.txt); do wget -r --warcfile=$i "$i" ... » +some other flags, ratelimits, etc)
12:56

plcp

(we've tried to reach orange to ask for more time, but I don't have that much hope on that request)
12:56

plcp

now, I'm wondering, what are we going to do with all these warcs
12:57

plcp

should we merge everything into some megawarcs & upload it ourselves to the IA under our own names?
12:58

plcp

or share it here and do something something
12:58

plcp

(it's my first time doing this kind of thing, tbh I'm not sure of anything)
12:59

plcp

cc pokechu22 maybe :o)
12:59

imer

probably don’t need to megawarc them, yes to uploading them to IA
13:00

imer

they wont be indexed into the WBM since the crawl is untrusted
13:01

imer

(i dont really have an idea, so take this with a grain of salt and/or wait for me to be corrected)
13:02

imer

if not indexed into IA a crawl per domain/user is probably easier to work with as well
13:02

imer

indexed into WBM*
13:03

plcp

yeah I don't expect them to get indexed (at best I'll setup later some pywb somewhere to expose them)
13:06

plcp

we also had to use a proxy to apply some rewrite rules, as some older pages had only dead links (rewriting perso.wanado.fr/<domain> into <domain>.pagesperso-orange.fr "ressurected" some sites, for example)
13:07

plcp

thus, strictly speaking, it's not a carbon copy of the pages
13:55

thuban

speaking of those orange.fr pages, deadline is tomorrow and our archivebot jobs are certain not to finish. i know it's technically possible to extract the remaining urls from the ab jobs and put them into #// (with a pattern-based rate limit to avoid ddosing); any chance of an admin actually doing so?
13:55

thuban

the ab jobs are rate-limited by ip bans, so dumping to #// would allow us to get more done even if the anti-ddos rate limit had to be pretty tight
13:59

thuban

(the relevant domains, for convenience, are orange.fr, monsite-orange.fr, pagesperso-orange.fr, pagespro-orange.fr, and woopic.com)
14:06

imer

#// doesn't really do recursive discovery
14:07

thuban

true but irrelevant
14:07

imer

Ah right, so the list is already complete?
14:07

thuban

no, but there are millions of urls in the queue we won't get to
14:08

imer

I see, yeah that might make sense
14:17

nstrom|m

we'd also need a target w/ some space on it, atm everything that's still running is at a crawl because optane9 is blocked most of the time
14:18

nstrom|m

but it's not a bad idea. I'd throw some boxes at it if it was up and running agani
14:18

thuban

damn, i thought we had some buffer
14:19

imer

Is it struggling currently? Unless we're pushing a lot of data through it we should have a lot of temp space & ingest capacity left
14:19

thuban

either way, i will reiterate my previous offer of space if it would be helpful
14:19

imer

Rew.by might need a ping if its stuck or something
14:20

imer

"A lot" should be a few gbit/s if i remember right
14:30

imer

rewby: ^ optane9 seems to be having issues (seeing -1 on my end)
14:36

nstrom|m

it's been letting things through in little chunks every so often so not completely stuck, but def struggling
14:39

imer

New tech for offloading so some teething issues are expected, hopefully ots an easy fix
15:39

vokunal|m

Yesterday the limits were taken off of #down-the-tube. If that's still the case, it could have a lot to do with why uploads are frozen
15:47

arkiver

thuban: any rate limits on orange
15:47

arkiver

?
15:49

thuban

arkiver: 1 request/second appears to be the maximum safe rate for a single ip
15:49

thuban

no idea what it can sustain in total
15:49

arkiver

but do they fall over at high rate?
15:49

arkiver

ah okey
15:49

arkiver

i was hoping this could be done with AB fully but sounds like that's not the case
15:49

thuban

mais, non
15:51

arkiver

which ones have you not done with AB yet?
15:51

arkiver

thuban: ^
15:54

thuban

i'm not 100% sure how the lists we discovered were sliced up; pokechu22 put the actual jobs in
15:54

thuban

but i _believe_ that everything we found went into ab, so pulling the 'remaining' urls of the four current jobs should cover everything we can
17:32

thuban

hm, correction: i think everything got queued for monsite-orange.fr and pagesperso-orange.fr, but not pagespro-orange.fr
17:34

thuban

here is a list of 953 pagespro-orange.fr sites _not_ in the 'priority' job (scrubbed, suitable for ab): transfer.archivete.am/l2Sws/orangefr_pagespro_scrubbed.txt.zst
17:35

thuban

do we have enough archivebot pipelines to add this one? if so i would appreciate someone (pokechu22?) running it
17:35

thuban

(i expect most sites to not work, but a few will)
17:38

pokechu22

thuban: I already did a job for the entirety of pagespro that I confirmed worked (by a local crawl) but I can try to put that in too
17:39

pokechu22

I need to eat first though
17:39

thuban

oh, ok! sorry, must have not seen that since it already finished
17:39

thuban

no problem then
17:41

pokechu22

It's worth saving the list of stuff that doesn't work in any case, which I don't think we have done (my local crawl has a warc, but it'd be good to do it via AB too)
17:50

thuban

pokechu22: on that note, can you explain how you generated your lists? (arkiver asked earlier what we've done already, and while i _think_ everything in my 'full' list made it into your seed_urls lists, i wasn't sure)
17:52

pokechu22

The first "full" lists of mine are from the wayback CDX server for various domains (e.g. acounting for perso.wanadoo.fr and perso.orange.fr being on pagesperso-orange.fr now), along with a bit of bing/google search (but that's limited to only a few entries). I mixed stuff into the later lists though
17:54

pokechu22

I *think* the priority list is just your list as-is, while the pagesperso-orange.fr_seed_urls.txt and monsite-orange.fr_seed_urls_v2.txt are my lists as-is, and pagesperso-orange.fr_monsite-orange.fr_seed_urls_2_no_coverage.txt is whatever was in your full list that's not in one of my lists. So there's some overlap between the priority job and the other jobs but everything
17:54

pokechu22

should be represented.
17:56

thuban

ok, cool
18:00

thuban

(we might want to do the list of redirects, for discoverability--there should be enough time)
18:10

ShadowJonathan

(Got told to move here) Heya, I'm planning to mirror-archive a novel/obscure website that may be shutting down soon (currently planning to use WarcMiddleware), with the intention to ask to be included in the archiveteam collection, so that it may show up on the wayback machine.
18:10

ShadowJonathan

Is there anything important that I should know before I do this? Anything regarding the quality/origin of this WARC, or do yall recommend to use a different tool to mirror a complete website?
18:11

thuban

ShadowJonathan: as a rule the internet archive will not whitelist third-party warcs for the wayback machine; i recommend you request the site in #archivebot and let us do it for you.
18:12

ShadowJonathan

Understandable, that's the answer I was after, I'll take a look there then
18:13

nicolas17

yeah, we can give suggestions on how you can best produce a quality WARC, but it's *not* going to appear on the wayback machine either way
18:14

ShadowJonathan

Entirely understandable, I see that the domain I'm after is already in archivebot, but its from 2019, I'll take a peek at the documentation before I'll poke the channel with my question
18:34

imer

We (as in people with AB access) can do a re-ar chive if theres new content/reason for it
18:40

pokechu22

thuban: looks like essentially all of those weren't on the list of URLs I originally tested, so it is new data
18:41

thuban

pokechu22: aha, thanks for running it
18:53

JAA

ShadowJonathan: I've never heard of WarcMiddleware before, but based on a quick glance at the code, it does not appear to be good software.
18:53

JAA

It 'converts a Scrapy request to a WarcRequestRecord' here: github.com/odie5533/WarcMiddleware/…c64ddcd25/warcmiddleware.py#L32-L61
18:53

JAA

That obviously won't preserve the data as sent by the server.
18:54

JAA

So, yet another person writing WARCs that doesn't understand the purpose of WARCs...
18:54

ShadowJonathan

Hmmm...
18:54

ShadowJonathan

It's listed on the wiki though
18:54

JAA

Yeah, that list needs an overhaul.
18:54

ShadowJonathan

And looked to be the first one to mirror an entire website
18:55

ShadowJonathan

But tbh if yall don't accept third party WARCs, there's a number of resources that need to be updated
18:55

ShadowJonathan

One gist I found seems to suggest to just poke one of yall here to move it into the collection, which seemed very very trusting, but yeah, ofc the policy has changed inbetween then and now
18:56

ShadowJonathan

gist.github.com/Asparagirl/6206247
19:00

thuban

i suspect "If you're uploading a WARC that should be included in the ArchiveTeam collection" meant 'if you are a member of archiveteam uploading part of an archiveteam project (and it is 2014 and we are still doing things this way)'
19:01

thuban

but yes, very misleading in present context
19:08

fireonlive

grab-site and warcprox are 'blessed' by JAA i believe
19:08

fireonlive

well, seem not bad
19:08

fireonlive

:p
19:11

JAA

wpull (and by extension grab-site) isn't perfect but doesn't have grave errors at least.
19:11

fireonlive

:)
19:11

JAA

warcprox isn't blessed by me, but because it comes from IA, it's assumed good until proven otherwise.
19:11

fireonlive

ye, that's better wording sorry
19:11

JAA

wget-at is also good. (wget is not.)
19:11

fireonlive

i don't think there's anything else to add to 'the list'
19:12

fireonlive

ah right archiveteam-flavoured wget
19:12

JAA

qwarc also writes WARCs according to the spec to the best of my knowledge and capability.
19:12

JAA

Everything else is best presumed terrible and unusable until proven otherwise.
19:12

fireonlive

:)
19:13

» fireonlive taps the follow the spec sign
19:14

JAA

:-)
19:16

JAA

I'll update the tools list.
19:17

fireonlive

was just looking at that and wondering if we need a 'recommended' column or something like that
19:17

fireonlive

lol
19:17

fireonlive

thanks
19:54

h2ibot

JustAnotherArchivist edited The WARC Ecosystem (+1536, /* Tools */ Add recommendation column): wiki.archiveteam.org/?diff=50708&oldid=50444
19:54

JAA

Now the list looks pretty sad.
19:55

fireonlive

lots of red :(
20:03

h2ibot

JustAnotherArchivist edited The WARC Ecosystem (+758, /* Tools */ Add wget-at and qwarc): wiki.archiveteam.org/?diff=50709&oldid=50708
20:04

h2ibot

Rexma edited Deathwatch (+56, /* 2023 */ its still up and i checked some…): wiki.archiveteam.org/?diff=50710&oldid=50698
20:06

h2ibot

FireonLive edited The WARC Ecosystem (-8, make table fit better on smaller screens): wiki.archiveteam.org/?diff=50711&oldid=50709
20:06

JAA

:-)
20:07

fireonlive

(the first really long tests link flew the recommended off screen for me)
20:07

fireonlive

:)
20:07

JAA

Yeah, same, actually, didn't check what was causing it though.
20:07

fireonlive

ahh =]
20:07

fireonlive

gotta love tables haha
20:18

vokunal|m

I was confused when it said the wiki page was edited to fit better on smaller screens. It didn't fit in mine before, and mine's 32 inch. Then I remembered I keep the wiki in 150% zoom
21:02

flashfire42

heh time to scrape the betting channels and weird bullshit because we are nearly out of telegram items
21:08

nicolas17

are we uploading stuff to IA yet or are we still filling up temporary storage?
21:09

Barto

flashfire42: call Gooshka, he'll figure out a way to queue telegram stuffs :-)
21:09

flashfire42

Gooshka is working on it. so am I. and nicolas17 I dont honestly know the answer to that
21:09

Barto

:-)
21:09

flashfire42

that_lurker is as well
21:10

nicolas17

it's not a big deal if we go idle, we don't *have* to keep workers busy... is that "weird bullshit" useful to archive? :P
21:10

flashfire42

*wiggles hand*
21:10

flashfire42

sorta
21:20

nicolas17

instead of "oh no workers are idle, time to throw whatever garbage we find into the queue to keep them busy", we should be saying "oh finally workers are idle, now the targets can finally catch up with their uploads" :P
21:21

flashfire42

are we doing uploads?
21:21

nicolas17

I don't know what's the status, that's why I was asking
21:22

flashfire42

If I can get confirmation we are doing uploads again I am quite more than happy to start letting it play catchup
21:22

flashfire42

But the peoples demand work
21:22

nicolas17

isn't it even worse if we're *not* doing uploads?
21:23

JAA

Yes
21:23

flashfire42

fair call if this is a veiled request to stop queueing I can stop
21:24

flashfire42

or at least stop mass queueing
21:24

nicolas17

I'm not saying "don't add stuff", I'm not even saying "the stuff you're adding is worthless crap" (I don't know if it is), just "*if* it's worthless crap then don't add it just to keep things busy"
21:25

nicolas17

whether we have capacity for it or not, is not for me to say :)
21:25

DigitalDragons

I don't see any new items on the archiveteam IA account so it would seem uploads are not happening
21:26

nicolas17

I'm once again wishing for a graph of total available space on targets
21:27

flashfire42

admitedly some of my queueing is just busy work.
22:49

fireonlive

re: Geoff: geoffchappell.com at the least
22:50

fireonlive

linkedin.com/in/geoffchappellsoftwareanalyst as well
22:59

JAA

The website's been run through AB earlier.
22:59

fireonlive

ah awesome :)
22:59

fireonlive

sorry i should check fart more
22:59

JAA

LinkedIn is horrible and generally not archiveable.
22:59

JAA

HTTP 999
23:00

fireonlive

ugh yeah, not too surprising
23:00

TheTechRobo

my favourite status code, 999 Fuck Yourself
23:02

DigitalDragons

9xx "asshole errors" group
23:28

h2ibot

Yts98 edited Collecting items randomly (+1153, unify algebraic notation, do some programming…): wiki.archiveteam.org/?diff=50712&oldid=21529

a year ago

« a day earlier

a day later »

today »