#archiveteam-bs

01:15

pabs

AntoninDelFabbro|m: which website, what are you trying to get? I tend to use different things for different purposes. for eg: googler/ddgr for site: search engine queries. curl/wget for downloads. pup for HTML parsing/querying. jq for JSON querying
02:19

h2ibot

Systwi uploaded File:Duck Hunt (World)-0--twitter-5.png (Mr. Peepers holding the Twitter bird,…): wiki.archiveteam.org/?title=File%3A…Hunt%20%28World%29-0--twitter-5.png
02:27

h2ibot

Systwi edited Twitter (+177, /* Vital Signs */ Added meme and serious caption.): wiki.archiveteam.org/?diff=50593&oldid=50591
02:35

fireonlive

systwi: 😃
02:39

h2ibot

Systwi edited Site exploration (+398, /* Twitter */ Mentioned Nitter and Twitter's…): wiki.archiveteam.org/?diff=50594&oldid=50492
02:40

systwi

fireonlive: :-D
04:54

erkinalp

wowturkey is down, not known it's temporary and re-restored or permanent as in finally closed
04:54

erkinalp

let's leave the bot running in hope it returns once more
04:55

erkinalp

the announced date was august 31
05:02

pabs

nicolas17: anything to save for this? volkerkrause.eu/2023/08/26/kde-jenkins-retirement-progress.html
05:03

nicolas17

pabs: I doubt it because most data there was ephemeral in the first place, eg. there's projects that do a daily build and only the last 5 binaries are kept
05:04

pabs

and what about the phabricator?
05:04

erkinalp

phabricator is a bugtracker, that's significant
05:04

erkinalp

tickets may have good things
05:05

nicolas17

we'll probably turn it into static pages somehow
05:05

nicolas17

I'm not sure how easy it is to archive, I think there's like, JS-backed "load more comments" stuff?
05:05

» pabs recommends an AB job, then download the static files :)
05:06

pabs

I did a phabricator recently, apart from the large amount of ignores I think it worked ok
05:07

erkinalp

we'd have to do the same with missing wowturkey viewtopic pages with p=### links
05:08

erkinalp

corresponding t=####&start=### ones have already been crawled
05:08

nicolas17

at one point we considered moving issues from phabricator to gitlab and it was messy because tickets can have multiple tags/projects that they belong to, while gitlab issues belong to *one* project
05:09

nicolas17

so we would need to check case by case and make a list of "if a ticket has tag X and tag Y, put it in repo Y"
05:09

pabs

AB job then static seems better
05:10

nicolas17

well yeah, this was *early* in the gitlab move when a lot of tickets would still be active
05:11

nicolas17

by now I guess a lot was closed, or stopped mattering, or was still active and someone moved it manually
05:12

flashfire42

Ok should I focus on webs or orange today? both have close cut off dates. Or do I say the hell with the both of them and continue with the aussie ISPs that have technically passed their shutdown date and are still up?
05:14

erkinalp

"September 1: wowTURKEY[IA•Wcite•.today•MemWeb], a large Turkish photo sharing forum[23]" september 1 → august 27
05:15

erkinalp

don't kill the archivebot crawler tho
05:15

erkinalp

it still crawls previously failed external links
05:17

erkinalp

if we had started this crawl one day before, we would have the full archive today...
05:17

flashfire42

Alas that is the joys of web archival
05:17

flashfire42

things are lost every day my friend
05:18

flashfire42

and it sucks. it does. but we do what we can
05:18

erkinalp

87% is better than nothing
05:18

erkinalp

(known item count is ~9.4M, we got 8,338,246)
05:23

pabs

science.org/content/article/governm…ersity-blow-science-researchers-say
05:23

erkinalp

300ms is too short for wowturkey, the server's own delay was about 500ms even when it was up
05:36

pokechu22

flashfire42: I should probably do an !a < list job on orange - webs might be better to focus on. On the other hand webs has the stupid calendars that make things a mess :|
05:37

flashfire42

orange has about 4 different subdomains and some of them dont even resolve for me but do for others. Webs is not a set and forget thing which is really what I am aiming for because the calenders are so fucking broken
06:43

pokechu22

Can I get a list of those different subdomains? (Lists of individual sites would be useful too but I have some ideas of how to get those once I know the starting points)
06:46

flashfire42

pagesperso-orange.fr
06:46

flashfire42

monsite-orange.fr
06:46

flashfire42

those are the main 2
06:47

flashfire42

images are hosted on cdn.woopic.com
06:56

DigitalDragons

a few individual sites: crawlyproject.digitaldragon.dev/cds/lists/fr/pagesperso-orange
07:04

DigitalDragons

(ignore the .txt at the end of everything)
07:09

AntoninDelFabbro|m

pabs: I wqnt to download annuaire-pp.orange.fr/accueil, but thanks for your help! :D
07:11

pabs

a good option for that is open it in your browser, open dev tools, click on all the things on the site, then save all the requests as a .har and then AB all the URLs output by this shell oneliner:
07:11

pabs

for f in *.har ; do jq -r '.log.entries[].request.url' < "$f" ; done | sort -u
07:12

pabs

ah, better open dev tools before loading the page, woops
07:12

pabs

there are some browser based crawler things on the wiki somewhere too
07:13

pabs

but they may not work if you need to interact with the site
07:20

AntoninDelFabbro|m

Gold! I just woke up, but I'm impatient to try this asap! Thank you!
07:31

erkinalp

qyxojzh|m: wowturkey definitively down
08:04

erkinalp

JAA: wowturkey definitively down, as of 0400UTC today
08:10

thuban

AntoninDelFabbro|m: that's a good way to capture data you can click through manually, but if the amount of navigation required is very large, i personally prefer to write a short script.
08:11

thuban

i have done so for annuaire-pp.orange.fr (and in the process, i believe, discovered more results than are shown in the browser) and will dump results tomorrow
08:20

AntoninDelFabbro|m

Awesome! Haha, well you saved me lot of time, thanks ;)
08:26

thuban

AntoninDelFabbro|m: you're welcome!
08:32

thuban

also, uh, can someone remind me what the status is on orange isp hosting in general? are we still just dumping stuff in archivebot? because there are tens of thousands of these
08:34

thuban

(i was going to add 'and some of them require javascript', but based on my spot-checking they're all in the weird 'put everything in the html, but don't actually display it until the js loads' idiom, so i think archivebot would actually be fine in that respect)
08:47

erkinalp

qyxojzh|m: JAA: arkiver: one of the former wowturkey's mods are about to ask the owner to buy and resurrect wowturkey.com
08:47

erkinalp

we might get a last chance revive
08:58

flashfire42

thuban um everything into archivebot unless you design some scripts because its about a week away from going bye bye and we have like 3 or 4 warrior projects on the go and fuck all ingestion to IA right now
09:44

h2ibot

Bzc6p edited Demotivalo.net (+37, /* Sister sites */ Update stati): wiki.archiveteam.org/?diff=50595&oldid=47826
11:25

erkinalp

wowturkey is down as of now
12:19

pabs

this site has 6TB of FLACs for bluegrass music: bluegrassarchive.com (frameset for gdarchive.net/Public/Bluegrass/contents.htm)
12:20

pabs

would be nice to grab eventually, but seems a bit big for AB, especially with the current IA upload limits
12:32

h2ibot

PaulWise edited Bugzilla (+1073, more from BZ site…): wiki.archiveteam.org/?diff=50596&oldid=50588
12:33

h2ibot

PaulWise edited Bugzilla (-93, remove accidentally added done ones): wiki.archiveteam.org/?diff=50597&oldid=50596
12:43

h2ibot

PaulWise edited Deathwatch (+295, Eclipse Wiki shutdown): wiki.archiveteam.org/?diff=50598&oldid=50584
12:53

h2ibot

PaulWise edited Bugzilla (+0, Eclipse Bugzilla shutdown, AB in progress:…): wiki.archiveteam.org/?diff=50599&oldid=50597
13:34

h2ibot

Exorcism edited Orain (-6): wiki.archiveteam.org/?diff=50600&oldid=44412
13:46

arkiver

pabs: it's maybe fine to put in AB when the current problems at IA are fixed
13:46

pabs

ok, wasn't sure if AB could handle that volume either
13:46

pabs

thanks
13:48

arkiver

well JAA is the expert on that
13:55

h2ibot

Exorcism edited Nupedia (+10): wiki.archiveteam.org/?diff=50601&oldid=28721
14:05

JAA

erkinalp: Ugh. Yeah, let's hope it's resurrected.
14:07

JAA

pabs, arkiver: AB doesn't care much about data size as long as there aren't huge files in it. The other limiting factor is number of URLs, but until you go over 100M, that's not usually a problem either.
14:28

erkinalp

JAA: seems no hope
14:51

erkinalp

why are archivebot downloads so slow?
14:52

JAA

Do you mean downloads of ArchiveBot data from the Internet Archive?
15:15

erkinalp

no, arcihvebot data downloads from archive.fart.website
15:15

erkinalp

i'm getting isdn speeds of download currently
15:16

erkinalp

it can't be due to my link speeds either
15:16

erkinalp

i hapen to have 70mbps down, 10mpbs up
15:16

JAA

The AB viewer is just an index of the data on IA.
15:16

JAA

The links go to IA.
15:17

JAA

And yeah, downloads from IA are notoriously slow, especially if you aren't near the Bay Area.
15:18

erkinalp

it isn't that slow normally
15:18

erkinalp

it was usually a few mbp
15:18

erkinalp

s
15:18

erkinalp

i could get good dsl speeds of download
15:18

erkinalp

not dialup or isdn speeds
15:21

JAA

It varies depending on IA load and which server the data is on.
17:27

erkinalp

IA downloads are now down to dialup speeds
17:30

JAA

Not surprising. IA is pretty busy recently, and it slows various things to a crawl.
17:31

JAA

Even IA-internal things are slow. One particular item I was monitoring took over 6 days to move 43 GB around internally.
17:31

fireonlive

ooof
17:32

JAA

(Move it from S3 to item server, checksums, and mirror to backup server.)
17:42

arkiver

i'm planning on dusting off wikis-grab for the upcoming deletions of wikis
17:42

arkiver

though those are also largely covered already by wikiteam dumps i believe
17:42

arkiver

the wikis-grab would more be a general method of archiving wikis
17:43

arkiver

project is also coming for ZOWA
17:44

arkiver

any idea for a channel for zowa.app ?
17:45

arkiver

flashfire42: do you know if we have the orange ISP hosting stuff fully covered with AB?
17:48

pokechu22

arkiver: have you seen #wikibot?
17:48

pokechu22

ah, you're already in that channel
17:49

arkiver

pokechu22: yeah
17:49

arkiver

i think it's good to have both dumps from there and from a project creating WARCs
17:49

pokechu22

I haven't seen wikis-grab before - does it try to do everything wikiteam does, or is it mostly focused on saving the current revision of every page?
17:49

pokechu22

Yeah, WARCs are good
17:50

arkiver

current revision
17:50

arkiver

may be good to allow wikiteam higher priority with the dumps than WARCs, since it's more complete
17:50

arkiver

but after that we should attempt to create WARCs as well
17:51

pokechu22

Saving the current revision and maybe all pages on the history tab (but not the revisions themselves - just the history list for attribution) probably is enough for WARCs
17:51

arkiver

yeah
17:51

JAA

Love it!
17:51

arkiver

it's largely for URL preservation, so it's in the wayback machine and easily browsable
17:52

arkiver

the dumps can be used to restore a wiki with (right?), but for browsing WARCs are better
17:52

arkiver

JAA: :)
17:53

JAA

Yep, that's accurate.
17:53

JAA

The dumps are entirely unusable for the average person.
17:54

arkiver

and of course outlinks for #// :)
17:55

nstrom|m

<arkiver> "any idea for a channel for zowa..." <- zowch
17:55

flashfire42|m

Soddy arkiver it’s 4am here and you are lucky I snap awake for random reasons. Orange is far from complete in archivebot. I’ve been launching as many jobs as I can manage but it’s like fighting a fire with a kids water bucket. I’ll get a sampling but not all of it
17:56

flashfire42|m

I will continue to throw in a much as possible during the next week but we won’t get it all I can say that with certainty. Not unless we get a stay of execution for another month or 2
17:57

flashfire42|m

Hopefully that info is helpful it is time for me to head back to sleep for another 2 hours.
17:58

pokechu22

I'll try to do an !a < list job for it too
17:58

pokechu22

The deadline for webs is sooner though
17:58

erkinalp

JAA: unless they have a WARC reader,
17:59

JAA

z-oww-a
18:00

imer

#nowa
18:01

imer

although I like the oww one better :D
18:08

JAA

Or perhaps some play on the content. What are some sounds you'd absolutely not want to hear in an ASMR video?
18:09

nstrom|m

zowaah, zowie 🤷‍♂️
18:10

imer

#zo🍽️
18:14

JAA

My terminal is sad about that last one.
18:17

imer

Yeah, lets not :)
18:32

arkiver

:)
18:53

thuban

arkiver, re orange isp hosting: AntoninDelFabbro|m posted a link to a page listing sites, and i have been enumerating them using its api
18:55

thuban

if my suspicion that supplying 0 as the category id retrieves all categories is correct, i expect to be able to enumerate 159832 sites (some fraction of which will be duplicates or inaccessible due to various oddnesses)
18:57

thuban

i don't think it's realistic to do this 'manually', but maybe some `!a <` jobs? individual sites are quite small as a rule
19:03

AntoninDelFabbro|m

I'm really impressed and thankful
19:10

fireonlive

i for one vote for emoji channel ;)
19:10

fireonlive

:3
19:11

erkinalp

JAA: wowturkey definitively dead, we can update the deadwatch now (death date: 2023-08-27,0400Z)
19:11

erkinalp

s/deadwatch/deathwatch/
19:11

fireonlive

oh it is confirmed by owner?
19:12

erkinalp

owner not responding to any correspondence
19:12

erkinalp

no hope of coming back up again
19:12

erkinalp

the AB job has a few external links (~650 or so) pending
19:12

fireonlive

ah :(
19:14

erkinalp

to skip wowturkey.com without impacting the remaining ~650 external resources, i'd propose to temporarily map wowturkey.com to 0.0.0.0 ;[
19:14

JAA

:-(
19:14

erkinalp

in the bot's end i mean
19:15

JAA

The AB job is paused, and the offsite URLs aren't in danger, so we can let it sit until the true deadline just in case it comes back.
19:15

erkinalp

oh, i though it was looping over and ober
19:15

erkinalp

good that it's paused
19:16

erkinalp

if it doesn't come back up until 23 september, then it's DaaD
19:17

erkinalp

(23 september is when the hosting expires, exactly 22 years from the website's start)
19:17

erkinalp

and the shutdown was exactly 20 years and 1 day from the first turkish language post
19:18

erkinalp

wowturkey initially consisted of english threads
19:18

erkinalp

promoting turkey to outsiders
19:21

pokechu22

transfer.archivete.am/inline/eDzUk/monsite-orange.fr_seed_urls.txt - this is the smaller one of the two :|
19:22

pokechu22

(this also contains urls from monsite.orange.fr and monsite.wanadoo.fr, both of which give a page redirecting (but not a 3xx redirect) to monsite-orange.fr)
19:24

thuban

pokechu22: how was that list collected?
19:24

JAA

erkinalp: Well, to be precise, 'paused' here just means a very slow request rate (one request every five minutes in this case), not actually paused.
19:24

JAA

Also, not sure where you got that 650 number from.
19:25

pokechu22

Most of it was archive.org/developers/wayback-cdx-server.html (e.g. web.archive.org/cdx/search/cdx?url=…Fle%2520tumulus.html+20141112051426) - I also
19:25

pokechu22

mixed in a list from #webroasting a while back
19:25

JAA

There are about 8.3k offsite URLs in the remaining queue.
19:25

pokechu22

SrainUser's transfer.archivete.am/Y5Qsp/orange_isp_hosting_urls.txt which I think was scraped from the list the site gives but I'm not 100% sure
19:26

pokechu22

and, yes, there's a fair bit of garbage on my list - easier to let it be attempted and fail than to try to filter it out
19:26

arkiver

thuban: if you have a list of sites, please do post them!
19:26

thuban

arkiver: still processing, will do
19:26

pokechu22

I can deduplicate my list against anything you find and start a second job for whatever's missing
19:28

pokechu22

It looks like there's a pagespro-orange.fr in addition to a pagesperso-orange.fr incidentally
19:28

thuban

yep
19:29

arkiver

thuban: thank you
19:29

arkiver

and in the meantime - all those queuing AB jobs for orange, please keep doing that
19:31

pokechu22

I'm currently doing an !a < list AB job for it - this is easier since it's one job for thousands of sites, but it's a bit buggy in that if the sites link to eachother, it might not recurse properly. Still, it seems like the most practical way to do this
19:32

erkinalp

JAA: thanks for the number
19:33

arkiver

erkinalp: we got a pretty serious chunk of it i believe
19:34

erkinalp

89% of items saved
19:35

erkinalp

maybe more
19:35

erkinalp

(the website had 9.35M posts, according to their own stats)
19:35

arkiver

that's good!
19:36

fireonlive

:D
19:36

arkiver

not sure the percentage is correct, but we got more than half i think
19:36

erkinalp

after scraping and reconstruction, i might actually get more posts
19:36

thuban

pokechu22: agreed re practicality (and i don't think sites linking to each other will be a problem--at worst, it won't get pages linked to by other sites but not their host site's homepage)
19:37

thuban

(and at best it will work fine, although there seems to be some confusion about this hackint.logs.kiska.pw/archiveteam-bs/20220725#c323962 ... hackint.logs.kiska.pw/archiveteam-bs/20220725#c323982)
19:38

thuban

i see that you have a job running for monsite-orange.fr_seed_urls.txt; are you going to start another for orange_isp_hosting_urls.txt? (or has that already been done?)
19:43

pokechu22

I'm working on building my own list for the orange one based on orange_isp_hosting_urls.txt but I'm not going to run orange_isp_hosting_urls.txt directly
19:44

thuban

ok, cool
19:46

JAA

erkinalp: Those stats are not right. You'd have to analyse the WARC data to tell how much we covered. The number of URLs retrieved is not really correlated to that in a meaningful way.
19:46

JAA

We fetched offsite URLs, we fetched rating.php, and so on.
19:47

JAA

A coarser estimate would be possible by analysing just the log file and retrieving how many topic IDs appear there, but later pages could be missing, so that's still only a rough estimate.
19:49

erkinalp

JAA: yeah that was what i was referring to by "after .. reconstruction, i might actually get more posts"
19:50

JAA

Or less. We don't know how far it got through the forum pagination.
19:52

erkinalp

thankfully wowturkey's viewtopic.php page size is fixed at 10, and ttum.php page size is fixed at 100
19:53

erkinalp

and both were configured in a manner to link to the most recent pages of each respective topic
20:44

szczot3k

Hi, how can I help with the efforts? Is an v6-space-holder useful anyhow?
20:48

rewby

IIRC there's at least one project currently active that uses a ton of v6 ips
20:48

rewby

I forget which one, imer wrote the code for it
20:48

rewby

Or rather
20:48

rewby

imer wrote the deployment code that makes it do lots of v6
20:48

szczot3k

Well, I can technically use a whole /39, so ready to help out
20:49

imer

#deadcat but we're bandwidth limited there anyways (and they seem to rate limit per single ip or something like that)
20:49

DigitalDragons

#deadcat is the one
20:49

imer

target bandwidth limited that is
20:49

rewby

Also, I could see if I can hook up my spare /32 to some workers at some point
20:50

rewby

I'm just busy with targets
20:50

imer

szczot3k: here's the aforementioned code/script as well: gist.github.com/imerr/614e534218a6b93be1a40b088dee885a
20:50

DigitalDragons

i heard #sweet supports ipv6 too but I don't know about their ratelimiting
20:51

imer

there is none unless you go way too fast and then they will (it seems) manually block your ip
20:51

DigitalDragons

hah
20:51

flashfire42

Ok got back through scrollback and it seems the consensus is to ignore webs for the moment and focus on orange? cc arkiver
20:52

DigitalDragons

also, glad to hear about wikis-grab!
20:55

DigitalDragons

i have some wikibot #// extraction almost ready but unsure about filtering
22:18

thuban

ok, orange.fr enumeration finished and spot-checks suggest that i got all the categories
22:18

thuban

processing the results now
22:20

thuban

malformed urls won't break archivebot, right? there are a few fun ones in here, like `usftennis2.monsite-orange.fr/index.html#="'><h1>abcd</h1>${{7*7}}${7*7}%{7+7}[[7*7]]@(1+2)<%= 7*7 %>` and `monsite.orange.fr@la-canaliere`
22:23

pokechu22

Right
22:23

pokechu22

the first one would just be treated as usftennis2.monsite-orange.fr/index.html because of the #
22:24

pokechu22

the second one would be probably treated as trying to log in as user monsite.orange.fr on site la-canaliere which obviously won't work, but will fail in an acceptable way
22:24

h2ibot

FireonLive edited Deathwatch (+294, move wowTURKEY to dead (we should use that…): wiki.archiveteam.org/?diff=50602&oldid=50598
22:24

pokechu22

the main thing that breaks archivebot is FTP - there are a few other things that can cause problems but they aren't easy to control for
22:33

h2ibot

FireonLive edited Deathwatch (+2, fix url for 2028-Russia going to example.com…): wiki.archiveteam.org/?diff=50603&oldid=50602
22:33

fireonlive

(i was like example.com?!)
22:34

JAA

That mistake is so common.
22:35

JAA

I wish there was a way to make edits throw an error when a template isn't used correctly.
22:35

JAA

Probably possible with an extension or something ridiculous like that.
22:37

pokechu22

You could probably use an editfilter
22:37

pokechu22

er, for that one, probably the right thing to do is make it generate a big red message of anger instead of silently using example.com
22:38

JAA

We do have wiki.archiveteam.org/index.php/Category:Pages_with_broken_URLs for all uses of Template:URL where the URL is empty.
22:38

JAA

I just remembered that I added that at one point.
22:40

pokechu22

en.wikipedia.org/wiki/Module:Check_for_unknown_parameters exists but I don't think lua is enabled on the AT wiki
22:42

h2ibot

Pokechu22 edited Template:Url (+138, add visible warning about broken URLs): wiki.archiveteam.org/?diff=50604&oldid=49244
22:42

h2ibot

Pokechu22 edited Reddit (-1, fix incorrect {{URL}} usage): wiki.archiveteam.org/?diff=50605&oldid=49987
22:43

h2ibot

Pokechu22 edited Talk:Twitter (+2, fix incorrect {{URL}} usage): wiki.archiveteam.org/?diff=50606&oldid=49771
22:44

h2ibot

Pokechu22 created Category:Pages with broken URLs (+210, Created page with "Pages that use…): wiki.archiveteam.org/?title=Category%3APages%20with%20broken%20URLs
22:45

JAA

Good idea, thanks.
22:46

JAA

Should be good enough.
22:56

fireonlive

awesome ^_^

a year ago

« a day earlier

a day later »

today »