#archiveteam-bs

01:23

thuban

on that note, what about world of tanks?
05:48

h2ibot

Manu edited Mailman/2 (+50, /* calypso.tux.org/pipermail lost */): wiki.archiveteam.org/?diff=52173&oldid=52170
07:27

tourist

thuban: should I repost those messages here?
07:28

thuban

tourist: for the benefit of log-readers, sure
07:29

tourist

[reposting from #archiveteam]
07:29

tourist

Hi, just want to discuss before editing Deathwatch because it's a bit vague:
07:29

tourist

booru.org is a site which allowed people to host their own tag-based 'booru' imageboards. Some are basically archives themselves for fandoms or special interests.
07:29

tourist

There are about 3000 boorus hosted, eighty have over 10,000 images, ten have over 100,000 images and two exceptional boorus have 1.5 and 1.7 million images respectively.
07:29

tourist

I propose it be placed on deathwatch due to this post a couple of weeks ago from the site admin: forum.booru.org/viewtopic.php?t=14193
07:30

tourist

>The project is closed and winding down, resources for search functionality etc. at peak times will get tapped out.
07:30

tourist

Does it seem like this site would require a dedicated project or should it be added to the Deathwatch page as normal, with 'Unknown' date.
07:30

tourist

[/end repost]
07:31

thuban

tourist: in theory sites with troubling vital signs but no clear shutdown announcement should go on 'fire drill' rather than deathwatch, but that page is a bit of a mess and i've been meaning to clean it up for some time, so i think deathwatch is ok for now
07:32

thuban

and we do add sites to deathwatch even if they get their own wiki pages/dedicated projects (although my guess is that it won't be necessary in this case)
07:33

tourist

Alright, I'll add it to the list now. Thanks :)
07:34

thuban

tourist: you're welcome! do you know whether there's a way to get a list of all the boorus, and/or whether booru creation/activity has been disabled?
07:36

tourist

Booru creation is closed. Boorus are still active.
07:38

tourist

List of boorus can be found at booru.org/top but you can only grab up to 200 per page.
07:39

thuban

that's fine if it's a complete list; let's see
07:39

thuban

yep, looks like it
07:39

thuban

thanks!
09:16

arkiver

should we do something for opensubtitles.org ? they have been restricting access greatly lately
09:16

arkiver

Ryz: no updates, and no reply from them
09:17

arkiver

perhaps at this point best is just gathering lists of abload.de URLs and pushing them through AB if it's not an extreme number of URLs
09:18

arkiver

huh is opensubtitles.org completely behind login now?
09:18

arkiver

what... looks like it, same for others?
09:18

arkiver

:(
09:18

thuban

arkiver: no, not for me
09:19

arkiver

thuban: do you have a example of a subtitle URL?
09:19

arkiver

that is not behind a login for you
09:19

thuban

sure: opensubtitles.org/en/subtitleserve/sub/10010159 / opensubtitles.com/en/subtitles/legacy/10010159/download ('beta')
09:20

arkiver

sends me to a login form
09:20

arkiver

let me VPN this
09:21

arkiver

thuban: hmm from a different location i get no login scen
09:21

arkiver

screen*
09:22

arkiver

i feel like opensubtitles.org is becoming more shitty fast though
09:22

thuban

arkiver: X-Forwarded-For trick work?
09:25

arkiver

thuban: no i think
09:43

steering

arkiver: yes, very much becoming more shitty fast
09:44

steering

i think at one point (some months ago) it also tried to make me login but doesn't do it now
09:44

arkiver

i think we'll launch a project for them
09:44

arkiver

better archive it before its too late
09:45

arkiver

now they apparently have a forced login for some IPs
09:45

arkiver

---
09:45

eggdrop

[karma] '-' now has -2 karma!
09:45

arkiver

WHAT
09:46

arkiver

---
09:46

eggdrop

[karma] '-' now has -3 karma!
09:46

arkiver

- --
09:46

eggdrop

[karma] '-' now has -4 karma!
09:46

arkiver

what magic is this
09:47

steering

sourcery
09:47

thuban

strip trailing '--', trim remainder of message
09:47

steering

-- -
09:47

steering

no pre-decrement
09:47

arkiver

anyway
09:47

arkiver

so something i came across
09:48

arkiver

this one is apparently going away in June bedfordregiment.org.uk - clearly some simple site made by probably a single enthusiastic person
09:48

arkiver

(i put it in AB)
09:48

arkiver

they have a list of sources/links to similar sites at bedfordregiment.org.uk/links.html
09:49

arkiver

near the bottom of the page they have
09:49

arkiver

> A Northamptonshire family history site worth knowing, which carries a wide array of [...]
09:49

arkiver

with a link to familyhistorynorthants.co.uk , which is a blog about gambling.
09:50

arkiver

but looking the front page up in the wayback machine, one finds a beautiful simple little site rich with information... gone and taken over by some gambling/scam business recently
09:51

arkiver

i wonder if we can find these sites easily somehow and get them all archived, it's sad to see how some of these end up. i bet many of these are maintained by old enthusiastic people, who may pass away in the coming years, after which their sites go down and tons of information gets lost
09:52

thuban

arkiver: i was just thinking along the same lines (i love sites like this and run them trough ab whenever i come across them)
09:52

arkiver

thuban: yeah!
09:53

thuban

marginalia.nu is not a bad source for these; i think the index is available somewhere
09:53

arkiver

we should get them all
09:54

arkiver

perhaps we can also contact several of these sites and let them know that we can archive these types of sites
09:54

arkiver

perhaps they could spread the word, and people behind these sites could submit lists of sites like these that they know about
09:55

arkiver

there's maybe forums with enthusiasts around these kind of subjects?
09:55

arkiver

thuban: did you ever contact marginalia? maybe we should contact them?
09:56

thuban

arkiver: downloads.marginalia.nu/exports ! i think 'domains' is what we would want?
09:57

thuban

or 'urls', depending on how we handle it
09:57

thuban

^ *through
09:58

arkiver

thuban: love it, yeah! i don't know much about marginalia, do they only collect these type of little home made sites?
10:00

thuban

i don't know that much more, but there's a fair amount of writing about the project(s) and philosophy on the site
10:02

thuban

see also marginalia.nu/marginalia-search/about/#similar-projects
10:03

arkiver

thuban: i love it
10:03

arkiver

just looking at search.marginalia.nu
10:03

arkiver

i need to #Y up and running really
10:03

arkiver

so we can get all these domains
10:04

h2ibot

Manu edited Mailman/2 (+46, /* datacast.hu/mailman/listinfo saved */): wiki.archiveteam.org/?diff=52174&oldid=52173
10:05

arkiver

bedfordregiment.org.uk is in the list!
10:05

arkiver

familyhistorynorthants.co.uk too
10:06

arkiver

yeah we need to get this archive, amazing!
10:06

c3manu

arkiver: are you looking for a crawled index of individual pages, or seed URLs?
10:06

arkiver

c3manu: any
10:06

c3manu

github.com/MarginaliaSearch/PublicData/tree/master/sets
10:06

c3manu

this is where people can submit urls :)
10:06

c3manu

..or what people submitted
10:07

arkiver

lovely!
10:07

arkiver

yeah we should get that too
10:07

c3manu

feel free to extend the wiki page ;)
10:07

c3manu

wiki.archiveteam.org/index.php/Marginalia_Search
10:08

c3manu

i also think webrings would be good for indices. in the indie corners of the internet those are getting popular again
10:09

c3manu

just look at this: webring.xxiivv.com
10:09

thuban

in theory yes; in practice it might be difficult to identify webring to/from links (since they can be formatted arbitrarily)
10:09

thuban

ah, a central index :)
10:09

c3manu

yeah, that's definitely not going to be fun ^^
10:12

arkiver

perhaps it's more something for marginalia to find these sites through those ^ and list them online?
10:12

arkiver

i will send marginalia.nu an email about this awesomeness
10:13

arkiver

do we have a pipeline on AB that can handle a 180 GB file?
10:13

arkiver

i want to throw downloads.marginalia.nu into it
10:13

thuban

^^ sounds good, i'm not sure whether the index is really curated or the search engine is doing the heavy lifting
10:13

c3manu

i approve re awesomeness email :)
10:13

arkiver

thuban: i guess they do some checks on the website front page to see if it is "old style" and include it only if it is
10:21

kiska

I feel like this is going to be a problem server8.kiska.pw/uploads/0bc070b5366d602c/image.png
10:21

kiska

Limited to 10 per day...
10:48

arkiver

kiska: yeah it would be a very long term effort
11:15

nyany

arkiver: depends on how it's done
11:15

nyany

if it's ip based, sure we're metaphorically screwed
11:16

nyany

but if it's SESSION based... (JAA's favorite)
11:16

nyany

i.e. store session with 24h expiry as cookie object, thus enabling easy bypassing if one were to simply ignore cookies
11:22

that_lurker

arkiver: Space wise the new pipelines like firepipe should fit 100+ gig files easily
11:27

arkiver

that_lurker: thanks! i put it on firepipe-f
11:28

AK

Interesting
11:28

AK

Never heard of marginalia before now
11:41

kiska

nyany: looks to be IP based
11:43

nstrom|m

does look like www.opensubtitles.org supports ipv6 though
12:00

ThreeHM

That 10/day limit only appears on their new "beta" site for me, I can still download as much as I want through the regular/older one
12:01

ThreeHM

Yet another reason to archive it before they change that I guess
12:13

JaffaCakes118

My friend has a file analysis site and would like all his current reports archived, I have a list of almost 400k links and was wondering if someone could start the archivebot archive of them please - transfer.archivete.am/ffsKw/neiki%20analytics%20links.txt
12:13

eggdrop

inline (for browser viewing): transfer.archivete.am/inline/ffsKw/neiki%20analytics%20links.txt
12:14

JaffaCakes118

sites currently running cloudflare with the "essentially off" setting enabled, but I can get them to disable cloudflare completely if needed, but don't think it will be needed
12:18

thuban

JaffaCakes118: your friend's reports depend on javascript-initiated requests; archivebot will be useless
12:20

thuban

i suppose it might work if we generated the corresponding api url for every page
12:20

JaffaCakes118

thuban the links are able to be archived perfectly through the save now page
12:20

JaffaCakes118

is it not the same for archivebot?
12:20

thuban

save page now is not archivebot
12:20

thuban

no
12:20

katia

save page now runs a browser, archivebot doesn't
12:21

JaffaCakes118

ah ok
12:21

JaffaCakes118

is there any way we can still archive it? My friend of course will be willing to make changes
12:21

katia

but seems we'd just need api.neiki.dev/analyze/reports?sha256=...
12:22

JaffaCakes118

yeah he said save the api instead
12:22

JaffaCakes118

and it will return the data of it
12:22

katia

well, alongside
12:22

JaffaCakes118

I will get a list of links now for the api.neiki.dev
12:23

thuban

no need
12:23

JaffaCakes118

oh ok
13:58

nyany

wickerz: I'm sure you saw my little post in ab but that site should be all set for you, it's on the bot per c3manu
13:59

nyany

looks like a success too
14:00

wickerz

ty nyany
14:16

Neiki

Hey, i want to create an sitemap for my website to display all links that could be archived every month f.ex, what should i do for that?
14:16

Neiki

just a big txt file with all links?
14:40

nstrom|m

there's an xml standard for sitemaps, see sitemaps.org/protocol.html
16:08

h2ibot

Manu edited Mailman/2 (-15, /* ffmpeg.org/mailman/listinfo saved by…): wiki.archiveteam.org/?diff=52175&oldid=52174
16:16

h2ibot

Manu edited Mailman/2 (-2, /* listas.softwarelivre.org saved */): wiki.archiveteam.org/?diff=52176&oldid=52175
16:19

h2ibot

Manu edited Mailman/2 (+25, /* no archives for…): wiki.archiveteam.org/?diff=52177&oldid=52176
17:17

h2ibot

Manu edited Mailman/2 (+37, /* list.ehu.eus saved */): wiki.archiveteam.org/?diff=52178&oldid=52177
18:03

qwertyasdfuiopghjkl

arkiver: Speaking of Marginalia Search, looks like it can also save the pages it crawls as WARC (although this is not enabled by default due to storage space reasons, and from the article it sounds like it might currently be done incorrectly): marginalia.nu/log/94_warc_warc marginalia.nu/release-notes/v2024-01-0
18:03

qwertyasdfuiopghjkl

MarginaliaSearch/MarginaliaSearch #62
18:04

qwertyasdfuiopghjkl

JAA: ^ Possibly some more warc software to mention on wiki.archiveteam.org/index.php/The_WARC_Ecosystem ?
18:15

nyany

FYI: bbc.com/news/world-middle-east-68961753
18:23

TheTechRobo

qwertyasdfuiopghjkl: Are they faking headers?
18:23

TheTechRobo

> In this case jwarc was a bit of an awkward fit, not a fault of the library, just a minor incompatibility with the level of operations, where much of Marginalia’s crawling works at a higher abstraction level and access to http protocol details isn’t always very easy, meaning some of the headers and handshakes is re-constructed after the fact.
18:25

TheTechRobo

The pull request says:
18:25

TheTechRobo

> A caveat is that it's not possible to fully record every aspect of the crawl due to incompatibilities of design and operation between the crawler and the expectations by the designers of the warc format, but a record of crawling is constructed after the fact. It may be possible to reconcile the two in the future, but this is outside of the scope
18:25

TheTechRobo

of this change.
18:26

TheTechRobo

They also seem to be having trouble with the size of WARC files. Maybe we can point out the warc.zst format?
18:28

qwertyasdfuiopghjkl

TheTechRobo: I don't know anything about the actual coding stuff, but the article seemed to implied some stuff is saved incorrectly. JAA would probably be a better person to ask about that.
18:29

qwertyasdfuiopghjkl

*seemed to imply
18:33

qwertyasdfuiopghjkl

(if the inaccuracies can be fixed, maybe it could eventually be a good new source of data for the WBM on small web stuff?)
20:04

Notrealname1234

Is there a channel for Facebook?
20:14

JAA

arkiver: We have archived downloads.marginalia.nu a couple times recently.
20:17

JAA

qwertyasdfuiopghjkl: I feel like I've heard about Marginalia's WARCs when it was first announced, and yeah, it sounds like they're faking it, so not good WARCs.
20:18

JAA

Cc arkiver
20:31

wickerz

Could someone AB wega-vinduer.dk they are soon to be declared bankrupt
20:43

Notrealname1234

wickerz: pokechu22 did it
20:49

fireonlive

ugh bad WARCs
21:21

fireonlive

now we just need proper warcs in archivebox...
21:23

h2ibot

Xaft edited List of websites excluded from the Wayback Machine (+37): wiki.archiveteam.org/?diff=52179&oldid=52122
21:24

h2ibot

MrScottyPieey edited Sploder (+106, The site shut down.): wiki.archiveteam.org/?diff=52180&oldid=51300
21:24

h2ibot

MrScottyPieey edited Template:Internet history (-15): wiki.archiveteam.org/?diff=52181&oldid=46691
21:24

h2ibot

MrScottyPieey edited Me at the zoo (+17, Wikipedia only allows UTC time zones.): wiki.archiveteam.org/?diff=52182&oldid=50561
21:24

h2ibot

MrScottyPieey edited YouTube (+33): wiki.archiveteam.org/?diff=52183&oldid=52147
21:24

h2ibot

MrScottyPieey created 2021 (+20, Created page with "{{Internet history}}"): wiki.archiveteam.org/?title=2021
21:24

h2ibot

MrScottyPieey uploaded File:YouTube screenshot 2024 May 4 2024 (cropped).png: wiki.archiveteam.org/?title=File%3A…0May%204%202024%20%28cropped%29.png
21:24

h2ibot

BooruUser edited Deathwatch (+310, added booru.org): wiki.archiveteam.org/?diff=52186&oldid=52136
21:27

fireonlive

uh
21:27

fireonlive

2021?
21:34

h2ibot

JustAnotherArchivist edited Sploder (+164, Restore shutdown announcement verbatim; add…): wiki.archiveteam.org/?diff=52187&oldid=52180
21:34

JAA

Yeah, not a big fan of those {{Internet history}} pages.
21:41

h2ibot

JustAnotherArchivist edited Me at the zoo (+9, Then it's a good thing we aren't Wikipedia and…): wiki.archiveteam.org/?diff=52188&oldid=52182
21:43

h2ibot

JustAnotherArchivist edited Me at the zoo (+105, Update views count, add comments count): wiki.archiveteam.org/?diff=52189&oldid=52188
21:57

fireonlive

JAA: same here, it's part of the number of pages i feel are quite out of scope for us
21:58

fireonlive

/better managed elsewhere
22:00

h2ibot

JAABot edited List of websites excluded from the Wayback Machine (+0): wiki.archiveteam.org/?diff=52190&oldid=52179
22:02

Notrealname1234

Because of Google's policy, can someone see if you can scrape this, Google doesn't allow scrapers tho google.com/search?q=site%3A*.drv.tw
23:06

fireonlive

we got some coverage(?) for subscene: techworm.net/2024/05/subscene-shutdown.html
23:07

Notrealname1234

90 GB reddit archived, is it possible to get it on WBM?

14 days ago

« a day earlier

a day later »

today »