-
thuban
on that note, what about world of tanks?
-
h2ibot
-
tourist
thuban: should I repost those messages here?
-
thuban
tourist: for the benefit of log-readers, sure
-
tourist
[reposting from #archiveteam]
-
tourist
Hi, just want to discuss before editing Deathwatch because it's a bit vague:
-
tourist
booru.org is a site which allowed people to host their own tag-based 'booru' imageboards. Some are basically archives themselves for fandoms or special interests.
-
tourist
There are about 3000 boorus hosted, eighty have over 10,000 images, ten have over 100,000 images and two exceptional boorus have 1.5 and 1.7 million images respectively.
-
tourist
I propose it be placed on deathwatch due to this post a couple of weeks ago from the site admin:
forum.booru.org/viewtopic.php?t=14193
-
tourist
>The project is closed and winding down, resources for search functionality etc. at peak times will get tapped out.
-
tourist
Does it seem like this site would require a dedicated project or should it be added to the Deathwatch page as normal, with 'Unknown' date.
-
tourist
[/end repost]
-
thuban
tourist: in theory sites with troubling vital signs but no clear shutdown announcement should go on 'fire drill' rather than deathwatch, but that page is a bit of a mess and i've been meaning to clean it up for some time, so i think deathwatch is ok for now
-
thuban
and we do add sites to deathwatch even if they get their own wiki pages/dedicated projects (although my guess is that it won't be necessary in this case)
-
tourist
Alright, I'll add it to the list now. Thanks :)
-
thuban
tourist: you're welcome! do you know whether there's a way to get a list of all the boorus, and/or whether booru creation/activity has been disabled?
-
tourist
Booru creation is closed. Boorus are still active.
-
tourist
List of boorus can be found at
booru.org/top but you can only grab up to 200 per page.
-
thuban
that's fine if it's a complete list; let's see
-
thuban
yep, looks like it
-
thuban
thanks!
-
arkiver
should we do something for opensubtitles.org ? they have been restricting access greatly lately
-
arkiver
Ryz: no updates, and no reply from them
-
arkiver
perhaps at this point best is just gathering lists of abload.de URLs and pushing them through AB if it's not an extreme number of URLs
-
arkiver
huh is opensubtitles.org completely behind login now?
-
arkiver
what... looks like it, same for others?
-
arkiver
:(
-
thuban
arkiver: no, not for me
-
arkiver
thuban: do you have a example of a subtitle URL?
-
arkiver
that is not behind a login for you
-
thuban
-
arkiver
sends me to a login form
-
arkiver
let me VPN this
-
arkiver
thuban: hmm from a different location i get no login scen
-
arkiver
screen*
-
arkiver
i feel like opensubtitles.org is becoming more shitty fast though
-
thuban
arkiver: X-Forwarded-For trick work?
-
arkiver
thuban: no i think
-
steering
arkiver: yes, very much becoming more shitty fast
-
steering
i think at one point (some months ago) it also tried to make me login but doesn't do it now
-
arkiver
i think we'll launch a project for them
-
arkiver
better archive it before its too late
-
arkiver
now they apparently have a forced login for some IPs
-
arkiver
---
-
eggdrop
[karma] '-' now has -2 karma!
-
arkiver
WHAT
-
arkiver
---
-
eggdrop
[karma] '-' now has -3 karma!
-
arkiver
- --
-
eggdrop
[karma] '-' now has -4 karma!
-
arkiver
what magic is this
-
steering
sourcery
-
thuban
strip trailing '--', trim remainder of message
-
steering
-- -
-
steering
no pre-decrement
-
arkiver
anyway
-
arkiver
so something i came across
-
arkiver
this one is apparently going away in June
bedfordregiment.org.uk - clearly some simple site made by probably a single enthusiastic person
-
arkiver
(i put it in AB)
-
arkiver
they have a list of sources/links to similar sites at
bedfordregiment.org.uk/links.html
-
arkiver
near the bottom of the page they have
-
arkiver
> A Northamptonshire family history site worth knowing, which carries a wide array of [...]
-
arkiver
with a link to
familyhistorynorthants.co.uk , which is a blog about gambling.
-
arkiver
but looking the front page up in the wayback machine, one finds a beautiful simple little site rich with information... gone and taken over by some gambling/scam business recently
-
arkiver
i wonder if we can find these sites easily somehow and get them all archived, it's sad to see how some of these end up. i bet many of these are maintained by old enthusiastic people, who may pass away in the coming years, after which their sites go down and tons of information gets lost
-
thuban
arkiver: i was just thinking along the same lines (i love sites like this and run them trough ab whenever i come across them)
-
arkiver
thuban: yeah!
-
thuban
marginalia.nu is not a bad source for these; i think the index is available somewhere
-
arkiver
we should get them all
-
arkiver
perhaps we can also contact several of these sites and let them know that we can archive these types of sites
-
arkiver
perhaps they could spread the word, and people behind these sites could submit lists of sites like these that they know about
-
arkiver
there's maybe forums with enthusiasts around these kind of subjects?
-
arkiver
thuban: did you ever contact marginalia? maybe we should contact them?
-
thuban
arkiver:
downloads.marginalia.nu/exports ! i think 'domains' is what we would want?
-
thuban
or 'urls', depending on how we handle it
-
thuban
^ *through
-
arkiver
thuban: love it, yeah! i don't know much about marginalia, do they only collect these type of little home made sites?
-
thuban
i don't know that much more, but there's a fair amount of writing about the project(s) and philosophy on the site
-
thuban
-
arkiver
thuban: i love it
-
arkiver
just looking at search.marginalia.nu
-
arkiver
i need to #Y up and running really
-
arkiver
so we can get all these domains
-
h2ibot
-
arkiver
-
arkiver
-
arkiver
yeah we need to get this archive, amazing!
-
c3manu
arkiver: are you looking for a crawled index of individual pages, or seed URLs?
-
arkiver
c3manu: any
-
c3manu
-
c3manu
this is where people can submit urls :)
-
c3manu
..or what people submitted
-
arkiver
lovely!
-
arkiver
yeah we should get that too
-
c3manu
feel free to extend the wiki page ;)
-
c3manu
-
c3manu
i also think webrings would be good for indices. in the indie corners of the internet those are getting popular again
-
c3manu
just look at this:
webring.xxiivv.com
-
thuban
in theory yes; in practice it might be difficult to identify webring to/from links (since they can be formatted arbitrarily)
-
thuban
ah, a central index :)
-
c3manu
yeah, that's definitely not going to be fun ^^
-
arkiver
perhaps it's more something for marginalia to find these sites through those ^ and list them online?
-
arkiver
i will send marginalia.nu an email about this awesomeness
-
arkiver
do we have a pipeline on AB that can handle a 180 GB file?
-
arkiver
i want to throw
downloads.marginalia.nu into it
-
thuban
^^ sounds good, i'm not sure whether the index is really curated or the search engine is doing the heavy lifting
-
c3manu
i approve re awesomeness email :)
-
arkiver
thuban: i guess they do some checks on the website front page to see if it is "old style" and include it only if it is
-
kiska
-
kiska
Limited to 10 per day...
-
arkiver
kiska: yeah it would be a very long term effort
-
nyany
arkiver: depends on how it's done
-
nyany
if it's ip based, sure we're metaphorically screwed
-
nyany
but if it's SESSION based... (JAA's favorite)
-
nyany
i.e. store session with 24h expiry as cookie object, thus enabling easy bypassing if one were to simply ignore cookies
-
that_lurker
arkiver: Space wise the new pipelines like firepipe should fit 100+ gig files easily
-
arkiver
that_lurker: thanks! i put it on firepipe-f
-
AK
Interesting
-
AK
Never heard of marginalia before now
-
kiska
nyany: looks to be IP based
-
nstrom|m
does look like www.opensubtitles.org supports ipv6 though
-
ThreeHM
That 10/day limit only appears on their new "beta" site for me, I can still download as much as I want through the regular/older one
-
ThreeHM
Yet another reason to archive it before they change that I guess
-
JaffaCakes118
My friend has a file analysis site and would like all his current reports archived, I have a list of almost 400k links and was wondering if someone could start the archivebot archive of them please -
transfer.archivete.am/ffsKw/neiki%20analytics%20links.txt
-
eggdrop
-
JaffaCakes118
sites currently running cloudflare with the "essentially off" setting enabled, but I can get them to disable cloudflare completely if needed, but don't think it will be needed
-
thuban
JaffaCakes118: your friend's reports depend on javascript-initiated requests; archivebot will be useless
-
thuban
i suppose it might work if we generated the corresponding api url for every page
-
JaffaCakes118
thuban the links are able to be archived perfectly through the save now page
-
JaffaCakes118
is it not the same for archivebot?
-
thuban
save page now is not archivebot
-
thuban
no
-
katia
save page now runs a browser, archivebot doesn't
-
JaffaCakes118
ah ok
-
JaffaCakes118
is there any way we can still archive it? My friend of course will be willing to make changes
-
katia
-
JaffaCakes118
yeah he said save the api instead
-
JaffaCakes118
and it will return the data of it
-
katia
well, alongside
-
JaffaCakes118
I will get a list of links now for the api.neiki.dev
-
thuban
no need
-
JaffaCakes118
oh ok
-
nyany
wickerz: I'm sure you saw my little post in ab but that site should be all set for you, it's on the bot per c3manu
-
nyany
looks like a success too
-
wickerz
ty nyany
-
Neiki
Hey, i want to create an sitemap for my website to display all links that could be archived every month f.ex, what should i do for that?
-
Neiki
just a big txt file with all links?
-
nstrom|m
there's an xml standard for sitemaps, see
sitemaps.org/protocol.html
-
h2ibot
-
h2ibot
-
h2ibot
Manu edited Mailman/2 (+25, /* no archives for…):
wiki.archiveteam.org/?diff=52177&oldid=52176
-
h2ibot
-
qwertyasdfuiopghjkl
arkiver: Speaking of Marginalia Search, looks like it can also save the pages it crawls as WARC (although this is not enabled by default due to storage space reasons, and from the article it sounds like it might currently be done incorrectly):
marginalia.nu/log/94_warc_warc marginalia.nu/release-notes/v2024-01-0
-
qwertyasdfuiopghjkl
-
qwertyasdfuiopghjkl
JAA: ^ Possibly some more warc software to mention on
wiki.archiveteam.org/index.php/The_WARC_Ecosystem ?
-
nyany
-
TheTechRobo
qwertyasdfuiopghjkl: Are they faking headers?
-
TheTechRobo
> In this case jwarc was a bit of an awkward fit, not a fault of the library, just a minor incompatibility with the level of operations, where much of Marginalia’s crawling works at a higher abstraction level and access to http protocol details isn’t always very easy, meaning some of the headers and handshakes is re-constructed after the fact.
-
TheTechRobo
The pull request says:
-
TheTechRobo
> A caveat is that it's not possible to fully record every aspect of the crawl due to incompatibilities of design and operation between the crawler and the expectations by the designers of the warc format, but a record of crawling is constructed after the fact. It may be possible to reconcile the two in the future, but this is outside of the scope
-
TheTechRobo
of this change.
-
TheTechRobo
They also seem to be having trouble with the size of WARC files. Maybe we can point out the warc.zst format?
-
qwertyasdfuiopghjkl
TheTechRobo: I don't know anything about the actual coding stuff, but the article seemed to implied some stuff is saved incorrectly. JAA would probably be a better person to ask about that.
-
qwertyasdfuiopghjkl
*seemed to imply
-
qwertyasdfuiopghjkl
(if the inaccuracies can be fixed, maybe it could eventually be a good new source of data for the WBM on small web stuff?)
-
Notrealname1234
Is there a channel for Facebook?
-
JAA
arkiver: We have archived
downloads.marginalia.nu a couple times recently.
-
JAA
qwertyasdfuiopghjkl: I feel like I've heard about Marginalia's WARCs when it was first announced, and yeah, it sounds like they're faking it, so not good WARCs.
-
JAA
Cc arkiver
-
wickerz
Could someone AB
wega-vinduer.dk they are soon to be declared bankrupt
-
Notrealname1234
wickerz: pokechu22 did it
-
fireonlive
ugh bad WARCs
-
fireonlive
now we just need proper warcs in archivebox...
-
h2ibot
Xaft edited List of websites excluded from the Wayback Machine (+37):
wiki.archiveteam.org/?diff=52179&oldid=52122
-
h2ibot
MrScottyPieey edited Sploder (+106, The site shut down.):
wiki.archiveteam.org/?diff=52180&oldid=51300
-
h2ibot
MrScottyPieey edited Template:Internet history (-15):
wiki.archiveteam.org/?diff=52181&oldid=46691
-
h2ibot
MrScottyPieey edited Me at the zoo (+17, Wikipedia only allows UTC time zones.):
wiki.archiveteam.org/?diff=52182&oldid=50561
-
h2ibot
-
h2ibot
MrScottyPieey created 2021 (+20, Created page with "{{Internet history}}"):
wiki.archiveteam.org/?title=2021
-
h2ibot
-
h2ibot
BooruUser edited Deathwatch (+310, added booru.org):
wiki.archiveteam.org/?diff=52186&oldid=52136
-
fireonlive
uh
-
fireonlive
2021?
-
h2ibot
JustAnotherArchivist edited Sploder (+164, Restore shutdown announcement verbatim; add…):
wiki.archiveteam.org/?diff=52187&oldid=52180
-
JAA
Yeah, not a big fan of those {{Internet history}} pages.
-
h2ibot
JustAnotherArchivist edited Me at the zoo (+9, Then it's a good thing we aren't Wikipedia and…):
wiki.archiveteam.org/?diff=52188&oldid=52182
-
h2ibot
JustAnotherArchivist edited Me at the zoo (+105, Update views count, add comments count):
wiki.archiveteam.org/?diff=52189&oldid=52188
-
fireonlive
JAA: same here, it's part of the number of pages i feel are quite out of scope for us
-
fireonlive
/better managed elsewhere
-
h2ibot
JAABot edited List of websites excluded from the Wayback Machine (+0):
wiki.archiveteam.org/?diff=52190&oldid=52179
-
Notrealname1234
Because of Google's policy, can someone see if you can scrape this, Google doesn't allow scrapers tho
google.com/search?q=site%3A*.drv.tw
-
fireonlive
-
Notrealname1234
90 GB reddit archived, is it possible to get it on WBM?