-
thuban
ok, i'm currently scraping all the page images, and i'll feed them back into archivebot once i've got the list
-
thuban
(it'll take a little while, i'm being gentle)
-
h2ibot
JustAnotherArchivist edited Deathwatch (+159, /* 2023 */ Add OneHallyu):
wiki.archiveteam.org/?diff=51153&oldid=51152
-
vokunal|m
It looks to be about 1.1M posts
-
vokunal|m
nvm
-
thuban
Peroniko: did you want any advice about converting to pdf and/or uploading to ia?
-
thuban
the former is easy, but the latter will probably require some manual work due to the lack / inconsistent formatting of metadata
-
vokunal|m
~11.7M posts on OneHallyu
-
JAA
Yeah, 11.8 million is what the homepage says.
-
vokunal|m
someday i'll learn. First time i helped with a forum, I didn't see any number, the second time, i count it manually. Fool me twice, I better learn the third time
-
JAA
:-)
-
Peroniko
thuban: I have uploaded some thing to IA, but if there is a guide for better metadata it would be helpful.
-
Peroniko
-
thuban
Peroniko: the general metadata documentation is here:
archive.org/developers/metadata-schema/index.html
-
JAA
OneHallyu is running through AB now. We'll see how that goes.
-
JAA
Buttflare is involved.
-
kpcyrd
from -ot: how do I archive videos hosted on sharepoint? it's going to be deleted in a few days:
kth-my.sharepoint.com/:v:/g/persona…5PsB-5r8MVtnp5NECOtKN-YGsA?e=8s0kaj
-
nicolas17
it's probably some temporary signed URL that will change on every load and can't be archived in a way that lets the original link work in WBM
-
nicolas17
oof that seems to be DASH even
-
thuban
kpcyrd: i don't think there's anything reliably plug-and-play for sharepoint
-
nicolas17
transcoded on the fly from an original .mp4 that seems impossible to access
-
thuban
you could try
github.com/kylon/Sharedown or some of the workarounds suggested in sharepoint-related issues at
github.com/snobu/destreamer
-
thuban
(or punt and screen-record it)
-
nulldata
There's an open PR for yt-dlp to add SharePoint.
yt-dlp/yt-dlp #6531
-
nicolas17
I *could* grab the DASH but it sucks that I can't access the original .mp4 :/
-
kpcyrd
I'm trying to get it into the wayback machine specifically:
-
kpcyrd
-
kpcyrd
20Xuan%2Emp4&ga=1
-
nicolas17
that's not going to work
-
kpcyrd
rip
-
nicolas17
there's timestamped, signed URLs that change every time you load the page
-
thuban
ia item not adequate?
-
nicolas17
ffmpeg doesn't do parallel requests (in fact I'm not sure if it does proper HTTP keepalive) so this DASH remux is taking me forever
-
nicolas17
oh great I got some 503 Service Unavailable too
-
nulldata
Gitlab is now requiring new users to verify using a phone number or credit card, or account will be deleted. So far only seems to apply to new accounts, but something to keep an eye on if they expand it to existing accounts.
lemmy.world/post/8297909
-
flashfire42|m
-
flashfire42|m
-
JAA
-
h2ibot
-
that_lurker
bird.makeup would be a nice alternative way to grab twitter(x) stuff. They create a mastodon account where all the tweets are posted.
-
that_lurker
also means there is no rate limiting
-
that_lurker
other than on their end of course
-
null
-
rktk
stupid sexy nickserv
-
rktk
-
rktk
Does anyone know of a full dump or half dump of open subtitles
-
rktk
this is a real slap in the face
-
h2ibot
MasterX244 edited List of websites excluded from the Wayback Machine (+28):
wiki.archiveteam.org/?diff=51155&oldid=51036
-
h2ibot
JAABot edited List of websites excluded from the Wayback Machine (+0):
wiki.archiveteam.org/?diff=51156&oldid=51155
-
arkiver
opensubtitles closing themselves?
-
Hans5958
-
murb
oh talking of which
bleed-clothing.com/de/info # "Wir sind insolvent."
-
JAA
arkiver: 'Only' the API, as I understand it?
-
Megame
Hans5958, murb I threw them in AB
-
murb
ta
-
fireonlive
-
fireonlive
-
fireonlive
though those prices are.. hm.
-
JAA
XML-RPC... Ok, yeah, I agree that needs to die already.
-
fireonlive
blog.opensubtitles.com/opensubtitle…i-embrace-the-20-black-friday-treat posted earlier says "This decision, initially disclosed in a forum post, will primarily affect non-VIP users, while VIP members will continue to enjoy access to the API." so i guess they're keeping it around for VIP people for a
-
fireonlive
bit longer at least?
-
fireonlive
and yeah it does haha
-
anarchat
so uh
-
anarchat
apparently, i have a blogspot blog, two actually... and i learned this because google/blogger.com wrote me to tell me i haven't logged in since 2007 and so they will delete my shit... i wonder if we need to do something about this
-
anarchat
my two blogs are totally irrelevant and empty, but there might be others facing destruction out there
-
nicolas17
anarchat: it has been discussed before; how do we find "all blogs"?
-
anarchat
i have no idea
-
fireonlive
#frogger :)
-
nicolas17
-
nicolas17
it wasn't easy to get because if I just fed the DASH manifest to ffmpeg, one or two segments would randomly give a "503 service unavailable" and ffmpeg doesn't retry
-
nicolas17
so I got a gap in the video
-
nicolas17
I had to download all segments and rewrite the manifest to use those local files
-
nicolas17
take it and figure out what to do with it; archive.org item or whatever :P
-
Pedrosso
transfer.archivete.am/J2GVQ/sporeforums1.txt does this list of largely unarchived spore forums have any URLs that the bot wouldn't be able to archive properly? In either case could the viable URLs be fed to AB?
-
JAA
The ones that aren't entire domains could be problematic recursion-wise.
-
Pedrosso
Problematic in what way?
-
JAA
Other than that, not sure.
-
JAA
Not recursing properly. If you !a
example.org/foo and it has a link to /bar, that won't be followed.
-
Pedrosso
not even with offsite links allowed?
-
JAA
No, because they're not offsite.
-
JAA
Offsite = different host
-
Pedrosso
I getcha, had hoped it was just offsite (named after different host) = outside recursion
-
JAA
For example, it wouldn't recurse anywhere useful from
mobygames.com/forum/game/36030/spore because those URLs aren't in .../spore/.
-
JAA
In that case, !a
mobygames.com/forum/game/36030/spore would work though.
-
JAA
But yeah, each of those needs to be looked at individually.
-
JAA
And some might simply not be possible.
-
Pedrosso
what does the / at the end do?
-
JAA
It's a path segment delimiter. For the purpose of AB, the last slash in the path part of the URL determines where it'll recurse onsite.
-
JAA
-
Pedrosso
I see
-
Pedrosso
Should I send an transfer.archivete.am link with just the full-domain ones then?
-
JAA
No need, this one is fine.
-
Pedrosso
Alright. Thanks a lot then
-
JAA
I wonder whether
blog.seamonkey-project.org/2023/11/…4/migrating-off-archive-mozilla-org only applies to SeaMonkey or also to other projects or even the entire archive.mozilla.org.
-
JAA
(It's already running through AB courtesy of arkiver.)
-
JAA
Cc pabs ^
-
JAA
I'm currently listing all of archive.mozilla.org. It's ... large.
-
JAA
I'll have a size estimate later.
-
arkiver
JAA: maybe it's too large for ArchiveBot, i wonder how large it is. hope we can archive it entirely
-
JAA
I'm already up to over 1.2 million *directories* after only processing 17k.
-
JAA
So yeah...
-
JAA
To rephrase it a bit clearer: I've processed 17k directories and discovered over 1.2 million directories from those. I'm recursing through the dir tree, obviously.
-
JAA
And those numbers are now at 32k done, 2.1M discovered.
-
JAA
It'll be a while...
-
Pedrosso
How long will it stay up? Assuming it has any sort of shutdown date
-
JAA
See link above
-
JAA
Beware of
archive.mozilla.org/pub/firefox/tinderbox-builds , those subdirs are *massive*. Like, 100 MB dir listings massive.
-
JAA
There's also at least one which doesn't finish loading within a minute.
-
Pedrosso
does AB ignore something if it doesn't load within a minute?
-
project10
mod_autoindex like 😰
-
JAA
It's complicated.
-
JAA
AB expects the HTTP headers within 20 seconds and the complete response within 6 hours, but slow processing of parallel requests (such as link extraction or compressing for WARC) can break the retrieval.
-
JAA
I bet most of the dirs in there were not listed correctly on the first attempt by AB.
-
JAA
The 1 minute timeout is the default in qwarc, which I'm using for listing this more efficiently.
-
nicolas17
100MB *listings*?
-
JAA
Running into a problem, will need to restart the listing.
-
Pedrosso
I bid you good luck with this, lookin' forward to seeing just how big the listing file will be.
-
JAA
nicolas17: Yes, autoland-linux64 is that one, it contains 195k entries.
-
JAA
autoland-macosx64-debug times out on the server side after a bit over a minute with a 502.
-
JAA
Listing restarted, going more faster now.
-
JAA
(I hope there are no loops via symlinks.)
-
JAA
Oh, this time autoland-linux64 repeatedly timed out as well, yay.
-
JAA
I think I'm running into SQLite lock contention at this point. But processing 7-9k dirs per minute isn't bad.
-
Pedrosso
As for what I sent of spore forums, here are a few archive-related comments about the domains of the few that weren't directly in the domain
transfer.archivete.am/mawW8/sporeforums%20addendum.txt
-
Pedrosso
An addendum to that addendum;
gamefaqs.gamespot.com has an archive but
gamefaqs.gamespot.com/boards/926714-spore/72994456 (posted before the archive) is missing (
gamefaqs.gamespot.com/boards/926714-spore has 1 archive from ArchiveTeam though)
-
nicolas17
my modem rebooted... maybe because of telegrab at high concurrency /o\
-
JAA
I don't think we ever fully archived GameFAQs. I believe there were unsuccessful/incomplete attempts only.
-
nicolas17
(reddit is much more prone to doing that)
-
nicolas17
does wget-at use keepalive?
-
Pedrosso
Ah, I see.
-
JAA
Now doing over 10k dirs per minute. Brrrrr
-
JAA
Still going to take at least 3 hours to get through the remaining queue. lol
-
JAA
So yes, it is marginally too big for AB. :-P
-
Pedrosso
What alternatives are there then?
-
JAA
It does depend a bit on how many files there are and how large they are.
-
JAA
DPoS would be an option.
-
JAA
Or maybe it can be done with AB with a few !ao < jobs rather than one big recursive one.
-
JAA
The listings I've retrieved so far are already over 1 GiB of WARC, i.e. after compression.
-
Pedrosso
o_o
-
Pedrosso
how many would "a few" be?
-
h2ibot
-
JAA
And arkiver spoke: 'let there be an icon!', and there was an icon.
-
Pedrosso
That is how it be
-
fireonlive
and it was glorious
-
Flashfire42
And the administrators of those websites said "Did anybody hear that?, Must have been the wind"
-
Flashfire42
Whenever a new project is about to start I always imagine some kind of eldritch abomination machine just slowly whirring to life. With eyes of red blink into existence and start a march towards their target
-
JAA
Pedrosso: 'A few' would be more than 'a couple' but not 'many'. :-P I don't know, it depends on the output of the listing.
-
h2ibot
-
arkiver
yep :)
-
arkiver
JAA: what is your opinion on already writing a WARC-TLS-Cipher-Suite field before it's standardised?
-
arkiver
(related to that issue on the warc specs github repo)
-
arkiver
or actually
-
arkiver
WARC-Cipher-Suite (the value starting with TLS_ already makes it clear it's for TLS)
-
fireonlive
(thank you for not calling it SSL)
-
arkiver
i'm glad i made your day fireonlive :)
-
fireonlive
:D
-
Pedrosso
JAA: oh, well it's nice that there are such convenient solutions
-
pabs
JAA: I expect archive.mozilla.org has a lot of stuff that isn't that useful to archive, like millions of test results :)
-
JAA
arkiver: Fine with me, it's not a violation of the spec to write fields that aren't specified. Might be worth leaving a comment about the intent on
iipc/warc-specifications #86 though and seeing if anyone else has concerns about that.
-
JAA
pabs: Yeah, I'm sure there are more and less useful parts to it.
-
JAA
Have you possibly seen another announcement from Mozilla themselves about it?
-
pabs
not yet, but I did just wake up :)
-
pabs
-
pabs
-
JAA
Ah yes, time zones. :-)
-
pabs
maybe it is only ex-Mozilla projects moving?
-
thuban
repeating some requests related to old.dlib.me here, since they got lost in #archivebot:
-
thuban
-
thuban
transfer.archivete.am/ej3GO/www.old.dlib.me-item-pdfs - a small number of items available as pdf rather than through the document viewer
-
thuban
transfer.archivete.am/y7EDo/www.old.dlib.me-item-info-byname - item info pages, as linked from the library index (extracted from post xhr--not that we can duplicate that, but it's what external links are likely to be). media items like photos and videos are included in page assets
-
thuban
transfer.archivete.am/157ht6/www.old.dlib.me-item-info-byid - item info pages, by document id (this is the only way to see metadata for some items, mostly newspapers)
-
thuban
i believe that's everything that will actually work
-
JAA
I'll run them shortly.