-
Ivan226
-
pokechu22
Ivan226: I did those earlier today in #archivebot
-
Ivan226
ah got it
-
pabs
tomodachi94: in #archivebot, `socialbot: snscrape twitter-profile foo` works a reasonable amount of time. it can gather 3200 recent tweets. the twitter-user option doesn't work at the moment, but there is a fix in snscrape git that isn't released yet
-
nicolas17
apparently archive.org is being hammered by thousands of AWS instances "downloading the OCR text from our materials"
-
JAA
Doranwen: The Dropbox link recovered, and I'm pulling a copy.
-
nicolas17
IA S3 stats show a massive drop in uploads about 8 hours ago... did our targets finally catch up or what? :P
-
fireonlive
oh does IA store stuff in AWS?
-
JAA
Of course not, but there's an S3-ish interface.
-
fireonlive
ah! i see :)
-
fireonlive
i thought that rather odd haha
-
tomodachi94
pabs: thanks!
-
tomodachi94
-
pabs
archivebot can't do individual posts, but we can snag the user
-
pabs
I just did this: socialbot: snscrape twitter-profile TheOrderofSith
-
tomodachi94
Ah okay, much appreciated.
-
Doranwen
JAA: Thanks! Glad it did :)
-
tomodachi94
-
fireonlive
can grab-site be resumed in-place from another server? would like to move off this (higher priced) one at some point
-
fireonlive
'oh i'll only have this server like 4 days... "up 20 days"' where did the time gooo
-
arkiver
last two days have been chaotic for me. i may have missed important messages in channels, if you think I missed something please ping again
-
h2ibot
Tomodachi94 created Prnt.sc (+439, Create page):
wiki.archiveteam.org/?title=Prnt.sc
-
h2ibot
Manu edited Coronavirus (+159, Add some German case, vaccination (+more) data…):
wiki.archiveteam.org/?diff=49850&oldid=48266
-
imer
i've now gone through my youtube archive and picked out all videos that aren't on yt anymore (either private or missing completely) - 4.9k videos/1.3tb
-
imer
mainly looking for some guidance if/how I should submit those to IA? (the yt-dl command used mostly matches the one on the wiki except for the info-json which I didnt add at the time)
-
pabs
imer: join #down-the-tube
-
pabs
oh, you said no longer on YT, oops
-
imer
pabs: well, its in there as well now haha, probably more appropriate either way :)
-
imer
thanks
-
arkiver
JAA: do you know if we got the hl2dm.net forum?
-
arkiver
closing may 31
-
arkiver
according to deathwatch
-
pabs
-
arkiver
pabs: perfect, thank you
-
tomodachi94
Alright here's a second dump of 14452 URLs related to that group, sourced from a dump of their Discord:
transfer.archivete.am/KlnAM/urls.txt
-
tomodachi94
It's mostly images.
-
h2ibot
Entartet edited List of websites excluded from the Wayback Machine (+30, Added patrickcollison.com.):
wiki.archiveteam.org/?diff=49852&oldid=49833
-
manu|m
If I want to run a Warrior on a dedicated machine at home (headlessly via Docker), what would be reasonable specs for it?
-
JAA
manu|m: There's no general answer as it depends entirely on the project. Some projects are CPU-intensive (e.g. sitemap parsing on URLs), some projects require significant disk space (anything with videos, e.g. YouTube), some require a lot of RAM due to recursion (e.g. Enjin I think)... If you want to run multiple projects at once, consider using the project images rather than the warrior.
-
manu|m
so different projects/pipelines will pick the warriors they use based on their specs? I’d just like to make use of my internet connection when I don’t need it for myself
-
JAA
No, either the warrior runs a specific selected project, or it runs the default project that we set on the tracker side, which is the same for all warriors set to 'ArchiveTeam's choice'.
-
manu|m
oh okay, thanks
-
manu|m
i'll check out the project pages then
-
JAA
If you want a 'set it up once and forget about it' thing, you'll want the warrior set to AT's choice.
-
JAA
But a dedicated machine for that is a bit overkill.
-
manu|m
i’m not getting a second tower or a server rack for that, i just thought it might be a good idea to have it running on a machine that draws a bit less power than my desktop setup
-
manu|m
another question: once or twice a year (when there isn’t a pandemic going on) i’m attending Chaos events where there’s 4-7 days of practically unlimited bandwith available, where it’s possible to colocate machines. would it be useful to bring a warrior (or more) there to crunch through larger projects, or would that be counter-productive?
-
nicolas17
depends on the project, sometimes the website being archived has per-IP limits so having more bandwidth doesn't actually help
-
HiccupJul
Is there a way to make archivebot login-walled content? I want to backup redump.org using archivebot, but some of the content is walled behind the requirement to submit a few discs to the site. Its not exactly public but its not exactly private either.
-
HiccupJul
*archivebot archive login-walled content
-
pokechu22
Archivebot can't do that, no :/ (I think there might be other tools that can (e.g. grab-site installed locally) but I haven't worked with those)
-
pokechu22
... hmm, and redump.org isn't loading for me at all, that's not a good sign :|
-
HiccupJul
Yeah it goes down occasionally, the admin is unresponsive, and the admin is against public backups.
-
HiccupJul
so I think it'd be a good thing to have it in the wayback machine
-
HiccupJul
is there any tool that supports login-walled sites, that can be used to ingest data into the wayback machine?
-
pokechu22
There have been some backups in the past, but without being logged in
-
HiccupJul
yeah, its just that misses a lot of data
-
pokechu22
Looks like the last full run of redump.org was on 2022-05-10, and forum.redump.org was last run on 2023-04-29, but both of those wouldn't be logged in
-
pokechu22
For main redum.porg, it'd be missing data for a few more recent systems, and also revision history, right? While forum.redump.org would be missing basically everything to my understanding
-
HiccupJul
i.e. it misses modern systems, change history, dump submission sub-forums
-
HiccupJul
hah, we said almost the same thing
-
HiccupJul
so yeah, we are on the same page
-
HiccupJul
all that stuff is pretty important to the continued operation of the site, imo
-
pokechu22
My understanding is that grab-site produces warcs, and those *can* be ingested into web.archive.org but won't necessarily be by default
-
that_lurker
-
HiccupJul
i assume wayback doesn't ingest stuff made by random people
-
HiccupJul
only things archived by archive team or IA services, or stuff from companies like alexa
-
pokechu22
My understanding is that yeah, that's roughly the case. Most outsider stuff ends up in
archive.org/details/warczone
-
pokechu22
One other aspect to consider is that if you do save login-walled content, every single page will show you being logged in
-
JAA
Data behind logins doesn't make it into the Wayback Machine in general.
-
JAA
That's a relatively hard rule with few exceptions.
-
HiccupJul
i guess i should look into using something like archivebot, and then just hosting the static pages on free hosting so they can be browsed, in addition to the WARCs
-
JAA
(And the exceptions are of historical nature, e.g. our SPUF project in 2017.)
-
JAA
That sounds reasonable. grab-site is basically like AB but local, and you can give it cookies.
-
HiccupJul
someone did run grab-site, but i think it was a pretty long process
-
HiccupJul
and hard to get working
-
HiccupJul
i should check if the output of that was okay, then maybe i can just set that up to run every week and upload to IA (and github pages/neocities for a browsable version)
-
JAA
Shouldn't be hard to get working unless there's annoying 'DDoS protection' stuff in the way or extensive use of JS, but it certainly won't be fast, yeah.
-
HiccupJul
seems like that grab-site run was incomplete
-
HiccupJul
so the issues with it apparently weren't resolved
-
JAA
Note that making the WARCs publicly accessible might allow others to hijack your account.
-
HiccupJul
i think i'll make a dummy account for it
-
HiccupJul
actually i believe i know someone who has one like that already
-
pokechu22
redump.org doesn't support https; I doubt it has proper DDoS protection :P
-
HiccupJul
but yeah i don't think i want to do it with my account, if only to avoid my name being plastered over it
-
pokechu22
I assume you'd be grabbing the submission history subforums but not the dumpers subforum, then?
-
HiccupJul
the account could probably get dumper access
-
HiccupJul
i have plenty of low-priority discs (e.g. already verified ps2 shovelware) that i can use
-
HiccupJul
actually, i can probably ask a moderator just to promote an account even without any disc submissions
-
HiccupJul
this command was used for the grab-site attempt:
bpa.st/6EQWS
-
HiccupJul
-
HiccupJul
took 33 hours, not too bad
-
HiccupJul
ah, this bug prevented forum attachments being saved:
ArchiveTeam/wpull #291
-
JAA
HTTP/0.9? Eww.
-
JAA
Technically, that can't go into WARC either.
-
HiccupJul
technically?
-
JAA
The spec only permits HTTP/1.1, strictly speaking.
-
HiccupJul
hm, as long as it works, i guess it'd be fine
-
flashfire42
HiccupJul I see you got IRC working. Yes redump sucks in terms of tech
-
HiccupJul
i think hackint may have been down for a bit, or maybe some system clock fluke that caused a certificate error
-
JAA
hackint hasn't been down in a good while, but the webchat thingy was broken for about a week recently.