-
h2ibot
-
fireonlive
smol change: twitter2nitter/transferinliner/and karma system now ignore lines starting with !; so it won't go off if you're using a bot command (thanks project10); also 'known-bots' (h2ibot, botifico, and Aramaki) are skipped from them
-
JAA
Sanqui: Just a brief update, all of those linked Webzdarma jobs are 4.5 TiB, so it'll take a while to download them all, even at the 60 MB/s I'm getting from right next to IA.
-
h2ibot
Arctic Circle System edited Alive... OR ARE THEY (+383, /* Endangered */ Added Kirby's Rainbow Resort):
wiki.archiveteam.org/?diff=51229&oldid=51031
-
Sanqui
Thanks JAA. Problem is the ones that had offsite, sadly not enough foresight there. In the long term we will be making and keeping our own copies
-
imer
Sanqui: would you like me to run that through the common crawl cdx? I have that lying around and from a quick spot check there is some matching links in there
-
Sanqui
imer: yes please, ^https?://(www.)?(uloz.to|ulozto.cz|ulozto.sk|ulozto.net|zachowajto.pl)
-
kiska
I could try the fdns data set I have
-
imer
Sanqui: ack, will be a few days to run through it all
-
Sanqui
imer: deadline is tomorrow, so probably no need then
-
Sanqui
thanks though
-
Sanqui
maybe if it's possible to run on a subset of .cz sites
-
Sanqui
(and .sk)
-
Sanqui
it would make sense
-
imer
oh. oops
-
imer
i'll toss you over the partial results then as I get them
-
JAA
Sanqui: Sometime in the future, all AB jobs' databases should be kept, and then this wouldn't be an issue. wpull still extracts all links when running with --no-offsite-links, it just then ignores them silently, so they only appear in the DB.
-
Vokun
Can these sorts of links be put into AB? This person passed away and if possible i'd like to have these pages saved. Also, can AB grab a youtube channel? Just the pages, not videos. I already put it into downthetube
-
Vokun
-
Vokun
-
Vokun
-
Vokun
-
pokechu22
Vokun: I don't think any o those work properly in AB, n; all of those sites have strict rate-limiting and are JS-based, and AB will only get 429s
-
Vokun
rip
-
fireonlive
youtube can go to #down-the-tube as long as it's in scope
wiki.archiveteam.org/index.php/YouTube#Scope (someone dying is)
-
Vokun
I put it in. Thanks
-
fireonlive
:)
-
h2ibot
Pokechu22 edited DokuWiki (+472, mention taskrunner):
wiki.archiveteam.org/?diff=51230&oldid=51010
-
Pedrosso
the archiveteam wikipage on bluesky is very short, has anything been done about that?
-
polduran
hello everyone. I might have something for the archivebot if anyone has time to put it in the queue:
summoners-inn.de is the biggest and probably one of the oldest german league of legends news website with articles back to 2013. today, they announced the end of Summoner's Inn after their parent company Freaks4U lost their partnership
-
polduran
to host the official german Leauge of Legends broadcast.
-
pokechu22
polduran: I've queued it, not sure how well it'll run though as they don't seem to have a sitemap
-
pokechu22
I also queued
freaks4u.de
-
polduran
let's hope for the best^^ thank you. and yeah, good idea ^-^" maybe also the german LoL-league?
primeleague.gg not sure if there is anything interessting on there and how and if the situation also affects this, but the website is hosted and copyrighted by freaks4u
-
pokechu22
Alright
-
polduran
thanks again and have a nice day :D
-
sdomi
continuing on the discussion from #//; imer: what would be the best way to handle this JS mess?
-
sdomi
I can probably write a scraper that'll generate a list of URLs from these downloaders; there isn't much metadata to be saved anyways, so IMO saving just the ZIPs is a good starting point
-
sdomi
imer: hey, also, can you verify if the downloader3.html still works? I.. think I crashed it
-
masterX244
did you check with devtools how the EULA acceptance is handled?
-
sdomi
checked from two IPs and several browsers, no dice
-
sdomi
masterX244: on some of them there's no EULA at all
-
masterX244
with some luck that can be faked with some headers/constant request stuff
-
sdomi
so I'm focusing on that right now
-
masterX244
had a site once that had a ad-intercept on first download under a session, fooled that by "wasting" that with a url-parametered URL before the real crawl started
-
sdomi
-
masterX244
2 "wasted" requests ion the WARC but better than a lost one. POST sucks for archivebot though
-
sdomi
masterX244: no, no; i'm not getting any responses anymore
-
sdomi
oh, it's back now
-
sdomi
so what I did was.. I tried a wildcard instead of the version number, just to check what would happen
-
masterX244
ahh, poking around for shortcuts
-
sdomi
and it seems that it crashed their entire API for a solid minute
-
sdomi
so. uh. we need to be careful around this one XD
-
masterX244
cockroach-infested area :(, that sucks
-
sdomi
btw, how does WARC work? I know that I can run a mitm proxy for myself, but how would I go about handing it over to IA? what are the steps/precautions/who do I need to talk to...? :p
-
nicolas17
"you don't"
-
nicolas17
you can upload WARC files to archive.org, but they won't be used by web.archive.org, because there's no way to know if they actually match the website you mirrored or if you messed with the content (accidentally or intentionally)
-
sdomi
yes, that I know
-
sdomi
i was more asking about... what steps do I take to actually get the content preserved with y'alls help?
-
fireonlive
a project/mini-project proposal let’s say :3
-
sdomi
figured out how the EULA stuff works! it's a static JS function that takes params from the current URL
-
sdomi
so this is very much possible to automate
-
sdomi
function in question:
pastebin.com/9bsxLDLu
-
imer
sdomi: sorry, stepped away for a bit, I have not the slightest idea how to do this - although I am probably no the person to ask haha
-
sdomi
imer: writing a scraper as we speak :p
-
imer
nice
-
Webuser533
could you help me find an archive of this video
youtube.com/watch?v=V3gbrP2U10A ?
-
that_lurker
#youtubearchive would be a fitting channel for that question
-
Webuser533
alright thank you !
-
sdomi
-
sdomi
-
sdomi
turns out that most docs URLs are completely dead already, or point to generic sites that have likely been archived for ages. i'm downloading real "data" locally right now, gonna upload as an item onto IA later ^-^