-
nulldata
!tell BenFranske Looks like there's at least some on IA but none recent.
archive.org/details/twit-podcasts Was there an item of yours taken down? Probably should discuss in #internetarchive
-
eggdrop
[tell] ok, I'll tell BenFranske when they join next
-
fireonlive
at least nulldata is useful
-
nulldata
Huh?
-
eggdrop
[tell] BenFranske: [2024-05-02T00:12:40Z] <nulldata> Looks like there's at least some on IA but none recent.
archive.org/details/twit-podcasts Was there an item of yours taken down? Probably should discuss in #internetarchive
-
BenFranske
nulldata Yes, I was working on uploading everything from there and had about 7500 episodes uploaded of the 24500 episodes that they have published. Most of them have been pulled and the rest are probably going to get pulled soon I think. My account also got locked at IA just a bit ago. See my currently still available set at
-
BenFranske
archive.org/details/@benfranske (currently 870 items rather than the 7500+ that were there earlier today)
-
hook54321
does anyone know if
dumps.wikimedia.org/other/shorturls is dumped anywhere on a regular basis? it looks like they used to be dumped to IA but haven't been for a few years
archive.org/details/shorturls-20200907
-
no-n0rth
Hey folks! I was looking at the Blingee archive, I'm looking for a file that has some of the stamp swf - would anyone here be familiar with the project?
-
pokechu22
Hmm, I don't know too much about blingee, but based on the information on
wiki.archiveteam.org/index.php/Blingee someone here would probably be able to find it. If you have a URL then it'd be on web.archive.org; if you have something else I think that page has enough information on how to figure out the URL?
-
no-n0rth
Thanks for the link! I followed that to the internet archive backups, but the files are huge lol and so far it seems most of them just have comments and gifs. I suspect the AES key was rotated, but I might try running the scraper tomorrow if I don't find a cdx that has swf files
-
lea
say I want to archive a site that's behind a login wall. I could probably write a scraper for it. can I somehow upload the results to the web archive?
-
lea
site in question:
usdb.animux.de (hosts synced song texts for karaoke apps)
-
katia
lea, i think stuff that is behind login never goes to wayback machine
-
katia
but you/anyone can upload it to IA
-
katia
-
lea
katia: is there a documentation on the preferred format for uploads?
-
lea
or should I just dump a zip file with all the current data of the site? what about new content? the site is still alive
-
katia
you probably want to design your scraper to be incremental then
-
lea
yes
-
lea
since these are individual files, I guess I could just upload tens of thousands of individual files to the archive?
-
katia
maybe better for #internetarchive
-
katia
IA unpacks some .tar and maybe others, packing it/compressing it might make more sense than single files
-
thuban
lea: the best format for archival purposes is warc (you can upload warcs to the internet archive like any other item even though they don't go into the wayback machine).
-
thuban
i suggest using
github.com/ArchiveTeam/grab-site, which outputs warc and which you can configure to use your login cookies
-
lea
the page needs a JS-initiated HTTP POST to give out the data. I can also initiate it without JS. does the tool support a use case like that?
-
lea
thanks for the pointer btw
-
thuban
lea: yes, you can use --wpull-args with wpull's --post options (see
wpull.readthedocs.io/en/master/options.html) to send POST requests. that said, depending on the details this may become very inconvenient
-
thuban
(since wpull uses the same post data for _all_ requests, worst-case, you may need to scrape the site once, process the output to determine what urls and post data you need for the txt downloads, and invoke grab-site on each individually in a loop. you can combine the results with eg warcat:
github.com/chfoo/warcat)
-
thuban
(might still be quicker than writing your own scraper)
-
Miori
-
joepie91|m
well shit
-
katia
buttflare :|
-
katia
well not on subscene.com, just on forum?
-
katia
started an archivebot job for subscene.com
-
Miori
-
katia
nice
-
gaz
hey peeps, i'm looking for some advice or tips: i want to download absolutely everything associated with a few domains from the wayback machine (all subdomains, images, js, css, etc etc etc). my initial investigations put what i want to grab at like 30 million urls, and would take like 6 months on one machine. i'm hoping you guys have info that
-
gaz
could help :)
-
that_lurker
gaz: Easiest might be to try and search for the domain in
archive.fart.website/archivebot/viewer/?q=utu.fi and download the associated .warc.gz file
-
that_lurker
-
» that_lurker wonder how one can send the wrong link twice
-
gaz
ok i'll have a look
-
gaz
lol
-
JAA
That will only work if it's an ArchiveBot crawl, of course. You wouldn't get snapshots from other sources etc.
-
JAA
But yeah, if there is such a crawl, it's probably a good start.
-
that_lurker
yeah. forgot to mention that too.... The sudden summer heat in Finland is getting to me :-P
-
Vokun
Ah yes. With a high of just above freezing, i'd be sweating too
-
Vokun
Actually, sorry. It's too hot where I live too
-
that_lurker
These are the first days when its starting to go over 10 C here during the day. Nights are still around 5 C and now days are somewhere close to 20 C
-
Larsenv
-
Vokun
It goes from about 12-28 here from night to day. I run a fan at night from the window while I sleep cause it doesn't cool down till really late at night, so when I wake up i'm fridged.
-
» that_lurker watches for the looming gaze of JAA as the conversation has gone offtopic and wonders whether to continue or not :P
-
JAA
:-)
-
Ryz
Heya folks, does anyone wanna help me extract subdomains of
htmlplanet.com ? I found loads of it through
subdomain.center and might've found 900 of 'em and I'm planning to run 'em all in AB (can't be HTTPS curiously, it's HTTP only!)
-
Ryz
I'm...I'm trying to recall if there's a IRC channel dedicated to this <#>;
-
Ryz
I was initially going to say #webroasting - but that's specifically for ISP hosting websites
-
that_lurker
Ryz: Quick scan found 610. Most likely the same you already got though
transfer.archivete.am/inline/Gd4Ub/htmlplanetsubdomains.txt
-
Webuser536
If this is the chat to be talking about this, is there a way of properly using wget to download files from the Wayback Machine?
-
Ryz
that_lurker, this is from WBM CDX I assume? oo;
-
that_lurker
Got those by doing a scan with Sublist3r
-
Ryz
Hello Webuser536, please go to #internetarchive for a better chance of your question being answered
-
Ryz
that_lurker, go for a WBM CDX please if you can, there might be more subdomains there
-
that_lurker
-
Ryz
Hmm, there has to be more... :C
-
that_lurker
You could maybe do some hardcore bruteforcing, but that would take a while
-
pokechu22
-
that_lurker
oh that found a lot
-
pokechu22
yeah, and you can copy the thing at the bottom and put it into the resumeKey parameter to get more
-
JAA
60% of the time, it works every time!
-
fireonlive
*JAA CDX api flashback horror stories*
-
Notrealname1234
"JAA" CDX api!
-
JAA
Also, little-things/ia-cdx-search is a thing. :-)
-
JAA
I guess it might work fine in this case.
-
JAA
The resumeKey-based pagination, I mean.
-
that_lurker|m
I looked at, little-things/ia-cdx-search today and totally forgot about it when i needed it :-)
-
that_lurker|m
JAA thanks for making those amazing scripts available
-
Notrealname1234
Wonderful scripts
-
JAA
:-)
-
that_lurker|m
JAA++
-
eggdrop
[karma] 'JAA' now has 37 karma!
-
Guest77
Hello! what is the best way to handle '.warc' files? i have tested a bit the 'grab-site' program but i am clueless on how to treat the .warc file as an 'extractible' file. I would like to see and select which files to extract as one usually does with .zip and other compressed files. zless shows the raw data but it is not the best way