#archiveteam-bs

00:12

nulldata

!tell BenFranske Looks like there's at least some on IA but none recent. archive.org/details/twit-podcasts Was there an item of yours taken down? Probably should discuss in #internetarchive
00:12

eggdrop

[tell] ok, I'll tell BenFranske when they join next
01:08

fireonlive

at least nulldata is useful
01:09

nulldata

Huh?
02:10

eggdrop

[tell] BenFranske: [2024-05-02T00:12:40Z] <nulldata> Looks like there's at least some on IA but none recent. archive.org/details/twit-podcasts Was there an item of yours taken down? Probably should discuss in #internetarchive
02:13

BenFranske

nulldata Yes, I was working on uploading everything from there and had about 7500 episodes uploaded of the 24500 episodes that they have published. Most of them have been pulled and the rest are probably going to get pulled soon I think. My account also got locked at IA just a bit ago. See my currently still available set at
02:13

BenFranske

archive.org/details/@benfranske (currently 870 items rather than the 7500+ that were there earlier today)
04:01

hook54321

does anyone know if dumps.wikimedia.org/other/shorturls is dumped anywhere on a regular basis? it looks like they used to be dumped to IA but haven't been for a few years archive.org/details/shorturls-20200907
06:07

no-n0rth

Hey folks! I was looking at the Blingee archive, I'm looking for a file that has some of the stamp swf - would anyone here be familiar with the project?
06:14

pokechu22

Hmm, I don't know too much about blingee, but based on the information on wiki.archiveteam.org/index.php/Blingee someone here would probably be able to find it. If you have a URL then it'd be on web.archive.org; if you have something else I think that page has enough information on how to figure out the URL?
06:17

no-n0rth

Thanks for the link! I followed that to the internet archive backups, but the files are huge lol and so far it seems most of them just have comments and gifs. I suspect the AES key was rotated, but I might try running the scraper tomorrow if I don't find a cdx that has swf files
08:39

lea

say I want to archive a site that's behind a login wall. I could probably write a scraper for it. can I somehow upload the results to the web archive?
08:40

lea

site in question: usdb.animux.de (hosts synced song texts for karaoke apps)
09:05

katia

lea, i think stuff that is behind login never goes to wayback machine
09:05

katia

but you/anyone can upload it to IA
09:06

katia

archive.org/developers/internetarchive/cli.html
09:06

lea

katia: is there a documentation on the preferred format for uploads?
09:07

lea

or should I just dump a zip file with all the current data of the site? what about new content? the site is still alive
09:08

katia

you probably want to design your scraper to be incremental then
09:11

lea

yes
09:12

lea

since these are individual files, I guess I could just upload tens of thousands of individual files to the archive?
09:13

katia

maybe better for #internetarchive
09:14

katia

IA unpacks some .tar and maybe others, packing it/compressing it might make more sense than single files
09:36

thuban

lea: the best format for archival purposes is warc (you can upload warcs to the internet archive like any other item even though they don't go into the wayback machine).
09:36

thuban

i suggest using github.com/ArchiveTeam/grab-site, which outputs warc and which you can configure to use your login cookies
09:38

lea

the page needs a JS-initiated HTTP POST to give out the data. I can also initiate it without JS. does the tool support a use case like that?
09:41

lea

thanks for the pointer btw
09:45

thuban

lea: yes, you can use --wpull-args with wpull's --post options (see wpull.readthedocs.io/en/master/options.html) to send POST requests. that said, depending on the details this may become very inconvenient
10:00

thuban

(since wpull uses the same post data for _all_ requests, worst-case, you may need to scrape the site once, process the output to determine what urls and post data you need for the txt downloads, and invoke grab-site on each individually in a loop. you can combine the results with eg warcat: github.com/chfoo/warcat)
10:01

thuban

(might still be quicker than writing your own scraper)
10:50

Miori

Did you guys see subscene closing down in 24 hours? forum.subscene.com/topic/subscene-is-closing-so-sorry
10:56

joepie91|m

well shit
10:56

katia

buttflare :|
10:57

katia

well not on subscene.com, just on forum?
10:58

katia

started an archivebot job for subscene.com
11:17

Miori

reddit.com/r/DataHoarder/comments/1b5rxc2/subscenecom_full_dump and apparently subdl.com is mirroring data from subscene every hour
11:24

katia

nice
16:39

gaz

hey peeps, i'm looking for some advice or tips: i want to download absolutely everything associated with a few domains from the wayback machine (all subdomains, images, js, css, etc etc etc). my initial investigations put what i want to grab at like 30 million urls, and would take like 6 months on one machine. i'm hoping you guys have info that
16:39

gaz

could help :)
16:54

that_lurker

gaz: Easiest might be to try and search for the domain in archive.fart.website/archivebot/viewer/?q=utu.fi and download the associated .warc.gz file
16:54

that_lurker

correction the link is archive.fart.website/archivebot/viewe
16:55

» that_lurker wonder how one can send the wrong link twice
16:55

gaz

ok i'll have a look
16:55

gaz

lol
16:56

JAA

That will only work if it's an ArchiveBot crawl, of course. You wouldn't get snapshots from other sources etc.
16:56

JAA

But yeah, if there is such a crawl, it's probably a good start.
16:56

that_lurker

yeah. forgot to mention that too.... The sudden summer heat in Finland is getting to me :-P
16:59

Vokun

Ah yes. With a high of just above freezing, i'd be sweating too
17:03

Vokun

Actually, sorry. It's too hot where I live too
17:11

that_lurker

These are the first days when its starting to go over 10 C here during the day. Nights are still around 5 C and now days are somewhere close to 20 C
17:14

Larsenv

it's archive.fart.website/archivebot/viewer
17:18

Vokun

It goes from about 12-28 here from night to day. I run a fan at night from the window while I sleep cause it doesn't cool down till really late at night, so when I wake up i'm fridged.
17:20

» that_lurker watches for the looming gaze of JAA as the conversation has gone offtopic and wonders whether to continue or not :P
17:21

JAA

:-)
19:36

Ryz

Heya folks, does anyone wanna help me extract subdomains of htmlplanet.com ? I found loads of it through subdomain.center and might've found 900 of 'em and I'm planning to run 'em all in AB (can't be HTTPS curiously, it's HTTP only!)
19:36

Ryz

I'm...I'm trying to recall if there's a IRC channel dedicated to this <#>;
19:37

Ryz

I was initially going to say #webroasting - but that's specifically for ISP hosting websites
20:43

that_lurker

Ryz: Quick scan found 610. Most likely the same you already got though transfer.archivete.am/inline/Gd4Ub/htmlplanetsubdomains.txt
20:57

Webuser536

If this is the chat to be talking about this, is there a way of properly using wget to download files from the Wayback Machine?
21:06

Ryz

that_lurker, this is from WBM CDX I assume? oo;
21:07

that_lurker

Got those by doing a scan with Sublist3r
21:07

Ryz

Hello Webuser536, please go to #internetarchive for a better chance of your question being answered
21:08

Ryz

that_lurker, go for a WBM CDX please if you can, there might be more subdomains there
21:16

that_lurker

Ryz, Not finding anything at least with web.archive.org/cdx/search/cdx?url=*.htmlplanet.com/&matchType=domain
21:18

Ryz

Hmm, there has to be more... :C
21:24

that_lurker

You could maybe do some hardcore bruteforcing, but that would take a while
21:52

pokechu22

try web.archive.org/cdx/search/cdx?url=…it=10000&showResumeKey=1&resumeKey=
22:01

that_lurker

oh that found a lot
22:09

pokechu22

yeah, and you can copy the thing at the bottom and put it into the resumeKey parameter to get more
22:11

JAA

60% of the time, it works every time!
22:13

fireonlive

*JAA CDX api flashback horror stories*
22:14

Notrealname1234

"JAA" CDX api!
22:15

JAA

Also, little-things/ia-cdx-search is a thing. :-)
22:17

JAA

I guess it might work fine in this case.
22:17

JAA

The resumeKey-based pagination, I mean.
22:20

that_lurker|m

I looked at, little-things/ia-cdx-search today and totally forgot about it when i needed it :-)
22:27

that_lurker|m

JAA thanks for making those amazing scripts available
22:27

Notrealname1234

Wonderful scripts
22:28

JAA

:-)
22:28

that_lurker|m

JAA++
22:28

eggdrop

[karma] 'JAA' now has 37 karma!
23:39

Guest77

Hello! what is the best way to handle '.warc' files? i have tested a bit the 'grab-site' program but i am clueless on how to treat the .warc file as an 'extractible' file. I would like to see and select which files to extract as one usually does with .zip and other compressed files. zless shows the raw data but it is not the best way

15 days ago

« a day earlier

a day later »

today »