-
nicolas17
we need evidence that it's good, not "I can't think why it would be bad"
-
JAA
Yep, ideally a comprehensive test suite we can also run continuously in the future on building.
-
JAA
But no such test suite for WARCs exists in general, and it's a lot of work.
-
appledash
as someone who tried to write a crawler that outputs WARC... I decided WARC is just garbage and I wrote my own format
-
fireonlive
🤨
-
fireonlive
tell us more about this format of yours
-
JAA
WARC isn't great, but it's the least terrible format out there. ARC is far worse.
-
JAA
The WARC spec has a number of issues though, and implementing it is tricky to get right.
-
appledash
My format is just a folder full of timestamped uncompressed HTTP request and response payloads, with folders named based on the request URL/path
-
appledash
It gets the job done for what I need
-
» fireonlive blunks
-
fireonlive
what do you use it for mainly?
-
appledash
Web scraping, saving contents of web sites as I go that I might want to process later... The main idea is say, I have an image web site I want to download, I write a script which saves the images all in a directory, but I also output that raw data in case later on I find out there was some vital information in every HTML page that I forgot to download (say a description of
-
appledash
the image, or the author, or something)
-
appledash
then I can go back and process those files and match them up with the other data I downloaded to augment it
-
fireonlive
ahh
-
JAA
Personal mirrors are fair game for anything. Use wget with link rewriting for all I care. :-P
-
JAA
For proper archival, that'd be missing some metadata. It also doesn't scale, and repeated retrievals of the same URL get fun.
-
appledash
What'd it be missing?
-
nicolas17
hm
-
nicolas17
JAA: I just thought of a tool that would be handy to have
-
nicolas17
dedup a WARC after the fact
-
JAA
HTTP headers, IP, transfer encoding (although that one's debatable) come to mind.
-
JAA
nicolas17: Yes, that was a key design part of the thing I've been working on.
-
appledash
well the request/response data include http headers :p
-
appledash
everything after the TCP socket
-
JAA
Ah, ok. 'Payload' means something specific in HTTP. :-)
-
nicolas17
afaik qwarc does deduplication between different URLs in one archival task, but if I rerun it next month, it won't deduplicate files that didn't change vs previous archival
-
JAA
Actually, RFC 9110 deprecated the word, I guess. But it was the body without encoding prior to that.
-
nicolas17
archivebot doesn't dedup anything I think?
-
appledash
It's also moderately annoying that every tool that generates warc files seems to be absurdly complicated for no reason
-
appledash
I have been a software engineer and sysadmin for 12 years and I still feel like I need a PhD to understand most of these
-
JAA
nicolas17: Both correct. In fact, qwarc only dedupes within a single process. When you spread a single archival across multiple processes or restart the process to fix the memory 'leak' (fragmentation), that also leads to duplication.
-
nicolas17
and I think wget dedups across time, but only if they have the same URL
-
JAA
appledash: warcio's interface is reasonable, but unfortunately warcio itself sucks. warcprox would allow you to use whatever HTTP client you want via MITM proxying, which is neat.
-
JAA
nicolas17: Correct, and you can also write and load CDXs. wget-at supports URL-agnostic dedupe.
-
appledash
I remember having some issue with warcprox
-
appledash
that was my first try I think
-
JAA
I'm not terribly surprised. I have no experience with it myself.
-
nicolas17
yeah so it would be nice to have a tool that can run afterwards to replace warc records with dedup pointers
-
JAA
Yeah, soon™. :-P
-
nicolas17
do you know anything about the HAR format?
-
nicolas17
browser dev tools can export requests to HAR and it *might* be complete enough to be convertible to WARC but I'm not sure yet
-
JAA
It isn't.
-
JAA
It doesn't preserve the headers verbatim, and it doesn't preserve transfer encoding.
-
TheTechRobo
appledash: do you know what issue you were having?
-
fireonlive
i was using archivebox then JAA went and rained on my parade
-
fireonlive
😢
-
fireonlive
⛈️
-
TheTechRobo
pywarc when
-
fireonlive
(rightly)
-
appledash
I do not remember :(
-
appledash
It was awhile ago
-
h2ibot
PaulWise edited Mailman2 (+806, add more mailman2 instances, corpit.ru done):
wiki.archiveteam.org/?diff=50587&oldid=50487
-
h2ibot
PaulWise edited Bugzilla (+53, add more bugzilla instances):
wiki.archiveteam.org/?diff=50588&oldid=50488
-
h2ibot
PaulWise edited Mailman2 (+67, started some jobs, one instance already gone):
wiki.archiveteam.org/?diff=50589&oldid=50587
-
erkinalp
about 5.5 days and wowturkey arcihval still going full blast
-
erkinalp
in wowturkey archivals, you might have seen "DNS resolution failed: [Errno -2] Name or service not known
reklam_link.com/d/news/433509.jpg"
-
erkinalp
those are the links wowturkey censors
-
erkinalp
they replace censored hostnames by reklam_link
-
khaosfox
Okay, hi. On the chnace that I am right here. I have around 20 or 40TB of archived YouTube channels I'd like to out up on IA, however the videos are sorted in subfolders for the playlist names and I'd like to keep it that way when uploading. I know the web uploader supports folder creation, but I want to use the cli on a headless server, and I can't find any way to do this with the ia cli
-
khaosfox
utility. And outting hundreths of video files all in one root is extremely stupid. Is there any way to archive this outcome?
-
qyxojzh|m
No way to cd your way through it?
-
qyxojzh|m
(Never used the IA CLI, sorry)
-
khaosfox
I specifically need a cli solution. On the chance that I overlooked something or there is another tool or script I'm asking.
-
qyxojzh|m
<qyxojzh|m> "No way to cd your way through..." <- Is this not possible?
-
qyxojzh|m
i.e. navigate to or create different directories and then upload to those
-
khaosfox
I see now way to do this with the ia cli utility.
-
qyxojzh|m
How odd
-
khaosfox
It let's me specify an idetifier and that's it
-
qyxojzh|m
Annoying ngl
-
qyxojzh|m
Would be handy to “directorize” the archive or at least allow uploading directorized archives to it
-
khaosfox
Yes, but I hope anyone here has any good idea how to maybe handle this
-
TheTechRobo
you can do directories
-
TheTechRobo
say you have a folder named `a`, then you put a file in it
-
TheTechRobo
you can do `ia upload <IDENTIFIER> a` and it will upload all files and subdirectories in `a` to the item
-
TheTechRobo
be sure you're not using a trailing slash, or it will upload everything to the root!
-
qyxojzh|m
Perfect!
-
qyxojzh|m
TheTechRobo: How?
-
qyxojzh|m
Oh so
-
qyxojzh|m
`a/b` uploads folder `b`
-
qyxojzh|m
`a/b/` uploads contents of `b` without the folder
-
qyxojzh|m
* Oh so
-
qyxojzh|m
`a/b` uploads folder `b` and therefore its contents
-
qyxojzh|m
`a/b/` uploads contents of `b` without the folder
-
TheTechRobo
yes
-
TheTechRobo
don't ask me why
-
qyxojzh|m
TheTechRobo: Nah it makes sense tbh
-
qyxojzh|m
`a/b` = target `b`
-
qyxojzh|m
`a/b/` = target `b/*`
-
qyxojzh|m
* target `b/*` (but not `b`)
-
HP_Archivist
-
kaz
403 on that from uk
-
HP_Archivist
kaz: Works on my end
-
kaz
are you in the uk
-
HP_Archivist
No, US
-
kaz
ok then
-
TheTechRobo
Works here from Canada
-
HP_Archivist
WBM excludes the site though, for some reason
-
TheTechRobo
I've seen them exclude certain patterns
-
qyxojzh|m
EU legislation issues, methinks
-
qyxojzh|m
Try VPNing?
-
erkinalp
if you see anything saying "reklam_link" in wowturkey archives, those are censored link
-
erkinalp
wowturkey censors links to certain sites by replacing their hostname by "reklam_link"
-
qyxojzh|m
Reklam = ad
-
qyxojzh|m
from French réclame
-
qyxojzh|m
is that right?
-
erkinalp
yes
-
erkinalp
ad_link :)
-
qyxojzh|m
So yeah, makes sense
-
erkinalp
status, t.me and a few more are amongst the censored ones
-
qyxojzh|m
Makes sense tbh
-
qyxojzh|m
At the same time it opens up some questionable stuff
-
erkinalp
hmm, what if i open a website called reklam_link.com *
-
erkinalp
i'd make tons of ad revenue tbh
-
erkinalp
and it's not only me who thought of doing this
-
qyxojzh|m
Might be taken, doğru mu?
-
erkinalp
no
-
nstrom|m
Underscore isn't valid in a domain name
-
erkinalp
no such domain registered
-
qyxojzh|m
Ah so the underscore is the key
-
erkinalp
ah
-
qyxojzh|m
No way to register it either
-
erkinalp
hmm
-
qyxojzh|m
Ne yazık
-
qyxojzh|m
* Ne yazık (= what a pity)
-
erkinalp
register reklam-link.com and rewrite all reklam_link.com to reklam-link.com client side
-
erkinalp
:joy:
-
qyxojzh|m
maybe MITM /j
-
JAA
transfer is dead due to an incident at Scaleway.
-
fireonlive
JAA: hoping it’s not a SBG2 :/
-
HCross
it's a "our blade chassis is dead" I think
-
fireonlive
ahj
-
fireonlive
ahh*
-
qyxojzh|m
Currently working out if I may invite my darling Aroy, she made an archival tool I think would be greatly useful here
-
fireonlive
i read that as tracker at first and was much more concerned
-
fireonlive
😅
-
HP_Archivist
-
HP_Archivist
WBM doesn't like this link
-
khaosfox
Okay, that works. Thanks!
-
HP_Archivist
-
HP_Archivist
-
HP_Archivist
-
HP_Archivist
qwertyasdfuiopghjkl: Mind taking a look at this? ^
-
qwertyasdfuiopghjkl
-
JAA
fireonlive: A lot of stuff depends on transfer since that's where the zstd dicts are stored, so eventually it would still stall everything.
-
fireonlive
indeed
-
JAA
HP_Archivist: Yeah, the i.redd.it URL is it, but if you just access that directly, you won't get the image. They started doing that bullshit quite recently, like in the last few months.
-
HP_Archivist
Thanks qwertyasdfuiopghjkl - It still redirects in the browser. And JAA, yeah, I've never had a problem capturing Reddit images from posts before now. What nonsense.
-
fireonlive
last i checked curl on i.reddit got the full image but what a pain
-
HP_Archivist
-
HP_Archivist
What's odd is that when I crawled this actual post last night in SPN, it captured the page but not the image (which is kinda the point of the crawl)
-
HP_Archivist
Archive.is captured the page and image just fine though
-
qwertyasdfuiopghjkl
Maybe you can try saving a (different) page that embeds it as an image, but idk if that would work
-
khaosfox
quit
-
khaosfox
sorry wrong terminal window
-
Rynav
Hi, would it be against the TOS or perhaps the law to download and host some or maybe all pico song files.
-
Rynav
Currently working on a app that allows users to filter thru picosong entries, get details preview and download the file. But downloading and previewing from archive org itself is extremely slow and sometimes doesn't work at all.
-
Rynav
Thinking of downloading some files and hosting them on my server
-
JAA
Rynav: Obviously, virtually all of it is copyrighted content. Whether the artists/copyright holders will care is not a question we can answer.
-
Rynav
JAA Well yeah you are right , i wonder why I haven't figured it out. Thank you!!
-
AntoninDelFabbro|m
I've tried to get a list of URLs for Orange website with wget, but (oh surprise) I got a 403 from Google and failed on
annuaire-pp.orange.fr
-
AntoninDelFabbro|m
Is there a repository where I can already paste some URLs?
-
pokechu22
If you've got a file you want to share you can upload it to
transfer.archivete.am
-
imer
well, you cant, since transfer is currently offline, but that would be the usual place
-
AntoninDelFabbro|m
😆 alright, nice thank you!
-
fireonlive
i’d say bpa.st but the spam filters are a “big oof”. you can use paste.debian.net in the meantime though if you’d like to dump and run
-
fireonlive
but if you’re around a bit i’d just wait for the transfer
-
JAA
transfer is back.
-
AntoninDelFabbro|m
Nice! Okay, quick question, I have to get URLs from a website (that uses JS…): to which tool would you orientate me? wget?
-
erkinalp
wowturkey archival still going strong
-
appledash
Hmm, I have an FTP server which seems to be telling me to connect to a LAN IP address whenever I initiate a transfer from it. What'd be the best way to transfer data from it? I'm going to make the assumption that if I just connect to the FTP server's WAN address instead of the LAN address it gives me, it'll work. But is there any way to tell wget to ignore the address the FTP
-
appledash
server tells me and use a given one?
-
appledash
The control connection works fine, it just fails to open the data connection
-
pokechu22
Maybe active mode would work, where the FTP server opens a connection to your machine? (That's the older mode so it should be fairly well supported)
-
appledash
I was thinking about that, but there's a catch to it... The FTP server is Russian, and something between me (Canada) and the FTP server is blocking my connection, so I have to proxy through a Russian VPS
-
appledash
I would need to forward the active mode through the VPS as well I guess
-
pokechu22
Ah, then yeah, you'd need to do something special to trick that :|
-
h2ibot
Vokunal edited Frequently Asked Questions (+0):
wiki.archiveteam.org/?diff=50590&oldid=50586
-
h2ibot
Cooljeanius edited Twitter (+56, /* External links */ add relevant GitHub repo):
wiki.archiveteam.org/?diff=50591&oldid=50555