-
OrIdow6
So URLs involved in fetching gallery pages in Wysp are tied to window size, as I have spent quite a bit of time figuring out
-
OrIdow6
So playback will not work properly on those, but it should still be fine on individual images
-
spirit
someone archivebot
geomaticblog.net please, thanks :) no date given at
geomaticblog.net/2023/07/06/retiring-geomaticblog.net but should be a quick job so no issue
-
Barto
spirit: it's in
-
spirit
cheers!
-
Barto
spirit:
github.com/jsanz/geomaticblog and
jorgesanz.net is also being taken care of :)
-
spirit
Barto: <3
-
JustZac2
Hello?
-
JustZac2
I have no idea what im doing i wanted to find a deleted or privated youtube video
-
pokechu22
I'm not an expert at that, but does the video work on web.archive.org?
-
JustZac2
havent tried it yet lemme check
-
JustZac2
Yeah its not there
-
JAA
What's the video URL?
-
JustZac2
Wait a min
-
JustZac2
-
JustZac2
this is the one
-
JAA
#youtubearchive has a copy. You can ask there and wait patiently until someone has time to pull it from storage.
-
JustZac2
Thanks. how long will that take?
-
JustZac2
any estimates?
-
pokechu22
I'm pretty sure it's a manual process so it could take a few hours to a day depending on who's available (but I'm not part of that project so I don't know the details)
-
JustZac2
Oh Thanks man
-
JAA
Yep, something like that.
-
qq44|m
hello
-
qq44|m
anyone know how to get http/s proxy working with grab-site or wpull?
-
qq44|m
i have an http proxy working, but having trouble with https proxy
-
qq44|m
wpull keeps sending a bad request to the proxy. i think its a cert error, but am not sure how to debug it
-
qq44|m
anyone try this before and know how to get it working?
-
JAA
I know that HTTPS proxying is pretty broken in wpull 2.x. Not overly familiar with the changes in ludios_wpull (which is what grab-site uses), so can't comment on whether it applies there as well.
-
qq44|m
ahh that sucks, thought it must have been on wpulls end, tried a few different proxies and none worked with https
-
JAA
I think you should get a relatively clear error though, not 400 or similar.
-
JAA
'CONNECT is intentionally not supported' should appear somewhere.
-
qq44|m
I get: code 400, message Bad request version
-
JAA
Oh wait, you're trying to use an external proxy, not wpull's proxy.
-
JAA
Nevermind then.
-
qq44|m
yeah trying to use an external proxy
-
qq44|m
i have a pretty specific crawl im trying to do, and need to modify the logic of wpull to do it. thought the easy way would be to run it through a proxy and have the proxy handle that logic
-
JAA
Hmm, what kind of logic?
-
qq44|m
Im trying to download all pages from a site that were published from a specific date range
-
qq44|m
my first idea was to use igsets and see if the date is available in the url path
-
qq44|m
its not though unfortunately, so I have to crawl a few index pages, and use those pages to find pages between date ranges
-
VickoSaviour
hey guys, can someone please archive
progaming.ba because it is shutting down and i want it to be archived.
-
Barto
spirit: saved
-
spirit
yaaay
-
Barto
!a
progaming.ba --igset blogs,badvideos -e 'for VickoSaviour'
-
Barto
wrong place lol
-
Barto
now it's at the right place :-)
-
Barto
VickoSaviour: looks like it's login walled, cant do much
-
VickoSaviour
oh fk
-
VickoSaviour
so how is it bad?
-
Barto
no account, no data
-
VickoSaviour
welp. shit.
-
Barto
:(
-
VickoSaviour
and also who tf uses login wall...
-
thuban
VickoSaviour: if you have an account, you can save it yourself by giving your login cookies to
github.com/ArchiveTeam/grab-site (or another spidering program)
-
VickoSaviour
OH YES
-
VickoSaviour
i have a acc already
-
thuban
that can't go in the wayback machine, but it's better than nothing, and you could upload it to the internet archive if you want
-
spirit
-
masterX244
qq44: sometimes a crude selfwritten program for enumerating and then crawling a url list without recursion works for pages like that
-
spirit
fixed, i think
-
thuban
update: progaming.ba is gutted already, nothing to do but contact the admins :(
-
TheTechRobo
VickoSaviour: If you do upload it to archive.org, remember that the cookies you pass it will be stored inside the WARC
-
fireonlive
also any personal data will be saved in the WARC as well such as your username if it’s returned in the pages
-
thuban
true, but moot
-
fireonlive
:)
-
fireonlive
just a note I guess to those grab siteing ao3 or something ig
-
JAA
qq44|m: I'm not sure how a proxy would help you there, unless you mangle data there (in which case I sure hope you aren't producing WARCs). I'd do it with a wpull plugin.
-
JAA
Or well, I'd really do it with my own stuff (qwarc) instead, but that's not really user-friendly especially since there's zero documentation.
-
qq44|m
JAA: I want to preserve a specific directory structure in the WARC. The proxy in this case would take the URL, do some fetches to find the relevant URL within the date range, and return that page to wpull
-
qq44|m
unless im misunderstanding how proxies work
-
qq44|m
also in some cases I want the proxy to modify the contents of the page that its sending back to wpull
-
JAA
As long as you don't write that to WARC, that's fine. WARC is supposed to be an exact reproduction of what the target server sent.
-
qq44|m
i do want to write it to the warc, but i know im misusing warc in this case
-
qq44|m
i save a lot of documentation, and in those cases I mostly care about having a usable copy of the documentation as opposed to a faithful copy of the web pages
-
JAA
Then I'd recommend at least adding a custom WARC header explaining that in detail. Not sure what I'd call that header, but probably something with an X- prefix.
-
JAA
--warc-header on wpull
-
VickoSaviour
what's the progress on reddit.com website? is the content earlier than January of 2021 saved?
-
JAA
100 seconds, longer than some other people.
-
fireonlive
we should have a leaderboard at some point, JAA
-
JAA
I'd rather spend my time saving shit. :-)
-
masterX244
before the shredders reach the data
-
fireonlive
:)
-
fireonlive
it's like a conveyer belt we're trapped on, constantly running, with a meat shredder screaming at the end
-
myself
We should have a bot that makes someone pass a "welcome to IRC" quiz before voicing them...
-
fireonlive
i have seen such a long time ago lol
-
fireonlive
read the rules at <link> and enter the password hidden in the rules
-
fireonlive
but it din't help a lot
-
fireonlive
s/rules/faq/
-
fireonlive
it was a game of find password asap, ask question already answered :3
-
fireonlive
'you do understand you might have to wait minutes or hours for this right' 'yes yes get out of my way i want to type'
-
fireonlive
:D
-
JAA
I mean, we generally want people to be able to reach us with as few barriers as possible in general.
-
fireonlive
that too
-
fireonlive
sometimes there are gems that come in, sometimes you get me :3
-
nulldata
-
nulldata
"InfluxDB Cloud shuts down in Belgium; some weren't notified before data deletion"
-
nulldata
Oof
-
qq44|m
JAA: I tried https proxy with grab-site, but get the same error as wpull 2.0.3. Do you know of any other archiving tools similar to wpull or grab-site that works with https proxies?
-
JAA
qq44|m: No idea, I don't use proxies for archival precisely because of the potential for data corruption.
-
qq44|m
wget works with proxy, do you know if there is a way to download page requisites with wget?
-
qq44|m
when ive used it in the past it only downloaded files from the first party domain, no third party files
-
tech234a
Came across
radar.cloudflare.com/domains which has a top 1 million domains list sourced from users of Cloudflare's 1.1.1.1 DNS. Lots of other interesting information on that site including a list of known bots.
-
fireonlive
in csv format, too!
-
fireonlive
=]