-
JAA
→ #telegrab I guess
-
arkiver
thuban: anything you feed on #telegrab will be queued
-
arkiver
the majority of the telegram project came in over #//
-
thuban
arkiver: i see, thanks! are !p items also being queued?
-
arkiver
thuban: new posts that were not discovered yet at the time I stashed away the item lists yes
-
arkiver
also please queue anything youtube and prigozjin related to #down-the-tube - it will become playable from the Wayback Machine from there
-
thuban
i have to get going in a bit, so if someone else wants to look for youtube coverage i for one would be grateful
-
h2ibot
-
HP_Archivist
-
HP_Archivist
Gonna post this link here, maybe someone can find a way to successfully capture an entry like the one above ^ JAA mentioned the site is... not very easily navigated
-
HP_Archivist
WBM and Archive.is both capture the url, but each capture redirects to the site's main page.
-
pokechu22
When I load that once it goes to the main page, and a second time gives me a redirect to
justice.fultoncountyga.gov/pajailma…h=/PAJailManager/JailingDetail.aspx saying "Only documents that have been redacted are available via public access. If an expected document does not appear, ensure that it has been redacted."
-
pokechu22
My guess is that it needs a cookie to work - save page now outlinks from another page might do it (but you'd need to find a page that doesn't require the same cookie)
-
HP_Archivist
Hmm
-
HP_Archivist
Well, that's why I asked JAA if simply crawling the whole site altogether would somehow inadvertently capture those specific ID entry pages
-
HP_Archivist
But that would be through AB, not through SPN
-
JAA
justice.fultoncountyga.gov/PAJailManager starts off with a JS-populated <form> and a link to a search form. Not sure how it could discover anything from there.
-
JAA
Er, no <form> on the homepage, just script hell.
-
fireonlive
ew :/
-
HP_Archivist
Somehow we should find a way to archive it. But a screenshot or right click, save as the html to a zip then to IA is not exactly ideal...
-
HP_Archivist
Trump is expected to be booked tomorrow and his entry will likely be online at some point tomorrow. FYI in case someone finds a way to capture the pages sans the 'script hell'
-
HP_Archivist
Oh and pokechu22: You were right about a cookie - I just clicked the link I sent a few minutes ago and it returned that same error message. Oops.
-
pokechu22
An !ao < list thing would work for setting cookies actually
-
pokechu22
-
HP_Archivist
pokechu22: Giuliani's booking jail ID info w/associated charges and fines
-
HP_Archivist
Since it's session specific you'll have to perform the search on your own. But, if you click jail records and then enter 'Giuliani' and 'Rudy' for Last and First names respectively, it should bring you to the booking page for him
-
HP_Archivist
Pretty historical. Tomorrow, even more so for Trump. But if !ao < list would work, then maybe we can try it. Idk how that would circumvent a redirect though
-
fireonlive
can someone please show governments how to configure https please
-
fireonlive
:|
-
pokechu22
ugh, I don't think !ao < list will work here - it needs the POST on
justice.fultoncountyga.gov/pajailmanager/JailingSearch.aspx?ID=400 too
-
HP_Archivist
pokechu22: Yup, I figured that since SPN is more/less the same thing and, again, just redirects to the main page. Archive.is has the same behavior. Idk. It should be saved, but beyond my expertise
-
HP_Archivist
A thought: Maybe once Trump's booking goes live on the site we can throw the whole site in AB for the hell of it and see what happens?
-
pokechu22
To clarify: one redirect is caused by the lack of the cookie, but the second is by the lack of the POST. AB can work around the cookie one but not the POST one since AB doesn't do POSTs.
-
HP_Archivist
=/
-
HP_Archivist
Like I said, beyond my know-how. I at least wanted to bring it to the attention of everyone else here.
-
HP_Archivist
Alternatively, I've seen screenshots of the charges against Giuliani trending on Twitter, or X, or whatever-the-fuck. Always possible someone screenshots the .gov site entry, posts it to some other site, and we could capture that page into WBM easily.
-
HP_Archivist
Long way to get there, but it would be "captured", heh
-
fireonlive
sooooo many people are like ha-ha, i've screenshotted your deleted bad tweets sucker.. and i'm just sitting there with a tear in my eye like what about the wayback machine et al.
-
fireonlive
screenshots aren't proof :'(
-
fireonlive
oh sorry this isn't -ot
-
HP_Archivist
All good, fireonlive. Yeah, obviously, screenshots are not archival. But in this case, I don't see a way around? Like I said, I just wanted to talk about it here so others knew. Maybe someone will do something clever, heh
-
fireonlive
ye i think best case in this case
-
fireonlive
it's probably some contracted-out-lowest-bidder crap
-
HP_Archivist
I mentioned saving the hmtl locally and zipping it, then offloading to its own IA item. That would be another way, I guess
-
HP_Archivist
html*
-
qwertyasdfuiopghjkl
HP_Archivist: I haven't tried it, but according to
docs.google.com/document/d/1Nsv52Mv…PCpHlat0gkzw0EvtSgpKHu4mk0MnrA/edit , if you use the SPN2 API you can manually specify the cookies it should use when saving a page. Maybe that would work?
-
DigitalDragons
warcprox + manually browsing the site in chrome?
-
nicolas17
anyone awake to change the default warrior project? it's currently set to telegram, and we already went through all the queue there
-
flashfire42
Do I need to dig up more telegram links in the mean time?
-
fireonlive
feed the beat, flashfire42 :D
-
fireonlive
if you're up for it
-
fireonlive
beast
-
nicolas17
flashfire42: maybe, but being set to default project, you're unlikely to have enough to keep everyone busy anyway :P
-
fireonlive
true
-
nicolas17
JAA: ^ can you change default project?
-
h2ibot
-
h2ibot
-
DigitalDragons
maybe it would be better to mark as "Special case"?
-
DigitalDragons
cc Exorcism
-
Exorcism
DigitalDragons: but the website is ded sooo 😭
-
DigitalDragons
(most of?) of the news outlets are still up though
-
DigitalDragons
hmm
-
erkinalp
-
erkinalp
-
erkinalp
after the WARC is ready, we could feed our WARC to such a script
-
JAA
nicolas17: Switched to xuite.
-
erkinalp
almost 5 days and wowturkey archival still going full blast
-
JAA
erkinalp: Sure, such things can be done, but that's not usually something we do. We just archive things, and then people can use it as they like.
-
h2ibot
JustAnotherArchivist edited NewsGrabber (+5):
wiki.archiveteam.org/?diff=50579&oldid=50578
-
h2ibot
-
h2ibot
Znak edited UC Berkeley Course Captures (-1, /* Download the videos */ Fix typo "yyt-dlp".):
wiki.archiveteam.org/?diff=50581&oldid=50570
-
h2ibot
-
invalidCards
imer: Whoops, sorry '=D thanks for the reply though
-
imer
no worries
-
fireonlive
-
fireonlive
bets on Trump's indictment
-
fireonlive
not sure if gambling is an archive target but there's one of them
-
HP_Archivist
qwertyasdfuiopghjkl: RE: No-cookies when using SPN. Wanna give that a try?
-
pokechu22
I'll give it a try
-
HP_Archivist
pokechu22: Okay cool, thanks
-
pokechu22
Doesn't look like it worked :|
-
pokechu22
but there aren't many details on how capture_cookie is supposed to work - I might have done it wrong
-
pokechu22
(I URL-encoded the whole cookie... which probably wasn't right, especially since there are multiple cookies...)
-
pokechu22
no, what I'm doing is correct; these two commands correctly captured spriters-resource.com and models-resource.com's front pages with changed settings (enabling NSFW posts for the first and enabling NSFW posts and using text mode for the second - both cookies were recognized for the second); I've censored my own IA login cookies in these:
-
pokechu22
curl -X POST -H "Accept: application/json" -d'url=
spriters-resource.com/&force_get=1&…enshot=1&capture_cookie=nsfw%3Dshow' -H 'Cookie: logged-in-sig=<snip>; logged-in-user=<snip>'
web.archive.org/save
-
pokechu22
curl -X POST -H "Accept: application/json" -d'url=
models-resource.com/&force_get=1&ca…ie=viewmode%3Dtext%3B%20nsfw%3Dshow' -H 'Cookie: logged-in-sig=<snip>; logged-in-user=<snip>'
web.archive.org/save
-
pokechu22
It might be timing sensitive too - I'll give it another try later
-
HP_Archivist
pokechu22: Hm, alright. Would a local instance of wget work on something like this or more/less the same thing?
-
pokechu22
A local instance of wget probably would work if you can specify the relevant cookies
-
pokechu22
Firefox's browser tools have a "copy as curl" function that could be a starting point (though you'd need to change it to wget parameters)
-
HP_Archivist
If that would work, it's more/less just saving the html locally (essentially). Still not archived into WBM...
-
pokechu22
Theoretically you can write a WARC with the right version of wget (or wpull), but it still wouldn't end up in WBM
-
HP_Archivist
Yeah, true. But not much different from right click, save as > zip > IA item
-
qwertyasdfuiopghjkl
-
HP_Archivist
qwertyasdfuiopghjkl: Works on my end, nice! How'd you get it?
-
pokechu22
It might have been a second attempt of mine, though I wasn't sure if that went through or not (the jobid seemed to be the same)
-
qwertyasdfuiopghjkl
from the timestamp I'd guess it's mine, unless you did it at around the same time
-
pokechu22
I did one at about the same time, but I told it to save a screenshot and there doesn't seem to be mine so it's probably yours
-
pokechu22
Anything special you did?
-
qwertyasdfuiopghjkl
I went to the page via the method mentioned above, opened the F12 menu, reloaded the page, opened the raw request headers of the request, copied the stuff after "Cookie: " from there into a text editor, find-and-replaced "=" to "%3D" and "+" to "%2B", and used that in the command after &capture_cookie=
-
qwertyasdfuiopghjkl
Example of the command I used: curl -X POST -H "Accept: application/json" -H "Authorization: LOW <redacted>:<redacted>" -d 'url=
justice.fultoncountyga.gov/PAJailMa…x?JailingID=1472670&capture_cookie=<cookie>'
web.archive.org/save
-
pokechu22
Hmm, I used Notepad++'s MIME tools plugin to do URL encoding, which also changed / to %2F and space to %20 and ; to %3B - I guess it must not have undone some of those
-
qwertyasdfuiopghjkl
I didn't need to url-encode ";", maybe that was the issue with your try
-
pokechu22
Hmm, actually, URL-encoding semicolon and space worked for models-resource.com, so maybe it's the slash that did it or something?
-
pokechu22
wait, no, it *didn't* URL-encode pluses - that's probably the cause. And maybe they got treated as spaces instead of pluses in that case?
-
qwertyasdfuiopghjkl
I did a bit of trial and error with saving pages of
ip.wtf to figure out how the &capture_cookie= worked before trying the actual page. "+" was replaced with " ".
-
HP_Archivist
I also tried screenshot, too, probably around the same time, heh. Not sure how it worked though?
-
qwertyasdfuiopghjkl
-
HP_Archivist
Ah, thanks. Will look now
-
HP_Archivist
Hmm - I don't know how to do any of that. But I guess this is a good guide for future issues like this. Trump's page should be up by end of today. Wanna make sure we capture that, especially.
-
HP_Archivist
Thank you, qwertyasdfuiopghjkl
-
qwertyasdfuiopghjkl
HP_Archivist: If you ping me when that one goes up i'll try to get it (if I'm not asleep). Are there any others that are already up that should be saved or was it just the one I did?
-
HP_Archivist
qwertyasdfuiopghjkl: The other parties involved in Trump's circle who were also booked yesterday, too. Their names are escaping me atm. And okay, thank you
-
HP_Archivist
-
qwertyasdfuiopghjkl
Thanks
-
HP_Archivist
No problem. Should be able to find each one by just entering First and Last name
-
HP_Archivist
Some might not have been booked yet
-
HP_Archivist
-
qwertyasdfuiopghjkl
-
qwertyasdfuiopghjkl
-
HP_Archivist
Both load on my end, nice
-
HP_Archivist
Gonna be AFK for a while. Will be on later.
-
qwertyasdfuiopghjkl
-
fireonlive
thanks qwertyasdfuiopghjkl :)
-
qwertyasdfuiopghjkl
-
thuban
i'm glad that it's usable in this situation, but i find it kind of odd that spn will let you get captures with arbitrary cookies into the wbm
-
thuban
(and that this is apparently explicitly intended to support logged-in views, judging by target_username and target_password! i would sure like to know how those are implemented)
-
fireonlive
i wonder if that's http auth
-
fireonlive
er basic auth
-
fireonlive
-
JAA
But you could just put that in the URL instead.
-
fireonlive
oh right.
-
fireonlive
curious indeed
-
thuban
right, and the docs specifically refer to "the target page's login forms", which sounds to me like they're talking about page content
-
fireonlive
ooh that's very interesting
-
fireonlive
thanks for bringing that up thuban, would be neat to know indeed
-
Darken
Is there any way to increase the concurrent items to more than 6? 6 is just too low for me (archive warrior)
-
thuban
Darken: 6 is the maximum for the warrior
-
Darken
I am aware, but is there a way to go past this amount
-
Darken
and if not why?
-
thuban
if you want to do more, you can run additional warriors, or run project containers (which go up to 20) instead:
wiki.archiveteam.org/index.php/Runn…g_Archive_Team_Projects_with_Docker
-
Darken
thanks
-
Darken
what is the image address for the xuite project?
-
thuban
atdr.meo.ws/archiveteam/xuite-grab (you can get the link from the README in the source repo, linked in the infobox on the project wiki page)
-
thuban
n.b.: with xuite it's ok to go as high as you want, but when starting a new project check to see whether there's a recommended concurrency--one of the reasons the warrior is limited to 6 is that some sites implement ip bans
-
h2ibot
Switchnode edited ArchiveTeam Warrior (-28, /* How can I run tons of Warriors easily? */…):
wiki.archiveteam.org/?diff=50583&oldid=50455
-
mgrandi
Hey, I dunno if anyone has done this, but Mac GUI had a "blog post" where they said they took down all of their downloads:
web.archive.org/web/20230721053511/https://macgui.com/downloads
-
mgrandi
However I randomly checked today and they are back up:
macgui.com/downloads/?cat_id=53 , has anyone ran a archivebot on these files?
-
thuban
viewer says yes, most recently in mid-july
-
thuban
but it looks like that was during the 'downtime', and the previous one was from like 2015, so another go-around seems good
-
thuban
maybe just of /downloads ?