-
nicolas17
JAA: is it possible to set a cookie/header in archivebot?
-
JAA
nicolas17: No
-
nicolas17
dammit
-
nicolas17
<a id="logo" class="logo" href="./index.php?sid=34673d68f5d91e04536838e3b7a446a6">
-
nicolas17
if there's no session ID cookie in the request, the response adds a sid= to all links
-
JAA
Yeah, we can add a dummy request to
foro.argenteam.net/?archiveteam or similar at the start to make it set the cookie before fetching the pages of interest.
-
nicolas17
oh it does preserve received cookies between requests?
-
JAA
Yes,
-
JAA
s/,//
-
JAA
Setting a value manually probably wouldn't work anyway for the session ID.
-
nicolas17
yeah it might be per IP?
-
nicolas17
-
nicolas17
lmk if you do the !ao< so I add the job ID to the wiki
-
JAA
Already submitted.
-
h2ibot
Nicolas17v2 edited ARGENTeaM (+61, Add forum AB job):
wiki.archiveteam.org/?diff=51285&oldid=51284
-
fireonlive
v2 on the wiki eh
-
fireonlive
oh nvm
-
JAA
What was your request rate for the scraping?
-
fireonlive
User:... tells all
-
nicolas17
JAA: I used 10 parallel wgets, I just timed it and it did 22 req/s
-
nicolas17
I see the queue is barely going down, probably collecting lots of page requisites :P
-
JAA
Not bad. :-)
-
pabs
Barto Nulo|m - #gitgud is mostly about JAA saving git bundles to archive.org (and I think also web issues/PRs to the WBM), but I also lurk there and am doing SWH SCN requests for GitHub users, the SWH SCN API is open to everyone else though, but with rate limits
-
h2ibot
FireonLive edited Telegram (-803, remove notable channels section.…):
wiki.archiveteam.org/?diff=51286&oldid=51176
-
that_lurker
fireonlive: Welp thats better than what I did :P
-
fireonlive
did you make an edit? xP
-
that_lurker
Yep. I just wanted to clarify the section somewhat more, but just deleting it was also on my mind :P
-
fireonlive
ah :3
-
nicolas17
looks like there's lots of argenteam.net/resources/covers/*.jpg images on forum posts, which means they will be duplicated with the previous job that grabbed the webpages
-
nicolas17
hmph, damn dupes
-
JAA
nicolas17: And it's done. That was faster than expected.
-
nicolas17
JAA: it looks like there's a similarly-sized number of URLs corresponding to forum threads in login-required forums :/
-
nicolas17
idk what's up with that
-
nicolas17
why make some forums available anonymously, and others require login but *any* login is fine?
-
Nulo|m
<pabs> "Barto Nulo - #gitgud is mostly..." <- i have no clue what SWH SCN means
-
nicolas17
"doing a Save Code Now request in Software Heritage"
-
pabs
yep
-
Nulo|m
okay, i believe you already picked them from #gitgud no? otherwise
pastebin.com/bDAqGRsn /
pastebin.com/6vk81xnE
-
pabs
correct
-
pabs
and also the gitlabs JAA mentioned on #codearchiver
-
Nulo|m
i'm doing my own overengineered archiving effort for all datasets in argentinian gov CKANs and related
-
Nulo|m
-
nicolas17
-
Nulo|m
-
Nulo|m
there's more, these are the CKAN ones
-
nicolas17
yes so maybe we should add them :)
-
Nulo|m
it's ~150 GB uncompressed and ~85 GB gzipped
-
JAA
Oh dear
-
JAA
I've thrown a couple of those into ArchiveBot, but...
-
Nulo|m
i am downloading them myself and i will distribute them later (and should probably upload to archive.org) so i'm not personally concerned
-
JAA
I'm assuming you're just downloading plain files though, not as WARCs.
-
JAA
So no integration into the Wayback Machine.
-
Nulo|m
yes indeed
-
Nulo|m
i made my own fancy javascript frontend to navigate it
-
JAA
But yeah, at least there'll be a copy. :-)
-
JAA
datos.gob.ar is either horribly overloaded or hosted on a potato.
-
nicolas17
for all I know CKAN might involve POST requests :P
-
nicolas17
(I didn't check)
-
JAA
It *seems* to work reasonably well in AB, apart from the expected filter faceting.
-
Nulo|m
i don't think so,
-
Nulo|m
my method is that most of these ones have a /data.json path and those have downloadURLs, it's all GETs. the API is also all GETs (for the ones that don't have the data.json plugin)
-
JAA
I was just about to ask what you're fetching exactly. Sounds good then.
-
Nulo|m
the data.json (that has most metadata) (or i scrap the api and generate one with less data from that) and all the downloadURLs
-
JAA
AB doesn't seem to find /data.json. Not that I'm terribly surprised.
-
JAA
But there probably shouldn't be a difference to what's listed on the web pages.
-
JAA
Just more efficient to fetch everything that way.
-
Nulo|m
yep
-
Nulo|m
i'll be uploading my new dump with 147GB as i added a few big ones (they didn't have the data.json plugin so i had to write the glue code), does dropping a torrent here make sense?
-
Nulo|m
i'm inexperienced with archive.org, can i just.. upload ~90GB? or do i need permission
-
Nulo|m
is
youtube.com/@AudiovisualTelam getting archived? i don't see it in the wiki page
-
nicolas17
is it multiple files?
-
nicolas17
uploading 90GB as a single file has a big risk of being interrupted
-
Nulo|m
they are multiple files of sizes varying very small to a few GB
-
nicolas17
uploading 90GB as a single item with multiple files is probably a bad idea anyway, you shold probably do one per server or something?
-
Nulo|m
well, i've been taking the nuclear approach and automating downloading and dumping all together because they are so many. but yes, i can upload them separately, can that be scripted?
-
nicolas17
-
pabs
-
JAA
(Same thing, to avoid confusion)
-
h2ibot
JacksonChen666 edited URLTeam (+367, 2137.pl URL shortener):
wiki.archiveteam.org/?diff=51287&oldid=51193
-
h2ibot
Flama12333 edited BetaArchive (+97, same of ftp access):
wiki.archiveteam.org/?diff=51288&oldid=38443
-
h2ibot
Taka edited Deathwatch (+181, /* 2024 */ Added Wezzy):
wiki.archiveteam.org/?diff=51289&oldid=51267
-
h2ibot
Usernam edited Deathwatch (+128, /* 2024 */):
wiki.archiveteam.org/?diff=51290&oldid=51289
-
h2ibot
Inti83 edited Argentina (+455, /* Memory and Human Rights */ add some new sites):
wiki.archiveteam.org/?diff=51291&oldid=51278
-
h2ibot
Rad XXXack edited Blip.tv (-5, searchable index url fixed):
wiki.archiveteam.org/?diff=51292&oldid=47802
-
JAA
If anyone understands what that BetaArchive edit is trying to say exactly, please edit.
-
h2ibot
JustAnotherArchivist changed the user rights of User:Usernam
-
h2ibot
JustAnotherArchivist changed the user rights of User:JacksonChen666
-
pokechu22
JAA: I'm guessing they're saying something like the FTP interface was shut down and replaced with something else with the same data on April 23, but I'm not sure
-
JAA
Yeah, also my guess, but couldn't find anything on the website to confirm it.
-
JAA
-
project10
could I trouble you for wiki edit rights on user:Project10 while you're in there, JAA?
-
h2ibot
JustAnotherArchivist edited BetaArchive (+149, Rephrase FTP shutdown and add reference; update…):
wiki.archiveteam.org/?diff=51293&oldid=51288
-
h2ibot
JustAnotherArchivist changed the user rights of User:Project10
-
h2ibot
JustAnotherArchivist changed the user rights of User:That lurker
-
project10
thanks :)
-
Nulo|m
off to a bad start re: uploading my dumps to IA: error uploading /datos.csjn.gov.ar_/[..]/creditos-y-ejecucion-2019-mensual-saf-335---febrero-2020.pdf.gz: Uploaded content is unacceptable. - error checking pdf file
-
Nulo|m
i have all files gzipped (irrespective of type) to save space in the torrent and in the server
-
nicolas17
yeah at least for PDFs you probably shouldn't...
-
nicolas17
IA will let you search text in PDFs, even if they are image PDFs (eg. scanned) by doing OCR
-
nicolas17
if gzipped that won't work
-
Vokun
If a post on the wiki is quoting something, should a date in the quote still be swapped to the new date format, or should it be left alone?
-
Nulo|m
<nicolas17> "if gzipped that won't work" <- why doesn't it just disable that functionality :(
-
nicolas17
oh I don't know *why* it doesn't let you upload
-
h2ibot
Vokunal edited Kephost.com (+31, /* Closure */):
wiki.archiveteam.org/?diff=51294&oldid=47946
-
Nulo|m
i'll try with the uncompressed copy i guess
-
nicolas17
I'm saying even if it let you upload .pdf.gz, it seems like a good idea to not gzip PDFs
-
Nulo|m
yeah it's not great, it just was a lazy decision i made
-
Nulo|m
thank you all! going to sleep
-
h2ibot
Vokunal edited Ghostbin (+52, updated to new date system):
wiki.archiveteam.org/?diff=51295&oldid=46969
-
h2ibot
Vokunal edited Early projects (+26, updated to new date format):
wiki.archiveteam.org/?diff=51296&oldid=27674
-
h2ibot
Vokunal edited Nin.com (+5, /* remix.nin.com */ new date format):
wiki.archiveteam.org/?diff=51297&oldid=29051
-
h2ibot
Vokunal edited Nin.com (+12, /* phm.nin.com */ new date format):
wiki.archiveteam.org/?diff=51298&oldid=51297
-
h2ibot
Vokunal edited Elections/2019 Swiss federal election/Candidates/Luzern (+552, /* Nationalrat */ New date format):
wiki.archiveteam.org/?diff=51299&oldid=43129
-
h2ibot
Vokunal edited Sploder (+35, new url format):
wiki.archiveteam.org/?diff=51300&oldid=48899
-
h2ibot
Vokunal edited Google Business Sitebuilder (+9, New date format):
wiki.archiveteam.org/?diff=51301&oldid=47446
-
h2ibot
Vokunal edited Google Business Sitebuilder (+23, new date format):
wiki.archiveteam.org/?diff=51302&oldid=51301
-
h2ibot
Vokunal edited CVE References (+11, date format):
wiki.archiveteam.org/?diff=51303&oldid=49862
-
h2ibot
Vokunal edited Magazines and journals (+26, date format):
wiki.archiveteam.org/?diff=51304&oldid=48762
-
h2ibot
Vokunal edited MediaFire (+16, date format):
wiki.archiveteam.org/?diff=51305&oldid=50263
-
h2ibot
Vokunal edited Gna!/code and downloads (+13, date format):
wiki.archiveteam.org/?diff=51306&oldid=29362
-
h2ibot
Vokunal edited Gna!/code and downloads (+34, /* Sizes by project */ date format):
wiki.archiveteam.org/?diff=51307&oldid=51306
-
h2ibot
-
h2ibot
-
h2ibot
Vokunal edited Bitbucket (+61, date format):
wiki.archiveteam.org/?diff=51312&oldid=47800
-
h2ibot
Vokunal edited Xanga (+8, /* Xanga 2.0 */ date format):
wiki.archiveteam.org/?diff=51313&oldid=47740
-
h2ibot
-
h2ibot
Vokunal edited TalkTalk (+78, /* Personal Webspace Closure */ date format):
wiki.archiveteam.org/?diff=51315&oldid=47544
-
h2ibot
Vokunal edited 2019 Swiss women's strike (+11, date format):
wiki.archiveteam.org/?diff=51316&oldid=43131
-
h2ibot
-
h2ibot
Vokunal edited Mininova (+13, date format):
wiki.archiveteam.org/?diff=51318&oldid=41223
-
h2ibot
Vokunal edited Google Video (+102, date format):
wiki.archiveteam.org/?diff=51319&oldid=38184
-
h2ibot
Vokunal edited Mastodon/Outdated instances/2019-03-24 (+26, date format):
wiki.archiveteam.org/?diff=51320&oldid=35987
-
h2ibot
Vokunal edited Turkey Media Crackdown (+13, /* Endangered Newspapers: */ date format):
wiki.archiveteam.org/?diff=51321&oldid=30123
-
h2ibot
Vokunal edited DNS History (+121, date format):
wiki.archiveteam.org/?diff=51322&oldid=47777
-
thuban
Vokun: imo, if the date isn't otherwise given in the article, do use the datetime template but wrap it in square brackets; if it is, leave the quotation as-is
-
missaustraliana
can someone approve my edit to Deathwatch? I added Studio 10 to the axe
-
missaustraliana
axe list*
-
h2ibot
-
nulldata
d2iq.com, an Enterprise Kubernetes Management Platform, is shutting down/selling assets.
theinformation.com/articles/a16z-ba…d-150m-sale-to-microsoft-shuts-down
-
Pedrosso
If you have an archive of a site, be it by grab-site or other tools more specific to that site, it's made clear that it can't be trusted to be put into wayback but is it still expected / good to upload as an item?
-
thuban
Pedrosso: yes!
-
Pedrosso
Awesome. Are there any specific metadata suggestions in the case of grab-site?
-
thuban
i don't upload warcs much, so i don't have any, but someone else might
-
hhhb
Did anyone noticed that twitter changed its media section? If you are on there and logged in, the media section now displays a grid of images, one for each tweet. If tweet has 2+ images, it only shows the first one. Extracting URLs will only get that image URL since the 2nd and beyond URLs aren't on the HTML until you click on the image, and press
-
hhhb
left/right to view additional images
-
hhhb
If you want to save tweets, which by the way could fail silently - the WBM itself does not emit any errors, but viewing the playback URL shows that the page itself is loaded, but the tweet content will not load with a message saying "something went wrong" message. You'll want to save nitter front end instead.
-
hhhb
if a tweet fails to load on nitter, it will show an error page, I would assume would error 4XX or 5XX, which causes an error to appear when saving the page without the need to check the playback URL
-
h2ibot
Vokunal edited ISP Hosting (+860, Updated to new date format):
wiki.archiveteam.org/?diff=51325&oldid=50970
-
h2ibot
-
fireonlive
Vokun++
-
eggdrop
[karma] 'Vokun' now has 1 karma!
-
h2ibot
Vokunal edited Portalgraphics.net (+32, updated date format):
wiki.archiveteam.org/?diff=51327&oldid=30338
-
h2ibot
Vokunal edited WikiApiary (+59, date format):
wiki.archiveteam.org/?diff=51328&oldid=49585
-
h2ibot
Vokunal edited Twaud.io (+10, /* Announcement */ date format):
wiki.archiveteam.org/?diff=51329&oldid=28598
-
h2ibot
Vokunal edited Xeno-canto (+13, date format):
wiki.archiveteam.org/?diff=51330&oldid=29439
-
h2ibot
Vokunal edited Bayimg (+10, /* Going offline */ date format):
wiki.archiveteam.org/?diff=51331&oldid=47798
-
h2ibot
Vokunal edited ESPN Forums (+10, date format):
wiki.archiveteam.org/?diff=51332&oldid=28887
-
h2ibot
Vokunal edited Instacast (+20, date format):
wiki.archiveteam.org/?diff=51333&oldid=47933
-
h2ibot
Vokunal edited Moegirlpedia (+37, date format):
wiki.archiveteam.org/?diff=51334&oldid=49240
-
h2ibot
Vokunal edited Moegirlpedia (+1, spelling):
wiki.archiveteam.org/?diff=51335&oldid=51334
-
h2ibot
Vokunal edited List of websites excluded from the Wayback Machine/Former exclusions (+8, date format):
wiki.archiveteam.org/?diff=51336&oldid=51211
-
thuban
at some point i've got to go through the 'endangered' category, because i know for a fact that some of them are in fact dead
-
h2ibot
Vokunal edited URLTeam (+117, /* Non-warrior projects */ date format):
wiki.archiveteam.org/?diff=51337&oldid=51287
-
thuban
i want to revamp fire drill with my proposed defcon system too, but... should get work done first...
-
JAA
Vokun: EDT is UTC-4, not -5. (EST is -5.)
-
missaustraliana
Still waiting for someone to approve my edit. Not sure if theres an ID type thing or reference number i need to provide
-
h2ibot
Amphitryon edited URLTeam/Dead (+0, /* Dead or Broken */ Update last known…):
wiki.archiveteam.org/?diff=51338&oldid=51192
-
h2ibot
Missaustraliana edited Deathwatch (+160, Add Studio 10):
wiki.archiveteam.org/?diff=51339&oldid=51290
-
h2ibot
HHHB edited Talk:Twitter (+1683, /* Twitter have changed its media section, it…):
wiki.archiveteam.org/?diff=51341&oldid=50606
-
h2ibot
JustAnotherArchivist changed the user rights of User:Amphitryon
-
h2ibot
JustAnotherArchivist changed the user rights of User:Missaustraliana
-
h2ibot
JustAnotherArchivist changed the user rights of User:HHHB
-
h2ibot
JustAnotherArchivist changed the user rights of User:Magmaus3
-
that_lurker
Seems like JAA got tired of approving all the rapid edits :3
-
JAA
The mod system was never intended as a barrier to submission but as a spam filter without a registration lock. So... yeah :-)
-
JAA
I wish the mod interface had a button to directly make someone automoderated.