-
eggdrop
-
Inti83
Thanks XD was just reading logs
-
Inti83
thanks - we did see that cont.ar and cine.ar have cloudflare, someone here said they may have a local contact in Argentina Cabase IXP
-
Inti83
We are doing all we can with grab-site - thankd for the testing tool, helps!
-
Inti83
Also we were wondering if there is anything that can help so that grab-site doesn't start from the beginning when it fails, if there is some flag to start from where it left off, we couldnt find anything in the docs
-
JAA
That is a *long*-standing wishlist entry:
ArchiveTeam/grab-site #58
-
JAA
So, no.
-
pabs
nicolas17: #gitgud and archive.softwareheritage.org for GitHub repos :) (and #codearchiver for other git repos)
-
h2ibot
Inti83 edited Argentina (-4, /* Guidelines for Adding Websites */):
wiki.archiveteam.org/?diff=51252&oldid=51242
-
Naruyoko
twitter.com/kogekidogso This account is now inactive. It looks like it was saved with ArchiveBot when I bringed this up last week, but the associated querie.me account might not be.
-
eggdrop
-
Naruyoko
(And peing)
-
JAA
Yes, it was run through ArchiveBot, but that was only a very superficial crawl. I'll rerun it as soon as there's space.
-
Naruyoko
Thank you
-
JAA
Querie.me looks scripty.
-
JAA
I can't seem to get to any account page there...? Or is it just a matter of following the links in their tweets?
-
Naruyoko
You can get a list of their answers as "recent answers" here:
querie.me/user/r1OYTzyfrTY0Fn4ZIBXI4nEJPs63/recent
-
Naruyoko
However they use infinite scrolling thing
-
JAA
Yeah, that page is entirely useless without JavaScript, so archiving it is going to be difficult.
-
nicolas17
hm
-
Naruyoko
I'm now seeing if I can load the list by holding down arrow
-
JAA
I'm trying to do some curl magic.
-
nicolas17
JAA: suppose I write code to archive a querie.me page, by parsing the JS crap if needed to figure out what URLs to recurse into
-
JAA
They only load 5 answers per request by default, but you can do far more.
-
JAA
1000 is slow but works. :-)
-
nicolas17
we're not mass-archiving the entire site so this is not a DPoS project, just for one-off pages
-
nicolas17
how should I write that code? would a wget-at lua script be appropriate anyway?
-
JAA
No extra URLs need to be fetched for the individual answers, it seems.
-
JAA
So I'll just do the user page API crap and then throw the answer page URLs into AB.
-
Naruyoko
I see, loading 1000 at once is much more efficient than me scrolling down endlessly
-
nicolas17
ah hm I guess it could be a script in any technology, that produces a URL list for archivebot
-
JAA
There are over 2000 answers, so yes. :-)
-
JAA
The very technologically advanced extraction:
-
JAA
`function querie { pp="$1"; curl "
querie.me/api/qas?kind=recent&count…serId=r1OYTzyfrTY0Fn4ZIBXI4nEJPs63${pp}" -H 'User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:109.0) Gecko/20100101 Firefox/120.0' -H 'Accept: */*' -H 'Accept-Language: en-US,en;q=0.5' -H 'Accept-Encoding: gzip, deflate, br' -H 'Referer:
querie.me/user/r1OYTzyfrTY0Fn4ZIBXI4nEJPs63/recent' --compressed -s | tee
-
JAA
"/tmp/querie-json-${pp}" | jq -r '.[] | .id' >"/tmp/querie-ids-${pp}"; ls -al /tmp/querie-{json,ids}-"${pp}"; pp="&startAfterId=$(tail -n 1 "/tmp/querie-ids-${pp}")"; printf '%q\n' "$pp"; }`
-
JAA
No loop because I wasn't sure how it'd behave at the end.
-
JAA
(It returns an empty array then.)
-
Naruyoko
It looks like you'll get an empty array as response at the end, from observing other user
-
arkiver
there must be some prize we can award JAA for that one-liner
-
JAA
arkiver: This isn't even my final form.
-
arkiver
oh no
-
JAA
This is the one-liner I'm most proud of so far: `try-except` in a single line in Python:
web.archive.org/web/20230311201616/https://bpa.st/657RU
-
Naruyoko
Meanwhile, peing.net simply has a page number (3/page)
-
arkiver
i think we now have a bot for queuing for all long term channels except #shreddit (which by default gets everything)
-
arkiver
JAA: well... congrats i guess :P
-
arkiver
Naruyoko: what is up with peing.net?
-
Naruyoko
The person has an account there too
-
Naruyoko
I don't know how well the individual pages save, since it's excluded
-
JAA
arkiver: I *think* you can rewrite any Python code as a single line. Pattern matching and exception groups are hard, but it should be possible. I have this idea of writing a tool to do the conversion. Maybe when I'm retired or something. :-P
-
arkiver
JAA: conversion to one-liners?
-
arkiver
or from
-
JAA
To
-
arkiver
oh no
-
JAA
:-)
-
arkiver
thisisfine
-
project10
no god, please, no...
-
JAA
We'll be able to ship our pipeline.py as a single line! Imagine the savings from not having to store all those LF characters!
-
arkiver
79 chars please!
-
JAA
The 80s called, they want their monitor back. :-P
-
arkiver
i feel an old discussion coming up
-
JAA
:-)
-
» arkiver and JAA have fundamental differences when it comes to Python line length
-
JAA
Nah, it's obviously just for fun to prove that it's possible. The Python grammar very much requires separate lines in several places. `try-except` is one of them.
-
JAA
Hence that complicated one-liner to still do it.
-
arkiver
maybe we'll make JAA into one line
-
arkiver
one dimensional JAA
-
arkiver
no more 3 dimensions
-
arkiver
just for fun :)
-
JAA
Like your Python code is basically one-dimensional because it has no width? :-)
-
arkiver
like my Python code is basically one-dimensional because it has no width, exactly!
-
JAA
No depth to it either I guess. :-P
-
arkiver
yes
-
arkiver
nice clean python code without personality
-
JAA
Here's the API data from Querie for that account as JSONL, because I had it anyway:
transfer.archivete.am/whGJs/querie.…YTzyfrTY0Fn4ZIBXI4nEJPs63.jsonl.zst
-
JAA
Produced by concatenating the querie-json-* files in the right order + `jq -c '.[]'`
-
JAA
(Yes, this could be done better, and I would if I had to do this more than once.)
-
JAA
The job for those 2003 answers is running now.
-
Naruyoko
-
eggdrop
-
JAA
Naruyoko: Thanks, done.
-
qwertyasdfuiopghjkl
-
arkiver
thanks qwertyasdfuiopghjkl , i reported it to someone who may be able to fix it
-
project10
perhaps we could add it to #nodeping?
-
h2ibot
-
Pedrosso
fz.se supposedly existed since 1996, might it be a good idea to have a proactive grab?
-
TheTechRobo
digitize.archiveteam.org is also (still) down
-
JAA
digitize.archiveteam.org is permanently down, and its contents were integrated into the main wiki years ago, from what I've been told.
-
e853962747e3759
Hi, can anyone please teach me how to add a webpage to the wayback machine crawls? At least the main website every 6 hours. If the linked pages from it could be auto-archived also, even better . maybe a check for each page if anything significant has changed instead of making too many duplicates
-
that_lurker
why do you want it be crawled so often
-
nicolas17
yesterday savepagenow had a global backlog of like 18 hours lol
-
JAA
e853962747e3759: Were you here a couple days ago already?
-
that_lurker
god dammit you spooked them
-
that_lurker
:P
-
JAA
lol
-
that_lurker
I was so ready to give them my "instuctions.jpg" :P
-
fireonlive
;d
-
fireonlive
webchat needs like a 'btw if you leave this in a background tab you'll disconnect'
-
fireonlive
gone are the times of tabs just having fun 24/7
-
e853962747e3759
Hi, can anyone please teach me how to add a webpage to the wayback machine crawls? At least the main website every 6 hours. If the linked pages from it could be auto-archived also, even better . maybe a check for each page if anything significant has changed instead of making too many duplicates
-
JAA
e853962747e3759: Were you here a couple days ago already?
-
e853962747e3759
so is this some sort of faux intellectual elitest thing? I am not worthy of the knowledge to be able to archive a website?
-
nicolas17
just trying not to answer the same question a dozen times to people who will ignore them and ask again
-
Pedrosso
faux intellectual elitest...+
-
Pedrosso
?*'
-
Pedrosso
What's that even supposed to mean?
-
e853962747e3759
the last comment i saw here was started with "yesterday save page now..." Can someone please copy paste the explanation if there was one
-
e853962747e3759
just 2 comments, no explanation
-
immibis
it means "i'm a troll" as if the cat-on-keyboard nickname didn't give it away
-
Pedrosso
I don't understand how but, I'll take your word for it
-
nicolas17
-
Pedrosso
they left
-
Pedrosso
oh
-
Pedrosso
Yeah probably
-
Pedrosso
I thought you sent a link to what explenation they were asking for
-
nicolas17
I don't think we have anything that can poll every 6 hours
-
Pedrosso
They may be 'trolling' but it's a good idea to at least grab it once, right?
-
Pedrosso
"It is an important and significant news aggregator"
-
JAA
Adding it to #// makes the most sense.
-
Pedrosso
does #// grab entire sites? I thought those were just an equivalent to !ao
-
fireonlive
it has a news sources thing iirc
-
nicolas17
JAA: oh btw, how should I handle new/updated support.apple.com articles?
-
JAA
It does not, but their question was to grab the homepage regularly plus links on it. Which is exactly what #// already does for lots of news sites and other things.
-
JAA
nicolas17: AB !ao?
-
Pedrosso
Ahhh
-
nicolas17
I don't need something to grab 8000 pages periodically because I'm already doing it, but I can give a list of the changes I did find
-
JAA
If you already grab them, grab them as WARCs and upload that? That also creates a direct record that the unchanged pages did indeed not change.
-
JAA
Rather than changes being missed, for example.
-
nicolas17
hrmmm grabbing them as WARC would need significant changes :P I have a git repo of file content alone atm
-
fireonlive
ah i see
-
nicolas17
oh right I'm even mangling the data I store (apparently <link rel="alternate"> tags linking to other languages get regularly shuffled so I strip them out to get readable diffs)
-
e853962747e3759
Is this a joke? I thought the internet archive organization is a normal organization that works with volunteers and archivists to archive the internet
-
JAA
You disconnected...
-
JAA
Also, we are not the Internet Archive.
-
JAA
And yes, we do try to work with everyone, but if you keep disconnecting, it's hard to communicate.
-
nicolas17
lol
-
JAA
Case in point...
-
nicolas17
hard to answer your questions if you keep disconnecting
-
e7269535e6632
How do i prevent it from disconnecting? I am also having significant problems with the chat box here. Did I miss any comments or explanations?
-
JAA
Keep the webchat tab in the foreground or move it to a separate window.
-
JAA
-
nicolas17
JAA: "grab them as WARCs and upload that?" that won't appear on WBM will it?
-
JAA
nicolas17: We can make that happen.
-
nicolas17
qwarc might work for this...
-
that_lurker
e7269535e6632: In case you did not yet disconnect. Why do you want the site to be archived so often and what is the site? Also connecting trough irc would be better if you are having trouble with the webirc client
-
that_lurker
(╯°□°)╯︵ ┻━┻
-
JAA
lol
-
fireonlive
christ lol
-
Pedrosso
!tell e7269535e6632 "can anyone please teach me how to add a webpage to the wayback machine crawls?" To archive a site on your own which is what it sounds like you're asking for use
github.com/ArchiveTeam/grab-site and upload to IA through an item. It won't show up on the wayback machine but it will be saved which is really the point.
-
Pedrosso
To have it in the wayback machine it'd have to be queried to AT's ArchiveBot, one of their other projects. I'm not sure what other ways there are.
-
eggdrop
[tell] ok, I'll tell e7269535e6632 when they join next
-
nicolas17
inb4 they join with a different nickname next time
-
Pedrosso
Haha
-
fireonlive
!tell e7269535e6632 <Pedrosso> To have it in the wayback machine it'd have to be queried to AT's ArchiveBot, one of their other projects. I'm not sure what other ways there are.
-
eggdrop
[tell] ok, I'll tell e7269535e6632 when they join next
-
fireonlive
(cut in two)
-
Pedrosso
Thank
-
fireonlive
:)
-
Pedrosso
I bet they may, but they didn't last time
-
Pedrosso
(about joining with a different nick)
-
fireonlive
they may have been a7427a63 from the 2nd but unknown
-
that_lurker
They could also hopefully be reading the logs and see that, if they are in fact having issues with the webchat
-
nicolas17
was e853962747e3759 earlier today
-
that_lurker
and most likely <a7427a63 as well
-
JAA
Maybe they at least got the log link and are reading there. If so, hi. :-)
-
fireonlive
supppppppp
-
nicolas17
argenteam.net this website provides crowdsourced subtitles in spanish, mainly for uhh questionably-obtained movies
-
nicolas17
you can find subtitles by the torrent infohash so you know it syncs properly with the exact video you have
-
nicolas17
it's shutting down at the end of the year
-
nicolas17
they said they will soon publish a torrent with all 100k subtitles they have done
-
fireonlive
oh nice
-
nicolas17
there's also a forum with 127475 threads but it seems to be login-walled, so that could be complicated to archive
-
nicolas17
-
immibis
i would think that subtitles work no matter how you got a copy of a movie, so there's no need to call them questionably-obtained
-
nicolas17
immibis: well, the website actually has magnet: links to the video the subtitles were made for >.>
-
nicolas17
so I'm sure many people use it primarily as a torrent search index too
-
Pedrosso
Would having someone sign up for a throwaway account and giving the cookie for archival something that'd work? For the login-walls I mean
-
nicolas17
Pedrosso: I signed up and I can see all forums normally
-
nicolas17
I'm just not sure if that can be used for archival
-
Pedrosso
Can the archivebot use custom cookies?
-
JAA
No
-
nicolas17
and well, my username shows up on every page :P
-
JAA
Things archived with accounts also can't go into the WBM generally speaking.
-
Pedrosso
I see
-
Pedrosso
Then how are such things generally saved?
-
JAA
I've done some with wpull and cookies. The WARCs are somewhere, either in IA just for download and local playback or still sitting in my pile of stuff to upload.
-
Pedrosso
-
Pedrosso
"generally speaking" so it was just a special case?
-
nicolas17
forums (viewforum.php?f=) 4, 11, 35, 46, 55, 64 are publicly accessible
-
fireonlive
furaffinity.net/user/smaugit < people have nice things to say about the account haha
-
Pedrosso
yep, they sure do
-
nicolas17
forums 1, 4, 11, 14, 27, 35, 46, 55, 63, 64, 66, 67, 73 are accessible on a brand new account
-
nicolas17
of the remaining IDs, when logged in some return "forum doesn't exist" and others return "you're not authorized to see this forum" (probably private stuff for trusted translators, moderators, etc)
-
h2ibot
FireonLive edited Issuu (+115, move to partially saved to now, can be changed…):
wiki.archiveteam.org/?diff=51254&oldid=50096
-
Pedrosso
(Also, since the last one was in 2015, another proactive grab of furaffinity might be warranted, maybe?)
-
JAA
Pedrosso: I can only think of one or two cases where such archives went into the WBM. For SPUF, Valve people gave us an account to continue archiving past the shutdown deadline, allowing us to cover everything. And I think there was another one that I can't remember right now.
-
nicolas17
I could make a more anonymous account :P
-
nicolas17
but anyway
-
nicolas17
there's a few public forums
-
nicolas17
and there's the main site to deal with
-
nicolas17
oh fun, the pages are not deterministic
-
nicolas17
"Codec info = AVC Baseline⊙L2 | V_MPEG4/ISO/AVC"
-
nicolas17
gets turned into <a href="/cdn-cgi/l/email-protection" class="__cf_email__" data-cfemail="56143725333a3f3833161a627864"> and the cfemail field changes across requests
-
nicolas17
anyway I'm doing a simplistic wget of all movie IDs now
-
nicolas17
because they have a <meta name="og:url"> with the canonical URL
-
fireonlive
-+rss- YouTuber who intentionally crashed plane is sentenced to 6 months in prison:
twitter.com/bnonews/status/1731748816250974335 news.ycombinator.com/item?id=38523704
-
eggdrop
-
nicolas17
should take me 30 minutes to get all IDs