-
h2ibot
-
h2ibot
JustAnotherArchivist changed the user rights of User:Vokunal
-
Pedrosso
With URL-needing projects like #down-the-tube, when the tracker says there are 0 to do, does that mean that the system literally has no more urls to go off of? Or that it's just not willing to allocate any right now?
-
nicolas17
when the youtube tracker says there are 0 to do, it means there are no more urls in the youtube queue, yeah
-
nicolas17
the youtube project is not trying to archive all of youtube (that would be infeasible), it has to be actually important videos
-
nicolas17
if it reaches 0, great, we have more capacity for the other projects
-
Pedrosso
Alright, that's what I wanted/needed to know. Thanks
-
Pedrosso
On a seperate curiosity, I've been wondering from a previous conversation if it'd be possible (and if possible; if it should be done) get all the failed imgur outlinks from the logs of AB projects and run those through the imgur warrior.
-
pabs
yes, you "just" need to download all the AB logs from IA, parse them, upload the lists and submit to #imgone
-
pabs
and maybe make a service for that, since other projects will want some processing too
-
nicolas17
pabs: are warcs public for imgur? for many projects they aren't :(
-
pabs
sounded like Pedrosso was talking about warcs for AB not #imgone?
-
Pedrosso
I was, I was
-
» pabs not sure about imgur warcs tho
-
pabs
-
nicolas17
ah
-
Pedrosso
Also, pabs, what exactly do you mean by making a service for that?
-
thuban
nicolas17: they're both public
-
pabs
Pedrosso: as in a server with some code that does this all day long, and lets people add processing and flows. ie if AB finds a wiki, it should go to #wikibot
-
pabs
so the service would parse the warcs and connect that link
-
Pedrosso
That sounds like a good idea. Tho I individually don't have enough knowledge nor experience here to begin to think about executing that
-
JAA
There is a tool for WARC extraction, although that would have slightly different results than log parsing.
-
JAA
s/extraction/scraping/ I guess, extracting links that appear in WARCs.
-
Pedrosso29
Sry bout the disconnect/reconnect, if it shows
-
pabs
I think this was less about scraping the HTML in WARCs and more about sending the 429ed imgur requests from AB to #imgone
-
Pedrosso29
^
-
JAA
Yeah, they're not equivalent.
-
JAA
WARC scraping would produce more results but also requires munching more data.
-
pabs
but really, both could be useful. indeed, tons more data for scraping though
-
Pedrosso29
The former I suppose would be more specific to what I originally asked, the latter would be far more general and fit with the service idea
-
pabs
could do scraping only for the AB jobs without offsite links
-
pabs
anyway. its good to start simple though and work up from there, so manually do this, then hackily automate parts, then betterise the automation, then package it into a service
-
thuban
it's a nice thought, but it would duplicate some of the logic for cross-project dispatch we do already and i'm not sure what the best strategy for eventually rationalizing that would be
-
thuban
s/dispatch/backfeed/
-
pabs
are there any docs for that? I hadn't heard of any cross-project dispatch yet
-
JAA
#// dispatches to Telegram and (soon?) Imgur.
-
nicolas17
pabs: #// already sends telegram links to #telegrab
-
nicolas17
how that works behind the scenes, I don't know
-
pabs
ah, interesting...
-
thuban
-
JAA
No Imgur yet. arkiver, here's a reminder. ;-)
-
nicolas17
oh ew
-
nicolas17
I expected something server side rather than the worker for one project submitting into another
-
thuban
the logical thing might be to have a central url clearinghouse that identified all specially-handled urls and forwarded them to the appropriate projects (and either sent the rest to #// or, possibly configurably, dropped them as might be more appropriate for archivebot)
-
pabs
yes
-
thuban
in practice all new projects send outlinks to #// anyway, so (if eg telegram links to mediafire or whatever) they do get to the appropriate projects eventually
-
Pedrosso
So a mediafire outlink from the AB will be sent to #// where it'll be sent to #mediaonfire?
-
JAA
Only DPoS projects send things to #//. AB does not.
-
Pedrosso
Ah, I see I see
-
thuban
right. and bundling that queueing with archival makes it not compose well with archivebot, plus it's a needless round-trip, plus it requires the code to actually opt in (when looking for an example i was surprised to find that apparently pastebin doesn't queue outlinks at all)
-
thuban
plus changes require #// worker updates to take effect (minor considering how most people run it, but still)
-
thuban
idk, i can think of some cases in which you really do need the original discovery context and not just the url (nitter/mastodon instances, blogs at custom domains). but i think all we actually do at present is url-pattern-based
-
thuban
s/discovery context/page structure/ (i can't actually think of any examples where you need the discovery context)
-
fireonlive
-
eggdrop
-
JAA
-
h2ibot
Tech234a edited List of websites excluded from the Wayback Machine/Partial exclusions (+52, Add early Apple Store):
wiki.archiveteam.org/?diff=51122&oldid=50493
-
h2ibot
Petchea edited Tumblr (+107, /* History */):
wiki.archiveteam.org/?diff=51123&oldid=51113
-
mgrandi
-
JAA
Indeed, but it's running through AB already.
-
JAA
Didn't realise it was part of G/O. Another one for the list, I guess.
-
h2ibot
JustAnotherArchivist edited Deathwatch (+184, /* 2023 */ Add Jezebel):
wiki.archiveteam.org/?diff=51124&oldid=51121
-
Barto
pabs: poor TheTechRobo he may get the hug of death of HN :D
-
TheTechRobo
-
Barto
that's pretty moderate so far
-
TheTechRobo
1.7k now
-
arkiver
TheTechRobo: congrats on getting on front page :)
-
arkiver
very nice tool as well!
-
arkiver
JAA: whoops
-
arkiver
thanks for the reminder
-
TheTechRobo
arkiver: :D
-
ScenarioPlanet
-
ScenarioPlanet
Pedrosso pokechu22 ^
-
h2ibot
-
ScenarioPlanet
error: Hello there
-
error
howdy
-
ScenarioPlanet
Also I think you should change your nickname (with /nick new_nickname) or you'll get pinged every time someone uses the "error" word
-
error
fair lol
-
redlattice
changed
-
tomodachi94
Does anyone know if Fextralife (
fextralife.com) has been grabbed ever? Specifically curious about their wikis, which seem like a goldmine
-
tomodachi94
(Wiki page already created for those interested)
-
Pedrosso
I don't see anything on
archive.org/search?query=originalurl%3A%28%2Afextralife%2A%29 however, idk if there's possibly another way of searching for it
-
Pedrosso
If it's really a goldmine of wikis; maybe move this to in #wikiteam ?
-
Pedrosso
Didregard that last statment, as after all it's the entire website
-
Pedrosso
tomodachi94: I believe #archivebot automatically moves wikis to #wikibot when it discovers them, so I'd suggest you repeat this in #archivebot so that an admin can submit it
-
Pedrosso
do ask them if it does move them automatically as I don't know
-
thuban
archivebot is a self-contained system and doesn't submit anything to any other tooling
-
Pedrosso
Strange, when I had asked AB to archive a website with a wiki in it, it sent it there. Perhaps I misinterpreted it
-
mossssss
does anyone know if blogger/blogspot is in the warrior?
-
mossssss
or if there's even an initiative to archive it?
-
JAA
Pedrosso: The originalurl search only works for wikis specifically dumped by WikiTeam tooling. Basically nobody else sets that metadata field. Certainly not AB.
-
JAA
And no, AB does not submit anything elsewhere. That was done manually.
-
thuban
-
JAA
mossssss: It isn't yet, but we're aware of the situation. Unfortunately, it doesn't seem to be possible to enumerate the blogs or similar.
-
JAA
Looks like that (Google inactive accounts etc.) was never added to Deathwatch though.
-
mossssss
oh no!!! thats so frustrating that there's no way to do it
-
Pedrosso
Very frustrating indeed.
-
mossssss
i'm stressed because i know there's so much stuff on there that is totally going to be lost
-
fireonlive
:(
-
fireonlive
google is really on a 'HOW much are we storing???' kick lately
-
Pedrosso
Could you specify?
-
JAA
If there's anything you particularly care about, feel free to ask in #archivebot about archiving it. Blogger blogs work fairly well (except for some pagination mess and the 'dynamic view' script hells).
-
h2ibot
JustAnotherArchivist edited Deathwatch (+635, /* 2023 */ Add Google's inactive accounts purge):
wiki.archiveteam.org/?diff=51126&oldid=51124
-
mossssss
this is perhaps a bit backwards but is there a way to do it through individual bloggers profiles? probably half of profiles aren't visible but the ones that are usually have 1-3 blogs on them
-
JAA
The profile IDs are much too large to be bruteforced, and IIRC there's quite a bit of rate limiting on the profile pages.
-
mossssss
yeah - that makes sense. its just the only half-plausible solution i can come up with lol
-
Pedrosso
JAA: I've grabbed the wiki links from tomodachi94's suggested website.
transfer.archivete.am/ErUPC/wikilinks.txt
-
Ryz
I mentioned the Blogger thing multiple times 2-3 months ago...
-
JAA
Yes, it was discussed extensively in May.
-
JAA
But since we have no way of discovering blogs, really...
-
Ryz
Not even Blogger ID numeration?
-
Ryz
*ID number
-
Ryz
Even if it's rate limited?
-
thuban
-
thuban
on a related note, anyone know whether blog names or user ids can be extracted from blogspot image cdn urls? parts don't look entirely random, but i'm not sure
-
Ryz
Yeah, it's one of the reasons why I became slightly to somewhat more inactive in ArchiveBot s:
-
Ryz
There seems to be an implicit feeling that Blogger may be deemed less important than YouTube or other stuff
-
Ryz
Even though it uses less space than something video related
-
thuban
i actually had no idea about this--must have missed the discussion
-
pabs
Barto, arkiver, TheTechRobo: oh, didn't think it would reach the front page :)
-
JAA
thuban: Yeah, that's why this should've been on Deathwatch from the start. :-|
-
Barto
pabs: muahaha
-
Barto
congratz
-
Pedrosso
congratz indeed
-
fireonlive
pabs raking in that HN karma :p
-
pabs
re blogger, a while back I found you can scrape front pages for profile links, scrape front page links from profiles, and you get a probably ever-expanding lists
-
Ryz
I'm not even sure even adding it on Deathwatch when it was announced would help
-
Pedrosso
I don't imagine it'd be complete but quite extensive
-
mossssss
it would be nice to try
-
Pedrosso
It would
-
JAA
Ryz: It does help. It wasn't really on my radar anymore until some people brought it up again a couple days ago (on Reddit and via email).
-
» JAA summons the arkiver.
-
fireonlive
-
fireonlive
it is time.
-
fireonlive
arkiver
-
pabs
my hacky script for blogger/blogspot enumeration:
transfer.archivete.am/RAiXa/archive-blogspot.sh
-
pabs
(note the captchas you get really hamper the process)
-
fireonlive
anyone here work at google? :p
-
pabs
-
Pedrosso
ooh, nevermind pabs: that(the script) does look like it'd be complete
-
Ryz
I have so many Blogspot websites to process too
-
pabs
sorry for the traffic bump TheTechRobo :)
-
mossssss
same - i may just send them in the other channel if i need to
-
» pabs reached out to a Google person he knows
-
pabs
(not in the right dept tho)
-
fireonlive
🤞
-
katia
fireonlive, you can tell if they say (opinions my own)
-
fireonlive
haha
-
fireonlive
true true
-
TheTechRobo
pabs: All good! :-)
-
fireonlive
-
h2ibot
PaulWise edited Blogger (+294, add second strategy):
wiki.archiveteam.org/?diff=51127&oldid=47348
-
h2ibot
PaulWise edited Blogger (+116, add list of blogs found with the second strategy):
wiki.archiveteam.org/?diff=51128&oldid=51127
-
thuban
here's a list of 144 blogs extracted from my irc logs (excluding #archivebot but not other archiveteam channels):
transfer.archivete.am/2sEI9/blogspot_blogs_from_irc_logs.txt
-
thuban
(some of these are from topics of channels i scanned during the freenode implosion--i had totally forgotten about that)
-
mossssss
does this mean we might be able to do it?? (I would be SO relieved lol - even some is better than none)
-
h2ibot
Tomodachi94 created Fextralife (+458, Create page):
wiki.archiveteam.org/?title=Fextralife
-
h2ibot
-
JAA
One potential concern is that many blogs will not be at risk, and I guess we don't have a good way of identifying which ones are.
-
mossssss
yeah - i know its any google account that hasnt been touched in 2 years - but that doesnt necessarily mean that the blogs are representative of the accounts
-
JAA
Any blog with a post in the past 2 years would *probably* be fine, but scheduled posts are a thing, so it's not reliable.
-
mossssss
omg i totally forgot about that...
-
mossssss90
not sure why it keeps disconnecting me lol so annoying