#archiveteam-bs

00:13

h2ibot

Vokunal edited Deathwatch (+0): wiki.archiveteam.org/?diff=51121&oldid=51120
00:13

h2ibot

JustAnotherArchivist changed the user rights of User:Vokunal
02:49

Pedrosso

With URL-needing projects like #down-the-tube, when the tracker says there are 0 to do, does that mean that the system literally has no more urls to go off of? Or that it's just not willing to allocate any right now?
02:50

nicolas17

when the youtube tracker says there are 0 to do, it means there are no more urls in the youtube queue, yeah
02:50

nicolas17

the youtube project is not trying to archive all of youtube (that would be infeasible), it has to be actually important videos
02:50

nicolas17

if it reaches 0, great, we have more capacity for the other projects
02:51

Pedrosso

Alright, that's what I wanted/needed to know. Thanks
03:00

Pedrosso

On a seperate curiosity, I've been wondering from a previous conversation if it'd be possible (and if possible; if it should be done) get all the failed imgur outlinks from the logs of AB projects and run those through the imgur warrior.
03:02

pabs

yes, you "just" need to download all the AB logs from IA, parse them, upload the lists and submit to #imgone
03:03

pabs

and maybe make a service for that, since other projects will want some processing too
03:04

nicolas17

pabs: are warcs public for imgur? for many projects they aren't :(
03:05

pabs

sounded like Pedrosso was talking about warcs for AB not #imgone?
03:05

Pedrosso

I was, I was
03:05

» pabs not sure about imgur warcs tho
03:05

pabs

btw the AB warcs are linked from archive.fart.website/archivebot/viewer
03:05

nicolas17

ah
03:06

Pedrosso

Also, pabs, what exactly do you mean by making a service for that?
03:06

thuban

nicolas17: they're both public
03:07

pabs

Pedrosso: as in a server with some code that does this all day long, and lets people add processing and flows. ie if AB finds a wiki, it should go to #wikibot
03:07

pabs

so the service would parse the warcs and connect that link
03:09

Pedrosso

That sounds like a good idea. Tho I individually don't have enough knowledge nor experience here to begin to think about executing that
03:11

JAA

There is a tool for WARC extraction, although that would have slightly different results than log parsing.
03:11

JAA

s/extraction/scraping/ I guess, extracting links that appear in WARCs.
03:11

Pedrosso29

Sry bout the disconnect/reconnect, if it shows
03:12

pabs

I think this was less about scraping the HTML in WARCs and more about sending the 429ed imgur requests from AB to #imgone
03:12

Pedrosso29

^
03:12

JAA

Yeah, they're not equivalent.
03:12

JAA

WARC scraping would produce more results but also requires munching more data.
03:13

pabs

but really, both could be useful. indeed, tons more data for scraping though
03:13

Pedrosso29

The former I suppose would be more specific to what I originally asked, the latter would be far more general and fit with the service idea
03:13

pabs

could do scraping only for the AB jobs without offsite links
03:14

pabs

anyway. its good to start simple though and work up from there, so manually do this, then hackily automate parts, then betterise the automation, then package it into a service
03:15

thuban

it's a nice thought, but it would duplicate some of the logic for cross-project dispatch we do already and i'm not sure what the best strategy for eventually rationalizing that would be
03:16

thuban

s/dispatch/backfeed/
03:17

pabs

are there any docs for that? I hadn't heard of any cross-project dispatch yet
03:17

JAA

#// dispatches to Telegram and (soon?) Imgur.
03:17

nicolas17

pabs: #// already sends telegram links to #telegrab
03:18

nicolas17

how that works behind the scenes, I don't know
03:18

pabs

ah, interesting...
03:18

thuban

there's loads but it's all done haphazardly inline github.com/ArchiveTeam/urls-grab/blob/master/urls.lua#L1688
03:18

JAA

No Imgur yet. arkiver, here's a reminder. ;-)
03:18

nicolas17

oh ew
03:18

nicolas17

I expected something server side rather than the worker for one project submitting into another
03:21

thuban

the logical thing might be to have a central url clearinghouse that identified all specially-handled urls and forwarded them to the appropriate projects (and either sent the rest to #// or, possibly configurably, dropped them as might be more appropriate for archivebot)
03:22

pabs

yes
03:25

thuban

in practice all new projects send outlinks to #// anyway, so (if eg telegram links to mediafire or whatever) they do get to the appropriate projects eventually
03:29

Pedrosso

So a mediafire outlink from the AB will be sent to #// where it'll be sent to #mediaonfire?
03:30

JAA

Only DPoS projects send things to #//. AB does not.
03:30

Pedrosso

Ah, I see I see
03:32

thuban

right. and bundling that queueing with archival makes it not compose well with archivebot, plus it's a needless round-trip, plus it requires the code to actually opt in (when looking for an example i was surprised to find that apparently pastebin doesn't queue outlinks at all)
03:39

thuban

plus changes require #// worker updates to take effect (minor considering how most people run it, but still)
03:48

thuban

idk, i can think of some cases in which you really do need the original discovery context and not just the url (nitter/mastodon instances, blogs at custom domains). but i think all we actually do at present is url-pattern-based
03:54

thuban

s/discovery context/page structure/ (i can't actually think of any examples where you need the discovery context)
05:31

fireonlive

here we go, here we go again… x.com/dexerto/status/1722958208807891046?s=12
05:31

eggdrop

nitter: nitter.net/dexerto/status/1722958208807891046
05:34

JAA

i.kym-cdn.com/entries/icons/original/000/029/223/cover2.jpg
05:52

h2ibot

Tech234a edited List of websites excluded from the Wayback Machine/Partial exclusions (+52, Add early Apple Store): wiki.archiveteam.org/?diff=51122&oldid=50493
05:56

h2ibot

Petchea edited Tumblr (+107, /* History */): wiki.archiveteam.org/?diff=51123&oldid=51113
06:07

mgrandi

abcnews.go.com/Technology/wireStory…website-shutting-after-16-104768751 I don't see it on deathwatch or mentioned here
06:07

JAA

Indeed, but it's running through AB already.
06:08

JAA

Didn't realise it was part of G/O. Another one for the list, I guess.
06:14

h2ibot

JustAnotherArchivist edited Deathwatch (+184, /* 2023 */ Add Jezebel): wiki.archiveteam.org/?diff=51124&oldid=51121
14:51

Barto

pabs: poor TheTechRobo he may get the hug of death of HN :D
15:43

TheTechRobo

wtf send help lounge.thetechrobo.ca/uploads/f2d379beb39b7321/IMG_2421.jpeg
16:15

Barto

that's pretty moderate so far
16:17

TheTechRobo

1.7k now
16:40

arkiver

TheTechRobo: congrats on getting on front page :)
16:40

arkiver

very nice tool as well!
16:40

arkiver

JAA: whoops
16:40

arkiver

thanks for the reminder
16:46

TheTechRobo

arkiver: :D
17:40

ScenarioPlanet

transfer.archivete.am/jxHWG/static.spore.com-ids-2016-fix.txt.zst - Fixed line 538397 and broken sorting
17:40

ScenarioPlanet

Pedrosso pokechu22 ^
18:00

h2ibot

JAABot edited CurrentWarriorProject (-4): wiki.archiveteam.org/?diff=51125&oldid=51117
20:19

ScenarioPlanet

error: Hello there
20:22

error

howdy
20:35

ScenarioPlanet

Also I think you should change your nickname (with /nick new_nickname) or you'll get pinged every time someone uses the "error" word
20:37

error

fair lol
20:37

redlattice

changed
22:07

tomodachi94

Does anyone know if Fextralife (fextralife.com) has been grabbed ever? Specifically curious about their wikis, which seem like a goldmine
22:07

tomodachi94

(Wiki page already created for those interested)
22:22

Pedrosso

I don't see anything on archive.org/search?query=originalurl%3A%28%2Afextralife%2A%29 however, idk if there's possibly another way of searching for it
22:23

Pedrosso

If it's really a goldmine of wikis; maybe move this to in #wikiteam ?
22:24

Pedrosso

Didregard that last statment, as after all it's the entire website
22:27

Pedrosso

tomodachi94: I believe #archivebot automatically moves wikis to #wikibot when it discovers them, so I'd suggest you repeat this in #archivebot so that an admin can submit it
22:28

Pedrosso

do ask them if it does move them automatically as I don't know
22:29

thuban

archivebot is a self-contained system and doesn't submit anything to any other tooling
22:30

Pedrosso

Strange, when I had asked AB to archive a website with a wiki in it, it sent it there. Perhaps I misinterpreted it
22:32

mossssss

does anyone know if blogger/blogspot is in the warrior?
22:32

mossssss

or if there's even an initiative to archive it?
22:33

JAA

Pedrosso: The originalurl search only works for wikis specifically dumped by WikiTeam tooling. Basically nobody else sets that metadata field. Certainly not AB.
22:34

JAA

And no, AB does not submit anything elsewhere. That was done manually.
22:34

thuban

tomodachi94: doesn't look like it archive.fart.website/archivebot/viewer/?q=fextralife
22:34

JAA

mossssss: It isn't yet, but we're aware of the situation. Unfortunately, it doesn't seem to be possible to enumerate the blogs or similar.
22:35

JAA

Looks like that (Google inactive accounts etc.) was never added to Deathwatch though.
22:37

mossssss

oh no!!! thats so frustrating that there's no way to do it
22:37

Pedrosso

Very frustrating indeed.
22:38

mossssss

i'm stressed because i know there's so much stuff on there that is totally going to be lost
22:38

fireonlive

:(
22:38

fireonlive

google is really on a 'HOW much are we storing???' kick lately
22:38

Pedrosso

Could you specify?
22:39

JAA

If there's anything you particularly care about, feel free to ask in #archivebot about archiving it. Blogger blogs work fairly well (except for some pagination mess and the 'dynamic view' script hells).
22:42

h2ibot

JustAnotherArchivist edited Deathwatch (+635, /* 2023 */ Add Google's inactive accounts purge): wiki.archiveteam.org/?diff=51126&oldid=51124
22:44

mossssss

this is perhaps a bit backwards but is there a way to do it through individual bloggers profiles? probably half of profiles aren't visible but the ones that are usually have 1-3 blogs on them
22:45

JAA

The profile IDs are much too large to be bruteforced, and IIRC there's quite a bit of rate limiting on the profile pages.
22:46

mossssss

yeah - that makes sense. its just the only half-plausible solution i can come up with lol
22:51

Pedrosso

JAA: I've grabbed the wiki links from tomodachi94's suggested website. transfer.archivete.am/ErUPC/wikilinks.txt
22:54

Ryz

I mentioned the Blogger thing multiple times 2-3 months ago...
22:55

JAA

Yes, it was discussed extensively in May.
22:55

JAA

But since we have no way of discovering blogs, really...
22:56

Ryz

Not even Blogger ID numeration?
22:56

Ryz

*ID number
22:56

Ryz

Even if it's rate limited?
23:00

thuban

a bit, yeah (wiki.archiveteam.org/index.php/Blogger#Strategy, hackint.logs.kiska.pw/archiveteam-bs/20230910#c378934)
23:02

thuban

on a related note, anyone know whether blog names or user ids can be extracted from blogspot image cdn urls? parts don't look entirely random, but i'm not sure
23:02

Ryz

Yeah, it's one of the reasons why I became slightly to somewhat more inactive in ArchiveBot s:
23:11

Ryz

There seems to be an implicit feeling that Blogger may be deemed less important than YouTube or other stuff
23:12

Ryz

Even though it uses less space than something video related
23:15

thuban

i actually had no idea about this--must have missed the discussion
23:20

pabs

Barto, arkiver, TheTechRobo: oh, didn't think it would reach the front page :)
23:21

JAA

thuban: Yeah, that's why this should've been on Deathwatch from the start. :-|
23:22

Barto

pabs: muahaha
23:22

Barto

congratz
23:22

Pedrosso

congratz indeed
23:23

fireonlive

pabs raking in that HN karma :p
23:25

pabs

re blogger, a while back I found you can scrape front pages for profile links, scrape front page links from profiles, and you get a probably ever-expanding lists
23:25

Ryz

I'm not even sure even adding it on Deathwatch when it was announced would help
23:26

Pedrosso

I don't imagine it'd be complete but quite extensive
23:26

mossssss

it would be nice to try
23:26

Pedrosso

It would
23:27

JAA

Ryz: It does help. It wasn't really on my radar anymore until some people brought it up again a couple days ago (on Reddit and via email).
23:28

» JAA summons the arkiver.
23:29

fireonlive

mkx9delh5a.execute-api.ca-central-1…ploads/c7743b41c33e6600/arkiver.png
23:29

fireonlive

it is time.
23:29

fireonlive

arkiver
23:29

pabs

my hacky script for blogger/blogspot enumeration: transfer.archivete.am/RAiXa/archive-blogspot.sh
23:30

pabs

(note the captchas you get really hamper the process)
23:30

fireonlive

anyone here work at google? :p
23:30

pabs

and my list of blogs I wanted to AB: transfer.archivete.am/XWpXt/blogspot.com-blogs.txt
23:31

Pedrosso

ooh, nevermind pabs: that(the script) does look like it'd be complete
23:31

Ryz

I have so many Blogspot websites to process too
23:32

pabs

sorry for the traffic bump TheTechRobo :)
23:32

mossssss

same - i may just send them in the other channel if i need to
23:32

» pabs reached out to a Google person he knows
23:32

pabs

(not in the right dept tho)
23:33

fireonlive

🤞
23:33

katia

fireonlive, you can tell if they say (opinions my own)
23:33

fireonlive

haha
23:33

fireonlive

true true
23:33

TheTechRobo

pabs: All good! :-)
23:34

fireonlive

ah! news.ycombinator.com/item?id=38228481 :)
23:40

h2ibot

PaulWise edited Blogger (+294, add second strategy): wiki.archiveteam.org/?diff=51127&oldid=47348
23:42

h2ibot

PaulWise edited Blogger (+116, add list of blogs found with the second strategy): wiki.archiveteam.org/?diff=51128&oldid=51127
23:45

thuban

here's a list of 144 blogs extracted from my irc logs (excluding #archivebot but not other archiveteam channels): transfer.archivete.am/2sEI9/blogspot_blogs_from_irc_logs.txt
23:46

thuban

(some of these are from topics of channels i scanned during the freenode implosion--i had totally forgotten about that)
23:48

mossssss

does this mean we might be able to do it?? (I would be SO relieved lol - even some is better than none)
23:49

h2ibot

Tomodachi94 created Fextralife (+458, Create page): wiki.archiveteam.org/?title=Fextralife
23:49

h2ibot

Tomodachi94 uploaded File:Fextralife banner.png: wiki.archiveteam.org/?title=File%3AFextralife%20banner.png
23:50

JAA

One potential concern is that many blogs will not be at risk, and I guess we don't have a good way of identifying which ones are.
23:51

mossssss

yeah - i know its any google account that hasnt been touched in 2 years - but that doesnt necessarily mean that the blogs are representative of the accounts
23:52

JAA

Any blog with a post in the past 2 years would *probably* be fine, but scheduled posts are a thing, so it's not reliable.
23:53

mossssss

omg i totally forgot about that...
23:55

mossssss90

not sure why it keeps disconnecting me lol so annoying

11 months ago

« a day earlier

a day later »

today »