#archiveteam-bs

00:30

JAA

Everything accessible on the Knowledge Adventure CDN and present as of my initial listing on 2023-06-14 or the relisting about 6 hours ago should now be archived.
00:30

JAA

betamax, nicolas17: ^
03:51

h2ibot

FireonLive edited Current Projects (-10, move Tiki to recently finished): wiki.archiveteam.org/?diff=50050&oldid=50046
05:26

fireonlive

Visa to Acquire Pismo for US$ 1 billion in cash: pismo.io/blog/visa-to-acquire-pismo
05:29

fireonlive

"Pismo will retain our founders and current management team. The transaction is subject to regulatory approvals and other customary closing conditions and is expected to close by the end of 2023.", website probably not super in danger i guess
08:11

flashfire42|m

Is there any way to monitor the offload of the targets? I think someone was saying a few were getting full or close to
08:46

imer

flashfire42|m: nope, ideally targets run at near-full anyways to apply backpressure - if they were empty that just means IA can accept more data and we're archiving too slow ;)
08:47

flashfire42|m

Heh I mean yeah but there are some projects currently paused because we were grabbing too much data for IA to keep up
08:49

imer

yeah. not quite sure what the status there is. someone else would have to chime in what is going to happen there, if anything
08:50

imer

could be a matter of waiting it out until things slow down naturally or there might be improvements on the IA/AT side so things can go faster
08:50

imer

a lot of data though, so all not easy I can imagine
09:19

masterx244|m

IA is a common bottleneck, the S3 upload "loading bays" are the bottleneck pretty often. AT can suckle data out faster than they can be ingested there
10:50

betamax

JAA: that's amazing, thanks so much!
10:51

betamax

Would you be able to share your relisting from a day or so ago? My friend is working with others to reverse engineer the server for the game and having the full file listing would be very helpful
12:48

h2ibot

OrIdow6 edited Egloos (+649, Account of the grab): wiki.archiveteam.org/?diff=50051&oldid=50043
12:57

OrIdow6

No reply from Wysp.ws
13:15

Hans5958

Are there archives of the leaderboards for past projects?
13:26

Chris5010

If you know the project name, you can use that in the normal tracker URL: tracker.archiveteam.org/[projectName]. For example, the project for Enjin is done, but the leaderbord is still accessible: tracker.archiveteam.org/enjin
14:32

h2ibot

Yts98 edited LINE BLOG (+139, Add link to data): wiki.archiveteam.org/?diff=50052&oldid=49955
15:01

Hans5958

Where is the repo to (at least the front end of) tracker.archiveteam.org?
15:09

pokechu22

github.com/ArchiveTeam/universal-tracker I think?
15:13

Hans5958

Really? Probably want to contribute some code but looks "dead"
15:16

h2ibot

Manu edited Deathwatch (+261, Stitcher will shut down end of August): wiki.archiveteam.org/?diff=50053&oldid=50047
15:17

h2ibot

Noxian edited Tumblr (+0, /* See also */ latest version of TumblThree): wiki.archiveteam.org/?diff=50054&oldid=49141
15:17

h2ibot

Hans5958 edited Egloos (-12, Little bit of rewording): wiki.archiveteam.org/?diff=50055&oldid=50051
15:17

h2ibot

Exorcism edited Tiki (+23): wiki.archiveteam.org/?diff=50056&oldid=50049
15:17

h2ibot

Exorcism uploaded File:Tiki logo.png: wiki.archiveteam.org/?title=File%3ATiki%20logo.png
15:18

h2ibot

Exorcism edited Deathwatch (+0): wiki.archiveteam.org/?diff=50058&oldid=50053
15:41

arkiver

egloos, tiki, and lineblog project are done!
15:42

arkiver

tracker front page is becoming less busy :P
15:43

yts98

arkiver: great! now I want to propose a warrior project for Xuite :p github.com/yts98/xuite-grab
15:44

fireonlive

i read that as xtube which is both incorrect and also long gone (and already done) :c
15:44

threedeeitguy

tiki was fun. my first top 10 finish :D
15:47

fireonlive

haha yeah first where i was near the top :p
15:49

h2ibot

Yts98 edited Current Projects (+0, Move LINE BLOG to recently finished): wiki.archiveteam.org/?diff=50059&oldid=50050
15:50

rktk

Just wanted to throw this out as a forum to archive: memoriesoffear.jcink.net
15:50

rktk

They did a number of translated games, one namely Toilet in Wonderland (which Vinny Vinesauce played on stream)
15:50

fireonlive

Hans5958: looks like that's the one yeah
15:50

rktk

memoriesoffear.jcink.net/index.php?showtopic=56
15:53

h2ibot

Yts98 edited LINE BLOG (+1, Finish the project): wiki.archiveteam.org/?diff=50060&oldid=50052
15:53

fireonlive

i imagine everyone is quite busy with a lot of other things (including things outside of archiveteam) so it's not high priority as other stuff
15:54

fireonlive

yts98: :D
15:54

rktk

fireonlive, do you mean that forum I linked sorry, or replying to someone else
15:54

fireonlive

rktk: oh sorry, replying to Hans5958
15:54

rktk

If there is a recommended way of scraping a forum like that, I have no issue to do it myself
15:54

rktk

ah ok fireonlive :)
15:54

fireonlive

:)
15:54

fireonlive

regarding the tracker.archiveteam.org codebase
16:09

pokechu22

rktk: Probably archivebot, but it's fairly full currently. That one should be pretty easy to run though since it's small
16:10

rktk

pokechu22, could I run an archivebot myself locally?
16:10

rktk

or should I just do an wget mirror
16:10

arkiver

yts98: why JSObj?
16:10

pokechu22

ArchiveBot isn't designed to be run locally, github.com/ArchiveTeam/grab-site is the more usable equivalent
16:10

pokechu22

There's also a forum-dl project or something like that that might be usable
16:11

yts98

arkiver: to deal with JS objects embedded in the HTML.
16:11

pokechu22

wget's also fine, but wouldn't end up on web.archive.org (though anything a random person does probably wouldn't end up there)
16:11

pokechu22

Looks like they also have mediafire links so those will need to be put into #mediaonfire
16:12

yts98

I found simply replacing single quotes with double quotes may still cause errors
16:13

arkiver

yts98: on the item types, can you please make then a bit more descriptive?
16:14

pokechu22

Looks like there's actually a lot of forums under jcink.net, so that's something to check later
16:15

arkiver

yts98: looks pretty good!
16:17

rktk

pokechu22, yeah this is just a random personal grab. and i could save to warc, mainly just as a means of throwing it on archive as an object, rather than web archive
16:17

rktk

pokechu22, yeah definitely something worth looking at
16:18

yts98

I chose very short item type names because the wiki said "Because the Tracker uses Redis as its database, memory usage is a concern."
16:18

arkiver

let's make a channel for xuite! i'm not sure if this word has a meaning, perhaps we can have a play on words in the language of this word
16:18

arkiver

yts98: ah. well lists are mostly offloaded, so not a huge concern now
16:19

yts98

arkiver: watch this video.
16:19

yts98

vlog.xuite.net/play/Qm9leW9BLTEzODg4Ni5mbHY=
16:19

threedeeitguy

There's small website that I wish to regularly save a few pages for (usually 1-2 pages a day). The prompt to save the page would be an email notification from said site. I already have extracting the link sorted. Is there an API equivalent of web.archive.org/save ? Saving the page is fairly time critical as once items are sold the page is
16:19

threedeeitguy

updated and information is removed.
16:20

pokechu22

rktk: I've started an archivebot job anyways, shouldn't take too long
16:20

yts98

Xuite's slogan is "My Xuite, So Sweet~"
16:20

rktk

hurray! pokechu22
16:20

arkiver

yts98: i see some stuff there like TODOs on handling malformed JSON responses
16:20

rktk

someone should save digitalfaq before all the scam evidence is wiped away
16:20

pokechu22

threedeeitguy: Pretty sure web.archive.org/save can be treated as an API endpoint, I remember seeing some docs on that, one sec
16:20

pokechu22

digitalfaq?
16:21

arkiver

those malformed responses should be caught in write_to_warc, then not be written to WARC, and either be marked for retrying to retrieve, or the item should be aborted. or in rare cases no write to WARC and let it continue as usual if this is an 'error' that is fine
16:21

yts98

arkiver: their API sometimes mix cp950 with utf8
16:21

pokechu22

docs.google.com/document/d/1Nsv52Mv…PCpHlat0gkzw0EvtSgpKHu4mk0MnrA/edit
16:21

arkiver

right, i see. so the error is on our side, not on theirs?
16:21

arkiver

yts98: ^
16:21

rktk

pokechu22, digitalfaq.com
16:22

pokechu22

What's the deal with scam evidence?
16:22

pokechu22

Looks like it was previously saved August 2022: archive.fart.website/archivebot/viewer/job/4ialw
16:23

pokechu22

err, no, those are small enough that saving it probably failed
16:23

yts98

arkiver: yes. the error is caused in JSON.lua.
16:25

threedeeitguy

pokechu22 thanks, il take a look. It may not be suitable anyway. I just tried a page and its far from clean: web.archive.org/web/20230629161553/…-big-boy-4-8-8-4-stock-code-11379/#
16:25

arkiver

yts98: i see there is still a change of 'bad data' getting into the WARC, for example I see a check on json["ok"] get_urls. at this point the data is already in the WARC, which it should be if there is an indication of an error
16:26

JAA

betamax: Yeah, everything will be on IA once the upload finishes.
16:26

arkiver

so this json["ok"] check should be in write_to_warc, and then again either retried or items aborted (or accepted in rare cases) if the error is there
16:26

arkiver

there may be other checks in get_urls that should move to write_to_warc
16:29

yts98

arkiver: json["ok"] being false is not rare. It happens when an article is protected by the password, or an user did not activate one of the blog, album, or vlog service.
16:29

arkiver

alright good
16:30

yts98

and then I saw thousands of usernames discovered, but the API will respond with "no such user".
16:32

yts98

their username search API even returns illegal usernames, possibly manually altered by the moderator to deactivate some accounts
16:33

arkiver

interesting
16:33

arkiver

so
16:33

arkiver

on images
16:33

arkiver

photo.xuite.net, and such
16:33

arkiver

can different items get to the same images? can they be duplicated between items? i see they are now generally always accepted for immediate archiving
16:36

yts98

I sent some image URLs in API responses of user item, but some of these images belong to an album, so the current script will grab them twice or more.
16:37

arkiver

are the URLs for a single image unique?
16:38

arkiver

as in, is it always 3.example.com/image.png, or can there also be 2.example.com/image.png, 3.example.com/image?format=png, etc.?
16:40

arkiver

I see the TODO about false positives. yes, this may produce false positives. but archiving is usually done with the thought of "better discovery too much than too little". so if we are sure everything will be discovered with very strict rules, then that is fine
16:40

yts98

for photo.xuite.net, the image URLs are unique;
16:40

yts98

when images embedded in blog articles, the service possibly generates another URL that accepts outlinks
16:41

arkiver

but it is often good to keep the rules somewhat relaxed, allow for a possibility of false positives. eliminate these false positives if we find them. and that way perhaps extract/archive more than we initially were under the impression was actually there
16:41

arkiver

yts98: "another URL that accepts outlinks" - for an image? what do you mean?
16:44

arkiver

yts98: on the video URLs and load balancing. can video URLs to the same video be found in different items? as in, can there be duplicates? (same as what i asked for the photos)
16:45

arkiver

if the a certain video will _only_ be discovered from a single item, then good! and then let's get whatever load balancers they use, Wget-AT will prevent writing duplicate data, while still preserving the URLs.
16:46

arkiver

there will only be duplicate data downloaded on the side of the Warrior, but this extra data will be deduplicated away when written to the WARC. if xuite can handle it, then it's good to get this duplicate data.
16:46

arkiver

because this is not only about purely data preservation, but also about URL preservation. we want to try and cover the entire range of possible URLs, so that those can be found through the Wayback Machine.
16:47

arkiver

so. let's say we have 1.example.com/image.png and 2.example.com/image.png both pointing to the same image. we download them _in the same Wget-AT session_, then they will be deduplicated, while both their URLs are preserved (yes, data will be downloaded twice)
16:48

arkiver

if we have separate items for those two URLs to the same image, then it is likely that those separate items end up in different Wget-AT session, and are not deduplicated, which wastes bytes
16:49

arkiver

if we're talking about 1 TB or so of duplicated data, that is not a big problem. but if it turns into 10 TB or 100 TB of duplicated data, that is a problem
16:51

arkiver

yts98: i see you store data in _data.txt, what is the use of this. we're actually not really using data.txt anymore. in the past data.txt was used to discover items, but nowadays we use backfeed for that.
16:51

arkiver

there is nothing on the targets currently that will do anything with the _data.txt file.
16:52

yts98

I did not remember I saw image URL formats other than 1.share.photo.xuite.net in which article.
16:52

yts98

Separating images to new items is a reasonable approach. Let's handle them like cdn-obs in lineblog.
16:52

yts98

Video URLs may also be checked in user items. But they may expire if we backfeed them as item.
16:52

yts98

I thought warc revisit can only be used on the same URL. So warc revisit applies to different URLs when the response body is identical.
16:52

arkiver

yes, on the response body being identical
16:53

arkiver

i see on expiring video URLs. are the video URLs you get through a user item actually used for playback? or are they "just there" in some data blob, while actually only the video URL on the post page is used for playback?
16:54

arkiver

on FlashVars rules - those are not known yet?
16:57

arkiver

yts98: well overall looks pretty good, i'll be further checking this later!
16:59

yts98

the purpose of data.txt is to inspect the metadata not included in item names, including blog_id and every <embed>.
16:59

yts98

I've discovered 5 types of FlashVars rules wiki.archiveteam.org/index.php/Xuite#Flash-based_creations , but I'm not sure if I missed more.
17:00

yts98

arkiver: thanks for taking a look! I learned very much about archiving practices :)
17:00

arkiver

good to hear :)
17:00

arkiver

alright i'm not sure yet about data.txt, will be having a better look later!
17:01

arkiver

(i only actually looked at the code - not the site yet)
17:07

yts98

a possible alternative to data.txt is to create a dummy backfeed that does not actually backfeed the items into the project.
17:08

arkiver

that sounds better yes
17:08

arkiver

but i'm not sure if we actually need it, need to do some experiments as well
17:09

arkiver

if there is something unexpected, can item be simply aborted?
17:09

arkiver

i see for example that when an a: item is queued, it is always written to the data.txt as well, that is not needed i think?
17:22

fireonlive

gettyimages acquired unsplash earlier in 2021: unsplash.com/blog/unsplash-getty and looks like they’re jumping on the “oh fuck AI is going to ruin us” bandwagon way too late twitter.com/sindresorhus/status/1674390882399801345
17:23

fireonlive

not sure what he means by “removed their free non-API endpoint” though
17:26

arkiver

yts98: i see very explicity extraction of certain URLs, also from the HTML, line 1096 for example. i think this is already handled by the 'general' URLs extraction happening at line 1966? if not, that might be a better place
17:27

JAA

Next AT project: archive everything that has a free API.
17:27

arkiver

this is again coming from the point of "better extract too much than too little" - if we only allow extraction of very specific URLs in very specific places, there is a great risk of missing something.
17:27

arkiver

hmm
17:28

arkiver

or, is this being extracted specifically here to have the certain referer be different than the current URLs we're working on?
17:28

arkiver

in which case it would be good. later it'd be picked up in the 'general' extraction code, but not queued since it was queued before
17:28

arkiver

current URL*
18:17

fireonlive

JAA: yeeeeah :|
18:17

fireonlive

🙃 🔫
18:18

fireonlive

they said AI/ML would destroy the internet
18:18

fireonlive

i just didn't think it would be in this way
20:30

pokechu22

tinaja.com looks kinda big so I'm not going to put it into archivebot until we have a little bit more space
20:31

arkiver

let's see
20:31

arkiver

interesting site
20:31

that_lurker

seems to have a lot of pdf's so might be big
20:32

arkiver

pokechu22: shall we put it in archivebot anyway?
20:32

vokunal|m

I was about to ask what you look for to determine whether it looks big or not. At first glance I figured it looks like it's from the 90s, so small
20:33

pokechu22

Currently all the AB pipelines are full because hel3/hel4 are low on disk space because of the general upload backlog to my understanding
20:34

pokechu22

Probably we could still queue it though
20:39

that_lurker

actually those pdf's are not that big so might be something like 50 - 60 gigs at tops
20:40

that_lurker

could be good to queue it as you can just pause it in the event that there is no space right?
20:41

pokechu22

Alright, queued it
20:41

pokechu22

It'll auto-pause when there's no space (< 5 GB I think)
20:42

that_lurker

LUL was already started aparently :P
21:00

arkiver

pokechu22: general upload backlog to where?
21:00

arkiver

is IA the bottle neck?
21:01

pokechu22

I think so?
21:01

pokechu22

JAA talked more about it I think
21:01

pokechu22

main thing is that if you look at archivebot.com/pipelines most machines are full
21:01

arkiver

we need an "ArchiveBot talk" channel
21:02

JAA

arkiver: #down-the-tube and AB used the same rsync target. The former clogged it.
21:02

arkiver

ah
21:02

arkiver

JAA: how about that archivebot talk channel?
21:02

JAA

That comes up every few months or so. It'd be mostly a dead channel probably.
21:03

arkiver

i usually miss messages someone posts to me in #archivebot
21:03

arkiver

oh well
21:03

arkiver

warning to all ^ if I need to really notice the message, don't write to me in #archivebot
21:03

JAA

Make your client log highlights into a separate window. :-)
21:03

pokechu22

Relevant messages are on 03:47:37 UTC on June 29
21:04

vokunal|m

This is what I've been using to check. Is this known as a good way to see if they're clogged? monitor.archive.org/weathermap/weathermap.html
21:04

pokechu22

I don't think the rsync targets would be on there as they're archiveteam infrastructure, but I'm not 100% sure of that
21:05

vokunal|m

the switchtc0-200paul has been in the red for around 30+ hours
21:05

that_lurker

JAA That is the one thing from znc I would like to have on thelounge
21:05

arkiver

JAA: that would be something i need to figure out and not doing that now
21:05

fireonlive

did someone said that archive.org had an issue with (or intentionally?) limited inbound speed?
21:05

arkiver

vokunal|m: no, there can be many reasons
21:05

fireonlive

that was oof a while ago though
21:08

pokechu22

Oh, it was also mentioned that yarus.ru was shutting down shortly per yarus.ru/post/1989728469 - there's an AB job for it, but there's basically no chance it'll finish completely :|
21:10

pokechu22

ugh, it looks like that site's also JS-based so AB's not going to get anything useful :| (and I think I pushed it too hard and am now getting 403s :|)
21:12

that_lurker

no wonder google translate did not work on it :P
21:15

vokunal|m

Yeah I was wondering why it wasn't working
21:18

that_lurker

Oh and just found out The Lounge has a recent mentions feature
21:19

that_lurker

thats convenient
21:19

fireonlive

indeed! the @ symbol
21:24

arkiver

pokechu22: checking
21:25

arkiver

pokechu22: are you planning to pull tinaja.com through AB later?
21:25

pokechu22

It turns out it was already running in AB since yesterday
21:26

arkiver

oof just seeing yarus in my browser with that loading screen... oof oof
21:26

arkiver

what
21:26

arkiver

June 30?
21:26

arkiver

not again
21:26

pokechu22

Several hours ago it was 18 hours
21:26

pokechu22

frankly I think it's not possible to get it done
21:27

pokechu22

It does have a complete sitemap though
21:27

arkiver

they posted the message you linked today?
21:27

arkiver

for a shutdown tomorrow?
21:28

pokechu22

nyuuzyou: ^
21:28

pokechu22

It seems like that's the case though
21:28

arkiver

rewby: are you around?
21:28

arkiver

i'm not sure if we can get a project up in time
21:28

arkiver

but we might need a target for a shutdown tomorrow... announced today :(
21:29

rewby|backup

I'll get you a target if you get a tracker proj and vars in... 30 mins
21:29

arkiver

woah sequential post IDs?
21:29

arkiver

i like it
21:29

pokechu22

"У вас будет время сохранить весь свой контент" - "You will have time to save all your content." yeah, sure...
21:29

pokechu22

Sequential IDs and a full sitemap as far as I can tell
21:29

pokechu22

but on the other hand, javascript
21:29

JAA

<dr_evil_air_quotes.gif>
21:30

arkiver

i'm always skeptical about sitemaps
21:30

arkiver

rewby|backup: alright
21:32

imer

they seem to have a rate limit (on api. at least), returns a standard nginx 403
21:32

imer

and now that's changed to another 403 page
21:34

arkiver

imer: proper status code?
21:34

imer

yep
21:34

imer

403
21:34

arkiver

good
21:34

imer

here's the content of the non-nginx 403: transfer.archivete.am/hyaCY/2023-06-29_23-34-40_wmbgyH3GLo.txt
21:34

imer

i've censored my ip with XXX
21:35

pokechu22

Archivebot is still getting 403s a while after con=6, d=0 (that wasn't using the API and in fact wasn't even trying to retrieve stuff from the API, though)
21:36

fireonlive

ok everyone gather around for a picture
21:36

fireonlive

an api actually used a proper http status code
21:36

imer

block doesnt seem to be shared across domains, but obviously the site wont work
21:36

fireonlive

we need to remember this moment
21:37

arkiver

interesting
21:37

imer

i'll keep checking if I get unblocked
21:37

arkiver

IDs sequential with a huge sudden gap
21:38

imer

response headers: transfer.archivete.am/mTei1/2023-06-29_23-37-36_Qp3eqSS4hN.png content-type is proper as well
21:38

imer

no ipv6 (why do I even bother checking this)
21:39

fireonlive

one day you'll be rewarded
21:39

fireonlive

it's like finding a rare coin
21:39

fireonlive

the toyota yarus, en.wikipedia.org/wiki/Toyota_Yaris
21:39

fireonlive

lol
21:40

imer

do we have a channel name yet? i'll throw into the hat #norus if not
21:40

fireonlive

nop
21:40

imer

words i can arrange sentence to
21:40

fireonlive

mine was #yaaaaaaaaasus but that's kinda gay
21:40

fireonlive

:p
21:41

arkiver

imer: see what i wrote earlier ;)
21:41

fireonlive

also not punny enough
21:41

arkiver

#norus it is
21:41

fireonlive

arkiver: you were in the tiki channel
21:41

fireonlive

:D
21:42

arkiver

HEY EVERYONE! JAA is not in #norus , let's party there. no one tell JAA please!!
21:54

h2ibot

JustAnotherArchivist created ЯRUS (+194, Created page with "{{Infobox project | URL =…): wiki.archiveteam.org/?title=%D0%AFRUS
21:55

h2ibot

JustAnotherArchivist created Yarus.ru (+19, Redirected page to [[ЯRUS]]): wiki.archiveteam.org/?title=Yarus.ru
22:04

h2ibot

Pcr edited List of websites excluded from the Wayback Machine (+26, Add TH3D): wiki.archiveteam.org/?diff=50063&oldid=49985
22:07

fireonlive

:D
22:10

thuban

arkiver: wrt noise in #archivebot, if you use weechat, there are some filters at wiki.archiveteam.org/index.php/User:Switchnode

a year ago

« a day earlier

a day later »

today »