-
JAA
Everything accessible on the Knowledge Adventure CDN and present as of my initial listing on 2023-06-14 or the relisting about 6 hours ago should now be archived.
-
JAA
betamax, nicolas17: ^
-
h2ibot
FireonLive edited Current Projects (-10, move Tiki to recently finished):
wiki.archiveteam.org/?diff=50050&oldid=50046
-
fireonlive
Visa to Acquire Pismo for US$ 1 billion in cash:
pismo.io/blog/visa-to-acquire-pismo
-
fireonlive
"Pismo will retain our founders and current management team. The transaction is subject to regulatory approvals and other customary closing conditions and is expected to close by the end of 2023.", website probably not super in danger i guess
-
flashfire42|m
Is there any way to monitor the offload of the targets? I think someone was saying a few were getting full or close to
-
imer
flashfire42|m: nope, ideally targets run at near-full anyways to apply backpressure - if they were empty that just means IA can accept more data and we're archiving too slow ;)
-
flashfire42|m
Heh I mean yeah but there are some projects currently paused because we were grabbing too much data for IA to keep up
-
imer
yeah. not quite sure what the status there is. someone else would have to chime in what is going to happen there, if anything
-
imer
could be a matter of waiting it out until things slow down naturally or there might be improvements on the IA/AT side so things can go faster
-
imer
a lot of data though, so all not easy I can imagine
-
masterx244|m
IA is a common bottleneck, the S3 upload "loading bays" are the bottleneck pretty often. AT can suckle data out faster than they can be ingested there
-
betamax
JAA: that's amazing, thanks so much!
-
betamax
Would you be able to share your relisting from a day or so ago? My friend is working with others to reverse engineer the server for the game and having the full file listing would be very helpful
-
h2ibot
OrIdow6 edited Egloos (+649, Account of the grab):
wiki.archiveteam.org/?diff=50051&oldid=50043
-
OrIdow6
No reply from Wysp.ws
-
Hans5958
Are there archives of the leaderboards for past projects?
-
Chris5010
If you know the project name, you can use that in the normal tracker URL:
tracker.archiveteam.org/[projectName]. For example, the project for Enjin is done, but the leaderbord is still accessible:
tracker.archiveteam.org/enjin
-
h2ibot
Yts98 edited LINE BLOG (+139, Add link to data):
wiki.archiveteam.org/?diff=50052&oldid=49955
-
Hans5958
Where is the repo to (at least the front end of) tracker.archiveteam.org?
-
pokechu22
-
Hans5958
Really? Probably want to contribute some code but looks "dead"
-
h2ibot
Manu edited Deathwatch (+261, Stitcher will shut down end of August):
wiki.archiveteam.org/?diff=50053&oldid=50047
-
h2ibot
Noxian edited Tumblr (+0, /* See also */ latest version of TumblThree):
wiki.archiveteam.org/?diff=50054&oldid=49141
-
h2ibot
Hans5958 edited Egloos (-12, Little bit of rewording):
wiki.archiveteam.org/?diff=50055&oldid=50051
-
h2ibot
-
h2ibot
-
h2ibot
-
arkiver
egloos, tiki, and lineblog project are done!
-
arkiver
tracker front page is becoming less busy :P
-
yts98
arkiver: great! now I want to propose a warrior project for Xuite :p
github.com/yts98/xuite-grab
-
fireonlive
i read that as xtube which is both incorrect and also long gone (and already done) :c
-
threedeeitguy
tiki was fun. my first top 10 finish :D
-
fireonlive
haha yeah first where i was near the top :p
-
h2ibot
Yts98 edited Current Projects (+0, Move LINE BLOG to recently finished):
wiki.archiveteam.org/?diff=50059&oldid=50050
-
rktk
Just wanted to throw this out as a forum to archive:
memoriesoffear.jcink.net
-
rktk
They did a number of translated games, one namely Toilet in Wonderland (which Vinny Vinesauce played on stream)
-
fireonlive
Hans5958: looks like that's the one yeah
-
rktk
-
h2ibot
Yts98 edited LINE BLOG (+1, Finish the project):
wiki.archiveteam.org/?diff=50060&oldid=50052
-
fireonlive
i imagine everyone is quite busy with a lot of other things (including things outside of archiveteam) so it's not high priority as other stuff
-
fireonlive
yts98: :D
-
rktk
fireonlive, do you mean that forum I linked sorry, or replying to someone else
-
fireonlive
rktk: oh sorry, replying to Hans5958
-
rktk
If there is a recommended way of scraping a forum like that, I have no issue to do it myself
-
rktk
ah ok fireonlive :)
-
fireonlive
:)
-
fireonlive
regarding the
tracker.archiveteam.org codebase
-
pokechu22
rktk: Probably archivebot, but it's fairly full currently. That one should be pretty easy to run though since it's small
-
rktk
pokechu22, could I run an archivebot myself locally?
-
rktk
or should I just do an wget mirror
-
arkiver
yts98: why JSObj?
-
pokechu22
ArchiveBot isn't designed to be run locally,
github.com/ArchiveTeam/grab-site is the more usable equivalent
-
pokechu22
There's also a forum-dl project or something like that that might be usable
-
yts98
arkiver: to deal with JS objects embedded in the HTML.
-
pokechu22
wget's also fine, but wouldn't end up on web.archive.org (though anything a random person does probably wouldn't end up there)
-
pokechu22
Looks like they also have mediafire links so those will need to be put into #mediaonfire
-
yts98
I found simply replacing single quotes with double quotes may still cause errors
-
arkiver
yts98: on the item types, can you please make then a bit more descriptive?
-
pokechu22
Looks like there's actually a lot of forums under jcink.net, so that's something to check later
-
arkiver
yts98: looks pretty good!
-
rktk
pokechu22, yeah this is just a random personal grab. and i could save to warc, mainly just as a means of throwing it on archive as an object, rather than web archive
-
rktk
pokechu22, yeah definitely something worth looking at
-
yts98
I chose very short item type names because the wiki said "Because the Tracker uses Redis as its database, memory usage is a concern."
-
arkiver
let's make a channel for xuite! i'm not sure if this word has a meaning, perhaps we can have a play on words in the language of this word
-
arkiver
yts98: ah. well lists are mostly offloaded, so not a huge concern now
-
yts98
arkiver: watch this video.
-
yts98
-
threedeeitguy
There's small website that I wish to regularly save a few pages for (usually 1-2 pages a day). The prompt to save the page would be an email notification from said site. I already have extracting the link sorted. Is there an API equivalent of
web.archive.org/save ? Saving the page is fairly time critical as once items are sold the page is
-
threedeeitguy
updated and information is removed.
-
pokechu22
rktk: I've started an archivebot job anyways, shouldn't take too long
-
yts98
Xuite's slogan is "My Xuite, So Sweet~"
-
rktk
hurray! pokechu22
-
arkiver
yts98: i see some stuff there like TODOs on handling malformed JSON responses
-
rktk
someone should save digitalfaq before all the scam evidence is wiped away
-
pokechu22
threedeeitguy: Pretty sure web.archive.org/save can be treated as an API endpoint, I remember seeing some docs on that, one sec
-
pokechu22
digitalfaq?
-
arkiver
those malformed responses should be caught in write_to_warc, then not be written to WARC, and either be marked for retrying to retrieve, or the item should be aborted. or in rare cases no write to WARC and let it continue as usual if this is an 'error' that is fine
-
yts98
arkiver: their API sometimes mix cp950 with utf8
-
pokechu22
-
arkiver
right, i see. so the error is on our side, not on theirs?
-
arkiver
yts98: ^
-
rktk
pokechu22, digitalfaq.com
-
pokechu22
What's the deal with scam evidence?
-
pokechu22
Looks like it was previously saved August 2022:
archive.fart.website/archivebot/viewer/job/4ialw
-
pokechu22
err, no, those are small enough that saving it probably failed
-
yts98
arkiver: yes. the error is caused in JSON.lua.
-
threedeeitguy
pokechu22 thanks, il take a look. It may not be suitable anyway. I just tried a page and its far from clean:
web.archive.org/web/20230629161553/…-big-boy-4-8-8-4-stock-code-11379/#
-
arkiver
yts98: i see there is still a change of 'bad data' getting into the WARC, for example I see a check on json["ok"] get_urls. at this point the data is already in the WARC, which it should be if there is an indication of an error
-
JAA
betamax: Yeah, everything will be on IA once the upload finishes.
-
arkiver
so this json["ok"] check should be in write_to_warc, and then again either retried or items aborted (or accepted in rare cases) if the error is there
-
arkiver
there may be other checks in get_urls that should move to write_to_warc
-
yts98
arkiver: json["ok"] being false is not rare. It happens when an article is protected by the password, or an user did not activate one of the blog, album, or vlog service.
-
arkiver
alright good
-
yts98
and then I saw thousands of usernames discovered, but the API will respond with "no such user".
-
yts98
their username search API even returns illegal usernames, possibly manually altered by the moderator to deactivate some accounts
-
arkiver
interesting
-
arkiver
so
-
arkiver
on images
-
arkiver
photo.xuite.net, and such
-
arkiver
can different items get to the same images? can they be duplicated between items? i see they are now generally always accepted for immediate archiving
-
yts98
I sent some image URLs in API responses of user item, but some of these images belong to an album, so the current script will grab them twice or more.
-
arkiver
are the URLs for a single image unique?
-
arkiver
as in, is it always 3.example.com/image.png, or can there also be 2.example.com/image.png, 3.example.com/image?format=png, etc.?
-
arkiver
I see the TODO about false positives. yes, this may produce false positives. but archiving is usually done with the thought of "better discovery too much than too little". so if we are sure everything will be discovered with very strict rules, then that is fine
-
yts98
for photo.xuite.net, the image URLs are unique;
-
yts98
when images embedded in blog articles, the service possibly generates another URL that accepts outlinks
-
arkiver
but it is often good to keep the rules somewhat relaxed, allow for a possibility of false positives. eliminate these false positives if we find them. and that way perhaps extract/archive more than we initially were under the impression was actually there
-
arkiver
yts98: "another URL that accepts outlinks" - for an image? what do you mean?
-
arkiver
yts98: on the video URLs and load balancing. can video URLs to the same video be found in different items? as in, can there be duplicates? (same as what i asked for the photos)
-
arkiver
if the a certain video will _only_ be discovered from a single item, then good! and then let's get whatever load balancers they use, Wget-AT will prevent writing duplicate data, while still preserving the URLs.
-
arkiver
there will only be duplicate data downloaded on the side of the Warrior, but this extra data will be deduplicated away when written to the WARC. if xuite can handle it, then it's good to get this duplicate data.
-
arkiver
because this is not only about purely data preservation, but also about URL preservation. we want to try and cover the entire range of possible URLs, so that those can be found through the Wayback Machine.
-
arkiver
so. let's say we have 1.example.com/image.png and 2.example.com/image.png both pointing to the same image. we download them _in the same Wget-AT session_, then they will be deduplicated, while both their URLs are preserved (yes, data will be downloaded twice)
-
arkiver
if we have separate items for those two URLs to the same image, then it is likely that those separate items end up in different Wget-AT session, and are not deduplicated, which wastes bytes
-
arkiver
if we're talking about 1 TB or so of duplicated data, that is not a big problem. but if it turns into 10 TB or 100 TB of duplicated data, that is a problem
-
arkiver
yts98: i see you store data in _data.txt, what is the use of this. we're actually not really using data.txt anymore. in the past data.txt was used to discover items, but nowadays we use backfeed for that.
-
arkiver
there is nothing on the targets currently that will do anything with the _data.txt file.
-
yts98
I did not remember I saw image URL formats other than 1.share.photo.xuite.net in which article.
-
yts98
Separating images to new items is a reasonable approach. Let's handle them like cdn-obs in lineblog.
-
yts98
Video URLs may also be checked in user items. But they may expire if we backfeed them as item.
-
yts98
I thought warc revisit can only be used on the same URL. So warc revisit applies to different URLs when the response body is identical.
-
arkiver
yes, on the response body being identical
-
arkiver
i see on expiring video URLs. are the video URLs you get through a user item actually used for playback? or are they "just there" in some data blob, while actually only the video URL on the post page is used for playback?
-
arkiver
on FlashVars rules - those are not known yet?
-
arkiver
yts98: well overall looks pretty good, i'll be further checking this later!
-
yts98
the purpose of data.txt is to inspect the metadata not included in item names, including blog_id and every <embed>.
-
yts98
I've discovered 5 types of FlashVars rules
wiki.archiveteam.org/index.php/Xuite#Flash-based_creations , but I'm not sure if I missed more.
-
yts98
arkiver: thanks for taking a look! I learned very much about archiving practices :)
-
arkiver
good to hear :)
-
arkiver
alright i'm not sure yet about data.txt, will be having a better look later!
-
arkiver
(i only actually looked at the code - not the site yet)
-
yts98
a possible alternative to data.txt is to create a dummy backfeed that does not actually backfeed the items into the project.
-
arkiver
that sounds better yes
-
arkiver
but i'm not sure if we actually need it, need to do some experiments as well
-
arkiver
if there is something unexpected, can item be simply aborted?
-
arkiver
i see for example that when an a: item is queued, it is always written to the data.txt as well, that is not needed i think?
-
fireonlive
gettyimages acquired unsplash earlier in 2021:
unsplash.com/blog/unsplash-getty and looks like they’re jumping on the “oh fuck AI is going to ruin us” bandwagon way too late
twitter.com/sindresorhus/status/1674390882399801345
-
fireonlive
not sure what he means by “removed their free non-API endpoint” though
-
arkiver
yts98: i see very explicity extraction of certain URLs, also from the HTML, line 1096 for example. i think this is already handled by the 'general' URLs extraction happening at line 1966? if not, that might be a better place
-
JAA
Next AT project: archive everything that has a free API.
-
arkiver
this is again coming from the point of "better extract too much than too little" - if we only allow extraction of very specific URLs in very specific places, there is a great risk of missing something.
-
arkiver
hmm
-
arkiver
or, is this being extracted specifically here to have the certain referer be different than the current URLs we're working on?
-
arkiver
in which case it would be good. later it'd be picked up in the 'general' extraction code, but not queued since it was queued before
-
arkiver
current URL*
-
fireonlive
JAA: yeeeeah :|
-
fireonlive
🙃 🔫
-
fireonlive
they said AI/ML would destroy the internet
-
fireonlive
i just didn't think it would be in this way
-
pokechu22
tinaja.com looks kinda big so I'm not going to put it into archivebot until we have a little bit more space
-
arkiver
let's see
-
arkiver
interesting site
-
that_lurker
seems to have a lot of pdf's so might be big
-
arkiver
pokechu22: shall we put it in archivebot anyway?
-
vokunal|m
I was about to ask what you look for to determine whether it looks big or not. At first glance I figured it looks like it's from the 90s, so small
-
pokechu22
Currently all the AB pipelines are full because hel3/hel4 are low on disk space because of the general upload backlog to my understanding
-
pokechu22
Probably we could still queue it though
-
that_lurker
actually those pdf's are not that big so might be something like 50 - 60 gigs at tops
-
that_lurker
could be good to queue it as you can just pause it in the event that there is no space right?
-
pokechu22
Alright, queued it
-
pokechu22
It'll auto-pause when there's no space (< 5 GB I think)
-
that_lurker
LUL was already started aparently :P
-
arkiver
pokechu22: general upload backlog to where?
-
arkiver
is IA the bottle neck?
-
pokechu22
I think so?
-
pokechu22
JAA talked more about it I think
-
pokechu22
main thing is that if you look at
archivebot.com/pipelines most machines are full
-
arkiver
we need an "ArchiveBot talk" channel
-
JAA
arkiver: #down-the-tube and AB used the same rsync target. The former clogged it.
-
arkiver
ah
-
arkiver
JAA: how about that archivebot talk channel?
-
JAA
That comes up every few months or so. It'd be mostly a dead channel probably.
-
arkiver
i usually miss messages someone posts to me in #archivebot
-
arkiver
oh well
-
arkiver
warning to all ^ if I need to really notice the message, don't write to me in #archivebot
-
JAA
Make your client log highlights into a separate window. :-)
-
pokechu22
Relevant messages are on 03:47:37 UTC on June 29
-
vokunal|m
This is what I've been using to check. Is this known as a good way to see if they're clogged?
monitor.archive.org/weathermap/weathermap.html
-
pokechu22
I don't think the rsync targets would be on there as they're archiveteam infrastructure, but I'm not 100% sure of that
-
vokunal|m
the switchtc0-200paul has been in the red for around 30+ hours
-
that_lurker
JAA That is the one thing from znc I would like to have on thelounge
-
arkiver
JAA: that would be something i need to figure out and not doing that now
-
fireonlive
did someone said that archive.org had an issue with (or intentionally?) limited inbound speed?
-
arkiver
vokunal|m: no, there can be many reasons
-
fireonlive
that was oof a while ago though
-
pokechu22
Oh, it was also mentioned that
yarus.ru was shutting down shortly per
yarus.ru/post/1989728469 - there's an AB job for it, but there's basically no chance it'll finish completely :|
-
pokechu22
ugh, it looks like that site's also JS-based so AB's not going to get anything useful :| (and I think I pushed it too hard and am now getting 403s :|)
-
that_lurker
no wonder google translate did not work on it :P
-
vokunal|m
Yeah I was wondering why it wasn't working
-
that_lurker
Oh and just found out The Lounge has a recent mentions feature
-
that_lurker
thats convenient
-
fireonlive
indeed! the @ symbol
-
arkiver
pokechu22: checking
-
arkiver
pokechu22: are you planning to pull tinaja.com through AB later?
-
pokechu22
It turns out it was already running in AB since yesterday
-
arkiver
oof just seeing yarus in my browser with that loading screen... oof oof
-
arkiver
what
-
arkiver
June 30?
-
arkiver
not again
-
pokechu22
Several hours ago it was 18 hours
-
pokechu22
frankly I think it's not possible to get it done
-
pokechu22
It does have a complete sitemap though
-
arkiver
they posted the message you linked today?
-
arkiver
for a shutdown tomorrow?
-
pokechu22
nyuuzyou: ^
-
pokechu22
It seems like that's the case though
-
arkiver
rewby: are you around?
-
arkiver
i'm not sure if we can get a project up in time
-
arkiver
but we might need a target for a shutdown tomorrow... announced today :(
-
rewby|backup
I'll get you a target if you get a tracker proj and vars in... 30 mins
-
arkiver
woah sequential post IDs?
-
arkiver
i like it
-
pokechu22
"У вас будет время сохранить весь свой контент" - "You will have time to save all your content." yeah, sure...
-
pokechu22
Sequential IDs and a full sitemap as far as I can tell
-
pokechu22
but on the other hand, javascript
-
JAA
<dr_evil_air_quotes.gif>
-
arkiver
i'm always skeptical about sitemaps
-
arkiver
rewby|backup: alright
-
imer
they seem to have a rate limit (on api. at least), returns a standard nginx 403
-
imer
and now that's changed to another 403 page
-
arkiver
imer: proper status code?
-
imer
yep
-
imer
403
-
arkiver
good
-
imer
-
imer
i've censored my ip with XXX
-
pokechu22
Archivebot is still getting 403s a while after con=6, d=0 (that wasn't using the API and in fact wasn't even trying to retrieve stuff from the API, though)
-
fireonlive
ok everyone gather around for a picture
-
fireonlive
an api actually used a proper http status code
-
imer
block doesnt seem to be shared across domains, but obviously the site wont work
-
fireonlive
we need to remember this moment
-
arkiver
interesting
-
imer
i'll keep checking if I get unblocked
-
arkiver
IDs sequential with a huge sudden gap
-
imer
-
imer
no ipv6 (why do I even bother checking this)
-
fireonlive
one day you'll be rewarded
-
fireonlive
it's like finding a rare coin
-
fireonlive
-
fireonlive
lol
-
imer
do we have a channel name yet? i'll throw into the hat #norus if not
-
fireonlive
nop
-
imer
words i can arrange sentence to
-
fireonlive
mine was #yaaaaaaaaasus but that's kinda gay
-
fireonlive
:p
-
arkiver
imer: see what i wrote earlier ;)
-
fireonlive
also not punny enough
-
arkiver
#norus it is
-
fireonlive
arkiver: you were in the tiki channel
-
fireonlive
:D
-
arkiver
HEY EVERYONE! JAA is not in #norus , let's party there. no one tell JAA please!!
-
h2ibot
JustAnotherArchivist created ЯRUS (+194, Created page with "{{Infobox project | URL =…):
wiki.archiveteam.org/?title=%D0%AFRUS
-
h2ibot
JustAnotherArchivist created Yarus.ru (+19, Redirected page to [[ЯRUS]]):
wiki.archiveteam.org/?title=Yarus.ru
-
h2ibot
Pcr edited List of websites excluded from the Wayback Machine (+26, Add TH3D):
wiki.archiveteam.org/?diff=50063&oldid=49985
-
fireonlive
:D
-
thuban
arkiver: wrt noise in #archivebot, if you use weechat, there are some filters at
wiki.archiveteam.org/index.php/User:Switchnode