-
OrIdow6
Alright, before I read the backlog, I have been gone for ~3 days
-
OrIdow6
Maybe it was less, I can't remember
-
OrIdow6
Anyhow, aimix-z is thankfully still up (and hopefully will remain so), I have a Japanese proxy that works accessing it, so hopefully I can do that
-
OrIdow6
Sorry for sort of vanishing suddenly, and so close to a deadline, hopefully I'll have these things running
-
OrIdow6
gazorpazorp: I recall themadpro looking into something subtitle-related a while ago (or it could have been tech 234 a, I got them mixed up at first)
-
OrIdow6
By the way, if anyone knows of a Bintray user that has a small (2-4) number of packages, with a small number of repos, with a small number of versions, that would be nice
-
OrIdow6
Because multiple versions tend to make it blow up into thousands of requests, which makes it hard to test
-
Jake
Orldow6: I have one with a single package, single repo, and one version.
bintray.com/lightshed This guy also has 9 repos and 4 packages.
bintray.com/jaycroaker
-
themadpro
gazorpazorp and orldow6: It's us alright
-
themadpro
we have been trying to build a subtitle/caption alliance for the past year or so ever since YouTube removed closed captions, and most work so far has been LIMITED to YouTube.
-
themadpro
We could consider adding this to the backlog, but we have got quite a lot of things ahead of us already.
-
themadpro
Notably, Jopik is working on publishing a bunch of credits he had gathered from Metadata scrapes over the years.
-
themadpro
We're active on Discord, but I might as well grab the channel for it on IRC #scc
-
gazorpazorp
Thanks for answering, themadpro. :)
-
Kaz
so _that's_ why he wanted to know
-
Kaz
smh
-
EggplantN
kekw
-
EggplantN
oh
-
EggplantN
kekw man died recently
-
EggplantN
did we archive him
-
Vukky
reddit archive probably contains at least one meme of him
-
thuban
afaict he did not have a twitter
-
spirit
SketchTheCow: could you update the description of
archive.org/details/atomicgamer?tab=about with these two snippets
pastebin.com/raw/BNs915Ya ? thanks!
-
etnguyen03
just curious is atdash.meo.ws no longer public? (just curious, trying to see where my workers are)
-
EggplantN
For now, no it is not
-
EggplantN
people have abused it slightly recently
-
EggplantN
I.e viewing 7 days with 5s refresh and it’s been causing issues
-
etnguyen03
okay cool
-
Jake
Epic Games acquired ArtStation, a portfolio, digital assets marketplace, kinda website. They say no changes to branding, and lower fees for their marketplace.
magazine.artstation.com/2021/04/art…on-is-joining-the-epic-games-family
-
atphoenix
"no changes" famous last words
-
Ryz
Jake, atphoenix, launched some archives on ArtStation;
-
Ryz
Earlier I archived individual peoples' ArtStation accounts during the Activision Blizzard mess earlier this year~
-
JAA
EggplantN: Re findmypast.com, login required, so not possible with AB.
-
JAA
Also, 'During a Free Access period or Free Weekend, you may access a maximum of 200 Records per 24hr period (Free Access Limit).' per the T&C.
-
Jake
Ryz: Awesome, thank you very much! :)
-
Jake
(and yeah, I think ArtStation is quite large, may not be great for AB.)
-
Ryz
Jake, could maybe have it as a surface grab? Usually companies being acquired, I usually archive the whole websites and their related subdomains and other websites~
-
Jake
Ryz: Yeah, not sure what the best approach here is! A blogpost in 2018 said they had 3.4m monthly users. (
magazine.artstation.com/2018/03/artstation-marketplace-alpha)
-
masterX244
could be a case for the warrior
-
masterX244
(i think we need to do a outlink-crawl (including i.stack.imgur.com) on the stackexchange dumps, too, another valuable source of relevant links there)
-
masterX244
might be worth the time to hack a tool together that extracts the URLS from the dump and then doing a diff to last run for new outlinks to insert into the URLS project
-
hook54321
JAA: I'm adding a section on connecting via mobile to what you wrote on
wiki.archiveteam.org/index.php/Archiveteam:IRC, feel free to change or move it if you think there's a better place.
-
JAA
hook54321: Sounds good!
-
Jake
masterX244: I believe someone does have a tool to extract certain outlinks from large groups of WARCs. (I believe rewby?)
-
masterX244
stackexchange is a XML dump (already dumped regularly to archive.org)
-
rewby
I was about to say, SE is xml
-
rewby
One I'm working with for a uni project actually
-
rewby
Painful file to work with
-
masterX244
jake: should be able to quickly hack together a tool for that job
-
Jake
Ah sorry, didn't realize it was XML.
-
rewby
Yeah, it's quite interesting
-
masterX244
did a really ugly crawler recently for the Trackmania exchange to get all track and replay pages
-
rewby
Basically every detail of the stackexchange platform (and all sites under it) is archived
-
masterX244
result file is running through my grab-site instance atm and uploaded to archive.org every 50GB
-
masterX244
pulling down those XMLs now to find the quickest way to process the files
-
masterX244
(one smaller to my computer and a full dump over to my server, shit internet so main processing is done at server to avoid that bottleneck)
-
OrIdow6
So Aimix-Z is still blocking me
-
OrIdow6
I have my suspicions for how they're doing it, not completely sure though
-
OrIdow6
Could just be that they're blocking anyone who quickly accesses it
-
OrIdow6
Something to work on later: headless browser warriors, maybe using Selenium or whatever
-
masterX244
preventive crawl or immediate danger atm?
-
OrIdow6
*Webdriver
-
OrIdow6
Jake: Thanks
-
Jake
No problem.
-
OrIdow6
masterX244: See Deathwatch, technically should have already gone down
-
EggplantN
OrIdow6 can we provide any infra to assist you?
-
EggplantN
as in Dual E5v3 with a /23?
-
EggplantN
or more?
-
HCross
Needs to be Japanese
-
nyany
oof
-
nyany
that's a wallet burner
-
EggplantN
HCross vultr?
-
HCross
Could do, but you wouldn’t be able to bring your own IP With a decent geolocation
-
EggplantN
ah shit they're fucked geo arent they
-
nyany
probably
-
nyany
Linode is the same way
-
EggplantN
linode is in JP?
-
nyany
Yeah
-
nyany
When I geolocate the IPs it usually comes up as like Atlanta
-
Kaz
AK: bought
-
thuban
i registered a fresh alternatehistory.com account to use with grab-site, but new accounts require admin approval before you can see the forums that are getting deleted :<
-
thuban
if my shit doesn't get confirmed before i get grab-site set up i might just use my personal account and sit on the results
-
AK
Enjoy it Kaz, well worth it imo
-
EggplantN
what has Kaz bought
-
Kaz
staycation
-
gazorpazorp
Is there someone who does PR for ArchiveTeam? I've been reading articles on Yahoo! Answers shutting down and not one of them mentioned ArchiveTeam or anything related to archiving. Some way to coordinate in contacting writers to edit their page would be a nice thing to set up
-
Ajay
There are many that do mention AT
-
gazorpazorp
Or when an article talks about censorship of reddit or whatever we archive - a good reminder would be the Wayback Machine and ways to add to it (via ArchiveTeam)
-
EggplantN
yes there is a PR person gazorpazorp
-
EggplantN
his name is Jason Scott
-
EggplantN
aka TextFiles/SketchTheCo_W
-
masterX244
Jake and rewby: Xml parser rigged. Waiting for the full XMLs arriving at my server.
-
Jake
Nice
-
gazorpazorp
That's great, @EggplantN and Ajay. Thanks
-
thuban
ok, question about grab-site:
-
thuban
what's the most correct way to get threads only from specific forums? (threads are under generic 'threads' urls, not per-forum.)
-
masterX244
Enumerate the URLs of all forum thread list pages that you want to get the threads from, then add the forum index URL as ignore (ignores don't ignore starting URLs) so it doesnt go into other subforums
-
thuban
oh cool, thanks
-
thuban
i figured i'd be doing that enumeration, but i wasn't sure whether i'd end up in other fora via miscellaneous ui...
-
masterX244
or blacklist thread urls, too if you also enumerated all of them, that way both main escape paths are blocked
-
thuban
the other option i considered would be to enumerate the threads of interest and just use no-parent (since thread pages are children)
-
thuban
but to enumerate the threads, i'd have to get them from the thread list pages somehow, and it felt silly to do that manually if i could figure out a way to get grab-site to do it for me
-
masterX244
last time that i needed to do that i hacked together a quick and dirty C# program
-
thuban
grab-site definitely doesn't have a whitelist mode, right? (ignore everything _except_ /threads/ urls?)
-
masterX244
regex allows a match anything except. but direct links to other threads allows escaping that way
-
thuban
ah yeah
-
thuban
though i'm not sure that's as much of a concern
-
OrIdow6
EggplantN HCross: Thanks for the offer, right now I'm sort of busy, it is possible that I will be able to bypass the geographic thing with Accept-Language as that seemed to extend the time before I got banned from a Japanese IP
-
OrIdow6
Well, I or anyone else
-
OrIdow6
Anyhow, at present it's in limbo, where it should have been shut down but hasn't
-
OrIdow6
Well, as of a few hoursa go
-
thuban
does grab-site --1 (no recursion) disable offsite links?
-
thuban
ugh, wait
-
thuban
i don't want --1, i want no-parent (like the default archivebot behavior). is that the default for grab-site too?
-
JAA
-
JAA
Yes, --no-parent is the default.
-
thuban
ty JAA :)
-
thuban
unfortunately when trying `grab-site --input-file ~/misc/at/ah-urls.txt --igsets=forums --wpull-args=--load-cookies=/tmp/alternatehistory.com_cookies.txt`, i get the following errors:
-
thuban
"sqlalchemy.exc.InvalidRequestError: Could not evaluate current criteria in Python: "Cannot evaluate Select". Specify 'fetch' or False for the synchronize_session execution option.", followed by "CRITICAL Sorry, Wpull unexpectedly crashed."
-
JAA
Which SQLAlchemy version?
-
thuban
1.4.12
-
JAA
Try a 1.3.x version instead. At least standard wpull broke in a number of ways with 1.4.
-
JAA
Actually yeah, exactly that error:
ArchiveTeam/wpull #463
-
JAA
(grab-site uses a fork, but it's close enough in this respect I believe.)
-
thuban
ok, trying again with 1.3.24...
-
thuban
and it's working :) thanks!
-
thuban
my one concern is that i'm currently seeing only urls from the input file, not page prerequisites or subsequent pages of threads; are those all queued at the end?
-
JAA
Yes, wpull does breadth-first recursion.
-
thuban
ok, good to know.
-
JAA
OrIdow6: You working on Bintray?
-
OrIdow6
JAA: The last day or so, no, though there is a semi-working grab script
-
OrIdow6
In the sense that it
-
OrIdow6
gets the essential data but not the interface stuff
-
JAA
I see.
-
JAA
I'll try to get some discovery done.
-
OrIdow6
I did some already, let me find it
-
OrIdow6
Was simple, I just searched for alphanumerical strings on the user search - could not figure out how not do do approximate matching
-
OrIdow6
-
JAA
Yeah, that was more or less what I had in mind as well.
-
JAA
Sadly, pagination breaks at 10k.
-
OrIdow6
Yeah
-
JAA
What queries did you run?
-
OrIdow6
Um
-
OrIdow6
0a-zo apparently, not sure how that's being sorted
-
OrIdow6
-
OrIdow6
Judging from the size of stdout
-
OrIdow6
-
JAA
Huh, didn't run p-z on the second character? I'm seeing results on those.
-
OrIdow6
I think I stopped it once it plateaud
-
JAA
Ah