-
hook54321
really not a fan DRM, but i can see why they sent the takedown request with the surrounding legal stuff going on right now
-
flashfire42
I mean there are easier ways to get pirated books than crack DRM from IA
-
fireonlive
👀
-
fireonlive
hook54321: yeah for sure. hope it helps them
-
pcr
It does probably mean they are going to resist any attempt to perform a (maybe illegal) archival if they announce they are going to need to delete stuff on a day in the future.
-
thuban
flashfire42: depends on the book tbqh
-
-
h2ibot
Yts98 edited 半次元 (+2524, Explain alternate image CDN endpoints):
wiki.archiveteam.org/?diff=50208&oldid=50196
-
h2ibot
-
h2ibot
PaulWise edited Bugzilla (+30, ghostscript bugzilla):
wiki.archiveteam.org/?diff=50210&oldid=50178
-
h2ibot
PaulWise edited Bugzilla (+53, IRC channel topics idea):
wiki.archiveteam.org/?diff=50211&oldid=50210
-
h2ibot
PaulWise edited Bugzilla (+160, security issue lists idea):
wiki.archiveteam.org/?diff=50212&oldid=50211
-
h2ibot
-
h2ibot
PaulWise edited Mailman2 (+213, add IRC and sectrackers as sources of mailman2…):
wiki.archiveteam.org/?diff=50214&oldid=50180
-
h2ibot
PaulWise edited Bugzilla (+792, add URLs from Debian sectracker):
wiki.archiveteam.org/?diff=50215&oldid=50213
-
h2ibot
-
pokechu22
JAA: I have a large set of URLs related to germandocsinrussia.org and historyrussia.org book scans, probably on the order of 10 million across all sites and all unsaved zoom levels. They're incremental IDs with gaps (e.g.
wwii.germandocsinrussia.org/pages/24/zooms/8,
wwii.germandocsinrussia.org/pages/1505900/zooms/8 - I haven't figured out the exact maximum
-
pokechu22
yet) where zoom ranges from 3 to 7 or 8 (0-2 are not used directly, but instead e.g.
wwii.germandocsinrussia.org/system/…ac35af4e318616146ad4.jpg?1538539960 or x_small or xx_small, and archivebot will have already captured them so we don't need to worry about the random-looking component). I assume qwarc is the best
-
pokechu22
tool for that, as giving archivebot an !ao < list job with 10 million entries will result in sadness?
-
pokechu22
If so, what kind of information do you need to do a qwarc job?
-
pokechu22
For the AB jobs that have finished, I've determined that the highest valid IDs are
tsamo.germandocsinrussia.org/pages/48045/zooms/8 and
rgaspi-458-9.germandocsinrussia.org/pages/77762/zooms/8 (and that there are 40593 and 70703 actual valid images in that region respectively, with invalid ones in that range giving 500s and ones outside that range giving 404s).
-
pokechu22
It seems like zoom 8 gives errors on some URLs (e.g.
rgaspi-458-9.germandocsinrussia.org/pages/8/zooms/8) for which zoom 7 does work.
-
pokechu22
ah, scratch that about 500s, seems to depend on the site as
wwii.germandocsinrussia.org/pages/163000/zooms/8 and
wwii.germandocsinrussia.org/pages/165000/zooms/8 are 200 but
wwii.germandocsinrussia.org/pages/164000/zooms/8 is 404 instead of 500. I'll just wait for AB to finish to get a maximum valid ID instead of trying to do a binary search
-
JAA
pokechu22: Yeah, loading 10M into AB would be slow. The list input importing in wpull is a bit awkward. It'd probably take a few hours. That's the only sad part though. It's certainly better otherwise because it allows for easy monitoring, request rate adjustment, etc., which isn't really the case with qwarc.
-
JAA
And could do it in smaller chunks of course rather than one huge list.
-
JAA
It's possible of course with qwarc, just doesn't sound like a great fit unless the site is going down soon and can handle several dozen requests per second.
-
pokechu22
Alright, I might try it for the smaller ones at least
-
pokechu22
I'm not aware of any rate-limiting - I'll try tsamo.germandocsinrussia.org at an aggressive rate with AB to see what happens maybe
-
JAA
Well, qwarc is about 1 or 2 orders of magnitude faster than AB...
-
JAA
(Without trying hard, that is.)
-
JAA
Although AB is able to reach something like 20 req/s quite comfortably for images.
-
pokechu22
Probably we'd be limited by the ping time to russia if anything
-
JAA
Right
-
nicolas17
JAA:
archive.org/details/csdnsdplist this has a bunch of "screensavers" used on Apple Store demo devices, but it also has the original URLs they were downloaded from, would it be worth putting them in archivebot or something so they're on WBM?
-
JAA
(TIL 'H.264 IA' for derived videos.)
-
JAA
nicolas17: Maybe, yeah. I wouldn't be opposed to it.
-
fireonlive
i wonder why the first one is 'sideways'
-
fireonlive
hm they all seem sideways
-
fireonlive
the few i checked yday were good though :D
-
nicolas17
Side data:
-
nicolas17
displaymatrix: rotation of -90.00 degrees
-
fireonlive
ahh
-
nicolas17
which the web player doesn't understand ig
-
fireonlive
makes sense :)
-
nicolas17
also it seems many of these are h265 and HDR
-
pokechu22
ugh, looks like there's also a map view for some pages that's higher resolution, e.g.
tsamo.germandocsinrussia.org/pages/44716/map - indicated on view-source:https://tsamo.germandocsinrussia.org/ru/nodes/246-delo-234-karta-polozheniya-frantsuzskih-angliyskih-i-belgiyskih-voysk-na-zapadnom-fronte-na-04-05-1918g-m-1-750-000 by map_ids = [44716]; in the JS. Pretty sure
-
pokechu22
the only way to find those is to download the full warcs :|
-
pokechu22
(you can plug in any page ID, but most will try to load missing images, and I think there's only a few maps to trying to save them for everything would be a waste of resources)
-
nicolas17
JAA: transfer.archivete.am is down
-
nicolas17
Caddy returns Bad Gateway
-
JAA
nicolas17: Yes, we have monitoring for that in #nodeping.
-
nicolas17
ok