-
lennier1
Instagram used to be archivable through socialbot. But it's been quite a while since it's worked.
-
lennier1
There are programs to mass download Instagram photos, but they need an account.
-
JAA
PhantomJS in wpull/AB was horribly broken many years ago already, so yeah, I removed it from AB. Instagram used to be scrapable with socialbot. Later, we had some AB pipelines specifically for individual Instagram pages (mostly profiles), but they ended up getting banned as well.
-
Ryz
-
mind_combatant
so, for what it's worth, ironically, the wiki page for URLTeam (
wiki.archiveteam.org/index.php?title=URLTeam ) has at least two of it's references (numbers 3 and 4) linking to dead pages that i had to use the wayback machine to actually see. there's probably various others throughout the wiki, probably worth re-linking to a working wayback snapshot or some other archived copy either instead or in addition to.
-
mind_combatant
reference number 2 doesn't even seem to have a working copy on the wayback machine, so that's cool.
-
mind_combatant
oh, wait, never mind, 2 does still exist, it's just that bit.ly's blog redirected me to an address that doesn't exist and never existed before i got the url and put it into the wayback machine.
-
systwi_
Ryz: MEGA file preservation can, currently, only be done manually.
-
JAA
Ryz: There's no specific tooling for MEGA so far. Yet another thing I've been meaning to look into for a while. So apart from some browser-based thing (Brozzler or another MITM-proxied headless browser with the required scripting), the best thing we can do is download it and throw it into an IA item.
-
JAA
Ninja'd...
-
h2ibot
JustAnotherArchivist moved Gitlab to GitLab (Capitalisation fix):
wiki.archiveteam.org/?title=GitLab
-
h2ibot
JustAnotherArchivist edited GitLab (-17, Capitalisation fix):
wiki.archiveteam.org/?diff=48787&oldid=48786
-
Jake
I might just be an idiot right now, but is there not a way to use curl to glob numerical ranges together sequentially together? Currently have "curl "
example.com/[7600-8000]/[7600-8000].jpg"", looking to get the same number in each one, but it doesn't seem to work like that. Output: "
example.com/7600/7608.jpg"
-
JAA
You mean you want /7600/7600.jpg, /7601/7601.jpg, etc. for a total of 401 URLs?
-
Jake
Yup.
-
Jake
Sorry, I probably didn't explain it very well.
-
nimaje
I don't think there is a way for that, but libera/#curl probably knows more
-
JAA
Yeah, I don't think so either. I'd probably do it with `seq|awk` or similar.
-
Jake
yeah. :( Thanks!
-
h2ibot
Themadprogramer edited Discourse (+48, Added Hugo Community):
wiki.archiveteam.org/?diff=48788&oldid=48774
-
h2ibot
ThreeHeadedMonkey edited Deathwatch (+312, Added MapKnitter and SpectralWorkbench):
wiki.archiveteam.org/?diff=48789&oldid=48783
-
h2ibot
KevinArchivesThings edited WikiTeam (+160, Added WARC search for editthis.info wikis):
wiki.archiveteam.org/?diff=48790&oldid=48703
-
adamus1red
Jake: Wouldn't a bash for loop do the trick?
-
adamus1red
for i in $(seq 7600 8000);
-
JAA
That would create a new (process and) connection for each request, which slows things down significantly.
-
Jake
^
-
Jake
It's what I was doing before, but it is _extremely_ slow :(
-
JAA
Not sure if it could perhaps be done with shell pattern expansion, but then you might run into argument list length limits.
-
adamus1red
Jake: use the loop to generate the list of URLs then xargs to run multiple requests per curl process and run multiple instances?
-
Jake
I think that's the current best plan!
-
JAA
`seq|awk` is probably a few orders of magnitude faster than a shell loop, but for a few hundred numbers, that won't matter (or might even be faster due to the lack of subprocesses).
-
thuban
Jake: curl can take a list of 'configurations' with -K
-
Jake
Yup, I don't think we can get the URLs to be generated in curl though?
-
thuban
this is kind of a pain in the ass, because you need to prepend 'url=' to everything and duplicate (most of) any other configuration you would do, but it does mean you can do seq|awk|curl and let curl's native connection reuse (and parallelization) handle the whole batch
-
thuban
not _in curl_, afaik, no
-
Jake
ah. I see