-
JAA
-
thuban
F
-
mgrandi
how much of it did we get?
-
JAA
Don't think we grabbed anything from that host except for my /archive/ grab and jodizzle's attempt to collect more videos from article pages.
-
JAA
On another note, I found that a web search for site:img.appledaily.com.tw brings up a bunch of interesting-looking things. PDFs, DOCs, 'classified' (i.e. ads) JS flipping book thingies, etc. Would be good to look into that at some point. Probably not in danger though since it's the Taiwanese Apple Daily.
-
achivarin
-
achivarin
Also, how do we piece the ts and m3u8 segments back together? Do we have the accompanying metadata?
-
jodizzle
I have some metadata from the scraping process that could be used to piece them back together, yes. I should probably organize that. However, if you were willing to analyze them in bulk, you could also do it from the m3u8 files themselves. (Ideally, the m3u8s and ts files would all be ordered in the uploaded text files, but my scraping process screwed up the order a little bit.)
-
jodizzle
To answer your earlier question achivarin, I got the mp4s by just using a pretty lazy regex that appeared to work well.
-
jodizzle
JAA: I've uploaded the last of my m3u8s (I think). Later, I'll dedupe against your list and process any additional ones.
-
jodizzle
I did go ahead and diff the mp4s I collected against your list, and found 74959 unique to mine. I made an AB job for them.
-
JAA
Nice! It's weird that /archive/ is so incomplete.
-
JAA
Or perhaps I screwed up my extraction. It was very crude.
-
JAA
jodizzle: We could go through the AB job WARCs to perhaps find more videos as well. Bit of a pain though since that's 550 GiB (so far)... :-|
-
JAA
rewby: Want to do that?
-
dav3
hi, i have about 2TB of hk-appledaily data from 2014-2021 scraped using the /archive/ pages, mostly videos, images and html. the data is on 2 1TB servers, what is the best way of transferring the data?
-
JAA
Hi dav3. What data format? WARCs? Plain files? Something else?
-
dav3
plain files, mp4, html, jpg/gif
-
JAA
Not entirely sure. arkiver, any advice?
-
JAA
dav3: Do you still have the URLs for the MP4s? We also grabbed about 1.5 TB of those last night, but would be nice to compare.
-
dav3
sure, i can generate a list of video urls
-
dav3
-
JAA
-
arkiver
getting reports that thestandnews.com and beta.thestandnews.com may be next
-
arkiver
at least thestandnews.com has sitemaps
-
JAA
jodizzle: ^ Also some M3U8 in dav3's list.
-
arkiver
did we already get these, and else can we get these?
-
arkiver
(ping rewby as well on thestandnews)
-
rewby
I already sent them
-
rewby
Check the logs
-
arkiver
yes, just saw
-
JAA
Looks like we had a job for a bunch of *.thestandnews.com stuff in Aug 2019.
-
arkiver
EggplantN: have all lists from rewby been queued?
-
JAA
s/a job/AB jobs/
-
EggplantN
uh
-
EggplantN
yes
-
EggplantN
they had
-
arkiver
the lists from rewby for thestandnews hongkongfp and polymerkhk
-
arkiver
alright, good
-
rewby
I didn't do beta yet, I'll run that after I've had a shower
-
arkiver
thanks rewby
-
arkiver
EggplantN: have those archived already been uploaded to IA for the lists from rewby or are they stashed somewhere?
-
EggplantN
they were through #//
-
EggplantN
so they should be uploaded
-
JAA
dav3: Hmm, line 569 in the 2014-2018 file is malformed.
-
JAA
-
dav3
oops not sure what happened there. i will make a new file
-
JAA
Looks like you found 16781 videos that aren't in my list from /archive/.
-
JAA
Comparing with the other lists we have now.
-
JAA
Filtering out all the video lists mentioned on the wiki brings it to 11661 videos.
-
JAA
-
JAA
That's dav3's list from the ZIP above minus
transfer.archivete.am/IJGRu/hk.appledaily.com-archive-videos.zst minus jodizzle's nine lists on the wiki page.
-
JAA
jodizzle: ^ More M3U8 for you. :-)
-
JAA
Throwing the `grep '\.mp4$'` of that into ArchiveBot now.
-
arkiver
EggplantN: do you happen to have an easy list of all of them?
-
EggplantN
not on hand
-
EggplantN
you can always try and queue again if you wanna check
-
arkiver
mostly thinking about embedded images now
-
EggplantN
dont need to use urls.js as the ones i did all had valid URLs
-
arkiver
i remember one appledaily had images which could not be extracted without custom code
-
arkiver
yeah
-
arkiver
so for next sites, if we want the embedded image, let's make sure the domain is in the
github.com/ArchiveTeam/urls-grab/bl…aster/extract-outlinks-patterns.txt list
-
arkiver
the HTML alone is already very important of course
-
JAA
Hmm, I just remembered something... We can probably still grab Apple Daily images. Checking now.
-
JAA
Most article images were using a resizing server thingy, but the original image URL is in that resize URL and on another server and still up for now.
-
JAA
-
dav3
-
rewby
arkiver: beta.thestandnews.com just redirects sitemaps to www.'s sitemap.
-
dav3
i downloaded ~533,000 images. i will output a url and file list for those..
-
JAA
-
JAA
Huh, that unearthed another 36 MP4s that aren't elsewhere. Nice. :-)
-
JAA
(No extra M3U8)
-
jodizzle
Ugh, so many m3u8s to unpack…
-
JAA
Yeah, and it's getting messy to see what we have and haven't covered despite the wiki page. :-|
-
hook54321
-
hook54321
woops
-
dav3
-
JAA
-
JAA
I'll process this later if noone beats me to it.
-
dav3
-
JAA
Cheers
-
JAA
-
AK
JAA: for the two above, any cleanup needing doing or can I just throw all the urls into ab? (I can manage getting the urls our and into a list)
-
JAA
AK: Not sure yet if AB or #// is better for these.
-
JAA
Depends on how many more there are. I'll get the ones from the AB jobs as well.
-
JAA
But these are the full images, so no processing needed in that sense.
-
AK
Alright, I'll turn them both into one list, then upload it here and it can go somewhere else
-
JAA
Upload as hk.appledaily.com-dav3-image-urls.zst please. :-)
-
EggplantN
Are they the same domain per URL? If so turn on page reqs
-
JAA
EggplantN: They're just direct URLs for images. The pages they were on aren't available anymore.
-
EggplantN
Ah okie. Sure #// them that box can take up to 11Gbit peaks inbound
-
AK
Now I've gotta work out how I zst them lmao
-
JAA
Nice thing about AB is that it produces a single grouped dataset.
-
JAA
But yeah, will get the ones from the two AB job DBs later and then decide.
-
AK
-
AK
Fuck knows if I managed that correctly
-
JAA
Thanks
-
AK
I'm not sure I did
-
AK
lol
-
AK
Actually I think I did
-
arkiver
zstd filename
-
arkiver
did you do that?
-
arkiver
then it should be fine
-
arkiver
:P
-
JAA
I usually increase the level a bit. If I feel patient, I go for `--ultra -22`.
-
JAA
But yeah, that'll do.
-
JAA
And will still beat `gzip -9` in virtually all cases.
-
AK
Don't get too mad, but I used 7zip gui lol
-
AK
Figured 11 was better than the default of 1
-
JAA
lol
-
AK
-
AK
Powershell for the extraction, then used a nice gui for the compressing back
-
AK
Can you tell I'm a windows admin ;)
-
JAA
Yes, my condolences.
-
JAA
I'd probably have done `awk -F, '{print $3}' input | zstd -8 -o out.zst`.
-
arkiver
damn
-
JAA
But I won't kinkshame. :-)
-
arkiver
if it works it works :P
-
arkiver
AK will upgrade to Linux at some points :)
-
arkiver
point*
-
HCross
_stares intently at arkiver_
-
AK
arkiver I use linux loads, how else do you download your Windows 10 iso?
-
thuban
achivarin: yep, running
-
thuban
-
AK
Can do
-
thuban
thanks!
-
thuban
achivarin: i am not sure whether
youtube.com/user/eatravel corresponds to one of the playboard channel urls you linked or to something else. youtube says the page is not available; can you explain?
-
AK
BT Community Webkit closed on the 24th May 2021.
-
AK
-
AK
Did anyone know about BT Community Webkit?
-
JAA
Yeah, it was known.
-
AK
Damn, ahh well
-
EggplantN
AK it was done
-
EggplantN
see archiveteam_inbox
-
EggplantN
HCross did it
-
AK
Awesome
-
EggplantN