-
nyuuzyou
Just finished parsing the forum woman.ru and using regex processed the whole dataset and extracted a list of links to external sites. I think they will be useful here -
transfer.archivete.am/LWPGL/womanru.txt.zst
-
datechnoman
-
h2ibot
datechnoman: Registering YROzuz94 for '!a
transfer.archivete.am/LWPGL/womanru.txt'
-
h2ibot
-
h2ibot
datechnoman: Deduplicating and queuing 167708 items. (YROzuz94)
-
h2ibot
datechnoman: Deduplicated and queued 167708 items. (YROzuz94)
-
datechnoman
^^^ nyuuzyou - I've uploaded your script into the tracker which will be processed when the target is back online :) ^^^
-
datechnoman
Not script, urls sorry
-
nyuuzyou
thanks
-
datechnoman
No worries. Thanks for sharing
-
datechnoman
-
h2ibot
-
datechnoman
arkiver JAA - Its only early days (nearly 24 hours of being down) but worth asking if either of you have a contact method for rewby|backup?
-
datechnoman
I cannot breathe properly when #// isn't chewing urls </3
-
arkiver
datechnoman: ouch i see
-
arkiver
let's give them a bit more time, i hope this will be fixed soon
-
datechnoman
Ack no worries. I need to get another hobby >.<
-
datechnoman
Will keep chewing away urls either way :)
-
arkiver
this is a great hobby to have though :) we're saving very important data
-
arkiver
as i have come to find out personally when occasionally checking for more recent stuff in the Wayback Machine
-
datechnoman
55 million hits this week alone on #// items
-
datechnoman
The stat's speak for themselves
-
datechnoman
235 million in the last month
-
arkiver
years from now this will likely simply be the most valuable web collection
-
datechnoman
CommonCrawl is trying to gain on us but we have got them beat >;)
-
datechnoman
It will be invaluable. My wife still doesn't get what or why I do this but I really enjoy it and have learnt so much :)
-
arkiver
hah :) i check those stats regularly as well
-
arkiver
datechnoman: that is very nice, especially on the learning part too
-
arkiver
\me pretty much got into programming because of Archive Team
-
» arkiver *
-
arkiver
:P sad
-
» arkiver pretty much got into programming because of Archive Team
-
arkiver
did a little before too, but not serious
-
Vokun
A decent portion of the stuff being added now is from CommonCrawl anyways no? It's just going to be in a much more accessable format
-
datechnoman
They do not extract / ingest .pdf files etc where we do
-
datechnoman
They are much more focused on their grabs
-
datechnoman
They want/like to sample websites
-
datechnoman
Which is what we do but we are much more widespread
-
datechnoman
And we grab certain outlinks etc
-
datechnoman
We are very much the same in certain aspects but they dont crawl all the different outlines from say blogger, telegram etc
-
h2ibot
-
h2ibot
datechnoman: Deduplicating and queuing 24997237 items. (PBLGmrlF)
-
datechnoman
-
h2ibot
-
h2ibot
-
h2ibot
datechnoman: Deduplicating and queuing 3416579 items. (BOQ8dvzU)
-
h2ibot
datechnoman: Deduplicated and queued 3416579 items. (BOQ8dvzU)
-
h2ibot
datechnoman: Deduplicated and queued 24997237 items. (PBLGmrlF)
-
nyuuzyou
-
eggdrop
-
arkiver
nyuuzyou: thank you!!
-
arkiver
-
h2ibot
-
h2ibot
-
h2ibot
arkiver: Deduplicating and queuing 212690 items. (bOrp62lm)
-
h2ibot
arkiver: Deduplicated and queued 212690 items. (bOrp62lm)