-
datechnoman
ok, we are back to ripping along :)
-
fireonlive
:D
-
fireonlive
nsweet
-
fireonlive
s/n//
-
datechnoman
-
eggdrop
-
fireonlive
rip T_T
-
» fireonlive hovers over datechnoman's card with scissors
-
datechnoman
Shouldnt I be doing that for you haha?
-
fireonlive
probably x3
-
fireonlive
one of them is metal tho will need some tin snips
-
datechnoman
Haha indescribable. Makes it sound like your got some hectic credit limit x3
-
datechnoman
Indestructible***
-
datechnoman
Stupid autocorrect
-
fireonlive
xP not toooo high
-
datechnoman
Sounds like I'll be seeing you on the #// leaderboard then :P
-
arkiver
running very well!
-
arkiver
big thank you to everyone for all this effort into it :)
-
datechnoman
Thanks for fixing up the filtering :)
-
datechnoman
Otherwise we are just wasting cash and resources! arkiver
-
arkiver
it's running very smooth now :)
-
arkiver
which is including the latest update on getting outlinks from various sites
-
arkiver
so looks like we're over the peak of the bulk of initial data after that update :)
-
arkiver
datechnoman: well indeed :P
-
datechnoman
Yes! I think we hit the peak and now its a matter of "catching up" with all the newly discovered stuff and the outlinks from those etc
-
arkiver
yep!
-
datechnoman
Been doing more and more spot testing and the data we are getting is def unique
-
arkiver
100 years from now the stuff we archive today will be incredibly valuable
-
datechnoman
Thinking it might be worthwhile getting a medical/researche url-source together. I know we have the new domains outlink thing you added in but more for checking like we do for the news sources daily to get the latest research/pdf's etc
-
datechnoman
research**
-
datechnoman
I am very surprised that a lot of the sites arent in WBM or have very, very limited coverage
-
arkiver
datechnoman: those URLs would fit in the current urls-sources repo
-
arkiver
-
arkiver
but there are other 'arxiv like' sites as well for other research areas i believe
-
arkiver
-
arkiver
there is also biorxiv and medrxiv
-
audrooku|m
1.59M urls collected from 25k public onetab share pages
transfer.archivete.am/YgRD5/25k_onetab_urls.zst
-
audrooku|m
> arkiver: 100 years from now the stuff we archive today will be incredibly valuable
-
audrooku|m
agreed
-
datechnoman
Cheers arkiver will look to adding to it in the near future
-
datechnoman
Expand on our reach
-
datechnoman
Quality over quantity for content
-
thuban
fwiw, penetration of biorxiv and medrxiv is much lower in those fields than arxiv is in its
-
thuban
pubmed offers rss feeds derived from search results, so that's cool
-
thuban
i played around with it a little bit and a 'general' feed would be too big a firehose to drink from (the max limit is ~1000 entries, there are more than 1000 new entries a day, and items are only datestamped, not timestamped, so you couldn't catch them all by checking more frequently)
-
arkiver
thuban: 'too big firehose'? as in too many releases?
-
thuban
right
-
arkiver
i think we can just get them all
-
arkiver
from those feeds
-
thuban
no, they won't all show up
-
arkiver
ah right datestamped, and some selection of 1000 of them is showing only
-
arkiver
i see
-
thuban
_however_, there's a complete list of journals (
ftp.ncbi.nih.gov/pubmed/J_Medline.txt), and this isn't documented but if you search by the nlm journal id there's just an rss feed for each, no need to even depend on their generation from queries :)
-
thuban
-
thuban
we would probably want at least one degree of outlinks on these (full text is not shown in the pubmed entry even if full text is available)
-
arkiver
we always queue back any .pdf URLs we find
-
arkiver
and we always extract any URLs from any PDF we archive
-
arkiver
so i think that should cover it?
-
thuban
they're generally not linked as pdfs
-
thuban
-
thuban
(sometimes the journal is open-access but pubmed doesn't know this, eg
pubmed.ncbi.nlm.nih.gov/38631827 (try the doi). if we're following outlinks anyway i don't think this matters for our purposes)
-
arkiver
-
arkiver
(and i see now i could remove some duplicates from that list
-
arkiver
so it should get outlinks including the doi.org one
-
thuban
ok, cool!
-
arkiver
the doi.org one then goes to the web page that has the links to the .pdf URL, which we then queue as well
-
arkiver
and queue all URLs from PDFs again, so eventual reach can be pretty far
-
thuban
should be trivial to generate the journal feed urls from that text file
-
arkiver
yep
-
arkiver
and they can be periodically queued and have new links extracted from them, i can put on a rate limit to make sure we don't overload stuff
-
arkiver
(the reason we don't use rate limits for specific patterns more often is that the way in which rate limiting is implemented can cause stalls and wasted resources, so it's used sparingly only, and in cases in which we'll likely not run into problems)
-
thuban
there is a wrinkle in that some journals are open-access after a delay, so it would be nice to be able to re-queue after however many months. is there a good way to do that?
-
arkiver
yes, but i'm not sure if that is the right thing to do, or if this is now becoming more of a standalone project
-
arkiver
we have for example the monthly queued domains, we can treat these journal URLs in the same way
-
thuban
i defer to your expertise
-
thuban
here's the list of pmc journals and thier associated delays, if that helps:
ncbi.nlm.nih.gov/pmc/journals/?filt…r=t1&titles=current&search=journals
-
thuban
*their
-
arkiver
interesting, i didn't know they had the predefined delays
-
arkiver
well, some of them
-
arkiver
i've also added the lists of scientific organisations and non-profit organisations sites to have outlinks extracted from
-
arkiver
education organizations from wikidata added as well
-
arkiver
while the educational organizations list is in urls-sources, we're not queueing them yet
-
arkiver
they are also in urls-grab though, which is used now during archiving
-
arkiver
when the queue is down, i'll update the lists we periodically queue
-
arkiver
if anyone else has ideas for what we should queue from wikidata, let me know!
-
arkiver
there is a chance things will grow again
-
arkiver
the queue that is
-
arkiver
we will see, i'll move items to secondary again then until we're looking better
-
datechnoman
Sounds great. Proper quality data. Love it!
-
arkiver
yeah!
-
Cheesy
-
Cheesy
-
Cheesy
-
Cheesy
-
Cheesy
-
Cheesy
-
Cheesy
-
Cheesy
-
Cheesy
-
Cheesy
-
Cheesy
-
Cheesy
-
Cheesy
-
Cheesy
-
Cheesy
-
Cheesy
-
Cheesy
-
Cheesy
-
Cheesy
-
Cheesy
-
Cheesy
-
Cheesy
-
Cheesy
-
Cheesy
-
Cheesy
-
Cheesy
-
Cheesy
-
Cheesy
-
Cheesy
-
Cheesy
-
Cheesy
-
Cheesy
-
Cheesy
-
Cheesy
-
Cheesy
-
Cheesy
-
Cheesy
-
Cheesy
-
Cheesy
-
Cheesy
-
Cheesy
-
Cheesy
-
Cheesy
-
Cheesy
-
Cheesy
-
Cheesy
-
kiska
Dear god... please put this in a paste or something similar
-
Cheesy
-
Cheesy
-
Cheesy
-
Cheesy
-
Cheesy
-
Cheesy
-
Cheesy
-
Cheesy
-
Cheesy
-
Cheesy
-
kiska
-
Cheesy
-
Cheesy
-
Cheesy
-
Cheesy
-
Cheesy
-
Cheesy
-
Cheesy
-
Cheesy
-
Cheesy
-
Cheesy
-
Cheesy
-
Cheesy
-
Cheesy
-
Cheesy
-
Cheesy
-
Cheesy
-
Cheesy
-
Cheesy
-
kiska
Oh my....
-
Cheesy
-
Cheesy
-
kiska
Put your list on
paste.kiska.pw and then tell us what the link is
-
arkiver
kiska: i would rather promote transfer.archivete.am
-
JAA
^
-
imer
arkiver/JAA: www.viralcovert.com looks a bit weird (and looks like we're killing the site potentially)
transfer.archivete.am/BKKez/www.viralcovert.com.log
-
eggdrop
-
JAA
Looks like that's been going on for a couple days now.
-
JAA
I wonder where these come from. Doesn't seem to be in the sitemaps.
-
JAA
The items are custom:comment=special%2dinterest%2dfrom%2dmain&random=202404&url=..., so not sure about that either.
-
» JAA no touchy.
-
datechnoman
Getting soooooooo many dead URL's
-
datechnoman
Thought we might have pushed through them overnight but still going. Hopefully not much longer!