#//

00:29

datechnoman

ok, we are back to ripping along :)
00:32

fireonlive

:D
00:33

fireonlive

nsweet
00:33

fireonlive

s/n//
00:33

datechnoman

transfer.archivete.am/zBOkF/credit_card.jpg
00:33

eggdrop

inline (for browser viewing): transfer.archivete.am/inline/zBOkF/credit_card.jpg
00:35

fireonlive

rip T_T
00:35

» fireonlive hovers over datechnoman's card with scissors
00:37

datechnoman

Shouldnt I be doing that for you haha?
01:11

fireonlive

probably x3
01:11

fireonlive

one of them is metal tho will need some tin snips
01:18

datechnoman

Haha indescribable. Makes it sound like your got some hectic credit limit x3
01:18

datechnoman

Indestructible***
01:18

datechnoman

Stupid autocorrect
01:40

fireonlive

xP not toooo high
01:58

datechnoman

Sounds like I'll be seeing you on the #// leaderboard then :P
05:47

arkiver

running very well!
05:47

arkiver

big thank you to everyone for all this effort into it :)
05:52

datechnoman

Thanks for fixing up the filtering :)
05:52

datechnoman

Otherwise we are just wasting cash and resources! arkiver
05:52

arkiver

it's running very smooth now :)
05:52

arkiver

which is including the latest update on getting outlinks from various sites
05:53

arkiver

so looks like we're over the peak of the bulk of initial data after that update :)
05:53

arkiver

datechnoman: well indeed :P
05:56

datechnoman

Yes! I think we hit the peak and now its a matter of "catching up" with all the newly discovered stuff and the outlinks from those etc
05:56

arkiver

yep!
05:56

datechnoman

Been doing more and more spot testing and the data we are getting is def unique
05:57

arkiver

100 years from now the stuff we archive today will be incredibly valuable
05:58

datechnoman

Thinking it might be worthwhile getting a medical/researche url-source together. I know we have the new domains outlink thing you added in but more for checking like we do for the news sources daily to get the latest research/pdf's etc
05:58

datechnoman

research**
06:00

datechnoman

I am very surprised that a lot of the sites arent in WBM or have very, very limited coverage
06:07

arkiver

datechnoman: those URLs would fit in the current urls-sources repo
06:08

arkiver

i have at least arxiv in there github.com/ArchiveTeam/urls-sources/blob/master/900_arxiv.txt
06:08

arkiver

but there are other 'arxiv like' sites as well for other research areas i believe
06:09

arkiver

and we have the 'scientific organization' one in there github.com/ArchiveTeam/urls-sources…cientific-organization.wikidata.txt
06:11

arkiver

there is also biorxiv and medrxiv
06:12

audrooku|m

1.59M urls collected from 25k public onetab share pages transfer.archivete.am/YgRD5/25k_onetab_urls.zst
06:13

audrooku|m

> arkiver: 100 years from now the stuff we archive today will be incredibly valuable
06:13

audrooku|m

agreed
06:39

datechnoman

Cheers arkiver will look to adding to it in the near future
06:39

datechnoman

Expand on our reach
06:39

datechnoman

Quality over quantity for content
08:10

thuban

fwiw, penetration of biorxiv and medrxiv is much lower in those fields than arxiv is in its
08:11

thuban

pubmed offers rss feeds derived from search results, so that's cool
08:13

thuban

i played around with it a little bit and a 'general' feed would be too big a firehose to drink from (the max limit is ~1000 entries, there are more than 1000 new entries a day, and items are only datestamped, not timestamped, so you couldn't catch them all by checking more frequently)
08:16

arkiver

thuban: 'too big firehose'? as in too many releases?
08:16

thuban

right
08:17

arkiver

i think we can just get them all
08:17

arkiver

from those feeds
08:17

thuban

no, they won't all show up
08:17

arkiver

ah right datestamped, and some selection of 1000 of them is showing only
08:18

arkiver

i see
08:18

thuban

_however_, there's a complete list of journals (ftp.ncbi.nih.gov/pubmed/J_Medline.txt), and this isn't documented but if you search by the nlm journal id there's just an rss feed for each, no need to even depend on their generation from queries :)
08:18

thuban

eg pubmed.ncbi.nlm.nih.gov/rss/journals/7507203/?limit=1000
08:20

thuban

we would probably want at least one degree of outlinks on these (full text is not shown in the pubmed entry even if full text is available)
08:20

arkiver

we always queue back any .pdf URLs we find
08:21

arkiver

and we always extract any URLs from any PDF we archive
08:21

arkiver

so i think that should cover it?
08:21

thuban

they're generally not linked as pdfs
08:21

thuban

eg pubmed.ncbi.nlm.nih.gov/32487176
08:22

thuban

(sometimes the journal is open-access but pubmed doesn't know this, eg pubmed.ncbi.nlm.nih.gov/38631827 (try the doi). if we're following outlinks anyway i don't think this matters for our purposes)
08:24

arkiver

nih.gov is in raw.githubusercontent.com/ArchiveTe…static-extract-outlinks-domains.txt
08:24

arkiver

(and i see now i could remove some duplicates from that list
08:25

arkiver

so it should get outlinks including the doi.org one
08:25

thuban

ok, cool!
08:25

arkiver

the doi.org one then goes to the web page that has the links to the .pdf URL, which we then queue as well
08:26

arkiver

and queue all URLs from PDFs again, so eventual reach can be pretty far
08:27

thuban

should be trivial to generate the journal feed urls from that text file
08:27

arkiver

yep
08:27

arkiver

and they can be periodically queued and have new links extracted from them, i can put on a rate limit to make sure we don't overload stuff
08:28

arkiver

(the reason we don't use rate limits for specific patterns more often is that the way in which rate limiting is implemented can cause stalls and wasted resources, so it's used sparingly only, and in cases in which we'll likely not run into problems)
08:36

thuban

there is a wrinkle in that some journals are open-access after a delay, so it would be nice to be able to re-queue after however many months. is there a good way to do that?
08:41

arkiver

yes, but i'm not sure if that is the right thing to do, or if this is now becoming more of a standalone project
08:41

arkiver

we have for example the monthly queued domains, we can treat these journal URLs in the same way
08:43

thuban

i defer to your expertise
08:43

thuban

here's the list of pmc journals and thier associated delays, if that helps: ncbi.nlm.nih.gov/pmc/journals/?filt…r=t1&titles=current&search=journals
08:44

thuban

*their
08:47

arkiver

interesting, i didn't know they had the predefined delays
08:47

arkiver

well, some of them
10:00

arkiver

i've also added the lists of scientific organisations and non-profit organisations sites to have outlinks extracted from
10:15

arkiver

education organizations from wikidata added as well
10:19

arkiver

while the educational organizations list is in urls-sources, we're not queueing them yet
10:19

arkiver

they are also in urls-grab though, which is used now during archiving
10:20

arkiver

when the queue is down, i'll update the lists we periodically queue
10:20

arkiver

if anyone else has ideas for what we should queue from wikidata, let me know!
10:24

arkiver

there is a chance things will grow again
10:24

arkiver

the queue that is
10:24

arkiver

we will see, i'll move items to secondary again then until we're looking better
10:29

datechnoman

Sounds great. Proper quality data. Love it!
10:36

arkiver

yeah!
12:00

Cheesy

biliabd.org
12:01

Cheesy

buetjs.com
12:23

Cheesy

oldweb.lged.gov.bd/LGEDAboutUS.aspx
12:23

Cheesy

oldweb.lged.gov.bd/ProjectList.aspx
12:23

Cheesy

oldweb.lged.gov.bd/ViewMap.aspx
12:23

Cheesy

oldweb.lged.gov.bd/ViewRoad2.aspx
12:23

Cheesy

oldweb.lged.gov.bd/Minister51Dofa.aspx
12:23

Cheesy

oldweb.lged.gov.bd/Admin/Login.aspx
12:23

Cheesy

oldweb.lged.gov.bd/UnitHome.aspx?unitID=15
12:23

Cheesy

oldweb.lged.gov.bd/UnitHome.aspx?unitID=1
12:23

Cheesy

oldweb.lged.gov.bd/UnitHome.aspx?unitID=2
12:23

Cheesy

oldweb.lged.gov.bd/UnitHome.aspx?unitID=3
12:23

Cheesy

oldweb.lged.gov.bd/UnitHome.aspx?unitID=4
12:23

Cheesy

oldweb.lged.gov.bd/UnitHome.aspx?unitID=41
12:23

Cheesy

oldweb.lged.gov.bd/UnitHome.aspx?unitID=5
12:23

Cheesy

oldweb.lged.gov.bd/UnitHome.aspx?unitID=6
12:23

Cheesy

oldweb.lged.gov.bd/UnitHome.aspx?unitID=7
12:23

Cheesy

oldweb.lged.gov.bd/UnitHome.aspx?unitID=9
12:23

Cheesy

oldweb.lged.gov.bd/UnitHome.aspx?unitID=10
12:23

Cheesy

oldweb.lged.gov.bd/UnitHome.aspx?unitID=11
12:23

Cheesy

dhakacity.org
12:23

Cheesy

dphe.gov.bd
12:23

Cheesy

dwasa.org.bd
12:23

Cheesy

moestab.gov.bd
12:23

Cheesy

bangladesh.gov.bd/mos/bpatc/index.htm
12:23

Cheesy

bangladesh.gov.bd/mos/bcsa/index.htm
12:23

Cheesy

bangladesh.gov.bd/bpscs/index.htm
12:23

Cheesy

mof.gov.bd
12:23

Cheesy

bangladesh-bank.org
12:23

Cheesy

nbr-bd.org
12:23

Cheesy

ribec.org
12:23

Cheesy

erd.gov.bd
12:23

Cheesy

parliamentofbangladesh.org
12:23

Cheesy

bangladesh.gov.bd/moa/moa.html
12:23

Cheesy

saic-dhaka.org
12:23

Cheesy

bangladesh.gov.bd/moa/dae/dae.htm
12:23

Cheesy

barc.gov.bd
12:23

Cheesy

mofdm.gov.bd
12:23

Cheesy

bdembjp.com
12:23

Cheesy

plancomm.gov.bd
12:23

Cheesy

bbs.gov.bd
12:23

Cheesy

sictgov.org
12:24

Cheesy

bbsgov.org
12:24

Cheesy

imed-bd.org
12:24

Cheesy

moef.gov.bd
12:24

Cheesy

bforest.gov.bd
12:24

kiska

Dear god... please put this in a paste or something similar
12:24

Cheesy

doe-bd.org
12:24

Cheesy

bfri.gov.bd
12:24

Cheesy

mod.gov.bd
12:24

Cheesy

bangladesh.gov.bd/www.bangladeshnavy.org
12:24

Cheesy

bangladesh.gov.bd/www.sparrso.gov.bd
12:24

Cheesy

bangladesh.gov.bd/www.mist-bd.org
12:24

Cheesy

mod.gov.bd/sob
12:24

Cheesy

bmd.gov.bd
12:24

Cheesy

bangladesh.gov.bd/www.ndcbd.com
12:24

Cheesy

motj.gov.bd
12:24

kiska

Such as paste.kiska.pw
12:24

Cheesy

rhd.gov.bd
12:24

Cheesy

railway.gov.bd
12:24

Cheesy

brta.gov.bd
12:24

Cheesy

brtc.gov.bd
12:24

Cheesy

dtcb.gov.bd
12:24

Cheesy

moind.gov.bd
12:24

Cheesy

coranait.com/bscic
12:24

Cheesy

home.bangla.net/bcic
12:24

Cheesy

moedu.gov.bd
12:24

Cheesy

dshe.gov.bd
12:24

Cheesy

banbeis.gov.bd
12:24

Cheesy

ugc.org
12:24

Cheesy

univdhaka.edu
12:24

Cheesy

buet.ac.bd
12:24

Cheesy

educationboard.gov.bd
12:24

Cheesy

mopme.gov.bd
12:24

Cheesy

mosict.gov.bd
12:24

Cheesy

bcc.net.bd
12:24

kiska

Oh my....
12:24

Cheesy

mail.lged.gov.bd
12:24

Cheesy

oldweb.lged.gov.bd/../UnitHome.aspx?unitID=6
12:25

kiska

Put your list on paste.kiska.pw and then tell us what the link is
14:34

arkiver

kiska: i would rather promote transfer.archivete.am
15:00

JAA

^
16:16

imer

arkiver/JAA: www.viralcovert.com looks a bit weird (and looks like we're killing the site potentially) transfer.archivete.am/BKKez/www.viralcovert.com.log
16:16

eggdrop

inline (for browser viewing): transfer.archivete.am/inline/BKKez/www.viralcovert.com.log
16:18

JAA

Looks like that's been going on for a couple days now.
16:34

JAA

I wonder where these come from. Doesn't seem to be in the sitemaps.
16:35

JAA

The items are custom:comment=special%2dinterest%2dfrom%2dmain&random=202404&url=..., so not sure about that either.
16:49

» JAA no touchy.
22:36

datechnoman

Getting soooooooo many dead URL's
22:39

datechnoman

Thought we might have pushed through them overnight but still going. Hopefully not much longer!

a month ago

« a day earlier

a day later »

today »