#archiveteam-bs

00:01

Terbium

I see a bunch of free and paid APIs for M&A feeds
00:10

Terbium

site.financialmodelingprep.com/deve…per/docs/merger-and-acquisition-api
00:10

fireonlive

hmm
00:10

fireonlive

if there’s a good rss feed i could hook it up to rss
00:12

Terbium

there's this: seekingalpha.com/market-news/m-a
00:13

Terbium

There's an RSS feed
00:15

fireonlive

this seems to be the url for the feed: seekingalpha.com/tag/m-a.xml
00:16

fireonlive

i’m sitting in a vehicle on my phone so hard to tell for sure haha
00:16

Terbium

fireonlive: yep that's the one, it has the stock ticker symbols like the other FMP feed
00:17

fireonlive

ah awesome :)
00:17

Terbium

which makes finding companies a lot easier
00:17

Terbium

it also showed failed or cancelled M&As as well
00:18

fireonlive

i’ll toss it up in #m&a if that suits everyone when i’m back at a more proper computer later; just out and about with a friend who’s visiting for the first time in a while
01:18

qwertyasdfuiopghjkl

thewrap.com/gannett-drops-ap-associated-press-usa-today "Gannett, publisher of USA Today and hundreds of local newspapers, will stop using the Associated Press’ content starting next week, [...] will eliminate AP dispatches, photos and video as of March 25, according to an internal memo"
01:19

qwertyasdfuiopghjkl

Not sure if this means removal of existing content or just discontinuing new content
01:27

qwertyasdfuiopghjkl

apnews.com/article/gannett-associat…ct-97405e4715c9a25d21477b992028db2a "Shortly after, AP said it had been informed by McClatchy that it would also drop the service." nytimes.com/2024/03/19/business/med…-mcclatchy-ap-associated-press.html "McClatchy [...] told its editors this week that it would stop
01:27

qwertyasdfuiopghjkl

using some A.P. services next month." "[McClatchy] said that The A.P.’s feed would end on March 29 and that no A.P. content could be published after March 31." apparently there's also another one
02:29

fireonlive

#m&a is now setup, we should see if it works within the hour :3
02:32

fireonlive

Terbium++
02:32

eggdrop

[karma] 'Terbium' now has 2 karma!
09:31

newbie007

is it possible to upload locally archived websites to internet archive such that they are searchable using wayback machine?
09:32

pabs

that isn't possible
09:53

arkiver

RIP original redis
14:36

ikkoup

Hi,
14:36

ikkoup

Would you be interested in archiving the biggest (and only) Arabic archive of literary magazines? Its owner died last week and it's at risk of dying at anytime.
14:36

ikkoup

archive.alsharekh.org
14:39

ikkoup

the site also has a sitemap (archive.alsharekh.org/sitemap.xml) which would help ramp things up!
14:39

pokechu22

Hmm, the stats are 2 million pages, 326,446 articles, 52,234 writers, 273 magazines, 15,857 issues. It looks like images are directly embedded (view-source:https://archive.alsharekh.org/Articles/293/20679/470610 has <img _ngcontent-sc1 class="slide_image" src="MagazinePages\Magazine_JPG\Al_Shariqa\Al_Shariqa_2017\Issue_3\014.jpg"
14:39

pokechu22

data-normal="MagazinePages\Magazine_JPG\Al_Shariqa\Al_Shariqa_2017\Issue_3\014.jpg" data-full="MagazinePages\Magazine_JPG\Al_Shariqa\Al_Shariqa_2017\Issue_3\014.jpg"> + <base href="/">) and archivebot extracts those correctly, and the server doesn't mind the backslashes not being replaced by the browser with forward slashes
14:41

ikkoup

Yes, it uses "flipbuilder.com" (PDF Page Flipper) to make the reading pages.
14:41

ikkoup

Don't know if you encountered that before. sorry for my weak language.
14:43

pokechu22

I think archivebot will work here - 2 million URLs is a bit large, but we've done bigger. Do you know if it's at risk of shutting down in a few weeks, or if it'll probably be up for months?
14:46

pokechu22

hmm, archive.alsharekh.org/contents/293/20679 requires a bunch of API requests to e.g. archiveapi.alsharekh.org/Search/IssueIndex?IID=20679 actually; archivebot probably won't follow those
14:47

ikkoup

Hmm, not sure.
14:47

ikkoup

The owner was the pioneer or Arabic language in the early days of computers and he (and his company at the time) added Arabic support for almost every OS/software at the time.
14:47

ikkoup

The company isn't very active these days and he stepped down from it. I guess it'd be up for a few months considering his finances and tech background?
14:47

pokechu22

... though archive.alsharekh.org/sitemap10.xml links to articles, so it *would* find all of the articles, but the table of contents would not work unless we did that separately (which would not be *too* hard)
14:48

ikkoup

Not sure if its possible, but can you ignore the API requests?
14:48

ikkoup

It's for info about individual articles which is not as important as the whole issue/chapter/magazine (archive.alsharekh.org/MagazinePages/MagazineBook/~xxx)
14:49

ikkoup

The important stuff is at the above url structure, the API acts like an index for the issue (article 1 is at page 3, article 2 is at page 6 etc)
14:52

pokechu22

Hmm, archive.alsharekh.org/MagazinePages…l_Maarefa_2020/Issue_681/index.html doesn't have any URLs archivebot would find in it... that flipbook won't work well with it
14:53

pokechu22

it looks like archive.alsharekh.org/MagazinePages…sue_681/mobile/javascript/config.js has bookConfig.totalPageCount=337 and bookConfig.CreatedTime ="201204132846"
14:54

ikkoup

If you check dev inspection (ctrl shift i) then you can see that the flipbook is just a bunch of images and js.
14:54

ikkoup

I guess it's not possible after all eh?
14:55

pokechu22

It would be possible, but it would require additional work to make the flipbooks function
14:56

pokechu22

archive.alsharekh.org/Articles/293/20679/470610 links the images directly though so that would work. Do all magazines have both flipbooks and those /Articles/ pages?
14:59

pokechu22

archive.alsharekh.org/Articles/293/20679/470610 has a blue "تصفح العدد" button that opens archive.alsharekh.org/MagazinePages…/Al_Shariqa_2017/Issue_3/index.html so it seems like flipbooks do exist for everything... but I can't see where that link comes from
15:01

pokechu22

... and the flipbook uses archive.alsharekh.org/MagazinePages…e_3/files/mobile/1.jpg?201204132846 while the /Articles/ page uses archive.alsharekh.org/MagazinePages…iqa/Al_Shariqa_2017/Issue_3/001.jpg (better quality).
15:01

ikkoup

The whole thing is basically a giant flip book :(
15:01

ikkoup

And not very sure about articles page, but it exists for most of it (unindexed issues have no articles, only flipbook)
15:03

pokechu22

I'll start it in archivebot just to get *something*, and hopefully a solution for the flipbooks can be found afterwards
15:03

pokechu22

Thanks for letting us know about the site, we probably wouldn't have found it otherwise :)
15:05

pokechu22

I assume the rest of alsharekh.org should also be saved?
15:06

arkiver

thank you ikkoup!
15:07

arkiver

yeah it might be interesting to save everything on that site
15:07

arkiver

at least into WARCs, perhaps separate items on IA as well
15:10

ikkoup

Not really, alsharekh.org is landing page for other services run by the same guy.
15:10

ikkoup

a Lexicon, Dictionary (acquired by Saudi government), Tashkeel (vowel movement corrector) and a spell checker. I guess they can't be saved.
15:11

ikkoup

I also tried to setup grab-site (github.com/ArchiveTeam/grab-site) on a vps to help crawling the archive, but had some troubles with python 3.8 not being supported.
15:22

Terbium

ikkoup: I would recommend using a container or Python version manager for grab-site in that case to drop back down to Python 3.7
15:28

pokechu22

That said, archivebot isn't a distributed project - running grab-site locally would mean you grab the entire site yourself, and additional archivebot grabs the entire site by itself. It won't make things run faster.
15:32

ikkoup

Ah, I thought it was something like the archivewarrior.
15:32

ikkoup

I wanted to run grab-site since it has some advanced crawling/scraping capabilities for forums like vBulletin and SMF which are not found in other crawling/scarping tools I looked up.
16:09

arkiver

i realise i don't know much about storj
16:10

arkiver

is it just private storage only for files to be made available from elsewhere, page requisites and such?
16:16

kiska

I think you can use storj as S3
16:26

kiska

Which I guess means you could have some site assets on storj being served
16:27

kiska

Or something like that
16:28

arkiver

right
16:30

kpcyrd

is there a channel for archiving #web3?
16:33

arkiver

archiving web3?
16:34

arkiver

so like... archiving blockchains?
16:35

FireFly

I thought part of the point was that it's kind of implicitly so already due to its distributed nature
16:37

arkiver

that's not archiving
16:40

FireFly

..fair
17:30

kpcyrd

the question was tongue in cheek, I probably should've made that more obvious :)
17:50

h2ibot

Censuro edited Talk:URLTeam (+983, /* Shouldn't archive.today be considered a URL…): wiki.archiveteam.org/?diff=51913&oldid=26103
17:50

h2ibot

Popthebop edited Talk:Deathwatch (+423, /* the Tom Lehrer website containing original…): wiki.archiveteam.org/?diff=51914&oldid=51350
17:50

h2ibot

Popthebop edited Talk:Tumblr (+1278, /* Current state of tumblr | IMPORTANT */ new…): wiki.archiveteam.org/?diff=51915&oldid=45705
17:50

h2ibot

Sepro edited List of websites excluded from the Wayback Machine (+24, Add loom.com): wiki.archiveteam.org/?diff=51916&oldid=51896
17:50

h2ibot

Flama12333 edited Deathwatch (+167, added realtek ftp sadly): wiki.archiveteam.org/?diff=51917&oldid=51901
18:00

h2ibot

JAABot edited List of websites excluded from the Wayback Machine (+0): wiki.archiveteam.org/?diff=51918&oldid=51916
19:13

h2ibot

JacksonChen666 edited Deathwatch (+3, fix citation errors): wiki.archiveteam.org/?diff=51919&oldid=51917
19:55

michaelblob

how are people doing log agg? looking into grafana loki but getting piss poor performance generating graphs
19:56

michaelblob

also eyeing influxdb but now sure how/where that fits in
19:56

Barto

work use an ELK stack
20:22

nstrom|m

Just using dozzle on individual servers, no agg
21:22

pabs

arkiver, kpcyrd: I wonder if Web3 is as distributed as advertised? relatedly NFTs certainly aren't, lots of them apparently just load stuff off HTTP
21:28

nicolas17

lmk when there's anything of value worth archiving, too
21:33

AK

I did ELK, but then it was approaching hundreds of GB of logs per day, now I just use dozzle everywhere 🤷‍♂️ At work we use Azure stuff and grafana if we need graphs
21:34

AK

dozzle does everything I need for almost all my personal stuff: logs.hel1.aktheknight.co.uk
23:54

icedice

JAA if you haven't gotten The PokéCommunity completely archived by now, you might want to put it high up on the priority list. A Pokémon fan game website was just shut down by DMCA: twitter.com/RelicCastleCom/status/1770901435867361351
23:54

icedice

The PokéCommunity has probably the largest Pokémon fan game communities out there and they had four games C&D'd a while ago, so the ninja lawyers are well aware that they exist
23:57

Terbium

why they gotta do my PokeCommunity like that....
23:57

pokechu22

I think we last did it 10 months ago: archive.fart.website/archivebot/viewer/job/202305131413054huog
23:58

nulldata

Terbium - because Nintendo loathes its fans.
23:59

Terbium

Also, they really should have hosted the site in a DMCA ignored location. After so many DMCA's over the decades, it seems like this lesson is never learned

6 months ago

« a day earlier

a day later »

today »