#archiveteam-bs

02:03

anarcat

location.services.mozilla.com to be sunset, announced on march 13th
02:03

nicolas17

rip
02:26

JAA

Fuck software patents.
02:31

fireonlive

++
03:38

HP_Archivist

pokechu22: I know you looked at libraw.org and adjusted, it finished earlier. (I had stepped away for a few hours..) Did you abort early or just adjust the crawl?
05:19

pokechu22

HP_Archivist: I got rid of something that *should* have been junk (pagination URLs where there was a second, different, ignored, pagination param), but I'm not 100% sure if it was complete or not. Those URLs did make up most of the queue. I'll double-check.
05:22

HP_Archivist

pokechu22: Ah alright, yeah I appreciate it. I recently learned that library is the basis to a variety of other software that processes camera RAW files. And the Libraw project is one that is based on *another* project that ceased in 2018, which I also want to make sure gets crawled properly
05:22

HP_Archivist

dechifro.org/dcraw
05:51

pokechu22

HP_Archivist: I can confirm that it successfully requested all 125 pages from libraw.org/comments/recent?page=1 to libraw.org/comments/recent?page=125. The ignore I added was for stuff like libraw.org/comments/recent?destinat…=comments/recent%3Fpage%3D10&page=1 to
05:51

pokechu22

libraw.org/comments/recent?destinat…omments/recent%3Fpage%3D10&page=125 which are the exact same as the actual page list, but with a second page number in the middle that does nothing (so instead of 125 requests for 125 pages, it'd be 15625 requests... which is just silly). The site's complete.
05:53

HP_Archivist

Hm, alright. Thank you for checking!
06:38

h2ibot

Petchea created Piapro (+1301, Created page with "{{Infobox project | title =…): wiki.archiveteam.org/?title=Piapro
06:48

h2ibot

Petchea edited Piapro (+899): wiki.archiveteam.org/?diff=51888&oldid=51887
06:51

h2ibot

Petchea edited Piapro (+52): wiki.archiveteam.org/?diff=51889&oldid=51888
06:52

h2ibot

Petchea edited Piapro (-28): wiki.archiveteam.org/?diff=51890&oldid=51889
06:54

h2ibot

Petchea edited Piapro (+283): wiki.archiveteam.org/?diff=51891&oldid=51890
07:04

h2ibot

Petchea edited Piapro (+132, not just music): wiki.archiveteam.org/?diff=51892&oldid=51891
12:23

imer

"On April 10th, 2024 the cell data downloads will be deleted and will no longer be available. " DELETED? (re mozilla location services)
12:52

PredatorIWD

imer: Ran the downloads page location.services.mozilla.com/downloads through IA with Save outlinks on and it actually got First archive on most links but someone should still check it since that can miss crawling some links from the page.
12:53

imer

it might not grab the larger downloads properly I think?
12:53

PredatorIWD

Is there any surefire way to save a page like this on IA other than the basic web.archive.org/save UI?
12:54

imer

archivebot, I'm sure someone will run it through
12:54

PredatorIWD

imer: I manually entered the 2 big downloads it missed as well, might have missed some smaller ones also
12:54

Barto

i've thrown location.services.mozilla.com into archivebot
12:55

imer

thanks!
20:46

eightthree

are there projects for repeatedly backing up job posting sites (and maybe marketplaces/classifieds like gumtree) since those tend to be deleted fairly quickly, not sure if archive.org is anything close to thorough at keeping copies...
20:47

JAA

If you have a good list of such sites/pages, we could throw them into #//'s thing.
22:59

eightthree

JAA: they are often geographically limited, how much do I need to parse these and come up with a deduplicated, sorted list, also carving out subsections of host/domains that are job specific (i.e. the format of urls for jobs in hackernews, linkedin, etc?): en.wikipedia.org/wiki/List_of_employment_websites
22:59

eightthree

en.wikipedia.org/wiki/.jobs
22:59

eightthree

github.com/lukasz-madon/awesome-remote-job
22:59

eightthree

github.com/hugo53/awesome-RemoteWork
22:59

eightthree

github.com/zenika-open-source/awesome-remote-work
22:59

eightthree

github.com/engineerapart/TheRemoteFreelancer
22:59

eightthree

github.com/remoteintech/remote-jobs (seems like a list of employer websites, not likely frequent posted and deleted content like jobs)
22:59

eightthree

github.com/lukasz-madon/awesome-rem…adme-ov-file#job-boards-aggregators
22:59

eightthree

github.com/lukasz-madon/awesome-rem…e-job?tab=readme-ov-file#job-boards
23:00

JAA

Ideally, you'd compile a list and create a PR against github.com/ArchiveTeam/urls-sources .
23:03

eightthree

ok, this wont be done anytime in the next week from me, but if someone wants to start PRing I welcome it
23:04

eightthree

I was actually surprised this hasnt been done yet by archiveteam, I was hoping to find where to find the archives! (currently jobsearching and a little in hurry to find income)
23:29

nicolas17

eightthree: why archive them? historical data analysis?
23:31

nicolas17

for actual job search you'd only care about recent posts...
23:49

eightthree

nicolas17: so there's urgent searching and then there's "how long to wait and how infrequently does "dream job" or "rare desirable item for sale" show up in jobsites/classifieds/marketplaces. I was noticing a lot of "this post disappeared" type messages, getting confused and annoyed and feeling like this was worth publicly archiving...
23:50

nicolas17

(what's a "dream job", I don't understand)
23:51

nicolas17

(half joking)

6 months ago

« a day earlier

a day later »

today »