00:45:01 More fun with TechnologyGuide: if the URL contains 'nonexistant', the server resets the connection: http://forum.notebookreview.com/nonexistant/ 00:45:54 notebookreview is closing? :/ 00:46:08 Yep 00:46:40 that sucks 00:46:49 My qwarc archive of the thread pages just finished a few minutes ago. Should be complete apart from the countless broken shit like the above. 00:47:17 what even.... 'temp' and 'nonexistant' 00:47:43 http://forum.notebookreview.com/threads/asus-v6j-everest-benchmarks.43048/ returns an empty page. 00:47:57 It's kind of hilarious just how broken these forums are. 00:50:31 http://forum.notebookreview.com/nessus/ 00:50:36 ¯\_(ツ)_/¯ 00:52:12 they have some terribly broken WAF paired with a 10 year old corrupted forum database? 05:14:23 TechnologyGuide forum archive post counts: 1841025 from Brighthand (1880674 on homepage), 53604 from DigitalCameraReview (58092), 4180267 from NotebookReview (no global stats, about 9.35M from adding up the subforum numbers, but those don't include everything), 508834 from TabletPCReview (521529) 05:15:05 The NotebookReview discrepancy seems pretty bad, but no idea where it comes from. I didn't see any systematic problems in the data. 05:17:39 Adding up the 'Messages' numbers for the subforums gives 9347229 there. This doesn't include http://forum.notebookreview.com/forums/nbr-marketplace.18/ (which is shown as a link on the homepage instead of a forum entry). 06:53:25 looking for someone who helped to archive soup.io 07:09:48 stormy: Unless you're looking for someone to describe their personal experiences on the project, it's best just to ask their question 07:13:22 fair enough. so if I understood correctly, the archived data gets uploaded directly to archive.org and nothing else. what I'm trying to find out is: did the crawler get around the "Content Warning" pages, and, once on archive.org, how do I get around the content warning pages. 07:20:02 Do you have an example? 07:20:43 sure: https://web.archive.org/web/20191201125421/http://einefragevonstil.soup.io/ 07:21:27 I have enough experience in web crawling to know how it can work on the crawling side, but not with archive.org... 07:33:54 So I don't see anything to indicate that the project got those 07:34:21 where can I find a list of soup.io hosts that the project did cover? 07:35:30 I do not think any such list exists 07:35:50 Well, I expect it was deleted since that project as 2 years ago 07:36:18 Also, that site was "covered" by the project, I just don't see anything getting around the content warning 07:37:20 If you do want a list I can giveyou some quick info on how to generate it 07:37:55 that'd be great, thanks 07:41:24 stormy: https://web.archive.org/web/*/http://example.com/* will list all URLs 07:41:29 Use https://archive.org/services/docs/api/internetarchive/cli.html to download all the .os.cdx.gz files in collection:archiveteam:soup.io , then decompress then, extract the URLs, and then the domains 07:42:13 It would be better to tell us what you're trying to do, though, since asking for what the project covered is fairly specific 07:44:07 thanks, will try both of this later. what I'm trying to do it getting an offline mirror for some soups that I remember. anything that still exists. I was hoping that it would exist in other forms than on archive.org, but I guess I'll have to make do with that. 07:44:30 gotta run for now, back in an hour. 08:51:03 OrIdow6 are you sure that collection:archiveteam:soup.io is the correct identifier? since it gives me a query error https://archive.org/advancedsearch.php?q=collection%3Aarchiveteam%3Asoup.io&fl%5B%5D=identifier&sort%5B%5D=&sort%5B%5D=&sort%5B%5D=&rows=50&page=1&output=json&callback=callback&save=yes 08:58:23 Usually you just search for a file from the project in the UI and you can find the collection that way 08:58:54 Also, if you're worried the data only lives on the ia, you can just list the items using the cli (and collection name) and just download them all 08:59:33 Usually you can get most of it to replay by loading the warcs into pywb3. Sometimes you need to find some additional warcs with static site data like javascript and css files. 08:59:38 The IA has them somewhere usually 09:01:56 I'm still having trouble getting an item list from a url prefix, either with cli or web 09:02:31 You're right I said iy wrong, that should be archiveteam_soupio, not archiveteam:soupio 09:09:12 thanks, that's getting me somewhere 09:52:32 this company is allegedly shutting down https://cyberninjas.com/ https://edition.cnn.com/2022/01/07/politics/cyber-ninjas-shutting-down-arizona/index.html 10:27:25 Anyone know how do request specific IDs with the API of https://www.thiswebsitewillselfdestruct.com/api/get_letter 10:27:52 Currently all IDs are randomized and random messages are displayed 10:51:27 ee 15:29:27 Hey JAA. If you don't mind me asking, how goes the archiving effort for the TechnologyGuide sites? 15:36:10 Thanks JAA! 15:36:12 https://archive.org/details/technologyguide_forums_20220125 18:04:18 Appearently roblox is banning all users with YT or any other refrence to off site platforms, be ready to see some 404s 18:04:26 *some 18:07:36 https://www.youtube.com/watch?v=9DBb6_aVS4M 20:00:44 JustAnotherArchivist edited TechnologyGuide (+53): https://wiki.archiveteam.org/?diff=48220&oldid=48214 20:00:45 Gridkr edited Coronavirus/Affected companies (+401): https://wiki.archiveteam.org/?diff=48221&oldid=45063 20:00:48 anyone archive all CIA world factobook? https://www.cia.gov/the-world-factbook/ 20:41:22 Making sure a copy is grabbed now duce1337 20:43:05 ok 21:33:31 OrIdow6 Im curreny very buzzy with my day job so i have had the time to spent on the Ukrainian archive 21:34:19 But i got some more list of urls that i need to clean before they can be added to the wiki