00:00:58 Ajay: hi! would you be able to set up a similar submission dashboard for so-net u-page+ as you did for mediafire? 00:01:16 we can spread it through twitter to allow users to submit websites 00:01:35 it's So-net U-Page+ we might want to put some japanese text on there 00:04:49 arkiver: Some Japanese people on a message board have done that already 00:05:31 http://geolog.mydns.jp/so-net/ 00:05:31 OrIdow6: where? 00:05:49 do we have the output of that? 00:05:53 The text file is downloadable at one of thos elinks 00:05:54 and was it tweeted around? 00:05:55 Yes 00:06:01 I don't know about Twitter 00:06:06 It was spread through something 00:06:19 Don't know if GOogle Translte told me 00:06:33 1 month ago, or whenever I read the thread 00:06:39 we should put it on our twitter 00:06:46 i want us to get more active on our twitter 00:09:57 Actually that could be good in general for announcing project launches 00:10:36 who actually has access 00:28:14 looks like it's already covered by that other site, but yea I can set that site up for any future projects if we need/want 00:30:55 I agree with announcing project launches on twitter 01:02:50 EggplantN your colo in kansas city ks/mo? 01:02:57 yes 01:03:16 only about 6 hrs south of me 01:03:29 worrying 01:03:38 lol 01:06:17 on a separate note not sure if i am considered a regular but be happy to get a permanent target at hetzner if i would be allowed 01:07:16 Do we wanna grab the NRA as they are restructuring to a not for profit organisation and filing for bankruptcy? 01:08:38 Same actually. I have an AX-51-NVME that I would be more than happy to repurpose from workers to target if needed. 01:10:13 For now we're not too bad overall were coping and growing. Sadly one of our new targets isn't fully up to scratch and we're working to improve it 01:10:18 flashfire42: That's probably not a bad idea. From the brief glance I made, it looks like they are already a non-profit, but they want to close that and re-start in Texas. 01:10:30 Suposedly to avoid being sued in NYC 01:14:40 k keep it in mind open ended offer 01:15:18 i will do thank you! 01:21:50 All I hear is an excuse for you to put more new hardware in a colo :D 01:23:58 :D 01:26:14 >_> Craigle 01:26:16 dont shame me 01:28:52 No shame, I'm all about it 01:29:00 >_> 03:36:36 WebBBS (Version 4.33; June 8, 2000) 03:36:36 http://web.archive.org/web/20000817194941/http://www.awsd.com/scripts/webbbs/ 03:36:50 oh you already found that 03:37:15 well or found something close to it 03:37:31 Yeah, there's a full version history. 03:37:47 http://www.awsd.com/download/webbbs/history.txt 04:05:34 So in the interest of time, I am running a quick thing on CrowdMap to get report URLs 04:06:14 Which are, as far as I can tell, the actual data being captured - the rest of the site more or less does nothing but display reports 04:06:29 (With exceptions) 04:06:33 The AB job didn't get them 04:09:07 So I think these can be run via AB 04:09:14 Actually looking to be about 400k urls here 04:19:15 Yeah, looked like simple non-scripty HTML I think. 04:20:03 I couldn't find an example with comments though. 04:20:04 That part of the site, yeah 04:20:49 Well, nothing with many comments for potential pagination. 04:21:08 Anyway, since this might go down any second now, let's split it up into a couple jobs and run? 04:21:38 Yeah, this is very barebones in the interest of time 04:21:57 'Best Effort' SLA on this one. :-) 04:23:17 https://transfer.notkiska.pw/Ccm2G/crowdmap_reports_list_1.txt - about 66k URLs, mostly reports 04:23:30 Actually, http://archive.fart.website/archivebot/viewer/job/6hn3f did grab some reports. No idea how complete it is though. 04:25:16 Started 04:25:47 Well, it must be incomplete somewhere, because it grabbed less URLs than there are so far in this list I made 04:25:50 Thanks 04:27:06 Oof 04:27:21 Their server returned empty HTTP 200 responses for quite a few sites. 04:28:38 And it's doing that again. 04:29:12 I noticed that it did that when I typod endpoint names 04:29:19 How can you tell when they're empty? 04:29:40 I'm looking at the WARCs. 04:29:50 And can reproduce it with curl. 04:29:59 Oh, I thought you meant the running job 04:30:14 Yes, I'm looking at that job's WARC. 04:30:32 Oh 04:30:57 It should be returning non-empty responses for the reports 04:31:21 so http://awsd.com/scripts/webbbs/ says "PLEASE GO TO TETRABB.COM FOR THE NEWEST VERSION OF THE WEBBBS FORUM". http://tetrabb.com/ says "Domain for sale". ooops. 04:31:36 Because it should have gotten the cookie beforehand 04:32:07 At least, that's what I experienced 04:32:46 Well, not necessarily. 04:32:50 Concurrency etc. 04:33:07 But yeah, it's the cookie stuff. 04:34:24 The cookie seems to be valid for an hour. 04:35:26 So if we shuffle the list and add a fake request at the beginning for /?archiveteam or similar, it should be fine. 04:36:56 Why shuffle the list? 04:37:28 Because it might not get to the end of the list within an hour. 04:37:33 Each request extends the cookie by an hour. 04:37:50 But by the time it gets to the bottom, those cookies might've expired already. 04:39:59 I've launched another recursive job with those requests. 04:42:08 By the way, excellent example of the -ot discussion earlier: the server sends a 'Refresh' header, which isn't standardised in HTTP headers but browsers behave as if it was a refresh meta tag. 04:43:36 Anyway, we can leave the job with your list as is if you want, but we'll probably miss a handful of URLs. 04:45:23 Nope, the new job is still getting empty 200s even though the cookie is being sent. WTF is this shit? 04:45:57 Well, occasionally at least. 04:46:21 Better than the previous attempt anyway. 04:55:00 Ok, https://transfer.notkiska.pw/Q9qoc/crowdmap_reports_list_2.txt is the new one 04:55:27 Every domain name should be prefixed with http://domain.crowdwhatever.com/?archiveteam now 04:55:30 That has the same issue. 04:55:47 Unless run at 1 concurrency, but that's not reasonable with the deadline already over. 04:56:10 Oh 04:56:24 It would fetch multiple URLs at once, and every time it begins processing a new domain, it'll miss content on a few URLs. 04:57:10 In theory, you could put the /?archiveteam trick URL 'a bit further up', but it's hard to predict how many buffer URLs you need. 04:57:10 If it had more than one dummy request, would that work? 04:57:21 Oh, you got there before me 04:57:53 Sort of 04:58:18 Here's what I'd do: shuffle the list, then take the unique domains in the order they first appear in the file, then insert the trick URL in that order at the beginning. 04:58:42 The shuffling also has the side effect that we'd get a random sample of the content if the site shuts down while we're still grabbing it. 04:59:04 As opposed to a strongly biased one. 04:59:43 Would it work to have a padding section in between the cookie-requests and the proper content? 05:00:10 That's a definite maybe. 05:00:12 :-P 05:00:40 Well, it's only a few extra requests, I'll throw it in there in case it does help 05:01:07 Make it /?disable /?your /?stupid /?cookie /?bullshit :-) 05:07:47 Ok, https://transfer.notkiska.pw/RLMJo/crowdmap_reports_list_2_improved.txt has the changes 05:12:15 Looks good, thanks. 05:12:21 Want to do the same thing for the first list? 05:14:06 Ok 05:15:04 https://transfer.notkiska.pw/hTYJ1/crowdmap_reports_list_1_improved.txt 05:28:31 Queued that to a full pipeline. Oops 05:30:54 But the list 2 job seems to be running fine now. :-) 05:45:13 Good 05:45:22 https://transfer.notkiska.pw/JsCaO/crowdmap_reports_list_3_improved.txt - nearly halfway done now 05:46:29 Though I'm just going in the order of the list scrape, which isn't random and doesn't seem to be completely alphabetical, either 06:10:36 Heya OrIdow6, I'll be taking over from JAA, tossing in your work into AB 06:25:39 Ryz: Ok; it seems to have slowed down recently (think it's hit a few big sites), so it may be some more time until I have another 06:25:53 Check #archivebot 06:26:14 Reading logs 09:15:44 Bugtraq: BugTraq Shutdown - https://seclists.org/bugtraq/2021/Jan/0 09:16:05 Known 09:16:09 Thx 14:25:38 Hey all. In short I'm looking for something that connects some 'web extraction/scraping' logic to WARC parsing. I can code in go (and python), but wanted to make sure I'm not overlooking anything. Basically I would like to convert a forum scrape from WARC to database records (post, user, etc.) 14:25:39 Hi avoozl. My go-to library for WARC parsing is warcio. You can use it to iterate over a WARC and then do whatever you need with the HTTP body. 14:26:02 That's Python. No idea if there's any decent Go libraries. 14:26:29 There's some reasonable libraries, but most of them stop at the content level. So parsing the HTTP response and converting it into the right character set will take some additional effort 14:27:23 warcio does parse HTTP responses. 14:27:39 You'll want the content_stream() of each record. 14:28:34 Thanks I'll take a look at how that is implemented. The go library I was using just gives me the raw content stream, but doesn't do any handling of content encodings 14:29:56 Seems like I'll need to add quite a few parts to this go library, but that's fair. thanks 14:32:37 JAA: browsing through the warcio source, I don't think I can see it actually parsing/using the response header such as 'Content-Type: text/html; charset=UTF-8' ... Not sure how it currently selects the encoding 14:34:55 Hmm, I thought it did. 14:35:19 But yeah, looks like you're right. 14:35:25 don't want to make you sound stupid, but if you're parsing a single forum it's probably enough to just hardcode the relevant encoding 14:35:57 (if you're parsing many forums, I wanna talk about your project over dinner) 14:36:11 :-) 14:36:16 Yeah, agreed. 14:36:32 Sanqui: I'm parsing quite a few different forums, but usually everything using the same 'base' is ok. I'm currently in the processes of expanding the scope a bit, and this bit me 14:37:08 right. well, remember that charset parsing is non-trivial anyway, and even browsers do quite a bit of guesswork 14:37:23 I wonder if Requests has a nice way to handle this. 14:37:37 Sanqui: I feared so.. I'm currently browsing through some go/net/http/response code, and they don't really have a great way of handling this either.. I'll check some other sources 14:38:16 This should be useful: https://github.com/psf/requests/blob/c2b307dbefe21177af03f9feb37181a89a799fcc/requests/utils.py#L486 14:38:22 the standard workflow is probably 1. check the first 512 bytes for UTF-16 BOM or a Sanqui: https://github.com/psf/requests/blob/4f6c0187150af09d085c03096504934eb91c7a9e/requests/models.py#L839 handles some of the magic on the python side 14:38:36 JAA: yes :) 14:39:03 I would definitely prioritize what the HTML document says over the HTTP header 14:39:17 Sanqui: RFC 2616 disagrees, but I guess reality is harsh :) 14:39:44 indeed 14:39:53 Yeah, you can't really implement an HTTP client based on the specs if you want to be compatible with shitty servers. 14:40:12 We just had that discussion in -ot last night. :-) 14:40:16 Haha, ok :) 14:40:23 browsers have gotten surprisingly good at this -- I've been browsing 2000s czech websites and I forgot half of them have byzantine encodings that firefox just autodetects 14:40:36 'good' 14:40:41 I'll see how far I can get with just the basics. If I hit any encoding snags in reality I'll come back to bug you :) 14:41:05 I'll stick around on the channel, sounds like some interesting discussions went on here :) 14:41:17 Browsers should simply refuse to display pages that don't specify the correct encoding per spec. Oh well, let's not have that discussion again. :-P 14:41:40 You may be interested in #archiveteam-ot and #archiveteam-dev as well. 14:41:48 Basically, I built a prototype of a scraper a while ago that can take a config file that determines which parts to extract (xpath/css matching) for certain url types, and then pushes it all into neo4j (not ideal, but easy to set up) 14:42:09 JAA: remember XHTML? 14:42:20 we've tried the whole "refuse to display non-standard pages" thing 14:42:28 Now I found the lovely trove of warc files on archive.org, and I'm rethinking part of my approach to just read an entire forum 14:43:01 Sanqui: Yes, I used to develop all my websites with XHTML. But the trainwreck had long left the station by that point. 14:43:45 avoozl: a database of web forums from archive.org is one of my dream projects 14:43:50 * avoozl has some flashbacks to structured web and OWL 14:44:04 See also Transfer-Encoding vs Content-Encoding, which *nobody* seems to use correctly. 14:44:26 Sanqui: I'm trying to keep it self-contained, I have experimented with blevesearch and dgraph before, but it is hard to work at scale. neo4j seems like a nice middle ground, but it will require a fairly beefy setup 14:44:40 old forums are a goldmine, a treasure trove of information, and as they drop out they're no longer searchable by google 14:44:47 even if we work to archive them 14:45:07 of course you could just dump everything into elastic and try the 'search' approach. But I like analytics so I want things a bit more organized and referenced 14:45:37 absolutely, as a first step it'd be great to even just have metadata -- there's these fora, they had this many posts and users, click here to browse them in wayback 14:45:54 a graph of posts over time so you can say "prime time was 2007" 14:45:54 etc. 14:45:57 I've also had an idea for a project in this direction before. A standardised format for any sort of online discussion, extensible with platform-specific information as needed. 14:46:30 And then parsers that extract things accordingly from forums, social media, mailing lists, and whatnot. 14:46:34 Sanqui: do you have any specific 'small' forum from archiveteam that you can recommend to try first? I'm currently just picking something at random, but I'd rather start with something that is pretty textual and not too large (say <100GB) 14:47:18 JAA: yeah I've been thinking along those lines as well, but it is always difficult to scale these things propertly, especially once 'time' becomes part of the storage structure (this user existed at this time, but later it disappeared, or changed alias, etc.) 14:47:36 Yeah 14:47:51 Try https://archive.org/details/forums.region.leagueoflegends.com_202003 perhaps. 14:47:56 JAA: everyone seems to want something different out of it.. for me I would like two things: browsing a forum with search, and running analytical queries (spark/graphql/python) on subsets of the data (and on the topology) 14:47:57 invisionfree and zetaboards used to have tons and archive.org archived a lot of them 14:48:01 and they're pretty standard 14:48:28 I'm currently archiving this major estonian forum active since 2002 https://foorum.hinnavaatlus.ee/ 14:48:35 the LOL one looks good, I'll have a quick look at that 14:48:35 78756 users, 4860928 posts 14:49:13 Ages ago when geocities was archived I started playing with that, but I'm glad things have gotten a bit easier these days. that was a tricky set to get anything out of 14:49:30 avoozl: Big advantage is that those archives only contain the relevant HTML pages, no images, videos, outlinks, or other fluff. 14:49:47 So that should make for a nice small test bed. 14:50:05 JAA: yeah that makes it perfect. I started at something from archiveteam eu domains and that was 99% non-html 14:50:22 Then once that works, try an ArchiveBot crawl of a small forum I suppose. 14:51:20 in 2007 I archived a czech forum about pet birds that seem to be offline now 14:51:21 operenci.cz https://archive.fart.website/archivebot/viewer/job/29l18 14:51:35 sorry 2017* 14:51:58 I'll have a go at LOL first, see when I find some time to work on this 14:52:18 also speaking of google, I've come to realize just how irrelevant it's gotten when it comes to finding quality information 14:52:33 increasingly I'm finding better information just by searching reddit, hacker news, a discord server relevant to the topic, or heck literally doing a fulltext search in my telegram chats 14:52:52 times are changing 14:54:45 What i like most is that you can easily store the entire history of reddit on your desktop machine, unless you want media 14:54:55 but yeah accessibility of archived data is one thing we (archive team) are not that great at... there's always so much going on that by the time a project is done, we immediately move onto the next thing that's at danger 14:55:37 which is fine, the data is saved and we keep focus, BUT I would be delighted to see more projects making use of the archives, analyzing, enabling ease of access 14:55:53 so, thumbs up from me 14:56:22 ++ 14:56:30 It is also difficult to find the right 'size' of software project. 'accessing data' could easily spiral out of control into some planet-sized IPFS key-value store with auto-indexing and distributed version control... which of course will never be finished and the user experience will be backlogged into the better half of this century 14:56:54 it's absolutely true 14:57:14 I would just like something fairly simple, see if it sticks. Then if anyone wants to move it further, lovely. 14:57:20 a narrow, detail-oriented focus is often better than casting a net that's too wide 14:58:22 Also, some shiny-object-syndrome exists... I tend also to think 'oh maybe I could run openface-id neural nets to pick up all the faces in the parler data.. and then of course, you COULD do that, but it feels worse then just working at the core of things 14:58:42 #adhd 15:00:25 YES 15:00:44 This is why my AT indexer is still not running, basically. :-P 15:00:50 a bizarre forum I'm archiving right now :p https://turfwarsapp.com/forum/43/topic/4824171/ 15:01:06 for a geolocation-based mobile game 15:10:36 they used to play ingress here, that must be a ton of data too 15:11:25 motivated by a friend I archived a few websites pertaining to geolocation based games because they've been having a hard time with the pandemic 15:11:28 ingress is probably safe 16:57:15 JAA: is there any trick to getting files from archive.org faster? I typically see curl/wget drop to like 250KB/sec after a while and it takes forever for most downloads 16:57:53 or is that normal speed 16:59:00 average speed on larger files seems to be around 400KB/sec.. 1,91G 397KB/s in 94m 3s 17:02:34 -purplebot- A Million Ways to Die on the Web edited by KamafaDelgato (+14) just now -- https://www.archiveteam.org/?diff=46185&oldid=44414 17:03:34 -purplebot- Template:IRC edited by Justcool393 (-12, Default to hackint for ArchiveTeam …) just now -- https://www.archiveteam.org/?diff=46186&oldid=41611 17:04:34 -purplebot- This Is My Jam edited by Flashfire42 (+104) just now -- https://www.archiveteam.org/?diff=46188&oldid=31812 17:04:34 -purplebot- FileTrip edited by Flashfire42 (+0) just now -- https://www.archiveteam.org/?diff=46189&oldid=35949 17:05:34 -purplebot- Wildscreen Arkive edited by Flashfire42 (+0) just now -- https://www.archiveteam.org/?diff=46190&oldid=34427 17:05:51 avoozl: Nope, that's normal speed. :-/ 17:06:22 Ok, I'll just throw it all into the queue 17:14:16 Re: comments in #archiveteam .... I suppose some think that bugtraq is comparatively boring... 17:33:34 -purplebot- Coronavirus edited by Wessel1512 (+26, /* Archives and dedicated sites …) just now -- https://www.archiveteam.org/?diff=46193&oldid=46100 17:34:34 -purplebot- Template:IRC edited by JustAnotherArchivist (+12, Reverted edits by [[Special:Contributions/Justcool393|Justcool393]] …) 22 minutes ago -- https://www.archiveteam.org/?diff=46191&oldid=46186 17:45:34 -purplebot- This Is My Jam edited by Sanqui (+8, use job template for job id) 20 minutes ago -- https://www.archiveteam.org/?diff=46192&oldid=46188 17:48:49 On the Template:IRC edit reversal: that breaks pretty much every single IRC channel mention on the wiki. We need a mass edit at the same time as changing the default network in the template. There's been a bit of discussion in here about how to best do that and possibly also tackle the issue of dead channels at the same time (e.g. '#archiveteam-bs (on hackint), formerly #foobar (on EFnet)'). Until 17:48:55 that happens, the IRC template should stay as it is, even though it's messy. (Cc justcool393) 17:55:22 Is the https://git.savannah.gnu.org/git/gnulib.git down? 17:56:51 We're not GNU, but it's up now. 17:57:07 (And yeah, I could reproduce it being down at first.) 19:36:05 I have created a pull request to update the Dockerfile and add Wget-AT as a common dependency into the container itself, so that all the projects can use it https://github.com/ArchiveTeam/warrior-dockerfile/pull/44 20:39:41 DaxServer — I think the warrior hasn’t been used in a while, but I would be interested to see a how a dockerized version works, with all the C&C coordination that is required, etc.... 20:42:14 Speaking of Warrior-type activities, does anyone know what the status is of a warrior-style archive project for the community.fantasyflightgames.com site. 20:42:20 I know there was the ArchiveBot pipeline 218c8179a369ceb37a999add83e36442 but that’s just a single source that is probably getting throttled or temporarily banned frequently, and I don’t know if they’ve even made a single full complete run yet. 20:44:29 brad: he left sadly but I do believe it would've been nice for pre-parler for people to have it available 20:45:07 Yeah. ;( 20:45:08 we have lots of new helpers now and if it works, pending approval and review from the devs. it would be idea till trackerv2 & warriorv4 are ready 20:45:32 either way this is probably best for #archiveteam-dev 20:48:29 Thanks! I’ll head over there.... 20:57:42 Oh, I do have another question — are the WARCs created by ArchiveBot and other ArchiveTeam projects available anywhere for download? Some of the folks on the FFG SWRPG Discord are setting up a new community-owned forum site, and would also like to have their own searchable archive of the FFG community site, and I know the WARCs are key to doing that. 21:01:09 brad: https://archive.org/details/archivebot 21:03:30 https://archive.fart.website/archivebot/viewer/job/5l4qk 21:36:46 Thanks! 21:48:59 brad: The main job for community.fantasyflightforums.com ran finished, so the bulk of the site should be archived. However, there were a large number of 403s, which are now running in a separate job (erulqjgzn97r2xiab2yqe1qqv). 21:50:39 This isn't a perfect solution, because the way it's set up, AB won't recurse on the URLs in the second job (it's an '!ao <', not an '!a'). 21:50:48 Yeah, I wasn’t able to find the original job on the trackers, so I assumed it had finished or shut down. I did find the one to sweep through and pick up the 403s for something like 180k links? Wow, that’s a lot of 403s.... 21:51:18 It is 180k, but it's not all from community.fantasyflightforums.com. It's from other domains as well (including some that may 403 naturally) 21:52:13 It makes total sense that you would do multiple runs. It hadn’t occurred to me that the best way to pick up the 403s was to do a non-recursive list of specific URLs to try, however. But that is kinda clever. 21:52:13 If I have time, I might look into other ways to retrieve any missing pages. 21:52:48 Much appreciated! 21:53:19 Keep in mind that it's not really the best way, because again, there might be URLs on those 403-ing pages that the original job never got to. 21:53:29 Hopefully not too many, though. 21:56:42 Right, so ideally you’d want to do multiple recursive runs, plus non-recursive runs with specific URLs. 21:56:54 And I’m happy to help with that in any way I can.