00:00:58 <arkiver> Ajay: hi! would you be able to set up a similar submission dashboard for so-net u-page+ as you did for mediafire?
00:01:16 <arkiver> we can spread it through twitter to allow users to submit websites
00:01:35 <arkiver> it's So-net U-Page+ we might want to put some japanese text on there
00:04:49 <OrIdow6> arkiver: Some Japanese people on a message board have done that already
00:05:31 <OrIdow6> http://geolog.mydns.jp/so-net/
00:05:31 <arkiver> OrIdow6: where?
00:05:49 <arkiver> do we have the output of that?
00:05:53 <OrIdow6> The text file is downloadable at one of thos elinks
00:05:54 <arkiver> and was it tweeted around?
00:05:55 <OrIdow6> Yes
00:06:01 <OrIdow6> I don't know about Twitter
00:06:06 <OrIdow6> It was spread through something
00:06:19 <OrIdow6> Don't know if GOogle Translte told me
00:06:33 <OrIdow6> 1 month ago, or whenever I read the thread
00:06:39 <arkiver> we should put it on our twitter
00:06:46 <arkiver> i want us to get more active on our twitter
00:09:57 <Arcorann> Actually that could be good in general for announcing project launches
00:10:36 <thuban> who actually has access
00:28:14 <Ajay> looks like it's already covered by that other site, but yea I can set that site up  for any future projects if we need/want
00:30:55 <Ajay> I agree with announcing project launches on twitter
01:02:50 <fuzzy8021> EggplantN your colo in kansas city ks/mo?
01:02:57 <EggplantN> yes
01:03:16 <fuzzy8021> only about 6 hrs south of me
01:03:29 <EggplantN> worrying
01:03:38 <fuzzy8021> lol
01:06:17 <fuzzy8021> on a separate note not sure if i am considered a regular but be happy to get a permanent target at hetzner if i would be allowed
01:07:16 <flashfire42> Do we wanna grab the NRA as they are restructuring to a not for profit organisation and filing for bankruptcy?
01:08:38 <Craigle> Same actually. I have an AX-51-NVME that I would be more than happy to repurpose from workers to target if needed.
01:10:13 <EggplantN> For now we're not too bad overall were coping and growing. Sadly one of our new targets isn't fully up to scratch and we're working to improve it
01:10:18 <Craigle> flashfire42: That's probably not a bad idea. From the brief glance I made, it looks like they are already a non-profit, but they want to close that and re-start in Texas.
01:10:30 <Craigle> Suposedly to avoid being sued in NYC
01:14:40 <fuzzy8021> k keep it in mind open ended offer
01:15:18 <EggplantN> i will do thank you!
01:21:50 <Craigle> All I hear is an excuse for you to put more new hardware in a colo :D
01:23:58 <kiska> :D
01:26:14 <EggplantN> >_> Craigle
01:26:16 <EggplantN> dont shame me
01:28:52 <Craigle> No shame, I'm all about it
01:29:00 <EggplantN> >_>
03:36:36 <atphoenix> WebBBS (Version 4.33; June 8, 2000)
03:36:36 <atphoenix>  http://web.archive.org/web/20000817194941/http://www.awsd.com/scripts/webbbs/
03:36:50 <atphoenix> oh you already found that
03:37:15 <atphoenix> well or found something close to it
03:37:31 <JAA> Yeah, there's a full version history.
03:37:47 <JAA> http://www.awsd.com/download/webbbs/history.txt
04:05:34 <OrIdow6> So in the interest of time, I am running a quick thing on CrowdMap to get report URLs
04:06:14 <OrIdow6> Which are, as far as I can tell, the actual data being captured - the rest of the site more or less does nothing but display reports
04:06:29 <OrIdow6> (With exceptions)
04:06:33 <OrIdow6> The AB job didn't get them
04:09:07 <OrIdow6> So I think these can be run via AB
04:09:14 <OrIdow6> Actually looking to be about 400k urls here
04:19:15 <JAA> Yeah, looked like simple non-scripty HTML I think.
04:20:03 <JAA> I couldn't find an example with comments though.
04:20:04 <OrIdow6> That part of the site, yeah
04:20:49 <JAA> Well, nothing with many comments for potential pagination.
04:21:08 <JAA> Anyway, since this might go down any second now, let's split it up into a couple jobs and run?
04:21:38 <OrIdow6> Yeah, this is very barebones in the interest of time
04:21:57 <JAA> 'Best Effort' SLA on this one. :-)
04:23:17 <OrIdow6> https://transfer.notkiska.pw/Ccm2G/crowdmap_reports_list_1.txt - about 66k URLs, mostly reports
04:23:30 <JAA> Actually, http://archive.fart.website/archivebot/viewer/job/6hn3f did grab some reports. No idea how complete it is though.
04:25:16 <JAA> Started
04:25:47 <OrIdow6> Well, it must be incomplete somewhere, because it grabbed less URLs than there are so far in this list I made
04:25:50 <OrIdow6> Thanks
04:27:06 <JAA> Oof
04:27:21 <JAA> Their server returned empty HTTP 200 responses for quite a few sites.
04:28:38 <JAA> And it's doing that again.
04:29:12 <OrIdow6> I noticed that it did that when I typod endpoint names
04:29:19 <OrIdow6> How can you tell when they're empty?
04:29:40 <JAA> I'm looking at the WARCs.
04:29:50 <JAA> And can reproduce it with curl.
04:29:59 <OrIdow6> Oh, I thought you meant the running job
04:30:14 <JAA> Yes, I'm looking at that job's WARC.
04:30:32 <OrIdow6> Oh
04:30:57 <OrIdow6> It should be returning non-empty responses for the reports
04:31:21 <atphoenix> so http://awsd.com/scripts/webbbs/ says "PLEASE GO TO TETRABB.COM FOR THE NEWEST VERSION OF THE WEBBBS FORUM".  http://tetrabb.com/ says "Domain for sale". ooops.
04:31:36 <OrIdow6> Because it should have gotten the cookie beforehand
04:32:07 <OrIdow6> At least, that's what I experienced
04:32:46 <JAA> Well, not necessarily.
04:32:50 <JAA> Concurrency etc.
04:33:07 <JAA> But yeah, it's the cookie stuff.
04:34:24 <JAA> The cookie seems to be valid for an hour.
04:35:26 <JAA> So if we shuffle the list and add a fake request at the beginning for /?archiveteam or similar, it should be fine.
04:36:56 <OrIdow6> Why shuffle the list?
04:37:28 <JAA> Because it might not get to the end of the list within an hour.
04:37:33 <JAA> Each request extends the cookie by an hour.
04:37:50 <JAA> But by the time it gets to the bottom, those cookies might've expired already.
04:39:59 <JAA> I've launched another recursive job with those requests.
04:42:08 <JAA> By the way, excellent example of the -ot discussion earlier: the server sends a 'Refresh' header, which isn't standardised in HTTP headers but browsers behave as if it was a refresh meta tag.
04:43:36 <JAA> Anyway, we can leave the job with your list as is if you want, but we'll probably miss a handful of URLs.
04:45:23 <JAA> Nope, the new job is still getting empty 200s even though the cookie is being sent. WTF is this shit?
04:45:57 <JAA> Well, occasionally at least.
04:46:21 <JAA> Better than the previous attempt anyway.
04:55:00 <OrIdow6> Ok, https://transfer.notkiska.pw/Q9qoc/crowdmap_reports_list_2.txt is the new one
04:55:27 <OrIdow6> Every domain name should be prefixed with http://domain.crowdwhatever.com/?archiveteam now
04:55:30 <JAA> That has the same issue.
04:55:47 <JAA> Unless run at 1 concurrency, but that's not reasonable with the deadline already over.
04:56:10 <OrIdow6> Oh
04:56:24 <JAA> It would fetch multiple URLs at once, and every time it begins processing a new domain, it'll miss content on a few URLs.
04:57:10 <JAA> In theory, you could put the /?archiveteam trick URL 'a bit further up', but it's hard to predict how many buffer URLs you need.
04:57:10 <OrIdow6> If it had more than one dummy request, would that work?
04:57:21 <OrIdow6> Oh, you got there before me
04:57:53 <OrIdow6> Sort of
04:58:18 <JAA> Here's what I'd do: shuffle the list, then take the unique domains in the order they first appear in the file, then insert the trick URL in that order at the beginning.
04:58:42 <JAA> The shuffling also has the side effect that we'd get a random sample of the content if the site shuts down while we're still grabbing it.
04:59:04 <JAA> As opposed to a strongly biased one.
04:59:43 <OrIdow6> Would it work to have a padding section in between the cookie-requests and the proper content?
05:00:10 <JAA> That's a definite maybe.
05:00:12 <JAA> :-P
05:00:40 <OrIdow6>  Well, it's only a few extra requests, I'll throw it in there in case it does help
05:01:07 <JAA> Make it /?disable /?your /?stupid /?cookie /?bullshit :-)
05:07:47 <OrIdow6> Ok, https://transfer.notkiska.pw/RLMJo/crowdmap_reports_list_2_improved.txt has the changes
05:12:15 <JAA> Looks good, thanks.
05:12:21 <JAA> Want to do the same thing for the first list?
05:14:06 <OrIdow6> Ok
05:15:04 <OrIdow6> https://transfer.notkiska.pw/hTYJ1/crowdmap_reports_list_1_improved.txt
05:28:31 <JAA> Queued that to a full pipeline. Oops
05:30:54 <JAA> But the list 2 job seems to be running fine now. :-)
05:45:13 <OrIdow6> Good
05:45:22 <OrIdow6> https://transfer.notkiska.pw/JsCaO/crowdmap_reports_list_3_improved.txt - nearly halfway done now
05:46:29 <OrIdow6> Though I'm just going in the order of the list scrape, which isn't random and doesn't seem to be completely alphabetical, either
06:10:36 <Ryz> Heya OrIdow6, I'll be taking over from JAA, tossing in your work into AB
06:25:39 <OrIdow6> Ryz: Ok; it seems to have slowed down recently (think it's hit a few big sites), so it may be some more time until I have another
06:25:53 <Ryz> Check #archivebot
06:26:14 <OrIdow6> Reading logs
09:15:44 <hexa-> Bugtraq: BugTraq Shutdown - https://seclists.org/bugtraq/2021/Jan/0
09:16:05 <kiska> Known
09:16:09 <hexa-> Thx
14:25:38 <avoozl> Hey all. In short I'm looking for something that connects some 'web extraction/scraping' logic to WARC parsing. I can code in go (and python), but wanted to make sure I'm not overlooking anything. Basically I would like to convert a forum scrape from WARC to database records (post, user, etc.)
14:25:39 <JAA> Hi avoozl. My go-to library for WARC parsing is warcio. You can use it to iterate over a WARC and then do whatever you need with the HTTP body.
14:26:02 <JAA> That's Python. No idea if there's any decent Go libraries.
14:26:29 <avoozl> There's some reasonable libraries, but most of them stop at the content level. So parsing the HTTP response and converting it into the right character set will take some additional effort
14:27:23 <JAA> warcio does parse HTTP responses.
14:27:39 <JAA> You'll want the content_stream() of each record.
14:28:34 <avoozl> Thanks I'll take a look at how that is implemented. The go library I was using just gives me the raw content stream, but doesn't do any handling of content encodings
14:29:56 <avoozl> Seems like I'll need to add quite a few parts to this go library, but that's fair. thanks
14:32:37 <avoozl> JAA: browsing through the warcio source, I don't think I can see it actually parsing/using the response header such as 'Content-Type: text/html; charset=UTF-8' ... Not sure how it currently selects the encoding
14:34:55 <JAA> Hmm, I thought it did.
14:35:19 <JAA> But yeah, looks like you're right.
14:35:25 <Sanqui> don't want to make you sound stupid, but if you're parsing a single forum it's probably enough to just hardcode the relevant encoding
14:35:57 <Sanqui> (if you're parsing many forums, I wanna talk about your project over dinner)
14:36:11 <JAA> :-)
14:36:16 <JAA> Yeah, agreed.
14:36:32 <avoozl> Sanqui: I'm parsing quite a few different forums, but usually everything using the same 'base' is ok. I'm currently in the processes of expanding the scope a bit, and this bit me
14:37:08 <Sanqui> right.  well, remember that charset parsing is non-trivial anyway, and even browsers do quite a bit of guesswork
14:37:23 <JAA> I wonder if Requests has a nice way to handle this.
14:37:37 <avoozl> Sanqui: I feared so.. I'm currently browsing through some go/net/http/response code, and they don't really have a great way of handling this either.. I'll check some other sources
14:38:16 <JAA> This should be useful: https://github.com/psf/requests/blob/c2b307dbefe21177af03f9feb37181a89a799fcc/requests/utils.py#L486
14:38:22 <Sanqui> the standard workflow is probably 1. check the first 512 bytes for UTF-16 BOM or a <meta charset tag, 2. check the HTTP header, 3. run some heuristics on the text to guess.
14:38:29 <avoozl> Sanqui: https://github.com/psf/requests/blob/4f6c0187150af09d085c03096504934eb91c7a9e/requests/models.py#L839  handles some of the magic on the python side
14:38:36 <avoozl> JAA: yes :)
14:39:03 <Sanqui> I would definitely prioritize what the HTML document says over the HTTP header
14:39:17 <avoozl> Sanqui: RFC 2616 disagrees, but I guess reality is harsh :)
14:39:44 <Sanqui> indeed
14:39:53 <JAA> Yeah, you can't really implement an HTTP client based on the specs if you want to be compatible with shitty servers.
14:40:12 <JAA> We just had that discussion in -ot last night. :-)
14:40:16 <avoozl> Haha, ok :)
14:40:23 <Sanqui> browsers have gotten surprisingly good at this -- I've been browsing 2000s czech websites and I forgot half of them have byzantine encodings that firefox just autodetects
14:40:36 <JAA> 'good'
14:40:41 <avoozl> I'll see how far I can get with just the basics. If I hit any encoding snags in reality I'll come back to bug you :)
14:41:05 <avoozl> I'll stick around on the channel, sounds like some interesting discussions went on here :)
14:41:17 <JAA> Browsers should simply refuse to display pages that don't specify the correct encoding per spec. Oh well, let's not have that discussion again. :-P
14:41:40 <JAA> You may be interested in #archiveteam-ot and #archiveteam-dev as well.
14:41:48 <avoozl> Basically, I built a prototype of a scraper a while ago that can take a config file that determines which parts to extract (xpath/css matching) for certain url types, and then pushes it all into neo4j (not ideal, but easy to set up)
14:42:09 <Sanqui> JAA: remember XHTML?
14:42:20 <Sanqui> we've tried the whole "refuse to display non-standard pages" thing
14:42:28 <avoozl> Now I found the lovely trove of warc files on archive.org, and I'm rethinking part of my approach to just read an entire forum
14:43:01 <JAA> Sanqui: Yes, I used to develop all my websites with XHTML. But the trainwreck had long left the station by that point.
14:43:45 <Sanqui> avoozl: a database of web forums from archive.org is one of my dream projects
14:43:50 * avoozl has some flashbacks to structured web and OWL
14:44:04 <JAA> See also Transfer-Encoding vs Content-Encoding, which *nobody* seems to use correctly.
14:44:26 <avoozl> Sanqui: I'm trying to keep it self-contained, I have experimented with blevesearch and dgraph before, but it is hard to work at scale. neo4j seems like a nice middle ground, but it will require a fairly beefy setup
14:44:40 <Sanqui> old forums are a goldmine, a treasure trove of information, and as they drop out they're no longer searchable by google
14:44:47 <Sanqui> even if we work to archive them
14:45:07 <avoozl> of course you could just dump everything into elastic and try the 'search' approach. But I like analytics so I want things a bit more organized and referenced
14:45:37 <Sanqui> absolutely, as a first step it'd be great to even just have metadata -- there's these fora, they had this many posts and users, click here to browse them in wayback
14:45:54 <Sanqui> a graph of posts over time so you can say "prime time was 2007"
14:45:54 <Sanqui> etc.
14:45:57 <JAA> I've also had an idea for a project in this direction before. A standardised format for any sort of online discussion, extensible with platform-specific information as needed.
14:46:30 <JAA> And then parsers that extract things accordingly from forums, social media, mailing lists, and whatnot.
14:46:34 <avoozl> Sanqui: do you have any specific 'small' forum from archiveteam that you can recommend to try first? I'm currently just picking something at random, but I'd rather start with something that is pretty textual and not too large (say <100GB)
14:47:18 <avoozl> JAA: yeah I've been thinking along those lines as well, but it is always difficult to scale these things propertly, especially once 'time' becomes part of the storage structure (this user existed at this time, but later it disappeared, or changed alias, etc.)
14:47:36 <JAA> Yeah
14:47:51 <JAA> Try https://archive.org/details/forums.region.leagueoflegends.com_202003 perhaps.
14:47:56 <avoozl> JAA: everyone seems to want something different out of it.. for me I would like two things: browsing a forum with search, and running analytical queries (spark/graphql/python) on subsets of the data (and on the topology)
14:47:57 <Sanqui> invisionfree and zetaboards used to have tons and archive.org archived a lot of them
14:48:01 <Sanqui> and they're pretty standard
14:48:28 <Sanqui> I'm currently archiving this major estonian forum active since 2002 https://foorum.hinnavaatlus.ee/
14:48:35 <avoozl> the LOL one looks good, I'll have a quick look at that
14:48:35 <Sanqui> 78756 users, 4860928 posts
14:49:13 <avoozl> Ages ago when geocities was archived I started playing with that, but I'm glad things have gotten a bit easier these days. that was a tricky set to get anything out of
14:49:30 <JAA> avoozl: Big advantage is that those archives only contain the relevant HTML pages, no images, videos, outlinks, or other fluff.
14:49:47 <JAA> So that should make for a nice small test bed.
14:50:05 <avoozl> JAA: yeah that makes it perfect. I started at something from archiveteam eu domains and that was 99% non-html
14:50:22 <JAA> Then once that works, try an ArchiveBot crawl of a small forum I suppose.
14:51:20 <Sanqui> in 2007 I archived a czech forum about pet birds that seem to be offline now
14:51:21 <Sanqui> operenci.cz https://archive.fart.website/archivebot/viewer/job/29l18
14:51:35 <Sanqui> sorry 2017*
14:51:58 <avoozl> I'll have a go at LOL first, see when I find some time to work on this
14:52:18 <Sanqui> also speaking of google, I've come to realize just how irrelevant it's gotten when it comes to finding quality information
14:52:33 <Sanqui> increasingly I'm finding better information just by searching reddit, hacker news, a discord server relevant to the topic, or heck literally doing a fulltext search in my telegram chats
14:52:52 <Sanqui> times are changing
14:54:45 <avoozl> What i like most is that you can easily store the entire history of reddit on your desktop machine, unless you want media
14:54:55 <Sanqui> but yeah accessibility of archived data is one thing we (archive team) are not that great at...  there's always so much going on that by the time a project is done, we immediately move onto the next thing that's at danger
14:55:37 <Sanqui> which is fine, the data is saved and we keep focus, BUT I would be delighted to see more projects making use of the archives, analyzing, enabling ease of access
14:55:53 <Sanqui> so, thumbs up from me
14:56:22 <JAA> ++
14:56:30 <avoozl> It is also difficult to find the right 'size' of software project. 'accessing data' could easily spiral out of control into some planet-sized IPFS key-value store with auto-indexing and distributed version control... which of course will never be finished and the user experience will be backlogged into the better half of this century
14:56:54 <Sanqui> it's absolutely true
14:57:14 <avoozl> I would just like something fairly simple, see if it sticks. Then if anyone wants to move it further, lovely.
14:57:20 <Sanqui> a narrow, detail-oriented focus is often better than casting a net that's too wide
14:58:22 <avoozl> Also, some shiny-object-syndrome exists... I tend also to think 'oh maybe I could run openface-id neural nets to pick up all the faces in the parler data.. and then of course, you COULD do that, but it feels worse then just working at the core of things
14:58:42 <Sanqui> #adhd
15:00:25 <JAA> YES
15:00:44 <JAA> This is why my AT indexer is still not running, basically. :-P
15:00:50 <Sanqui> a bizarre forum I'm archiving right now :p  https://turfwarsapp.com/forum/43/topic/4824171/
15:01:06 <Sanqui> for a geolocation-based mobile game
15:10:36 <avoozl> they used to play ingress here, that must be a ton of data too
15:11:25 <Sanqui> motivated by a friend I archived a few websites pertaining to geolocation based games because they've been having a hard time with the pandemic
15:11:28 <Sanqui> ingress is probably safe
16:57:15 <avoozl> JAA: is there any trick to getting files from archive.org faster? I typically see curl/wget drop to like 250KB/sec after a while and it takes forever for most downloads
16:57:53 <avoozl> or is that normal speed
16:59:00 <avoozl> average speed on larger files seems to be around 400KB/sec..  1,91G   397KB/s    in 94m 3s
17:02:34 -purplebot- A Million Ways to Die on the Web edited by KamafaDelgato (+14) just now -- https://www.archiveteam.org/?diff=46185&oldid=44414
17:03:34 -purplebot- Template:IRC edited by Justcool393 (-12, Default to hackint for ArchiveTeam …) just now -- https://www.archiveteam.org/?diff=46186&oldid=41611
17:04:34 -purplebot- This Is My Jam edited by Flashfire42 (+104) just now -- https://www.archiveteam.org/?diff=46188&oldid=31812
17:04:34 -purplebot- FileTrip edited by Flashfire42 (+0) just now -- https://www.archiveteam.org/?diff=46189&oldid=35949
17:05:34 -purplebot- Wildscreen Arkive edited by Flashfire42 (+0) just now -- https://www.archiveteam.org/?diff=46190&oldid=34427
17:05:51 <JAA> avoozl: Nope, that's normal speed. :-/
17:06:22 <avoozl> Ok, I'll just throw it all into the queue
17:14:16 <atphoenix> Re: comments in #archiveteam .... I suppose some think that bugtraq is comparatively boring...
17:33:34 -purplebot- Coronavirus edited by Wessel1512 (+26, /* Archives and dedicated sites …) just now -- https://www.archiveteam.org/?diff=46193&oldid=46100
17:34:34 -purplebot- Template:IRC edited by JustAnotherArchivist (+12, Reverted edits by [[Special:Contributions/Justcool393|Justcool393]] …) 22 minutes ago -- https://www.archiveteam.org/?diff=46191&oldid=46186
17:45:34 -purplebot- This Is My Jam edited by Sanqui (+8, use job template for job id) 20 minutes ago -- https://www.archiveteam.org/?diff=46192&oldid=46188
17:48:49 <JAA> On the Template:IRC edit reversal: that breaks pretty much every single IRC channel mention on the wiki. We need a mass edit at the same time as changing the default network in the template. There's been a bit of discussion in here about how to best do that and possibly also tackle the issue of dead channels at the same time (e.g. '#archiveteam-bs (on hackint), formerly #foobar (on EFnet)'). Until
17:48:55 <JAA> that happens, the IRC template should stay as it is, even though it's messy.  (Cc justcool393)
17:55:22 <DaxServer> Is the https://git.savannah.gnu.org/git/gnulib.git down?
17:56:51 <JAA> We're not GNU, but it's up now.
17:57:07 <JAA> (And yeah, I could reproduce it being down at first.)
19:36:05 <DaxServer> I have created a pull request to update the Dockerfile and add Wget-AT as a common dependency into the container itself, so that all the projects can use it https://github.com/ArchiveTeam/warrior-dockerfile/pull/44
20:39:41 <brad> DaxServer — I think the warrior hasn’t been used in a while, but I would be interested to see a how a dockerized version works, with all the C&C coordination that is required, etc....
20:42:14 <brad> Speaking of Warrior-type activities, does anyone know what the status is of a warrior-style archive project for the community.fantasyflightgames.com site.
20:42:20 <brad> I know there was the ArchiveBot pipeline 218c8179a369ceb37a999add83e36442 but that’s just a single source that is probably getting throttled or temporarily banned frequently, and I don’t know if they’ve even made a single full complete run yet.
20:44:29 <EggplantN> brad: he left sadly but I do believe it would've been nice for pre-parler for people to have it available
20:45:07 <brad> Yeah. ;(
20:45:08 <EggplantN> we have lots of new helpers now and if it works, pending approval and review from the devs. it would be idea till trackerv2 & warriorv4 are ready
20:45:32 <EggplantN> either way this is probably best for #archiveteam-dev
20:48:29 <brad> Thanks! I’ll head over there....
20:57:42 <brad> Oh, I do have another question — are the WARCs created by ArchiveBot and other ArchiveTeam projects available anywhere for download?  Some of the folks on the FFG SWRPG Discord are setting up a new community-owned forum site, and would also like to have their own searchable archive of the FFG community site, and I know the WARCs are key to doing that.
21:01:09 <hook54321> brad: https://archive.org/details/archivebot
21:03:30 <hook54321> https://archive.fart.website/archivebot/viewer/job/5l4qk
21:36:46 <brad> Thanks!
21:48:59 <jodizzle> brad: The main job for community.fantasyflightforums.com ran finished, so the bulk of the site should be archived.  However, there were a large number of 403s, which are now running in a separate job (erulqjgzn97r2xiab2yqe1qqv).
21:50:39 <jodizzle> This isn't a perfect solution, because the way it's set up, AB won't recurse on the URLs in the second job (it's an '!ao <', not an '!a').
21:50:48 <brad> Yeah, I wasn’t able to find the original job on the trackers, so I assumed it had finished or shut down. I did find the one to sweep through and pick up the 403s for something like 180k links? Wow, that’s a lot of 403s....
21:51:18 <jodizzle> It is 180k, but it's not all from community.fantasyflightforums.com.  It's from other domains as well (including some that may 403 naturally)
21:52:13 <brad> It makes total sense that you would do multiple runs. It hadn’t occurred to me that the best way to pick up the 403s was to do a non-recursive list of specific URLs to try, however. But that is kinda clever.
21:52:13 <jodizzle> If I have time, I might look into other ways to retrieve any missing pages.
21:52:48 <brad> Much appreciated!
21:53:19 <jodizzle> Keep in mind that it's not really the best way, because again, there might be URLs on those 403-ing pages that the original job never got to.
21:53:29 <jodizzle> Hopefully not too many, though.
21:56:42 <brad> Right, so ideally you’d want to do multiple recursive runs, plus non-recursive runs with specific URLs.
21:56:54 <brad> And I’m happy to help with that in any way I can.