-
project10
-
fireonlive
oof
-
anarcat
some epic crawl, journalmetro.com (d9mh44xbsx92ie1iwf88mk2pn) - i didn't expect it to be so big (and still growing)
-
anarcat
things seem to be running smoothly though, and might finish in time to keep that thing in IA before the damn thing falls apart like everything else
-
JAA
(immature giggles from the back row)
-
JAA
They didn't say anything about how long the site would stay up, did they?
-
anarcat
i haven't followed closely
-
Ryz
Heya folks, besides the default Warrior project selection, any other Warrior projects that might need attention?
-
flashfire42
atm telegram and reddit are the 2 with items but they are right now clogged by targets. if you wanna try your luck at Zowa then we could test if its a ban or if the items that its trying to push out are indeed bad at this point Ryz
-
nicolas17
yeah everything seems stalled atm
-
Ryz
Was pondering on Imgur but hmm o.o;
-
Ryz
Wouldn't mind running more of the bruteforcer if it needs attention
-
nicolas17
I have 135 million IDs from the bruteforcer that I still didn't submit into the queue and probably never will
-
nicolas17
imgur got too large
-
Ryz
Oof, too much data? :c
-
nicolas17
we archived 654TB
-
nicolas17
<JAA> The problem is the data size. We already went well past the initial estimate we gave IA.
-
nicolas17
<nicolas17> we're at 650TiB
-
nicolas17
<JAA> Yes, which is more than double what we told IA.
-
Ryz
...Oo;
-
Ryz
Aaaaah <#>;
-
nicolas17
<JAA> I feel like the best option going forward that we have is keeping this running continuously MediaFire-style so that we can queue lists of images collected from other crawls
-
nicolas17
<JAA> But I don't see archiving all of Imgur happening anytime soon. Well, not until they're shutting down or doing a severe policy change like deleting images after X days or whatever.
-
Ryz
Hmm, would it be best to just run the bruteforcer on the remote chance that Imgur may actually shut down or the severe policy change? Collect more of the stuff
-
JAA
→ #imgone
-
flashfire42
ok I have to ask. What the fuck is actually connecting to the rsync servers if nobody is actually seeming to connect. If we are all complaining what the fuck is the clog?
-
nicolas17
flashfire42: if you get -1 it means disks are full
-
nicolas17
and the server is set to maximum 0 concurrent connections and nobody is connecting
-
flashfire42
So the bottleneck is moving that data to temp storage? or did we already fill that
-
JAA
Yes, that is the bottleneck. No, it isn't full, but its capacity is reduced compared to at the beginning.
-
nicolas17
I have seen continuous rsync errors for hours so it looks stuck full rather than "slow to free up"
-
nicolas17
unless it's so slow that the hysteresis is making it look stuck
-
nicolas17
JAA: should we pause (or greatly rate-limit) projects while targets are full?
-
nicolas17
especially telegram where people would get reclaims of items that took too long *because* they're stuck uploading
-
flashfire42
poor optane9 rewby
-
JAA
So someone mentioned archiving Doomworld yesterday. Since it's Invision and I only had to replace three lines in my Canucks forums script, I gave it a quick try. Turns out that site is very broken. Quite a lot of topics return 500s:
doomworld.com/forum/topic/721-x
-
JAA
There hasn't been any official announcement in five years, and the sole admin I could see is rarely active. So it could use an archival.
-
that_lurker
Cisco appreas to have bought Splunk
-
that_lurker
-
fireonlive
that_lurker: im not sure they could make it any more expensive but i’m sure they’re going to try
-
that_lurker
"Somebody: Splunk has exorbitant prices and locked-in enterprise customers!
-
that_lurker
Cisco: Oh these guys are just like us. Better buy them up. We know this business."
-
that_lurker
That and many more fun takes are on the HN
news.ycombinator.com/item?id=37596497
-
fireonlive
:3
-
that_lurker
Would maybe be a good idea to grab the splunk documentation site
docs.splunk.com/Documentation
-
rktk
fanforum.com is anything from this site archived? there seems to be a LOT of older content there
-
rktk
-
rktk
at least, not as a warc
-
rktk
im sure it's in web.archive
-
pabs
-
rktk
possibly worthy as a new project?
-
rktk
I notice the site loads verrrrry slow. takes a while depending on how old the INDEX of a subforum is
-
rktk
talking minutes, not seconds
-
pabs
seems like it would be impossible to archive - would kill the site?
-
JAA
That IA search is only really useful for wiki dumps.
-
pabs
oh, the front page eventually loaded
-
JAA
Most other items don't have an 'originalurl' metadata field.
-
rktk
pabs exactly what I'm talking about
-
rktk
Jaa ah only for wiki sites
-
rktk
didn't know that was a metadata entry
-
rktk
pabs just trying to sus out thoughts on this but yeah it seems the site is basically in some kind of maintenance mode or like, skeleton life support...
-
rktk
and yet people still actively post
-
JAA
Oh wow, that's huge.
-
pabs
Threads: 495,274 | Posts: 107,495,875 | Members: 406,871 | Currently Active Users: 2969 (36 members and 2933 guests)
-
pabs
-
pabs
vBulletin
-
pabs
probably too big for ArchiveBot?
-
JAA
Topic IDs are around 63 million, so enumerating that is out of question.
-
just1602
github.com/EFForg/apkeep <= I thought this could be helpfull for the archive team if people want to download apk to archive them
-
anarcat
holy crap
-
JAA
Neat, thanks!
-
JAA
pabs: Yeah, with all the extra links everywhere, probably too big. And also, the slow responses would run into the timeout all the time I bet.
-
pabs
2933 guests, hmm I wonder if they are getting hit hard by spidering
-
JAA
-
JAA
Just that individual subforum I think, but yeah.
-
JAA
> Fan Forum requires that we average at least 12 posts per day, with lower numbers than that leading to warnings and then possible closure of the board.
-
fireonlive
huh.
-
DigitalDragons
interesting
-
nicolas17
that_lurker: I saw several people wondering if the $28B Cisco paid Splunk was to acquire them or just renewing their license for the year