-
FireFly
haha
-
nulldata
So no Tom & Jerry gay porn was captured?
-
fireonlive
uwu
-
fireonlive
now i'm curious...
-
fireonlive
brb
-
fireonlive
..huh
-
fireonlive
ok
-
katia
i ran it overnight from 2 machines outside CCC and woke up to 16TB of data most of it porn
-
katia
my plan right now is to take the Stuff That Is Interesting To Me and delete the rest
-
fireonlive
forum.tailscale.com/t/tailscale-forum-announcement/5768 < "I’m here to share some important news regarding the Tailscale forum. After almost three years, we have made the decision to sunset this platform. Starting on July 15, 2023 the forum will go into read-only mode: Posts will continue to be available to read, but users will not be
-
fireonlive
able to start new threads or reply to existing ones. While our forum has served as a valuable platform for discussions and knowledge sharing, we believe that concentrating our resources on our managed support channels will enable us to better meet the needs of our community."
-
fireonlive
dunno if we coverred this
-
pabs
fart says no
-
pabs
fireonlive: will you AB?
-
fireonlive
ah thanks pabs, sure
-
pabs
-
fireonlive
hmm 11k posts
-
fireonlive
might ignore individual post urls
-
fireonlive
(but keep topic ones)
-
flashfire42
Nah if its good to go with no rate limits we can probably keep individual posts
-
fireonlive
actually that's not too much
-
flashfire42
assuming you mean 11k posts not 11k threads
-
fireonlive
asterisk was 300k
-
fireonlive
-
fireonlive
yeah 1.9k topics, 11.0k posts
-
pabs
11.0k is pretty small
-
fireonlive
started akane3b8pelvuc53swdsdwjqr
-
fireonlive
no-offsite due to con/delay discourse needed but we can run those later
-
fireonlive
needed->needs
-
fireonlive
keeping indiv. urls too :)
-
nicolas17
assuming argenteam actually shuts down Jan 1st, I was going to do another crawl
-
nicolas17
turns out they removed the subtitles already D:
-
h2ibot
-
nicolas17
looks like some argenteam subtitle zips are actually misnamed rars
-
fireonlive
ofc
-
fireonlive
what's a file extension for anyways?
-
nicolas17
I considered feeding argenteam API responses into archivebot too but it's too late now
-
nicolas17
the API responses no longer have the subtitle URLs either
-
fireonlive
:/
-
nicolas17
webpages work tho
-
nicolas17
-
nicolas17
because I archivebot'd all webpages and subtitles
-
nicolas17
and I do have API responses saved, but not as pristine WARCs, rather as concatenated ndjson
-
fireonlive
-
fireonlive
"Arquivo.pt launched a new version, called Francisco, on the 19th of January 2022. The SavePageNow function stands out, allowing anyone to save a Web page to be preserved by Arquivo.pt. It is only necessary to enter a page’s address and browse through its contents. Arquivo.pt SavePageNow was inspired on the Internet Archive Save Page Now and
-
fireonlive
implemented using webrecorder pywb."
-
fireonlive
(oof)
-
fireonlive
(but also TIL)
-
fireonlive
cc JAA/arkiver
-
fireonlive
they use /wayback/ in their urls too xP
-
fireonlive
-
fireonlive
(ofc in 2022 yellow is some crypto bullshit now)
-
GhostOverflow256
What'd be my "first" step for if i've spotted a potentially important website that's possibly going to go down within the next month?
-
GhostOverflow256
I'd be happy to develop the tools myself, I've checked and it doesn't have a sitemap.xml, but it runs mediawiki, anybody got pointers for enumerating pages?
-
thuban
GhostOverflow256: the first step is to drop the link; we already have specialized tools (#wikiteam/#wikibot) for archiving wikis
-
GhostOverflow256
@thuban I've just downloaded the xml backup of all pages, but the page is `
ddosecrets.com/wiki/Distributed_Denial_of_Secrets`. I saw that the last uploaded backup onto the internet archive was 3 years ago, so should i upload it?
-
thuban
you can if you want, but wikibot handles that as well
-
GhostOverflow256
According to the github of wikibot, it's "Note: this bot is NOT currently set up to be installed by other people (hardcoded Discord channels, google cloud bucket names, etc), though it is on my todo list. Dragona be here (not just me)!"
-
GhostOverflow256
So that kinda discouraged me from trying to use it lol
-
thuban
you don't need to run your own! just submit a job to ours
-
thuban
(might require permissions; idk as i don't normally handle wikis. if it does someone will do it for you shortly)
-
GhostOverflow256
one way to find out i guess
-
pabs
-
pabs
does it need a new dump?
-
pabs
and I guess we should save the HTML to web.archive.org too
-
pabs
-
pabs
guess we will save it again since there was some more changes
-
pabs
and subdomains too
-
pabs
hmm,
data.ddosecrets.com is quite huge...
-
JAA
I archived data.ddosecrets.com before. Agree it'd be nice to rearchive, but it's huge indeed.
-
pabs
JAA: seems there are lots of 2023 updates on it
-
» pabs threw it in with -c 1
-
pabs
JAA: oh, it has 32.1 TB of Parler videos, that is maybe a bit much for AB?
-
JAA
Yes, it's far too much for AB.
-
» pabs aborted for now
-
JAA
Also was at the time I archived it.
-
JAA
I also wonder whether the Parler data is extracted from our project.
-
JAA
Duplicating that would be a bit silly.
-
pabs
"it now has over 190 datasets and 60 terabytes of data!"
-
kpcyrd
if it was already archived, I'd assume only new data gets stored and everything else is deduplicated?
-
kpcyrd
there's still the traffic archivebot needs to download and reupload ofc
-
katia
would AB need 32TB storage for it to work or does it upload stuff to IA as it goes?
-
pabs
the latter IIRC
-
JAA
kpcyrd: We don't currently have software that can dedupe against previous WARCs. (wget's CDX reading is broken IIRC.) It's an important factor in the design of the new WARC library I've been working on (slowly).
-
JAA
katia: It uploads as it goes, but it'd take a very long time and couldn't dedupe.
-
JAA
I don't remember how I grabbed it at the time, but I believe there were parallel processes.
-
katia
could archivebot save hashes of current running stuff somewhere to dedupe without having the filelocally?
-
h2ibot
Entartet created Arhivach (+1785, Created page with "{{Infobox project | URL =…):
wiki.archiveteam.org/?title=Arhivach
-
h2ibot
Entartet created M.Dvach (+893, Created page with "{{Infobox project| | URL =…):
wiki.archiveteam.org/?title=M.Dvach
-
h2ibot
-
h2ibot
-
h2ibot
-
h2ibot
-
h2ibot
-
JAA
katia: It could be implemented in theory. It's not practical at scale though.
-
Terbium
Currently the way I dedupe is load all CDX entries in a PostgreSQL database and add a hook in wpull to query hashes against it at crawl time
-
Terbium
It's not perfect, but works at smaller scales
-
JAA
Terbium: Uh, wpull doesn't have a hook for that, does it?
-
Terbium
It's a custom written hook I added to my non-mainline wpull fork to work with the rest of my ingest pipeline. It's not part of wpull or wpull_ludios natively
-
JAA
Right
-
JAA
Did you also modify the name so it's obvious the WARCs were produced with the fork?
-
JAA
The URLTable does in fact support revisits, but it's not implemented I think. Plus it's all broken on non-SQLite anyway.
-
Terbium
It has a bumped version number, but no name change. The WARCs I generate are unlikely to ever be redistrubuted
-
JAA
What version numbers are you using?
-
Terbium
Yep URLTable has basic local dedupe support, but it didn't suit my needs as I need to dudupe against a large number of concurrent crawlers on a central DB
-
JAA
Makes sense. I was also tinkering with that sort of thing before.
-
Terbium
I'm using version 4.0.0
-
fireonlive
postgres 😍
-
JAA
So we'll need to skip 4.x upstream as well, ok...
-
Terbium
I'm going to bring ludios_wpull up to 5
-
Terbium
Found regressions in 3.11 that broke some of the socket code, so will likely end up pushing it to 3.12, and jump from v3 to v5
-
Terbium
wpull standard is 2.0.3, ludios_wpull is at 3.0.9 currently
-
JAA
Please change the name it writes to the warcinfo record on that bump.
-
Terbium
huh, looks like ludios_wpull retains the "Wpull" name in warcinfo. I'll change the name to ludios_wpull
-
JAA
Yeah, it does, and it's annoyed me ever since that patch. :-)
-
JAA
Ideally, grab-site would also appear in there when it's invoked that way, but that may be trickier.
-
Terbium
hmm, that would be a pain, i guess i can use importlib to extract the grab-site version
-
Terbium
i guess we're going tnto -dev terrotory lol
-
JAA
Yeah :-)