-
h2ibot
Flashfire42 edited List of websites excluded from the Wayback Machine (+25):
wiki.archiveteam.org/?diff=51396&oldid=51390
-
h2ibot
OrIdow6 edited Google Drive (+1594, Make some of my research useful for future…):
wiki.archiveteam.org/?diff=51397&oldid=50420
-
fireonlive
OrIdow6++
-
eggdrop
[karma] 'OrIdow6' now has 1 karma!
-
fireonlive
sites do privately appear in folders at leaast but hm
-
h2ibot
JAABot edited List of websites excluded from the Wayback Machine (+0):
wiki.archiveteam.org/?diff=51398&oldid=51396
-
JAA
ETA for OneHallyu is 4 days, 3 hours. Probably not going to finish in time.
-
h2ibot
OrIdow6 edited Google Drive (+86, New discoveries involving Sites in Drive):
wiki.archiveteam.org/?diff=51399&oldid=51397
-
» fireonlive waits
-
fireonlive
cron pls
-
h2ibot
FireonLive edited Current Projects (+95, attempt to clean up/make easier to read the…):
wiki.archiveteam.org/?diff=51400&oldid=51243
-
fireonlive
cc JAA/arkiver
-
arkiver
looks good fireonlive
-
fireonlive
=]
-
arkiver
i'm not sure if we still need the ukraine/russian sites project there, it's not running since a long time
-
fireonlive
ah good point
-
fireonlive
Long-term, perpetual projects?
-
fireonlive
did we have a word for 'basically forever'
-
fireonlive
an internal word that is
-
fireonlive
"occurring repeatedly; so frequent as to seem endless and uninterrupted."
-
fireonlive
that works
-
arkiver
i'd just say long term
-
arkiver
can't promise we'll keep them running forever
-
JAA
I've used 'continuous' before, but doesn't really say much.
-
JAA
Yeah, 'long term' is good.
-
fireonlive
ah ok
-
JAA
Nothing will keep running forever. The heat death of the universe will consume it all.
-
fireonlive
yep :)
-
arkiver
so we might as well fade out now?
-
fireonlive
50/50 on leaving a blank section, but will leave an empty medium for now, to show that it 'can' exist
-
fireonlive
arkiver: that's my dream
-
arkiver
ouch
-
JAA
'(none currently)'?
-
fireonlive
ah that works
-
JAA
Rather than just an empty section, which may look weird.
-
JAA
I plan on hanging out on this channel until the heat death. :-)
-
fireonlive
:)
-
fireonlive
have we had a scripts only project in the past N years
-
fireonlive
*removes commented out section*
-
h2ibot
-
fireonlive
arkiver: "2019-202? coronavirus outbreak: Documenting and preserving data, events, and impacts of the virus on society. IRC Channel #coronarchive (on hackint)" < would you call this one not running as well?
-
arkiver
yes
-
fireonlive
kk
-
h2ibot
FireonLive edited Current Projects (-167, remove coronavirus):
wiki.archiveteam.org/?diff=51402&oldid=51401
-
fireonlive
wow i cured the world i guess
-
fireonlive
:p
-
TheTechRobo
Photobucket did the purge *long* ago, right?
-
TheTechRobo
Should that be removed from upcoming?
-
TheTechRobo
also I feel like some of these hiatuses will never be unhiatused (is that a word?)
-
TheTechRobo
ex. Audit 2014
-
TheTechRobo
finally,
wiki.archiveteam.org/index.php/NewsGrabber has been largely replaced with #//, right? should the wiki page be updated with that info?
-
fireonlive
there was a line on the audit 2014 haitus bullet that it would be done in 2016 that i removed a few months ago
-
TheTechRobo
oh, it does say under project status
-
fireonlive
re: NewsGrabber it does say "Archiving status Project superseded by URLs"
-
fireonlive
ye
-
fireonlive
lemme just...
-
h2ibot
TheTechRobo edited NewsGrabber (+51, replaced with #//):
wiki.archiveteam.org/?diff=51403&oldid=50757
-
fireonlive
hey you
-
fireonlive
conflicting my edit
-
fireonlive
.
-
h2ibot
-
fireonlive
oops forgot a message lol
-
h2ibot
TheTechRobo edited NewsGrabber (+1, Add a period):
wiki.archiveteam.org/?diff=51405&oldid=51404
-
h2ibot
TheTechRobo edited URLs (+222, Add urls-sources):
wiki.archiveteam.org/?diff=51406&oldid=50427
-
h2ibot
FireonLive edited Current Projects (+0, alphabetize "on hiatus"):
wiki.archiveteam.org/?diff=51407&oldid=51402
-
arkiver
oh yeah newsgrabber was kind of out predecessor to #//
-
fireonlive
-
TheTechRobo
> yt-dlp can be used to download article URLs, making it possible to preserve news in video-form just as well as news in text-form.
-
TheTechRobo
I don't think we have that in URLs, do we? I suppose the storage would get unwieldy
-
TheTechRobo
Might be nice for high-value stuff, though
-
fireonlive
archivebot used to use youtube-dl (before the fork) but not any longer
-
TheTechRobo
Yeah
-
TheTechRobo
That integration was always jank IIRC though
-
arkiver
we don't use yt-dlp in any project, except for the bot in #down-the-tube to discover videos of a channel for queuing
-
arkiver
err
-
TheTechRobo
arkiver: I thought yt-dlp was replaced in the bot?
-
arkiver
any Warrior project i should say
-
fireonlive
arkiver: is the bot on git :p
-
arkiver
TheTechRobo: only partially replaced
-
arkiver
fireonlive: no, it has keys that i didn't separate out yet
-
arkiver
but yes i should get it on git
-
fireonlive
ahh np
-
» TheTechRobo asked about that before :P
-
fireonlive
no rushy
-
JAA
Yes please :-)
-
TheTechRobo
arkiver: should Photobucket be removed from upcoming/proposed? or is it still planned?
-
arkiver
just have to free up some time for that
-
TheTechRobo
Can we have the tracker next?
-
arkiver
TheTechRobo: i don't think it's planned at the moment
-
fireonlive
good luck for tracker :P
-
arkiver
TheTechRobo: it current tracker is so very duck taped together (with sensitive stuff spread across it), that it will likely not be released publicly any time soon
-
TheTechRobo
I've been asking ever since I touched Seesaw. Universal-tracker is, despite the name, not very universal
-
fireonlive
arkiver: oh, one more q: are the IDs it generates stored in a database or something somewhere alongside the explanation provided/project/etc?
-
arkiver
i believe the old tracker on github should still somewhat work?
-
fireonlive
or is it mainly for irc logs?
-
TheTechRobo
arkiver: Somewhat is right.
-
TheTechRobo
fireonlive: I also have the same question about `-e`
-
arkiver
fireonlive: the bot for queuing you main? they are currently only in the logs
-
TheTechRobo
arkiver: No backfeed, slow, no offloader, etc
-
arkiver
together with the explanation, only in the logs
-
fireonlive
ye indeed
-
fireonlive
ah ok :)
-
arkiver
TheTechRobo: yeah
-
fireonlive
eventually ™
-
fireonlive
:D
-
arkiver
i guess :/
-
fireonlive
tracker is more understandable
-
fireonlive
so i don't hold that one against y'all lol
-
arkiver
:)
-
fireonlive
:)
-
TheTechRobo
The lack of an offloader was the main reason I never archived very much of Strawpoll. Whenever tracker was running, even idle, ~4GB of RAM usage because everything was in memory
-
arkiver
i could have setup a project for that, if i was aware
-
TheTechRobo
Maybe a project for 2024. Building Universal-tracker 3 :P
-
arkiver
set up*
-
TheTechRobo
arkiver: No shutdown notice, I just felt like archiving it
-
arkiver
ah okey
-
arkiver
it's still online?
-
TheTechRobo
No
-
fireonlive
i'm sure someone will set something big on fire on 2024
-
arkiver
went offline without shutdown notice?
-
fireonlive
well a lot of somethings
-
TheTechRobo
arkiver: No idea
-
TheTechRobo
I thought "y'know maybe I should continue archiving strawpoll" and it was ded
-
arkiver
fireonlive: maybe, i expected more to burn down with higher interest rates. maybe that will come next year still as the rates stay somewhat high and companies need to refinance
-
arkiver
TheTechRobo: sad :/
-
TheTechRobo
Yeah
-
fireonlive
-
fireonlive
updated 2023-02-09, closed 2022-08
-
TheTechRobo
I did get a bunch of polls, but nowhere near everything :/
-
arkiver
are they on IA?
-
fireonlive
-
TheTechRobo
arkiver: i think? this was when I was very new to ATY
-
TheTechRobo
*AT
-
fireonlive
their twitter kinda died lol
-
TheTechRobo
-
h2ibot
TheTechRobo edited Strawpoll.me (+2, Update info):
wiki.archiveteam.org/?diff=51408&oldid=49804
-
fireonlive
apparently they were having technical issues? and i guess didn't want to spend resources into fixing it
-
TheTechRobo
fireonlive: lol
-
fireonlive
-
fireonlive
i vaguely remember others saying 'not to use the .me' version as well
-
fireonlive
ooh, dark IA items with poll data :3
-
TheTechRobo
fireonlive: Yeah, not sure what's up with that
-
arkiver
TheTechRobo: how did you archive it?
-
arkiver
meaning how was the WARC created
-
fireonlive
-
fireonlive
-
fireonlive
-
arkiver
thanks
-
arkiver
TheTechRobo: i've moved the strawpoll item to archiveteam-fire , it will soon be in the Wayback Machine
-
fireonlive
i have my own collection?
-
fireonlive
:D
-
fireonlive
also sweet news :)
-
arkiver
hah i guess so :)
-
fireonlive
:3
-
TheTechRobo
arkiver: Holy shit lmao
-
arkiver
TheTechRobo: ?
-
TheTechRobo
arkiver: My shitty code made it into the WBM! :P
-
arkiver
well as long as the records are fine, it should be good :)
-
TheTechRobo
I don't even think Wget-AT *lets* you write invalid records :P
-
TheTechRobo
Well, I guess you could override DNS
-
arkiver
yeah :)
-
arkiver
for DNS yes i guess
-
TheTechRobo
But you could do that anyway
-
TheTechRobo
Wget-AT is amazing
-
TheTechRobo
Wget-AT++
-
eggdrop
[karma] 'Wget-AT' now has 2 karma!
-
arkiver
thanks :)
-
arkiver
many improvements coming up!
-
fireonlive
=]
-
» arkiver is preparing a response to the recent responses from the TLS working group on our proposed mime types and URIs for SSL/TLS
-
fireonlive
good luck with those IETF types
-
TheTechRobo
I'd also suggest adding some sort of unit testing
-
fireonlive
i'm sure it's on mind
-
TheTechRobo
Yeah
-
arkiver
thanks...
-
fireonlive
🆕 !tell now supports hostmasks (nick!user@host) e.g. !tell *!*@balls.example hello
-
fireonlive
(with wildcards)
-
Ryz
Mmm, welp, from a random checking of links to ignore on some ArchiveBot jobs, sadly
forum.mobilelegends.com has shut down earlier this year on April 30
-
Ryz
...I don't think we have much people in the mobile game area of things strongly about it :c
-
SketchCow
Hi Jason,
-
SketchCow
Sorry to bother you, but Joe Baugher has died. He wrote up so much about
-
SketchCow
aviation throughout the years and his articles are invaluable. Would you
-
SketchCow
mind asking the Archive team to archive his home page one last time?
-
SketchCow
-
SketchCow
All the best,
-
SketchCow
Chris
-
c3manu
SketchCow: I’m not jason, but i think this is something we can do :)
-
c3manu
oh wait. you're jason >.<
-
c3manu
it’s queued :)
-
Nulo|m
is there something easy to quickly (multi connection) download a list of urls (without following links) into a warc?
-
c3manu
Nulo|m: i’m not as experienced as other users here (which might have better answers for you), but you shouldd just be able to use wget for that
-
Nulo|m
i guess just have to make a script to run many wget right?
-
Nulo|m
also i can't find a flag to not download files into a file when i'm already downloading them into a warc in wget
-
c3manu
Nulo|m: well, it has to download them, but there's the --delete-after flag which gets rid of them once they're in the warc
-
Nulo|m
thanks!
-
c3manu
wget has a --background mode, but i would assume they then cannot write into the same warc file
-
c3manu
if you have multiple running i mean
-
c3manu
there's also wpull, a wget fork, which archivebot also uses to download things. that one supports concurrency, but depending on your python version it might be a little fidddly to set up:
github.com/ArchiveTeam/wpull
-
c3manu
correction: it's not a fork, just another tool. my bad
-
c3manu
what do you need the warc for, if i may ask?
-
Nulo|m
i'm downloading product pages to then scrap them offline
-
c3manu
why the warc then, and not just the pages themselves?
-
Nulo|m
because if i need to pull more info later that i wasn't scrapping before, i can still just pull from the warc
-
Nulo|m
also my scraper is kind of hacky so if it's bad i can just re-run it on the WARCs
-
c3manu
i see, that makes sense.
-
Nulo|m
also i should be able to run the scraper on WARCs from archive.org or other sources :)
-
c3manu
ok. apart from wpull i am running out of ideas. hopefully someone else can give you better answers when they're back :)
-
c3manu
-
Nulo|m
no, thank you!
-
Nulo|m
i think i'll make a script based on wget though
-
fireonlive
there’s wget-at to :)
-
fireonlive
too
-
Nulo|m
yah but wget works fine for me and i believe wget-at doesn't have multi-connection, just improved warc stuff?
-
c3manu
ah, that would probably be the fork than that i confused wpull with earlier
-
fireonlive
improved warc stuff sounds pretty paramount :3
-
Nulo|m
hehe but the warcs generated by gnu wget work fine with warcio.js which is what i'm using so 👍️
-
JAA
I'm averaging 5k OneHallyu topics per hour now. They went read-only at 2023-12-20T11:23Z or so (date of the last post by an admin). If they shut it down at the same time of day, I expect to have covered about 81% of the topics.
-
nicolas17_bot
more parallelism/IPs unlikely to help?
-
JAA
Their potato is too slow.
-
JAA
6 second average response time.
-
JAA
Let's see what happens if I throw more at it...
-
» Barto observes an explosion in the horizon
-
nicolas17_bot
also try less, if there's resource contention on the server it could have weird effects
-
JAA
Can't easily go to less, but yeah, I might if this makes it worse.
-
nicolas17_bot
("half the threads, 2 second response time" would be a net win, though unlikely)
-
JAA
Average response time now: 8404 ms ._.
-
fireonlive
x_x
-
JAA
Throughput still went up a bit though.
-
nicolas17
hm how's your network-layer latency to their server?
-
JAA
They hide behind Buttflare, so no idea.
-
nicolas17
oh :|
-
nicolas17
that latency is also irrelevant if they're in CF
-
JAA
Depends on what their backend looks like, but the point is rather that I can't measure it anyway.
-
nicolas17
if there wasn't CF, doing the crawl from somewhere closer could help
-
JAA
Possibly, although it can usually be balanced by higher concurrency.
-
JAA
I'm back down to the same throughput from before I increased the concurrency.
-
tech234a
“Bluesky makes web view public, login no longer required to read posts”
news.ycombinator.com/item?id=38739130
-
fireonlive
nicolas17: feel free to use #fire-spam for testing
-
fireonlive
everyone got to witness the bee movie so what's a bit more :p