-
JAA
betamax: So. Many. Broken. URLs. :-|
-
TheTechRobo
Hello
-
TheTechRobo
I think that this is the correct channel
-
TheTechRobo
Any chance that the
wiki.archiveteam.org/index.php/Dev will be updated ?
-
TheTechRobo
Most if not all pages are last updated 2015
-
TheTechRobo
and they use python 2
-
TheTechRobo
-
TheTechRobo
would ti still work with python 3?
-
JAA
Yep, that section is pretty outdated.
-
JAA
Yes, seesaw works fine with Python 3.
-
TheTechRobo
Sounds good, thanks
-
JAA
betamax: Queueing has begun. 1437 sites after lots of cleanup and dedupe and whatnot.
-
flashfire42
Ok start grabbing anything related to israel and gaza they are using white phosphorous
-
mgrandi
Do we have a coordinated effort for getting google doc urls? Apparently they are gonna start wiping those of they are inactive in 3 weeks
-
OrIdow6
Link?
-
OrIdow6
First I'v eheard of this
-
OrIdow6
That I remember
-
OrIdow6
Oh, apparently a link was posted (though I don't see any discussion) November 15
-
OrIdow6
-
OrIdow6
"After June 1:
-
OrIdow6
If you're inactive in one or more of these services for two years (24 months), Google may delete the content in the product(s) in which you're inactive.
-
OrIdow6
Similarly, if you're over your storage limit for two years, Google may delete your content across Gmail, Drive and Photos."
-
HCross
how would we do it? Export to PDF
-
Sanqui
the HTML view is probably better?
-
Sanqui
maybe even possible with #//?
-
EggplantN
#// gets single urls
-
avoozl
OrIdow6: inactive in "one or more".. that sounds like they could already wipe things if
-
avoozl
.. I'm inactive on a single service even
-
OrIdow6
One thing to worry about is ability to discover them
-
OrIdow6
Even something like "append this to the path" will make it hard to play them back in practice
-
OrIdow6
avoozl: it only "delete[s] the content in the product(s) in which you're inactive"
-
avoozl
Makes sense. I need to get into the habit of cycling through my google accounts every once in a while.. I just tended to create a new account for any android device I had
-
mgrandi
The AT wiki has some notes on the gdoc url formats, probably best to convert to a variety of formats since they are so small : shrug:
-
etnguyen03
-
mgrandi
Already got their twitter
-
JAA
Put it on the pile. ;-)
-
betamax
JAA: many thanks for queuing the party / candidate sites in AB
-
betamax
I don't want to overload things - should I wait before loading some more of the twitter scrapes into AB?
-
JAA
betamax: Fine to start two more I'd say. The ones that are primarily or entirely twitter.com URLs will run much faster than the ones full of external URLs.
-
betamax
Yeah, I think the next two lists are entirely t.co shortlinks but after that it's mostly just tweets
-
JAA
I'd guess the last one might be www.* stuff.
-
betamax
The last one goes from "
t.co..." to "t.co", so I guess there aren't any www stuff (that don't have "http" or "https" prefix)
-
JAA
Uh
-
JAA
That sounds highly unlikely. Unless you removed it or ignored it on sorting, I guess.
-
betamax
Just checked, and there aren't any starting with "www". Once sec while I look at the original scrape output.
-
JAA
I mean, random example from the job that just started:
t.co/4P985nEYWA ->
yorkshireparty.org.uk
-
JAA
And that was the first one I tried. So yeah, there are definitely lots of www URLs in it.
-
JAA
HTTP v HTTPS though
-
betamax
Ah, sorry. I thought you meant URLs starting with "www" (ie: no http or https)
-
JAA
Oh, no, there shouldn't be any protocol-less URLs unless something went very wrong.
-
betamax
There's around 700,000 non-twitter URLs with "//www." in the scrape.
-
JAA
Yeah, that makes more sense. :-)
-
betamax
There's around 16 protocol-less "t.co" links, but in the scheme of things that's nothing
-
JAA
Hmm, that's odd though.
-
JAA
I'd love to know where those come from. Sounds like a Twitter or snscrape bug.
-
Ryz
!ig 8d5te0kc64qttw6s9vsg27hza ^https?://assets\.squarespace\.com/universal/scripts-compressed/
-
Ryz
Oops
-
betamax
JAA: it looks to be a bug in twitter
-
betamax
-
betamax
-
betamax
(when run with "snscrape --format {url} {tcooutlinksss} {outlinksss} twitter-user <username>")
-
JAA
Thanks!
-
JAA
Time to add another workaround for Twitter weirdness I guess.
-
betamax
Here's the full list (of 17 results) if you want more examples for testing:
tardis.ed.ac.uk/~andrewferguson/uk_…ctions_2021_betamax/twitter_bug.txt
-
JAA
Perfect! :-)