Feed readers which don't take "no" for an answer

236 points by kencausey a year ago

strogonoff 10 months ago

A friend of mine co-runs a semi-popular semi-niche news site (for now more than a decade), and complains that recently traffic rose with bots masquerading as humans.

How would they know? Well, because Google, in its omniscience, started to downrank them for faking views with bots (which they do not do): it shows bot percentage in traffic stats, and it skyrocketed relative to non-bot traffic (which is now less than 50%) as they started to fall from the front page (feeding the vicious circle). Presumably, Google does not know or care it is a bot when it serves ads, but correlates it later with the metrics it has from other sites that use GA or ads.

Or, perhaps, Google spots the same anomalies that my friend (an old school sysadmin who pays attention to logs) did, such as the increase of traffic along with never seen before popularity among iPhone users (who are so tech savvy that they apparently do not require CSS), or users from Dallas who famously love their QQBrowser. I’m not going to list all telltale signs as the crowd here is too hype on LLMs (which is our going theory so far, it is very timely), but my friend hopes Google learns them quickly.

These newcomers usually fake UA, use inconspicuous Western IPs (requests from Baidu/Tencent data center ranges do sign themselves as bots in UA), ignore robots.txt and load many pages very quickly.

I would assume bot traffic increase would apply to feeds, since they are of as much use for LLM training purposes.

My friend does not actually engage in stringent filtering like Rachel does, but I wonder how soon it becomes actually infeasible to operate a website with actual original content (which my friend co-writes) without either that or resorting to Cloudflare or the like for protection because of the domination of these creepy-crawlies.

Edit: Google already downranked them, not threatened to downrank. Also, traffic rose but did not skyrocket, but relative amount of bot traffic skyrocketed. (Presumably without downranking the traffic would actually skyrocket.)

afandian 10 months ago

Are you saying that Google down-ranked them in search engine rankings for user behaviour in AdWords? Isn't that an abuse of monopoly? It still surprises me a little bit.
- malfist 10 months ago
  
  Who's going to call them on it if it is?
- EdwardDiego 10 months ago
  
  Yeah, but then who is going to stop them acting monopolistic?
  New administration is going to be monopoly friendly.
  I was honestly pleased that Gaetz was nominated for AG solely because he's big on antitrust. Or has been.
  - mapt 10 months ago
    
    Any sentiment expressed by the party which has dedicated itself to unrestricted corporate rights in this direction is an insincere attempt to pander to a current culture war front they are fighting that week; In this case, likely something along the lines of 'Twitter censored Trump's hydroxychloroquine post - we MUST PUNISH THEM AND REIGN IN BIG TECH [for not contributing to the fascist project]'.
    EDIT: Direct quote - "The internet's hall monitors out in Silicon Valley, they think they can suppress us, discourage us. Maybe if you're just a little less patriotic. Maybe if you just conform to their way of thinking a little more, then you'll be allowed to participate in the digital world,"
    This isn't an attempt to ensure freedom from monopoly, this is an attempt to enforce partisan control of the message, weaponizing the idea of free speech using force.
    I can assert that the 'common public square' idea central to freedom of speech is disappearing, and that this is a bad thing, but that's not what this man has been arguing or why this man has chosen this issue.
  - johnnyanmac 10 months ago
    
    if you believe their words (and I can't blame anyone who doesn't) apparently they want to lighten regulations on everything except big tech. So there may be a chance all those Google/Amazon cases will keep going on into the Trump administration.
    
    llamaimperative 10 months ago
    
    To be clear this isn't because they have a problem with monopoly businesses abusing consumers. It's because big tech exercised their First Amendment rights in ways he found undesirable.
    https://www.bbc.com/news/world-us-canada-57754435
    Note that he's still talking about breaking up tech companies but not... X? (Surely that will resume once he and Elon have a falling out)
m3047 10 months ago

It's not that hard to dominate bots. I do it for fun, I do it for profit. Block datacenters. Run bot motels. Poison them. Lie to them. Make them have really really bad luck. Change the cost equation so that it costs them more than it costs you.
You're thinking of it wrong, the seeds of the thinking error are here: "I wonder how soon it becomes actually infeasible to operate a website with actual original content".
Bots want original content, no? So what's the problem with giving it to them? But that's the issue, isn't it? Clearly, contextually, what you should be saying is "I wonder how soon it becomes actually infeasible to operate a website for actual organic users" or something like that. But phrased that way, I'm not sure a CDN helps (I'm not sure they don't suffer false positives which interfere with organic traffic when they intermediate, more security theater because hangings and executions look good, look at the numbers of enemy dead).
Take measures that any damn fool (or at least your desired audience) can recognize.
Reading for comprehension, I think Rachel understands this.
- throaway89 10 months ago
  
  what is a bot motel and how do you run one?
  - m3047 10 months ago
    
    Easy way is to implement e.g. a 4xx handler which serves content with links which generate further 4xx errors and rewrite the status code to something like 200 when sent to the requester. Load the garbage pages up with... garbage.
    
    m3047 10 months ago
    
    Since this is getting upvoted, I will put forth a suggestion I've made to the people who've paid me to help with this sort of subterfuge: turn your 404 handler into search. Then a human who goes there has a way out. But absolutely, load it up with garbage and broken links.
    
    throaway89 10 months ago
    
    Thanks, and you can make money with this? Sorry I'm a total noob in this area.
    
    shadowgovt 10 months ago
    
    Not really... You cost the bots money.
    Many are trying to index the web for whatever reason. By feeding them a Library of Babel, you can clog up their storage with noise.
    
    m3047 10 months ago
    
    Once in a while people pay you to do something you enjoy doing, like making people cry and wish they had a jobs flipping burgers instead. But I do it on my own systems for fun, honestly.
  - yesco 10 months ago
    
    The idea is that bots are inflexible to deviations from accepted norms and can't actually "see" rendered browser content. So if your generic 404, 403 error pages return a 200 status instead, with invisible links to other non accessible pages. The bots will follow the links but real users will not, trapping them in a kind of isolated labyrinth of recursive links (the urls should be slightly different though). It's basically how a lobster trap works if you want a visual metaphor.
    The important part here is to do this chaotically. The worst sites to scrape are buggy ones. You are, in essence, deliberately following bad practices in a way real users wouldn't notice but would still influence bots.
blfr 10 months ago

QQBrowser users from Dallas are more likely to be Chinese using a VPN than bots, I would guess.
- strogonoff 10 months ago
  
  That much is clear, yeah. The VPN they use may not be a service advertised to public and featured in lists, however.
  Some of the new traffic did come directly from Tencent data center IP ranges and reportedly those bots signed themselves in UA. I can’t say whether they respect robots.txt because I am told their ranges were banned along with robots.txt tightening. However, US IP bots that remain unblocked and fake UA naturally ignore robot rules.
  - thaumasiotes 10 months ago
    
    > The VPN they use may not be a service advertised to public and featured in lists, however.
    Well, of course not, since the service is illegal.
- m3047 10 months ago
  
  I'm seeing some address ranges in the US clearly serving what must be VPN traffic from Asia, and I'm also seeing an uptick in TOR traffic looking for feeds as well as WP infra.
BadHumans 10 months ago

At my company we have seen a massive increase in bot traffic since LLMs have become mainstream. Blocking known OpenAI and Anthropic crawlers has decreased traffic somewhat so I agree with your theory.
nicbou 10 months ago

I don’t think it’s a bot thing. Traffic is down for everyone and especially smaller independent websites. This year has been really rough for some websites.
- wkat4242 10 months ago
  
  I think it's also because a lot of sites have started paywalling. So users walk away.
is_true 10 months ago

I too found an extremely unlikely % of iphone users when checking access logs.
wiseowise 10 months ago

> who are so tech savvy that they apparently do not require CSS
Lmao!
m3047 10 months ago
Heres Crime^H^H^H^H^(ahem)Cloudflare requesting assets from one of my servers. I don't use Cloudflare, they have no business doing this.
```
  104.28.42.8 - - [21/Dec/2024:13:58:35 -0800] consulting.m3047.net "GET /apple-touch-icon-precomposed.png HTTP/1.1" 404 980 "-" "NetworkingExtension/8620.1.16.10.11 Network/4277.60.255 iOS/18.2"
  104.28.42.8 - - [21/Dec/2024:13:58:35 -0800] consulting.m3047.net "GET /favicon.ico HTTP/1.1" 200 302 "-" "NetworkingExtension/8620.1.16.10.11 Network/4277.60.255 iOS/18.2"
  104.28.42.8 - - [21/Dec/2024:13:58:35 -0800] consulting.m3047.net "GET /dubai-letters/balkanized-internet.html HTTP/1.1" 200 16370 "-" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_1) AppleWebKit/601.2.4 (KHTML, like Gecko) Version/9.0.1 Safari/601.2.4 facebookexternalhit/1.1 Facebot Twitterbot/1.0"
  104.28.42.8 - - [21/Dec/2024:13:58:35 -0800] consulting.m3047.net "GET /apple-touch-icon.png HTTP/1.1" 404 980 "-" "NetworkingExtension/8620.1.16.10.11 Network/4277.60.255 iOS/18.2"

  # dig -x 104.28.42.8

  ; <<>> DiG 9.12.3-P1 <<>> -x 104.28.42.8
  ;; global options: +cmd
  ;; Got answer:
  ;; ->>HEADER<<- opcode: QUERY, status: NXDOMAIN, id: 35228
  ;; flags: qr rd ra; QUERY: 1, ANSWER: 0, AUTHORITY: 1, ADDITIONAL: 1

  ;; OPT PSEUDOSECTION:
  ; EDNS: version: 0, flags:; udp: 1280
  ; COOKIE: 6b82e88bcaf538fc7ab9d44467685e82becd47ff4492b1be (good)
  ;; QUESTION SECTION:
  ;8.42.28.104.in-addr.arpa.      IN      PTR

  ;; AUTHORITY SECTION:
  28.104.in-addr.arpa.    3600    IN      SOA     cruz.ns.cloudflare.com. dns.cloudflare.com. 2288625504 10000 2400 604800 3600

  ;; Query time: 212 msec
  ;; SERVER: 127.0.0.1#53(127.0.0.1)
  ;; WHEN: Sun Dec 22 10:46:26 PST 2024
  ;; MSG SIZE  rcvd: 176
```
Further osint left as an exercise for the reader.
- Crosseye_Jack 10 months ago
  
  104.28.42.0/25 Is one of the ip ranges used by Apples Private Relay (via Cloudflare)
  https://github.com/hroost/icloud-private-relay-iplist/blob/m...
  (There is also a list of ranges on apples site, but I forget where…)
  Edit: found it https://mask-api.icloud.com/egress-ip-ranges.csv
- shadowgovt 10 months ago
  
  What is the issue with this request?
  - m3047 10 months ago
    
    > What is the issue with this request?
    I didn't realize this was an Apple thing, but that's fine. It changes the color of the horse and the name of the river, but the same road leads to the same destination.
    1) There is a notion that Cloudflare is a content distribution network. The risk profile for a content distribution network is different from a VPN service. Now I know it's a VPN service (or is it?). Changes it from "seems weird and inappropriate" to "do I care about people relying on this? no, probably not". Cloudflare can't be arsed to provide reverse DNS for something which is clearly not part of their CDN, or is it?
    1.5) Is it layer 2 or application? Cloudflare runs a CDN. Correct me if I'm wrong, but the CDN is a reverse proxy is it not? Is Cloudflare caching my website's content? Can they observe it? (It's surprisingly hard to find a solid explanation, but they talk about "proxies" and "decrypts the name of the website you requested" and none of that adds clarity, it makes it sound more like believe what we want you want to believe.)
    2) I don't block incoming SYNs from Cloudflare (yet) the way I do with Amazon, and this traffic per se isn't going to trip any mitigations here. But not all of the traffic is as benign (and it's impressive that they're so technically savvy they don't need the CSS as noted elsewhere). Presumably those exit points are shared by multiple customers. Did I mention I block all incoming SYNs from Amazon?
    
    Crosseye_Jack 10 months ago
    
    > and it's impressive that they're so technically savvy they don't need the CSS as noted elsewhere
    With the logs you provided, they appear to be coming from within iMessage.
    So when someone posts a link in iMessage it will fetch the favicon(s) and the html in order to generate a “preview” of the page with the title of the page and use one of the favicons. It doesn’t need to fetch any css files to do this.
    Not saying bad actors don’t fetch css either, but the lack of it being fetched doesn’t mean that it’s a bad actor.
    As for why CF don’t reverse DNS their IPs stating it’s iCloud private relay, well CF are not Apples only 3rd party egress provider (Akamai are also one that springs to mind). So if the number of providers can change at any time, the best source of information about valid egress providers is from Apple themselves.
    But Apple do also publish these changes to geo-location databases for you to query, for example: https://www.ip2location.com/demo/104.28.42.8 lists it as iCloud Private Relay.
    As for “are CloudFlare caching my site when ran through private relay?”, not 100% sure, I’ll have to check my own logs and cba’ed right now, but I don’t think so (it’s been a while since I ran tests on it to see how it behaved to be 100% sure right this minute.
    But I think it would be silly of them if they did as they may not be aware of the what to cache and for who. Let’s say they cached /profile without knowing what the server is using to determine who the logged in user is, they may false cache-hit and leak data from a previous request. When they act as your sites CDN you explicitly tell them what to cache on, but when acting as a relay (either for apple or their own warp product) for a site they are not a CDN for they are missing this info, sure they could guess, but why risk being wrong?)
    
    m3047 10 months ago
    
    Thanks for the explanation.

Apreche 10 months ago

Feed readers should be sending the If-Modified-Since header and web sites should properly recognize it and send the 304 Unmodified response. This isn’t new tech.

graemep 10 months ago

That is exactly what the article says.
- smallerize 10 months ago
  
  The article implies this but doesn't actually say it. It's nice to have the extra detail.
  - rtpg 10 months ago
    
    Rachel has been writing about feed readers a lot on their blog the past year (https://rachelbythebay.com/w/ shows 21 results for "feed reader"), so this is part of a whole narrative.
    If you're interested in it, I highly recommend just reading from start to end. It's all quite interesting (including building out a whole test service for feed readers to get scored on their behavior)
    
    soapdog 10 months ago
    
    I am writing a feed reader and I saved their posts about it to use as my implementation guidelines, got them printed and stuff.
  - avg_dev 10 months ago
    
    While it might be nice if the article spelled out the header, I do believe that there is more than implication present.
    > 00:04:51 GET /w/atom.xml, unconditional.
    > Fulfilled with 200, 502 KB.
    > [...]
    > A 20 minute retry rate with unconditional requests is wasteful. [...]
    And If-Modified-Since makes a request conditional. https://developer.mozilla.org/en-US/docs/Web/HTTP/Conditiona...
    
    JadeNB 10 months ago
    
    You left out a further explicit mention of conditional requests:
    > Advised (via Retry-After header) to come back in one day since they are unwilling or unable to do conditional requests.
    But I think it's still unarguable that the post doesn't explicitly mention If-Modified-Since, which it's not obliged to do, but the mention of it here could be helpful to someone. So why fuss?
    
    chipsa 10 months ago
    
    There’s also the option of “If-None-Match”, to make the request conditional. The point is being conditional, not how.
- shkkmo 10 months ago
  
  The people who already know that a "conditional request" means a request with an If-Modified-After header aren't the ones who need to learn this information.
dartos 10 months ago

If only people know of the standards
righthand 10 months ago

Yeah but my LLM won’t generate that code.

Havoc 10 months ago

Blocked for 2 hits in 20 minutes on a light protocol like rss?

That seems hilariously aggressive to me, but her server her rules I guess.

II2II 10 months ago

If your feed reader is refreshing every 20 minutes for a blog that is updated daily, nearly 99% of the data sent is identical. It looks like Rachel's blog is updated (roughly) weekly, so that jumps to 99.8%. It's not the least efficient thing in the world of computers, but it is definitely incurring unnecessary costs.
- elashri 10 months ago
  
  I opened the xml file she provides in the blog and it seems very long but okay. Then I decided it is a good blog to subscribe so I went and tried to add to my freshrss selfhosted instance (same ip obviously) and I couldn't because I got blocked/rate limited. So yes it is aggressive for different reasons.
  - e3bc54b2 10 months ago
    
    This post (before reading your comment) actually made me look into my own freshrss setup.
    NixOS defaults to refresh frequency of every 5 minutes[0] (0_0).
    I had noticed some blogs blackholing me before, but never quite made the connection.
    So now it is configured to fetch every 12 hours. I believe that is fair.
    [0] https://github.com/NixOS/nixpkgs/blob/d70bd19e0a38ad4790d391...
  - KomoD 10 months ago
    
    Same, I made 3 requests in total and got blocked.
  - lilyball 10 months ago
    
    I know she's mentioned this particular problem before on her blog, I don't remember where to find it offhand now but my vague recollection is that because browsers have largely removed the ability to directly view RSS feeds she doesn't consider this a significant issue anymore.
    Why did you view the XML file directly?
    
    elashri 10 months ago
    
    > Why did you view the XML file directly?
    There are many reasons why I do personally this.
    1- Check that the link actually loads and works!
    2- See how much content and does it contain last n or all feed history by default
    3- To see if the feed gives summary or full content of posts
    4- Just for curiosity like in this case I wanted to see what is this feed that prompted a blog post that reached HN front page.
    It is usually a superposition state of those reasons. But this is why it is aggressive limit and I know it her server her rulee but this wasn't pleasant experience for me as an end user. I was just sharing my experience.
    
    kelnos 10 months ago
    
    I feel like a reasonable way to deal with this situation might be to look at the user agent: if two requests come from the same IP but different user agents, then it's likely that it's either actually two completely different people (behind a NAT), or this situation the GP described.
    That's certainly a bit more effort to implement, though, and the author night not think it's worth the time.
  - radicality 10 months ago
    
    Weird. Those should have had different user-agents, and I would guess it cannot be purely based on up.
  - mubou 10 months ago
    
    Yeah, that's insane. Pretty much telling me not to subscribe to your blog at that point. Like sites that have an rss feed yet put Cloudflare protection in front of it...
    The correct thing to do here is put a caching layer in front so that every feed reader isn't simultaneously hitting the origin for the same content. IP banning is the wrong approach. (Even if it's only a temporary block, that's going to cause my reader to show an error and is entirely unnecessary.)
- wakawaka28 10 months ago
  
  It should be a timeboxed block if anything. Most RSS users are actual readers and expecting them to spend lots of time figuring out why clicking "refresh" twice on their RSS app got them blocked is totally unreasonable. I've got my feeds set up to refresh every hour. Considering the small number of people still using RSS and how lightweight it is, it's not bad enough to freak out over. At some point all Rachel's complaining and investigating will be more work than her simply interacting directly with the makers of the various readers that cause the most traffic.
sccxy 10 months ago

Her rss feed is last 100 posts with full content.
So it means 30 months of blog posts content in single request.
Sending 0.5MB in single rss request is more crime than those 2 hits in 20 minutes.
- horsawlarway 10 months ago
  
  I generally agree here.
  There are a lot of very valid use cases where defaulting to deny for an entire 24 hour cycle after a single request is incredible frustrating for your downstream users (shared IP at my university means I will never get a non-429 response... And God help me if I'm testing new RSS readers...)
  It's her server, so do as you please, I guess. But it's a hilariously hostile response compared to just returning less data.
  - mrweasel 10 months ago
    
    > But it's a hilariously hostile response compared to just returning less data.
    So provide a poor service to everyone, because some people doesn't know how to behave. That sees like an even worse response.
    
    sccxy 10 months ago
    
    Send only one year's recent posts and you've reduced bandwidth by 50%.
    
    wakawaka28 10 months ago
    
    People don't want to have to customize refresh rates on a per-feed basis. Perhaps the RSS or Atom standards need to support importing the recommended refresh rate automatically.
    
    ncallaway 10 months ago
    
    They don't need to change the refresh rate, though. They need to make conditional requests with an etag or a last-modified date, so the server can respond with a 304 not modified if no changes have been made.
    No standards need to be updated. The client software needs to be a better HTTP citizen.
    
    wakawaka28 10 months ago
    
    What about people who reside in the same place who have multiple RSS aggregators that scrape the same RSS? Her analysis will not handle that I think. At some point she is going to have to talk to the engineers that made it if she wants something done. Or she could take it upon herself to fix the software (at least the ones that are open-source). If she's just sharing the investigation then it's fine. But if the goal is to get the problems fixed, whining to us is probably the least efficient way to do it. She is knowledgeable enough to fix probably half of the RSS readers that she is complaining about and definitely knowledgeable enough to engage with all of them about fixing their code.
- aidenn0 10 months ago
  
  If there were a widely supported standard for pagination in RSS, then it would make sense to limit the number of posts. As there isn't, sending 500kB seems eminently reasonable, and RSS readers that send conditional requests are fine.
  - snthd 10 months ago
    
    "Pagination in feeds like ATOM and RSS?" - https://stackoverflow.com/questions/1301392/pagination-in-fe...
    Sounds like something that could be scored in the rss reader tests.
- EdwardDiego 10 months ago
  
  Did you actually write 500KB as 0.5MB to make it sound BIGGER?
  Clever.
- wakawaka28 10 months ago
  
  Yes that's right. Most blogs that are popular enough to have this problem send you the last 10 post titles and links or something. THAT is why people refresh every hour, so they don't miss out.
  - ParetoOptimal 10 months ago
    
    I hate RSS feeds that don't include full content.
  - EdwardDiego 10 months ago
    
    If only there were some kind of HTTP headers that could help them stop doing a GET every hour!
    Gosh darn, if only I could say "Hey, please only send me the data if it's been modified since I last requested it an hour ago" somehow.
    
    wakawaka28 10 months ago
    
    Sure, but whining to the broader public about it before talking to the engineers who made the offending software seems like a bad idea. The public doesn't really care and will keep using their preferred readers. In all my years on the Internet (a lot) I have never seen anyone complain about the volume of RSS traffic they got. If Rachel enjoys sharing her experience of investigating this issue, that's fine. But if she is sharing it with the expectation that her readers will randomly go fix other people's software, that's becoming unreasonable.
- BonoboIO 10 months ago
  
  Complains about traffic, sends 0.5mb of everything.
  That’s my kind of humor.
- xyzsparetimexyz 10 months ago
  
  sigh feed readers set the If-Modified-Since header so that the feed is only resent when there are new items.
cesarb 10 months ago

> Blocked for 2 hits in 20 minutes on a light protocol like rss?
I might be getting old, but 500KB in a single response doesn't feel "light" to me.
- sccxy 10 months ago
  
  Yes, this is a very poorly designed RSS feed.
  500KB is horrible for RSS.
  - Symbiote 10 months ago
    
    It's reasonable to have whole articles in RSS, of you aren't trying to show ads or similar.
    
    sccxy 10 months ago
    
    Whole articles are reasonable.
    100 articles are not reasonable.
    100 articles where most of them are 1+ year old is madness.
    RSS is not an archive of the entire website.
    
    Sweepi 10 months ago
    
    Well, its note the entire website, and i find the "last 100 articles" rule way better than "last 3" or "last 90 days" (which some times is 0 or 1).
    The host is fine with sending 0.5 MiB once (the client should be aswell from both a bandwidth and storage point of view).
    The host is not fine with sending 0.5 MiB every 20 minutes, which could be easily avoided if the client would use the mentioned "If-Modified-Since header".
    
    xyzsparetimexyz 10 months ago
    
    Thanks for being someone who actually knows this stuff among all the Dunning Kruger replies
    
    sangnoir 10 months ago
    
    > RSS is not an archive of the entire website
    Whole-article feeds end up become exactly that - a local archive of a blog.
garfij 10 months ago

I believe if you read carefully, it's not blocked, it's rate limited to once daily, with very clear remediation steps included in the response.
- that_guy_iain 10 months ago
  
  If you understand what rate limiting is, you block them for a period of time. Let's stop being pedantic here.
  72 requests per day is nothing and acting like it's mayhem is a bit silly. And for a lot of people would result in them getting possible news slower. Sure OP won't publish that often but their rate limiting is an edge case and should be treated as such. If they're blocked until the next day and nothing gets updated then the only person harmed is OP for being overly bothered by their HTTP logs.
  Sure it's their server and they can do whatever they want. But all this does is hurts the people trying to reach their blog.
  - HomeDeLaPot 10 months ago
    
    72 requests per day _per user with a naive feed reader_. This is a small personal blog with no ads that OP is self-hosting on her own hardware, so blocking all this junk traffic is probably saving her money. Plus she's calling attention to how feed readers can be improved!
    
    m3047 10 months ago
    
    My reason for smacking stuff down is that I don't want to see it in my logs. That simple.
    
    that_guy_iain 10 months ago
    
    Even if they had 1000 feed readers which would be a massive amount for a blog, if you can't scale that cheaply, that's on you.
    As I pointed out, her blog and rate limiting are an extreme edge case, it would be silly for anyone to put effort into changing their feed reader for a single small blog. It's bad product management.
    
    tecleandor 10 months ago
    
    Of course she can. It's static. She doesn't want and I understand. She's signaling their clients an standard call to say "I think you already have read this, at lest ask me first when this changed the last time".
    
    that_guy_iain 10 months ago
    
    If you choose to run a poorly implemented rss feed and not scale it cheaply you lose any sympathy from me.
    
    EdwardDiego 10 months ago
    
    So long as you know what If-Modified-Since is, and use it, you can have all or none of the sympathy you want.
  - throw0101b 10 months ago
    
    > 72 requests per day is nothing and acting like it's mayhem is a bit silly.
    72 requests per day per IP over how many IPs? When you start multiplying numbers together they can get big.
  - quest88 10 months ago
    
    I invite you to run your own popular blog on your own hardware and pay for the costs. It sounds like you don't know what the true costs are.
    
    donatj 10 months ago
    
    I do run a popular blog, and a $5 a month Digital Ocean droplet handles millions of requests per month without breaking a sweat.
    
    devjab 10 months ago
    
    If every user is collecting 36mb a day like in the story here, your droplet wouldn’t even be capable of serving 500 users a month without hitting your bandwidth limit. With their current rates, your one million requests would cost you around 10 million USD.
    
    donatj 10 months ago
    
    30 * 500 * 36mb = 560gb and I have 1tb a month on my apparently $6 droplet
    Correction - from my billing page it's $4.50 a month, from the resize page it is $6 so I'm guessing I am grandfathered in to some older pricing
    
    tecleandor 10 months ago
    
    That's ridiculously big quantity of data to serve a seldomly updated blog just because the client doesn't want (or know how, or think about) to implement an easy and old http method.
    Imagine the petabytes of data transferred through the internet saved if a couple RSS clients added that method.
    
    sccxy 10 months ago
    
    If OP enabled gzip then this 36mb would be 13mb.
    If OP reduced 30 months of posts in rss to 12 months then this 13mb would be 5mb a day.
    Using Cloudflare free plan and this static content is cached without any problem.
    
    notpushkin 10 months ago
    
    Yeah, but also... if RSS readers behaved correctly, it would be 512 kb. (170 kb with gzip, if she didn't enable it like you imply – I'm too lazy to check, but I assumed it was on.)
    I think making clients behave correctly is much more sustainable solution, although we could do better than doing so at the cost of the end users.
    
    int_19h 10 months ago
    
    This entire thread is a vivid illustration of why software is so shitty in general these days.
    
    Twirrim 10 months ago
    
    OP has never said that this is about financial aspects of things.
    
    Joker_vD 10 months ago
    
    Yews, it's about enforcing their preference on how others should interact with OP's published site feed, on principle. Which is always an uphill battle.
    
    Twirrim 10 months ago
    
    It's about enforcing that people follow standards. Which is still an uphill battle, but at least it's based in something sane. Their work on this has resulted in improvements to a whole slew of popular feed readers that should make life easier for a chunk of the internet, not just OP's own site.
    
    omgtehlion 10 months ago
    
    I serve 30tb/month for $30/mo on my own colocated hw
    
    that_guy_iain 10 months ago
    
    Sounds like you don't know how to scale for cheap.
    And since I've ran integrations that connected over 500 companies. I know what a rouge client actually looks like and 72 requests per day and I wouldn't even notice.
    
    EdwardDiego 10 months ago
    
    Good on you champ.
    
    sccxy 10 months ago
    
    More like a skill issue or just decision to make your life more difficult.
    It is free and easy to scale this kind of text based blog.
    
    EdwardDiego 10 months ago
    
    > More like a skill issue
    Hey, I think you mistook HN for Reddit.
mrweasel 10 months ago

But it's not a "light" protocol when you're serving 36MB per day, when 500KB would suffice. RSS/Atom is light weight, if clients play by the rules. This could also have been a news website, imagine how much traffic would be dedicated to pointless transfers of unchanged data. Traffic isn't free.
A similar problem arise from the increase in AI scraper activities. Talking to other SREs the problem seems pretty wide spread. AI companies will just hoover up data, but revisit so frequently and aggressively that it's starting to affect the transit feeds for popular websites. Frequently user-agents wouldn't be set to something unique, or deliberately hidden, and traffic originates from AWS, making it hard to target individual bad actors. Fair enough that you're scraping websites, that's part of the game when your online, but when your industry starts to affect transit feeds, then we need to talk compensation.
yladiz 10 months ago

That’s a bit disingenuous. 429s aren’t “blocking”, they’re telling the requester that they’re done too many requests and to try again later (with a value in the header). I assume the author configured this because they know how often the site is going to change typically. That the web server eventually stops responding if the client ignores requests isn’t that surprising, but I doubt it was configured directly too.
- Havoc 10 months ago
  
  Semantics. 429 is an error code. Rate limiting...blocking...too many requests...ignoring...call it whatever you like but it amounts to the same, namingly server isn't serving the requested content.
- luckylion 10 months ago
  
  > 429s aren’t “blocking”
  Like how "unlimited traffic, but will slow down to 1bps if you use more than 100gb in a month" is technically "unlimited traffic".
  But for all intents and purposes, it's limited. And 429 are blocking. They include a hint towards the reason why you are blocked and when the block might expire (retry-after doesn't promise that you'll be successful if you wait), but besides that, what's the different compared to 403?
  - yladiz 10 months ago
    
    I would disagree. Blocking typically implies permanence (without more action by the blockee), and since 429 isn’t usually a permanent error code I wouldn’t call it blocking. Same applies with 403, it’s only permanent if the requester doesn’t authorize correctly.
- that_guy_iain 10 months ago
  
  I would say it's disingenuous to claim sending HTTP status and body that is not expected for a period of time is not blocking them for that period of time. You can be pedantic and claim "but they can still access the server" but in reality that client is blocked for a period of time.
  - kstrauser 10 months ago
    
    In that case, I should be irate that the AWS API blocks me many times per day. Run `aws cli service some-paginated-thing` and see how many retries you get during normal, routine operation.
    But I’m not, because they’re not blocking me. They’re asking my client to slow down. Neither AWS nor Rachel’s blog owes me unlimited requests per unit time, and neither have “blocked” me when I violate they policies.
    
    that_guy_iain 10 months ago
    
    They literally do block you for a period of time until you are out of the rate limit. That is how rate limits work. That's why you don't get to access the resource you requested, because their system literally blocked you from doing so.
    See when you're trying to be pedantic and all about semantics, you should make sure you've crossed your Ts and dotted your Is.
    > Block – AWS WAF blocks the request and applies any custom blocking behavior that you've defined.
    from https://docs.aws.amazon.com/waf/latest/developerguide/waf-ru...
    And my favourite
    > Rate limiting blocks users, bots, or applications that are over-using or abusing a web property. Rate limiting can stop certain kinds of bot attacks.
    From CloudFlare's explainer https://www.cloudflare.com/learning/bots/what-is-rate-limiti...
    Every documentation on rate limit will include the word block. Because that's what you do, you allow access for a specific amount of requests and then block those that go over.

jannes 10 months ago

The HTTP protocol is a lost art. These days people don't even look at the status code and expect some mumbo jumbo JSON payload explaining the error.

klntsky 10 months ago

I would argue that HTTP statuses are a bad design decision, because they are intended to be consumed by apps, but are not app-specific. They are effectively a part of every API automatically without considerations whether they are needed.
People often implement error handling using constructs like regexp matching on status codes, while with domain-specified errors it would be obvious what exactly is the range of possible errors.
Moreover, when people do implement domain errors, they just have to write more code to handle two nested levels of branching.
- throw0101b 10 months ago
  > I would argue that HTTP statuses are a bad design decision, because they are intended to be consumed by apps, but are not app-specific.
  Perhaps put the app-specific part in the body of the reply. In the RFC they give a human specific reply to (presumably) be displayed in the browser:
  HTTP/1.1 429 Too Many Requests Content-Type: text/html Retry-After: 3600 <html> <head> <title>Too Many Requests</title> </head> <body> <h1>Too Many Requests</h1> <p>I only allow 50 requests per hour to this Web site per logged in user. Try again soon.</p> </body> </html>
  * https://datatracker.ietf.org/doc/html/rfc6585#section-4
  * https://developer.mozilla.org/en-US/docs/Web/HTTP/Status/429
  But if the URL is specific to an API, you can document that you will/may give further debugging details (in text, JSON, XML, whatever).
- marcosdumay 10 months ago
  
  > because they are intended to be consumed by apps, but are not app-specific
  Well, good luck designing any standard app-independent protocol that works and doesn't do that.
  And yes, you must handle two nested levels of branching. That's how it works.
  The only improvement possible to make it clearer is having codes for API specific errors... what 400 and 500 aren't exactly. But then, that doesn't gain you much.
- est 10 months ago
  
  > error handling using constructs like regexp matching on status codes
  Oh the horror. I would assume the practice is encourage by "RESTful" people?
KomoD 10 months ago

That's because a lot of people refuse to use status codes properly, like just using 200 everywhere.
- kstrauser 10 months ago
  
  A colleague who should’ve known better argued that a 404 response to an API call was confusing because we were, in fact, successfully returning a response to the client. We had a long talk about that afterward.
  - Joker_vD 10 months ago
    
    No, it is pretty confusing: the difference between 404 from hitting an endpoint that the server doesn't serve (because you forgot to expose this endpoint, oops!) and a 404 that means "we've successfully performed the search in our DB for the business entity you've requested and guarantee you that it does not exist" is rather difficult to tell programmatically.
    
    wiml 10 months ago
    
    If the URL identifies a resource (REST-style) and that database entry doesn't exist, then yes, 404 is less confusing response. If the URL identifies an API endpoint (RPC-style) then, sure, tunnel the error inside a "I successfully failed to handle that request" response if you like.
    
    reshlo 10 months ago
    
    All URLs used when interacting with an API obviously identify API endpoints. There is no such thing as a URL which is part of an API but which is not an API endpoint.
    There is a difference between /api/entity/123 and /api/search with a payload of 123, though.
    
    tbrownaw 10 months ago
    
    422 unprocessable content (webdav)
    The request couldn't be processed due to semantic errors... perhaps such as not being mapped to a handler :->
    I suppose from the right point of view that could also be likened to a reverse proxy not being able to send the request on (502 bad gateway), but sane people would probably find that even more confusing.
    There were also attempts to use 204 no content for "I successfully confirmed that what you asked for doesn't exist", but I think I managed to shoot those down.
    
    yjftsjthsd-h 10 months ago
    
    I'm open to arguing about which error to return in each case, but surely we can agree that neither of those warrant a 200?
    
    echoangle 10 months ago
    
    Why not? I wouldn’t say „I performed the search and there’s 0 results“ is an error condition. It’s just the result of a search, and everything went fine.
    
    yjftsjthsd-h 10 months ago
    
    Hm, maybe? I guess it depends on what we mean by search; if myapp.com/search?someproduct finds that there are 0 matches then yeah that's probably a 200, but if myapp.com/products/123456 fails because no product has id 123456 then that's a textbook 404.
    
    wruza 10 months ago
    
    It’s both nonsense, cause what you see here is a double conversion from an arbitrary problem domain into http domain and back again. Using a specific http code together with an app-domain code could make sense iff you wanted an intermediate host (a proxy etc) to perform some additional operation based on that status. Otherwise http status doesn’t speak the call language and can be just OK. (400/500 should still be handled by a client).
    Back-and-forth conversion is a very poor idea. It works for what was “internet resources” initially (basically files and folders), but later people stretched that on application data models and that creates constant issues because people naturally can’t understand the mapping, cause there’s none. This is not a good idea. Talk to http hosts with http and talk to your client with a language you designed specifically for talking to it. 200 vs non-200 is http level and orthogonal to in-service statuses.
    
    Joker_vD 10 months ago
    
    No, the latter is absolutely a 200 because of separation of concerns and layering.
    The HTTP server, when it detects "URI handler not found" condition, builds an 404 HTTP response and sends it as a normal payload through the underlying connection instead of turning it into an TLS error packet or an RST packet on TCP level (that's the TCP's standard response for "port handler process not found", after all) or something silly like that, and that is absolutely fine, because the application-level (HTTP) error messages should be transmitted by the transport level (TLS/TCP) just as normal messages would.
    The same reasoning holds just the same when we consider the usage of HTTP as a transport-level protocol for some higher-level RPC exchange. Yes, HTTP has some assortment of error codes that superficially look like they can be reused to serve as the upper-layer errors as well but that's a red herring.
AznHisoka 10 months ago

I dont look at the code because its wrong sometimes. Some pages return a 200 yet display an error in the page
- DaSHacka 10 months ago
  
  Nothing more annoying than a 200 response when the server 'successfully' serves a 404 page
  - CodesInChaos 10 months ago
    
    Returning a 3xx redirect to an generic error page is even worse than 200.

shepherdjerred 10 months ago

I like Rachel's writing, but I don't understand this recent crusade against RSS readers. Sure, they should work properly and optimizations can be made to reduce bandwidth and processing power.

But... why not throw a CDN in front of your site and focus your energy somewhere else? I guess every problem has to be solved by someone, but this just seems like a very strange hill to die on.

EdwardDiego 10 months ago

Because she's old school sysadmin mate, likes running her own stuff her own way, fair enough.
And she posts on it lots because she has a bunch of RSS clients pointed at her writing, because she's rather popular.
And she'd rather people writing this stuff just learn HTTP properly, at least out of professionalism, if not courtesy.
Hey, you might not, I might not, but we all choose our hills to die on.
My personal hill is "It's lollies and biscuits, not candy and cookies".
rollcat 10 months ago

> why not throw a CDN in front of your site [...]
Because this is how the open web dies - one website at a time. It's already near-dead on the client side - web browsers are not really "user" agents, but agents of oligopolist corporations, that have a stake in abusing you[1].
It's been attempted before with WAP[2], then AMP. But effectively, we're almost there.
[1]: https://www.5snb.club/posts/2023/do-not-stab/
[2]: https://news.ycombinator.com/item?id=42479172
est 10 months ago

> But... why not throw a CDN in front of your site and focus your energy somewhere else?
Yes it's been invented before, known as Feedburner, which was acquired & abandoned by Google.

generationP 10 months ago

Rejecting every unconditional GET after the first? That sounds a bit excessive. What if the reader crashed after the first and lost the data?

brookst 10 months ago

It’s a RSS feed. In that case, wait until the specified time and try again and any missed article will appear then. If it is constantly crashing so articles never get loaded, fix that.
- aleph_minus_one 10 months ago
  
  > If it is constantly crashing so articles never get loaded, fix that.
  This often requires to do lots of tests against the endpoint, which the server prohibits.
  - brookst 10 months ago
    
    Wait, so the argument is that this strict server policy is bad because it makes life hard for lazy RSS reader devs who insist on testing in production against servers they don’t own?
  - im3w1l 10 months ago
    
    If you are an rss-reader dev then you can set up a caching layer of your own.
    
    aleph_minus_one 10 months ago
    
    > If you are an rss-reader dev then you can set up a caching layer of your own.
    But are RSS reader devs willing to jump through such hoops?
    I would claim that writing a (simple) RSS reader (using a programming language that provides suitable libraries) is something that would be rather easy for me, but setting up a caching layer would (because I have less knowledge about the latter topic) take a lot more research from my side concerning how to do it.
    
    ncallaway 10 months ago
    
    > but setting up a caching layer would (because I have less knowledge about the latter topic) take a lot more research from my side concerning how to do it.
    If I was doing local development on an RSS reader, I'd just download any atom.xml file that seemed relevant, then serve it locally using php -S or some other local HTTP file-system server.
    That way I can hit that file a million times without bothering any remote server. Plus, then you've made your local development environment stable even in the face of internet outages.
    And if you automate the local HTTP server setup a little bit, then you can even run local integration tests.
    I dunno, it seems like if "write an RSS feed reader" is an easy problem, I'd expect "serve a file from my hard drive over HTTP on localhost" should also be an easy problem.
    
    im3w1l 10 months ago
    
    Sure, I have done such a thing myself and it was very simple. Let's say you do http_get(rss_address). Create a function http_cached_get, that looks for a recent cached response, and if none exists delegates to http_get and saves the response. In python this is like 10 lines.
    
    shepherdjerred 10 months ago
    
    That's a bit insane
XCabbage 10 months ago

For that matter, what if it's a different device (or entire different human being) on the same IP address?

bombcar 10 months ago

At some point instead of 429 it should return a feed with this post as always newest.

cpeterso 10 months ago

That’s a great point: the client software isn’t listening to the server, so the server software should break the loop by escalating to the human reader. The message response should probably be even more direct with a call to action about their feed reader (naming it, if possible) causing server problems.
0xDEAFBEAD 10 months ago

Or a feed with only this post

PaulHoule 10 months ago

This is why RSS for the birds.

My RSS reader YOShInOn subscribes to 110 RSS feeds through Superfeedr which absolves me of the responsibility of being on the other side of Rachel's problem.

With RSS you are always polling too fast or too slow; if you are polling too slow you might even miss items.

When a blog gets posted Superfeedr hits an AWS lambda function that stores the entry in SQS so my RSS reader can update itself at its own pace. The only trouble is Superfeedr costs 10 cents a feed per month which is a good deal for an active feed such as comments from Hacker News or article from The Guardian but is not affordable for subscribing to 2000+ indy blogs which YOShInOn could handle just fine.

I might yet write my own RSS head end, but there is something to say for protocols like ActivityPub and AT Protocol.

rakoo 10 months ago

That's why websub (formerly pubsubhubbub) was created and should be the proper solution, not a proprietary middleware
- PaulHoule 10 months ago
  
  Superfeedr is pubsubhubhub.
  - rakoo 10 months ago
    
    Aws lambdas and sqs are not
    
    PaulHoule 10 months ago
    
    But it is just a way to answer a webhook.
    Sure I could get DNS to point to my ADSL connection and set something up in my router so that my home computer can answer the webhook but then I can never turn my computer off. On top of that I have about one power outage a month.
    It would also be non-proprietary to spend $5k on a server and $300 a month on colo costs but with AWS I can use what would be 1 cent of resources if I was optimizing that colo (with $10k of labor) and pay 10 cents for it (could really be spending up to $50 on a non-optimized colo, which is what I might have if I don't need to handle a billion webhooks a month) If I want to switch to Azure or some other service that could answer a webhook the labor involved is minuscule.
    
    rakoo 10 months ago
    
    Surely there's a step between self-hosting on your home computer and colocating a server you buy
    Surely a $15 per year vm would be enough to receive a few thousands POSTs per day (https://tinykvm.com/), or if that's not enough, paying less than $5 per month for a VPS is still cheap
    Surely transposing a binary hosted on vendor A to vendor B is as minuscule in involved labor as configuring a webhook
    I'm not trying to diss on your own reader which is an awesome thing to do, I'm just sad that resorting to a fully proprietary architecture is an automatism when considering a protocol that can't be more open than RSS
    
    PaulHoule 10 months ago
    
    I see those $5/mo or $15/yr VMs as pretty expensive. In the penny-pinching mindset you lowball your RAM which is fine on a good day but it runs out when it gets a load spike. So you need a monitoring system of some kind, backup, and so on, ...
    w/ AWS I get all kinds of charts or alarms free or very cheap. The lambda is about 20 lines of Python code, I was able to complete part of the project that I was uninterested in at the time very quickly. Later on it took maybe 2-3 hours to make a UI that would let me add, view and remove feeds from my reader's UI.
    YOShInOn's recommendation engine and UI were a research project however that was high-risk and might not have worked so it wouldn't have made sense to develop a better head end.
    A better head end is on the agenda today but that system has a lot of other problems such as a dangerously large database that needs to be pruned, O(N) algorithms that were fine for the first year, other problems on the tail end. And it competes with other systems.
    My experience is I can set something like this up in AWS and just not think about it for years.

wheybags 10 months ago

Rss is pretty light. Even if you say it's too much to be re-sending, you could remove the content from the rss feed (so they need to click through to read it), which would shrink the feed size massively. Alternatively, remove old posts. Or do both.

Hopefully you don't have some expensive code generating the feed on the fly, so processing overhead is negligible. But if it's not, cache the result and reset the cache every time you post.

Surely this is easier than spending the effort and emotional bandwidth to care about this issue?

I might be wrong here, but this feels more emotionally driven ("someone is wrong on the internet") than practical.

gavinsyancey 10 months ago

As a user of the RSS feed, please don't remove content from it so I have to click through. This makes it much less useful and more annoying to use.
- wheybags 10 months ago
  
  I always click through regardless, because the rss text is probably missing formatting and images. I'll never be sure I'm getting a proper copy of the article unless I click through anyway.

RA2lover 10 months ago

nilslindemann 10 months ago

I am stupid, why not just return an HTML document explaining the issue, when there is such an incorrect second request in 20 minutes, then blocking that IP for 24 hours? The feed reader software author has to react, otherwise its users will complain to him, no?

ruszki 10 months ago

That’s what 429 return is for, which is mentioned in the article.
- Too 10 months ago
  
  Most readers presumably keep showing the last valid response when they get an error, so that the user doesn’t notice. Returning a fake ok response, explaining that the reader is dumb, will lift attention to the user. Not that I would advocate for this solution, except for desperate moments.
ImPostingOnHN 10 months ago

It might be clever to return an rss feed containing 1 item: the html document you mention.

donatj 10 months ago

On the flip side, what percent of RSS feed generators actually support conditional requests? I've written many over the last twenty years and I can tell you plainly, none of the ones I wrote have.

I never even considered the option or necessity. It's easy and cheap just to send everything.

I guess static generators with a apache style web server probably do, but I can't imagine any dynamic generators bother to try to save the small handful of bytes.

aendruk 10 months ago

For another perspective, I can offer the data point that the one dynamic feed generator I’ve written supports both If-Modified-Since and If-None-Match, and that I considered that to be an obvious requirement from the beginning.

ruuda 10 months ago

I have a blog where I post a few posts per year. [1] /feed.xml is served with an Expires header of 24 hours. I wrote a tool that allows me to query the webserver logs using SQLite [2]. Over the past 90 days, these are the top 10 requesters grouped by ip address (remote_addr column redacted here):

    requests_per_day  user_agent
    283               Reeder/5050001 CFNetwork/1568.300.101 Darwin/24.2.0
    274               CommaFeed/4.4.0 (https://github.com/Athou/commafeed)
    127               Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/131.0.0.0 Safari/537.36
    52                NetNewsWire (RSS Reader; https://netnewswire.com/)
    47                Tiny Tiny RSS/23.04-0578bf80 (https://tt-rss.org/)
    47                Refeed Reader/v1 (+https://www.refeed.dev/)
    46                Selfoss/2.18 (SimplePie/1.5.1; +https://selfoss.aditu.de)
    41                Reeder/5040601 CFNetwork/1568.100.1.1.1 Darwin/24.0.0
    39                Tiny Tiny RSS/23.04 (Unsupported) (https://tt-rss.org/)
    34                FreshRSS/1.24.3 (Linux; https://freshrss.org)

Reeder is loading the feed every 5 minutes, and in the vast majority of cases it’s getting a 301 response because it tries to access the http version that redirects to https. At least it has state and it gets 304 Not Modified in the remaining cases.

If I order by body bytes served rather than number of requests (and group by remote_addr again), these are the worst consumers:

    body_megabytes_per_year  user_agent
    149.75943975             Refeed Reader/v1 (+https://www.refeed.dev/)
    95.90771025              Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/131.0.0.0 Safari/537.36
    75.00080025              rss-parser
    73.023702                Tiny Tiny RSS/24.09-0163884ef (Unsupported) (https://tt-rss.org/)
    38.402385                Tiny Tiny RSS/24.11-42ebdb02 (https://tt-rss.org/)
    37.984539                Selfoss/2.20-cf74581 (+https://selfoss.aditu.de)
    30.3982965               NetNewsWire (RSS Reader; https://netnewswire.com/)
    28.18013325              Tiny Tiny RSS/23.04-0578bf80 (https://tt-rss.org/)
    26.330142                Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/84.0.4147.105 Safari/537.36
    24.838461                Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/84.0.4147.105 Safari/537.36

The top consumer, Refeed, is responsible for about 2.25% of all egress of my webserver. (Counting only body bytes, not http overhead.)

[1]: https://ruudvanasseldonk.com/writing [2]: https://github.com/ruuda/sqlog/blob/d129db35da9bbf95d8c2e97d...

6510 10 months ago

I ban the feed for 24 hours if it doesnt work.

I also design 2 new formats that no one (including myself) has ever implemented.

https://go-here.nl/ess-and-nno

enjoy

internet2000 10 months ago

Does anyone know if FreshRSS behaves properly here?

reocha 10 months ago

Earlier article with some info on freshrss: https://rachelbythebay.com/w/2024/10/25/fs/

Forge36 10 months ago

I couldn't find the tester. Thankfully the client i was tested... And it behaves poorly. Thankfully emacs has a client I can switch to!

mixmastamyk 10 months ago

I have a few feeds configured into Thunderbird but wasn’t reading them very often, so I “disabled” them to load manually. Despite this it tries to contact the sites often and, when not able to (firewall) goes into a frenzy of trying to contact them. All this despite being disabled.

Disappointing combined with the various update sites it tries to contact every startup, which is completely unnecessary as well. Couple of times a week should be the maximum rate.

aaron695 10 months ago

[dead]

PittleyDunkin 10 months ago

[flagged]

mdp2021 10 months ago

> to waste data
Useless traffic.
RSS has a structural problem: you download a "window" of data (the feeds between some timestamp in day A and some other in day B), which may or may not contain new data. You will easily lose some and receive a lot of duplicates.
It would have been better to be able to first check if new data is available. (And even better to only download the new ones, and all of them - "everything after YY-MM-DD hh:mm:ss")
- graemep 10 months ago
  
  > It would have been better to be able to first check if new data is available.
  Or, as the article says, to actually check when you are able. All she is asking for is that readers make conditional requests, at reasonable intervals, and respect 429s.
- masklinn 10 months ago
  
  > It would have been better to be able to first check if new data is available.
  That... is what conditional requests do...
  - mdp2021 10 months ago
    
    Which means: do not place extra mechanisms in the RSS when you can do the same with HTTP.
    But then there is the problem of RSS clients that may not work properly (may not use that trick), without the user knowing it, and that of RSS servers that do not work correctly with "If-Modified-Since" (as noted in a nearby post).
    Edit: but if the burden were placed on RSS instead, we could have had the trick of "I already have ...#10050, #10051 and #10052: just send me from #10053 on" - the feeds XML that is updated recently will contain more items than just the new ones. Similarly for the gaps: "the XML would contain from #10050 on, but there had been a surge of publications and now I am missing #10045 to #10049..."
    
    notpushkin 10 months ago
    
    > but if the burden were placed on RSS instead
    That would be nice to have, but it's less realistic.
    Some clients already speak proper HTTP, and others can too, with little modification. For modified RSS like this, you have to make a standard first, then push both servers and clients to use it.
    
    masklinn 10 months ago
    
    It also requires a "smart" server, whereas you can serve your atom or rss feed statically with a standard HTTP server handling caching and conditional requests essentially free.
    
    notpushkin 10 months ago
    
    Yeah, although dumb servers could just return last N posts as usual. It's one more case to handle on the client, though. (The alternative is breaking compatibility with older feeds, which is way worse IMO.)
- paulryanrogers 10 months ago
  
  Such as via HEAD and Etag?
  - masklinn 10 months ago
    
    HEAD is counter-productive
    Just GET with if-none-match and if-modified-since (based on the etag and last-modified you got in the previous response), and the server will return a 304 not modified with no content if nothing has changed, and the content otherwise.
    With a HEAD you'd get the same result except now you'd need to ignore the cache headers from the HEAD response in order to fetch the content in a second request.
    
    OutOfHere 10 months ago
    
    You speak of this confidently, but there are sites that return a not modified response when in truth they have modified the feed. It doesn't happen often, but I have seen it happen with more than one site. It is why I take the not modified response with a grain of salt.
Dilettante_ 10 months ago

It's wasted operations on the user's end, the server's and the network infrastructure. None of those are free.
guerrilla 10 months ago

> What does it mean to waste data, something notoriously free to copy?
She didn't say waste data. It's a waste of many resources though, including energy and bandwidth, even processing power. Depends on what level of abstraction you want to look at it from, but it's definitely a waste of something.
It is possible to waste data though, but only by deleting it. It takes energy to collect and store data. This isn't relevant to the case though.
- nuancebydefault 10 months ago
  
  Waste data is not to be taken literally. If i say, you wasted my time by writing nonsense, would you then reply, time is time and hence cannot be created nor consumed?
Tyr42 10 months ago

You do pay for your link, right?
- rzzzt 10 months ago
  
  What am I actually paying for? Is it "the entire speedometer" ie. 24/7 100% utilization of the advertised upload/download capability of the link? Why not?
  - masklinn 10 months ago
    
    > Why not?
    Because that costs way more than you're paying for your connection. The business model is predicated upon oversuscription of the ISP's network because near enough nobody does that.
klysm 10 months ago

Network bandwidth

euroderf 10 months ago

which => that

thaumasiotes 10 months ago

They're exactly equivalent. What are you hoping to correct?
- euroderf 10 months ago
  
  So I guess I get to be the grumpy grammarian. Whee for me.
  I'm making an observation. I try to spot language change. And "which" has been popping up in stupid places, at the expense of "that".
  There is a real difference between "which" and "that", whether Joe Sixpack gives a damn or not. And nowadays a lot of people seem to be inappropriately overusing "which", and it seems to be driven by UK speakers.
  Here's a sample explanation: https://www.grammarly.com/blog/grammar/which-vs-that
  I should be making a collection of the most OTT examples.
  - thaumasiotes 10 months ago
    
    > Here's a sample explanation: https://www.grammarly.com/blog/grammar/which-vs-that
    That might be a sample of what you're thinking, but it isn't a correct explanation of anything. It's just some random mythmaking, which is par for the course from "grammar advice" websites.
    Which and that are not distinguished in the manner that page wishes they were. That cannot be used (in the modern language) to introduce a nonrestrictive relative clause. Which can be used to introduce a restrictive or nonrestrictive relative clause.
    > There is a real difference between "which" and "that", whether Joe Sixpack gives a damn or not.
    There are real differences, but there are no differences as to the usage you flagged. The bigger difference is that which, being a pronoun, is part of a gender distinction (with who) between persons and nonpersons, whereas that, not being a pronoun, has no such distinction.
    You might note, hopefully, that if feed readers were people, "feed readers who don't take 'no' for an answer" would be completely standard grammar. The same is also obviously true of "feed readers which don't take 'no' for an answer" when the feed readers aren't people.
    CGEL [Cambridge Grammar of the English Language; both authors are from the UK] doesn't even bother to mention this myth, but the first example of a relative clause that it does give is He'll be glad to take the toys which you don't want. (Chapter 12 §2.1, example 1.i; page 1034)
    If you want to play a grammarian on the internet, wouldn't it be better to know some grammar first?
    > I try to spot language change.
    > And nowadays a lot of people seem to be inappropriately overusing "which"
    You're doing a remarkably terrible job; this change took place seven hundred years ago. Here's The Merriam-Webster Dictionary of English Usage:
    > According to McKnight 1928 that was prevalent in early Middle English, which began to be used as a relative pronoun in the 14th century, and who and whom in the 15th.
    > [...] By the early 17th century, which and that were being used pretty much interchangeably. Evans 1957 quotes this passage from the Authorized (King James) Version (1611) of the Bible:
    >> Render therefore unto Caesar the things which are Caesar's; and unto God the things that are God's.
    > During the later 17th century, Evans tells us, that fell into disuse, at least in literary English. It went into such an eclipse that its reappearance in the early 18th century was noticed and satirized by Joseph Addison in The Spectator (30 May 1711) in a piece entitled "Humble Petition of Who and Which against the upstart Jack Sprat That."
    (entry for that [1]; page 894 of the 1993 printing.)
- euroderf 10 months ago
  
  They're obviously not.
  It's a Britticism (AFAICT) making inroads.
  - sangnoir 10 months ago
    
    Oh no, not the British influencing the English language! I can't be arsed* about which vs. that when "on accident" has become semi-accepted (as the opposite of "on purpose"). Yuck.
  - thaumasiotes 10 months ago
    
    What do you hope to accomplish by making random false statements?
    
    euroderf 10 months ago
    
    Wot
  - EdwardDiego 10 months ago
    
    Britticisms? In English? The hell you say!
    
    euroderf 10 months ago
    
    Well, we know they cannot spell.
  - wetpaws 10 months ago
    
    [dead]

kelsey98765431 10 months ago

if you have to 429 people for an rss feed the problem is you

quectophoton 10 months ago

I think it's an acceptable response. Not only there's no SLA, but people are free to not provide a service to misbehaving user agents. It's like rejecting connections from Tor.
If anything, a 429 is a nice heads up. It could have been worse; she could have redirected those requests to a separate URL with an... unpleasant content, like a certain domain that redirects to I-don't-know-what whenever they detect the Referer header is from HN.
- redleader55 10 months ago
  
  As interesting as that site is, and as much as I sympathise with the author's plight, that site's behavior is so anti-me that I'm going to ignore it whenever/wherever it pops up. I'm not trolling the author, I'm not calling them names or anything, I was just interested in the technical stuff. I wish them good luck.
  - sangnoir 10 months ago
    
    I think that's the whole point - that author won't begrudge you for not visiting their site. They detest HN ideologically, so losing HNer traffic won't ruin their day.
  - EdwardDiego 10 months ago
    
    Rachel isn't serving adverts on on her blog. She's not trying to sell you on her consulting business.
    So my question to you is... ...why is it so anti-you, and why should it be different?
    
    redleader55 10 months ago
    
    I don't mean Rachel. Please read my parent a few times - that's what I was answering to. You might not know what my parent is referring to, but it's a different blog.
noident 10 months ago

There's a particular type of person that scours their HTTP logs and makes up rules that block 90% of feed readers using the default poll interval. If I stick your RSS feed into Miniflux and I get 429'd, I just stop reading your blog. Learn2cache. I'm talking to you, Cheapskate's Guide.
- Kudos 10 months ago
  
  This site would not 429 current Miniflux, since it makes conditional requests. She has a previous post outlining cache respecting behaviour of many common feed readers.
  - Kwpolska 10 months ago
    
    It could 429 it for conditional requests as well:
    > Unconditional requests: at most once per 24 hour period.
    > Conditional requests: at most once per 60 minute period.
    (Source: calling `curl hxxps://rachelbythebay[.]com/w/atom.xml` twice)
- ncallaway 10 months ago
  
  > Learn2cache
  I find this a particularly amusing comment, given that the main complaint is about feed reader's not sending conditional requests.
  Learn2cache indeed, feed reader authors.
dxdm 10 months ago

Nobody owes these people and their feed readers a 200 whenever they want one.
silvestrov 10 months ago

not when the client sends unconditional requests i.e. missing If-Modified-Since and If-None-Match headers.
All feed readers/clients should cache responses when sending multiple requests the same day.
ramses0 10 months ago

If you don't stop at red lights, the problem is other people. /s