A friend of mine co-runs a semi-popular semi-niche news site (for now more than a decade), and complains that recently traffic rose with bots masquerading as humans.
How would they know? Well, because Google, in its omniscience, started to downrank them for faking views with bots (which they do not do): it shows bot percentage in traffic stats, and it skyrocketed relative to non-bot traffic (which is now less than 50%) as they started to fall from the front page (feeding the vicious circle). Presumably, Google does not know or care it is a bot when it serves ads, but correlates it later with the metrics it has from other sites that use GA or ads.
Or, perhaps, Google spots the same anomalies that my friend (an old school sysadmin who pays attention to logs) did, such as the increase of traffic along with never seen before popularity among iPhone users (who are so tech savvy that they apparently do not require CSS), or users from Dallas who famously love their QQBrowser. I’m not going to list all telltale signs as the crowd here is too hype on LLMs (which is our going theory so far, it is very timely), but my friend hopes Google learns them quickly.
These newcomers usually fake UA, use inconspicuous Western IPs (requests from Baidu/Tencent data center ranges do sign themselves as bots in UA), ignore robots.txt and load many pages very quickly.
I would assume bot traffic increase would apply to feeds, since they are of as much use for LLM training purposes.
My friend does not actually engage in stringent filtering like Rachel does, but I wonder how soon it becomes actually infeasible to operate a website with actual original content (which my friend co-writes) without either that or resorting to Cloudflare or the like for protection because of the domination of these creepy-crawlies.
Edit: Google already downranked them, not threatened to downrank. Also, traffic rose but did not skyrocket, but relative amount of bot traffic skyrocketed. (Presumably without downranking the traffic would actually skyrocket.)
Are you saying that Google down-ranked them in search engine rankings for user behaviour in AdWords? Isn't that an abuse of monopoly? It still surprises me a little bit.
It's not that hard to dominate bots. I do it for fun, I do it for profit. Block datacenters. Run bot motels. Poison them. Lie to them. Make them have really really bad luck. Change the cost equation so that it costs them more than it costs you.
You're thinking of it wrong, the seeds of the thinking error are here: "I wonder how soon it becomes actually infeasible to operate a website with actual original content".
Bots want original content, no? So what's the problem with giving it to them? But that's the issue, isn't it? Clearly, contextually, what you should be saying is "I wonder how soon it becomes actually infeasible to operate a website for actual organic users" or something like that. But phrased that way, I'm not sure a CDN helps (I'm not sure they don't suffer false positives which interfere with organic traffic when they intermediate, more security theater because hangings and executions look good, look at the numbers of enemy dead).
Take measures that any damn fool (or at least your desired audience) can recognize.
Reading for comprehension, I think Rachel understands this.
Easy way is to implement e.g. a 4xx handler which serves content with links which generate further 4xx errors and rewrite the status code to something like 200 when sent to the requester. Load the garbage pages up with... garbage.
The idea is that bots are inflexible to deviations from accepted norms and can't actually "see" rendered browser content. So if your generic 404, 403 error pages return a 200 status instead, with invisible links to other non accessible pages. The bots will follow the links but real users will not, trapping them in a kind of isolated labyrinth of recursive links (the urls should be slightly different though). It's basically how a lobster trap works if you want a visual metaphor.
The important part here is to do this chaotically. The worst sites to scrape are buggy ones. You are, in essence, deliberately following bad practices in a way real users wouldn't notice but would still influence bots.
I'm seeing some address ranges in the US clearly serving what must be VPN traffic from Asia, and I'm also seeing an uptick in TOR traffic looking for feeds as well as WP infra.
That much is clear, yeah. The VPN they use may not be a service advertised to public and featured in lists, however.
Some of the new traffic did come directly from Tencent data center IP ranges and reportedly those bots signed themselves in UA. I can’t say whether they respect robots.txt because I am told their ranges were banned along with robots.txt tightening. However, US IP bots that remain unblocked and fake UA naturally ignore robot rules.
At my company we have seen a massive increase in bot traffic since LLMs have become mainstream. Blocking known OpenAI and Anthropic crawlers has decreased traffic somewhat so I agree with your theory.
Feed readers should be sending the If-Modified-Since header and web sites should properly recognize it and send the 304 Unmodified response. This isn’t new tech.
You left out a further explicit mention of conditional requests:
> Advised (via Retry-After header) to come back in one day since they are unwilling or unable to do conditional requests.
But I think it's still unarguable that the post doesn't explicitly mention If-Modified-Since, which it's not obliged to do, but the mention of it here could be helpful to someone. So why fuss?
If your feed reader is refreshing every 20 minutes for a blog that is updated daily, nearly 99% of the data sent is identical. It looks like Rachel's blog is updated (roughly) weekly, so that jumps to 99.8%. It's not the least efficient thing in the world of computers, but it is definitely incurring unnecessary costs.
I opened the xml file she provides in the blog and it seems very long but okay. Then I decided it is a good blog to subscribe so I went and tried to add to my freshrss selfhosted instance (same ip obviously) and I couldn't because I got blocked/rate limited. So yes it is aggressive for different reasons.
Yeah, that's insane. Pretty much telling me not to subscribe to your blog at that point. Like sites that have an rss feed yet put Cloudflare protection in front of it...
The correct thing to do here is put a caching layer in front so that every feed reader isn't simultaneously hitting the origin for the same content. IP banning is the wrong approach. (Even if it's only a temporary block, that's going to cause my reader to show an error and is entirely unnecessary.)
It should be a timeboxed block if anything. Most RSS users are actual readers and expecting them to spend lots of time figuring out why clicking "refresh" twice on their RSS app got them blocked is totally unreasonable. I've got my feeds set up to refresh every hour. Considering the small number of people still using RSS and how lightweight it is, it's not bad enough to freak out over. At some point all Rachel's complaining and investigating will be more work than her simply interacting directly with the makers of the various readers that cause the most traffic.
There are a lot of very valid use cases where defaulting to deny for an entire 24 hour cycle after a single request is incredible frustrating for your downstream users (shared IP at my university means I will never get a non-429 response... And God help me if I'm testing new RSS readers...)
It's her server, so do as you please, I guess. But it's a hilariously hostile response compared to just returning less data.
People don't want to have to customize refresh rates on a per-feed basis. Perhaps the RSS or Atom standards need to support importing the recommended refresh rate automatically.
Yes that's right. Most blogs that are popular enough to have this problem send you the last 10 post titles and links or something. THAT is why people refresh every hour, so they don't miss out.
If you understand what rate limiting is, you block them for a period of time. Let's stop being pedantic here.
72 requests per day is nothing and acting like it's mayhem is a bit silly. And for a lot of people would result in them getting possible news slower. Sure OP won't publish that often but their rate limiting is an edge case and should be treated as such. If they're blocked until the next day and nothing gets updated then the only person harmed is OP for being overly bothered by their HTTP logs.
Sure it's their server and they can do whatever they want. But all this does is hurts the people trying to reach their blog.
72 requests per day _per user with a naive feed reader_. This is a small personal blog with no ads that OP is self-hosting on her own hardware, so blocking all this junk traffic is probably saving her money. Plus she's calling attention to how feed readers can be improved!
Even if they had 1000 feed readers which would be a massive amount for a blog, if you can't scale that cheaply, that's on you.
As I pointed out, her blog and rate limiting are an extreme edge case, it would be silly for anyone to put effort into changing their feed reader for a single small blog. It's bad product management.
Of course she can. It's static. She doesn't want and I understand. She's signaling their clients an standard call to say "I think you already have read this, at lest ask me first when this changed the last time".
If every user is collecting 36mb a day like in the story here, your droplet wouldn’t even be capable of serving 500 users a month without hitting your bandwidth limit. With their current rates, your one million requests would cost you around 10 million USD.
That's ridiculously big quantity of data to serve a seldomly updated blog just because the client doesn't want (or know how, or think about) to implement an easy and old http method.
Imagine the petabytes of data transferred through the internet saved if a couple RSS clients added that method.
Yews, it's about enforcing their preference on how others should interact with OP's published site feed, on principle. Which is always an uphill battle.
Sounds like you don't know how to scale for cheap.
And since I've ran integrations that connected over 500 companies. I know what a rouge client actually looks like and 72 requests per day and I wouldn't even notice.
But it's not a "light" protocol when you're serving 36MB per day, when 500KB would suffice. RSS/Atom is light weight, if clients play by the rules. This could also have been a news website, imagine how much traffic would be dedicated to pointless transfers of unchanged data. Traffic isn't free.
A similar problem arise from the increase in AI scraper activities. Talking to other SREs the problem seems pretty wide spread. AI companies will just hoover up data, but revisit so frequently and aggressively that it's starting to affect the transit feeds for popular websites. Frequently user-agents wouldn't be set to something unique, or deliberately hidden, and traffic originates from AWS, making it hard to target individual bad actors. Fair enough that you're scraping websites, that's part of the game when your online, but when your industry starts to affect transit feeds, then we need to talk compensation.
That’s a bit disingenuous. 429s aren’t “blocking”, they’re telling the requester that they’re done too many requests and to try again later (with a value in the header). I assume the author configured this because they know how often the site is going to change typically. That the web server eventually stops responding if the client ignores requests isn’t that surprising, but I doubt it was configured directly too.
Semantics. 429 is an error code. Rate limiting...blocking...too many requests...ignoring...call it whatever you like but it amounts to the same, namingly server isn't serving the requested content.
Like how "unlimited traffic, but will slow down to 1bps if you use more than 100gb in a month" is technically "unlimited traffic".
But for all intents and purposes, it's limited. And 429 are blocking. They include a hint towards the reason why you are blocked and when the block might expire (retry-after doesn't promise that you'll be successful if you wait), but besides that, what's the different compared to 403?
I would disagree. Blocking typically implies permanence (without more action by the blockee), and since 429 isn’t usually a permanent error code I wouldn’t call it blocking. Same applies with 403, it’s only permanent if the requester doesn’t authorize correctly.
I would say it's disingenuous to claim sending HTTP status and body that is not expected for a period of time is not blocking them for that period of time. You can be pedantic and claim "but they can still access the server" but in reality that client is blocked for a period of time.
In that case, I should be irate that the AWS API blocks me many times per day. Run `aws cli service some-paginated-thing` and see how many retries you get during normal, routine operation.
But I’m not, because they’re not blocking me. They’re asking my client to slow down. Neither AWS nor Rachel’s blog owes me unlimited requests per unit time, and neither have “blocked” me when I violate they policies.
They literally do block you for a period of time until you are out of the rate limit. That is how rate limits work. That's why you don't get to access the resource you requested, because their system literally blocked you from doing so.
See when you're trying to be pedantic and all about semantics, you should make sure you've crossed your Ts and dotted your Is.
> Block – AWS WAF blocks the request and applies any custom blocking behavior that you've defined.
> Rate limiting blocks users, bots, or applications that are over-using or abusing a web property. Rate limiting can stop certain kinds of bot attacks.
Every documentation on rate limit will include the word block. Because that's what you do, you allow access for a specific amount of requests and then block those that go over.
I would argue that HTTP statuses are a bad design decision, because they are intended to be consumed by apps, but are not app-specific. They are effectively a part of every API automatically without considerations whether they are needed.
People often implement error handling using constructs like regexp matching on status codes, while with domain-specified errors it would be obvious what exactly is the range of possible errors.
Moreover, when people do implement domain errors, they just have to write more code to handle two nested levels of branching.
> I would argue that HTTP statuses are a bad design decision, because they are intended to be consumed by apps, but are not app-specific.
Perhaps put the app-specific part in the body of the reply. In the RFC they give a human specific reply to (presumably) be displayed in the browser:
HTTP/1.1 429 Too Many Requests
Content-Type: text/html
Retry-After: 3600
<html>
<head>
<title>Too Many Requests</title>
</head>
<body>
<h1>Too Many Requests</h1>
<p>I only allow 50 requests per hour to this Web site per
logged in user. Try again soon.</p>
</body>
</html>
> because they are intended to be consumed by apps, but are not app-specific
Well, good luck designing any standard app-independent protocol that works and doesn't do that.
And yes, you must handle two nested levels of branching. That's how it works.
The only improvement possible to make it clearer is having codes for API specific errors... what 400 and 500 aren't exactly. But then, that doesn't gain you much.
A colleague who should’ve known better argued that a 404 response to an API call was confusing because we were, in fact, successfully returning a response to the client. We had a long talk about that afterward.
No, it is pretty confusing: the difference between 404 from hitting an endpoint that the server doesn't serve (because you forgot to expose this endpoint, oops!) and a 404 that means "we've successfully performed the search in our DB for the business entity you've requested and guarantee you that it does not exist" is rather difficult to tell programmatically.
It’s a RSS feed. In that case, wait until the specified time and try again and any missed article will appear then. If it is constantly crashing so articles never get loaded, fix that.
> If you are an rss-reader dev then you can set up a caching layer of your own.
But are RSS reader devs willing to jump through such hoops?
I would claim that writing a (simple) RSS reader (using a programming language that provides suitable libraries) is something that would be rather easy for me, but setting up a caching layer would (because I have less knowledge about the latter topic) take a lot more research from my side concerning how to do it.
Sure, I have done such a thing myself and it was very simple. Let's say you do http_get(rss_address). Create a function http_cached_get, that looks for a recent cached response, and if none exists delegates to http_get and saves the response. In python this is like 10 lines.
My RSS reader YOShInOn subscribes to 110 RSS feeds through Superfeedr which absolves me of the responsibility of being on the other side of Rachel's problem.
With RSS you are always polling too fast or too slow; if you are polling too slow you might even miss items.
When a blog gets posted Superfeedr hits an AWS lambda function that stores the entry in SQS so my RSS reader can update itself at its own pace. The only trouble is Superfeedr costs 10 cents a feed per month which is a good deal for an active feed such as comments from Hacker News or article from The Guardian but is not affordable for subscribing to 2000+ indy blogs which YOShInOn could handle just fine.
I might yet write my own RSS head end, but there is something to say for protocols like ActivityPub and AT Protocol.
Rss is pretty light. Even if you say it's too much to be re-sending, you could remove the content from the rss feed (so they need to click through to read it), which would shrink the feed size massively. Alternatively, remove old posts. Or do both.
Hopefully you don't have some expensive code generating the feed on the fly, so processing overhead is negligible. But if it's not, cache the result and reset the cache every time you post.
Surely this is easier than spending the effort and emotional bandwidth to care about this issue?
I might be wrong here, but this feels more emotionally driven ("someone is wrong on the internet") than practical.
I always click through regardless, because the rss text is probably missing formatting and images. I'll never be sure I'm getting a proper copy of the article unless I click through anyway.
I am stupid, why not just return an HTML document explaining the issue, when there is such an incorrect second request in 20 minutes, then blocking that IP for 24 hours? The feed reader software author has to react, otherwise its users will complain to him, no?
On the flip side, what percent of RSS feed generators actually support conditional requests? I've written many over the last twenty years and I can tell you plainly, none of the ones I wrote have.
I never even considered the option or necessity. It's easy and cheap just to send everything.
I guess static generators with a apache style web server probably do, but I can't imagine any dynamic generators bother to try to save the small handful of bytes.
I have a blog where I post a few posts per year. [1] /feed.xml is served with an Expires header of 24 hours. I wrote a tool that allows me to query the webserver logs using SQLite [2]. Over the past 90 days, these are the top 10 requesters grouped by ip address (remote_addr column redacted here):
Reeder is loading the feed every 5 minutes, and in the vast majority of cases it’s getting a 301 response because it tries to access the http version that redirects to https. At least it has state and it gets 304 Not Modified in the remaining cases.
If I order by body bytes served rather than number of requests (and group by remote_addr again), these are the worst consumers:
I have a few feeds configured into Thunderbird but wasn’t reading them very often, so I “disabled” them to load manually. Despite this it tries to contact the sites often and, when not able to (firewall) goes into a frenzy of trying to contact them. All this despite being disabled.
Disappointing combined with the various update sites it tries to contact every startup, which is completely unnecessary as well. Couple of times a week should be the maximum rate.
RSS has a structural problem: you download a "window" of data (the feeds between some timestamp in day A and some other in day B), which may or may not contain new data. You will easily lose some and receive a lot of duplicates.
It would have been better to be able to first check if new data is available. (And even better to only download the new ones, and all of them - "everything after YY-MM-DD hh:mm:ss")
> It would have been better to be able to first check if new data is available.
Or, as the article says, to actually check when you are able. All she is asking for is that readers make conditional requests, at reasonable intervals, and respect 429s.
Which means: do not place extra mechanisms in the RSS when you can do the same with HTTP.
But then there is the problem of RSS clients that may not work properly (may not use that trick), without the user knowing it, and that of RSS servers that do not work correctly with "If-Modified-Since" (as noted in a nearby post).
Edit: but if the burden were placed on RSS instead, we could have had the trick of "I already have ...#10050, #10051 and #10052: just send me from #10053 on" - the feeds XML that is updated recently will contain more items than just the new ones. Similarly for the gaps: "the XML would contain from #10050 on, but there had been a surge of publications and now I am missing #10045 to #10049..."
Just GET with if-none-match and if-modified-since (based on the etag and last-modified you got in the previous response), and the server will return a 304 not modified with no content if nothing has changed, and the content otherwise.
With a HEAD you'd get the same result except now you'd need to ignore the cache headers from the HEAD response in order to fetch the content in a second request.
You speak of this confidently, but there are sites that return a not modified response when in truth they have modified the feed. It doesn't happen often, but I have seen it happen with more than one site. It is why I take the not modified response with a grain of salt.
> What does it mean to waste data, something notoriously free to copy?
She didn't say waste data. It's a waste of many resources though, including energy and bandwidth, even processing power. Depends on what level of abstraction you want to look at it from, but it's definitely a waste of something.
It is possible to waste data though, but only by deleting it. It takes energy to collect and store data. This isn't relevant to the case though.
Waste data is not to be taken literally. If i say, you wasted my time by writing nonsense, would you then reply, time is time and hence cannot be created nor consumed?
What am I actually paying for? Is it "the entire speedometer" ie. 24/7 100% utilization of the advertised upload/download capability of the link? Why not?
Because that costs way more than you're paying for your connection. The business model is predicated upon oversuscription of the ISP's network because near enough nobody does that.
I think it's an acceptable response. Not only there's no SLA, but people are free to not provide a service to misbehaving user agents. It's like rejecting connections from Tor.
If anything, a 429 is a nice heads up. It could have been worse; she could have redirected those requests to a separate URL with an... unpleasant content, like a certain domain that redirects to I-don't-know-what whenever they detect the Referer header is from HN.
As interesting as that site is, and as much as I sympathise with the author's plight, that site's behavior is so anti-me that I'm going to ignore it whenever/wherever it pops up. I'm not trolling the author, I'm not calling them names or anything, I was just interested in the technical stuff. I wish them good luck.
There's a particular type of person that scours their HTTP logs and makes up rules that block 90% of feed readers using the default poll interval. If I stick your RSS feed into Miniflux and I get 429'd, I just stop reading your blog. Learn2cache. I'm talking to you, Cheapskate's Guide.
This site would not 429 current Miniflux, since it makes conditional requests. She has a previous post outlining cache respecting behaviour of many common feed readers.
A friend of mine co-runs a semi-popular semi-niche news site (for now more than a decade), and complains that recently traffic rose with bots masquerading as humans.
How would they know? Well, because Google, in its omniscience, started to downrank them for faking views with bots (which they do not do): it shows bot percentage in traffic stats, and it skyrocketed relative to non-bot traffic (which is now less than 50%) as they started to fall from the front page (feeding the vicious circle). Presumably, Google does not know or care it is a bot when it serves ads, but correlates it later with the metrics it has from other sites that use GA or ads.
Or, perhaps, Google spots the same anomalies that my friend (an old school sysadmin who pays attention to logs) did, such as the increase of traffic along with never seen before popularity among iPhone users (who are so tech savvy that they apparently do not require CSS), or users from Dallas who famously love their QQBrowser. I’m not going to list all telltale signs as the crowd here is too hype on LLMs (which is our going theory so far, it is very timely), but my friend hopes Google learns them quickly.
These newcomers usually fake UA, use inconspicuous Western IPs (requests from Baidu/Tencent data center ranges do sign themselves as bots in UA), ignore robots.txt and load many pages very quickly.
I would assume bot traffic increase would apply to feeds, since they are of as much use for LLM training purposes.
My friend does not actually engage in stringent filtering like Rachel does, but I wonder how soon it becomes actually infeasible to operate a website with actual original content (which my friend co-writes) without either that or resorting to Cloudflare or the like for protection because of the domination of these creepy-crawlies.
Edit: Google already downranked them, not threatened to downrank. Also, traffic rose but did not skyrocket, but relative amount of bot traffic skyrocketed. (Presumably without downranking the traffic would actually skyrocket.)
Are you saying that Google down-ranked them in search engine rankings for user behaviour in AdWords? Isn't that an abuse of monopoly? It still surprises me a little bit.
It's not that hard to dominate bots. I do it for fun, I do it for profit. Block datacenters. Run bot motels. Poison them. Lie to them. Make them have really really bad luck. Change the cost equation so that it costs them more than it costs you.
You're thinking of it wrong, the seeds of the thinking error are here: "I wonder how soon it becomes actually infeasible to operate a website with actual original content".
Bots want original content, no? So what's the problem with giving it to them? But that's the issue, isn't it? Clearly, contextually, what you should be saying is "I wonder how soon it becomes actually infeasible to operate a website for actual organic users" or something like that. But phrased that way, I'm not sure a CDN helps (I'm not sure they don't suffer false positives which interfere with organic traffic when they intermediate, more security theater because hangings and executions look good, look at the numbers of enemy dead).
Take measures that any damn fool (or at least your desired audience) can recognize.
Reading for comprehension, I think Rachel understands this.
what is a bot motel and how do you run one?
Easy way is to implement e.g. a 4xx handler which serves content with links which generate further 4xx errors and rewrite the status code to something like 200 when sent to the requester. Load the garbage pages up with... garbage.
Thanks, and you can make money with this? Sorry I'm a total noob in this area.
The idea is that bots are inflexible to deviations from accepted norms and can't actually "see" rendered browser content. So if your generic 404, 403 error pages return a 200 status instead, with invisible links to other non accessible pages. The bots will follow the links but real users will not, trapping them in a kind of isolated labyrinth of recursive links (the urls should be slightly different though). It's basically how a lobster trap works if you want a visual metaphor.
The important part here is to do this chaotically. The worst sites to scrape are buggy ones. You are, in essence, deliberately following bad practices in a way real users wouldn't notice but would still influence bots.
QQBrowser users from Dallas are more likely to be Chinese using a VPN than bots, I would guess.
I'm seeing some address ranges in the US clearly serving what must be VPN traffic from Asia, and I'm also seeing an uptick in TOR traffic looking for feeds as well as WP infra.
That much is clear, yeah. The VPN they use may not be a service advertised to public and featured in lists, however.
Some of the new traffic did come directly from Tencent data center IP ranges and reportedly those bots signed themselves in UA. I can’t say whether they respect robots.txt because I am told their ranges were banned along with robots.txt tightening. However, US IP bots that remain unblocked and fake UA naturally ignore robot rules.
> The VPN they use may not be a service advertised to public and featured in lists, however.
Well, of course not, since the service is illegal.
At my company we have seen a massive increase in bot traffic since LLMs have become mainstream. Blocking known OpenAI and Anthropic crawlers has decreased traffic somewhat so I agree with your theory.
Feed readers should be sending the If-Modified-Since header and web sites should properly recognize it and send the 304 Unmodified response. This isn’t new tech.
That is exactly what the article says.
The article implies this but doesn't actually say it. It's nice to have the extra detail.
While it might be nice if the article spelled out the header, I do believe that there is more than implication present.
> 00:04:51 GET /w/atom.xml, unconditional.
> Fulfilled with 200, 502 KB.
> [...]
> A 20 minute retry rate with unconditional requests is wasteful. [...]
And If-Modified-Since makes a request conditional. https://developer.mozilla.org/en-US/docs/Web/HTTP/Conditiona...
You left out a further explicit mention of conditional requests:
> Advised (via Retry-After header) to come back in one day since they are unwilling or unable to do conditional requests.
But I think it's still unarguable that the post doesn't explicitly mention If-Modified-Since, which it's not obliged to do, but the mention of it here could be helpful to someone. So why fuss?
If only people know of the standards
Blocked for 2 hits in 20 minutes on a light protocol like rss?
That seems hilariously aggressive to me, but her server her rules I guess.
If your feed reader is refreshing every 20 minutes for a blog that is updated daily, nearly 99% of the data sent is identical. It looks like Rachel's blog is updated (roughly) weekly, so that jumps to 99.8%. It's not the least efficient thing in the world of computers, but it is definitely incurring unnecessary costs.
I opened the xml file she provides in the blog and it seems very long but okay. Then I decided it is a good blog to subscribe so I went and tried to add to my freshrss selfhosted instance (same ip obviously) and I couldn't because I got blocked/rate limited. So yes it is aggressive for different reasons.
Same, I made 3 requests in total and got blocked.
Weird. Those should have had different user-agents, and I would guess it cannot be purely based on up.
Yeah, that's insane. Pretty much telling me not to subscribe to your blog at that point. Like sites that have an rss feed yet put Cloudflare protection in front of it...
The correct thing to do here is put a caching layer in front so that every feed reader isn't simultaneously hitting the origin for the same content. IP banning is the wrong approach. (Even if it's only a temporary block, that's going to cause my reader to show an error and is entirely unnecessary.)
It should be a timeboxed block if anything. Most RSS users are actual readers and expecting them to spend lots of time figuring out why clicking "refresh" twice on their RSS app got them blocked is totally unreasonable. I've got my feeds set up to refresh every hour. Considering the small number of people still using RSS and how lightweight it is, it's not bad enough to freak out over. At some point all Rachel's complaining and investigating will be more work than her simply interacting directly with the makers of the various readers that cause the most traffic.
Her rss feed is last 100 posts with full content.
So it means 30 months of blog posts content in single request.
Sending 0.5MB in single rss request is more crime than those 2 hits in 20 minutes.
I generally agree here.
There are a lot of very valid use cases where defaulting to deny for an entire 24 hour cycle after a single request is incredible frustrating for your downstream users (shared IP at my university means I will never get a non-429 response... And God help me if I'm testing new RSS readers...)
It's her server, so do as you please, I guess. But it's a hilariously hostile response compared to just returning less data.
> But it's a hilariously hostile response compared to just returning less data.
So provide a poor service to everyone, because some people doesn't know how to behave. That sees like an even worse response.
Send only one year's recent posts and you've reduced bandwidth by 50%.
People don't want to have to customize refresh rates on a per-feed basis. Perhaps the RSS or Atom standards need to support importing the recommended refresh rate automatically.
Yes that's right. Most blogs that are popular enough to have this problem send you the last 10 post titles and links or something. THAT is why people refresh every hour, so they don't miss out.
> Blocked for 2 hits in 20 minutes on a light protocol like rss?
I might be getting old, but 500KB in a single response doesn't feel "light" to me.
Yes, this is a very poorly designed RSS feed.
500KB is horrible for RSS.
It's reasonable to have whole articles in RSS, of you aren't trying to show ads or similar.
Whole articles are reasonable.
100 articles are not reasonable.
100 articles where most of them are 1+ year old is madness.
RSS is not an archive of the entire website.
I believe if you read carefully, it's not blocked, it's rate limited to once daily, with very clear remediation steps included in the response.
If you understand what rate limiting is, you block them for a period of time. Let's stop being pedantic here.
72 requests per day is nothing and acting like it's mayhem is a bit silly. And for a lot of people would result in them getting possible news slower. Sure OP won't publish that often but their rate limiting is an edge case and should be treated as such. If they're blocked until the next day and nothing gets updated then the only person harmed is OP for being overly bothered by their HTTP logs.
Sure it's their server and they can do whatever they want. But all this does is hurts the people trying to reach their blog.
72 requests per day _per user with a naive feed reader_. This is a small personal blog with no ads that OP is self-hosting on her own hardware, so blocking all this junk traffic is probably saving her money. Plus she's calling attention to how feed readers can be improved!
My reason for smacking stuff down is that I don't want to see it in my logs. That simple.
Even if they had 1000 feed readers which would be a massive amount for a blog, if you can't scale that cheaply, that's on you.
As I pointed out, her blog and rate limiting are an extreme edge case, it would be silly for anyone to put effort into changing their feed reader for a single small blog. It's bad product management.
Of course she can. It's static. She doesn't want and I understand. She's signaling their clients an standard call to say "I think you already have read this, at lest ask me first when this changed the last time".
> 72 requests per day is nothing and acting like it's mayhem is a bit silly.
72 requests per day per IP over how many IPs? When you start multiplying numbers together they can get big.
I invite you to run your own popular blog on your own hardware and pay for the costs. It sounds like you don't know what the true costs are.
I do run a popular blog, and a $5 a month Digital Ocean droplet handles millions of requests per month without breaking a sweat.
If every user is collecting 36mb a day like in the story here, your droplet wouldn’t even be capable of serving 500 users a month without hitting your bandwidth limit. With their current rates, your one million requests would cost you around 10 million USD.
30 * 500 * 36mb = 560gb and I have 1tb a month on my apparently $6 droplet
Correction - from my billing page it's $4.50 a month, from the resize page it is $6 so I'm guessing I am grandfathered in to some older pricing
That's ridiculously big quantity of data to serve a seldomly updated blog just because the client doesn't want (or know how, or think about) to implement an easy and old http method.
Imagine the petabytes of data transferred through the internet saved if a couple RSS clients added that method.
If OP enabled gzip then this 36mb would be 13mb.
If OP reduced 30 months of posts in rss to 12 months then this 13mb would be 5mb a day.
Using Cloudflare free plan and this static content is cached without any problem.
OP has never said that this is about financial aspects of things.
Yews, it's about enforcing their preference on how others should interact with OP's published site feed, on principle. Which is always an uphill battle.
More like a skill issue or just decision to make your life more difficult.
It is free and easy to scale this kind of text based blog.
Sounds like you don't know how to scale for cheap.
And since I've ran integrations that connected over 500 companies. I know what a rouge client actually looks like and 72 requests per day and I wouldn't even notice.
But it's not a "light" protocol when you're serving 36MB per day, when 500KB would suffice. RSS/Atom is light weight, if clients play by the rules. This could also have been a news website, imagine how much traffic would be dedicated to pointless transfers of unchanged data. Traffic isn't free.
A similar problem arise from the increase in AI scraper activities. Talking to other SREs the problem seems pretty wide spread. AI companies will just hoover up data, but revisit so frequently and aggressively that it's starting to affect the transit feeds for popular websites. Frequently user-agents wouldn't be set to something unique, or deliberately hidden, and traffic originates from AWS, making it hard to target individual bad actors. Fair enough that you're scraping websites, that's part of the game when your online, but when your industry starts to affect transit feeds, then we need to talk compensation.
That’s a bit disingenuous. 429s aren’t “blocking”, they’re telling the requester that they’re done too many requests and to try again later (with a value in the header). I assume the author configured this because they know how often the site is going to change typically. That the web server eventually stops responding if the client ignores requests isn’t that surprising, but I doubt it was configured directly too.
Semantics. 429 is an error code. Rate limiting...blocking...too many requests...ignoring...call it whatever you like but it amounts to the same, namingly server isn't serving the requested content.
> 429s aren’t “blocking”
Like how "unlimited traffic, but will slow down to 1bps if you use more than 100gb in a month" is technically "unlimited traffic".
But for all intents and purposes, it's limited. And 429 are blocking. They include a hint towards the reason why you are blocked and when the block might expire (retry-after doesn't promise that you'll be successful if you wait), but besides that, what's the different compared to 403?
I would disagree. Blocking typically implies permanence (without more action by the blockee), and since 429 isn’t usually a permanent error code I wouldn’t call it blocking. Same applies with 403, it’s only permanent if the requester doesn’t authorize correctly.
I would say it's disingenuous to claim sending HTTP status and body that is not expected for a period of time is not blocking them for that period of time. You can be pedantic and claim "but they can still access the server" but in reality that client is blocked for a period of time.
In that case, I should be irate that the AWS API blocks me many times per day. Run `aws cli service some-paginated-thing` and see how many retries you get during normal, routine operation.
But I’m not, because they’re not blocking me. They’re asking my client to slow down. Neither AWS nor Rachel’s blog owes me unlimited requests per unit time, and neither have “blocked” me when I violate they policies.
They literally do block you for a period of time until you are out of the rate limit. That is how rate limits work. That's why you don't get to access the resource you requested, because their system literally blocked you from doing so.
See when you're trying to be pedantic and all about semantics, you should make sure you've crossed your Ts and dotted your Is.
> Block – AWS WAF blocks the request and applies any custom blocking behavior that you've defined.
from https://docs.aws.amazon.com/waf/latest/developerguide/waf-ru...
And my favourite
> Rate limiting blocks users, bots, or applications that are over-using or abusing a web property. Rate limiting can stop certain kinds of bot attacks.
From CloudFlare's explainer https://www.cloudflare.com/learning/bots/what-is-rate-limiti...
Every documentation on rate limit will include the word block. Because that's what you do, you allow access for a specific amount of requests and then block those that go over.
The HTTP protocol is a lost art. These days people don't even look at the status code and expect some mumbo jumbo JSON payload explaining the error.
I would argue that HTTP statuses are a bad design decision, because they are intended to be consumed by apps, but are not app-specific. They are effectively a part of every API automatically without considerations whether they are needed.
People often implement error handling using constructs like regexp matching on status codes, while with domain-specified errors it would be obvious what exactly is the range of possible errors.
Moreover, when people do implement domain errors, they just have to write more code to handle two nested levels of branching.
> I would argue that HTTP statuses are a bad design decision, because they are intended to be consumed by apps, but are not app-specific.
Perhaps put the app-specific part in the body of the reply. In the RFC they give a human specific reply to (presumably) be displayed in the browser:
* https://datatracker.ietf.org/doc/html/rfc6585#section-4* https://developer.mozilla.org/en-US/docs/Web/HTTP/Status/429
But if the URL is specific to an API, you can document that you will/may give further debugging details (in text, JSON, XML, whatever).
> because they are intended to be consumed by apps, but are not app-specific
Well, good luck designing any standard app-independent protocol that works and doesn't do that.
And yes, you must handle two nested levels of branching. That's how it works.
The only improvement possible to make it clearer is having codes for API specific errors... what 400 and 500 aren't exactly. But then, that doesn't gain you much.
That's because a lot of people refuse to use status codes properly, like just using 200 everywhere.
A colleague who should’ve known better argued that a 404 response to an API call was confusing because we were, in fact, successfully returning a response to the client. We had a long talk about that afterward.
No, it is pretty confusing: the difference between 404 from hitting an endpoint that the server doesn't serve (because you forgot to expose this endpoint, oops!) and a 404 that means "we've successfully performed the search in our DB for the business entity you've requested and guarantee you that it does not exist" is rather difficult to tell programmatically.
I'm open to arguing about which error to return in each case, but surely we can agree that neither of those warrant a 200?
I dont look at the code because its wrong sometimes. Some pages return a 200 yet display an error in the page
Nothing more annoying than a 200 response when the server 'successfully' serves a 404 page
Rejecting every unconditional GET after the first? That sounds a bit excessive. What if the reader crashed after the first and lost the data?
It’s a RSS feed. In that case, wait until the specified time and try again and any missed article will appear then. If it is constantly crashing so articles never get loaded, fix that.
> If it is constantly crashing so articles never get loaded, fix that.
This often requires to do lots of tests against the endpoint, which the server prohibits.
If you are an rss-reader dev then you can set up a caching layer of your own.
> If you are an rss-reader dev then you can set up a caching layer of your own.
But are RSS reader devs willing to jump through such hoops?
I would claim that writing a (simple) RSS reader (using a programming language that provides suitable libraries) is something that would be rather easy for me, but setting up a caching layer would (because I have less knowledge about the latter topic) take a lot more research from my side concerning how to do it.
Sure, I have done such a thing myself and it was very simple. Let's say you do http_get(rss_address). Create a function http_cached_get, that looks for a recent cached response, and if none exists delegates to http_get and saves the response. In python this is like 10 lines.
For that matter, what if it's a different device (or entire different human being) on the same IP address?
At some point instead of 429 it should return a feed with this post as always newest.
Related: https://news.ycombinator.com/item?id=42470035
This is why RSS for the birds.
My RSS reader YOShInOn subscribes to 110 RSS feeds through Superfeedr which absolves me of the responsibility of being on the other side of Rachel's problem.
With RSS you are always polling too fast or too slow; if you are polling too slow you might even miss items.
When a blog gets posted Superfeedr hits an AWS lambda function that stores the entry in SQS so my RSS reader can update itself at its own pace. The only trouble is Superfeedr costs 10 cents a feed per month which is a good deal for an active feed such as comments from Hacker News or article from The Guardian but is not affordable for subscribing to 2000+ indy blogs which YOShInOn could handle just fine.
I might yet write my own RSS head end, but there is something to say for protocols like ActivityPub and AT Protocol.
Rss is pretty light. Even if you say it's too much to be re-sending, you could remove the content from the rss feed (so they need to click through to read it), which would shrink the feed size massively. Alternatively, remove old posts. Or do both.
Hopefully you don't have some expensive code generating the feed on the fly, so processing overhead is negligible. But if it's not, cache the result and reset the cache every time you post.
Surely this is easier than spending the effort and emotional bandwidth to care about this issue?
I might be wrong here, but this feels more emotionally driven ("someone is wrong on the internet") than practical.
As a user of the RSS feed, please don't remove content from it so I have to click through. This makes it much less useful and more annoying to use.
I always click through regardless, because the rss text is probably missing formatting and images. I'll never be sure I'm getting a proper copy of the article unless I click through anyway.
I am stupid, why not just return an HTML document explaining the issue, when there is such an incorrect second request in 20 minutes, then blocking that IP for 24 hours? The feed reader software author has to react, otherwise its users will complain to him, no?
That’s what 429 return is for, which is mentioned in the article.
It might be clever to return an rss feed containing 1 item: the html document you mention.
On the flip side, what percent of RSS feed generators actually support conditional requests? I've written many over the last twenty years and I can tell you plainly, none of the ones I wrote have.
I never even considered the option or necessity. It's easy and cheap just to send everything.
I guess static generators with a apache style web server probably do, but I can't imagine any dynamic generators bother to try to save the small handful of bytes.
Does anyone know if FreshRSS behaves properly here?
Earlier article with some info on freshrss: https://rachelbythebay.com/w/2024/10/25/fs/
I couldn't find the tester. Thankfully the client i was tested... And it behaves poorly. Thankfully emacs has a client I can switch to!
I have a blog where I post a few posts per year. [1] /feed.xml is served with an Expires header of 24 hours. I wrote a tool that allows me to query the webserver logs using SQLite [2]. Over the past 90 days, these are the top 10 requesters grouped by ip address (remote_addr column redacted here):
Reeder is loading the feed every 5 minutes, and in the vast majority of cases it’s getting a 301 response because it tries to access the http version that redirects to https. At least it has state and it gets 304 Not Modified in the remaining cases.If I order by body bytes served rather than number of requests (and group by remote_addr again), these are the worst consumers:
The top consumer, Refeed, is responsible for about 2.25% of all egress of my webserver. (Counting only body bytes, not http overhead.)[1]: https://ruudvanasseldonk.com/writing [2]: https://github.com/ruuda/sqlog/blob/d129db35da9bbf95d8c2e97d...
I ban the feed for 24 hours if it doesnt work.
I also design 2 new formats that no one (including myself) has ever implemented.
https://go-here.nl/ess-and-nno
enjoy
I have a few feeds configured into Thunderbird but wasn’t reading them very often, so I “disabled” them to load manually. Despite this it tries to contact the sites often and, when not able to (firewall) goes into a frenzy of trying to contact them. All this despite being disabled.
Disappointing combined with the various update sites it tries to contact every startup, which is completely unnecessary as well. Couple of times a week should be the maximum rate.
[flagged]
> to waste data
Useless traffic.
RSS has a structural problem: you download a "window" of data (the feeds between some timestamp in day A and some other in day B), which may or may not contain new data. You will easily lose some and receive a lot of duplicates.
It would have been better to be able to first check if new data is available. (And even better to only download the new ones, and all of them - "everything after YY-MM-DD hh:mm:ss")
> It would have been better to be able to first check if new data is available.
Or, as the article says, to actually check when you are able. All she is asking for is that readers make conditional requests, at reasonable intervals, and respect 429s.
> It would have been better to be able to first check if new data is available.
That... is what conditional requests do...
Which means: do not place extra mechanisms in the RSS when you can do the same with HTTP.
But then there is the problem of RSS clients that may not work properly (may not use that trick), without the user knowing it, and that of RSS servers that do not work correctly with "If-Modified-Since" (as noted in a nearby post).
Edit: but if the burden were placed on RSS instead, we could have had the trick of "I already have ...#10050, #10051 and #10052: just send me from #10053 on" - the feeds XML that is updated recently will contain more items than just the new ones. Similarly for the gaps: "the XML would contain from #10050 on, but there had been a surge of publications and now I am missing #10045 to #10049..."
Such as via HEAD and Etag?
HEAD is counter-productive
Just GET with if-none-match and if-modified-since (based on the etag and last-modified you got in the previous response), and the server will return a 304 not modified with no content if nothing has changed, and the content otherwise.
With a HEAD you'd get the same result except now you'd need to ignore the cache headers from the HEAD response in order to fetch the content in a second request.
You speak of this confidently, but there are sites that return a not modified response when in truth they have modified the feed. It doesn't happen often, but I have seen it happen with more than one site. It is why I take the not modified response with a grain of salt.
It's wasted operations on the user's end, the server's and the network infrastructure. None of those are free.
> What does it mean to waste data, something notoriously free to copy?
She didn't say waste data. It's a waste of many resources though, including energy and bandwidth, even processing power. Depends on what level of abstraction you want to look at it from, but it's definitely a waste of something.
It is possible to waste data though, but only by deleting it. It takes energy to collect and store data. This isn't relevant to the case though.
Waste data is not to be taken literally. If i say, you wasted my time by writing nonsense, would you then reply, time is time and hence cannot be created nor consumed?
You do pay for your link, right?
What am I actually paying for? Is it "the entire speedometer" ie. 24/7 100% utilization of the advertised upload/download capability of the link? Why not?
> Why not?
Because that costs way more than you're paying for your connection. The business model is predicated upon oversuscription of the ISP's network because near enough nobody does that.
Network bandwidth
which => that
They're exactly equivalent. What are you hoping to correct?
They're obviously not.
It's a Britticism (AFAICT) making inroads.
if you have to 429 people for an rss feed the problem is you
I think it's an acceptable response. Not only there's no SLA, but people are free to not provide a service to misbehaving user agents. It's like rejecting connections from Tor.
If anything, a 429 is a nice heads up. It could have been worse; she could have redirected those requests to a separate URL with an... unpleasant content, like a certain domain that redirects to I-don't-know-what whenever they detect the Referer header is from HN.
As interesting as that site is, and as much as I sympathise with the author's plight, that site's behavior is so anti-me that I'm going to ignore it whenever/wherever it pops up. I'm not trolling the author, I'm not calling them names or anything, I was just interested in the technical stuff. I wish them good luck.
There's a particular type of person that scours their HTTP logs and makes up rules that block 90% of feed readers using the default poll interval. If I stick your RSS feed into Miniflux and I get 429'd, I just stop reading your blog. Learn2cache. I'm talking to you, Cheapskate's Guide.
This site would not 429 current Miniflux, since it makes conditional requests. She has a previous post outlining cache respecting behaviour of many common feed readers.
It could 429 it for conditional requests as well:
> Unconditional requests: at most once per 24 hour period.
> Conditional requests: at most once per 60 minute period.
(Source: calling `curl hxxps://rachelbythebay[.]com/w/atom.xml` twice)
Nobody owes these people and their feed readers a 200 whenever they want one.
not when the client sends unconditional requests i.e. missing If-Modified-Since and If-None-Match headers.
All feed readers/clients should cache responses when sending multiple requests the same day.
If you don't stop at red lights, the problem is other people. /s