Learnings: Identifying and getting rid of unwanted traffic

smallwheels

Well-known member
I've recently spent some time getting rid of unwanted traffic on my forums and thought maybe the learnings might have value for someone else, so I am writing them up. This is not intended as a tutorial or even advice - it is just a couple of finding that you may find useful or not. Also, there are many ways to Rome, depending from your situation, needs and abilities. So take it with a grain of salt. ;)

Important race conditions for my actions: My forum is pretty small (currently ~2.000 registered users), runs on shared hosting (which limits my possibilities in terms of configuration), I do not use Cloudflare (and do not want to) and my user base comes to 99,8% from German-speaking countries in Central Europe.

So first: What do I consider as "unwanted traffic"? As probably most of us: Spammers, bot registrations, (automated) content-scrapers, cracking attempts but also bot traffic from SEO companies, AI-training bots and most other bots that are neither a search engine or of other legitimate fair use purpose.

In fact wanting to exclude AI training bots from the forum is where it started. I installed the excellent "known bots" add on by @Sim a long time ago (all add ons I used are linked at the bottom of this post). It identifies all kinds of bots using the user agent they submit and provides a list of the latest 100 bots that visited you forum. This way you get (with some manual work) a good overview of what's floating around. For many of them there is a helpful link provided that explains which purpose the bot serves - it takes manual klicking around and reading, so there's time involved but it is a good start.

So what I did (and still do) is to go through the list like once a week, identify the bots I consider unwanted and add them to the "deny"-list in my robots.txt. This way I got rid of a good bunch of unwanted bots - but this approach is not enough:

First, it fully depends from the bot being cooperative. If the bot does not follow or respect the robots.txt it will fail - and it turned out that while many bots do respect the robots.txt a relevant part does not, namely the nastier SEO bots but also some bots like "facebook external hit".
Second, there are some bots where you can't tell what they do as they are built on generic libraries and they submit those as the user agent. These show up as i.e. "okhttp", "Python Requests Library", "Go-http-client", "python-httpx", etc.. So you have no idea what those do, no idea if they come from one author/source or many different ones, if they are from a single private hobbyist (and so probably ok) or a malicious scraper - but you can assume that they will probably not respect the robots.txt. However, I added them anyway - doesn't harm (but has indeed no effect as it turned out). We'll deal with them later...
Third - and this is a big loophole - an "evil" bot will probably not identify itself using a useragent "evil-bot". The more cleverly made ones will in opposite try to hide using a common user agent, either from browsers that human users use or from a legitimate and well accepted (or even wanted) bot like i.e. the Google indexing bot. Again, we'll speak later about how to deal with that blind spot. For the moment it is sufficent to know: The Known Bots add on is very helpful but due to it's nature limited at some points.

In fact, there is a new "standard" trying to establish: Analogue to robots.txt there is the possibility of creating an "ai.txt" where you can simply tell ai-bots that you do not want them. However: It is neither well known nor well established and I do have my doubts that ai bots will follow it (and it seems not to be checked often it at all according to the logs). I created one anyway - does cost nothing, is done fast and does no harm:

Code:
# $myforum content is made available for your personal, non-commercial
# use subject to our Terms of Service here:
# https://$myforum/help/terms/
# Use of any device, tool, or process designed to data mine or scrape the content
# using automated means is prohibited without prior written permission.
# Prohibited uses include but are not limited to:
# (1) text and data mining activities under Art. 4 of the EU Directive on Copyright in
# the Digital Single Market;
# (2) the development of any software, machine learning, artificial intelligence (AI),
# and/or large language models (LLMs);
# (3) creating or providing archived or cached data sets containing our content to others; and/or
# (4) any commercial purposes.

User-Agent: *
Disallow: /
Disallow: *

For getting rid of spam registrations I bought and installed the "Registration Spaminator" and "Login Spaminator" add ons by @Ozzy47 - and immediately hell broke lose. I did not suffer from many successful spam registrations until then but from time to time one came through plus the occasions when I had to deal with those caught by the mechanisms built into XF and awaiting manual action grew just enough to annoy me. With Ozzy's add ons manual work is gone completely and not a single spam registration came through. However: Now I see the full amount of unsuccessful tries that was hidden until now. ATM, after about 3,5 months, the registration spaminator does show ~26.000 unsuccessful registration attempts by bots and the login spaminator ~12.000 unsuccessful bot login attempts.
There is absolutely no need to do anything (as the add ons perform successfully and bravely) but as the logs do also provide the IP addresses those bots use and offer an easy way to look up the IPs I was curious and dived into it a bit. A couple of things turned out:

• Many of the attempts came recurringly from the same IP addresses
• most of them belonged to hosting providers in Russia
• apart from that there were a limited but recognizable amount of VPN endpoints in various countries and occasionally some from countries like China, India, Indonesia, Taiwan, and - more rarely - from the US, UAE, Aegypt and Ukraine

This is were I started my counter-measures. My forum users are almost exclusively from German speaking countries plus a few com from other European countries (including UK) and a handful from the US. So I installed the add on Geoblock Registrations, again by @Sim, and excluded a small bunch of countries from registering, in case a spammer may get around Ozzy's add ons. Just as a second line of defense and due to the nature of my user base no collateral damage to be expected.

On top of that I added some of the more notorious IPs to a freshly created deny list in my .htaccess. This is a bit of a dangerous game for two reasons:

• IPs get often reassigned relatively quickly, so one might block legitimate traffic after a short while. Plus obviously many spammers will just switch the IP they are using if they are locked out.
• the .htaccess is read and used on every single visit to the forum. If it is too big or too complex this may increase load times for legitimate users which obviously is undesireable.

However, at the level of traffic on my forum it should not be an issue plus the size of my .htaccess is still fine.

Coming back to the unwanted bots: I did the same with the unwanted bots from the list "Known Bots" created that did not follow the robots.txt (btw. many claimed on the website they would respect the robots.txt but in fact did not. Various SEO-companies notoriously but also i.e. the Chat GPT training bot - or a bot using this as it's user agent). A very special case is Meta's "facebook external hit" bot: It claims to respect the robots.txt most of the time but sometimes not. It claims to have the purpose of creating previews for content from your forum shared on facebook - but at the same time to be used for "security checks" and other purposes - so it even does not rule out to be used for training Facebook's AI model. To get this needs sorrowful reading - this thing seems indeed to be a trojan horse. Other companies have different bots for different purposes.
To get those IPs the bots are using I had to grep them from the web.log of the server using the user-agent as filter criteria (as the Kown Bots add on does not provide the IPs). Some of the serious businesses that use bots do tell the IPs or IP ranges their bots use on their webpages. This comes in handy as this way you can block it out completly while with the approach via the web.log you only get the IPs one by one which is a bit annoying. BTW: A lot of the companies that send bots to your forum do not like bots on their own websites - they use cloudflare's "I am a human check". Double standards I'd say - and a clear hint that you do not want THEIR bots on YOUR website as well..

An interesting thing in that respect were the "gerneric library based bots" I mentioned earlier. It turned out that they were quite massive in the attempts to scrape forum content and that many of them were using IPs from the big cloud providers. A lot of attempts came from AWS, mainly China-based, but also Microsoft, Digital Ocean and even Oracle. Plus again loads from Russian providers. And they were mostly using multiple IPs from the same network blocks of the ISP/hoster in question.

So again I took advantage from being a small forum, pushed the IPs in question into whois and simply blocked the whole network block the IP belonged to via .htaccess in most cases. Again a bit of a dangerous game, especially with providers like AWS as many companies are using their services - so there is danger of collateral damage, but so be it.

Coincidently I innocently installed the "custom 404-page" add on by @Siropu. The intention was - you may guess it - to create a custom 404-page but it turned out that it is another source of knowledge: It provides a log of calls that ended in a 404 and much to my surprise many of those were very awkward URLs. A lot obvious cracking attempts, calling pages or files under i.e. /wp-admin/, /wp-install/, /wp-content/, /wp-plugins/ or /wp-includes/ that don't exist on my server as I do not run Wordpress and never have on this domain. Same goes for some other URLs like "superadmin.php" and many more - all of them nonexistent (else they would not get a 404 and not be listed in the log of the add on). Grepping the web.log for those patterns opened a new can of worms. An enormous amount of calls from various IPs. Again, as before, many of them located at the cloud providers and many in Russia. But this time even a net block of Kaspersky was part of the game - and also a IP that belongs to Cloudfare. Nework blocks from Microsoft were pretty promient as source of the requests here as well as a couple of providers who claim to have their offices at Seychelle islands - fr decades already a pretty easy and obvious sign of a rough hosting company.
As for whatever reason the log of the 404-add on only rarely delivers an IP address (but sometimes does) a grep-job is unavoidable here. In fact it turned out that many of those hid behind generic user-agents and thus are not listed by "Known Bots".

In fact those calls do no real harm - what they target at is not there, so nothing can happen. But then they fill my logs with crap and bring unwanted load to the server for no reason. Indeed they do the latter as they often send out hundreds of requests per minute and as this goes on for a while this adds up. So I went the same route as before and blocked either the IP or the whole network block, depending from the output of whois, the geographic location, the number of IPs involved and the frequency of the occurence. For the two reasons above plus a third one: If there comes nasty stuff from a certain area of the net repeatedly probability is, that there will come more, and different nasty stuff from that area in future (and maybe already does but I did not filter for it in the logs, so I don't know about it). Locking them out completely is the simplest way of dealing with it.

To simplify things I will probably create a rule in .htaccess that simply blocks any call to those WP-directories, so I don't have to fiddle around with single IPs.

These measures collectively brought down unwanted traffic and behavior massively - at least for the moment. This will need constant readjustment of the robots.txt and the .htaccess but in general seem to be a success. How demaning the adjustment needs will be stays to be seen. Basically I followed a bunch of patterns:

• user agent
• behaviour
• geographical location
• kind of ip-adress (dialup, hoster, VPN, cloud provider, mobile)

In some cases all 4 patterns hit, in some less than that. And it turned out that many of the IPs caught in one of the Spaminator add ons also did scan for wordpress weaknesses or grabbed content. BTW: While IP V6 addresses did turn up from time to time the vast majority was IP V4.
I do have no proof but the impression that most of those requests probably came from just a hand full number of different persons (or rather criminal organizations), using changing IP addressess over time. Too similar were the patterns during registration or the URLs they crawled automatically. And it seems that most of them have their source in Russiain the end - a country notoriously famous as a common and pretty safe home for all kinds of internet crime for decades already. There are other sources as well, but in comparison these are few.

There are still some loose ends. I.e. in the log of the custom 404-page add on I see some strange behaviour: A series of calls to existing URLs of threads but with a random patterns at the end that then get's rotated with various endings like i.e.

Threadurl/sh1xw6es60qi
Threadurl/sh1xw6es60qi.php
Threadurl/sh1xw6es60qi.jsp
Threadurl/sh1xw6es60qi.html

I see this for various URLs (threads, tags, media) with various random parts. No idea what the intention behind this may be. Also I see calls for things in /.well-known/* like

$forum-URL/.well-known/old/
$forum-URL/.well-known/pki-validation/autoload_classmap.php
$forum-URL/.well-known/init
$forum-URL/.well-known/.well-known/owlmailer.php
$forum-URL/.well-known/pki-validation/file.php
$forum-URL/.well-known/pki-validation/sx.php
$forum-URL/.well-known/old/pki-validation/xmrlpc.php

and many, many more. While most of them seem very fishy at least some of them seem genuine as they do come i.e. directly from Google and seem to refer to something android related in this case. Plus there is legitimate use for calls to .well-known. i.e. in case your forum acts as a SSO provider (which mine does not).

It was an interesting dive into a rabbit hole - did probably not solve many relevant real world problems but created a bit of learning of what's going on unseen. One learning on top of that is that in comparison to my time as a professional admin decades ago things have become more miserable. Back then AI did not exist, nor did cloud providers or bots to a relevant degree apart from indexing bots of search engines or publicly usable VPNs. Evil players would (when stupid) use their own IP or - the more clever ones - misconfigured open proxies, hacked servers, rough providers or private computers behind dialu-ps that were turned into zombies via malware. One could approach abuse desks easily and they were appropriately skilled and took action quickly and effectively. Today, things have become way more complex and everybody is using AWS and alike, the goodies as well as the baddies. Getting in touch with one of those hosters has become useless, staff is incompetent and/or uninterested and hides behind a wall of inappropriate contact forms. So today I do not bother to even try to contact an abuse desk anymore and simply lock out to the best of my abilities and possibilities.
Also it is a sad truth that the big cloud companies do earn parts of their revenue serving the dirty part of the web that does harm to others. It may not be their intention and - due to the at-scale size and very automated business model of the cloud providers is hard to avoid - but AWS, Microsoft, Oracle, Digital Ocean and obviously even Cloudlflare are not the white heads that they claim to be. They earn dirty money, they profit from illegal and fraudulent behaviour of their customers and clearly could do a lot more to avoid it. But this would negatively influence their profits...

What could be quite nice would be the possiblity to include a continuously updated 3rd party blacklist of known rough- or spam-IPs to the .htaccess (much in the way as it is or was common for mailservers). However: I did not dive into this, it seems not to be common or straightforward and possibly could lead to performance issues. Would still be interested in trying it out.

The add ons I used:





 
Last edited:
An update after three weeks with this approach. In short: It works pretty well, however, it needs permanent tinkering (especially in the beginning) and clearly has it's flaws. And it takes considerable time effort plus some (not very elaborate) skills and includes the danger to accidentally make the forum unreachable if you make an error.

Longer version:

Basic principles​


1. identify unwanted traffic
2. classify the traffic
3. treat the traffic with effective measures from friendly to rigid, avoiding collateral damage

In my case (as my options are limited) this means almost exclusively using robots.txt (friendly approach) and .htaccess (rigid to very rigid approach).

Bot-Registration attempts​


This was the easiest and probably most sucessful part - however w/o real-life-effect: Bot registration attempts have already been reliably identified and successfully hindred by the spaminator plugins by @Ozzy47. There were between dozens and hundreds per day. As the spaminator plugins do offer a very comforable ip-lookup it was easy to identify the sources. I did block the IPs via .htaccess, either the single IP or the whole network, depending on personal judgement. The block approach was mostly used on hosters and data-centers from Russia, rarely but sometimes from other regions. The simple reason was, that Russian IPs were pretty notorious and I do not get legitimate traffic from there.
Result: Logged Spam Registration and -Login attempts dropped to 1-~5 per day. What is left now does for the most part come via VPN Endpoints and TOR exit nodes in various countries around the world. As these do typically not offer decent connection speed there is also typically only one or two attempts per try (in opposite to before where there were sometimes hundreds of tries). These are now blocked via .htaccess on a per cases basis. Judging from the pattern matching on usernames used it seems to be only about three or four different players.
However: No real threat anyway, just a little cleaning out the traffic.

Annoying but harmless bots like SEO bots and others that do not deliver value to my forum​


Those were for the most part identified via the known-bots plugin by @Sim, looked up via their webpage or google and added to the robots.txt file. In case they still popped up later (so did not respect the robots.txt) I did a grep for the useragent in question in the webserver logs and blocked the IPs via .htaccess.
It turned out that many do bravely follow robots.txt but many others don't, no matter what they claim on their website. Also, while some of the more honest ones publish their source IPs on their webpages the list is sometimes not comprehensive. But many are pretty easy to lock out as they use IP ranges that belong to their companies and can thus be easily identified. Works pretty well and reliably and is pretty safe.

There are some that are using a wider range of vastly different IPs that are on AWS, but with a little grepping around in the logs they are also locked out safely and reliably - it just takes a little longer.

Crackers and Vulnerability Scanners​


Those are an annoyance, mainly because they spoil the logs with noise, producing thousands of 404s per day and this way making proper 404-management basically impossible. Most of them seem to go through a list of URL-patterns and there are three ways to identify them:
• some show up in the known bots list with generic user agents like "python-requests" etc.
• grepping for typical URL-patterns they target (like i.e. " /wp-" in the web.log brings them to the surface)
• using the 404-plugin by @Siropu they show up in a easily accessible way, but in the beginning way to many to be able to handle them effectively. In the beginning it is way easier to do a series of various complex grep jobs in the web.log to be able to identify them. Once you have limited the (massive) amount considerably the 404-add on will be your friend.

It turned out that only parts of those were identified by the known bots plugin and also, that there are three main cohorts:
• some come from random IPs, sometimes (rarely) even dialups. These seem probably to be script kiddies.
• some come from rouge providers or openhacked relays
• a lot come from cloud providers, mainly microsoft, but also AWS/EC2 or bigger providers like OHV or some smaller ones

While the first two cohorts and parts of the third cohort are easy to block the ones using Microsoft and AWS/EC2 are a bit more problematic - which lies solely in the responsibility of Microsoft and Amazon: Both make it easy to rent out massive amounts of IPs quickly and switch them frequently. Both do a very bad job in segmenting their network blocks in smaller chunks or defining their purpose in the whois databases let alone in publishing a relation between IP and the person using it. Both have uttlerly complex and useless abuse-systems. So the evil players are able to hide sucessfully behind those providers and to use a huge number of frequently switching IPs.

Blocking single IPs is not effective and way too time-consuming as there are hundreds of them, changing frequently. Blocking whole ranges or bigger network chunks is effective but creates the danger of collateral damage, blocking legitimate (or even wanted) traffic. Microsoft is even worse than Amazon here as within their huge network blocks in whois sometimes (legitimate) traffic from Bing indexing seems to come from the same network ranges as these malicious players.

We'll see the same patterns when it comes to Scrapers in the next section.

However - I did decide that in my case I am willing to take eventual collateral damage as I do not see relevant legitimate traffic for my forum coming from AWS or Microsoft (or any data-center / hosting provider for that matter) - time may proof me wrong - and Bing as a search engine does have barely any relevance in traffic (despite being my forum is ranked quite good there). Also I do not block Bing but only some filthy IP-ranges in the ownership of Microsoft that seem partly to be used also by Bing (there are enough others ranges bing uses left am Microsoft does not publish IPs or IP ranges for Bing traffic). If those cloud providers prefer to burn their IP-ranges until they are blocked - so be it. So I started to block huge net-blocks from Amazon and Microsoft whenever malicious traffic came from there.

The result was pretty amazing overall: After one or two weeks of adding IPs and ranges to .htaccess this kind of traffic as well as 404s dropped massively and now regular 404-handling is possible all of a sudden.

Scrapers​


Another big annoyance are attempts to download forum content at scale for whatever reason. Due to the very special interest nature of my forum I can hardly imagine anyone trying to build a clone website or something, probably most attempts are for feeding AI models these days, at least that's what I assume.
The bigger AI companies do use bots that identify themselves and different bots for different purposes. Those seem to play fair in the meantime - i.e. Perplexity does follow the robot.txt as far as I can judge and so does OpenAI and ChatGPT (I do however habe blocked the published IPs of the OpoenAI-training-bot via .ht-access - better safe than sorry). A lot of the smaller ones do ignore robots.txt, even if they identify their bots via their user agents. And there are loads of scrapers that don't identify themselves (and are not detected by known bots). Those try to hide as normal browsers and switch their user agent frequently and randomly.

Partly, those can be detected with the same measures as the crackers (if they use generic libraries like "python-requests" as their user agent), often enough not. As a subset of the scrapers focusses on downloading pictures those can relatively easily identified on my forum as viewing full resolution require you to be logged in. If there are a lot of requests to a lot of full resolution pictures within seconds coming from the same IP or a range of IPs but following the same pattern probability is high that this is a bot.

Unfortunately, with scrapers I face the same issue with cloud providers as with cracker, just worse: Traffic is way more difficult to identify in the logs, at the same time it is way more and the usage of especially AWS ist way more common. Namely IPs from the namespace "*.ap-southeast-1.compute.amazonaws.com" ist dramatically high, hundreds if not more different IPs from a vast amount of ranges. Other areas of AWS are used as well but southeast is by far dominant in my logs. Other than that I discovered a huge amount of IPs belonging to Alibaba cloud, following the same pattern. It were so many that I assume that it must probably be a bigger player. They were switching user agents constantly, example user agents are

"Mozilla/5.0 (Windows NT 6.2; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3067.80 Safari/537.36"
"Mozilla/5.0 (Windows NT 6.3; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/67.0.2437.184 Safari/537.36"
"Mozilla/5.0 (Windows NT 6.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/67.0.2937.165 Safari/537.36"
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.2711.78 Safari/537.36"
"Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/78.0.3343.57 Safari/537.36"

So not very obvious and only subtile changes. IPs from Microsoft were (in opposite to vulnerability scans, were they were dominant among the cloud providers) not very prominent hier over the last weeks. Mostly AWS/EC2, Alibaba, a few sets from Google Cloud plus some smaller or obviously rogue offshore providers.


Results​


Effectiveness​


The measures proved to be pretty effective (at least short term), bringing down unwanted traffic considerably (or rather: dramatically) within very short amount of time in all areas (Bot-registration-attempts, cracking-attempts, annoying bots and - at least partly - scrapers. The problem is that there is probably a huge amount of black matter, especially with scrapers (and possibly other unwanted traffic that remains unidentified).
However, getting there was a lot of time intensive manual work - after the initial setup maybe an hour per day. This constantly lowers, but some daily manual effort is needed constantly and it will stay that way.

Handling of 404s is now possible which clearly is a benefit for the forum. It would be even better if I would not continue to get 404s where there should be 401s. This affects i.e. tags or full resolution pictures when klicked by a person that is not logged in. In my current understanding this is a shortcoming of XF. I've filed a bug for that already three weeks ago but did not get any reaction or response from XFs makers:


However:
• The number of identified bots that visit my forum unblocked has gone down to about 40-60/per day, according to "known bots". Most of them are desired or tolerated, the unwanted ones that show up are locked out quickly via my manual workflow through adding them to robots.txt and if they don't follow it by blocking their source IPs in .htaccess
• spam registration attempts have come down to a minimum
• the web.err.log has grown massively which indicates, that there is a massive effect - many if not most bots seem to be stupid and simply return constantly despite being locked out the hard way.
• 404s have shrunk massively to a degree where I can deal with them in a useful manner

Sustainability​


There seems to be a sustainable effect regarding bot-registration-attempts and unwanted corporate bots (like SEO bots and others etc.). The first seem to heavily rely on Russian providers when it comes to at-scale-attempts plus a couple of hosters, networks and IPs in other countries that are quickly identified. The second are "normal bussinesses" that seem to use a mostly static dedicated infrastructure, that changes not to fequently. It can thus be identified and locked out relatively easily.

Crackers and Scrapers are another issue - they rely to a good part on huge cloud providers, use a huge amount of different IPs and change them frequently. This needs constant adjustment which means: constant manual work (in the sense of daily work).

Collateral damage​


In the special situation of my forum I don't have much issues blocking even huge network ranges it they are geographically not in my target group or are server-farms anyway. However, even for my forum there are four issues:

• blocking VPN and TOR exit points is philosophically not my favorite way. However, there is not really an alternative, so I do it on a per case basis despite it might affect some normal legitimate users. Those do however have alternative ways by either switching their exit node or stop using a VPN.

• blocking huge IP-ranges from cloud providers will probably sooner or later cause colateral damage by unwantedly locking out legitimate traffic like search engines or even embedding by another website that is cloud hosted.

• one of my users reported, that he cannot visit the forum any more via his company's VPN as this uses AWS as its foundation and seems to reside within an IP-range that I blocked. I wasn't aware that companies do use AWS as VPN endpoints and it sounds a bit weird (and expensive) to me, but that's reality. The more infrastructure is based on cloud providers the bigger the risk for collateral damage becomes. Which gets worse as the file becomes more complex over time and with the blocking of whole networks it is demanding to find the blocking cause of a certain IP, so even fixing errors becomes problematic.

• a couple of times I made a syntax error/typo when editing the .htaccess which made the forum unreachable until fixed. As I do check the webpage after every edit it was only a couple of seconds each time until fixed - still not optimal.
So this is a dangerous game and it get's worse the more IPs I block. At some point the approach via .htaccess won't scale and slow down my page but not until now and I am probably far away from that point.

Shortcomings and room for improvement​


The main shortcomings are need for manual work (time effort plus potential for failure), the black matter of unidentified bad actors and the risk for colateral damage. Clearly, IP assignments are pretty dynamic and can change quickly. A static approach as my current one does not and cannot keep up with it. The one thing is to keep the block list up to date, the other is to clean it up from IPs that are no longer an issue. The latter is realistically not possible in a manual way.

So I could start to automate things by scripting. What I do manually at the moment could be automated relatively easily through simple scripting - nothing that a little shell script using mainly grep, sed and awk couldn'd do. Maybe I will do that (as it will save time and avoid errors), but this would still neither solve the issue of cleaning up nor the topic of black matter. XF as well as known bots do recognize only a unknown fraction of the bots that visit the forum. Much to my astonishment even the Yandex-bot, that visits my forum regularly and identifies itself properly, is not recognized. Obviously that's even more true for bots that actively try to hide themselves.

So what would be good would be an automated adaptive approach and in a perfect world one that would set measures before a bad actor visits the forum. Basically we are talking about something like a RBL here.

I remembered "project honeypot" from the olden days and looked around a bit for something in that direction. Much to my surprise I learned, that Cloudflare as a company seemsinitially to have been be a derivate of project honeypot. However - for my taste Cloudflare has become way too dominant and powerful as a player in today's internet and I would like to avoid it if possible.

Project Honeypot still exists today and - even better - there is an API and loads of implementations for various platforms to make use of it. Unfortunately not for XF, but possibly this would not be too hard to achieve.

There are also loads of other honeypots around - so the data is there, one has just to make use of it. One pretty promising project that I found on Github is the "ultimate bad bot blocker" which exists in various flavors (i.e. for Apache and Nginx):



A stripped down version can even be used on a shared hosting account like mine where I do not have full access to the web server configuration via .htaccess and robots.txt

There are a bunch of approaches of all kinds to the topic of bots that can be found on Github, however, their quality is often somewhat unclear.

Clearly, automation would help, even a simple firewall would be way better than having to rely on .htaccess (due to better logging alone) but the good news is:

It seems very doable to create an XF add on for that with relatively little effort as one can build upon what's there already. I don't have the skills, but maybe a dev has fun going down that path.

I'd assume I am not the only one interested in the topic, judging i.e. from this recent request by @Garfield™ .
 
Last edited:
Personally, I use my Geoblock Registration addon to block entire countries from being able to register. You could also block them entirely from the Cloudflare level, but that could have implications for your overall traffic levels.


My site ZooChat is very international in its audience, so I'm far less aggressive with my registration blocks than I am on PropertyChat - where the audience is primarily Australian based - there are only a dozen or so countries I allow registrations from on PropertyChat.

I do block entire ASNs (an identifier for the owner of a block of IP addresses) at the Cloudflare level when I detect a likely spambot registration from an identified datacentre (not VPN). But I also keep a record of exactly when I blocked it because occasionally blocking an ASN will also block genuine users or good bots who happen to use a VPN or proxy that utilises that datacentre ASN. For example, I blocked an ASN and then found that my site monitoring tool StatusCake was reporting my site was unreachable - so had to unblock that ASN.

I also use Cloudflare to manage problematic bots - the paid Pro account has much more fine-grained control over bad bots, and is far more effective than anything you might try to implement yourself.

Given my experience identifying and collecting bot data via KnownBots, I don't bother blocking them myself (unless there is a particularly bad bot that is not already being blocked - but again, I'll block it at the Cloudflare level).

Cloudflare does all of the heavy lifting for me and really does make a huge difference to my ability to manage my sites.
 
As you wish to stay clear of CloudFlare, take a look at @DragonByte Tech 's Security addon. It includes Bad Behaviour (which blocks a load of bad actors) and IIRC it also includes Project Honeypot blacklist API.

CleanTalk has a pretty good blacklist as well and can block bad registrations as well as bad users:
 
Personally, I use my Geoblock Registration addon to block entire countries from being able to register.
That's what I do as well (using your add on), but only for a very small set of countries.

I do block entire ASNs (an identifier for the owner of a block of IP addresses) at the Cloudflare level when I detect a likely spambot registration from an identified datacentre (not VPN).
So you are even more brutal than I am. :D
But I also keep a record of exactly when I blocked it because occasionally blocking an ASN will also block genuine users or good bots who happen to use a VPN or proxy that utilises that datacentre ASN. For example, I blocked an ASN and then found that my site monitoring tool StatusCake was reporting my site was unreachable - so had to unblock that ASN.
That is a very good idea. Unfortunately this rises the effort involved. I already thought about using Github or alike for versioning of .htaccess and robots.txt - maybe this would be the simplest way to avoid additional effort.

Cloudflare does all of the heavy lifting for me and really does make a huge difference to my ability to manage my sites.
Clearly, Cloudflare is a very comfortable solution, works properly as far as I can judge and does way more than what I (can) do manually automatically and way more sophisticated w/o need for manual work. Granted and obvious: They do it professionally, for a living, exclusively, with a huge team and endless ressources and have been doing that for years. I do it alone, with no resources with very limited time and limited skillset for learning. Still there are some flaws, that are possibly less important or unimportant for many but lead to me rather avoiding cloudflare if possible.

• it is a black box. Inside which magic happens. Over time, people rely on it, trust it and unlearn how things work and stick together - so they become dependent.

• sometimes, random things happen. Like i.e. something like that:


No matter if one likes it or not: If you rely on cloudflare you have to live with it.

• Personally, I find the cloudflare "xyz has to check the security of your internet connection" that you get sometimes when visiting a cloudflare "protected" website annoying, misleading and hillarious (but that's just me)

But the biggest and ultimate reason is: I don't like monopolies. We've all seen (and constantly see) the negative effects of that happening: Google having a de-facto-monopoly on internet search i.e.. Facebook and now Meta having a huge power in "social media", today together with companies like X and partly reddit - causing all sorts of negative effects. Amazon (and also Microsoft, Google and others) "owning" wider parts of the cloud infrastructure. Etc. etc.

Whenever a single actor becomes too powerful it is only a question of time when he starts to act badly in one way or another and to mainly act for his own good and profit, doing harm to all those who depend from him.

In my eyes Cloudflare has become such a monopoly, so it is probably a matter of time, when they start to act badly. Thus I prefer not to be part of that game if possible.

At the small size of my forum it should be possible to run it w/o becoming dependent from cloudflare. Plus learning and digging around is fun (at least to a degree).
 
I already thought about using Github or alike for versioning of .htaccess and robots.txt - maybe this would be the simplest way to avoid additional effort.

That's generally a very good idea - absolutely no reason you couldn't keep your server config in git for change tracking - can really help with identifying issues with your config.

Just be mindful about storing credentials or other sensitive items in something that could potentially be accessed by a 3rd party.
 
Blocking whole ranges or bigger network chunks is effective but creates the danger of collateral damage, blocking legitimate (or even wanted) traffic. Microsoft is even worse than Amazon here as within their huge network blocks in whois sometimes (legitimate) traffic from Bing indexing seems to come from the same network ranges as these malicious players.

I do block entire ASNs (an identifier for the owner of a block of IP addresses)
To give an example from actual logs: There was an obvious scan for vulnerable files coming from a certain ip. Mostly targeting wordpress, but not only.

Bash:
$ grep 52.169.144.138 web.log | head
52.169.144.138 - - [27/Apr/2025:23:06:09 +0200] "GET /.alf.php HTTP/1.1" 301 241 "-" "-" 50 439
52.169.144.138 - - [27/Apr/2025:23:06:09 +0200] "GET /.alf.php HTTP/1.1" 403 199 "-" "-" 350 4139
52.169.144.138 - - [27/Apr/2025:23:06:09 +0200] "GET /.tmb/ HTTP/1.1" 301 238 "-" "-" 48 433
52.169.144.138 - - [27/Apr/2025:23:06:09 +0200] "GET /.tmb/ HTTP/1.1" 403 199 "-" "-" 77 373
52.169.144.138 - - [27/Apr/2025:23:06:09 +0200] "GET /.tmb/mariju.php HTTP/1.1" 301 248 "-" "-" 58 453
52.169.144.138 - - [27/Apr/2025:23:06:09 +0200] "GET /.tmb/mariju.php HTTP/1.1" 403 199 "-" "-" 87 373
52.169.144.138 - - [27/Apr/2025:23:06:09 +0200] "GET /.tmb/wp-login.php HTTP/1.1" 301 250 "-" "-" 60 457
52.169.144.138 - - [27/Apr/2025:23:06:09 +0200] "GET /.tmb/wp-login.php HTTP/1.1" 403 199 "-" "-" 89 373
52.169.144.138 - - [27/Apr/2025:23:06:09 +0200] "GET /.well-known/ HTTP/1.1" 301 245 "-" "-" 55 447
52.169.144.138 - - [27/Apr/2025:23:06:09 +0200] "GET /.well-known/ HTTP/1.1" 403 199 "-" "-" 84 373

number of requests in this run from this IP in total:

Bash:
grep 52.169.144.138 web.log | wc -l
3681
2.209 of them were targeting wordpress-URLs.

These were the last occurrences:

Bash:
52.169.144.138 - - [27/Apr/2025:23:08:33 +0200] "GET /wp2.php HTTP/1.1" 301 240 "-" "-" 50 437
52.169.144.138 - - [27/Apr/2025:23:08:33 +0200] "GET /wp2.php HTTP/1.1" 403 199 "-" "-" 79 373
52.169.144.138 - - [27/Apr/2025:23:08:33 +0200] "GET /wp_class_datalib.php HTTP/1.1" 301 253 "-" "-" 63 463
52.169.144.138 - - [27/Apr/2025:23:08:33 +0200] "GET /wp_class_datalib.php HTTP/1.1" 403 199 "-" "-" 92 373
52.169.144.138 - - [27/Apr/2025:23:08:33 +0200] "GET /wp_wrong_datlib.php HTTP/1.1" 301 252 "-" "-" 62 461

So 3.681 requests within less than 2:30 minutes. Not too friendly. Feeding the IP into whois shows: It belongs to Microsoft. The netblock it resides in is not exactly small:

Bash:
$ whois 52.169.144.138
% IANA WHOIS server
% for more information on IANA, visit http://www.iana.org
% This query returned 1 object

refer:        whois.arin.net

inetnum:      52.0.0.0 - 52.255.255.255
organisation: Administered by ARIN
status:       LEGACY

whois:        whois.arin.net

changed:      1991-12
source:       IANA

# whois.arin.net

NetRange:       52.145.0.0 - 52.191.255.255
CIDR:           52.148.0.0/14, 52.145.0.0/16, 52.160.0.0/11, 52.152.0.0/13, 52.146.0.0/15
NetName:        MSFT
NetHandle:      NET-52-145-0-0-1
Parent:         NET52 (NET-52-0-0-0-0)
NetType:        Direct Allocation
OriginAS:       
Organization:   Microsoft Corporation (MSFT)
RegDate:        2015-11-24
Updated:        2021-12-14
Ref:            https://rdap.arin.net/registry/ip/52.145.0.0



OrgName:        Microsoft Corporation
OrgId:          MSFT
Address:        One Microsoft Way
City:           Redmond
StateProv:      WA
PostalCode:     98052
Country:        US
RegDate:        1998-07-10
Updated:        2024-03-18

It seems pretty silly to block a net of that size because of one single IP that misbehaved - but you have no indication for useful separators and you know, that there has been and will be permanent firing from this and other netblocks of Microsoft from a vast amount of different IPs. And as Microsoft does not give any hints at all: Welcome to .htaccess. Stupid, but effective.
 
So 3.681 requests within less than 2:30 minutes. Not too friendly. Feeding the IP into whois shows: It belongs to Microsoft. The netblock it resides in is not exactly small:

That would likely be an Azure server that someone is using to probe other services.

I would absolutely block the entire Datacenter ASN (and I do!) - but if I then discovered that other legitimate services are using that same ASN, it could be changed to a single IP address block. The thing is that blocking individual IPs is generally pointless because malicious actors will eventually get discovered and blocked by the provider - so the actors will simply spin up a new server on a new IP somewhere and be right back at it.

The key thing is that genuine users don't generally connect via a Datacenter - not unless they have spun up their own VPN or proxy service - although occasionally a legit VPN service will also use a hosted Datacenter rather than their own ASN.

Blocking things via .htaccess is perfectly fine - but won't help with a DDoS kind of attack because it is your server that has to deal with all of the traffic - I think you'd be better to implement an external firewall between your server and the internet to help mitigate some of these issues.

That's the reason I ended up on Cloudflare in the first place. One of my servers got DDoSed and my VPS host told me that I either need to front end my site with Cloudflare (or similar) to reduce the load on their infrastructure, or find a new provider.

Here's a chart of Cloudflare events that were prevented from accessing my server due to my ASN blocks in the past 24 hours:

1745874838864.webp

Here's a list of the top ASNs that were blocked by my firewall settings - the vast majority of this blocked traffic did indeed come from that Microsoft Datacenter ASN:

1745874944559.webp

... and here's a list of the top 15 IP addresses blocked from that single ASN:

1745875096288.webp

... I'm not even sure how many IPs were blocked in total - it could be substantially more than 15 - all blocked from a single rule, rather than needing to be blocked individually.

I'm not saying what you are doing it wrong - just that there is another way to approach these things which might be more efficient.
 
I will also say that Cloudflare does also allow you to add "allow" rules to poke holes through my otherwise broad ASN blocks.

I had a situation where staff at an accounting firm were regularly visiting one of my forums (related to property investment, so relevant to their work!), but the staff were all using hosted virtual PCs. They moved their hosting to a new provider which happened to use one of the ASNs I blocked from accessing my site and so none of the staff were able to access my site because all access came from that Datacenter ASN, not from their actual ISP.

Once I was able to establish that they had a permanently assigned IP address that was unlikely to change - it was a simple matter of me adding a new rule to explicitly allow traffic from that IP address while retaining the ASN block for the rest of the IPs owned by that Datacenter.
 
Same ASN over the last 72 hours. I have put challenges on the ASN, so that real users can get through, and poked further holes in in by whitelisting IPs, user agents, etc, while also blocking users that hit specific URLs or do dodgy stuff or are clearly bad actors as well as AI scrapers. it shows how a multilayered approach can work.

Security-Events--Cloudflare.webp
 
The key thing is that genuine users don't generally connect via a Datacenter - not unless they have spun up their own VPN or proxy service - although occasionally a legit VPN service will also use a hosted Datacenter rather than their own ASN.

Once I was able to establish that they had a permanently assigned IP address that was unlikely to change - it was a simple matter of me adding a new rule to explicitly allow traffic from that IP address while retaining the ASN block for the rest of the IPs owned by that Datacenter.

That's what I started doing as well - it is possible with .htaccess.

Blocking things via .htaccess is perfectly fine - but won't help with a DDoS kind of attack because it is your server that has to deal with all of the traffic - I think you'd be better to implement an external firewall between your server and the internet to help mitigate some of these issues.
True. But here I am limited in terms of options through being on shared hosting. I would need a firewall-as-a-service-approach and this would (realistically) be Cloudflare then. Luckily, I did not suffer from DDOS attacks yet. Whatever load came from bad actors could be handled by the host w/o issues until now - it were bad scarpers and scanners, but no deliberate DDOS. In case such an attack happend you are correct: Cloudflare would possibly be the best and easiest option.
 
Possibly also worth noting within the topic of this thread: There are companies, that are probably technically legal but operate in a very grey area at best and offer bots, apis and infrastructure for scraping. I.e. I stumbled over this company, probably based in eastern Europe (judging from the names on the webpage) that offers scraping apis and teaches you how to fake your user agent (no direct link to not boost their ranking through backlinks):

hasdata . com/blog/user-agents-for-web-scraping

Digging through my weblogs I also found a company in the outskirts of rural northern Germany, run by a 22-year old that offers (or claims to offer) a worldwide network of 1000s of hosts you can rent for scraping. Looking a little bit closer in the history, offerings and legal records of the company it seemed more than fishy.
 
Finished reading all of this and thanks for the contributions here. I'll just add that we block tons of various ASN's. Some are blocked at our local level and specific ranges are blocked in xenForo so people know that they need to use a different connection to see our stuff.

One point of interest that I've not been able to figure out, is this:
Code:
GET /threads/great-streaming-in-2025.233658/.zip,.txt,.pdf,.png,.jpg,.jpeg,.jpe,.gif,.xlsx,.jp2,.mp4,.mov HTTP/1.1",
These pop up in individual web server (NGINX) error logs as there's a rule in place to capture and stop these types of GET queries. access forbidden by rule, client: 38.xxx.xxx.xxx Bunch of these are going after legit threads and coming from what appear to be legit home ISP's and mobile carriers, from various parts of the world.

Any idea what they are trying to do here with that extension spam?
 
I've no idea. Looks to me like a weird request format anyway - but I am no expert in that. What user-agent is submitted with these requests?
Code:
Mozilla/5.0 (Linux; Android 12; Pixel 3a Build/SP2A.220505.008; wv) AppleWebKit/537.36 (KHTML, like Gecko) Version/4.0 Chrome/135.0.7049.111 Mobile Safari/537.36"
The user agent varies, this just happens to be the latest one. I'm just curious what the goal of this sort query is and what causes it to show up.
The sample above comes from a guest viewer, not a registered account.
 
Thanks for all the responses! Interesting to see that blocking whole networks and ASNs seems to be a valid and even common approach!

The measures proved to be pretty effective (at least short term), bringing down unwanted traffic considerably (or rather: dramatically) within very short amount of time in all areas (Bot-registration-attempts, cracking-attempts, annoying bots and - at least partly - scrapers. The problem is that there is probably a huge amount of black matter, especially with scrapers (and possibly other unwanted traffic that remains unidentified).
Another couple of days later and after adding some more networks the blocking list in .htaccess the situation has further calmed down very much. There are now three things left to do:

• check the logs (known bots and web.log) for new candidates to block, following the existing patterns. There are regularly some fresh ones, but it became way less, so the broad majority seems to be locked out successfully - for the moment at least.
• discover errors / overblocking and fix these. This affects mainly Bing and the bingbot but also the VPN-User coming via AWS I mentioned before. Fixed and monitoring for further errors.

While this is still not automated I've set up a bunch of routine grep jobs for that that (hopefully) provide me reliable information and so it is pretty fast and easy now to do further granular fine tuning of both: The ruleset in .htaccess as well as the grep jobs (that will eventually evolve into automated scripts over time and maybe further down the line be run via as cron jobs).
As these things have calmed down there ist now time left to dive into the biggie:

• Uncovering "black matter".

This is clearly not easy, but there are some quick wins to make. The information I have is basically the user-agent, the source-IP, the target-URL(s) and the patterns that show up (like frequencies, intensity, repetitions, source IP-ranges, URL-ranges, time when requests show up etc.) - so basically behavior.

A real behavior analysis is way beyond what I am willing (and able) to do, given my time budget, knowledge and toolset. This clearly is where professional or commercial solutions like Cloudflare can take advantage. But quick wins can be achieved pretty easily.

One basic assumption: Basically no normal user would request robots.txt or sitemap.xml. Whenever those URLs are called it will probably be a script or bot. So scanning for those in web.log while filtering out the valid bots and the already blocked URLs will potentially discover scrapers that hid until now as they don't declare themselves as bots via their user-agent (or are undiscovered by known bots if they do).

This way I discovered a (small) number of scrapers that were using "normal" user agents, verified by 1.) checking the IP belonged to a server range, not a dialup etc. 2.) checking what they did in terms of calls/requests. Further down the line checking for IPs that belonged to the same netblocks and checking for others that used the same (unique) user agents I could extend it a little more.

However: While you get an additional set of patterns to watch out for this way (combining new patterns with existing ones) it is still more than primitive in terms of analysis and can only scratch the surface. Tiny steps, but steps.

In term of bots there are two that stand out a bit:

HeadlessChrome as a user-agent is obviously not used by humans as one can assume. Most requests with this user-agent came from either Microsoft or Amazon IPs (while some also came from different hosters, mostly from the US). While it seems safe to block the generic hosters and the Amazon ones (again EC2/AWS) Microsoft is a little more complicated: Parts of the calls come from Azure (or are not declared in any way), so can be blocked. Others come from hosts that resolve to names within msnbot.*.search.msn.com, so are basically from Bing (which one would not like to block typically). So once more Microsoft makes life miserable and complicated - this seems in so many areas to be part their DNA for the time of their existence. :rolleyes:

The other bot that is pretty strange and annoying is FecebookExternalHit. It is blocked via my robots.txt but still shows up massively in my logs. According to Meta it's "main purpose" is to crawl webpages that have been shared on Facebook or via the ShareButton (which I don't have).


It would, as they say, sometimes ignore the robots.txt in case of "security- or integrity-checks" and what other purposes apart from the "main purpose" it may serve is not declared.

What is strange about this bot is, that I see it visiting occasionally, reading the robots.txt, coming from a Facebook IP-address, and then do nothing more (from this IP-address or any other that belongs to Facebook).

But I see hundreds of visits coming from dialup-addresses within my usual audience. Typically one or two requests embedded in a series of requests from the same IP that uses a different user-agent. On first look it seems as if they tend to come from iOS devices mainly or even exclusively. Seems a little weird - as if Facebook would install a bot on the machines of their users or use them as a proxy-server. Also, I cannot imagine that contents of my forums is shared on facebook hundreds of times per day. The logs suggest that it may at least partly be rather klicks on an URL on my forum that has been shared on Facebook - but again the URLs and the amount seem somewhat weird for that as well. Also, if someone klicks on a link it may have a referrer to Facebook, but not a bot as the user agent and clearly not in the middle of a series of requests from the IP. Has anyone an explanation of what's going on here? My best hypothesis is that Facebook is potentially somewhat spying on their users that use the Facebook App on mobile devices. I am not a Facebook member (never have been and never will be), so I cannot even look up or search on Facebook if the requested URLs have been shared there as Facebook has pretty much locked doors for non-members.

So far for today.

PS: What brightened my day was that within all this filth one VPN provider that was used for bot registering attempts stood out with honesty, given the company name :D:

Bildschirm­foto 2025-05-01 um 12.54.45.webp
 
You may want to use @Xon's sign up and abuse and standard library addons
Why? I don't have an issue with spam oder spammer's signing up successfully on my forum. These kind of annoyances is handled 100% successfully by the spaminator series of @Ozzy47 on my forums, as I lined out earlier. So what exactly would Xon's add on be good for in the topic of this thread?

The initial add on description says:

Signup abuse detection and blocking - Provides a toolkit to reduce signup spam

That is not what this thread is about. Is is about blocking visits to the forum, not signups, so about measures that are effective earlier, target something completely different and reduce traffic/keep the logs clean.

I did (and still) do consider it for detection of multiaccounts for quite a while already, however, this is not my primary concern at the moment, the add on has known specialities (i.e. in regards to GDPR as well as on shared hosting) as lined out in the add on FAQ and multiaccounts are a completly different topic than the one of this thread anyway.
 
Back
Top Bottom