Bot Management using robots.txt in XFcloud

CTS · Apr 10, 2024

Using the XFcloud for my instance, so I do not have server access or htaccess. Up front.

I know my ability to manage bots is limited, so my question revolves around the editing of robots.txt from within ACP.

I wish to "ask" Bytespider to cease indexing from my site.

I would like to use this code,...

Code:

User-agent: Bytespider
Disallow: /

User-agent: *
Disallow:

Now, if I were to append this to the bottom of the existing robots.txt, would there be any conflicts of the basic default robots.txt using the XFcloud instance.

Anybody have experience in XFcloud in safe ways to add or modify to the robots.txt?

philmckrackon · Apr 11, 2024

Can't speak for using XFcloud and modifying robots.txt but bytespider ignores that file.
Can you add to the .htaccess file in your XF root directory?

Code:

BrowserMatchNoCase "Bytedance" bad_bot
BrowserMatchNoCase "Bytespider" bad_bot
BrowserMatchNoCase "Baiduspider" bad_bot
Order Deny,Allow
Deny from env=bad_bot

Post in thread 'Known Bots'

Apr 2, 2024

I got pounded by Bytespider and Baiduspider yesterday with over 3,000 of them at once. This is what I put in my .htacess file. There may be better ways of blocking them but it worked.

Code:

BrowserMatchNoCase "Bytedance" bad_bot
BrowserMatchNoCase "Bytespider" bad_bot
BrowserMatchNoCase "Baiduspider" bad_bot
Order Deny,Allow
Deny from env=bad_bot

CTS · Apr 11, 2024

Unfortunately that is one of the very very few drawbacks in the cloud. No htaccess access.

CTS · Apr 11, 2024

philmckrackon said:
but bytespider ignores that file.

Didn't know that but not unexpected.

I just don't need Bytedance (parent co) spiders on my site.

philmckrackon · Apr 11, 2024

CTS said:
Unfortunately that is one of the very very few drawbacks in the cloud. No htaccess access.

What about contacting XFcloud and have them edit the .htaccess file?

CTS · Apr 11, 2024

philmckrackon said:
What about contacting XFcloud and have them edit the .htaccess file?

Of course I started with contacting, but understandably, access is not permitted on this type of instance installation.

avalanch · Apr 12, 2024

in Page_Container template, modify this as needed.

Code:

<meta name="robots" content="noindex, nofollow, noarchive, noodp, nosnippet, notranslate, noimageindex">
<meta name="googlebot" content="noindex, nofollow">
<meta name="googlebot-news" content="nosnippet">
<meta name="googlebot-video" content="noindex">
<meta name="googlebot-image" content="noindex">
<meta name="bingbot" content="noindex, nofollow">
<meta name="bingpreview" content="noindex, nofollow">
<meta name="msnbot" content="noindex, nofollow">
<meta name="slurp" content="noindex, nofollow">
<meta name="teoma" content="noindex, nofollow">
<meta name="Yandex" content="noindex, nofollow">
<meta name="baidu" content="noindex, nofollow">
<meta name="Yeti" content="noindex, nofollow">
<meta name="ia_archiver" content="noindex, nofollow">
<meta name="facebook" content="noindex, nofollow">
<meta name="twitter" content="noindex, nofollow">
<meta name="rogerbot" content="noindex, nofollow">
<meta name="LinkedInBot" content="noindex, nofollow">
<meta name="embedly" content="noindex, nofollow">
<meta name="slackbot" content="noindex, nofollow">
<meta name="W3C_Validator" content="noindex, nofollow">
<meta name="redditbot" content="noindex, nofollow">
<meta name="discordbot" content="noindex, nofollow">
<meta name="applebot" content="noindex, nofollow">
<meta name="pinterest" content="noindex, nofollow">
<meta name="smtbot" content="noindex, nofollow">
<meta name="googlewebmaster" content="noindex, nofollow">
<meta name="twitterbot" content="noindex, nofollow">
<meta name="tumblr" content="noindex, nofollow">
<meta name="slackbot" content="noindex, nofollow">
<meta name="flipboard" content="noindex, nofollow">
<meta name="qualaroo" content="noindex, nofollow">
<meta name="opensearch" content="noindex, nofollow">
<meta name="sogou" content="noindex, nofollow">
<meta name="exabot" content="noindex, nofollow">
<meta name="duckduckbot" content="noindex, nofollow">
<meta name="taptu" content="noindex, nofollow">
<meta name="outbrain" content="noindex, nofollow">
<meta name="Bytespider" content="noindex, nofollow">

Chris D · Apr 12, 2024

CTS said:
Using the XFcloud for my instance, so I do not have server access or htaccess. Up front.

I know my ability to manage bots is limited, so my question revolves around the editing of robots.txt from within ACP.

I wish to "ask" Bytespider to cease indexing from my site.

I would like to use this code,...

Code:

User-agent: Bytespider Disallow: / User-agent: * Disallow:

Now, if I were to append this to the bottom of the existing robots.txt, would there be any conflicts of the basic default robots.txt using the XFcloud instance.

Anybody have experience in XFcloud in safe ways to add or modify to the robots.txt?

Just to clarify this, you are not limited in editing your robots.txt at all. It is just done through an option in the admin CP for both convenience and to workaround no direct access to the server. Anything you want to put in robots.txt is fine and will work exactly the same way as editing the file directly.

CTS said:
Unfortunately that is one of the very very few drawbacks in the cloud. No htaccess access.

To be fair, if we used Apache, we'd probably have a UI to enable you to edit the .htaccess. But the bigger problem is we don't use Apache, we use Nginx, so the presence of a .htaccess file does nothing as that is basically exclusively for Apache.

It is concerning to me that Bytedance/spider are ignoring robots.txt. We may look at a more robust solution for this that we can implement centrally for all customers.

Kirby · Apr 12, 2024

Chris D said:
It is concerning to me that Bytedance/spider are ignoring robots.txt. We may look at a more robust solution for this that we can implement centrally for all customers.

Bytedance / Bytespeider even doesn't always use their own Useragent, they also use generic ones (like Chrome, etc.)

CTS · Apr 12, 2024

So this in the interim will be suitable for the short term if it is added to the default (cloud) robots.txt ?

Code:

User-agent: Bytespider
Disallow: /

User-agent: *
Disallow:

Until better solutions are implemented, I wish to make sure I do not hinder any of the other desired bots either.

tnx

Chris D · Apr 12, 2024

This is the default robots.txt:

Code:

User-agent: PetalBot
User-agent: AspiegelBot
User-agent: AhrefsBot
User-agent: SemrushBot 
User-agent: DotBot
User-agent: MauiBot
User-agent: MJ12bot
Disallow: /

User-agent: *
Disallow: /account/
Disallow: /attachments/
Disallow: /goto/
Disallow: /misc/language
Disallow: /misc/style
Disallow: /posts/
Disallow: /login/
Disallow: /search/
Disallow: /whats-new/
Disallow: /admin.php
Allow: /

Sitemap: {sitemap_url}

It's sufficient for most cases. If you want to add Bytespider it changes to:

Code:

User-agent: PetalBot
User-agent: AspiegelBot
User-agent: AhrefsBot
User-agent: SemrushBot 
User-agent: DotBot
User-agent: MauiBot
User-agent: MJ12bot
User-agent: Bytespider
Disallow: /

User-agent: *
Disallow: /account/
Disallow: /attachments/
Disallow: /goto/
Disallow: /misc/language
Disallow: /misc/style
Disallow: /posts/
Disallow: /login/
Disallow: /search/
Disallow: /whats-new/
Disallow: /admin.php
Allow: /

Sitemap: {sitemap_url}

CTS · Apr 12, 2024

@Chris D Thank you.

CTS · Apr 17, 2024

@Chris D

Watching over time since adding your suggestion to robots.txt,..

... it "appears" Bytespider may be complying. Their traffic has slowled down to a crawl (pun intended), so fingers crossed.

FoP · Thursday at 6:26 PM

Chris D said:

This is the default robots.txt:

Code:

User-agent: PetalBot
User-agent: AspiegelBot
User-agent: AhrefsBot
User-agent: SemrushBot
User-agent: DotBot
User-agent: MauiBot
User-agent: MJ12bot
Disallow: /

User-agent: *
Disallow: /account/
Disallow: /attachments/
Disallow: /goto/
Disallow: /misc/language
Disallow: /misc/style
Disallow: /posts/
Disallow: /login/
Disallow: /search/
Disallow: /whats-new/
Disallow: /admin.php
Allow: /

Sitemap: {sitemap_url}

It's sufficient for most cases. If you want to add Bytespider it changes to:

Code:

User-agent: PetalBot
User-agent: AspiegelBot
User-agent: AhrefsBot
User-agent: SemrushBot
User-agent: DotBot
User-agent: MauiBot
User-agent: MJ12bot
User-agent: Bytespider
Disallow: /

User-agent: *
Disallow: /account/
Disallow: /attachments/
Disallow: /goto/
Disallow: /misc/language
Disallow: /misc/style
Disallow: /posts/
Disallow: /login/
Disallow: /search/
Disallow: /whats-new/
Disallow: /admin.php
Allow: /

Sitemap: {sitemap_url}

I added Bytespider and Bytedance to the Robots.txt file yesterday after Bytespider started showing up on our forum. They ignore the file. At this moment we have 50 robots and 26 are Bytespider so they multiplied. Banning IP addresses also doesn’t work. We too are on Cloud hosting.

avalanch said:

in Page_Container template, modify this as needed.

Code:

<meta name="robots" content="noindex, nofollow, noarchive, noodp, nosnippet, notranslate, noimageindex">
<meta name="googlebot" content="noindex, nofollow">
<meta name="googlebot-news" content="nosnippet">
<meta name="googlebot-video" content="noindex">
<meta name="googlebot-image" content="noindex">
<meta name="bingbot" content="noindex, nofollow">
<meta name="bingpreview" content="noindex, nofollow">
<meta name="msnbot" content="noindex, nofollow">
<meta name="slurp" content="noindex, nofollow">
<meta name="teoma" content="noindex, nofollow">
<meta name="Yandex" content="noindex, nofollow">
<meta name="baidu" content="noindex, nofollow">
<meta name="Yeti" content="noindex, nofollow">
<meta name="ia_archiver" content="noindex, nofollow">
<meta name="facebook" content="noindex, nofollow">
<meta name="twitter" content="noindex, nofollow">
<meta name="rogerbot" content="noindex, nofollow">
<meta name="LinkedInBot" content="noindex, nofollow">
<meta name="embedly" content="noindex, nofollow">
<meta name="slackbot" content="noindex, nofollow">
<meta name="W3C_Validator" content="noindex, nofollow">
<meta name="redditbot" content="noindex, nofollow">
<meta name="discordbot" content="noindex, nofollow">
<meta name="applebot" content="noindex, nofollow">
<meta name="pinterest" content="noindex, nofollow">
<meta name="smtbot" content="noindex, nofollow">
<meta name="googlewebmaster" content="noindex, nofollow">
<meta name="twitterbot" content="noindex, nofollow">
<meta name="tumblr" content="noindex, nofollow">
<meta name="slackbot" content="noindex, nofollow">
<meta name="flipboard" content="noindex, nofollow">
<meta name="qualaroo" content="noindex, nofollow">
<meta name="opensearch" content="noindex, nofollow">
<meta name="sogou" content="noindex, nofollow">
<meta name="exabot" content="noindex, nofollow">
<meta name="duckduckbot" content="noindex, nofollow">
<meta name="taptu" content="noindex, nofollow">
<meta name="outbrain" content="noindex, nofollow">
<meta name="Bytespider" content="noindex, nofollow">

Thank you. Will try this.

philmckrackon · Thursday at 6:46 PM

FoP said:
I added Bytespider and Bytedance to the Robots.txt file yesterday after Bytespider started showing up on our forum. They ignore the file. At this moment we have 50 robots and 26 are Bytespider so they multiplied. Banning IP addresses also doesn’t work. We too are on Cloud hosting.

Thank you. Will try this.

IMO, best to block by .htaccess
Eddit: Just noticed your on XF cloud and do not have access to the file.

Code:

BrowserMatchNoCase "Bytedance" bad_bot
BrowserMatchNoCase "Bytespider" bad_bot
BrowserMatchNoCase "Baiduspider" bad_bot
Order Deny,Allow
Deny from env=bad_bot

Post in thread 'Known Bots'

Apr 2, 2024

I got pounded by Bytespider and Baiduspider yesterday with over 3,000 of them at once. This is what I put in my .htacess file. There may be better ways of blocking them but it worked.

Code:

BrowserMatchNoCase "Bytedance" bad_bot
BrowserMatchNoCase "Bytespider" bad_bot
BrowserMatchNoCase "Baiduspider" bad_bot
Order Deny,Allow
Deny from env=bad_bot

Mike S · Thursday at 7:15 PM

FoP said:
I added Bytespider and Bytedance to the Robots.txt file yesterday after Bytespider started showing up on our forum. They ignore the file. At this moment we have 50 robots and 26 are Bytespider so they multiplied. Banning IP addresses also doesn’t work. We too are on Cloud hosting.

Try it this way:

User-agent: PetalBot
User-agent: AspiegelBot
User-agent: AhrefsBot
User-agent: SemrushBot
User-agent: DotBot
User-agent: MauiBot
User-agent: MJ12bot
Disallow: /

User-agent: Bytespider
Disallow: /

User-agent: *
Disallow: /account/
Disallow: /attachments/
Disallow: /goto/
Disallow: /misc/language
Disallow: /misc/style
Disallow: /posts/
Disallow: /login/
Disallow: /search/
Disallow: /whats-new/
Disallow: /admin.php
Allow: /

Sitemap: {sitemap_url}

FoP · Thursday at 7:25 PM

philmckrackon said:
Eddit: Just noticed your on XF cloud and do not have access to the file.

Nope.

Mike S said:
Try it this way:

User-agent: PetalBot
User-agent: AspiegelBot
User-agent: AhrefsBot
User-agent: SemrushBot
User-agent: DotBot
User-agent: MauiBot
User-agent: MJ12bot
Disallow: /

User-agent: Bytespider
Disallow: /

Okay, will modify the file accordingly. Thank you.

FoP · Thursday at 8:07 PM

Now we have about 150 Facebook External Hit bots, all viewing unknown pages and showing the warning triangle. I think that means they are trying to view a page they aren't allowed access to?

Why does Facebook crawl our site? It's not a search engine. Also data training for AI?

CTS · 2024-06-07T03:26:27+0100

It took about 2 to 3 weeks to see bytespider begin to comply with the suggested addition to robots txt. They do not visit anymore (so far) and it's been as long as my last post in this thread.

FoP · 2024-06-07T11:13:54+0100

CTS said:
It took about 2 to 3 weeks to see bytespider begin to comply with the suggested addition to robots txt. They do not visit anymore (so far) and it's been as long as my last post in this thread.

That’s encouraging.

Bot Management using robots.txt in XFcloud

Active member

Well-known member

Active member

Active member

Well-known member

Active member

Active member

XenForo developer

Well-known member

Active member

XenForo developer

Active member

Active member

New member

Well-known member

Member

New member

New member

Active member

New member

We value your privacy