Bot Management using robots.txt in XFcloud

CTS

Active member
Using the XFcloud for my instance, so I do not have server access or htaccess. Up front.

I know my ability to manage bots is limited, so my question revolves around the editing of robots.txt from within ACP.

I wish to "ask" Bytespider to cease indexing from my site.

I would like to use this code,...

Code:
User-agent: Bytespider
Disallow: /

User-agent: *
Disallow:

Now, if I were to append this to the bottom of the existing robots.txt, would there be any conflicts of the basic default robots.txt using the XFcloud instance.

Anybody have experience in XFcloud in safe ways to add or modify to the robots.txt?
 
Can't speak for using XFcloud and modifying robots.txt but bytespider ignores that file.
Can you add to the .htaccess file in your XF root directory?
Code:
BrowserMatchNoCase "Bytedance" bad_bot
BrowserMatchNoCase "Bytespider" bad_bot
BrowserMatchNoCase "Baiduspider" bad_bot
Order Deny,Allow
Deny from env=bad_bot
 
in Page_Container template, modify this as needed.

Code:
<meta name="robots" content="noindex, nofollow, noarchive, noodp, nosnippet, notranslate, noimageindex">
<meta name="googlebot" content="noindex, nofollow">
<meta name="googlebot-news" content="nosnippet">
<meta name="googlebot-video" content="noindex">
<meta name="googlebot-image" content="noindex">
<meta name="bingbot" content="noindex, nofollow">
<meta name="bingpreview" content="noindex, nofollow">
<meta name="msnbot" content="noindex, nofollow">
<meta name="slurp" content="noindex, nofollow">
<meta name="teoma" content="noindex, nofollow">
<meta name="Yandex" content="noindex, nofollow">
<meta name="baidu" content="noindex, nofollow">
<meta name="Yeti" content="noindex, nofollow">
<meta name="ia_archiver" content="noindex, nofollow">
<meta name="facebook" content="noindex, nofollow">
<meta name="twitter" content="noindex, nofollow">
<meta name="rogerbot" content="noindex, nofollow">
<meta name="LinkedInBot" content="noindex, nofollow">
<meta name="embedly" content="noindex, nofollow">
<meta name="slackbot" content="noindex, nofollow">
<meta name="W3C_Validator" content="noindex, nofollow">
<meta name="redditbot" content="noindex, nofollow">
<meta name="discordbot" content="noindex, nofollow">
<meta name="applebot" content="noindex, nofollow">
<meta name="pinterest" content="noindex, nofollow">
<meta name="smtbot" content="noindex, nofollow">
<meta name="googlewebmaster" content="noindex, nofollow">
<meta name="twitterbot" content="noindex, nofollow">
<meta name="tumblr" content="noindex, nofollow">
<meta name="slackbot" content="noindex, nofollow">
<meta name="flipboard" content="noindex, nofollow">
<meta name="qualaroo" content="noindex, nofollow">
<meta name="opensearch" content="noindex, nofollow">
<meta name="sogou" content="noindex, nofollow">
<meta name="exabot" content="noindex, nofollow">
<meta name="duckduckbot" content="noindex, nofollow">
<meta name="taptu" content="noindex, nofollow">
<meta name="outbrain" content="noindex, nofollow">
<meta name="Bytespider" content="noindex, nofollow">
 
Using the XFcloud for my instance, so I do not have server access or htaccess. Up front.

I know my ability to manage bots is limited, so my question revolves around the editing of robots.txt from within ACP.

I wish to "ask" Bytespider to cease indexing from my site.

I would like to use this code,...

Code:
User-agent: Bytespider
Disallow: /

User-agent: *
Disallow:

Now, if I were to append this to the bottom of the existing robots.txt, would there be any conflicts of the basic default robots.txt using the XFcloud instance.

Anybody have experience in XFcloud in safe ways to add or modify to the robots.txt?
Just to clarify this, you are not limited in editing your robots.txt at all. It is just done through an option in the admin CP for both convenience and to workaround no direct access to the server. Anything you want to put in robots.txt is fine and will work exactly the same way as editing the file directly.

Unfortunately that is one of the very very few drawbacks in the cloud. No htaccess access.
To be fair, if we used Apache, we'd probably have a UI to enable you to edit the .htaccess. But the bigger problem is we don't use Apache, we use Nginx, so the presence of a .htaccess file does nothing as that is basically exclusively for Apache.

It is concerning to me that Bytedance/spider are ignoring robots.txt. We may look at a more robust solution for this that we can implement centrally for all customers.
 
It is concerning to me that Bytedance/spider are ignoring robots.txt. We may look at a more robust solution for this that we can implement centrally for all customers.
Bytedance / Bytespeider even doesn't always use their own Useragent, they also use generic ones (like Chrome, etc.)
 
So this in the interim will be suitable for the short term if it is added to the default (cloud) robots.txt ?

Code:
User-agent: Bytespider
Disallow: /

User-agent: *
Disallow:

Until better solutions are implemented, I wish to make sure I do not hinder any of the other desired bots either.

tnx
 
This is the default robots.txt:

Code:
User-agent: PetalBot
User-agent: AspiegelBot
User-agent: AhrefsBot
User-agent: SemrushBot 
User-agent: DotBot
User-agent: MauiBot
User-agent: MJ12bot
Disallow: /

User-agent: *
Disallow: /account/
Disallow: /attachments/
Disallow: /goto/
Disallow: /misc/language
Disallow: /misc/style
Disallow: /posts/
Disallow: /login/
Disallow: /search/
Disallow: /whats-new/
Disallow: /admin.php
Allow: /

Sitemap: {sitemap_url}

It's sufficient for most cases. If you want to add Bytespider it changes to:

Code:
User-agent: PetalBot
User-agent: AspiegelBot
User-agent: AhrefsBot
User-agent: SemrushBot 
User-agent: DotBot
User-agent: MauiBot
User-agent: MJ12bot
User-agent: Bytespider
Disallow: /

User-agent: *
Disallow: /account/
Disallow: /attachments/
Disallow: /goto/
Disallow: /misc/language
Disallow: /misc/style
Disallow: /posts/
Disallow: /login/
Disallow: /search/
Disallow: /whats-new/
Disallow: /admin.php
Allow: /

Sitemap: {sitemap_url}
 
  • Like
Reactions: CTS
@Chris D

Watching over time since adding your suggestion to robots.txt,..

... it "appears" Bytespider may be complying. Their traffic has slowled down to a crawl (pun intended), so fingers crossed.
 
This is the default robots.txt:

Code:
User-agent: PetalBot
User-agent: AspiegelBot
User-agent: AhrefsBot
User-agent: SemrushBot
User-agent: DotBot
User-agent: MauiBot
User-agent: MJ12bot
Disallow: /

User-agent: *
Disallow: /account/
Disallow: /attachments/
Disallow: /goto/
Disallow: /misc/language
Disallow: /misc/style
Disallow: /posts/
Disallow: /login/
Disallow: /search/
Disallow: /whats-new/
Disallow: /admin.php
Allow: /

Sitemap: {sitemap_url}

It's sufficient for most cases. If you want to add Bytespider it changes to:

Code:
User-agent: PetalBot
User-agent: AspiegelBot
User-agent: AhrefsBot
User-agent: SemrushBot
User-agent: DotBot
User-agent: MauiBot
User-agent: MJ12bot
User-agent: Bytespider
Disallow: /

User-agent: *
Disallow: /account/
Disallow: /attachments/
Disallow: /goto/
Disallow: /misc/language
Disallow: /misc/style
Disallow: /posts/
Disallow: /login/
Disallow: /search/
Disallow: /whats-new/
Disallow: /admin.php
Allow: /

Sitemap: {sitemap_url}
I added Bytespider and Bytedance to the Robots.txt file yesterday after Bytespider started showing up on our forum. They ignore the file. At this moment we have 50 robots and 26 are Bytespider so they multiplied. Banning IP addresses also doesn’t work. We too are on Cloud hosting.

in Page_Container template, modify this as needed.

Code:
<meta name="robots" content="noindex, nofollow, noarchive, noodp, nosnippet, notranslate, noimageindex">
<meta name="googlebot" content="noindex, nofollow">
<meta name="googlebot-news" content="nosnippet">
<meta name="googlebot-video" content="noindex">
<meta name="googlebot-image" content="noindex">
<meta name="bingbot" content="noindex, nofollow">
<meta name="bingpreview" content="noindex, nofollow">
<meta name="msnbot" content="noindex, nofollow">
<meta name="slurp" content="noindex, nofollow">
<meta name="teoma" content="noindex, nofollow">
<meta name="Yandex" content="noindex, nofollow">
<meta name="baidu" content="noindex, nofollow">
<meta name="Yeti" content="noindex, nofollow">
<meta name="ia_archiver" content="noindex, nofollow">
<meta name="facebook" content="noindex, nofollow">
<meta name="twitter" content="noindex, nofollow">
<meta name="rogerbot" content="noindex, nofollow">
<meta name="LinkedInBot" content="noindex, nofollow">
<meta name="embedly" content="noindex, nofollow">
<meta name="slackbot" content="noindex, nofollow">
<meta name="W3C_Validator" content="noindex, nofollow">
<meta name="redditbot" content="noindex, nofollow">
<meta name="discordbot" content="noindex, nofollow">
<meta name="applebot" content="noindex, nofollow">
<meta name="pinterest" content="noindex, nofollow">
<meta name="smtbot" content="noindex, nofollow">
<meta name="googlewebmaster" content="noindex, nofollow">
<meta name="twitterbot" content="noindex, nofollow">
<meta name="tumblr" content="noindex, nofollow">
<meta name="slackbot" content="noindex, nofollow">
<meta name="flipboard" content="noindex, nofollow">
<meta name="qualaroo" content="noindex, nofollow">
<meta name="opensearch" content="noindex, nofollow">
<meta name="sogou" content="noindex, nofollow">
<meta name="exabot" content="noindex, nofollow">
<meta name="duckduckbot" content="noindex, nofollow">
<meta name="taptu" content="noindex, nofollow">
<meta name="outbrain" content="noindex, nofollow">
<meta name="Bytespider" content="noindex, nofollow">
Thank you. Will try this.
 
I added Bytespider and Bytedance to the Robots.txt file yesterday after Bytespider started showing up on our forum. They ignore the file. At this moment we have 50 robots and 26 are Bytespider so they multiplied. Banning IP addresses also doesn’t work. We too are on Cloud hosting.


Thank you. Will try this.
IMO, best to block by .htaccess
Eddit: Just noticed your on XF cloud and do not have access to the file. :(
Code:
BrowserMatchNoCase "Bytedance" bad_bot
BrowserMatchNoCase "Bytespider" bad_bot
BrowserMatchNoCase "Baiduspider" bad_bot
Order Deny,Allow
Deny from env=bad_bot

 
I added Bytespider and Bytedance to the Robots.txt file yesterday after Bytespider started showing up on our forum. They ignore the file. At this moment we have 50 robots and 26 are Bytespider so they multiplied. Banning IP addresses also doesn’t work. We too are on Cloud hosting.

Try it this way:

User-agent: PetalBot
User-agent: AspiegelBot
User-agent: AhrefsBot
User-agent: SemrushBot
User-agent: DotBot
User-agent: MauiBot
User-agent: MJ12bot
Disallow: /

User-agent: Bytespider
Disallow: /

User-agent: *
Disallow: /account/
Disallow: /attachments/
Disallow: /goto/
Disallow: /misc/language
Disallow: /misc/style
Disallow: /posts/
Disallow: /login/
Disallow: /search/
Disallow: /whats-new/
Disallow: /admin.php
Allow: /

Sitemap: {sitemap_url}
 
Now we have about 150 Facebook External Hit bots, all viewing unknown pages and showing the warning triangle. I think that means they are trying to view a page they aren't allowed access to?

Why does Facebook crawl our site? It's not a search engine. Also data training for AI?
 
It took about 2 to 3 weeks to see bytespider begin to comply with the suggested addition to robots txt. They do not visit anymore (so far) and it's been as long as my last post in this thread.
 
  • Like
Reactions: FoP
Top Bottom