Search Engine crawlers

Announcements about the forum as well as comments, questions, ideas for the forum or the website in general.
Post Reply
User avatar
bahua
Administrator
Administrator
Posts: 10940
Joined: Thu Jan 23, 2003 7:39 pm
Location: Out of Town
Contact:

Search Engine crawlers

Post by bahua »

Our new VPS host bills its customers based on usage, in four categories: storage, memory, CPU, and network. Our old host charged a flat $28 a month for the tier of VPS we rented, and was a very capable host in terms of site performance. I would argue that performance is much better now. (though you may disagree from where you sit. If you do, I definitely want to hear from you via PM or replies to this post). I have been watching some tools on the server, and I've seen that the database is basically always running hot. I am starting to see why:

Code: Select all

$ wc -l access_kcrag.log
3579405 access_kcrag.log
That's 3.5 million hits, since Monday night. When I watch the access log live, I see lots and lots of search bots, so I decided to see how many we're dealing with:

Code: Select all

$ for i in bing google semrush baidu duckduckgo amazon; do echo -n "$i: "; grep -iw $i access_kcrag.log | wc -l; done
bing: 36077
google: 4306
semrush: 4417
baidu: 17
duckduckgo: 121
amazon: 9
36k hits for bing, and under 10k for all other identified engines put together, since Monday night. There's another scraper called ClaudeBot that alone accounts for almost half a million hits. Over 1.4 million hits come from requests for the site without https.

868k requests for search from guests. I just disabled that as I've been composing this. That should hopefully provide at least some relief to the database, but we'll see.

This is going to take some deeper investigation and some planning to address, but it appears the forum is a major target for web crawlers, which account for the vast majority of the site's traffic. I originally was writing this post to get your input on whether we want to actively block these bots, but the situation appears to be much more involved than that.

Any input anyone has would be appreciated.
User avatar
bahua
Administrator
Administrator
Posts: 10940
Joined: Thu Jan 23, 2003 7:39 pm
Location: Out of Town
Contact:

Re: Search Engine crawlers

Post by bahua »

On some further investigation, I've found that remote IP addresses starting with 47.75 account for 2.5M of these hits. The addresses belong to Alibaba, and originate in Hong Kong. I think I will block them.
User avatar
bahua
Administrator
Administrator
Posts: 10940
Joined: Thu Jan 23, 2003 7:39 pm
Location: Out of Town
Contact:

Re: Search Engine crawlers

Post by bahua »

This is fascinating. The requests from these Alibaba IPs are using the search function to list all of the posts of a given user, and then dive into those posts. I don't think these alibaba crawlers are populating a search engine. Based on the depth and frequency of their hits, I think they're feeding data to an AI. I dropped a deny directive for this IP block into kcrag's vhost config, and restarted apache.

I'm tailing the access log now, and the requests from these IPs are still pouring in, at about 1000 requests per minute, but now they're all getting 403 "forbidden" errors. I'm watching htop in another screen, and the database has dramatically calmed down. I will continue to monitor this, but I think with the rise of AIs, this kind of event will get only more common.

Please let me know if you see anything like or trouble like this.
User avatar
grovester
Oak Tower
Oak Tower
Posts: 4577
Joined: Thu Mar 13, 2008 7:30 pm
Location: KC Metro

Re: Search Engine crawlers

Post by grovester »

So there are AI versions of all of us based on our posts? That's terrifying!
User avatar
bahua
Administrator
Administrator
Posts: 10940
Joined: Thu Jan 23, 2003 7:39 pm
Location: Out of Town
Contact:

Re: Search Engine crawlers

Post by bahua »

grovester wrote: Thu May 02, 2024 6:56 pm So there are AI versions of all of us based on our posts? That's terrifying!
Yeah, but don't worry. They won't emerge until we die.
User avatar
bahua
Administrator
Administrator
Posts: 10940
Joined: Thu Jan 23, 2003 7:39 pm
Location: Out of Town
Contact:

Re: Search Engine crawlers

Post by bahua »

Look at that! CPU and network use dropped off sharply!

Image
KCKev
Valencia Place
Valencia Place
Posts: 1574
Joined: Mon Apr 10, 2006 7:23 pm
Location: Tucson Arizona
Contact:

Re: Search Engine crawlers

Post by KCKev »

Man you got alot going on. I've noticed when I'm about to log off I check who's on line Z(out of habit) and it's usually 4 registered online
bing [bot],google[bot],KCKev,semrush[bot].

Bots are registered? Not guests?
If you're not on the EDGE, you're taking up TOO MUCH ROOM!
User avatar
bahua
Administrator
Administrator
Posts: 10940
Joined: Thu Jan 23, 2003 7:39 pm
Location: Out of Town
Contact:

Re: Search Engine crawlers

Post by bahua »

No, they're not registered. They're detected by the board software.
brooksidebadgers
Parking Garage
Parking Garage
Posts: 12
Joined: Wed May 17, 2023 4:23 pm
Location: Brookside USA

Re: Search Engine crawlers

Post by brooksidebadgers »

fascinating and thanks for sharing. I eat this kind of stuff up.
bspecht
Western Auto Lofts
Western Auto Lofts
Posts: 533
Joined: Tue Jun 16, 2015 4:31 pm
Location: DC
Contact:

Re: Search Engine crawlers

Post by bspecht »

Cloudflare offers solid bot protection with their free plan, likely adequate for a site like this. https://developers.cloudflare.com/bots/ ... rted/free/
User avatar
im2kull
Bryant Building
Bryant Building
Posts: 3962
Joined: Tue May 24, 2005 4:33 pm
Location: KCMO

Re: Search Engine crawlers

Post by im2kull »

bahua wrote: Thu May 02, 2024 6:03 pm This is fascinating. The requests from these Alibaba IPs are using the search function to list all of the posts of a given user, and then dive into those posts. I don't think these alibaba crawlers are populating a search engine. Based on the depth and frequency of their hits, I think they're feeding data to an AI. I dropped a deny directive for this IP block into kcrag's vhost config, and restarted apache.

I'm tailing the access log now, and the requests from these IPs are still pouring in, at about 1000 requests per minute, but now they're all getting 403 "forbidden" errors. I'm watching htop in another screen, and the database has dramatically calmed down. I will continue to monitor this, but I think with the rise of AIs, this kind of event will get only more common.

Please let me know if you see anything like or trouble like this.
This is super interesting. Good interpretation of what you found.
bspecht
Western Auto Lofts
Western Auto Lofts
Posts: 533
Joined: Tue Jun 16, 2015 4:31 pm
Location: DC
Contact:

Re: Search Engine crawlers

Post by bspecht »

AI, as it currently exists, wouldn't be constantly crawling a site. An entity could/would do a one-time crawl of the information (though likely with a certain frequency) and parse it into a model for a dataset, which would then be trained on by an LLM (not doing any operations that would ping a website). That said, phpBB software is a rather basic data structure to grab & process + in demand given it's historical information uses.
User avatar
bahua
Administrator
Administrator
Posts: 10940
Joined: Thu Jan 23, 2003 7:39 pm
Location: Out of Town
Contact:

Re: Search Engine crawlers

Post by bahua »

bspecht wrote: Fri May 03, 2024 4:16 pm Cloudflare offers solid bot protection with their free plan, likely adequate for a site like this. https://developers.cloudflare.com/bots/ ... rted/free/
Suspicious that the bandwidth alliance, with whom cloudflare coordinates to identify "good" bots and "bad" bots is cosponsored by Alibaba, who is responsible for 2.5 million of the bot hits we've gotten this week. I think I'll stick with my own methods for now.
User avatar
Anthony_Hugo98
Valencia Place
Valencia Place
Posts: 1993
Joined: Fri Mar 22, 2019 10:50 pm
Location: Overland Park, KS

Re: Search Engine crawlers

Post by Anthony_Hugo98 »

bahua wrote: Fri May 03, 2024 6:11 pm
bspecht wrote: Fri May 03, 2024 4:16 pm Cloudflare offers solid bot protection with their free plan, likely adequate for a site like this. https://developers.cloudflare.com/bots/ ... rted/free/
Suspicious that the bandwidth alliance, with whom cloudflare coordinates to identify "good" bots and "bad" bots is cosponsored by Alibaba, who is responsible for 2.5 million of the bot hits we've gotten this week. I think I'll stick with my own methods for now.
Create the problem and sell the “cure” setup I’d bet
bspecht
Western Auto Lofts
Western Auto Lofts
Posts: 533
Joined: Tue Jun 16, 2015 4:31 pm
Location: DC
Contact:

Re: Search Engine crawlers

Post by bspecht »

Comical. Cloudflare created and has dominated it's sector because it works incredibly well at reasonable cost.

I'll refrain from further suggestions. Conspiracy theories are more fun!
User avatar
bahua
Administrator
Administrator
Posts: 10940
Joined: Thu Jan 23, 2003 7:39 pm
Location: Out of Town
Contact:

Re: Search Engine crawlers

Post by bahua »

I use cloudflare extensively. I'm aware of its value, but I'm also aware it isn't infallible. I'm not using their services in this case, as the host we're using has geographic edge POPs already, and the performance boost from this is obvious, at least from where I'm sitting.

But companies often have to make sacrifices to get access to the Chinese market, and the association I've observed between cloudflare and alibaba might be an example. I don't know if it is, and I really have no way of knowing if it is, but I do know that bots from one of cloudflare's business partners hit us hard, and continue to do so. Blocking their class-B with an authz module directive is a simple way to reduce the load that has proven extremely effective.

And on a administrative note, I think we could all do without your snark. Please keep it civil here.
mean
Administrator
Administrator
Posts: 11240
Joined: Wed Feb 05, 2003 9:00 am
Location: Historic Northeast

Re: Search Engine crawlers

Post by mean »

Looking for Chinese dissidents or some shit, if I had to guess. If they're scraping for an LLM, it's to let AI figure out who is talking bad about the CCP or something, not build a new model.

Obviously that's wild speculation, but if you don't think that's at least plausible, then I'm afraid the joke's on you.
Post Reply