Our new VPS host bills its customers based on usage, in four categories: storage, memory, CPU, and network. Our old host charged a flat $28 a month for the tier of VPS we rented, and was a very capable host in terms of site performance. I would argue that performance is much better now. (though you may disagree from where you sit. If you do, I definitely want to hear from you via PM or replies to this post). I have been watching some tools on the server, and I've seen that the database is basically always running hot. I am starting to see why:
Code: Select all
$ wc -l access_kcrag.log
3579405 access_kcrag.log
That's 3.5 million hits, since Monday night. When I watch the access log live, I see lots and lots of search bots, so I decided to see how many we're dealing with:
Code: Select all
$ for i in bing google semrush baidu duckduckgo amazon; do echo -n "$i: "; grep -iw $i access_kcrag.log | wc -l; done
bing: 36077
google: 4306
semrush: 4417
baidu: 17
duckduckgo: 121
amazon: 9
36k hits for bing, and under 10k for all other identified engines put together, since Monday night. There's another scraper called ClaudeBot that alone accounts for almost half a million hits. Over 1.4 million hits come from requests for the site without https.
868k requests for search from guests. I just disabled that as I've been composing this. That should hopefully provide at least
some relief to the database, but we'll see.
This is going to take some deeper investigation and some planning to address, but it appears the forum is a major target for web crawlers, which account for the vast majority of the site's traffic. I originally was writing this post to get your input on whether we want to actively block these bots, but the situation appears to be much more involved than that.
Any input anyone has would be appreciated.