I have had over 6,000 hits on my site today which was a bit of a suprise as I only have 30 or so members! (~2,000 yesterday and <200 the day before)
They are all largely from the same IP. How can I tell if it's the google spider bot as the google site says the IP address changes from time to time?
Pretty much all of the hits also appear to be calling course recent on the main page. Hmmmm.
Anyone else having the same issue?
I've been getting up to around 100 hits a day (66.249 ips)...all of them on news posts on my site frontpage. It doesn't seem to be impacting my server performance. Based on an earlier reply to a similar post I made a month or so ago, this is probably resulting in nothing more than a little advertisement for my site.
I'll install such a robots.txt tomorrow and post it here too, as an advise.
Can't we do something about this in the source code ? I checked the box that told me my site was closed to google, but apparently this doesn't work(anymore)?
Spider hits are common on any site that is on the 'net. It is a double-edged sword in that it helps people find your site as well as take up your bandwidth. I have found that Microsoft spiders do a lot more damage than Google.
If you want to tell the spiders to go away, you can create a file called 'robots.txt' in the root directory of your site. In that file, you can make a listing with instructions to spiders. Here is an example of a robots.txt file. Note, that this is NOT my file, I just found it on another site. The syntax shows two things. The first is the 'user-agent' and the second is the 'Disallow:' directive. The forward slash tells the spider not to enter the site at all. If you want to just have it not enter your forums, you can use:
More information is here:
Example is attached.
Surely 6,000 hits in one day isn't common? It's not as if I am running the bbc website!
It's still at it - hitting the same page (course 1) every minute or so.
Has it decreased on your site also?
What surprised me was that this google server was going through all the people profiles despite having the setting of opentogoogle set to "no". This is possible due to having the front page accessible, and not requiring login for seeing it.
How does the local Google search interact with Moodle? I've often thought that an external search engine would be a useful addition to Moodle in campus settings.
I had mega 'visits' and I mean 'mega visits' logged as a 'guest user' 13th-14th August
accessing an 'open to guests' course on my front page
Ray Lawrence figured out the 'culprit' for me from the IP , yep a Google Bot. and suggested I click on the not open to Google button in the config.
Stopped it !! Happened on my site once before , strange, 'huge number of guest logins' , that made me wonder why at that time !! All I can think of is ,there must be a 'common' 'interest factor for the bot' on the front page of our sites , obviously 'Moodle' , but maybe there is another 'common factor also ??
Or, am I being too 'human' in my cognition here ? Any thoughts guys ??
Curiouser and Curiouser ! said Alice !! or, was that the Queen of hearts ??
Geez I'm bad , I grew up in Lewis Carroll's Oxford !! and can't remember who said that !!
You English Lit guys , can you help me out ??
What doesn't seem particularly useful to me is all the people coming to my site through the search engines because they are looking up my users' names and finding their profiles indexed by Google. Recent activity pages or user post profile pages don't seem to useful either. Google accounts for 10 percent of my bandwidth this month, most of it to non-existant or irrelevant pages.
For this particular issue, note that Google doesn't really like visiting sites with large numbers of parameters after the '?'. In this case it appears to be cutting off the last one and just trying what's left anyway.
The googlebots are actually quite smart, and do seem to learn about your site to spider it better, e.g. they know which pages are updated regularly and spider them more often to get the 'fresh' content (the recent updates page is therefore an ideal candidate for spider blocking with robots.txt) so I'm guessing it would stop visiting these pages if it recieved a 404 or other machine readable error page, but it doesn't and the bot isn't (yet) smart enough to read the actual page and realise that it is an error.
I had a read through the Google guidelines for webmasters and the only thing that jumped out at me that Moodle might be doing to confuse these spiders is the following:
- Allow search bots to crawl your sites without session IDs or arguments that track their path through the site. These techniques are useful for tracking individual user behaviour, but the access pattern of bots is entirely different. Using these techniques may result in incomplete indexing of your site, as bots may not be able to eliminate URLs that look different but actually point to the same page.
I'd guess that if the googlebot doesn't accept cookies then the sessionID will be written into the URL. Those with multiple visits from the googlebot might want to check to see if the same page is being visited by bots with different sessionIDS.
You might want to try submitting your site to the Google Sitemaps service.
This is basically a document that tells Google what parts of your site are static and which sites need to be regularly updated. Most of the documentation addresses people who want Google to search their site more often but I'd guess that you could use it equally well to stop Google from pointlessly re-indexing pages that haven't changed or which may have changed but are only pointers to the real content such as recent activity.
It might be a nice project to have Moodle automatically generate these according to some admin options. In the meantime, they also accept RSS feeds. Maybe Martin could point them at the Moodle.org RSS feeds and we could see how that works out?