Google is Stalking My Site

Google is Stalking My Site

'mei a N Hansen - 'aho
Number of replies: 18
The bandwidth that Google has used spidering my site this month is through the roof. But its behavior today has gotten even more ridiculous. I've been watching my logs and it kept visiting the recent activity page over and over and over about a dozen times when there was no other intervening activity on the site. Then at 1:30, it visited a forum thread, at 1:31 someone posted another message on the thread, and at 1:32 the Google spider came back to the same thread again! Has anyone else experienced this?
Average of ratings: -
In reply to N Hansen

Re: Google is Stalking My Site

'mei a Darren Smith - 'aho
Hello.

I have had over 6,000 hits on my site today which was a bit of a suprise as I only have 30 or so members! (~2,000 yesterday and <200 the day before)

They are all largely from the same IP. How can I tell if it's the google spider bot as the google site says the IP address changes from time to time?
In reply to Darren Smith

Re: Google is Stalking My Site

'mei a N Hansen - 'aho
The Google bots that hit my site have an ip that starts with 66.249.
In reply to N Hansen

Re: Google is Stalking My Site

'mei a Darren Smith - 'aho
Snap!

Pretty much all of the hits also appear to be calling course recent on the main page. Hmmmm.

Anyone else having the same issue?
In reply to Darren Smith

Re: Google is Stalking My Site

'mei a Steve Hyndman - 'aho

I've been getting up to around 100 hits a day (66.249 ips)...all of them on news posts on my site frontpage. It doesn't seem to be impacting my server performance. Based on an earlier reply to a similar post I made a month or so ago, this is probably resulting in nothing more than a little advertisement for my site.

Steve

In reply to Darren Smith

Re: Google is Stalking My Site

'mei a Pieterjan Heyse - 'aho
this is funny, following this thread, I checked my logs ... over 10K lines from google in one day .... 10K hits ! this is insane .....

I'll install such a robots.txt tomorrow and post it here too, as an advise.

Can't we do something about this in the source code ? I checked the box that told me my site was closed to google, but apparently this doesn't work(anymore)?
In reply to N Hansen

Re: Google is Stalking My Site

'mei a Bill Burgos - 'aho
HI,

Spider hits are common on any site that is on the 'net. It is a double-edged sword in that it helps people find your site as well as take up your bandwidth. I have found that Microsoft spiders do a lot more damage than Google.

If you want to tell the spiders to go away, you can create a file called 'robots.txt' in the root directory of your site. In that file, you can make a listing with instructions to spiders. Here is an example of a robots.txt file. Note, that this is NOT my file, I just found it on another site. The syntax shows two things. The first is the 'user-agent' and the second is the 'Disallow:' directive. The forward slash tells the spider not to enter the site at all. If you want to just have it not enter your forums, you can use:

/mod/forum/

More information is here:

http://www.searchengineworld.com/robots/robots_tutorial.htm

Example is attached.


Average of ratings: Useful (1)
In reply to Bill Burgos

Re: Google is Stalking My Site

'mei a Darren Smith - 'aho
Thanks for the response.

Surely 6,000 hits in one day isn't common? It's not as if I am running the bbc website!

It's still at it - hitting the same page (course 1) every minute or so.
In reply to Bill Burgos

Re: Google is Stalking My Site

'mei a N Hansen - 'aho
Bill-I know all that, I'm just wondering why they have gotten so active and in a rather illogical way, hitting the same page over and over in rapid succession makes no sense.
In reply to N Hansen

Re: Google is Stalking My Site

'mei a N Hansen - 'aho
For example, it's logging an awful lot of hits trying to go to non-existant pages like this:

http://www.glyphdoctors.com/mod/forum/user.php?id=5&user=444&mode=posts
In reply to N Hansen

Re: Google is Stalking My Site

'mei a Darren Smith - 'aho
My site seems to have calmed down a lot since 9am GMT with only a handful of hits from the spider since.

Has it decreased on your site also?
In reply to Darren Smith

Re: Google is Stalking My Site

'mei a Ian Semey - 'aho
Our local IT-department recently bought a local google search server for making google search available on our university website, and this also created a lot of connections, going through the list of teachers and courses again and again. We created a robots.txt file, and blocked the ip number of this local google server in the apache configuration file, so now the live access log has calmed down again.

What surprised me was that this google server was going through all the people profiles despite having the setting of opentogoogle set to "no". This is possible due to having the front page accessible, and not requiring login for seeing it.

In reply to Ian Semey

Re: Google is Stalking My Site

'mei a David Scotson - 'aho

How does the local Google search interact with Moodle? I've often thought that an external search engine would be a useful addition to Moodle in campus settings.

In reply to N Hansen

Re: Google is Stalking My Site Me too !

'mei a Joyce Smith - 'aho
Hi Nicole
I had mega 'visits' and I mean 'mega visits' logged as a 'guest user' 13th-14th August
accessing an 'open to guests' course on my front page
Ray Lawrence figured out the 'culprit' for me from the IP , yep a Google Bot. and suggested I click on the not open to Google button in the config.
Stopped it !! Happened on my site once before , strange, 'huge number of guest logins' , that made me wonder why at that time !! All I can think of is ,there must be a 'common' 'interest factor for the bot' on the front page of our sites , obviously 'Moodle' , but maybe there is another 'common factor also ??
Or, am I being too 'human' in my cognition here ? Any thoughts guys ??

Curiouser and Curiouser ! said Alice !! thoughtful or, was that the Queen of hearts ??
Geez I'm bad , I grew up in Lewis Carroll's Oxford !! and can't remember who said that !!
You English Lit guys , can you help me out ??



In reply to Joyce Smith

Re: Google is Stalking My Site Me too !

'mei a N Hansen - 'aho
I actually want Moodle to visit my site. One of my reasons for setting up free and open discussion forums on my site was to be able to draw traffic through Google searches on more obscure topics that are discussed there. I've gotten four visits this month from people looking for "mummified chicken" and I'm only number 6 in the Google results on that term. In fact, I'd like to see the site open to more search engines, but I simply don't know how to enable this (I've submitted this as bug 3278).

What doesn't seem particularly useful to me is all the people coming to my site through the search engines because they are looking up my users' names and finding their profiles indexed by Google. Recent activity pages or user post profile pages don't seem to useful either. Google accounts for 10 percent of my bandwidth this month, most of it to non-existant or irrelevant pages.


In reply to N Hansen

Re: Google is Stalking My Site

'mei a David Scotson - 'aho

For this particular issue, note that Google doesn't really like visiting sites with large numbers of parameters after the '?'. In this case it appears to be cutting off the last one and just trying what's left anyway.

The googlebots are actually quite smart, and do seem to learn about your site to spider it better, e.g. they know which pages are updated regularly and spider them more often to get the 'fresh' content (the recent updates page is therefore an ideal candidate for spider blocking with robots.txt) so I'm guessing it would stop visiting these pages if it recieved a 404 or other machine readable error page, but it doesn't and the bot isn't (yet) smart enough to read the actual page and realise that it is an error.

I had a read through the Google guidelines for webmasters and the only thing that jumped out at me that Moodle might be doing to confuse these spiders is the following:

  • Allow search bots to crawl your sites without session IDs or arguments that track their path through the site. These techniques are useful for tracking individual user behaviour, but the access pattern of bots is entirely different. Using these techniques may result in incomplete indexing of your site, as bots may not be able to eliminate URLs that look different but actually point to the same page.

I'd guess that if the googlebot doesn't accept cookies then the sessionID will be written into the URL. Those with multiple visits from the googlebot might want to check to see if the same page is being visited by bots with different sessionIDS.

In reply to N Hansen

Re: Google is Stalking My Site

'mei a David Scotson - 'aho

You might want to try submitting your site to the Google Sitemaps service.

This is basically a document that tells Google what parts of your site are static and which sites need to be regularly updated. Most of the documentation addresses people who want Google to search their site more often but I'd guess that you could use it equally well to stop Google from pointlessly re-indexing pages that haven't changed or which may have changed but are only pointers to the real content such as recent activity.

It might be a nice project to have Moodle automatically generate these according to some admin options. In the meantime, they also accept RSS feeds. Maybe Martin could point them at the Moodle.org RSS feeds and we could see how that works out?