Global Search : Developping Mnet harvesting

Global Search : Developping Mnet harvesting

Nosūtīja Valery Fremaux
Atbilžu skaits: 2

Hi fellows,

I'm starting implementation of the Mnet enabled Global Search. Here are some ideas I'm upon to put in :

- The mnet harvesting will be separately disabled using an additional block_search configuration switch.

- In order to protect remote hosts against massive load attacks comming from heavy search activity, an additional configuration parameter should limit the amount of records that will be fetched through the network call.

Mnet search will add a new mnet service so called "mnet_search".

Publishing that service will allow to accept remote query calls and give back a hitset.

Suscribing to this service allows launching queries through the network.

The first version of the mnet search capability will be using synchronous calls, from the local querier, which is not that performant. I am thinking about using parallel Ajax queries to get remote results and aggregate results client side. Asynchronous calls will anyway use a local wrapper that must suscribe to the mnet_search service. This will be the next step. A new config switch will allow to choose how this harvesting is processed.

As most probable, synchronous calls will aggregate hits to the local hitset before results caching, thus navigating using the search cache WILL NOT throw network calls again.

OPEN QUESTION : Should aggregated results better be presented "by host", or should results be ordered by relevance ?

More news as development runs ahead.

Cheers.

Vidējais novērtējums: -
Atbildot uz Valery Fremaux

Re: Global Search : Developping Mnet harvesting

Nosūtīja Martín Langhoff

Valery,

great to see this in action. My strong recommendation is to avoid distributed searches - grab index updates via mnet or via plain http and run the searches locally.

For example, if you have a network of 100 nodes and you try and run a distributed search...

  • it will be very likely to time out unless the nodes are in your LAN
  • any node that is down or slow will hold up everything
  • the remote queries are serialised unless you do a ton of manual "threading" using the curl libs directly
  • any node that has lots of users searching imposes a big tax on the the other nodes

In the same scenario, but fetching indexes and running the search locally

  • fetch indexes on cron, once or twice a day
  • node connectivity is not important - if it's slow it will just slow down a cron run a few seconds, if it's unreliable, it will be tried again in 5 minutes
  • the speed of your search is always fast
  • if you have a ton of users searching, they tax your system, not anybody else's
  • if you do need the most-up-to-date-info-all-the-time-on-every-node you can broadcast updates across the network

I've done quite a bit of work on distributed search protocols (z3950). They don't scale well divējāds smaids . I am also a happy user of many "distribute the index and search locally" schemes (apt, yum, etc), and they work wonders.

Atbildot uz Martín Langhoff

Re: Global Search : Developping Mnet harvesting

Nosūtīja Valery Fremaux

Thanks Martin for those wise considerations. I was guessing distributed search had some effective and important drawbacks on performance and cross-load effects between nodes.

Asynchronously fetching remote indexes may be a reachable goal. I will study that way of doing right now, and try to know more about Lucene to make that broadcast possible.

I will update this thread with results of my investigations.

Cheers.