“Implementing search” – Forum Thread – Discuss

- NickToye
- 26 May 09, 9:32 am
- Comment #41

Can the 1.7 version be ported to the 2.0?

- nickdunn
- 26 May 09, 6:12 pm
- Comment #42

It was left out to streamline the backend. It would be technically feasible to build this as an extension should you wish to take the challenge. Failing that, Publish Filtering’s “contains” matching should give you what you want.

I never thought searching across multiple sections in the backend was that useful. I think it’s more appropriate to view the section you want to search, and search from there.

If you want to discuss further, I suggest starting a new thread. This original thread is discussing implementing a search in the front end (using Data Sources) but you are requesting is a backend search. Best for a new thread entirely.

- NickToye
- 26 May 09, 6:51 pm
- Comment #43

No I am requesting a front-end search facility.

Sorry for the confusion.

- NickToye
- 06 Jun 09, 4:36 am
- Comment #44

Ok, with the help of Dru we have managed to get a Section Search functionality going. It wasn’t that difficult really, and it was based on the findings here and this:

regexp:{$url-query}*

Now my problem now is that I need to search 2 sections, I don’t think it was possible in the last version of Symphony (1.7) so I’m guessing it’s going to be tough to pull off.

Bit of direction really, am I looking at creating a custom Dynamic XML and search that?

- buzzomatic
- 08 Jun 09, 3:17 pm
- Comment #45

I’m not entirely happy with how searching is handled at the moment either, I’ve talked with Alistair about creating an extension that allows full site/multiple section searching, this was my conclusion:

We need an internal search engine that can index entries as they are created, updated and deleted. It would be fairly easy to do this, there are delegates for handling exactly these events.

Interface

The search engine would have an interface that lets you choose which sections, and which data in those sections (via XPath expressions) you want to store in your search index.

Indexing

Listen for ‘create’ or ‘update’ delegates
Create XML representation of entry/ies
Store text in table, along with entry and section IDs

Searching

Create a custom event which takes one or more section IDs and a query, then uses MySQL Boolean searching to populate a parameter with a comma separated list of matched entry IDs. Then you could pass this parameter to a datasource for each section you’re searching in to get the original entries.

Alistair pointed out that MySQLs text searching capabilities are not brilliant, so later on the indexing should be updated to a custom and more powerful algorithm. But it would be acceptable for the short term.

Anyhow, all we need now is someone to implement it :P

- czheng
- 09 Jun 09, 1:48 am
- Comment #46

This looks encouraging. I’d offer to help, but I’d only be good for fetching the coffee or something…

- NickToye
- 09 Jun 09, 2:51 am
- Comment #47

I make a wicked Cappucino. Been doing so off and on most of my life.

I will not end up in a coffee house.

But yeah I would love to see something like this also. Would go down well in my planned Ensemble.

- buzzomatic
- 09 Jun 09, 8:49 am
- Comment #48

Yeah, unfortunately it seems progress is the enemy of progress. The more extensions I write, the less time I have to maintain them or write more.

So, I’d encourage anyone who’s interested to do it, the concept is actually really simple.

- nickdunn
- 09 Jun 09, 4:54 pm
- Comment #49

The more extensions I write, the less time I have to maintain them or write more.

I hear you. We’ve had a similar discussion at Airlock about implementing search like this and came up with pretty much the same solution. Caching a pre-rendered entry is the only way to go — it’s the building of the JOINs at runtime that kills large implementations.

It will take time, but I’m happy to start looking into this.

- nickdunn
- 09 Jun 09, 6:20 pm
- Comment #50

I like the idea of the search being a basic middle-man. It doesn’t attempt to do anything more advanced than cache plain text from entries, and provide a custom Data Source that supports paging, which outputs a list of System IDs.

By passing an ‘articles’ handle to the search it would return entries only in Articles, but providing ‘comments, articles, images’ would search multiple sections, or even ‘*’ to search all sections.

So long as there are Data Sources attached to the page to accept the search output parameter (returning entries from their respective section) it would be possible to implement true multi-section searching, both front-end and maybe even back-end too.

- carsten
- 09 Jun 09, 7:14 pm
- Comment #51

How to deal with multiple sections when its content is used very differently across a Symphony implementation? Symphony is so flexible that not all content is simply accessed by www.website.com/articles/id, but surely this will mean that anyone who goes beyond this standard structure has to create a customized result page to enable usage of URL structures like domain.com/page/category-name/brand-name/product-id/?language=lang in combination with domain.com/page/ and other possibilities.

How difficult would it be to also include related section data in the search results? I am using many section links to mimic the implementation of a relational database design. E.g. I have field “color” where I select the color “red” through a section link. “red” is in another section where it is an entry with the Dutch translation “rood”. I want to be able to search for “rood” and return all entry id’s that have “red” specified as color.

- nickdunn
- 09 Jun 09, 7:39 pm
- Comment #52

surely this will mean that anyone who goes beyond this standard structure has to create a customized result page to enable usage of URL structures

Absolutely. Symphony isn’t a “website in a box” and nor is the search — it would still require developer input to use it to satisy a requirement. As Rowan says, the intention is to limit the search by specific sections, so should you want it only to return results for Products and Product Categories, you could specify only these two sections.

The “search result” themselves would be nothing more than a list of entry System IDs as an output parameter and some very bare XML. Imagine I am searching two sections: Articles and Comments.

I create my search form:

<form method="get">
    <input type="text" name="q"/>
    <input type="hidden" name="sections" value="articles,comments"/>
    <input type="submit" value="Search"/>
</form>

When the “search” Data Source is attached to the page, it would look through both Articles and Comments finding matching entries. The output of this Data Source would be in two parts:

Output parameters

$ds-search 102, 7, 92, 1, 43

XML

<search>
    <pagination total-entries="15" current-page="1" total-pages="3" per-page="5" />
    <entries>
        <entry id="102" section="articles"/>
        <entry id="7" section="articles"/>
        <entry id="92" section="comments"/>
        <entry id="1" section="comments"/>
        <entry id="43" section="articles"/>
    </entries>
</search>

And that is all the search extension need provide.

So in this example I would create “Search Articles” and “Search Comments” data sources filtering System ID by {$ds-search} and attach them to my page. To write out the search results to the page, I would use something like:

<!-- root level match -->
<xsl:template match="data">
    <ul>
        <!-- loop through each search result -->
        <xsl:apply-templates select="search//entry" mode="search-results"/>
    </ul>
    <!-- pagination utility could be used here -->
</xsl:template>

<xsl:template match="entry" mode="search-results">
    <li>
        <!-- select the appropriate entry content from other Data Sources -->
        <xsl:choose>
            <xsl:when test="@section='articles'">
                <xsl:apply-templates select="/data/search-articles/entry[@id=current()/@id]" mode="article-result"/>
            </xsl:when>
            <xsl:when test="@section='comments'">
                <xsl:apply-templates select="/data/search-articles/entry[@id=current()/@id]" mode="comment-result"/>
            </xsl:when>
        </xsl:choose>
    </li>
</xsl:template>

<!-- render an article in the search results -->
<xsl:template match="entry" mode="article-result">
    <a href="/articles/{@id}/">
        <xsl:value-of select="title"/>
    </a>
</xsl:template>

<!-- render a comment in the search results -->
<xsl:template match="entry" mode="comment-result">
    <a href="/articles/{article/item/@id}/#comment-{@id}">
        <xsl:value-of select="author"/>
    </a>
</xsl:template>

This solves the text-based search nicely. In your example you are leaning towards a parametric-search for products/entities, therefore you’d be best suited to advanced searches using specific URL Parameters.

By means of example (note, this is not a Symphony site!!). For parametric searches (based on product attributes) a very scoped search was used (with the Bulb Finder drop downs):

http://www.lightbulbs-direct.com/bulbfinder/?w=3-5&v=Any&c=Bayonet&f=Any

However there was also a requirement to search both Products, Categories, Pages and Articles (FAQs) by a keyword search too:

http://www.lightbulbs-direct.com/search/?s=halogen

In the first example, a parametric search using specific Data Source filters and URL Parameters would be most appropriate. But a free-text keyword search through both Products, Pages, Categories, Comments and anything else you might need would require something similar to what I suggest above.

- buzzomatic
- 09 Jun 09, 7:54 pm
- Comment #53

I don’t like the idea of the search DS returning the actual entry as a normal DS would, it should on return information relevant to the actual search:

Pagination information, total matches, etc
The ID of the matched entry and what section it belongs in
Information about what search terms where used.

The reason to do this? Writing a DS that returns entire entries and paginates them would not only be hard to do, but also risk degrading performance significantly as it would need to return the full entry XML – which you probably don’t need.

I’m not saying it shouldn’t be done, only that we can have a fully working search sooner without it.

Anyhow, I’m using my iPhone, it’s a little hard to write with, so I’ll continue putting my thoughts here in the morning!

- nickdunn
- 09 Jun 09, 8:24 pm
- Comment #54

As an aside, I wonder whether this would have any influence over restructuring the way entry XML is formed. If the plain text and rendered XML for an entire entry is stored in this “search cache” table, I wonder whether it could also be used to build all Data Source XML outputs.

Rather than performing all of the joins to build the XML at runtime, the Data Source filters would execute, returning only a list of matching IDs. The entry XML could be retrieved from the cache, parsed using simplexml to return only the fields in the output for that Data Source.

This would mean queries for a Data Source return entry IDs only, and do not need to rebuild entry XML at runtime.

I’m sure this has been suggested in the past by you or Alistair. Would it work?

- nickdunn
- 09 Jun 09, 8:48 pm
- Comment #55

Thinking about this some more… it’s a pretty wild change to the way Symphony works, so I’m certainly not suggesting this for Symphony 2. But it may have legs for future versions (if the database retains its current 4NF structure).

If you also store the built XML for an entry, the problem is not with updating it when content changes (which is a trivial task), but updating all entries in a section when the structure of that section changes.

Imagine you have a section with a million entries and you rename a field. Every one of the million entries would need to be purged and rebuilt there and then!

Rowan’s idea of storing the rendered text of an entry gets around this problem.

How could we handle private data? It would be prudent to provide an interface for users to choose the fields that should be indexed for searching. Some field content should remain private (comment e-mail address for example). One example springs to mind: if I have an Article with a Title, Body and Image and upload an image with a name of “nick.jpg”, I wouldn’t necessarily want this entry to be found for the keyword “nick”. Exclusion or inclusion of fields is an important consideration.

This creates the problem of what happens when these options are changed in the future. Say I have 10,000 articles indexed including their Title and Content, but I also want to add Author Name. Changing this option would require the index for this section to be updated — 10,000 entries to be rebuilt, parsed and updated to the search index.

How could this be handled gracefully?

- ashooner
- 10 Jun 09, 2:49 am
- Comment #56

So the idea here is to search the underlying data rather than the rendered post-xsl pages. This is cool, but fulfills a different need than say GCSE.

@nickdunn Before I used Symphony, I had rolled my own xsl cms, and since my datamodels were usually pretty simple, I stored all my content in a single flat xml file (which I hear is bad). I have to say it was actually pretty snappy, and was faster than Symphony for small < 30-page sites.

I’ve wondered about pushing datasources to flat file caches upon section editing. Depending on how its handled, you could get versioning out of the deal as well, and cut alot of querying out of a page render. Editing the data is much more performance-heavy, and the hit from tons of un-requested data in the xml could be prohibitive, but it would be nice to have the option.

- buzzomatic
- 10 Jun 09, 8:16 am
- Comment #57

@ashooner I don’t think it fulfils a different need than GCSE, really all it does is let people search the relevant data on a site. Websites often have what I can best describe as ‘marketing actions’, for a search to be valuable to your users, you don’t want those to appear in the search results.

Also, using GCSE doesn’t let you weight data differently. With this extension each piece of indexed data can be weighted differently, helping the most relevant results appear first.

I see the database structure being something like this:

Expressions table

id
section_id
expression (xpath expression)
weight (simple number)

Indexes table

id
section_id
expression_id
entry_id
weight
data (normalised)

And as for recalculating the indexes, it’s not that much work to implement, the only downside is the amount of time it’ll take to execute – but we will need to do it – it shouldn’t be automated however.

Also, both of you have touched on the idea of storing all data in Symphony as XML, we have looked at this idea before, and it would probably work better than what we currently have. However, it would mean we’d be starting from scratch – not just Symphony developers, but everyone who uses it too.

It might still make an interesting experiment when I find some free time, though.

- ashooner
- 11 Jun 09, 12:45 am
- Comment #58

@buzzomatic

Don’t get me wrong, I think it would be better for finding the underlying data on a site, but it would still behave in a fundamentally different way than GCSE, which leads the user to an existing page resource (i.e. a data view pre-defined by the site’s organization) rather than a more focused view of the sought after data.

Depending on the context, it might be tricky to recreate the request/state in order to send the user to the view I’ve intended the data to be seen in.

My only concern is that search users generally expect a search result to include a link to the same page/resource that they could have navigated to themselves. This approach would not always provide that.

- nickdunn
- 11 Jun 09, 1:12 am
- Comment #59

You’re thinking firmly about a global “Search this site” type of search which, more often than not, would be best suited to a Google search.

My only concern is that search users generally expect a search result to include a link to the same page/resource that they could have navigated to themselves. This approach would not always provide that.

If you had a blog then you could scope the search to return results for the Articles and Comments sections only. This way, you can easily map the result to a URI. Would you not limit the searchable fields only to those sections that are represented by permalinks/URIs?

I can see this type of search being extremely useful because of its versatility:

a blog with Articles and Comments
a content-heavy site where each page is stored in Markdown formatted textareas linked to Pages
an e-commerce site where the search looks only in product name, description and attributes (think how Amazon’s search works, it returns Products, not pages, reviews, comments etc.)
a real-estate site for address, postcode-based searches
a “User Search” on a social network to just search name, username, email and profile description and no other pages or content types

Another really good example is this site search (Yahoo! BOSS extension): http://battlefront.co.uk/search/?q=coffee

The requirement was to return only Campaigns, Videos and Handbook Pages that match the criteria. We used Yahoo! BOSS to spider the site, and on rhe results page we need to parse the indexed URL and label it accordingly. Thankfully the URL structure was logical (Campaigns under /campaigns/, Videos under /video/ and so on).

Having an internal search would remove this dependency and would mean we could label individual types of content in the results based on their section — Campaigns, Images (matching their captions), Videos, Twitter posts, Blog Posts, Comments etc.

In the Battlefront example, a Campaign page comprises campaign information, images, videos and comments. Yahoo doesn’t distinguish between these content types and would just return a matching page. The user would then need to scan the entire page to find, say, the matching comment.

If the search did look through Comment, Videos, Images, Blog posts (etc) individually, these are all section-linked back to a Campaigner, so we can rebuild the URL to the campaigner page with ease.

- ashooner
- 11 Jun 09, 2:20 am
- Comment #60

Having an internal search would remove this dependency and would mean we could label individual types of content in the results based on their section — Campaigns, Images (matching their captions), Videos, Twitter posts, Blog Posts, Comments etc.

IMO, this shows how future-ready Symphony is overall: It’s easy to provide granular semantic search results rather than just the rote content of a particular page view.

This has alot of potential, having xsl templates that are geared more toward the semantics of content then the pageview, so a search result gives a more focused but equally well-presented view of you content. Then we start to get away from the ‘web-page’ paradigm altogether, which I think is the way things are going. The site acts more as an agent for the user, presenting the content in ways the site developer might not have originally conceived.

Symphony.

Implementing search

Search

Server Requirements

Symphony.

Implementing search

Search

You are looking at page 3 of 4

Server Requirements

Sign in