Documentation

Contents

Query syntax

Individual words: e.g. antarctica. If there are two words, e.g. antarctica book, it will search for them both but not as a phrase, and if there are three or more words, e.g. book about antarctica, it will search for a minimum of two of the words, e.g. in this example that could include pages with "book" and "about" but not "antarctica".

Phrase search: enclose phrase in double quotes to search for the exact phrase, e.g. "book about antarctica"

Boolean search: use AND, &&, NOT, !, OR, ||, + and -, with ( and ) to group queries, e.g. for pages which contain the keywords antarctica and book use antarctica AND book, or pages with antarctica and book but not movie use antarctica AND book !movie

Wildcard search: * for multiple characters, and ? for single characters, e.g. *arctic*

Filters: name:value, e.g. for all the pages on the michael-lewis.com domain which contain the word antarctica use domain:michael-lewis.com AND antarctica, or for all the article type pages on the michael-lewis.com domain use domain:michael-lewis.com AND type:article. See below for full list of field names.

Other searches: e.g. fuzzy searches, proximity searches, range searches, boost, etc. see The Standard Query Parser and The Extended DisMax Query Parser.

Fields

Name Notes
url Uniform Resource Locator, i.e. address of the web page.
domain The domain to which the page belongs.
is_home Boolean value, i.e. true or false. If true, indicates that the page is the home page for the domain.
title Extracted from the title tag.
author Extracted from meta name="author".
description Extracted from meta name="description" or meta property="og:description".
tags Multivalued. Extracted from meta name="keywords" or meta property="article:tag".
body Text extracted from the body tag.
page_type Extracted from meta property="og:type" or article data-post-type=.
page_last_modified Extracted from the Last-Modified HTTP header.
published_date Extracted from meta property="article:published_time" or meta name="dc.date.issued".
date_domain_added Date and time the domain was first added to the system for indexing. Only present on pages where is_home=true.
owner_verified Boolean value, i.e. true or false. If true, indicates that the page is from a site which has been verified by the owner.
contains_adverts Boolean value, i.e. true or false. If true, indicates that adverts have been detected on the page.
language Extracted from html lang=.
indexed_inlinks Multivalued. Pages which link to this page (from other domains within the search index, i.e. not from this domain or domains which aren't indexed).
indexed_outlinks Multivalued. Pages to which this page links (to other domains within the search index, i.e. not to this domain or domains which aren't indexed).

Indexing

The indexing process begins by loading the home page and following links to other pages within the same domain. If present, settings in robots.txt are obeyed, and links can be extracted from a sitemap.xml.

Each domain can be configured to exclude certain pages from indexing, based on:

This might be useful to, for example, filter out micro blog entries which have a particular path or type.

The indexing process continues in a depth first manner until there are no further pages to index or until the indexing page limit for the domain is reached.

Relevancy tuning

The following fields are used to determine how results are ranked: title, tags, description, url, author, body, contains_adverts, owner_verified and indexed_inlinks_count. There is further discussion of the relevancy tuning on some of the blog posts, and the plan is to make the source available for complete transparency.

API

OpenAPI Spec: https://searchmysite.net/api/

In summary, queries take the form /api/v1/search/<domain>?q=*, where parameters are:

And results are returned in the following format:


{
	"params": {
		"q": "*",
		"page": 1,
		"resultsperpage": 10,
	}
	"totalresults": 40,
	"results": [
		{
			"id": "https://...",
			"url": "https://...",
			"title": "...",
			...
		}
	]
}