Documentation

Contents

Frequently asked questions

What types of site can be submitted?

See the Submission Guidelines section in the Terms of Use for information on what types of site can be submitted.

How do I add a site that is not already listed?

Use the Add Site link at the top. There are two main options: Quick Add and Verified Add. The Quick Add lets anyone submit any site simply by entering the home page, although it needs to pass an approval process after submission before it is indexed. The Verified Add options can only be used by the site owner, who needs to verify that they own the site.

How do I verify ownership of a site?

The easiest way to prove ownership is to use IndieAuth, but if you don't have that set up you can still submit your site with a Domain Control Validation (DCV) process similar to that which you may have used for other services, i.e. you upload a specific piece of content to the domain's root or TXT record. The Add Site link will guide you through the process.

How do I verify ownership of an already listed site to access the verified owner benefits?

Use the Add Site link at the top and choose one of the Verified Add options described above.

How do I check on the status of a site I submitted via Quick Add?

Given there is no user tracking on Quick Add (see Privacy Policy) it isn't possible to notify you of changes to your submission directly. However, you can resubmit the site and see what the message is, i.e. it is now being indexed, still pending review, or rejected (along with rejection reason).

How frequently are sites reindexed?

At the moment, sites added via the Quick Add are reindexed weekly, and Verified Add are reindexed twice weekly (verified owners can also trigger reindexes on demand). The Quick Add page has a summary of the differences.

Query syntax

Individual words: e.g. antarctica. If there are two words, e.g. antarctica book, it will search for them both but not as a phrase, and if there are three or more words, e.g. book about antarctica, it will search for a minimum of two of the words, e.g. in this example that could include pages with "book" and "about" but not "antarctica".

Phrase search: enclose phrase in double quotes to search for the exact phrase, e.g. "book about antarctica"

Boolean search: use AND, &&, NOT, !, OR, ||, + and -, with ( and ) to group queries, e.g. for pages which contain the keywords antarctica and book use antarctica AND book, or pages with antarctica and book but not movie use antarctica AND book !movie

Wildcard search: * for multiple characters, and ? for single characters, e.g. *arctic*

Filters: name:value, e.g. for all the pages on the michael-lewis.com domain which contain the word antarctica use domain:michael-lewis.com AND antarctica, or for all the article type pages on the michael-lewis.com domain use domain:michael-lewis.com AND type:article. See below for full list of field names.

Other searches: e.g. fuzzy searches, proximity searches, range searches, boost, etc. see The Standard Query Parser and The Extended DisMax Query Parser.

Fields

Name Notes
url Uniform Resource Locator, i.e. address of the web page.
domain The domain to which the page belongs.
is_home Boolean value, i.e. true or false. If true, indicates that the page is the home page for the domain.
title Extracted from the title tag.
author Extracted from meta name="author".
description Extracted from meta name="description" or meta property="og:description".
tags Multivalued. Extracted from meta name="keywords" or meta property="article:tag".
body Text extracted from the body tag. This is likey to include text from navigation, headers and footers.
content Text extracted from the main tag, or article tag, or body tag, with text from any nav, header and/or footer tags removed.
page_type Extracted from meta property="og:type" or article data-post-type=.
page_last_modified Extracted from the Last-Modified HTTP header.
published_date Extracted from meta property="article:published_time" or meta name="dc.date.issued".
date_domain_added Date and time the domain was first added to the system for indexing. Only present on pages where is_home=true.
owner_verified Boolean value, i.e. true or false. If true, indicates that the page is from a site which has been verified by the owner.
contains_adverts Boolean value, i.e. true or false. If true, indicates that adverts have been detected on the page.
language Extracted from html lang=.
language_primary Language family, derived from the language attribute, e.g. if language=en-GB then language_primary=en.
indexed_inlinks Multivalued. Pages which link to this page (from other domains within the search index, i.e. not from this domain or domains which aren't indexed).
indexed_outlinks Multivalued. Pages to which this page links (to other domains within the search index, i.e. not to this domain or domains which aren't indexed).
indexed_inlink_domains Multivalued. Unique domains in indexed_inlinks.

Indexing

The indexing process begins by loading the home page and following links to other pages within the same domain. If present, settings in robots.txt are obeyed, and links can be extracted from a sitemap.xml.

Each domain can be configured to exclude certain pages from indexing, based on:

  • path: i.e. URLs containing a certain string.
  • type: i.e. values from the page_type field described above.
This might be useful to, for example, filter out micro blog entries which have a particular path or type.

The indexing process continues in a depth first manner until there are no further pages to index or until the indexing page limit for the domain is reached.

Relevancy tuning

The following fields are used to determine how results are ranked: title, tags, description, url, author, body, contains_adverts, owner_verified and indexed_inlinks_count. There is further discussion of the relevancy tuning on some of the blog posts, and the plan is to make the source available for complete transparency.

API

OpenAPI Spec: https://searchmysite.net/api/

In summary, queries take the form /api/v1/search/<domain>?q=*, where parameters are:

  • <domain>: the domain being searched (mandatory)
  • q: query string (mandatory)
  • page: the page number which multi-page results should start from (optional, default 1)
  • resultsperpage: the number of results per page (optional, default 10)

And results are returned in the following format:


{
	"params": {
		"q": "*",
		"page": 1,
		"resultsperpage": 10,
	}
	"totalresults": 40,
	"results": [
		{
			"id": "https://...",
			"url": "https://...",
			"title": "...",
			...
		}
	]
}