Documentation

Contents

What types of site can be submitted to the public search?

Users are encouraged to submit any personal and independent websites which they believe will improve the public search, not just their own sites. See the Submission Guidelines section in the Terms of Use for further details, including a definition of personal and independent websites.

How do I add a site to the public search?

Use the Add Site link at the top. Anyone can submit any site for a Basic listing, although it has to be approved by a moderator before it is indexed. Note that only site owners can submit a Free Trial or Full listing, because the submission process requires verifying ownership of the site.

How do I check on the status of a Basic listing I submitted?

Given there is no user tracking for a Basic listing (see Privacy Policy) it isn't possible to notify you of changes to your submission directly. However, you can resubmit the site and see what the message is, i.e. it is now being indexed, still pending review, or rejected (along with rejection reason).

Adding a site to the search as a service

What benefits do I get from a Full listing?

The Full listing provides access to the search as a service features, such as more frequent indexing, indexing a higher number of pages, providing access to Manage Site to configure indexing and trigger on-demand reindexing, and enabling the API. See Add Site for further details.

Does my site have to appear in the public search if I want to use the search as a service?

No. When listing your site one of the questions is "Include in public search", to which you can select No. This allows you to use the search as a service even if your site isn't a personal or independent website. Note that, as per the Terms of Use, a moderator may exclude a Full listing from the public search.

Can I change a Basic listing to a Full listing?

Yes. Just resubmit the site, and select Full instead of Basic for the Listing tier.

How do I verify ownership of my site?

The easiest way to prove ownership is to use IndieAuth, but if you don't have that set up you can still submit your site with a process similar to that which you may have used for other services, i.e. you upload a specific piece of content to the domain's root or TXT record. The Add Site link will guide you through the process.

Do I have to complete a Full listing in one session?

No. For a Full listing, if you are unable to complete the process in one session, you can resubmit at a later time to pick up where you left off.

Using the search

What is the query syntax?

Individual words: e.g. antarctica. If there are two words, e.g. antarctica book, it will search for them both but not as a phrase, and if there are three or more words, e.g. book about antarctica, it will search for a minimum of two of the words, e.g. in this example that could include pages with "book" and "about" but not "antarctica".

Phrase search: enclose phrase in double quotes to search for the exact phrase, e.g. "book about antarctica"

Boolean search: use AND, &&, NOT, !, OR, ||, + and -, with ( and ) to group queries, e.g. for pages which contain the keywords antarctica and book use antarctica AND book, or pages with antarctica and book but not movie use antarctica AND book !movie

Wildcard search: * for multiple characters, and ? for single characters, e.g. *arctic*

Filters: name:value, e.g. for all the pages on the michael-lewis.com domain which contain the word antarctica use domain:michael-lewis.com AND antarctica, or for all the article type pages on the michael-lewis.com domain use domain:michael-lewis.com AND page_type:article. See below for full list of field names.

Other searches: e.g. fuzzy searches, proximity searches, range searches, boost, etc. see The Standard Query Parser and The Extended DisMax Query Parser.

What fields are available?

Name Notes
id URL of the web page, before following any redirects. Will be unique.
url URL of the web page, after following any redirects. Will be the same as id if there are no redirects, and might not be unique if there are redirects.
domain The domain to which the page belongs.
is_home Boolean value, i.e. true or false. If true, indicates that the page is the home page for the domain.
title Extracted from the title tag.
author Extracted from meta name="author".
description Extracted from meta name="description" or meta property="og:description".
tags Multivalued. Extracted from meta name="keywords" or meta property="article:tag".
content Text extracted from the main tag, or article tag, or body tag, with text from any nav, header and/or footer tags removed.
page_type Extracted from meta property="og:type" or article data-post-type=.
page_last_modified Extracted from the Last-Modified HTTP header.
published_date Extracted from meta property="article:published_time" or meta name="dc.date.issued" or meta itemprop="datePublished".
date_domain_added Date and time the domain was first added to the system for indexing. Only present on pages where is_home=true.
owner_verified Boolean value, i.e. true or false. If true, indicates that the page is from a site which has been verified by the owner.
contains_adverts Boolean value, i.e. true or false. If true, indicates that adverts have been detected on the page.
language Extracted from html lang=.
language_primary Language family, derived from the language attribute, e.g. if language=en-GB then language_primary=en.
indexed_inlinks Multivalued. Pages which link to this page (from other domains within the search index, i.e. not from this domain or domains which aren't indexed).
indexed_outlinks Multivalued. Pages to which this page links (to other domains within the search index, i.e. not to this domain or domains which aren't indexed).
indexed_inlink_domains Multivalued. Unique domains in indexed_inlinks.

Managing my site

If you have a Full listing (or a Free Trial) you will have the API enabled. The API can be used in a server-side search page. Alternatively, you can use the API client-side, e.g. as per the very basic example at Adding a simple search page to my personal website with searchmysite.net.

If you only have a Basic listing, or don't want to use the API, a simple alternative is to have a form which takes a query and a domain hidden parameter containing the value of the domain to which to want to restrict results, e.g. (for michael-lewis.com):


<form action="https://searchmysite.net/search/">
  <input type="search" name="q" ></input>
  <input type="hidden" name="domain" value="michael-lewis.com"></input>
  <input type="submit" value="Search"></input>
</form>
						

What is the specification for the API?

In summary, queries take the form /api/v1/search/<domain>?q=*, where parameters are:

  • <domain>: the domain being searched (mandatory)
  • q: query string (mandatory)
  • page: the page number from which multi-page results should start (optional, default 1)
  • resultsperpage: the number of results per page (optional, default 10)

Results are returned in the following format, with all fields optional apart from id and url:


{
  "params": {
    "q": "*",
    "page": 1,
    "resultsperpage": 10,
  }
  "totalresults": 40,
  "results": [
    {
      "id": "https://server/path",
      "url": "https://server/path",
      "title": "Page title",
      "author": "Author",
      "description": "Page description",
      "tags": ["tag1", "tag2"],
      "page_type": "Page type, e.g. article",
      "page_last_modified": "2020-07-17T00:00:00+00:00",
      "published_date": "2020-07-17T00:00:00+00:00",
      "language": "en",
      "indexed_inlinks": ["inlink1", "inlink2"],
      "indexed_outlinks": ["outlink1", "outlink2"],
      "fragment": ["text before the search ", "query", " and text after"]
    }
  ]
}
						

How does the indexing process work?

The indexing process first checks a robots.txt, and will obey any rules there. If the robots.txt allows, it will then load the home page and web feed (if a web feed was configured or discovered on a previous index), looking for links which it will follow breadth first, until there are no further pages to index or until the indexing page limit for the domain or the timeout is reached.

If you want to exclude certain pages from indexing, you would normally do this via robots.txt. If you have a Full listing you can also configure your listing to exclude content based on:

  • path: i.e. URLs containing a certain string.
  • type: i.e. values from the page_type field described above.

This might be useful to, for example, filter out micro blog entries which have a particular path or type.

How frequently are sites reindexed?

See Add Site for the latest information on indexing frequency. If you have a Full listing you can logon to Manage Site to see when your next reindex is due, and can of course trigger a reindex on demand.

How does the relevancy tuning work?

The following fields are used to determine how results are ranked: title, description, author, tags, url, content, indexed_inlink_domains_count, contains_adverts and owner_verified. There is further discussion of the relevancy tuning on some Blog posts, and of course the Source code is available for complete transparency.

Support

How do I raise a support query?

Use the Contact to reach out. If you think you have found a bug, you could also raise via https://github.com/searchmysite/searchmysite.net/issues.

What are the support hours for the search as a service?

As per About searchmysite.net, this is an evenings and weekends side-project, so support is only available outside normal business hours. I'm also periodically away on holiday. I believe this is reflected in the low cost. Note however that the service has been running reliably since July 2020.