Tutorials / Sending and managing data / Searching Confluence Data
Apr. 30, 2019

Searching Confluence Data

Introduction

Using Algolia to search through your Confluence space brings powerful relevance tools and customization in displaying the results. It’s also helpful when you want to centralize all your online resources (Google Drive, Dropbox, Salesforce…) under a single search experience.

The self-hosted solution and cloud service of Confluence have distinct APIs, but here we’ll focus on Confluence Cloud.

This tutorial guides you into indexing the pages from Confluence Cloud, following these steps, using Node.js:

  • Getting your Confluence credentials
  • Fetching and indexing documents
  • Looping over the pagination
  • Setting up incremental updates

Prerequisites

Familiar with Node.js

This tutorial assumes you are familiar with Node.js, how it works, and how to create and run Node.js scripts. If you want to learn more before going further, we recommend you read the following resources:

Also, you need to install Node.js in your environment.

Have a Confluence and Algolia account

For this tutorial, we assume that you:

Install dependencies

You need to connect to your Algolia account. For that, you can use our Algolia Search library. We’ll also install Request-Promise-Native to avoid using callbacks, and striptags to sanitize HTML data.

Let’s add these dependencies to your project by running the following command in your terminal:

1
npm install algoliasearch request-promise-native striptags

Fetching data

Getting your Confluence credentials

Confluence provides two ways to authenticate: JSON Web Tokens and Basic Auth. For the sake of simplicity, we will go for Basic Auth. For more information on JWT authentication, you can read Authentication for Apps on the Confluence Cloud documentation.

On the Confluence Cloud admin panel, go to Users › Invite user, and add the new user’s email address. We recommend using a service account email because you’ll have to store the credentials, plus you’ll be able to tweak the space and group permissions regardless of your employees’ access levels. See Invite, edit, and remove users on the Confluence Cloud documentation to know more.

On the first login, you’ll be asked to create a password for the email you just invited. The email and password of this account are the credentials for Basic Auth.

Building the query

We’ll first create a helpers.js file and add a reusable function for querying resources.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
const rp = require('request-promise-native');

const CONFLUENCE_HOST = 'https://yourdomain.atlassian.net/wiki';
const CONFLUENCE_USERNAME = 'user@example.com';
const CONFLUENCE_PASSWORD = 'user_password';

module.exports = {
  confluenceGet(uri) {
    return rp({
      url: CONFLUENCE_HOST + uri,
      // GET parameters
      qs: {
        limit: 20, // number of item per page
        orderBy: 'history.lastUpdated', // sort them by last updated
        expand: [
          // fields to retrieve
          'history.lastUpdated',
          'ancestors.page',
          'descendants.page',
          'body.view',
          'space'
        ].join(',')
      },
      headers: {
        // auth headers
        Authorization: `Basic ${Buffer.from(
          `${CONFLUENCE_USERNAME}:${CONFLUENCE_PASSWORD}`
        ).toString('base64')}`
      },
      json: true
    });
  }
};

In a new index.js file, which will be our main job, we can make our first API call:

1
2
3
4
5
const { confluenceGet } = require('./helpers.js');

const run = () => confluenceGet('/rest/api/content');

run();

Preparing the records

Before sending content to Algolia, we want to make sure we only include the attributes we need. We also need to keep our data as flat as possible for performance reasons. We’ll therefore create a parseDocuments function that takes an array of documents and returns properly formatted records.

First, we need to create two internal functions: buildURL to convert a Confluence URI into a full URL, and parseContent to clean up the HTML content.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
// helpers.js

const rp = require('request-promise-native');
const striptags = require('striptags');

const buildURL = uri =>
  uri ? CONFLUENCE_HOST + uri.replace(/^\/wiki/, '') : false;

const parseContent = html =>
  html
    ? striptags(html)
        .replace(/(\r\n?)+/g, ' ')
        .replace(/\s+/g, ' ')
    : '';

Then, we can add our parseDocuments function:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
module.exports = {
  confluenceGet(uri) {
    // ...
  },
  parseDocuments(documents) {
    return documents.map(doc => ({
      objectID: doc.id,
      name: doc.title,
      url: buildURL(record._links.webui),
      space: doc.space.name,
      spaceMeta: {
        id: doc.space.id,
        key: doc.space.key,
        url: buildURL(record.space._links.webui)
      },
      lastUpdatedAt: doc.history.lastUpdated.when,
      lastUpdatedBy: doc.history.lastUpdated.by.displayName,
      lastUpdatedByPicture: buildURL(
        doc.history.lastUpdated.by.profilePicture.path.replace(
          /(\?[^\?]*)?$/,
          '?s=200'
        )
      ),
      createdAt: doc.history.createdDate,
      createdBy: doc.history.createdBy.displayName,
      createdByPicture: buildURL(
        doc.history.createdBy.profilePicture.path.replace(
          /(\?[^\?]*)?$/,
          '?s=200'
        )
      ),
      path: doc.ancestors.map(({title}) => title).join(''),
      level: doc.ancestors.length,
      ancestors: doc.ancestors.map(({id, title, _links}) => ({
        id: id,
        name: title,
        url: buildURL(_links.webui)
      })),
      children: doc.descendants
        ? doc.descendants.page.results.map(({id, title, _links}) => ({
        id: id,
        name: title,
        url: buildURL(_links.webui)
      }))
        : [],
      content: parseContent(doc.body.view.value)
    }));
  }
};

Let’s break down what’s happening here:

  • We deliberately set an objectID, so Algolia doesn’t generate one for us. This will allow us to avoid creating duplicates every time we run the script.
  • We parse the lastUpdatedByPicture to remove unnecessary parameters, and append s=200 to set the desired size of the output image (in pixels).
  • We set a path attribute to make ancestors searchable, and have subpages showing when searching for a document.
  • We set a level attribute that represents the depth of the document in the wiki tree. It could be useful in the tie-breaking strategy.
  • We keep track of ancestors and children for presentation purposes, in case you needed to display clickable breadcrumbs and dependencies.
  • We strip HTML and line breaks from our content to avoid noise during search.

Indexing content

Setting up your index

In your Algolia Dashboard, create a new confluence index and add an API key with all the “records” permissions checked.

Sending to Algolia

The Confluence API paginates results. To get them all, you need to keep on looping as long as there is a next link available in the response.

Confluence space could get quite crowded. For that reason, it would be better for performance reasons to upload documents to Algolia in several batches, instead of sending one big payload.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
// index.js

const algoliasearch = require('algoliasearch');
const { confluenceGet, parseDocuments } = require('./helpers.js');

const client = algoliasearch('YourApplicationID', 'YourAdminAPIKey');
const index = client.initIndex('confluence');

const run = () => {
  const saveObjects = (link = '/rest/api/content') => confluenceGet(link).then(({results, _links}) => {
    index.saveObjects(parseDocuments(results)).then(res => {
      if (_links.next) saveObjects(_links.next);
    });
  });
  saveObjects();
};

run();

Dealing with large documents

As we index the content of each document and not only their metadata, we may exceed the record size limit. At Algolia, we recommend that you keep records small in order to get more relevant and faster search. To prevent this, we can chunk the data and use the distinct feature. This means we’re going to save multiple records with the same metadata, each carrying a chunk of the content, but we’ll set our search results to only show one document at a time.

We need to update parseDocuments to split a document into several records, based on content length.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
// helpers.js

module.exports = {
  confluenceGet(uri) {
    // ...
  },
  parseDocuments(documents) {
    return documents.map(({body}) => {
      const record = {
        // ...
        content: null // initialize with null value instead
      };
      let content = parseContent(body.view.value);
      while (content.length) {
        // extract the first 600 characters (without splitting words)
        const chunk = content.replace(/^(.{600}[^\s]*).*/, '$1');
        // remove chunk from the original content
        content = content.substring(chunk.length);
        // inject the content chunk into a copy of the initial record
        return Object.assign({}, record, { content: chunk });
      }
    });
  }
};

We also need to set the attribute name as attributeForDistinct in our Algolia index. You can do this either programmatically or from your Dashboard.

1
index.setSettings({ attributeForDistinct: 'name' });

When you build your search, you’ll need to set the distinct parameter to true in the search query, to make sure you only get one hit for each chunked document.

Incremental sync

Incrementally synchronizing your data is about only indexing pages that were recently updated within a certain period of time (for example, the last 10 minutes) and running the script at a fixed interval. When your Confluence space has a large amount of pages, this prevents you from hitting the API rate limits and indexation will be fast.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
// index.js

const run = (from = 0) => {
  const saveObjects = (link = '/rest/api/content') => {
    let lastUpdatedAt = 0;
    return confluenceGet(link).then(({results, _links}) => {
      index.saveObjects(parseDocuments(results)).then(res => {
        lastUpdatedAt = new Date(
          results[results.length - 1].lastUpdatedAt
        ).getTime();
        if (_links.next && lastUpdatedAt >= from)
          saveObjects(_links.next);
      });
    });
  };
  saveObjects();
};

const args = process.argv.slice(2);
const from = args.length
  ? new Date(args[0]).getTime()
  : Date.now() - 10 * 60 * 1000;

run(from);

From now on, you only need to schedule a job that will run node index.js every 10 minutes. If you need to run it from a specific date, run the script with the date as an ISO 8601 string in arguments (for example, node index.js 2000-01-01T00:00:00+00:00).

You can also run node index.js 0 to perform a full sync of your space.

Did you find this page helpful?