Searching a static web site with Azure AI Search and semantic search - part 1

Learn how to fully automate an Azure AI search instance that pulls data and indexes it from an Azure Blob Storage account

Searching a static web site with Azure AI Search and semantic search - part 1

Azure AI Search (previously Azure Cognitive Search) is a powerful search engine that provides full text search amongst other features.

In this blog post we'll go through:

  • Setting up Azure AI search and related resources with terraform

In the next blog post we'll go through:

  • Extracting static pages and processing them
  • Uploading the static pages to a storage account
  • Indexing the static pages in Azure AI search
  • Searching the pages

Getting started

Create a git repository to work from:

mkdir azure-ai-search-terraform
cd azure-ai-search-terraform
git init

Setting up the repository

Now we need to setup the terraform providers we need, create a file called providers.tf:

terraform {
  #  backend "azurerm" {} # uncomment to store the state file in Azure
  required_providers {
    azurerm = {
      source  = "hashicorp/azurerm"
    }
    restapi = {
      source  = "Mastercard/restapi"
    }
  }
}

provider "azurerm" {
  features {}
}

providers.tf

The AzureRM provider will be used to create most of the resources, the Mastercard/restapi provider is used to configure Azure AI Search as there is currently no built in provider that can manage Azure AI Search resources after its creation, (see this issue from 2021).

After creating the providers file lets init terraform:

terraform init

You should see output that looks like:

Terraform has been successfully initialized!

You may now begin working with Terraform. Try running "terraform plan" to see
any changes that are required for your infrastructure. All Terraform commands
should now work.

If you ever set or change modules or backend configuration for Terraform,
rerun this command to reinitialize your working directory. If you forget, other
commands will detect it and remind you to do so if necessary.

The resources

First we'll create a resource group:

resource "azurerm_resource_group" "this" {
  location = "UK South"
  name     = "example"
}

resource-group.tf

A storage account that we will use to store the data in part 2:

resource "azurerm_storage_account" "default" {
  name                            = "yourstorageaccountname" # replace me
  location                        = azurerm_resource_group.this.location
  resource_group_name             = azurerm_resource_group.this.name
  account_tier                    = "Standard"
  account_replication_type        = "LRS"
  allow_nested_items_to_be_public = false

  # Required so the indexer can detect deleted files
  blob_properties {
    delete_retention_policy {
      days = 7
    }
    container_delete_retention_policy {
      days = 7
    }
  }
}

resource "azurerm_storage_container" "example" {
  name                 = "example"
  storage_account_name = azurerm_storage_account.default.name
}

data "azurerm_client_config" "current" {}

# permission for logged in user to upload to the storage account
resource "azurerm_role_assignment" "storage_blob_data_contributor_user" {
  principal_id = data.azurerm_client_config.current.object_id
  scope        = azurerm_storage_account.default.id

  role_definition_name = "Storage Blob Data Contributor"
}

storage.tf

Then a search service:

resource "azurerm_search_service" "this" {
  name                = "your-search-service" # replace-me
  resource_group_name = azurerm_resource_group.this.name
  location            = azurerm_resource_group.this.location
  sku                 = "basic"
  replica_count       = 1
  partition_count     = 1

  # Allow key auth to be used for the restapi provider
  local_authentication_enabled = true
  # Enable RBAC auth for the most secure authentication for the application
  authentication_failure_mode  = "http403"
  semantic_search_sku          = "standard"

  identity {
    type = "SystemAssigned"
  }
}

# permission for search service to access the storage account
resource "azurerm_role_assignment" "storage_blob_data_reader_search_service" {
  principal_id = azurerm_search_service.this.identity[0].principal_id
  scope        = azurerm_storage_account.default.id

  role_definition_name = "Storage Blob Data Reader"
}

search.tf

Now that we've created the search service we can configure the restapi provider that we'll need in a minute, update the providers.tf file adding this to the bottom:

provider "restapi" {
  uri                  = "https://${azurerm_search_service.this.name}.search.windows.net"
  write_returns_object = true

  headers = {
    "api-key"      = azurerm_search_service.this.primary_key,
    "Content-Type" = "application/json"
  }
}

providers.tf

To store the data we'll need an index,

We'll pull in a few fields that are auto generated by the indexer:

  • metadata_storage_name - name of the file
  • metadata_storage_path - full storage path including the storage account name
  • metadata_storage_content_md5 - used to see if the file has changed for indexing
  • content - the text content that has been extracted from the source file
  • title - a metadata attribute that will be set on the file in part 2
locals {
  index_json = {
    name = "example"
    fields = [
      {
        name       = "id"
        type       = "Edm.String"
        searchable = false
        filterable = false
        sortable   = false
        key        = true
        facetable  = false
      },
      {
        name       = "metadata_storage_last_modified"
        type       = "Edm.DateTimeOffset"
        searchable = false
        filterable = true
        sortable   = false
        facetable  = false
      },
      {
        name       = "title"
        type       = "Edm.String"
        searchable = true
        filterable = true
        sortable   = true
        facetable  = false
      },
      {
        name       = "metadata_storage_name"
        type       = "Edm.String"
        searchable = true
        filterable = true
        sortable   = true
        facetable  = false
      },
      {
        name       = "metadata_storage_path"
        type       = "Edm.String"
        searchable = true
        filterable = true
        sortable   = true
        facetable  = false
      },
      {
        name       = "metadata_storage_content_md5"
        type       = "Edm.String"
        filterable = true
        sortable   = true
        facetable  = false
      },
      {
        name       = "content"
        type       = "Edm.String"
        searchable = true
        filterable = false
        sortable   = false
        facetable  = false
      },
    ],
    semantic = {
      configurations = [
        {
          name : "example",
          prioritizedFields = {
            titleField = {
              fieldName = "title"
            },
            prioritizedContentFields = [
              {
                fieldName = "content"
              }
            ],
            prioritizedKeywordsFields = []
          }
        }
      ]
    },

  }
}


# See: https://learn.microsoft.com/en-us/rest/api/searchservice/preview-api/create-or-update-index
resource "restapi_object" "index" {
  path         = "/indexes"
  query_string = "api-version=2023-10-01-Preview"
  data         = jsonencode(local.index_json)
  id_attribute = "name" # The ID field on the response
}

index.tf

We've configured a semantic search configuration which will provide a secondary ranking of the result using language understanding to reevaluate the result set.

Now that we've defined an index, we need a datasource that an indexer can pull from:

locals {
  datasource_json = {
    name : "example",
    description : "Example datasource",
    type : "azureblob",
    credentials : {
      connectionString : "ResourceId=${azurerm_storage_account.default.id};"
    },
    container : {
      name : "example",
    },
    dataDeletionDetectionPolicy : {
      "@odata.type" : "#Microsoft.Azure.Search.NativeBlobSoftDeleteDeletionDetectionPolicy",
    },
  }
}

# https://learn.microsoft.com/en-us/rest/api/searchservice/create-data-source
resource "restapi_object" "storage_account_datasource" {
  path         = "/datasources"
  query_string = "api-version=2023-10-01-Preview"
  data         = jsonencode(local.datasource_json)
  id_attribute = "name" # The ID field on the response
  depends_on = [
    azurerm_search_service.this,
    azurerm_role_assignment.storage_blob_data_reader_search_service
  ]
}

datasource.tf

Notice that there's a data deletion policy based off of native blob soft delete, we configured that earlier in the storage account retention configuration, if files are deleted, they will be deleted from the index as well.

Also see that there's no credentials required to access the storage account, thats because the system assigned managed identity of the search service is accessing it using Azure RBAC with the 'Storage Blob Data Reader' role.

Finally we will create an indexer that will extract the files from the storage account:

locals {
  indexer_json = {
    name : "example",
    dataSourceName : "example",
    targetIndexName : "example",
    parameters : {
      configuration : {
        indexedFileNameExtensions : ".html",
        imageAction : "none"
      }
    }
  }
}

// https://learn.microsoft.com/en-us/rest/api/searchservice/preview-api/create-or-update-indexer
resource "restapi_object" "indexer" {
  path         = "/indexers"
  query_string = "api-version=2023-10-01-Preview"
  data         = jsonencode(local.indexer_json)
  id_attribute = "name" # The ID field on the response
  depends_on   = [azurerm_search_service.this, restapi_object.index, restapi_object.storage_account_datasource]
}

indexer.tf

You will see that there's no schedule configured, that's because this example is for a static site, we'll trigger the indexer in part 2 using the API as part of the static site deployment so there's no need to schedule it.

You should be able to run the indexer now and get a successful run it will just import 0 documents.

Click the run button in the top left and if everything has worked you should get a success:

Image showing successful run of indexer against storage account

For a complete reference of all the code used in this blog please see:

GitHub - timja/blog-ai-search-part-1
Contribute to timja/blog-ai-search-part-1 development by creating an account on GitHub.