Header Ad

Showing posts with label SEO by the Sea ⚓. Show all posts
Showing posts with label SEO by the Sea ⚓. Show all posts

Tuesday, October 13, 2020

Adjusting Featured Snippet Answers by Context

How Are Featured Snippet Answers Decided Upon?

I recently wrote about Featured Snippet Answer Scores Ranking Signals. In that post, I described how Google was likely using query dependent and query independent ranking signals to create answer scores for queries that were looking like they wanted answers.

One of the inventors of that patent from that post was Steven Baker. I looked at other patents that he had written, and noticed that one of those was about context as part of query independent ranking signals for answers.

Remembering that patent about question-answering and context, I felt it was worth reviewing that patent and writing about it.

This patent is about processing question queries that want textual answers and how those answers may be decided upon.

it is a complicated patent, and at one point the description behind it seems to get a bit murky, but I wrote about when that happened in the patent, and I think the other details provide a lot of insight into how Google is scoring featured snippet answers. There is an additional related patent that I will be following up with after this post, and I will link to it from here as well.

This patent starts by telling us that a search system can identify resources in response to queries submitted by users and provide information about the resources in a manner that is useful to the users.

How Context Scoring Adjustments for Featured Snippet Answers Works

Users of search systems are often searching for an answer to a specific question, rather than a listing of resources, like in this drawing from the patent, showing featured snippet answers:

featured snippet answers

For example, users may want to know what the weather is in a particular location, a current quote for a stock, the capital of a state, etc.

When queries that are in the form of a question are received, some search engines may perform specialized search operations in response to the question format of the query.

For example, some search engines may provide information responsive to such queries in the form of an “answer,” such as information provided in the form of a “one box” to a question, which is often a featured snippet answer.

Some question queries are better served by explanatory answers, which are also referred to as “long answers” or “answer passages.”

For example, for the question query [why is the sky blue], an answer explaining light as waves is helpful.

featured snippet answers - why is the sky blue

Such answer passages can be selected from resources that include text, such as paragraphs, that are relevant to the question and the answer.

Sections of the text are scored, and the section with the best score is selected as an answer.

In general, the patent tells us about one aspect of what it covers in the following process:

  • Receiving a query that is a question query seeking an answer response
  • Receiving candidate answer passages, each passage made of text selected from a text section subordinate to a heading on a resource, with a corresponding answer score
  • Determining a hierarchy of headings on a page, with two or more heading levels hierarchically arranged in parent-child relationships, where each heading level has one or more headings, a subheading of a respective heading is a child heading in a parent-child relationship and the respective heading is a parent heading in that relationship, and the heading hierarchy includes a root level corresponding to a root heading (for each candidate answer passage)
  • Determining a heading vector describing a path in the hierarchy of headings from the root heading to the respective heading to which the candidate answer passage is subordinate, determining a context score based, at least in part, on the heading vector, adjusting the answer score of the candidate answer passage at least in part by the context score to form an adjusted answer score
  • Selecting an answer passage from the candidate answer passages based on the adjusted answer scores

Advantages of the process in the patent

  1. Long query answers can be selected, based partially on context signals indicating answers relevant to a question
  2. The context signals may be, in part, query-independent (i.e., scored independently of their relatedness to terms of the query
  3. This part of the scoring process considers the context of the document (“resource”) in which the answer text is located, accounting for relevancy signals that may not otherwise be accounted for during query-dependent scoring
  4. Following this approach, long answers that are more likely to satisfy a searcher’s informational need are more likely to appear as answers

This patent can be found at:

Context scoring adjustments for answer passages
Inventors: Nitin Gupta, Srinivasan Venkatachary , Lingkun Chu, and Steven D. Baker
US Patent: 9,959,315
Granted: May 1, 2018
Appl. No.: 14/169,960
Filed: January 31, 2014

Abstract

Methods, systems, and apparatus, including computer programs encoded on a computer storage medium, for context scoring adjustments for candidate answer passages.

In one aspect, a method includes scoring candidate answer passages. For each candidate answer passage, the system determines a heading vector that describes a path in the heading hierarchy from the root heading to the respective heading to which the candidate answer passage is subordinate; determines a context score based, at least in part, on the heading vector; and adjusts answer score of the candidate answer passage at least in part by the context score to form an adjusted answer score.

The system then selects an answer passage from the candidate answer passages based on the adjusted answer scores.

Using Context Scores to Adjust Answer Scores for Featured Snippets

A drawing from the patent shows different hierarchical headings that may be used to determine the context of answer passages that may be used to adjust answer scores for featured snippets:

Hierarchical headings for featured snippets

I discuss these headings and their hierarchy below. Note that the headings include the Page title as a heading (About the Moon), and the headings within heading elements on the page as well. And those headings give those answers context.

This context scoring process starts with receiving candidate answer passages and a score for each of the passages.

Those candidate answer passages and their respective scores are provided to a search engine that receives a query determined to be a question.

Each of those candidate answer passages is text selected from a text section under a particular heading from a specific resource (page) that has a certain answer score.

For each resource where a candidate answer passage has been selected, a context scoring process determines a heading hierarchy in the resource.

A heading is text or other data corresponding to a particular passage in the resource.

As an example, a heading can be text summarizing a section of text that immediately follows the heading (the heading describes what the text is about that follows it, or is contained within it.)

Headings may be indicated, for example, by specific formatting data, such as heading elements using HTML.

A heading could also be anchor text for an internal link (within the same page) that links to an anchor and corresponding text at some other position on the page.

A heading hierarchy could have two or more heading levels that are hierarchically arranged in parent-child relationships.

The first level, or the root heading, could be the title of the resource.

Each of the heading levels may have one or more headings, and a subheading of a respective heading is a child heading and the respective heading is a parent heading in the parent-child relationship.

For each candidate passage, a context scoring process may determine a context score based, at least in part, on the relationship between the root heading and the respective heading to which the candidate answer passage is subordinate.

The context scoring process could be used to determine the context score and determines a heading vector that describes a path in the heading hierarchy from the root heading to the respective heading.

The context score could be based, at least in part, on the heading vector.

The context scoring process can then adjust the answer score of the candidate answer passage at least in part by the context score to form an adjusted answer score.

The context scoring process can then select an answer passage from the candidate answer passages based on adjusted answer scores.

This flowchart from the patent shows the context scoring adjustment process:

context scoring adjustment flowchart

Identifying Question Queries And Answer Passages

I’ve written about understanding the context of answer passages. The patent tells us more about question queries and answer passages worth going over in more detail.

Some queries are in the form of a question or an implicit question.

For example, the query [distance of the earth from the moon] is in the form of an implicit question “What is the distance of the earth from the moon?”

An implicit question - the distance from the earth to the moon

Likewise, a question may be specific, as in the query [How far away is the moon].

The search engine includes a query question processor that uses processes that determine if a query is a query question (implicit or specific) and if it is, whether there are answers that are responsive to the question.

The query question processor can use several different algorithms to determine whether a query is a question and whether there are particular answers responsive to the question.

For example, it may use to determine question queries and answers:

  • Language models
  • Machine learned processes
  • Knowledge graphs
  • Grammars
  • Combinations of those

The query question processor may choose candidate answer passages in addition to or instead of answer facts. For example, for the query [how far away is the moon], an answer fact is 238,900 miles. And the search engine may just show that factual information since that is the average distance of the Earth from the moon.

But, the query question processor may choose to identify passages that are to be very relevant to the question query.

These passages are called candidate answer passages.

The answer passages are scored, and one passage is selected based on these scores and provided in response to the query.

An answer passage may be scored, and that score may be adjusted based on a context, which is the point behind this patent.

Often Google will identify several candidate answer passages that could be used as featured snippet answers.

Google may look at the information on the pages where those answers come from to better understand the context of the answers such as the title of the page, and the headings about the content that the answer was found within.

Contextual Scoring Adjustments for Featured Snippet Answers

The query question processor sends to a context scoring processor some candidate answer passages, information about the resources from which each answer passages was from, and a score for each of the featured snippet answers.

The scores of the candidate answer passages could be based on the following considerations:

  • Matching a query term to the text of the candidate answer passage
  • Matching answer terms to the text of the candidate answer passages
  • The quality of the underlying resource from which the candidate answer passage was selected

I recently wrote about featured snippet answer scores, and how a combination of query dependent and query independent scoring signals might be used to generate answer scores for answer passages.

The patent tells us that the query question processor may also take into account other factors when scoring candidate answer passages.

Candidate answer passages can be selected from the text of a particular section of the resource. And the query question processor could choose more than one candidate answer passage from a text section.

We are given the following examples of different answer passages from the same page

(These example answer passages are referred to in a few places in the remainder of the post.)

  • (1) It takes about 27 days (27 days, 7 hours, 43 minutes, and 11.6 seconds) for the Moon to orbit the Earth at its orbital distance
  • (2) Why is the distance changing? The moon’s distance from Earth varies because the moon travels in a slightly elliptical orbit. Thus, the moon’s distance from the Earth varies from 225,700 miles to 252,000 miles
  • (3) The moon’s distance from Earth varies because the moon travels in a slightly elliptical orbit. Thus, the moon’s distance from the Earth varies from 225,700 miles to 252,000 miles

Each of those answers could be good ones for Google to use. We are told that:

More than three candidate answers can be selected from the resource, and more than one resource can be processed for candidate answers.

How would Google choose between those three possible answers?

Google might decide based on the number of sentences and a selection of up to a maximum number of characters.

The patent tells us this about choosing between those answers:

Each candidate answer has a corresponding score. For this example, assume that candidate answer passage (2) has the highest score, followed by candidate answer passage (3), and then by candidate answer passage (1). Thus, without the context scoring processor, candidate answer passage (2) would have been provided in the answer box of FIG. 2. However, the context scoring processor takes into account the context of the answer passages and adjusts the scores provided by the query question processor.

So, we see that what might be chosen based on featured snippet answer scores could be adjusted based on the context of that answer from the page that it appears on.

Contextually Scoring Featured Snippet Answers

This process starts which begins with a query determined to be a question query seeking an answer response.

This process next receives candidate answer passages, each candidate answer passage chosen from the text of a resource.

Each of the candidate answer passages are text chosen from a text section that is subordinate to a respective heading (under a heading) in the resource and has a corresponding answer score.

For example, the query question processor provides the candidate answer passages, and their corresponding scores, to the context scoring processor.

A Heading Hierarchy to Determine Context

This process then determines a heading hierarchy from the resource.

The heading hierarchy would have two or more heading levels hierarchically arranged in parent-child relationships (Such as a page title, and an HTML heading element.)

Each heading level has one or more headings.

A subheading of a respective heading is a child heading (an (h2) heading might be a subheading of a (title)) in the parent-child relationship and the respective heading is a parent heading in the relationship.

The heading hierarchy includes a root level corresponding to a root heading.

The context scoring processor can process heading tags in a DOM tree to determine a heading hierarchy.

hierarchical headings for featured snippets

For example, concerning the drawing about the distance to the moon just above, the heading hierarchy for the resource may be:

The ROOT Heading (title) is: About The Moon (310)

The main heading (H1) on the page

H1: The Moon’s Orbit (330)

A secondary heading (h2) on the page:

H2: How long does it take for the Moon to orbit Earth? (334)

Another secondary heading (h2) on the page is:

H2: The distance from the Earth to the Moon (338)

Another Main heading (h1) on the page

H1: The Moon (360)

Another secondary Heading (h2) on the page:

H2: Age of the Moon (364)

Another secondary heading (h2) on the page:

H2: Life on the Moon (368)

Here is how the patent describes this heading hierarchy:

In this heading hierarchy, The title is the root heading at the root level; headings 330 and 360 are child headings of the heading, and are at a first level below the root level; headings 334 and 338 are child headings of the heading 330, and are at a second level that is one level below the first level, and two levels below the root level; and headings 364 and 368 are child headings of the heading 360, and are at a second level that is one level below the first level, and two levels below the root level.

The process from the patent determines a context score based, at least in part, on the relationship between the root heading and the respective heading to which the candidate answer passage is subordinate.

This score may be is based on a heading vector.

The patent says that the process, for each of the candidate answer passages, determines a heading vector that describes a path in the heading hierarchy from the root heading to the respective heading.

The heading vector would include the text of the headings for the candidate answer passage.

For the example candidate answer passages (1)-(3) above about how long it takes the moon to orbit the earch, the respectively corresponding heading vectors V1, V2 and V3 are:

  • V1=<[Root: About The Moon], [H1: The Moon's Orbit], [H2: How long does it take for the Moon to orbit the Earth?]>
  • V2=<[Root: About The Moon], [H1: The Moon's Orbit], [H2: The distance from the Earth to the Moon]>
  • V3=<[Root: About The Moon], [H1: The Moon's Orbit], [H2: The distance from the Earth to the Moon]>

We are also told that because candidate answer passages (2) and (3) are selected from the same text section 340, their respective heading vectors V2 and V3 are the same (they are both in the content under the same (H2) heading.)

The process of adjusting a score, for each answer passage, uses a context score based, at least in part, on the heading vector (410).

That context score can be a single score used to scale the candidate answer passage score or can be a series of discrete scores/boosts that can be used to adjust the score of the candidate answer passage.

Where things Get Murky in This Patent

There do seem to be several related patents involving featured snippet answers, and this one which targets learning more about answers from their context based on where they fit in a heading hierarchy makes sense.

But, I’m confused by how the patent tells us that one answer based on the context would be adjusted over another one.

The first issue I have is that the answers they are comparing in the same contextual area have some overlap. Here those two are:

  • (2) Why is the distance changing? The moon’s distance from Earth varies because the moon travels in a slightly elliptical orbit. Thus, the moon’s distance from the Earth varies from 225,700 miles to 252,000 miles
  • (3) The moon’s distance from Earth varies because the moon travels in a slightly elliptical orbit. Thus, the moon’s distance from the Earth varies from 225,700 miles to 252,000 miles

Note that the second answer and the third answer both include the same line: “Thus, the moon’s distance from the Earth varies from 225,700 miles to 252,000 miles.” I find myself a little surprised that the second answer includes a couple of sentences that aren’t in the third answer, and skips a couple of lines from the third answer, and then includes the last sentence, which answers the question.

Since they both appear in the same heading and subheading section of the page they are from, it is difficult to imagine that there is a different adjustment based on context. But, the patent tells us differently:

The candidate answer score with the highest adjusted answer score (based on context from the headings) is selected, and the answer passage.

Recall that in the example above, the candidate answer passage (2) had the highest score, followed by candidate answer passage (3), and then by candidate answer passage (1).

However, after adjustments, candidate answer passage (3) has the highest score, followed by candidate answer passage (2), and then-candidate answer passage (1).

Accordingly, candidate answer passage (3) is selected and provided as the answer passage of FIG. 2.

Boosting Scores Based on Passage Coverage Ratio

A query question processor may limit the candidate answers to a maximum length.

The context scoring processor determines a coverage ratio which is a measure indicative of the coverage of the candidate answer passage from the text which it was selected from.

The patent describes alternative question answers:

Alternatively, the text block may include text sections subordinate to respective headings that include a first heading for which the text section from which the candidate answer passage was selected is subordinate, and sibling headings that have an immediate parent heading in common with the first heading. For example, for the candidate answer passage, the text block may include all the text in the portion 380 of the hierarchy; or may include only the text of the sections, of some other portion of text within the portion of the hierarchy. A similar block may be used for the portion of the hierarchy for candidate answer passages selected from that portion.

A small coverage ratio may indicate a candidate answer passage is incomplete. A high coverage ratio may indicate the candidate answer passage captures more of the content of the text passage from which it was selected. A candidate answer passage may receive a context adjustment, depending on this coverage ratio.

A passage coverage ratio is a ratio of the total number of characters in the candidate answer passage to the ratio of the total number of characters in the passage from which the candidate answer passage was selected.

The passage cover ratio could also be a ratio of the total number of sentences (or words) in the candidate answer passage to the ratio of the total number of sentences (or words) in the passage from which the candidate answer passage was selected.

We are told that other ratios can also be used.

From the three example candidate answer passages about the distance to the moon above (1)-(3) above, passage (1) has the highest ratio, passage (2) has the second-highest, and passage (3) has the lowest.

This process determines whether the coverage ratio is less than a threshold value. That threshold value can be, for example, 0.3, 0.35 or 0.4, or some other fraction. In our “distance to the moon” example, each coverage passage ratio meets or exceeds the threshold value.

If the coverage ratio is less than a threshold value, then the process would select a first answer boost factor. The first answer boost factor might be proportional to the coverage ratio according to a first relation, or maybe a fixed value, or maybe a non-boosting value (e.g., 1.0.)

But if the coverage ratio is not less than the threshold value, the process may select a second answer boost factor. The second answer boost factor may be proportional to the coverage ratio according to a second relation, or maybe fixed value, or maybe a value greater than the non-boosting value (e.g., 1.1.)

Scoring Based on Other Features

The context scoring process can also check for the presence of features in addition to those described above.

Three example features for contextually scoring an answer passage can be based on the additional features of the distinctive text, a preceding question, and a list format.

Distinctive text

Distinctive text is the text that may stand out because it is formatted differently than other text, like using bolding.

A Preceeding Question

A preceding question is a question in the text that precedes the candidate answer question.

The search engine may process various amounts of text to detect for the question.

Only the passage from which the candidate answer passage is extracted is detected.

A text window that can include header text and other text from other sections may be checked.

A boost score that is inversely proportional to the text distance from a question to the candidate answer passage is calculated, and the check is terminated at the occurrence of a first question.

That text distance may be measured in characters, words, or sentences, or by some other metric.

If the question is anchor text for a section of text and there is intervening text, such as in the case of a navigation list, then the question is determined to only precede the text passage to which it links, not precede intervening text.

In the drawing above about the moon, there are two questions in the resource: “How long does it take for the Moon to orbit Earth?” and “Why is the distance changing?”

The first question–“How long does it take for the Moon to orbit Earth?”– precedes the first candidate answer passage by a text distance of zero sentences, and it precedes the second candidate answer passage by a text distance of five sentences.

And the second question–“Why is the distance changing?”– precedes the third candidate answer by zero sentences.

If a preceding question is detected, then the process selects a question boost factor.

This boost factor may be proportional to the text distance, whether the text is in a text passage subordinate to a header or whether the question is a header, and, if the question is in a header, whether the candidate answer passage is subordinate to the header.

Considering these factors, the third candidate answer passage receives the highest boost factor, the first candidate answer receives the second-highest boost factor, and the second candidate answer receives the smallest boost factor.

Conversely, if the preceding text is not detected, or after the question boost factor is detected, then the process detects for the presence of a list.

The Presence of a List

A list is an indication of several steps usually instructive or informative. The detection of a list may be subject to the query question being a step modal query.

A step modal query is a query where a list-based answer is likely to a good answer. Examples of step model queries are queries like:

  • [How to . . . ]
  • [How do I . . . ]
  • [How to install a door knob]
  • [How do I change a tire]

The context scoring process may detect lists formed with:

  • HTML tags
  • Micro formats
  • Semantic meaning
  • Consecutive headings at the same level with the same or similar phrases (e.g., Step 1, Step 2; or First; Second; Third; etc.)

The context scoring process may also score a list for quality.

It would look at things such as:

  • A list in the center of a page, which does not include multiple links to other pages (indicative of reference lists)
  • HREF link text that does not occupy a large portion of the text of the list will be of higher quality than a list at the side of a page, and which does include multiple links to other pages (which are indicative of reference lists), and/are has HREF link text that does occupy a large portion of the text of the list

If a list is detected, then the process selects a list boost factor.

That list boost factor may be fixed or may be proportional to the quality score of the list.

If a list is not detected, or after the list boost factor is selected, the process ends.

In some implementations, the list boost factor may also be dependent on other feature scores.

If other features, such as coverage ratio, distinctive text, etc., have relatively high scores, then the list boot factor may be increased.

The patent tells us that this is because “the combination of these scores in the presence of a list is a strong signal of a high-quality answer passage.”

Adjustment of Featured Snippet Answers Scores

Answer scores for candidate answer passages are adjusted by scoring components based on heading vectors, passage coverage ratio, and other features described above.

The scoring process can select the largest boost value from those determined above or can select a combination of the boost values.

Once the answer scores are adjusted, the candidate answer passage with the highest adjusted answer score is selected as the featured snippet answer and is displayed to a searcher.

More to Come

I will be reviewing the first patent in this series of patents about candidate answer scores because it does have some additional elements to it that haven’t been covered in this post, and the post about query dependent/independent ranking signals for answer scores. If you have been paying attention to how Google has been answering queries that appear to be seeking answers, you have likely seen those improving in many cases. Some answers have been really bad though. It will be nice to have as complete of an idea as we can of how Google decides what might be a good answer to a query, based on information available to them on the Web.


Copyright © 2020 SEO by the Sea ⚓. This Feed is for personal non-commercial use only. If you are not reading this material in your news aggregator, the site you are looking at may be guilty of copyright infringement. Please contact SEO by the Sea, so we can take appropriate action immediately.
Plugin by Taragana

The post Adjusting Featured Snippet Answers by Context appeared first on SEO by the Sea ⚓.



from SEO by the Sea ⚓ https://ift.tt/2GOWiIO

Tuesday, October 6, 2020

Bursty Fresh and Local Featured Snippet Answers at Google

Featured Snippet Answers Based on Context

Last month I wrote about answer passages when Google decides what answers to show in response to queries that are asking questions, in the post, Featured Snippet Answer Scores Ranking Signals. In that post, I wrote about an updated patent which made it clear that passages that might be shown in response to a query are given answer scores that are based on both query dependent and query independent signals.

A query dependent signal is one that includes relevance of a term in the query to some aspect of candidate featured snippet answers. A query independent signal doesn’t rely upon the terms in a query, and their relevance to terms in an answer passage, but could look at other aspects of answers, such as whether an answer is written in complete sentences or other query independent aspects of those answers.

At the end of September, Danny Sullivan, Public Liaison for Search at Google, posted on the Google Keyword Blog about some recent queries that were performed on Google that contained questions about smoke-related to wildfires in California. One frequent query in the area was, “why is the sky orange?” The blog post told us about how Google might use contextual information about location and freshness of content in featured snippet answers.

You may notice that the location of searchers is not expressly identified in the query, much like a search for different business types, such as restaurants or places to shop. The article about these queries is in the post at:

Why is the sky orange? How Google gave people the right info

Danny tells us about how Google might respond to these queries:

Well, language understanding is at the core of Search, but it’s not just about the words. Critical context, like time and place, also helps us understand what you’re really looking for. This is particularly true for featured snippets, a feature in Search that highlights pages that our systems determine are likely a great match for your search. We’ve made improvements to better understand when fresh or local information — or both — is key to delivering relevant results to your search.

So this is pointing out that Google has worked on improving answers for questions that are asking about fresh or local information (Or both). The snippet from the post refers to critical context, and how Google may understand the context of a question is essential to how helpful it can be in answering questions.

Google tells us that “Our freshness indicators identified a rush of new content was being produced on this topic that was both locally relevant and different from the more evergreen content that existed.”

Since Google actively is engaged in indexing content on the web, they can notice bursty behavior about different topics, and where it is from. That reminds me of a post I wrote back in 2008 called How Search Query Burstiness Could Increase Page Rankings. So Google can tell what people are searching for and where they are searching from, by keeping an eye on their log files, and Google can tell what people are creating content about when it indexes new and updated webpages.

I liked this statement from the Google post, too:

Put simply, instead of surfacing general information on what causes a sunset, when people searched for “why is the sky orange” during this time period, our systems automatically pulled in current, location-based information to help people find the timely results they were searching for.

Danny also points out a query that sometimes surfaces from searchers in places such as New York City, or Boston: “Why is it Hazy?” to show that Google can use local context in those areas to provide relevant results for people searching from there.

We are told that this Google blog post provided information about a couple of queries specific to certain locations, but Google receives billions of queries a day, and they provide fresh and relevant results to all of those queries when they receive them.

Understanding the context of questions that people perform on different topics and from different places can help people receive answers to what they want to learn more about. The Google Blog post from Danny is worth reading and thinking about if you haven’t seen it


Copyright © 2020 SEO by the Sea ⚓. This Feed is for personal non-commercial use only. If you are not reading this material in your news aggregator, the site you are looking at may be guilty of copyright infringement. Please contact SEO by the Sea, so we can take appropriate action immediately.
Plugin by Taragana

The post Bursty Fresh and Local Featured Snippet Answers at Google appeared first on SEO by the Sea ⚓.



from SEO by the Sea ⚓ https://ift.tt/34yicb9

Thursday, September 24, 2020

Featured Snippet Answer Scores Ranking Signals

Calculating Featured Snippet Answer Scores

An update this week to a patent tells us how Google may score featured snippet answers.

When a search engine ranks search results in response to a query, it may use a combination of query dependant and query independent ranking signals to determine those rankings.

A query dependant signal may depend on a term in a query, and how relevant a search result may be for that query term. A query independent signal would depend on something other than the terms in a query, such as the quality and quantity of links pointing to a result.

Answers to questions in queries may be ranked based on a combination of query dependant and query independent signals, which could determine a featured snippet answer score. An updated patent about textual answer passages tells us about how those may be combined to generate featured snippet answer scores to choose from answers to questions that appear in queries.

A year and a half ago, I wrote about answers to featured snippets in the post Does Google Use Schema to Write Answer Passages for Featured Snippets?. The patent that post was about was Candidate answer passages, which was originally filed on August 12, 2015, and was granted as a continuation patent on January 15, 2019.

That patent was a continuation patent to an original one about answer passages that updated it by telling us that Google would look for textual answers to questions that had structured data near them that included related facts. This could have been something like a data table or possibly even schema markup. This meant that Google could provide a text-based answer to a question and include many related facts for that answer.

Another continuation version of the first version of the patent was just granted this week. It provides more information and a different approach to ranking answers for featured snippets and it is worth comparing the claims in these two versions of the patent to see how those are different from Google.

The new version of the featured snippet answer scores patent is at:

Scoring candidate answer passages
Inventors: Steven D. Baker, Srinivasan Venkatachary, Robert Andrew Brennan, Per Bjornsson, Yi Liu, Hadar Shemtov, Massimiliano Ciaramita, and Ioannis Tsochantaridis
Assignee: Google LLC
US Patent: 10,783,156
Granted: September 22, 2020
Filed: February 22, 2018

Abstract

Methods, systems, and apparatus, including computer programs encoded on a computer storage medium, for scoring candidate answer passages. In one aspect, a method includes receiving a query determined to be a question query that seeks an answer response and data identifying resources determined to be responsive to the query; for a subset of the resources: receiving candidate answer passages; determining, for each candidate answer passage, a query term match score that is a measure of similarity of the query terms to the candidate answer passage; determining, for each candidate answer passage, an answer term match score that is a measure of similarity of answer terms to the candidate answer passage; determining, for each candidate answer passage, a query dependent score based on the query term match score and the answer term match score; and generating an answer score that is a based on the query dependent score.

featured snippet answer scores

Candidate Answer Passages Claims Updated

There are changes to the patent that require more analysis of potential answers, based on both query dependant and query independent scores for potential answers to questions. The patent description does provide details about query dependant and query independent scores. The first claim from the first patent covers query dependant scores for answers, but not query independent scores as the newest version does. It provides more details about both query dependant scores and query independent scores in the rest of the claims, but the newer version seems to make both query dependant and query independent scores more important.

The first claim from the 2015 version of the Scoring Answer Passages patent tells us:

1. A method performed by data processing apparatus, the method comprising: receiving a query determined to be a question query that seeks an answer response and data identifying resources determined to be responsive to the query and ordered according to a ranking, the query having query terms; for each resource in a top-ranked subset of the resources: receiving candidate answer passages, each candidate answer passage selected from passage units from content of the resource and being eligible to be provided as an answer passage with search results that identify the resources determined to be responsive to the query and being separate and distinct from the search results; determining, for each candidate answer passage, a query term match score that is a measure of similarity of the query terms to the candidate answer passage; determining, for each candidate answer passage, an answer term match score that is a measure of similarity of answer terms to the candidate answer passage; determining, for each candidate answer passage, a query dependent score based on the query term match score and the answer term match score; and generating an answer score that is a measure of answer quality for the answer response for the candidate answer passage based on the query dependent score.

The remainder of the claims tell us about both query dependant and query independent scores for answers, but the claims from the newer version of the patent appear to place as much importance on the query dependant and the query independent scores for answers. That convinced me that I should revisit this patent in a post and describe how Google may calculate answer scores based on query dependant and query independent scores.

The first claims in the new patent tell us:

1. A method performed by data processing apparatus, the method comprising: receiving a query determined to be a question query that seeks an answer response and data identifying resources determined to be responsive to the query and ordered according to a ranking, the query having query terms; for each resource in a top-ranked subset of the resources: receiving candidate answer passages, each candidate answer passage selected from passage units from content of the resource and being eligible to be provided as an answer passage with search results that identify the resources determined to be responsive to the query and being separate and distinct from the search results; determining, for each candidate answer passage, a query dependent score that is proportional to a number of instances of matches of query terms to terms of the candidate answer passage; determining, for each candidate answer passage, a query independent score for the candidate answer passage, wherein the query independent score is independent of the query and query dependent score and based on features of the candidate answer passage; and generating an answer score that is a measure of answer quality for the answer response for the candidate answer passage based on the query dependent score and the query independent score.

As it says in this new claim, the answer score has gone from being “a measure of answer quality for the answer response for the candidate answer passage based on the query dependent score” (from the first patent) to “a measure of answer quality for the answer response for the candidate answer passage based on the query dependent score and the query independent score” (from this newer version of the patent.)

This drawing is from both versions of the patent, but it shows the query dependant and query independent scores both playing an important role in calculating featured snippet answer scores:

query dependent & query independent answers combine

Query Dependant and Query Independent Scores for Featured Snippet Answer Scores

Both versions of the patent tell us about how a query dependant score and a query independent score for an answer might be calculated. The first version of the patent only told us in its claims that an answer score used the query dependant score, and this newer version tells us that both the query dependant and the query independent scores are combined to calculate an answer score (to decide which answer is the best choice of an answer for a query.)

Before the patent discusses how Query Dependant and Query Independent signals might be used to create an answer score, it does tell us this about the answer score:

The answer passage scorer receives candidate answer passages from the answer passage generator and scores each passage by combining scoring signals that predict how likely the passage is to answer the question.

In some implementations, the answer passage scorer includes a query dependent scorer and a query independent scorer that respectively generate a query dependent score and a query independent score. In some implementations, the query dependent scorer generates the query dependent score based on an answer term match score and a query term match score.

Query Dependant Scoring for Featured Snippet Answer Scores

Query Dependent Scoring of answer passages is based on answer term features.

An answer term match score is a measure of similarity of answer terms to terms in a candidate answer passage.

The answer-seeking queries do not describe what a searcher is looking for since the answer is unknown to the searcher at the time of a search.

The query dependent scorer begins by finding a set of likely answer terms and compares the set of likely answer terms to a candidate answer passage to generate an answer term match score. The set of likely answer terms is likely taken from the top N ranked results returned for a query.

The process creates a list of terms from terms that are included in the top-ranked subset of results for a query. The patent tells us that each result is parsed and each term is included in a term vector. Stop words may be omitted from the term vector.

For each term in the list of terms, a term weight may be generated for the term. The term weight for each term may be based on many results in the top-ranked subset of results in which the term occurs multiplied by an inverse document frequency (IDF) value for the term. The IDF value may be derived from a large corpus of documents and provided to the query dependent scorer. Or the IDF may be derived from the top N documents in the returned results. The patent tells us that other appropriate term weighting techniques can also be used.

The scoring process for each term of the candidate answer passage determines several times the term occurs in the candidate answer passage. So, if the term “apogee” occurs two times in a candidate answer passage, the term value for “apogee” for that candidate answer passage is 2. However, if the same term occurs three times in a different candidate answer passage, then the term value for “apogee” for the different candidate answer passage is 3.

The scoring process, for each term of the candidate answer passage, multiplies its term weight by the number of times the term occurs in the answer passage. So, assume the term weight for “apogee” is 0.04. For the first candidate answer passage, the value based on “apogee” is 0.08 (0.08.times.2); for the second candidate answer passage, the value based on “apogee” is 0.12 (0.04.times.3).

Other answer term features can also be used to determine an answer term score. For example, the query dependent scorer may determine an entity type for an answer response to the question query. The entity type may be determined by identifying terms that identify entities, such as persons, places, or things, and selecting the terms with the highest term scores. The entity time may also be identified from the query (e.g., for the query [who is the fastest man]), the entity type for an answer is “man.” For each candidate answer passage, the query dependent scorer then identifies entities described in the candidate answer passage. If the entities do not include a match to the identified entity type, the answer term match score for the candidate answer passage is reduced.

Assume the following candidate passage answer is provided for scoring in response to the query [who is the fastest man]: Olympic sprinters have often set world records for sprinting events during the Olympics. The most popular sprinting event is the 100-meter dash.

The query dependent scorer will identify several entities–Olympics, sprinters, etc.–but none of them are of the type “man.” The term “sprinter” is gender-neutral. Accordingly, the answer term score will be reduced. The score may be a binary score, e.g., 1 for the presence of the term of the entity type, and 0 for an absence of the term of the correct type; alternatively may be a likelihood that is a measure of the likelihood that the correct term is in the candidate answer passage. An appropriate scoring technique can be used to generate the score.

Query Independant Scoring for Featured Snippet Answer Scores

Scoring answer passages according to query independent features.

Candidate answer passages may be generated from the top N ranked resources identified for a search in response to a query. N may be the same number as the number of search results returned on the first page of search results.

The scoring process can use a passage unit position score. This passage unit position could be the location of a result that a candidate answer passage comes from. The higher the location results in a higher score.

The scoring process may use a language model score. The language model score generates a score based on candidate answer passages conforming to a language model.

One type of language model is based on sentence and grammar structures. This could mean that candidate answer passages with partial sentences may have lower scores than candidate answer passages with complete sentences. The patent also tells us that if structured content is included in the candidate answer passage, the structured content is not subject to language model scoring. For instance, a row from a table may have a very low language model score but may be very informative.

Another language model that may be used considers whether text from a candidate answer passage appears similar to answer text in general.

A query independent scorer accesses a language model of historical answer passages, where the historical answer passages are answer passages that have been served for all queries. Answer passages that have been served generally have a similar n-gram structure, since answer passages tend to include explanatory and declarative statements. A query independent score could use a tri-gram model to compares trigrams of the candidate answer passage to the tri-grams of the historical answer passages. A higher-quality candidate answer passage will typically have more tri-gram matches to the historical answer passages than a lower quality candidate answer passage.

Another step involves a section boundary score. A candidate answer passage could be penalized if it includes text that passes formatting boundaries, such as paragraphs and section breaks, for example.

The scoring process determines an interrogative score. The query independent scorer searches the candidate answer passage for interrogative terms. A potential answer passage that includes a question or question term, e.g., “How far is away is the moon from the Earth?” is generally not as helpful to a searcher looking for an answer as a candidate answer passage that only includes declarative statements, e.g., “The moon is approximately 238,900 miles from the Earth.”

The scoring process also determines discourse boundary term position scores. A discourse boundary term is one that introduces a statement or idea contrary to or modification of a statement or idea that has just been made. For example, “conversely,” “however,” “on the other hand,” and so on.

A candidate answer passage beginning with such a term receives a relatively low discourse boundary term position score, which lowers the answer score.

A candidate answer passage that includes but does not begin with such a term receives a higher discourse boundary term position score than it would if it began with the term.

A candidate answer passage that does not include such a term receives a high discourse boundary term position score.

The scoring process determines result scores for results from which the candidate answer passage was created. These could include a ranking score, a reputation score, and site quality score. The higher these scores are, the higher the answer score will be.

A ranking score is based on the ranking score of the result from which the candidate answer passage was created. It can be the search score of the result for the query and will be applied to all candidate answer passages from that result.

A reputation score of the result indicates the trustworthiness and/or likelihood that that subject matter of the resource serves the query well.

A site quality score indicates a measure of the quality of a web site that hosts the result from which the candidate answer passage was created.

Component query independent scores described above may be combined in several ways to determine the query independent score. They could be summed; multiplied together; or combined in other ways.


Copyright © 2020 SEO by the Sea ⚓. This Feed is for personal non-commercial use only. If you are not reading this material in your news aggregator, the site you are looking at may be guilty of copyright infringement. Please contact SEO by the Sea, so we can take appropriate action immediately.
Plugin by Taragana

The post Featured Snippet Answer Scores Ranking Signals appeared first on SEO by the Sea ⚓.



from SEO by the Sea ⚓ https://ift.tt/3kMzfg6

Monday, September 21, 2020

Where will we go with Personalized Knowledge Graphs?

I’ve written a couple of posts about patents at Google on personalized knowledge graphs (a topic worth thinking about seriously).

When Google introduced the “Knowledge Graph” in 2012, they told us about just one knowledge graph. But it appears that they didn’t intend the idea of a knowledge graph to be a singular one – there are more than one knowledge graph.

Google later came out with a patent that told us about how each query might return a set of results that a new knowledge graph could be created from to answer the original query. Those mini-knowledge graphs could end up being combined into a larger knowledge graph. I wrote about that patent (filed in 2017) in this post:

Answering Questions Using Knowledge Graphs.

Another patent I wrote about was one on User-Specific Knowledge Graphs: User-Specific Knowledge Graphs to Support Queries and Predictions. This patent was filed in November of 2013 and was created back then. These are personalized knowledge graphs based upon information taken from your search history, from pages you have browsed, and from documents such as emails and social networking posts that you have made and received. This patent tells us that these personalized knowledge graphs could be joined together to lead to a universal knowledge graph (combining non-user-specific knowledge graphs, and user-specific knowledge graphs.)

I also wrote about how Google might create personalized entity repositories for people to carry around with them on their mobile devices: A Personalized Entity Repository in the Knowledge Graph. What makes this interesting is that it causes a knowledge base of information to be contained on your computing device, such as a mobile phone or a tablet, which means that your answer doesn’t have to come from a server somewhere, and can come from a knowledge graph built on that personalized knowledge base from an entity repository built from a machine learning approach based on your search history and documents (emails, documents, social network posts) that you may access

A Google whitepaper created for the International Conference on Theory of Information Retrieval (ICTIR) 2019, October 2–5, 2019 – Personal Knowledge Graphs: A Research Agenda by Krisztian Balog and Tom Kenter Captures a lot of the ideas behind the User-Specific Knowledge Graph patent (originally filed in 2013).

The abstract tells us:

Knowledge graphs, organizing structured information about entities, and their attributes and relationships, are ubiquitous today. Entities, in this context, are usually taken to be anyone or anything considered to be globally important. This, however, rules out many entities people interact with on a daily basis.

In this position paper, we present the concept of personal knowledge graphs: resources of structured information about entities personally related to its user, including the ones that might not be globally important. We discuss key aspects that separate them for general knowledge graphs, identify the main challenges involved in constructing and using them, and define our search agenda.

The paper tells us about the purposes behind knowledge graphs:

Obvious use cases include enabling rich knowledge panels and direct answers in search result pages, powering smart assistants, supporting data exploration and visualization (tables and graphs), and facilitating media monitoring and reputation management

These are important and essential aspects of how search engines such as Google are working these days. What makes this paper interesting is that it tells us about knowledge graphs that do these things that are personalized to work with individuals. As the authors tell us:

In this position paper, we present the concept of a personal knowledge graph (PKG)—a resource of structured information about entities personally related to its user, their attributes, and the relations between them.

This paper is a good look at the direction that knowledge graphs are evolving towards, and is worth spending time with to see where they might go. This could be very much true when it comes to something such as personal assistants which you may use to with personal errands, such as making a restaurant reservation or booking a flight, or helping with entertainment at homes, such as movies or music or news.

The paper suggests some research that might be done on personalized knowledge graphs in the future and presents a number of ideas on how to bring these concepts into actual use.

Krisztian Balog was a visiting Scholar at Google for over a year, and a computer science professor when he wrote the above paper. He has an open access book on the Springer Website (at no charge) on Entity Oriented Search, which is highly recommended. It captures really well a lot of what I have seen at Google on Entities.


Copyright © 2020 SEO by the Sea ⚓. This Feed is for personal non-commercial use only. If you are not reading this material in your news aggregator, the site you are looking at may be guilty of copyright infringement. Please contact SEO by the Sea, so we can take appropriate action immediately.
Plugin by Taragana

The post Where will we go with Personalized Knowledge Graphs? appeared first on SEO by the Sea ⚓.



from SEO by the Sea ⚓ https://ift.tt/3mB3BUR

Monday, July 13, 2020

Entity Seeking Queries and Semantic Dependency Trees

Entity Seeking Queries and Semantic Dependency Trees

Queries for some searches may be looking for one or more entities.

Someone may ask something like, “What is the hotel that looks like a sail.” That query may be looking for an entity that is the building, the Burj Al Arab Jumeirah.

Those entities may be identified in Semantic Dependency Trees, that answer the question in the query (example below)

Other queries may not look for answers about specific entities, such as “What is the weather today?” The answer to that query might respond to “The weather will be between 60-70.degrees Fahrenheit, and sunny today.”

Actions May Accompany Queries that Seek Entities

Google was granted a patent about answering entity seeking queries.

The process under the patent may perform particular actions for queries that seek one or more entities.

One action the system may perform involves:

  • Identifying one or more types of entities that a query is seeking
  • Determining whether the query is seeking one specific entity or potentially multiple entities

For example, the process may determine that a query of “What is the hotel that looks like a sail” is looking for a single entity that is a hotel.

In another example, the system may determine that a query “What restaurants nearby serve omelets” seeks potentially multiple entities that are restaurants.

An additional or alternative action the system may perform may include finding a most relevant entity or entities of the identified one or more types, and presenting what is identified to the user if sufficiently relevant to the query. For example, the system may identify that the Burj Al Arab Jumeirah is an entity that is a hotel and is sufficiently relevant to the terms “looks like a sail,” and, in response, audibly output synthesized speech of “Burj Al Arab Jumeirah.”

Additional Dialog about a Query to Concatenate an Entity Seeking Query

Yet another addition or alternative action may include initiating a dialog with the user for more details about the entities that are sought.

For example, the system may determine that a query is seeking a restaurant and there may be two entities that are restaurants are very relevant to the terms in the query and, in response, ask the searcher “Can you give me more details” and concatenate additional input from the user to the original query and re-execute the concatenated query.

Identifying SubQueries of Entity Seeking Queries

Another additional or alternative action may include identifying subqueries of a query which are entity-seeking, and using the above actions to answer the subquery, and then replacing the subqueries by their answers in the original query to obtain a partially resolved query which can be executed.

For example, the system may receive a query of “Call the hotel that looks like a sail,” determine that “the hotel that looks like a sail” is a subquery that seeks an entity, determine an answer to the subquery is “Burj Al Arab Jumeirah,” in response replace “the hotel that looks like a sail” in the query with “The Burj Al Arab Jumeirah” to obtain a partially resolved query of “Call the Burj Al Arab Jumeirah,” and then executes the partially resolved query.

Looking at Previous Queries

Another additional or alternative action may include identifying that a user is seeking entities and adapting how the system resolve queries accordingly.

For example, the system may determine that sixty percent of the previous five queries that a user searched for in the past two minutes sought entities and, in response, determine that a next query that a user provides is more likely an entity seeking query, and process the query accordingly.

An Advantage From Following this Process

An advantage may be more quickly resolving queries in a manner that satisfies a searcher.

For example, the system may be able to immediately provide an actual answer of “The Burj Al Arab Jumeirah” for the query “What hotel looks like a sail” where another system may instead provide a response of “no results found” or provide a response that is a search result listing for the query.

Entity Seeking Queries and Semantic Dependency Trees

Entity Seeking Queries
Another advantage may be that the process may be able to more efficiently identify an entity sought by a query. For example, it may determine an entity seeking query is looking for an entity of the type “hotel” and, in response, limit a search to only entities that are hotels instead of searching across multiple entities including entities that are not hotels.

Entities in Semantic Dependency Trees

Semantic Dependency Tree

This is an interesting approach to an entity seeking queries. Determining an entity type that may correspond to an entity sought by a query based on a term represented by a root of a dependency tree includes:

Determining the term represented by the root of the dependency tree represents a type of entity.

Determining an entity type that corresponds to an entity sought by the query based on a term represented by a root of the dependency tree includes:

Identifying a node in the tree that represents a term that represents a type of entity
Includes a direct child that represents a term that indicates an action to perform.
In response to determining that the root represents a term that represents and type of entity and includes a direct child that represents a term that indicates an action, identifying the root.

In some implementations, identifying a particular entity based on both the entity type and relevance of the entity to the terms in the query includes:

  • Determining a relevance threshold based on the entity type
  • Determining a relevance score of the particular entity based on the query satisfies the relevance threshold
  • In response to determining the relevance score of the particular entity based on the query satisfies the relevance threshold, identifying the particular entity

This patent on Entity Seeking Queries can be found at:

Answering Entity-Seeking Queries
Inventors: Mugurel Ionut Andreica, Tatsiana Sakhar, Behshad Behzadi, Marcin M. Nowak-Przygodzki, and Adrian-Marius Dumitran
US Patent Application: 20190370326
Published: December 5, 2019
Filed: May 29, 2018

Abstract

In some implementations, a query that includes a sequence of terms is obtained, the query is mapped, based on the sequence of the terms, to a dependency tree that represents dependencies among the terms in the query, an entity type that corresponds to an entity sought by the query is determined based on a term represented by a root of the dependency tree, a particular entity is identified based on both the entity type and relevance of the entity to the terms in the query, and a response to the query is provided based on the particular entity that is identified.

Mapping a Query to a Semantic Dependency Tree

A process that handles entity seeking queries

This process includes:

  • A query mapper that maps a query including a sequence of terms to a semantic dependency tree
  • An entity type identifier that may determine an entity type based on the semantic dependency tree
  • An entity identifier that may receive the query
  • The entity type that is determined
  • Data from various data stores and identify an entity
  • Subquery resolver that may partially resolve the query based on the entity that is identified
  • Query responder that may provide a response to the query

An Example Semantic Dependency Tree

This is how a Semantic Dependency Tree may be constructed:

  1. A semantic dependency tree for a query may be a graph that includes nodes
  2. Each node represents one or more terms in a query
  3. Directed edges originating from a first node and ending at a second node may indicate that the one or more terms represented by the first node are modified by the one or more terms represented by the second node
  4. A node at which an edge ends may be considered a child of a node from which the edge originates
  5. A root of a semantic dependency tree may be a node representing one or more terms that do not modify other terms in a query and are modified by other terms in the query
  6. A semantic dependency tree may only include a single root

An Entity Type Identifier

An entity type identifier may determine an entity type that corresponds to an entity sought by the query based on a term represented by a root of the semantic dependency tree.

For example, the entity type identifier may determine an entity type of “Chinese restaurant” that corresponds to an sought by the query “Call the Chinese restaurant on Piccadilly Street 15” based on the term “Chinese restaurant” represented by the root of the semantic dependency tree.

In another example, the entity type identifier may determine an entity type of “song” for the query “play the theme song from the Titanic” based on the term “play” represented by the root of the semantic dependency tree for the query not representing an entity type and determining that the root has a child that represents the terms “the theme song” which does represent an entity type of “song.”

Entities from a Location History of a Searcher

The entity identifier may extract all the entities from a mobile location history of a searcher which have a type identified by the entity type identifier, such as hotels, restaurants, universities, etc. along with extracting features associated to each such entity such as the time intervals when the user visited the entity or was near the entity, or how often each entity was visited or the user was near the entity.

Entities from a Past Interaction History of a Searcher

In addition to that location history, the entity identifier may extract all the entities that the user was interested in their past interactions that have a type identified by the entity type identifier, such as:

  • Movies that the user watched
  • Songs that the user listened to
  • Restaurants that the user looked up and showed interest in or booked
  • Hotels that the user booked
  • Etc.

Confidence in Relevance for Entity Seeing Queries

The patent also tells us that the entity identify may obtain a relevance score for each entity that reflects a confidence that the entity is sought to be the query.

The relevance score may be determined based on one or more of the features extracted from the data stores that led to the set of entities being identified, the additional features extracted for each entity in the set of entities, and the features extracted from the query.


Copyright © 2020 SEO by the Sea ⚓. This Feed is for personal non-commercial use only. If you are not reading this material in your news aggregator, the site you are looking at may be guilty of copyright infringement. Please contact SEO by the Sea, so we can take appropriate action immediately.
Plugin by Taragana

The post Entity Seeking Queries and Semantic Dependency Trees appeared first on SEO by the Sea ⚓.



from SEO by the Sea ⚓ https://ift.tt/3fqmHsR

Thursday, July 2, 2020

How Google Might Rank Image Search Results

Changes to How Google Might Rank Image Search Results

We are seeing more references to machine learning in how Google is ranking pages and other documents in search results.

That seems to be a direction that will leave what we know as traditional, or old school signals that are referred to as ranking signals behind.

It’s still worth considering some of those older ranking signals because they may play a role in how things are ranked.

As I was going through a new patent application from Google on ranking image search results, I decided that it was worth including what I used to look at when trying to rank images.

Images can rank highly in image search, and they can also help pages that they appear upon rank higher in organic web results, because they can help make a page more relevant for the query terms that page may be optimized for.

Here are signals that I would include when I rank image search results:

  • Use meaningful images that reflect what the page those images appear on is about – make them relevant to that query
  • Use a file name for your image that is relevant to what the image is about (I like to separate words in file names for images with hyphens, too)
  • Use alt text for your alt attribute that describes the image well, and uses text that is relevant to the query terms that the page is optimized for) and avoid keyword stuffing
  • Use a caption that is helpful to viewers and relevant to what the page it is about, and the query term that the page is optimized for
  • Use a title and associated text on the page the image appears upon that is relevant for what the page is about, and what the image shows
  • Use a decent sized image at a decent resolution that isn’t mistaken for a thumbnail

Those are signals that I would consider when I rank image search results and include images on a page to help that page rank as well.

A patent application that was published this week tells us about how machine learning might be used in ranking image search results. It doesn’t itemize features that might help an image in those rankings, such as alt text, captions, or file names, but it does refer to “features” that likely include those as well as other signals. It makes sense to start looking at these patents that cover machine learning approaches to ranking because they may end up becoming more common.

Machine Learning Models to Rank Image Search Results

Giving Google a chance to try out different approaches, we are told that the machine learning model can use many different types of machine learning models.

The machine learning model can be a:

  • Deep machine learning model (e.g., a neural network that includes multiple layers of non-linear operations.)
  • Different type of machine learning model (e.g., a generalized linear model, a random forest, a decision tree model, and so on.)

We are told more about this machine learning model. It is “used to accurately generate relevance scores for image-landing page pairs in the index database.”

We are told about an image search system, which includes a training engine.

The training engine trains the machine learning model on training data generated using image-landing page pairs that are already associated with ground truth or known values of the relevance score.

The patent shows an example of the machine learning model generating a relevance score for a particular image search result from an image, landing page, and query features. In this image, a searcher submits an image search query. The system generates image query features based on the user-submitted image search query.

Rank Image Search Results includes Image Query Features

That system also learns about landing page features for the landing page that has been identified by the particular image search result as well as image features for the image identified by that image search result.

The image search system would then provide the query features, the landing page features, and the image features as input to the machine learning model.

Google may rank image search results based on various factors

Those may be separate signals from:

  1. Features of the image
  2. Features of the landing page
  3. A combining the separate signals following a fixed weighting scheme that is the same for each received search query

This patent describes how it would rank image search results in this manner:

  1. Obtaining many candidate image search results for the image search query
  2. Each candidate image search result identifies a respective image and a respective landing page for the respective image
  3. For each of the candidate image search results processing
    • Features of the image search query
    • Features of the respective image identified by the candidate image search result
  4. Features of the respective landing page identified by the candidate image search result using an image search result ranking machine learning model that has been trained to generate a relevance score that measures a relevance of the candidate image search result to the image search query
  5. Ranking the candidate image search results based on the relevance scores generated by the image search result ranking machine learning model

    – Generating an image search results presentation that displays the candidate image search results ordered according to the ranking
    – Providing the image search results for presentation by a user device

Advantages to Using a Machine Learning Model to Rank Image Search Results

If Google can rank image search query pairs based on relevance scores using a machine learning model, it can improve the relevance of the image search results in response to the image search query.

This differs from conventional methods to rank resources because the machine learning model receives a single input that includes features of the image search query, landing page, and the image identified by a given image search result to predicts the relevance of the image search result to the received query.

This process allows the machine learning model to be more dynamic and give more weight to landing page features or image features in a query-specific manner, improving the quality of the image search results that are returned to the user.

By using a machine learning model, the image search engine does not apply the same fixed weighting scheme for landing page features and image features for each received query. Instead, it combines the landing page and image features in a query-dependent manner.

The patent also tells us that a trained machine learning model can easily and optimally adjust weights assigned to various features based on changes to the initial signal distribution or additional features.

In a conventional image search, we are told that significant engineering effort is required to adjust the weights of a traditional manually tuned model based on changes to the initial signal distribution.

But under this patented process, adjusting the weights of a trained machine learning model based on changes to the signal distribution is significantly easier, thus improving the ease of maintenance of the image search engine.

Also, if a new feature is added, the manually tuned functions adjust the function on the new feature independently on an objective (i.e., loss function, while holding existing feature functions constant.)

But, a trained machine learning model can automatically adjust feature weights if a new feature is added.

Instead, the machine learning model can include the new feature and rebalance all its existing weights appropriately to optimize for the final objective.

Thus, the accuracy, efficiency, and maintenance of the image search engine can be improved.

The Rank Image Search results patent application can be found at

Ranking Image Search Results Using Machine Learning Models
US Patent Application Number 16263398
File Date: 31.01.2019
Publication Number US20200201915
Publication Date June 25, 2020
Applicants Google LLC
Inventors Manas Ashok Pathak, Sundeep Tirumalareddy, Wenyuan Yin, Suddha Kalyan Basu, Shubhang Verma, Sushrut Karanjkar, and Thomas Richard Strohmann

Abstract

Methods, systems, and apparatus including computer programs encoded on a computer storage medium, for ranking image search results using machine learning models. In one aspect, a method includes receiving an image search query from a user device; obtaining a plurality of candidate image search results; for each of the candidate image search results: processing (i) features of the image search query and (ii) features of the respective image identified by the candidate image search result using an image search result ranking machine learning model to generate a relevance score that measures a relevance of the candidate image search result to the image search query; ranking the candidate image search results based on the relevance scores; generating an image search results presentation; and providing the image search results for presentation by a user device.

The Indexing Engine

The search engine may include an indexing engine and a ranking engine.

The indexing engine indexes image-landing page pairs, and adds the indexed image-landing page pairs to an index database.

That is, the index database includes data identifying images and, for each image, a corresponding landing page.

The index database also associates the image-landing page pairs with:

  • Features of the image search query
  • Features of the images, i.e., features that characterize the images
  • Features of the landing pages, i.e., features that characterize the landing page

Optionally, the index database also associates the indexed image-landing page pairs in the collections of image-landing pairs with values of image search engine ranking signals for the indexed image-landing page pairs.

Each image search engine ranking signal is used by the ranking engine in ranking the image-landing page pair in response to a received search query.

The ranking engine generates respective ranking scores for image-landing page pairs indexed in the index database based on the values of image search engine ranking signals for the image-landing page pair, e.g., signals accessed from the index database or computed at query time, and ranks the image-landing page pair based on the respective ranking scores. The ranking score for a given image-landing page pair reflects the relevance of the image-landing page pair to the received search query, the quality of the given image-landing page pair, or both.

The image search engine can use a machine learning model to rank image-landing page pairs in response to received search queries.

The machine learning model is a machine learning model that is configured to receive an input that includes

(i) features of the image search query
(ii) features of an image and
(iii) features of the landing page of the image and generate a relevance score that measures the relevance of the candidate image search result to the image search query.

Once the machine learning model generates the relevance score for the image-landing page pair, the ranking engine can then use the relevance score to generate ranking scores for the image-landing page pair in response to the received search query.

The Ranking Engine behind the Process to Rank Image Search Results

In some implementations, the ranking engine generates an initial ranking score for each of multiple image—landing page pairs using the signals in the index database.

The ranking engine can then select a certain number of the highest-scoring image—landing pair pairs for processing by the machine learning model.

The ranking engine can then rank candidate image—landing page pairs based on relevance scores from the machine learning model or use those relevance scores as additional signals to adjust the initial ranking scores for the candidate image—landing page pairs.

The machine learning model would receive a single input that includes features of the image search query, the landing page, and the image to predict the relevance (i.e., relevance score, of the particular image search result to the user image query.)

We are told that this allows the machine learning model to give more weight to landing page features, image features, or image search query features in a query-specific manner, which can improve the quality of the image search results returned to the user.

Features That May Be Used from Images and Landing Pages to Rank Image Search Results

The first step is to receive the image search query.

Once that happens, the image search system may identify initial image-landing page pairs that satisfy the image search query.

It would do that from pairs that are indexed in a search engine index database from signals measuring the quality of the pairs, and the relevance of the pairs to the search query, or both.

For those pairs, the search system identifies:

  • Features of the image search query
  • Features of the image
  • Features of the landing page

Features Extracted From the Image

These features can include vectors that represent the content of the image.

Vectors to represent the image may be derived by processing the image through an embedding neural network.

Or those vectors may be generated through other image processing techniques for feature extraction. Examples of feature extraction techniques can include edge, corner, ridge, and blob detection. Feature vectors can include vectors generated using shape extraction techniques (e.g., thresholding, template matching, and so on.) Instead of or in addition to the feature vectors, when the machine learning model is a neural network the features can include the pixel data of the image.

Features Extracted From the Landing Page

These aren’t the kinds of features that I usually think about when optimizing images historically. These features can include:

  • The date the page was first crawled or updated
  • Data characterizing the author of the landing page
  • The language of the landing page
  • Features of the domain that the landing page belong to
  • Keywords representing the content of the landing page
  • Features of the links to the image and landing page such as the anchor text or source page for the links
  • Features that describe the context of the image in the landing page
  • So on

Features Extracted From The Landing Page That Describes The Context of the Image in the Landing Page

The patent interestingly separated these features out:

  • Data characterizing the location of the image within the landing page
  • Prominence of the image on the landing page
  • Textual descriptions of the image on the landing page
  • Etc.

More Details on the Context of the Image on the Landing Page

The patent points out some alternative ways that the location of the image within the Landing Page might be found:

  • Using pixel-based geometric location in horizontal and vertical dimensions
  • User-device based length (e.g., in inches) in horizontal and vertical dimensions
  • An HTML/XML DOM-based XPATH-like identifier
  • A CSS-based selector
  • Etc.

The prominence of the image on the landing page can be measured using the relative size of the image as displayed on a generic device and a specific user device.

The textual descriptions of the image on the landing page can include alt-text labels for the image, text surrounding the image, and so on.

Features Extracted from the Image Search Query

The features from the image search query can include::

  • Language of the search query
  • Some or all of the terms in the search query
  • Time that the search query was submitted
  • Location from which the search query was submitted
  • Data characterizing the user device from which the query was received
  • So on

How the Features from the Query, the Image, and the Landing Page Work Together

  • The features may be represented categorically or discretely
  • Additional relevant features can be created through pre-existing features (Relationships may be created between one or more features through a combination of addition, multiplication, or other mathematical operations.)
  • For each image-landing page pair, the system processes the features using an image search result ranking machine learning model to generate a relevance score output
  • The relevance score measures a relevance of the candidate image search result to the image search query (i.e., the relevance score of the candidate image search result measures a likelihood of a user submitting the search query would click on or otherwise interact with the search result. A higher relevance score indicates the user submitting the search query would find the candidate image search more relevant and click on it)
  • The relevance score of the candidate image search result can be a prediction of a score generated by a human rater to measure the quality of the result for the image search query

Adjusting Initial Ranking Scores

The system may adjust initial ranking scores for the image search results based on the relevance scores to:

  • Promote search results having higher relevance scores
  • Demote search results having lower relevance scores
  • Or both

Training a Ranking Machine Learning Model to Rank Image Search Results

The system receives a set of training image search queries
For each training image search query, training image search results for the query that are each associated with a ground truth relevance score.

A ground truth relevance score is the relevance score that should be generated for the image search result by the machine learning model (i.e., when the relevance scores measure a likelihood that a user would select a search result in response to a given search query, each ground truth relevance score can identify whether a user submitting the given search query selected the image search result or a proportion of times that users submitting the given search query select the image search result.)

The patent provides another example of how ground-truth relevance scores might be generated:

When the relevance scores generated by the model are a prediction of a score assigned to an image search result by a human, the ground truth relevance scores are actual scores assigned to the search results by human raters.

For each of the training image search queries, the system may generate features for each associated image-landing page pair.

For each of those pairs, the system may identify:

(i) features of the image search query
(ii) features of the image and
(iii) features of the landing page.

We are told that extracting, generating, and selecting features may take place before training or using the machine learning model. Examples of features are the ones I listed above related to the images, landing pages, and queries.

The ranking engine trains the machine learning model by processing for each image search query

  • Features of the image search query
  • Features of the respective image identified by the candidate image search result
  • Features of the respective landing page identified by the candidate image search result and the respective ground truth relevance that measures a relevance of the candidate image search result to the image search query

The patent provides some specific implementation processes that might differ based upon the machine learning system used.

Take Aways to Rank Image Search Results

I’ve provided some information about what kinds of features Google May have used in the past in ranking Image search results.

Under a machine learning approach, Google may be paying more attention to features from an image query, features from Images, and features from the landing page those images are found upon. The patent lists many of those features, and if you spend time comparing the older features with the ones under the machine learning model approach, you can see there is overlap, but the machine learning approach covers considerably more options.


Copyright © 2020 SEO by the Sea ⚓. This Feed is for personal non-commercial use only. If you are not reading this material in your news aggregator, the site you are looking at may be guilty of copyright infringement. Please contact SEO by the Sea, so we can take appropriate action immediately.
Plugin by Taragana

The post How Google Might Rank Image Search Results appeared first on SEO by the Sea ⚓.



from SEO by the Sea ⚓ https://ift.tt/2ZyCMpC