Optimizing global content in Internet search
© Mark The Globe/University of Vienna
1
Table of Contents Document information.....................................................................................................................3 Profile of the authors........................................................................................................................3 Chapter 1: Search engines and keywords........................................................................................4 1.1 Introduction................................................................................................................................4 1.2 Keywords...................................................................................................................................4 1.3 Multicultural and multilingual factors.......................................................................................5 1.4 The Keyword Effectiveness Index.............................................................................................6 Chapter 2: Managing keywords for Global SEO.............................................................................7 2.1 Introduction................................................................................................................................7 2.2 Incorporating keywords into websites.......................................................................................7 2.3 Distinguishing keywords from terminology in the CAT environment......................................8 2.4 A standard format for keyword data...........................................................................................9 2.5 Sample markup for KEI-enabled terminology entries.............................................................12 Chapter 3: SEO and the translation process...................................................................................14 3.1 Introduction..............................................................................................................................14 3.2 Skills of the translator..............................................................................................................14 3.3 Information needed by the translator.......................................................................................15 3.4 Translation tools and resources................................................................................................15 3.5 The translation process............................................................................................................16 3.6 Integrating SEO keywords into the translation process...........................................................16 3.6.1 TM contains keywords in one language...............................................................................17 3.6.2 TM contains no keywords.....................................................................................................18 3.6.3 TM contains keywords in both languages............................................................................20 3.7 Identifying keywords on a web page.......................................................................................21 3.8 Conclusion...............................................................................................................................22 Chapter 4: Implications of SEO requirements for CAT tools........................................................23 Chapter 5: Conclusions and future work.......................................................................................24 Bibliography..................................................................................................................................25
© Mark The Globe/University of Vienna
2
Document information This document presents a proposal for managing search engine keywords and their translation in a multilingual context, for purposes of website optimization. It is submitted to the University of Vienna as a report reflecting preliminary research conducted by the authors. Authors: • Kara Warburton, Termologic (
[email protected]) • Barbara Inge Karsch, BIK Terminology (
[email protected]) Date: August 20, 2012
Profile of the authors Kara Warburton holds a BA in Translation, an MA in Terminology, a B.Ed., and is currently completing a PhD in commercial terminology management. For 15 years she was the chief terminologist for IBM, where she spearheaded a global terminology management program to support both the authoring and translation divisions. In this capacity she led the development and implementation of a multilingual terminology database and a term extraction process. She also developed a process for leveraging multilingual terminology to enhance search on the company's Intranet. She has also taught university courses in terminology management and, since 2006, she has offered consultancy services and delivered training workshops through Termologic (www.termologic.com). She is the International Chair of ISO TC 37, which is responsible for terminology standards. Barbara Inge Karsch holds a BA and an MA in Translation and Interpretation. For 14 years, she worked as terminologist for English and German at J.D. Edwards and Microsoft. During that time she designed and implemented two large-scale terminology management systems, and trained or mentored hundreds of translators, international project managers, content publishers and terminologists. In 2010, she started BIK Terminology, her terminology consulting and training business. She teaches at New York University and the Lessius University College and is a US delegate to ISO TC 37. She maintains a blog on terminology issues at www.bikterminology.com.
© Mark The Globe/University of Vienna
3
Chapter 1: Search engines and keywords 1.1 Introduction Search engine optimization (SEO) is the process of increasing traffic to a website by improving the site's visibility, or "rank," in search engines. A search engine (SE) is an application that people use to find information on the Internet, such as Google and Yahoo. The higher and more frequently a website appears on search results pages, the more visitors it will receive. A specific goal is to make the website appear on the first page of the search results (among the top ten returned results), since the majority of users will not check subsequent pages. SEO forms part of global online marketing strategies and therefore has to adopt a multilingual and multicultural approach in order to realize the greatest success in diverse target markets. Google is the most popular SE in many parts of the world. Still, even in these areas other SEs are in use and in some countries, like China, Japan, and the Czech Republic, they have a larger market share. SEO therefore has to target different SEs in certain markets. Each SE has its own methods for ranking websites. More than 200 different factors are reportedly used in Google's ranking calculations1. These methods are not disclosed, largely to minimize their abuse by website developers. However the major ranking factors have been determined through independent research2.
1.2 Keywords While the precise methods used by SEs to rank websites are unknown, the search "keyword" is undeniably used by all SEs to rank websites. A search keyword is the word or words entered into the search field of the SE by a user looking for information on the Internet. Having a match between the keywords entered by the user and the words on a website will help to raise that website's ranking by the SE. The frequency and location of the keywords on the website may also affect the ranking. There are no standards for keywords. People use different keywords to search for the same thing, and they use the same keywords to search for different things. These behaviours are a consequence of the properties of polysemy and homonymy that are inherent in all languages. Synonymy occurs when different words (or "terms") have the same meaning, such as cell phone and mobile phone. A SE user might choose either of these terms, among others, when searching for a website to purchase a phone online. Some terms are close in meaning but not exactly the same; they could be considered as "near synonyms," such as cell phone and smart phone. A person may want the advanced features of a smart phone but use the more familiar keyword cell phone, not realizing that these two terms refer to different things. Some users may not even be aware of the better search term, in this example smart phone. Furthermore, there are spelling variants, such as cellphone (without the space) and abbreviated forms such as cell or variants such as cellular. One could come up with dozens of terms, abbreviations, and variants used by people to search for this single concept. 1
http://en.wikipedia.org/wiki/Search_engine_optimization For example: http://www.seomoz.org/article/search-ranking-factors © Mark The Globe/University of Vienna 2
4
Homonymy occurs when one word has multiple meanings. The word cell might informally be used to refer to a cell phone, but it has other meanings in science, medicine, and engineering. There are jail cells, battery cells, cancer cells, cell microprocessors, and cell bikes, to name a few. A cell is also a form of music, a television series, a scientific journal, a type of outlawed organization, and a novel written by Steven King. Furthermore, when capitalized, CELL is an acronym that means many things, such as Computer Enhanced Language Learning and Continuing Education Lending Library. (Many SEs do not differentiate between uppercase and lowercase letters.) Clearly, using only the word cell on a website or in a search query without further clarification and precision would cause the SE to suggest websites that are irrelevant to the user. Homonymy is common; for the 2000 most-polysemous terms in English, the typical verb has more than eight common senses and the typical noun has more than five3. The term cell is an example of a keyword that is too "general" because its meaning is ambiguous. Another type of ambiguous keyword is one that has a generic-specific relationship with other keywords. For instance, a person may use the keyword bone disease when searching for information about osteoporosis. The results will contain web pages about other types of bone disease such as rickets and arthritis. On the other hand, using precise keywords can also lead to undesirable search results. For instance, a person may use the term global warming to learn about climate change, but the latter is a broader concept, and therefore, some important websites that would interest the user may be missing from the search results. A "narrow" search keyword produces fewer search results than a "broad" search keyword. Depending on the search results, substituting a slightly broader or narrower keyword can improve the results by expanding or restricting the search respectively.
1.3 Multicultural and multilingual factors The aforementioned challenges -- polysemy, homonymy, and keywords that are too narrow or too broad -- are further complicated by cultural factors. Patterns of language use can vary from one region to another, even within the same language. For instance, the term handset to refer to cell phones is common among English speakers in certain regions, yet non-existent in others. At a certain point in time, the term smart phone may have become well established in some markets but not in others. SEO needs to consider the terminology that is predominant in the targeted market, and also adapt to changes in language use. Considering these issues from the perspective of different languages adds yet another dimension. Acronyms are frequently not transferable between languages. For instance, the International Organization for Standardization (ISO) is named the Organization internationale de normalisation in French. It would be logical to assume that OIN is the French equivalent of ISO. But this acronym is not used in French for the organization; the English acronym ISO has been adopted in French. It turns out that in France, the acronym OIN is widely used to mean opération d'intérêt national, a government regulated urbanization program4. That may be the reason why ISO chose not to adopt OIN as its French acronym. Such language-specific "terminological conflicts" affect the formation and acceptability of variants and other synonyms. The previous example demonstrates that when choosing keywords for translated web pages, verifying equivalence of meaning between search keywords in the source language and the target language is critical in addition to establishing how a given keyword is ranked by search engines. 3
http://en.wikipedia.org/wiki/Concept_search http://fr.wikipedia.org/wiki/Opération_d'intérêt_national © Mark The Globe/University of Vienna 4
5
A synset is a set of synonyms (including spelling variants and abbreviations). Due to various cultural, social, and linguistic factors, synsets in different languages for the same concept will comprise different numbers of members. For example, the term USB drive has many synonyms in English (USB flash drive, memory stick, thumb drive, pen drive, and more), but some of these terms may not have equivalents in other languages. Some languages may have more terms for this concept than English. Acronyms don't even exist in some languages. Since synsets are a fundamental framework for establishing SEO keywords, and synsets are language-specific, it is clear that the process for establishing SEO keywords must be language-specific as well and NOT be influenced by any one particular language such as English. While it is necessary to establish equivalence of meaning between keywords in different languages, involving the skills of a translator, keywords in one language can NOT simply be "translated" into another language. Marketing factors and search engine algorithms must be taken into consideration. The most accurate and precise “translation" chosen by a translator for a given term may produce significantly fewer search hits than that of the source term, due to conditions in the target market. Choosing a keyword in the target language that is based solely on translation considerations (equivalency of meaning) may therefore result in dramatically different levels of search effectiveness in the two markets.
1.4 The Keyword Effectiveness Index While the aforementioned linguistic factors need to be taken into account, two non-linguistic parameters are even more important for SEO: search volume and competition. Search volume refers to how frequently the keyword is used by people when searching for information about a certain topic. Competition refers to the number of exact results (hits) exist for that keyword because of its presence on different web pages. The goal is to identify and implement keywords that are more frequently used as search keywords by users of a SE while at the same time are less frequently used by competitor websites. The more searches the higher the expected traffic. The more competition the more difficult it is to achieve a high ranking among the retrieved hits, and the more difficult it is for users to select the web page that they are interested in among the multitude of search results. A metric, the Keyword Effectiveness Index, uses these two parameters to measure the effectiveness of keywords in online searches: KEI = (Searches^2) / Competition The higher the KEI value, the more effective the keyword is for search results. A value greater than 10 indicates a very good keyword, greater than one is average, and below one is poor. Search keywords are thus determined by selecting, from a list of translation candidates, those terms that have the highest KEI values as determined in the specific conditions of the target market, search behaviour of consumers in that market, and the search engines prevalent in that market.
© Mark The Globe/University of Vienna
6
Chapter 2: Managing keywords for Global SEO 2.1 Introduction Managing keywords as part of SEO involves identifying the most effective keywords for a website and enabling those keywords on the website in the locations most likely to be used by search engines in their ranking algorithms. It can be beneficial to use several synonymous keywords that have a high KEI on the same web page. Organizations that provide translated versions of their websites need to incorporate suitable keywords into the translated web pages in different languages. As explained earlier, keywords for each language are selected based on both translation considerations (equivalency of meaning) and marketing factors (the KEI). Thus, translators need to be provided with keywords with an acceptably high KEI and need to be trained about the importance of preserving these keywords so that they do not substitute them with other possible translations. Keywords also need to be stored for reuse on other web pages that deal with the same topic. In translation environments that use a CAT tool, the existing terminology lookup function can be leveraged to provide translators with keywords during the translation process. A keyword database can be developed and activated in this terminology function as a “special" kind of terminology.
2.2 Incorporating keywords into websites Keywords can be incorporated into a website either during the creation of the site's content or afterwards. However, SEO is more often undertaken after it is realized that the site has not attracted sufficient visitors or generated the desired revenues, i.e. after the site has been created. If keywords are added to a site during its development, then the keywords for the source market are already in place when the site is localized for additional geographical markets. This means that the site contains a carefully selected terminology, and possibly a higher terminological density than would normally occur in a non SEO-optimized text. As explained earlier, this terminology cannot be simply translated. An SEO-optimized set of keyword equivalents in the target language needs to be developed before the site is translated. These keywords need to be made available to translators in a format that they can easily use. The keywords need to be archived so that they can be used again, such as in a specially-designed terminology database. Translators will have to be trained in how to implement keywords in the target text, since SEO is not something that translators are typically familiar with. Computer-assisted translation (CAT) tools are widely-used in today's translation industry. An understanding of how SEO keywords can be integrated into a translation production environment that leverages CAT tools is therefore needed. However, different CAT tools are based on different architectures and technologies. Some use translation memory (TM), a traditional approach with a long history. Some have developed other innovative approaches as an alternative to TM, such as dynamically generated bitext alignments and text databases. Some are comprehensive production solutions that include integrated workflows and project management features which are appealing to large translation companies, while others offer lighter feature sets focusing on translating text that are popular with small companies and freelancers. Architectures vary from client/server to standalone desktop products, and from self-contained computer applications to Web services. Given this range of conditions, different technical implementations for SEO © Mark The Globe/University of Vienna
7
optimization during translation will likely be developed by the different CAT tools suppliers. Determining some baseline requirements that any SEO optimization tool used in a translation process should support can help to drive consistency and uniformity in SEO for the translation and localization industry. Keywords need to be placed at certain locations in the HTML content: • • • • •
Title element Headings (H1, H2, H3) Italics, bold, underlined content IMG ALT descriptions As part of the normal text
Authors and translators work with different content development tools. Authors deal with different file formats and editors (XML, MSWord, Framemaker, etc.), and sometimes with controlled authoring software such as Acrocheck. Translators often use computer-assisted translation tools, such as SDL Trados and MultiTrans. Any software that is developed for the purpose of adding keywords to web pages will therefore need to either integrate with these different tools, or be offered in a standalone tool that can work in any of these technical environments. Because it is not uncommon for organizations that have a global or multiregional presence to use multiple content and translation service providers, the likelihood that different tools need to be supported is high. To optimize leveragability and interoperability of the keyword data, and thus preserve the investment incurred to determine keywords, it is thus recommended to store keywords in a format that is not locked into a specific tool, i.e. in a format that can be used by multiple tools.
2.3 Distinguishing keywords from terminology in the CAT environment At the moment when the translator is adding keywords to a translated web page, only keywords, and not other types of terms, should be shown to the translator in the CAT environment. Otherwise, the translator may select terms that have a low KEI. To enable this feature, there are two options: 1. Store the keywords in a separate, dedicated termbase. 2. Store the keywords in the same termbase that is used for regular translation purposes, but mark the keywords with a special flag (data category) to distinguish them from ordinary terms. The option to be selected depends on the auto-lookup terminology function available in the CAT tool. Option 2 has the advantage of storing all terminology and keywords in the same termbase, thus reducing duplication and redundancy between termbases. However, option 2 can only be used if the autolookup function and manual searches for translation candidates can be restricted to the keywords at the moment when the translator wants to implement keywords on a web page. It needs to be possible for the translator to easily switch back and forth between keyword search and term search. Option 2 also requires the ability to filter keywords from ordinary terms for other purposes such as export and display. Storing the keywords in a separate, dedicated termbase can lead to redundancy and duplication between terms and keywords that will be stored in separate repositories even though they © Mark The Globe/University of Vienna
8
represent the same concept. However, this method must be used if option 2 is unavailable. In this case, the translator will need to enable the keyword termbase, and disable other termbases, in the CAT tool when inserting keywords into the translated web page. We recommend totally disabling all the non-keyword terminology entries when inserting keywords into the translation. Otherwise, there is a danger that the translator may choose a translation whose KEI has not yet been determined. If a required keyword is missing in the target language, the translator should request one from the SEO service provider. In doing so, the translator may provide the SEO service provider with suggestions from existing termbases. Newly provided keywords should be inserted into the keyword termbase for future use.
2.4 A standard format for keyword data Keywords share features with terms, or terminology, and therefore certain methods from the field of terminology management may be appropriate and effective when deciding on a keyword management strategy. In particular, both keywords and terms: • Denote concepts • Are usually short – one or several words • Are primarily nouns • May exist in sets of more than one (synonyms and variants) A standard for XML markup for terminology data has already been developed by the International Organization for Standardization: TermBase eXchange (TBX). This standard is suitable for encoding keywords in XML. It also has the advantage of already being supported by many CAT and controlled authoring software tools. TBX needs to be customized in order to be suitable for representing keywords. In particular, many of the types of data (called “data categories") offered in the default TBX are not needed for keyword management and can be ignored. For this purpose, we can use TBX-Basic as a model. TBX-Basic is a simpler version of TBX, adopted as a standard for the localization industry, which adopts a reduced set of data categories (25 versus over 125). For keyword management, several data categories, such as the KEI, need to be added. The following table lists the primary data categories that are required for search keywords, along with their standard TBX XML representation, if available, and data types. Table 1. Data categories and XML markup for terminological entries containing keywords Data categories for keywords
TBX representation
Data type
Language/locale
Picklist. ISO 639 language codes as in BCP-47.
Keyword
Text
© Mark The Globe/University of Vienna
9
Data categories for keywords
TBX representation
Data type
Keyword flag
Picklist: – fullForm – acronym – abbreviation – shortForm – variant – phrase – keyword
Part of speech
Picklist: – noun – verb – adjective – adverb – properNoun – other
Subject field
text or picklist (preferred)
Project identifier
Picklist
Company identifier
Picklist
Definition
Text
KEI-related data categories: – KEI
Numeric
– Search volume
Numeric
– Competition
Numeric
– Ranking
Numeric
– Date when KEI was calculated
Date in ISO 8601 format: YYYY-MM-DD
– Note
Text
Market
Text or picklist (preferred). If the values required correspond to countries, use the ISO 3166 country codes. If thevalues correspond to locales, use thecodes from IETF RFC 4646 or itssuccessor, as identified in IETF BCP 47.
Translation comment
Note
Additional data categories can be adopted, as desired. TBX-Basic should be used as the model. For instance, there are data categories available to record administrative information such as the name of the person who created or updated the entry or a specific field in the entry, and the date of the change. These types of information are often automatically inserted into the entry by the terminology management system. If definitions are included, it is also recommended to provide a field for the source of the definition. Refer to TBX-Basic for information about these and other data categories. © Mark The Globe/University of Vienna
10
The company identifier is useful if an organization such as a translation service provider manages keywords for several companies. The keyword flag distinguishes terms from keywords in combined termbases, which is further explained below. Not all these data categories need to be documented for each keyword. For instance, many if not all keywords will not require definitions or subject fields in order to be understood by translators and other content developers. We suggest that the following data categories should be mandatory: language/locale, keyword, part of speech, and all those in the KEI group except the note. As will be explained later, the keyword flag is also mandatory in combined termbases. As is done for terms in TBX, the information about a keyword is recorded in an “entry" (). If more than one keyword denotes a given concept and is used for a given web page (variants and synonyms), ALL such keywords MUST be recorded in the same entry. Entries contain keywords in any number of languages; all keywords for all languages for a given concept are recorded in the same entry. As for the new KEI-related data categories, we propose two approaches for grouping them together. The first approach uses the element. Unfortunately, this grouping element does not allow the as a child element, which we would need to record the date that the KEI was calculated. In TBX, dates are only available in because they are associated with changes to the entry such as when the entry or a field in an entry is created or updated. Associating a date with external information such as a KEI value has no known precedent in termbases. To address this problem, we introduce a new type attribute value on the element for recording the date: ... This introduced type attribute value is recorded in the XCS file. Using the element has one disadvantage. It portrays the KEI information as “administrative” in nature, rather than “descriptive”. While this has no bearing on the machine readability and processability of the data, it could be contested from a theoretical viewpoint. Approach 2 addresses that concern. It uses the element to group the KEI data categories. The KEI is recorded in a , similar to other data categories that “describe” the term, such as the part of speech and the term type. This grouping element allows a in which we can record the date associated with the KEI. Unfortunately, only one is allowed in the group, therefore, the additional KEI data categories such as competition must be recorded in elements with user-specific type values. These introduced type attribute values are recorded in the XCS file. In order to support , the entry markup must use instead of the simpler ( allows the deeper nesting structure required), therefore, this approach adopts a slightly more complex entry structure than the former. Both approaches are TBX-compliant because all the newly-introduced data categories are specified in the customizable XCS file and do not affect the TBX core-structure DTD. TBX arranges data categories at one of three hierarchical levels of an entry: concept level (termEntry), language level (langSet), and term level (tig or ntig). The following sample entries demonstrate the two options described above for keywords in TBX-compliant markup. New markup is introduced for the data categories required for the KEI group, in the form of type attribute values. © Mark The Globe/University of Vienna
11
2.5 Sample markup for KEI-enabled terminology entries
Sample markup for entries containing KEI information
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... © Mark The Globe/University of Vienna
12
... ... ... ... ... ... modification ... ... ... ... ... ... ... ...
Each keyword is documented in a or group. All keywords for a given language/market combination are contained in the group for that language/market. All keywords for all languages for a given concept are documented in the same . For information about how to set up a complete XML document instance in TBX-compliant format including required front matter and back matter, refer to the TBX standard, ISO 30042 (2008). The previous XML markup model provides translators with sufficient information to select and prioritize keywords. All keywords in the same entry are assumed to have the same meaning, however, descriptive information provided by the definition and subject field is available to confirm semantic equivalency when desired. The KEI group of data categories provides an objective measure of the relative importance of keywords so that they can be ranked; keywords with the highest KEI can be inserted into the most important locations of a web page such as the title and headings.
© Mark The Globe/University of Vienna
13
Chapter 3: SEO and the translation process 3.1 Introduction For a given translation project, the appropriate skills, processes and tools have to be applied. If one of these areas is weak, the project will experience problems, such as time delays and reduced translation quality. The following strategies can help to address these and other challenges: • Ensure that the project has the appropriate people resources, e.g. translator, SEO keyword specialist and editor • Provide easy-to-use SEO keyword and terminology databases • Ensure that the SEO keyword and terminology databases contain the highest quality content • Provide tools that support the workflow and supply the functionality that allows an individual translator or a team to perform the translation and SEO keyword integration process flawlessly. In the following sections, we will drill deeper into the requirements of skills, resources, processes and tools.
3.2 Skills of the translator To be successful, translators must possess a variety of skills. In 2001, Budin5 listed eight different, yet interlinked sets of skills that translators need at varying degrees during their work. The following six types are the most relevant for translation projects that require keyword integration: • Linguistic skills • Terminology skills • Application of translation methodologies • Subject matter knowledge • Information management skills • Computer skills When translating a website, translators apply all of these skills at different times and at varying degrees in order to arrive at the best possible translation of a text segment. If a translator is asked to select keywords from a database, another skill requirement is added to an already highly complex convergence of skills. The final goal of the process is a translated text that not only has to transfer information to the reader; it also must drive traffic to the site. Thus the translator chosen to undertake a translation project that requires SEO optimization should possess advanced skills in analyzing terminology, concepts, and search keywords. He or she should understand the goals of SEO and be familiar with search engines and search behaviour of internet users in the target market. The translator must also be experienced in using terminology databases for automatic lookup in the CAT tool. One must not assume that all translators who use the CAT tool actually have this particular experience, as many users of CAT tools only use the translation memory (TM). 5
Wissensmanagement in der Translation, 2001
© Mark The Globe/University of Vienna
14
3.3 Information needed by the translator The following is a list of data, information, or knowledge that the translator must have in order to succeed: • Understand the overall goal of a translation project with SEO keyword integration • Understand the process flow (who does what at what time) • Know where and when to place keywords in the text • Understand the difference between terminology and keywords • Understand the different data categories of an entry in the database
3.4 Translation tools and resources Translating means rendering content that exists in a source language into a target language and in documented form. The basic process itself has been the same for centuries. What has changed, however, are the tools and resources used as well as the skills required to perform the process. While translators use a variety of media and/or tools, anything but computer-assisted-translation (CAT) tools can be ruled out as the main translation medium for this process. CAT environments support translators via resources such as translation memories (TMs) and termbases (TBs); they also supply functionality, for instance, automated spell checkers or search-and-replace of words or terms. These functions play a role in the process of integrating SEO keywords into the translation process. A TM is a collection of source language sentences and their corresponding translations in the target language. In a CAT tool, a TM is automatically created in a database as the translator is translating a given text. Each time he or she translates a sentence, the original (source language) sentence is stored in the TM database along with the translation of that sentence created by the translator. Over time, the TM grows into a large collection of sentence “pairs.” These pairs of source language and target language sentences can be reused to translate future projects whenever they contain the same or similar sentences. Companies and organizations that translate their content can amass large TMs that they leverage to translate new content more efficiently and at less cost than translating without any TM. TMs are viewed as a significant commercial asset and they are therefore carefully managed like other commercial assets to ensure their continued value. TMs are reused to translate additional texts from the same project as well as for future versions of the same project or for related projects. It is therefore important that if a translated text is changed in a text editor outside of the CAT tool, these changes are also added to the TM, so that the TM is maintained in the best possible quality and continues to reflect the final version of the translation. As discussed in sections 2.3 and 2.4, a terminology database (TB) is an essential component of CAT tools. It must not be confused with a TM. Translated terms and translated sentences are stored in separate databases, but they contain different types of information and are used in different ways and for different purposes. A TB contains information about concepts and the terms that designate these concepts as well as a wide range of other information such as grammatical properties and usage indications. A TM contains full sentences (often called “segments”) of text as well as some administrative information such as the date the sentence was © Mark The Globe/University of Vienna
15
translated and the name of the translator. Standards exist for developing translation memory and terminology databases, such as TMX and TBX. Tools and resources must not be the drivers of a process. Rather they should support the process and lead to the desired outputs. It is necessary to analyze inputs and outputs as well as the process steps in order to decide on the correct set up of resources, i.e. TMs and TBs.
3.5 The translation process Translating a document or a set of documents mandates that any terminology or other significant expressions are consistently used. Often, a given term or expression can be translated in various ways, each such variation being semantically correct. Likewise, even in the source language a given concept can be expressed in different ways. However, such variations are generally not desired, as they can make the overall content more difficult to understand by introducing unwanted ambiguity and requiring readers to mentally “process” a more complex vocabulary. Even more important from a commercial perspective, inconsistent terminology in the source language raises translation costs by reducing the number of times that a matching sentence will be found in the TM. That is not to say that abbreviations and short forms may not be used along with their full forms. But a goal in both authoring and translation is to avoid using linguistically unrelated synonyms. For example, in a text about storage media the term USB flash drive might be designated as the standard term to use for a data storage device that includes flash memory with an integrated Universal Serial Bus (USB) interface6. It is acceptable to use the abbreviation USB for Universal Serial Bus after the full form was mentioned on first occurrence. At this point, the reader already has the necessary knowledge to understand the abbreviated form. It could be confusing, though, to use synonyms or quasi-synonyms, such as key chain, jump drive or memory stick, to refer to a USB flash drive. Simply speaking, such synonyms add no value to the text but rather reduce clarity and concision. Besides the fact that information transfer is reduced, even just small spelling variations or minor changes in word formation (such as the presence or absence of hyphens or spaces in compound terms such as checkbox, check-box and check box) lead to fuzzy matches in the TM and thus an increase in translation cost.
3.6 Integrating SEO keywords into the translation process Translating a website is a unique type of translation process, as it can involve marketing considerations. The introduction of keywords for search engine optimization is a deliberate effort to integrate synonyms of those terms that represent the main ideas of the site, which runs counter to the best practice of terminology consistency just described. Keyword integration therefore requires skills on the part of the translator to balance these two seemingly contradictory requirements: terminology consistency for translation legibility and terminology variation for SEO optimization. The translator needs to be able to judge when and where to add keywords while minimizing the negative impact of terminology variation on the quality of the translation and of its associated TM. Keyword integration can occur before, during or after the translation stage, depending on the inputs and the desired outputs. The goal of each stage of the process is to minimize the steps, complexity, and the time spent by the translators or other project participants. 6
http://en.wikipedia.org/wiki/USB_flash_drive © Mark The Globe/University of Vienna
16
The terminology database as a project resource was discussed in sections 2.3 and 2.4. Process steps, complexity and estimated time required depend on the desired output. We will therefore look at the output first. The output is an HTML website with integrated keywords in the target language. An additional output is a translation memory. There are several options of what this TM could look like: 1. Translation memory with integrated source and target keywords 2. Translation memory without keywords in either language 3. Translation memory without source keywords, but with integrated target keywords 4. Translation memory with integrated source keywords, but without target keywords In the following sections, the different options will be discussed followed by a more thorough analysis of the preferred option.
3.6.1 TM contains keywords in one language This section describes options 3 and 4, i.e. where keywords are added to the TM in one language but not the other. It is a best practice to keep source and target language segments in a TM semantically equivalent. If changes are made to a source segment, equivalent changes should be made in the respective target segment. A clean parallel structure of the two languages optimizes reuse of the memory. Options 3 and 4 where keywords would be present in one language, but not in the other, would therefore not be desirable, as these memories would be less effective for follow-up projects. Broadly speaking, there is no semantic difference between the SL and the TL segments when one contains SEO keywords and the other does not. Adding keywords to a web page does not change its overall meaning or message. However, the following problems might arise when a website is updated in the source language through the addition of keywords, and a translation of that website becomes necessary. Each SL segment that was changed because of the addition of keywords will no longer have an exact match in the TM, but rather, a fuzzy match at best. When a SL text is changed, the number of exact matches in the TM decreases, resulting in increased translation effort and cost. In short, adding keywords to SL text that has already been stored in a TM reduces the value of the TM for the next related translation project. The fuzzy match that results when a SL segment has keywords but the TL segment does not could be difficult for a translator to interpret and to analyze. The “terminology” in the two segments would differ, the SL reflecting more terminological variation, and the translator may not understand why, since the meaning with or without this terminological variation would typically be the same. If he or she checks the termbase, the SEO keywords for the SL may or may not be present there, therefore the termbase may not offer an explanation for the differences in terminology between the two languages and the increased terminological variation in the SL. Without a clear understanding of the importance of the way that the SL has been rendered, the translator may adopt a strategy that is inappropriate for SEO such as ignoring the terminological variation in the SL, assuming that it has no purpose, in favour of terminological consistency in the TL, which is his or her normal guiding principle. In a combined termbase that includes both terms and keywords, although keywords and terms are set up as synonyms in the database, some of them may actually be quasi-synonyms (i.e. they have slight differences in meaning). This is likely to occur more frequently when keywords are © Mark The Globe/University of Vienna
17
introduced to a termbase because of their marketing orientation. The ability of a translator to judge or interpret these slight differences in meaning will lead to particular choices of keywords for the target text. With every change to the website and change of keywords chosen in the source text, a potential meaning gap between terms in one language and keywords in the other will occur. Thus, if SL and TL segments increasingly diverge linguistically, they will also eventually diverge semantically. The TM may become less and less reliable for translating similar content. Option 3 and 4 are not recommended, as they further complicate the process, put additional accuracy needs on the termbase and reduce the usability of the output TM. Option 1 or 2 is preferred.
3.6.2 TM contains no keywords In Option 2, the input source text would not contain the source language keywords; furthermore, the target language keywords would not be introduced during the translation of the source text in the CAT tool because they would “contaminate” the translation memory by making the target text “different” from the source text. Rather they must be added to the translation by using a text editor before a text is finalized and published. This option might be the best choice for projects with the following criteria: • The TM is used to translate content other than the website. • The content of the website is not highly volatile and updates to the website are not expected frequently. • Translators with SEO keyword integration skills are not available for the project. There are several advantages and disadvantages to this method: Disadvantages: • Adding SEO keywords in both languages requires an additional step beyond the normal authoring and translation process, and requires the use of a separate application (text editor). • SEO keywords and terms are applied in two different process steps. Therefore, they must be managed either in two separate terminology databases, or in one database. In either case, functionality must be provided that allows the translator to “switch on” only the terminology and the keyword expert to switch on only the keywords. • SEO keywords must be added every time the text is retranslated, as they are not part of the TM. Advantages: • The SEO keywords are added in a separate step that can be carried out by a specifically trained person, possibly even someone besides the translator, which may be advantageous in someorganizational environments. • The person adding the target keywords only needs to focus on the goal of adding keywords which makes the task less complex and he or she can more reliably keep track of what was added and where. • If SEO keywords need a review, this step can be carried out by the person responsible for adding the keywords to a retranslated text.
© Mark The Globe/University of Vienna
18
•
•
The TM is not contaminated with synonyms, which preserves its value for other translation projects. Both SL and TL remain perfectly aligned and semantically equivalent. This method can be used for CAT tools where searches cannot be restricted to either the keywords or the terms, as discussed in sections 2.3 and 2.4.
Begin of website translation
Set up of project in CAT tool
Proofreading/ editing
Translation
Integration of SEO target-langauge keywords
Export of text into text editor
Add
Integrate
Enter new segments
Update sements
Add Translation memory
Terminology
Add
Keep track of frequency and location of keywords used
SEO keywords
Update segments
Translated website with tgt keywords available online
Enter new or update segments Add
Update or integrate new Update
Begin of next translation cycle
Set up of project in CAT tool
Translation
Proofreading/ editing
Export of text into text editor
Integration of SEO target-language keywords
Review of SEO keywords
Figure 1. Workflow for Option 2: SEO keywords are integrated after the translation step Description of the workflow At the beginning of the translation phase a terminology database, an SEO keyword database, (or a database that contains both types of data, distinctly marked), a TM (which may be empty or already contain segments) and the CAT tool are available. The project set up in the CAT tool contains the database(s) as well as the TM. The translator fills the TM and uses the terminology from the database(s) while progressing through the translation. Once the text is completely translated, as in any translation project, an editor/proofreader verifies the translation in the CAT tool. The final translation is then exported and opened in a text editor, and the keywords are added to strategic locations on the web page either by the translator or by a specifically trained person. This person selects the keywords from the database(s). When the source-language website is updated, the next translation cycle can occur with the same or potentially updated database(s) as well as the filled TM. The translator would derive the benefits of a clean and very usable TM. The keyword expert might need to do a review of the SEO keywords before integrating them. The remaining steps are the same as in the workflow of the initial translation.
© Mark The Globe/University of Vienna
19
3.6.3 TM contains keywords in both languages This section describes option 1, where the source language text contains the SEO keywords and the outputs are the target language text and a translation memory, both with integrated keywords. This option might be the best choice for projects with the following criteria: • The TM is used exclusively for the translation of the HTML website(s) in question. • The translator is skilled in integrating keywords. • The website is frequently updated. This option has the following advantages and disadvantages: Advantages: • •
The entire process takes place within the CAT environment. There are fewer steps in the process than in Option 2, because the text does not need to be exported into a text editor where keywords are then added.
Disadvantages: • •
The translator must also be trained in the use and application of SEO keywords. The translator needs to review even the translations that correspond to exact matches from the TM to ensure that any SEO keywords that they contain continue to be relevant and optimal for the target language and market. Keep track of frequency and location of keywords used
Begin of website translation
Set up of project in CAT tool
Translation and integration of keywords
Proofreading/ editing
Add Enter new segments
SEO keywords
Terminology
Add Add
Update
Translation memory with keywords
Enter new or update segments Add
Begin of next translation cycle
Review of SEO keywords
Set up of project in CAT tool
Translation and integration of keywords
Update Updatesegments segments
Translated website with tgt keywords available online
Review matches from TM that contain keywords
Proofreading/ editing
Keep track of frequency and location of keywords used
Figure 2. Workflow for Option 1: SEO keywords are integrated during the translation step © Mark The Globe/University of Vienna
20
Description of the workflow At the beginning of the translation phase a terminology database, an SEO keyword database, (or a database that contains both types of data, distinctly marked), a TM (which may be empty or already contain segments) and the CAT tool are available. The project set up in the CAT tool contains the database(s) as well as the TM. The translator fills the TM and uses the terminology from the database(s) while progressing through the translation. The translator selects the targetlanguage keywords from the keywords supplied in the database(s), making sure that the selections are keywords and not other types of terms. The translator should have a choice of integrating the keywords either during the translation process or afterwards. In either case, the keywords automatically become part of the TM. Proofreading and editing is the last step before publication of the website. The person performing this task will have to be able to distinguish keywords from terms in the translation, and assure that changes, if any, are deliberate and not accidental. When the source-language website is updated, the retranslation can occur with the same or potentially updated database(s) as well as the filled TM. The keywords in the TL might have to be reviewed for their continued applicability and market relevance and any changes should be incorporated into the database. The translator would derive the benefits of a TM that already contains keywords. The remaining steps are the same as in the workflow of the initial translation.
3.7 Identifying keywords on a web page An additional consideration is the challenge that a translator, editor or even an SEO expert might face if the keywords are fully integrated into the website. They may not recognize the keywords as special information carriers. To avoid this problem, keywords should be explicitly identified in the web page. One solution for this is to enclose keywords in unique markup tags, inline in the text. However, this requires that a unique markup tag for this purpose is available, and that it can be applied in a standard and consistent way across all web content world-wide. It may not be possible to meet this requirement as it has a dependency on the HTML markup language which can vary slightly in different production environments. Furthermore, browsers may interpret the tag differently. (This aspect requires further research if pursued.) Another option is to list all the keywords in the metadata section of the HTML page. Even though this section is not used by the search engines, it can be useful to explicitly identify the keywords for both the SL and the TL text for authors, translators, and keyword managers/SEO experts. The metatags are stable and supported by all browsers. Then the translator can: • explicitly identify all TL keywords in the metadata section of the TL web page, which would be helpful for keyword identification and management for specific web pages • use his own judgment and decide how to render the SL keywords in the TL in the most strategic places. Using metadata tags in the HTML header section to record keywords for a web page requires the ability for translators to edit the content of these tags in the CAT tool.
© Mark The Globe/University of Vienna
21
3.8 Conclusion The translation workflows presented in options 1 and 2 are equally good candidates from a TM perspective. In both cases, the SL and TL segments remain aligned. Option 1 seems simpler from a process perspective than option 2, but it requires more skills from the translator. The advantage of using one database vs. two for storing the keywords and terms was discussed in Chapter 2. Using one combined database is preferred as it reduces duplication of work and data redundancy. However, this approach requires features in the CAT tool to visibly distinguish terms from keywords in the terminology lookup window used by translators.
© Mark The Globe/University of Vienna
22
Chapter 4: Implications of SEO requirements for CAT tools The previous sections discussed requirements for skills, resources and process. In this section, some suggestions for tools requirements to facilitate keyword integration during the translation process will be listed. The assumption was that a computer-assisted translation tool will be used to carry out the translation. The CAT tool should allow the use of a combined term and keyword database, or separate term and keyword databases. As shown in the workflows, translators: • Translate new segments • Review and update existing TL segments • Integrate terms from the database(s) • Integrate keywords from the database(s) The following functionality will assist the translators in this process: • Search and replace • Spellchecking • Automatic matching of terms and keywords in source segments to available terms and keywords in the database(s) • The ability to enable or disable the automatic terminology lookup function for either the keywords, the other terms, or both • A visual indicator in the terminology autolookup window that shows whether the translation in the window is as a term or a keyword • Easy integration of suggested target terms or target keywords into the translation segment • A way to identify keywords in the source text and in the target text • Write access to the metatags in the HTML header, if the current proposal for using metatags to record the keywords that are added to a web page is adopted.
© Mark The Globe/University of Vienna
23
Chapter 5: Conclusions and future work In this document, we proposed ways to incorporate SEO keywords into the translation process. Specifically, a model was suggested for managing SEO keywords in a database that adheres to ISO standards and other best practices in terminology management. This model includes an entry structure and a set of data categories, which can be used for the combined management of terminology and SEO keywords to minimize data redundancy and duplication of efforts and maximize the use of the managed data. To implement an actual combined termbase based on the proposed model, a TBX-compliant database schema will need to be developed and tested. The whole issue of how SEO keywords are viewed in the field of terminology management needs further consideration. Conventional terminology management precepts may need to be challenged. Certainly, the new data categories proposed for keywords will need to be submitted to the ISO TC37 Data Category Registry and will need to be considered in the future version of TBX. We suggested a workflow for projects where a team of skilled professionals perform the translation and subsequent keyword integration tasks. We recommend this option when a translator with the necessary skills to combine the translation of a text with the integration of the SEO keywords is unavailable. A second workflow was proposed, though, for scenarios where a translator is available who possesses these skills. Although we identified some high-level tools requirements, further work should be done in this area. Future research is required to design user interface elements that enable the flawless connection of a translation environment with the combined database. Another aspect that is of considerable importance in pure terminology work is change management. Keeping the database up-to-date with two different types of data per conceptual entry presents unique challenges that would benefit from further investigation. Training materials need to be developed for using termbases that contain both terms and keywords in a CAT tool. We suggest that a pilot project be undertaken to test the proposed workflows and keyword data model. Such a pilot project would make it possible to empirically measure the efforts involved and the impacts on the translation memories. It would validate the data model for keywords and terms. The pilot project might lead to improvements to the workflows, strategies to minimize the impacts on the TM and refinements to the data model. If SEO keywords are to be incorporated into a website during the authoring stage, a workflow needs to be defined that works well in this context, as well as appropriate tools and data inputs. For instance, it may be possible to adapt the terminology checking functions in controlled authoring software for SEO keywords. A future research project for global SEO keywords could focus on integrating keywords into the authoring stage of website development.
© Mark The Globe/University of Vienna
24
Bibliography Budin, Gerhard. (2002). Wissensmanagement in der Translation. In Joanna Best and Sylvia Kalina (Eds.), Übersetzen und Dolmetschen. Eine Orientierungshilfe. Tübingen/Basel: A. Francke Verlag Karsch, Barbara Inge. (2006). Terminology workflow in the localization process. In Keiran J. Dunne (Ed.), Perspectives on Localization (XIII ed.). Amsterdam: John Benjamins Publishing Company. Handbook of Terminology. ISO 30042 (2008): TermBase eXchange. International Organization for Standardization. ISO TC37 Data Category Registry. Available from: www.isocat.org
© Mark The Globe/University of Vienna
25