A Quality Evaluation Approach to Search Engines of ... - IEEE Xplore

2017 IEEE Third International Conference on Big Data Computing Service and Applications

A Quality Evaluation Approach to Search Engines of Shopping Platforms Chen Hao

Tao Chuanqi

School of Computer Science and Engineering Nanjing University of Science and Technology Nanjing , China [email protected]

Keywords—quality factors; quality indicators; assurance; quality evaluation; search engines

II.

RELATED WORK

There are increasing quality problems resulting in erroneous testing costs in enterprises and businesses. According to IDC [1], the Big Data technology market will "grow at a 27% compound annual growth rate (CAGR) to $32.4 billion through 2017".Quality assurance and validation for big data systems have been already preliminary discussed[2]. Search engine is a typical big data system. Its quality directly effects the economic interest of the service provider [3]. The reasons for the quality problem and how to solve it have already been a research point in related field[4]-[6].The quality of search engines is much more difficult to assure due to the non-oracle problem. That means there is no predictable output for the search engines. Also, how the search engines satisfy the need of users is hard to quantify, which is an important part of the search engine quality. There are several quality evaluation methods proposed by some researchers. ZHI proposes to use the metamorphic testing to evaluate the quality [7]. He provides five metamorphic relations which should be satisfied. However, it lacks more detailed work. Shehata proposed a simulation model to evaluate the quality [8]. Goel showed how to use the page level keyword to evaluate the correlation degree of the result and the keyword input [9]. There are also other approaches to evaluate the image search engines [10] [11]. However, the existing approaches lack of a systematization standard to estimate the quality of search engines, and there is no adaptive method to evaluate shopping platform search engines. Shopping platform search engines is different from original search systems because it is combined with commodity recommendation, ranking strategy, personalized search results [12] [13], classification retrieval and etc. The design and build of it is much more special compared with the traditional applications [14] [15].

quality

INTRODUCTION

People widely use shopping platforms in their daily lives. In online shopping platforms, a small part of commodity is shown on the home page of the website. Partial information of commodity in the product page is hidden in the database due to the limited space of homepage. Thus, users need to input keywords to the search engines of the shopping platform to get the relational commodity information they want. Search engine for shopping platform is of great importance, and it obviously affects the income of the shopping platform business. For instance, users always expect their commodity search results to be reasonable and correlated to their needs. For example, when we search “Mechanical Keyboard”, both “Mechanical Keyboard” and “Mechanical feeling Keyboard” will be in the result, but a mechanical keyboard is totally different from a mechanical feeling keyboard. Therefore, the evaluation of the quality of search engine of shopping platforms become an important task of test engineers and QA. Currently, there are few quality factors or computable quality indicators for search engines of shopping platform to test and evaluate its quality. In this paper, we will provide a set of available reference quality factors and quality calculation indicators. What’s more, some experiments and comparison had been conducted to validate the effectiveness of the reference quality factors and quality calculation indicators. The reference quality factors and indicators we propose in this paper can promote shopping 978-1-5090-6318-5/17 $31.00 © 2017 IEEE DOI 10.1109/BigDataService.2017.45

Computer Engineering Department San Jose State University USA [email protected]

platforms to improve their search engines as well as polish up the user experience. The paper is organized as follows. The next section is related work. Section III presents quality factors for search engines of shopping platforms. Section IV proposes the quality indicators. Section V shows our case studies on several practical commodity search engines. Conclusions are summarized in the last section.

Abstract—Search engine is one of the mostly-used big data applications. However, search system varies in different platforms and fields, and there are few approaches to evaluate its quality. Search engine for online shopping systems combines text search and classification-based retrieval. It is more difficult to validate and evaluate quality since there are no definite quality standards or testing methods. Some quality factors and relational indicators are provided in this paper through studying on search system of online shopping platform, Experiments and comparisons on different shopping platforms are also performed to show the quality problems of search engines in shopping platforms. The study results indicate the feasibility and effectiveness of the proposed quality evaluation approach.

I.

Jerry Gao

School of Computer Science and Engineering Nanjing University of Science and Technology Nanjing , China [email protected]

299

III.

operation of searching a keyword into account, we consider behaviors that may affect the search results, such as exchanging the order of the keywords, the length of the keyword, the number of the keywords, inputting a useless symbol by error, and etc. We comprehensively propose the following 13 quality evaluation indicators for a commodity search system. The detailed of the proposed indicators and their metrics are presented as follows.

QUALITY FACTORS FOR COMMODITY SEARCH ENGINES

According to the ISO/9126 software quality evaluation and the measurement model, the quality factors can be divided into 3 parts: external metrics, internal metrics, and metrics using quality. For the quality of search engines, we evaluate not only the correctness of the search functions, but also the practicability and satisfaction of the system. That means we need to evaluate the system from the point of users. Combining the quality metric in ISO model, with the characteristic of the commodity search system, and the point of users, we improve the classic model and propose a model for commodity search engines as in figure 1.

Indicator (1) and (2) are corresponding to quality relevance, (4) and (5) are corresponding to quality Stability, (6) to (9) are corresponding to quality completeness, (10) to (13) are corresponding to quality intelligent processing

1) Indicators of single keyword search---global match degree: That means the revelance degree of the search results and keywords,and it reflects whether the result satisfied the search needs of users. Take the single keyword as A, the search result set of A as FR,each record in FR as p,extract the title of p as title,divided title into several sub-strings as long as the length of A. Calculate the maximum Levenshtein ratio of sub-strings and A as the match degree between record p and keyword A. For example,assuming A is ‘abcde’,title of record p is ‘abcdabc’, we divide title into sub-strings as long as keyword ’abcde’. We get 3 sub-strings: ’ abcda’, ’bcdab’, ’cdabc’. Then, we calculate the Levenshtein ratio of each substring and A,and take the maximum ratio as the match degree between ‘abcdabc’and ‘abcde’. 2) Indicators of multiple keyword search--- global match degree: Similar to indicator 1, while the keywords are multiple. Here we compute all the maximum Levenshtein ratio of each keyword and choose the average value to represent the whole. 3) Indicators of multiple keyword search---keywords match degrees.: Indicator 2 focuses on the global match degree of keyword, that means for a single record in result and the corrsponding input keywords, indicator 2 computes the average match degree of each single keyword and result, but indicator 3 calculates the ratio of the highly matched keywords in total keywords. We calculate the maximum Levenshtein ratio of sub-strings and each single keyword B in multiple keywords A as the match degree between record p and keyword B, if the match degree is bigger than the threshold,we think that p is high relevant with B.Then we calculate how many sub-keywords in A is high relevant with p as the relevant times between p and A,divide the relevant times by the number of sub-keyword in A as the relevant degree between p and A. 4) Indicators of single keyword search---Stability. From the Set Theory we can know that, set A is equal to set A, that means when the database is fixed,the search result of A and A are the same. Here we use jaccard intersection of the result of A and A to compute the indicator. Send the same keyword to the search engines twice in short times. Take the first time search result as FR1,the second as FR2. Calculate the jaccard

Fig. 1. Quality factors for search engines

The proposed quality factors are explained as follows: • Stability: Apart from the update of the database, the result should be the same in a given short time. That means if the database stays stable, the search result s should not be changed either. • Completeness: Completeness means the completeness of the result, which means that the record should not absent if it satisfies the input keyword. Though search engines have non-Oracle problems, there are also some rules that the result should stand by. For instance, if different users search equivalence keywords, the result should also be equal. • Relevance: Relevance evaluates the function of the search engines by the relevant degree of the result and input keywords. • Intelligent processing: The intelligent process ability of the search engine system, such as handling the obvious errors, dealing with useless input and the common mistake or the recognition of part repetition in keywords. In big data-based applications, this feature becomes more intelligent with the support of new technologies like deep learning. IV.

factor factor factor factor

QUALITY INDICATORS FOR COMMODITY SEARCH ENGINES

Through the analysis of the commodity search system, the quality of search engines is mainly reflected in whether the commodity search results are reasonable. Taking the actual

300

intersection of FR1 and FR2 as the result of current the keyword/keywords. 5) Indicators of multiple keyword search---Stability.This is similar to indicator 4 6) Indicators for multiple keywords search --- the influence on results for the derangement of keywords. From the Set Theory we can know that, set (A AND B) is equal to set (B AND A). that means when the database is fixed,the search result of (A AND B) and (B AND A) are the same. Here we use jaccard intersection of the result of (A AND B) and (B AND A) to compute the indicator. 7) Indicators for single keyword search --- the influence on results for the repetition of keyword. From the Set Theory we can know that, set (A) is equal to set (A AND A). that means when the database is fixed,the search result of (A) and (A AND A) are the same. 8) Indicators for multiple keyword search --- the influence on results for the repetition of keyword. From the Set Theory we can know that, set (A AND B) is equal to set (A AND B AND B). that means when the database is fixed,the search result of (A AND B) and(A AND B AND B) are the same. That means, if we randomly repeat a keyword of multiple keywords, the result should not change. 9) Indicators for single keyword search --- the influence on results for the combination of keyword and titles. We think that if p is a result of keyword A. Here we use title of each result to represent it, when we search (p AND A), record p will still in the new result. 10) Indicators for single keyword search --- the influence on results for the transformation of Simplified and Traditional keyword. Simplified Chinese and Traditional Chinese are both widely used in China, so a good search engine should also be able to intelligent processing. Because for both Simplified Chinese and Traditional Chinese, the search needs of users stay the same if they are corresponding. Send the Simplified Chinese keyword to the search engines and then the corrsponding Traditional Chinese form. Take the first time search result as FR1,the second as FR2.calculate the jaccard intersection of FR1 and FR2 as the result of current the keyword. 11) Indicators for single keyword search --- the influence on results for the addition of useless symbol.When some useless symbols like ‘,’,’?’,’#’ are present in they keyword, the search engines should be able to recognize it and handle it. For keyword A, take the result of searching A as FR1. Add A with a random useless symbol and take it as B. Take the result of searching B as FR2. Calculate the jaccard intersection of FR1 and FR2 as the result. We predefine some useless symbols in search like:“,”,“#”,“@”,“...”,“ࠚ”,etc. 12) Indicators for multiple keywords search --- the influence on results for the adhesion of keywords. When input multiple keywords,we may forget to input a blank to split them, so the system should be able to handle it. For multiple keywords A in the correct condition, each single keyword in A is divided by a blank. For keywords A, take the result of

searching A as FR1. Remove the blank in A and take the new keywords as B. Take the result of searching B as FR2, calculate the jaccard intersection of FR1 and FR2 as the result 13) Indicators for single keyword search --- the influence on results for the removal of a single character in a long keyword. When we naming commodity, we can forget to input some character in it. For keywords A, take the result of searching A as FR1. Remove a random character in A and take the new keyword as B, take the result of searching B as FR2, and calculate the jaccard intersection of FR1 and FR2 as the result. Here we need the A as a long keyword, that means it should have more characters because when we remove a random character in a short keyword, ambiguity may occur so it will influence the search ambitions. V.

STUDIES & ANALYSIS

A. Experimental Design In this section, on account of the proposed five quality factors and 17 quality indicators, a simple experiment and verification is carried out. 1) Experimental Subject: Experiment takes six big online shopping platforms of China, including TaoBao, JD, SuNing, Dang-Dang, Gome and FeiNiu as the experiment subjects. 2) Experiment Keyword : Our experiments use the keywords selected from the Top-20w Search Keywords List provided by TaoBao Direct Bus. In order to avoid data redundancy, the result for each time search take at most 100 records as the effective result set. The experiment of each quality indicators take no less than 1000 keywords. B. Experimental Results Through the searching of thousands of keywords, and calculating the result of 4 quality factors and 13 quality indicators for 6 shopping platforms, we get a series of conclusions on Preliminary quality evaluations. The figure of experiment result is as figure 2.

Fig. 2. Experiment result figure

The X axis represents the order number of the quality indicator, while the Y axis represents the experiment result of each quality indicator. All results are in the section of [0-1], and results of six shopping platforms are marked in different

301

colors. From the figure we can see that no shopping platforms is perfect and they have different advantages and disadvantages. In the same indicators, performance of different platforms varies. In some indicators like 4,5 all those platforms show similar capacity, while in some indicators like 7,8,9,10,11, platforms have big differences.

segmentation for the keyword in new keywords are constructed by title and old keyword, so its related with the quality problem in indicator 1-2, they have intersecting problems and reasons. Indicator 10:1) The platform cannot recognize and handle the simplified and traditional keywords. 2) Some platforms do forcing the conversion of the Chinese simplified into traditional characters before take the keyword into search so it will avoid the problem 3) There do exist commodities that have simplified and traditional part in their title at the same time 4) The simplified and traditional form of some keyword are the same. Indicator 11: 1) The platform cannot recognize and handle the useless symbol 2) There does commodities that have useless symbols in their title. 3) Some useless symbols have ambiguity or special use in the search strategy of the platform. 4) Useless symbols raised bug in search engines. Indicator 12: 1) The platform cannot recognize or handle the adhesion of keywords 2) There does exist commodity that have adhesion of keywords in their title 3) Some ambiguity occurs when the adhesion of keywords happened 4) The word segmentation strategy. Indicator 13: 1) The platform cannot recognize and handle the absence of part of keywords 2) There exist commodities that have absence of part of keywords. 3) Some ambiguity occurs when the absence of part of keywords happened. These 13 indicators in the 6 shopping platforms all existing quality problems, more or less. As well, there may have other reasons for the quality problem, here we do not discuss it further.

C. Experiment Results Analysis The result shows when searching in shopping platform, the result is influenced by each operation or the selection, converting of keywords. Next, we will combine result and each quality indicators to give possible reasons for these quality problems. There are some common reasons for the quality problem or quality defect, such as: 1. The function that shows and sorts the search result is unstable. 2. Between the time gap when we construct new keyword and put it into search engine, an update of commodity is caught by the experiment program. 3. Unstable network environment and network speed. 4. The platform may set shielding mechanisms to avoid large access request to reduce the pressure of server. 5. The platform does not welcome other programs to acquire data from their platform. 6. Coding bug in experiment program 7. Inadequate input keywords due to the large amount of time to conduct the experiment There may existing other reasons for the quality problems as explained as follows, combined with specific quality indicators: Indicators 1-3: Possible reasons for the quality problems: 1) The platform may conduct Chinese word segmentation for the single keyword or each single keyword in multiple keywords. They then use the word segmentation result to search, like the mechanism of fuzzy search, or even they just conduct the fuzzy search. 2). When use multiple keywords, the platform may set different priority to each single keyword in it, so the search result has relevance to the priority, but dark to the users. Indicator 4-5: 1) There exists recommending goods in the result, not only the search result, but also mixed with goods that given to users due to the recommending algorithm, and the recommending algorithm seldom recommend a same good to the user, so there will be something inconsistent but similar. 2) Problem or bug in search algorithm. Indicator 6: Priority of each single keyword in it. The search algorithm will make out results which have relevance to the priority, or the commodity display algorithm may take the priority into account. Indicator 7-8: 1) The platform cannot recognize the repetition part in the keyword or keywords, not to mention handling it. 2) There indeed existing some commodity that has repetition in their titles, and this has already been observed by users. Indicator 9: 1) Title is too long so it is illegal for the search engines. 2)The platform that conducts Chinese word

D. Experiment Results Comparison We can see from the result that all the 6 platforms have quality problems in each or some quality indicators. Next, we will do a visualization comparison of the 6 platforms. 1) Correlation: This quality factor consist 3 part: indicator 1-3, the comparison is as the figure3.

Fig.3. Radar map for 6 platform in quality indicator 1-3

From the figure, we can see, in this quality factor, TaoBao behaves the best in a whole, while FeiNiu the worst, others have average performance.

302

2)

Stability: This quality factor consists of 2 parts, indicator 4-5, the comparison is as the figure4 next.

larger or equal to the result of search keyword A+B. Here we can take ‘Computer Desk’ as A, and ‘Computer Desk Computer Desk’ as A+B. Obviously, a metamorphic relation violation occurred: A has even less result than A+B. With further study, we found that when we search ‘Computer Desk’, the result directory which the system show us is ‘all the results>furniture>desk>computer desk>computer desk’, while when we search ’Computer Desk Computer Desk’, the result directory which the system show us is ‘all the results>computer desk computer desk’. We can see that the result of FR1 has been auto filtered, only showing us the result in the 3nd level class ‘furniture>desk>computer desk>’. The search engine found the whole result. The result of FR2 has not been auto filtered so it shows the whole result. After we canceled the auto filtering of FR1, the true result of FR1 return 2129 records, about equal to FR2.So the reason may be as follows: 1)SuNing set a auto filtering strategy to show the result, which have relevance to the keyword and the result, but some result which is found by the search engine do not satisfied the show strategy, so it has been dropped. 2)In the database of SuNing, some commodities are not correctly classified, such as the result of ‘computer desk’, there are totally 2129 records which is relevant to the keyword, but only 872 records have the correct class(label) ‘furniture>desk>computer desk>’,other records are not in the class. 4) Intelligent Processing: This quality factor consists of 4 parts, indicator 10--13, the comparison is as the figure6 next.


From the figure, we can see, in this quality factor, TaoBao and SuNing behaves better in a whole, while JD worse, others have average performance. We did and followup experiment and observation, and found that the problem indeed exists, but not the bug in experiment program. We conducted search in JD in short time manually, and the result did have great differences, which seemed improbable for a big shopping platform. Further, we studied the JD search website and found some possible reasons: In other platforms, such as TaoBao, the return search result usually consists of 2 parts: the fixed 1 or 2 in the front are the recommended result, which may be not highly relevant with the search keyword, and next are the search result corresponding to the keyword. That means when we do search and look the result, an recommend strategy is applied in the engine. While in JD, although there will be recommended result in search result, the position of the recommend goods each time are not fixed nor stable. In TaoBao, however, they are relatively fixed, sothe unstable algorithm in JD will influence the experiment program. 3) Completeness: This quality factor consist 4 parts, indicator 6-9, the comparison is as the figure5 next.


From the figure we can see, in this quality factor, each platform has its advantages and disadvantages. Indicator 11 show the influence on results for the addition of useless symbol. We can know that, when it comes to the symbols like“*”,“@”,“]”, it will return a good result, while when we add symbol ‘……’,the result becomes empty. Maybe the platform either has some handle with useless symbols, or ignore them. It still have some bugs however. Indicator 13 is the influence on results for the removal of a single character in a long keyword. We find the reason why TaoBao performed poorly: TaoBao is a C2C(Customer To Customer) platform, so there are a lot of sellers. There may be some serious naming and classification irregularity and redundancy behavior. For example, when we search ‘adidas’ and ‘adida’ ,there will be a lot of results, but the result of


In indicator 7 and 8, SuNing is obviously worse than others. When we search ‘Computer Desk’ and ‘Computer Desk Computer Desk’, yet in Chinese ‘Computer Desk’ is a single word, not two. The first keyword ’Computer Desk’ return result FR1,while the follow-up keyword, return with FR2. We can see that there 872 records in FR1, while there are 2128 records in FR2. However, on the basis of a common metamorphic relation, the result of search keyword A must be

303

VII. REFERENCES

‘adidas’ and ‘adida’ have great similarities and great differences. What’s more, TaoBao suffers a lot from fake and copycat commodiies, so this will increase the problem of indicator 13.While Other Platforms have less fake and copycat commodity, and have a more strict naming rules and check systems.

[1] [2]

[3]

E Experiment Conclusion Through the experiment and comparison of 6 platforms, 5 quality factors and 17 quality indicators, we make out a conclusion of the quality advantages and disadvantages of these platforms, as the following table shows. Taking the quality indicators of each quality factors as a whole, in each quality factor, we choose two of the best performing platforms as better while two of the worst performing platforms as worse.

better

worse

Correlation

TaoBao, Dang-Dang

SuNing,Feiniu

Stability

TaoBao, SuNing

JD,FeiNiu

Completeness

Gome, JD

SuNing, TaoBao

Intelligent Processing

FeiNiu, JD

TaoBao, SuNing

TABLE I.

[4] [5]

[6]

[7]

[8]

[9]

EXPERIMENT RESULTS

F Study Limitations

[10]

The experiment still have deficiencies such as ˖ the experiment data only have 1000 keywords, so it may affect the accuracy of result of the experiment. If we conduct the experiment with more keywords, the result and conclusion will be more correct and accurate. Also, the network speed of the experiment may also effect the results. What’s more, there exists Maxima and minima values in the experiment result that may affect the result average value. Further work should be continued to study of whether these values will affect the correctness of the experiment. Addiction experiment can be conducted to validate the quality problems and try to give a possible solving method. In this experiment, we test 6 Chinese shopping platforms, more English shopping platforms can also be involved, and more quality factors and quality calculation indicators should be considered and validated. VI.

[11]

[12]

[13]

[14]

[15]

CONCLUSIONS

This paper proposes 4 quality factors and 13 quality indicators to evaluate the quality of big data system—search engine. Through the experiment we can see that there indeed exist quality problems. The quality factors and quality indicators can be used to evaluate the search functions and improve the search service of each platforms. What’s more, quality factors and quality indicators in this paper can also be extended to evaluate other big data systems.

304

M R Wigan, R Clarke. Big Data's Big Unintended Consequences[J]. Computer, 2013, 46(6):46-53. J Gao, C Xie, C.Q. Tao. Big Data Validation and Quality Assurance -- Issuses, Challenges, and Needs[C]// IEEE, IEEE International Symposium on Service-Oriented System Engineering. 2016:433-441. I, Lianos E.Motchenkova. Market dominance and search quality in the search engine market[J]. Journal of Competition Law & Economics, 2013, 9(2):419-455. D Hawking, N.Craswell, P.Bailey, et al. Measuring Search Engine Quality[J]. Information Retrieval, 2001, 4(1):33-59. R.M.Losee, L.A.H.Paris. Measuring search-engine quality and query difficulty: Ranking with target and freestyle[J]. Journal of the American Society for Information Science, 1999, 50(10):882–889. Couvering. E V. Is Relevance Relevant? Market, Science, and War: Discourses of Search Engine Quality[J]. Journal of Computer-Mediated Communication, 2007, 12(3):866– 887. Z.Q,Zhou S.Xiang, T.Y.Chen. Metamorphic Testing for Software Quality Assessment: A Study of Search Engines[J]. IEEE Transactions on Software Engineering, 2015:1-1. T.Phelan, A.Patel, S.O.Ciardhuáin. Simulation Based Approach to Evaluate a Distributed Search Engine.[C]// Iadis International Conference Www/internet 2003, Icwi 2003, Algarve, Portugal, November. 2003:347-354. S.Goel, S.Yadav. Search engine evaluation based on page level keywords[C]// Advance Computing Conference. 2013:870-876. F.Lazarinis, E.N.Efthimiadis. Measuring search engine quality in image queries in 10 non-English languages: an exploratory study[C]// Proceeding of the, ACM Workshop on Improving Non English Web Searching, Inews 2008, Napa Valley, California, Usa, October. 2008:89-92. X.Tian, Y.Lu, L. Yang. Query Difficulty Prediction for Web Image Search[J]. IEEE Transactions on Multimedia, 2012, 14(4):951-962. N .Wang. Design and Implementation of a Crawling System in Shopping Search Engine[C]// Computer Science and Engineering, 2009. WCSE '09. Second International Workshop on. IEEE, 2009:212-216. TJ.Moore, W.C.Kelley, A.R.Wilson. ONLINE SHOPPING SEARCH ENGINE FOR VEHICLE PARTS: US, US20090019008[P]. 2009. S.Park, K.Cho, K.Choi. Information Seeking Behavior of Shopping Site Users: A Log Analysis of Popshoes, a Korean Shopping Search Engine[J]. Journal of the Korean Society for Information Management, 2015, 32(4):289-305. L.I.Shi-Weia, X.D.Qian. Model of personalized online shopping search engine[J]. Application Research of Computers, 2010.