Maximizing Commerce and Marketing Strategies through Micro-Blogging Janée N. Burkhalter Saint Joseph’s University, USA Natalie T. Wood Saint Joseph’s University, USA & Edith Cowan University, Australia
A volume in the Advances in Marketing, Customer Relationship Management, and E-Services (AMCRMES) Book Series
Managing Director: Managing Editor: Director of Intellectual Property & Contracts: Acquisitions Editor: Production Editor: Development Editor: Typesetter: Cover Design:
Lindsay Johnston Austin DeMarco Jan Travers Kayla Wolfe Christina Henning Caitlyn Martin Tucker Knerr Jason Mull
Published in the United States of America by Business Science Reference (an imprint of IGI Global) 701 E. Chocolate Avenue Hershey PA, USA 17033 Tel: 717-533-8845 Fax: 717-533-8661 E-mail:
[email protected] Web site: http://www.igi-global.com Copyright © 2015 by IGI Global. All rights reserved. No part of this publication may be reproduced, stored or distributed in any form or by any means, electronic or mechanical, including photocopying, without written permission from the publisher. Product or company names used in this set are for identification purposes only. Inclusion of the names of the products or companies does not indicate a claim of ownership by IGI Global of the trademark or registered trademark. Library of Congress Cataloging-in-Publication Data Maximizing commerce and marketing strategies through micro-blogging / Janee N. Burkhalter and Natalie T. Wood, editors. pages cm Includes bibliographical references and index. Summary: “This book examines the various methods and benefits of using micro-blogs within a business context, bringing together the best tools and tactics necessary to properly incorporate this approach”-- Provided by publisher. ISBN 978-1-4666-8408-9 (hardcover : alk. paper) -- ISBN 978-1-4666-8409-6 (ebook) 1. Marketing--Management. 2. Blogs. 3. Business planning. I. Burkhalter, Janee N., 1979- editor. II. Wood, Natalie T., 1970- editor. HF5415.13.M36759 2015 658.8’72--dc23 2015008254 This book is published in the IGI Global book series Advances in Marketing, Customer Relationship Management, and EServices (AMCRMES) (ISSN: 2327-5502; eISSN: 2327-5529) British Cataloguing in Publication Data A Cataloguing in Publication record for this book is available from the British Library. All work contributed to this book is new, previously-unpublished material. The views expressed in this book are those of the authors, but not necessarily of the publisher. For electronic access to this publication, please contact:
[email protected].
280
Chapter 12
Twitter Data Acquisition and Analysis: Methodology and Best Practice Stephen Dann Australian National University, Australia
ABSTRACT Social media data collection is often treated as tacit knowledge with the collation of tweets reduced to a single sentence without explanation as to means, mechanisms or relative merit of the approach. This chapter describes methods and techniques for the capture of Twitter timeline data, inclusive of first person and third party methods for data capture from personal accounts, public accounts, and keyword searches. The chapter takes a practical approach to acquiring Twitter data with a focus on individual timelines, and small to medium scale search sets. The emphasis is on being able to obtain, examine, and convert Twitter data into knowledge quickly, and with limited requirement for technical skills. This type of data collection assumes no prior programming knowledge. The chapter explains how to retrieve Twitter data from three sources: personally controlled timelines, third party timelines and ongoing search results. Finally, the chapter describes preliminary analysis that can be performed to ascertain content creation patterns, without recourse to analysis of individual tweets.
INTRODUCTION Collecting Twitter timeline data is a balancing act between the commercial needs of Twitter, the research question driving the data collection, and the assumption of Twitter as a public source of secondary data. The advantage of Twitter over Facebook as a social media pulse is the default public nature of Twitter, versus the default “walled-garden” private nature of Facebook. Twitter accounts default to posting content to the public timeline, and can be viewed from a public website which creates a source of secondary published data. As individual tweets can be attributed directly by URL and citation, it is possible to view an account timeline as a sequence of 140 character standalone publications to be viewed, captured and analyzed as sequential issues in a publically published volume. Twitter generates an extraordinary DOI: 10.4018/978-1-4666-8408-9.ch012
Copyright © 2015, IGI Global. Copying or distributing in print or electronic forms without written permission of IGI Global is prohibited.
Twitter Data Acquisition and Analysis
amount of data as a platform and can be intimidating to consider that this platform can generate in excess of 400 million new tweets per day (Kumar et al. 2014). That said, for studies focused on individual account behaviors or small groups of accounts, there are data collection mechanisms designed to collect small to medium scale data sets. This chapter examines the steps required to obtain timeline data from public Twitter accounts, including a longitudinal search, through a range of capture techniques. It outlines how analysis using just a timestamp and minimal Twitter data can be applied to determine if differences emerge from when, where and what was used to publish to the Twitter account.
BACKGROUND The research field of Twitter operates at three levels of abstraction for the purpose of data collection – tweet, timeline and pulse. Tweet level is the individually identified message that can be directly accessed by URL. Tweet analysis can take place at the individual level, or within the context of a series of tweets. Timeline is a sequence of tweets from a single account collated directly from Twitter. Time series style tweet analysis to detect patterns in Twitter use over time, and in response to specific external events rely on the timeline. Finally, pulse level data is where Twitter is used to track user sentiment around a keyword, topic or idea across a range of unconnected accounts. Pulse level data is frequently considered the domain of big data using automated analysis and macro-scale capture of millions of tweets. However, this chapter outlines search based tweet capture on specific #hashtags and key terms to provide topic analysis at a large yet manageable bodies of pulse data. Search based data capture also allows for the tracking of brand mentions within the broader Twitterverse or the observation of Twitter engagement with competitor accounts via @mention tracking,
CAPTURING TWITTER DATA: TIMELINE This chapter features an extended examination of four tweet capture mechanisms to articulate the approaches for collecting data for both industry and academia. Academically, an established method with an explanation of the detail of the acquisition of tweets can be sourced, referenced, and used as a basis to acknowledge variations on collection method. For practitioners, data collection best practice can be used to inform the internal metrics of the organization, or form the basis for a customized protocol to acquire competitor information, or analyze the company’s performance. The four methods outlined involve internal Twitter account archives, externally mediated timeline capture through Kwitty, keyword search via Hootsuite, and web capture using NCapture as part of NVivo analysis. Each method is discussed in terms of the variables captured in the data, and the steps needed to perform the capture. All methods should be considered equal in their value to a content classification process. Selection and use of a method should be determined by its relative value to an individual project. It may be that Kwitty’s minimalistic four item data set is more valuable for capturing a personally controlled timeline as it will benchmark against subsequent external timeline captures. Alternatively, NCapture’s depth could suit a project requiring greater nuance and pre-prepared coding than is present in the Twitter Archive data. Table 1 outlines a brief comparison of the four methods
281
Twitter Data Acquisition and Analysis
Table 1. Comparisons of the four collection methods Internal Timeline Archive Cost File Format Data Data Points
Kwitty
Hootsuite
NCapture
Free
Free
Subscription fees
Set up costs
CSV, JSON
XML
CSV
NCVX
Verbose
Sparse
Verbose
Verbose
10
4
13
18
Timelines
Personal
Third party
Search results
Third party / Search
Style
Manual
Manual
Automated
Manual
No method presented here is more or less endorsed, nor is any particular proprietary method endorsed by payment or another arrangement. As data collection involving Twitter is an evolving process, the methods presented in this chapter should be considered illustrative rather than declarative.
Method 1: Internal Timeline Archive Data The first method captures Twitter data directly from within the Twitter service via a request for a Twitter archive. As this capture comes from within Twitter, it is both relatively straightforward and will be used as a comparative benchmark for the other data sets. As the owner of a Twitter account, an individual may request the download of their history within the preference section of their account. The limit of this approach is that account ownership is required for the download request. Third party accounts cannot be accessed directly through this method. Data requested directly from Twitter has one of the richer veins of information available, as the system is drawing from the master data source. The downloadable file consists of an interactive interface using an HTML index page, JSON data set, and a CSV file which contains all tweets. The CSV file has ten column of data for each tweet that describe the text of the tweet, and its relative position in the overall Twitter universe in terms of relationships with other users and other tweets. The CSV file is of most value to the method outline in this chapter. Table 2 presents the summary overview of the Twitter data set header information contained in the CSV file. Twitter Archive data contains one distinct element not present in other captures – the expanded URL category which converts the in-house Twitter (t.co) URL shortener back to its original link. The expanded URL function does not address any other shortened URL (e.g. bit.ly), and only covers the period after the service commenced operation. The archival function provides sufficient data to form the basis for internal analysis where the purpose of the project is to examine the individual, personally controlled Twitter timeline.
Value of the Personal Timeline Dataset Academic use of a personal timeline is relatively limited in contrast to the value for practitioners. As academics, personal reflections are often limited in their use. It could be possible that the individual timeline could aid in the development of a reflective narrative of teaching practice, or provide additional
282
Twitter Data Acquisition and Analysis
Table 2. Twitter archive data variables Variable ID
Descriptor
tweet_id,
ID marker for each of the timeline author’s original tweets
in_reply_to_status_id,
ID marker for a tweet from another timeline that led to the author’s response
in_reply_to_user_id
The author of the tweet that resulted in the reply tweet
Timestamp
Timestamp based on the timeline author’s declared location
Source
The platform or framework used to create and send the tweet
Text
Actual body of the tweet
retweeted_status_id
This tweet_id of a retweet taken from its original timeline
retweeted_status_user_id
The identity of the author being retweeted
retweeted_status_timestamp
Date stamp for the retweeted tweet
expanded_urls
Where the URL has been shortened by Twitter using the t.co address, the full URL is published here. This cell does not report URL for non-t.co addresses
data during an ethnographic observation project. However, for the most part, with the notable exclusion of the original Dann (2010) paper, there are few opportunities for applying your timeline for academic study. In direct contrast, self-analysis has several practical purposes for academics and business. For both parties, understanding the vocabulary of the Twitter account, and the tweeting patterns, may be invaluable for determining what to do next with the account. The ability to store, retain and revisit a Twitter account’s performance over the full history of the account is a valuable piece of data to integrate into organizational history and metrics. Additional analysis of the text content can reveal adherence or deviations from branding strategy and communication policy. As each tweet is data stamped, it can also be tied to external performance metrics such as sales generated from tweeting a discount code or traffic patterns based on tweets to the company’s website. Finally, access to the history of the account’s communications can uncover relationship marketing information in the form of most common conversations and most frequent conversational partners across the organization’s use of the account.
Method 2: External Timeline Capture Kwitty is a Google Chrome browser plugin “Kwitty” to access and archive the third-party public Twitter timelines externally owned accounts (Heemsbergen and Lindgren 2014). As secondary data, Twitter account timelines have been available for capture and download through an evolving range of mechanisms over time – Dann (2010) applied the Twitter2PDF software until API changes retired the software package. Setting up Kwitty is relatively straight forward for users with administration rights to make changes to their machines. The software is freely available from the Chrome store (https://code.google.com/p/ kwitty/). The Kwitty data collection process is relatively robust, with the application having a minor limitation of periodically struggling to capture very short timelines (less than ten tweets). Data capture takes three steps. A timeline search identifies the account names; the timeline is identified for export (Figure 1), and the exported data is copied to Excel (or equivalent). It also requires the ownership of a Twitter account to access the Twitter API. Kwitty does not use a dedicated search bar. The tweet box is converted to a search function after the letter f is typed – which means that the researcher does run the risk of posting to their search query to their timeline. Although Kwitty can also perform timeline searches
283
Twitter Data Acquisition and Analysis
Figure 1. Account history state
(Sourced from www.twitter.com/drstephendann)
using the operator “s”, those searches are not yet exportable for capture. Kwitty will open a second tab to display the timeline where an export option becomes available (Figure 2) One of the benefits of using Kwitty is the “history state” of each account in the timeline panel at the point of data export. This overview outlines the date of account formation, tweets since account created including the tweets per day calculation, account’s following/follower status and the account’s selfdescription. The history state data must be captured manually at the point of data collection. A limit of the Kwitty timeline count is the lack of automation of the history state capture – users wishing to track follower/following counts over time would need to schedule manual periodic updates to their data sets. Kwitty prefers accurate capitalization of the account name for the export option to work - the user tab will open successfully, but the Export function will report an error that “usertab does not exist”. Checking and correcting the capitalization will usually resolve the error for exporting the data, which will open in a separate browser tab. Kwitty uses a more restricted four part data set which is sufficient for external account benchmark and comparison, but lacks the nuance of a direct timeline download from Twitter. However, as Kwitty can obtain third-party timelines, this limitation is a tradeoff for access to the data. Table 3 outlines the data variables captured through Kwitty.
Figure 2. Charting patterns of use
284
Twitter Data Acquisition and Analysis
Table 3. Kwitty archive data variables Variable ID ID
Description Unique identifier of the tweet
Twitter Archive tweet_id
Tweet
Text content of the tweet
Text
Date
Timestamp of the data, in US format, set to the local time of the tweet
Timestamp
Via
Client used to publish a tweet
Source
As an external call to the Twitter API, the Kwitty software can only access up to the 3200 prior tweets from the date of the search that is a common limitation for external captures. For contrast and benchmarking this can provide a standard playing field for sampling timeline history (tweets n = 3199) for long term accounts. This can be a limiting factor when examining accounts with a high volume of recent activity (n = 3200, 100) that publish very few tweets. Brandjacking: Attempts by an organization to attempt to redirect attention to their products or brands based on the mention of a rival’s brand name. Broadcaster: Twitter accounts with a very high follower to following ratio by design as they cultivate an audience without attempting to follow other users. Content Analysis: Collective terms for a range of way to analyze of text, music, audio, video and other formats. Crosstabs: A tabular display of results that displays the frequency distribution of select variables with the dataset. Evangelistic: Twitter accounts that are the active audiences who follow many more accounts than follow them, and who engage with the accounts that they follow even when the engagement is unidirectional. Gisting: A fast and preliminary analysis of a large body of the text to look for trends, patterns and research ideas of interest. Longitudinal Analysis: Repeated research investigations over time around the same or similar research questions. Pulse Analysis: An analysis of tweets around a common thematic cluster of a hashtag, keyword, or specific event. Tag Cloud: a visualization of the commonly used words within a larger body of the text. Timeline Analysis: A time series style analysis of patterns and trends in a single Twitter account’s history. Visible Figures: Types of Twitter users who have acquired a followership without intentionally developing a curated audience.
ENDNOTE
1
296
http://www.qsrinternational.com/products_nvivo_add-ons.aspx#ncapture