development) and outlines how this specific application automated certain steps .... included a mix of Windows 95, Windows 2000, Windows XP, Windows 2000 ...
Paper PH10-2008
Public Health Tobacco Surveillance with SAS® Mark Tabladillo, Ph.D., MarkTab Consulting, Atlanta, GA Associate Faculty, University of Phoenix ABSTRACT The Centers for Disease Control and Prevention (CDC) uses SAS® as a strategic business intelligence and predictive analysis framework. Users across this Federal agency apply latest public health science practice to develop peer-reviewed publications and guidelines serving the American public. This presentation focuses on how one specific scientific group within the CDC’s National Center for Chronic Disease Prevention and Health Promotion (NCCDPHP) integrated SAS, and particularly SAS/AF®, into ongoing tobacco surveillance. The development methodology demonstrates how SAS has been strategically used to support public health reporting and legislative policy both in the United States and around the world.
BACKGROUND To assist states and countries in developing and maintaining their comprehensive tobacco prevention and control programs, the World Health Organization (WHO) and Centers for Disease Control and Prevention’s (CDC’s) Office on Smoking and Health (OSH) developed the Youth Tobacco Surveillance System (YTSS). The YTSS includes two independent surveys, one for countries (the Global Youth Tobacco Survey) and one for American states (the Youth Tobacco Survey) (Centers for Disease Control, 2008h; SAS Institute Inc., 2008). A SAS/AF® application was developed to manage and process these surveys, as well as similar surveillance projects (including the Global School Personnel Survey and the Global Health Professions Student Survey).
PROBLEM STATEMENT The Office on Smoking and Health (OSH) is a division within CDC’s National Center for Chronic Disease Prevention and Health Promotion (NCCDPHP). In 1999, OSH decided to use SAS as an automation tool to help process survey data for the newly developed Youth Tobacco Survey (YTS) and the Global Youth Tobacco Survey (GYTS). Both surveys consisted of a core questionnaire which survey teams administer randomly to selected school classes. The sampling design called for a two-stage sampling methodology of selecting schools and then classes at random. All students in selected classes then become part of the survey sample. Because of the complex sample design, the reporting also requires complex sample analysis, a feature not in SAS version 6.12 (which was the latest version in 1999). Therefore, the software produces reports using results from SAS-callable SUDAAN (RTI International, 2008). The public health team chose SAS/AF framework to improve and automate the existing SAS Macro coding, and provide a way for consistently developing estimates based on comparable complex sample analysis. Previous papers have described the microevolution of this software (Tabladillo, 2002, 2003a, 2004). The focus of this presentation, rather, takes a systems approach to describe how the SAS/AF application developer responded to a public health need. The first section describes the five qualitative phases (common to all SAS application development) and outlines how this specific application automated certain steps within these common phases. The second section describes the technology quantitatively, providing a categorical census of details inside the application. The third and final section provides evidence of public health results.
QUALITATIVE PHASES The best way to understand the output is as a solution and response to the problem statement. The SAS solution was developed in response to a public health need, and its success is proven by the choice of system analysts within several groups within OSH to continue to use the application, even though each analysts has sophisticated SUDAAN and SAS programming skills (typically including SAS MACRO). OSH has never mandated using the application, but analysts freely choose to continue to use the application to allow them to present the best and most consistent results for ongoing surveillance. Every analyst since 1999 has provided unique and creative input into the functionality and design of the software application, and all the analysts know that the group has collectively contributed to the scientific rigor and technological development. The following five qualitative phases, common to all SAS applications development, provides a framework to understand how this specific application has provided measurable value since 1999: • Phase One: Data Integration • Phase Two: Business Intelligence • Phase Three: Predictive Analytics • Phase Four: Dissemination • Phase Five: Continuous Improvement The following sections outline details of each of these five qualitative phases. 1
PHASE ONE – DATA INTEGRATION
SAS provides an efficient framework for extracting, transforming, and loading data (ETL). In this case, contractors scan the survey data from paper reports into an electronic text file. Survey samples individually have typically 1,500 participants. The SAS/AF software first checks the actual scanned survey results and compares this data with the expected school participation results (stored in a separate dataset). Output reporting allows the survey analyst to The core ETL process includes: • checking for out-of-range values (invalid responses for questions); • checking for inconsistencies among variables (and perhaps editing certain nondemographic values to missing); • and, creation of new derivative reporting variables (either individual questions reduced to a binary response, or combinations of variables reduced to a binary response). determine erroneous classifications. Once the actual and expected results are reconciled, then the ETL process can continue. The ETL process ends with a report stating the total amount of missing data per question, a helpful output which frequently helps uncover potential problems in the original questionnaire. The SAS/AF application then has a separate weighting program, which applies specific complex sample weights (based on the survey design) and then applies post-stratification adjustments (typically to adjust for demographic indicators such as gender, age, and race or ethnicity). After this weighting program is run, the final output is a weighted and cleaned dataset, which can then be returned to states and countries as a final output. Also, the data are then made available for states in the STATE System (Centers for Disease Control, 2008f) and for countries through the Global Youth Tobacco website (Centers for Disease Control, 2008b). PHASE TWO – BUSINESS INTELLIGENCE
The SAS/AF application then provides two types of reports for states and countries. One report uses PROC TABULATE and ODS (Output Delivery System) to produce summary tables of each survey question. Though the core survey has standard questions, typically states and countries work with the CDC to modify or add other questions of specific public health interest. This customization feature requires the SAS/AF application to use a separate dataset representing the survey questions and responses. Survey analysts choose Excel as a preferred method for customizing these questionnaire files. The SAS/AF application outputs expected values (percentages), confidence intervals (requiring complex sample estimation), and sample size for the overall sample and by demographic breakdown (such as gender, age, grade, and race or ethnicity). The second report focuses on the derivative binary reporting variables. These binary variables are preferred for scientific reporting since they report the number of respondents who say “Yes”, and for example, the number of students who say they have smoked a cigarette in the past 30 days. Using PROC TABULATE, ODS, and results from SUDAAN, these reports have percentages, confidence intervals, and sample sizes. Survey analysts typically choose text files for output, then delivering the final results in Microsoft Word. However, the SAS/AF application allows also for PDF or RTF output. In both reports, the SAS/AF application retains the summarized report data, as well as other demographic combinations not reported, into a summary data warehouse. This data warehouse serves as a summary reporting source for producing internal documents across states or countries, and years. The SAS/AF application reporting feature uses reporting templates (datasets) which indicate what variables to report. The output can be either a string or numeric variables, sent either to Excel or Access. Typically, government reports are black-and-white, and usually involve tables (instead of figures). Since all reports must be individually inspected by several layers of people, Excel or Access allows document authors to produce appropriately formatted output and customized analysis text. The summary data allows for relatively quick output compared with reanalyzing source weighted data. PHASE THREE – PREDICTIVE ANALYTICS
The tobacco surveys require using complex sample analysis, a methodology which uses survey design for confidence interval estimation. Since version 8, SAS has natively included complex sample procedures (SAS Institute, 2008), but for consistency’s sake, SUDAAN continues to be the preferred choice in this surveillance application. CDC findings for states (Centers for Disease Control, 2008e, 2008g) and countries (Centers for Disease Control, 2008d) show that preferred public health reporting is typically limited to demographic surveillance estimates. The SAS/AF system does not provide any automated predictive analytics, but instead allows for detailed output from the data warehouse for reporting purposes. A system analyst can interactively use the weighted datasets and the SAS-callable SUDAAN to test and develop predictive models. Beyond this basic level, individual states and countries might choose to perform advanced predictive analytics. The SUDAAN software (for example) allows for many advanced modeling techniques using complex sample data (RTI International, 2008). Other complex sample analysis software includes CDC’s free Epi Info (Centers for Disease Control, 2008a), SPSS with the Complex Samples optional add-on (SPSS Inc., 2008) and Stata (StataCorp LP, 2008). The CDC website lists some country that attempt more predictive modeling science (Centers for Disease 2
Control, 2008c). Though other choices are available, the core technology in SAS/AF application and the software licenses available point towards using the SUDAAN confidence interval estimates to arrive at variance estimates consistent with the automated reporting. PHASE FOUR - DISSEMINATION
In the United States, public health departments may use the reports within the agency, for a department of education or for legislative use. Similarly, countries may report their results to a ministry of health, a ministry of education, national legislators, and the media (including magazines, radio, and television). In this surveillance system, the Centers for Disease Control acts as a scientific enabler, and depending on the situation may provide technical support for surveillance design, consistent data processing across all participants, and technical support for dissemination. PHASE FIVE – CONTINUOUS IMPROVEMENT
The surveillance system implies repeated involvement, and the CDC website provides data for many states and countries which have repeated a youth tobacco survey. Often, the same team is involved in successive survey events, and that consistency provides an opportunity to build on the lessons of previous years. Both states and countries have opportunities to meet in national or regional conferences to share lessons and results from surveillance efforts.
QUANTITATIVE DESCRIPTION This section describes the quantitative characteristics of the SAS/AF application. Included in this section are the following subsections: • SAS Technologies • Application Line Count • Application Metadata • Application Deployment SAS TECHNOLOGIES
The SAS/AF application developer responded to a need, and the Centers for Disease Control has a license which includes many of the SAS technologies available across the entire agency, the Department of Health and Human Services. This broad license includes almost all the commonly licensed products, notably missing the Miner products, the Business Intelligence suite, and the industry-specific customized solutions. Also, the SUDAAN license includes all the functionality of SAS-callable SUDAAN. The operating system is Windows (and over the years included a mix of Windows 95, Windows 2000, Windows XP, Windows 2000 Server, and Windows 2003 Server). Table 1 provides details on Data Integration (Phase One) and Business Intelligence (Phase Two) which form the heart of the automated SAS/AF application. The three categories in the left are SAS, SAS/AF, and SUDAAN. The SAS category includes functions in the Base SAS license. The descriptions match terms commonly used by SAS programmers in their resumes, and often asked for by managers hiring SAS programmers. The SAS/AF category includes features in that product, and the categories match the SCL categories from the SAS documentation. The SUDAAN category shows the two PROCs automated from that software license. Some choices reflect developer preference. Instead of using PROC SQL, this application uses the DATA STEP was used for all data creation and merging purposes. Also, instead of using PROC REPORT, this application chooses PROC TABULATE. In these (and other) types of choices, either technology has its advantages and disadvantages. The choice of technologies resulted from many interactive discussions and functional requests over the years.
3
Category Description SAS SAS SAS SAS SAS SAS SAS SAS SAS SAS SAS SAS SAS SAS SAS SAS SAS SAS SAS SAS SAS SAS SAS SAS SAS SAS SAS/AF SAS/AF SAS/AF SAS/AF SAS/AF SAS/AF SAS/AF SAS/AF SAS/AF SUDAAN SUDAAN
DATA STEP DATA STEP APPEND DATA STEP MERGE MACRO ODS PDF ODS RTF ODS TEXT PROC COMPARE PROC CONTENTS PROC DATASETS PROC EXPORT PROC FORMAT PROC FREQ PROC MEANS PROC PRINT PROC PRINTTO PROC SORT PROC TABULATE PROC TRANSPOSE STATEMENT: FILENAME STATEMENT: FOOTNOTE STATEMENT: LIBNAME STATEMENT: OPTIONS STATEMENT: TITLE PERL REGULAR EXPRESSIONS WIN32 API FRAME OBJECT-ORIENTED CLASSES SCL: LIST SCL: MESSAGE SCL: SAS SYSTEM OPTION SCL: SAS TABLE SCL: SUBMIT BLOCK SCL: VARIABLE SCL: WINDOW PROC CROSSTAB PROC DESCRIPT
Data Integration X X X X
Business Intelligence X X X X X X X
X X X X X X
X X X X
X X X X X X X X X X
X X X X X X X X X X X X X X X X X
X X X X X X X X X X X X
Table 1. SAS Technologies by Qualitative Phases 4
LINE COUNT
Number of lines is a traditional way to measure software. In many legacy applications, programmers would not use blank lines within code, but in this SAS/AF application, both comments and blank lines were used in the code to enhance future code maintenance. Also, unlike traditional code, both SAS Macro code and SAS/AF code typically use repetitive blocks of code called multiple times but coded only once. In this application, the SAS/AF code submits customized SAS code based on system analyst selections on the screen and through customized datasets. In 1999, the set of hand-coded programs (based on SAS Macro) typically had 3,000 lines of code. The following table provides a line count summary for the SAS/AF application in four categories. Category one is the frame, and this SAS/AF application includes one frame with tabbed interfaces. The single-frame design has proved to be a development advantage, greatly simplifying application maintenance. There are no pop-up windows, and status messages are instead sent to the status bar or to the SAS log. Frames follow the Singleton design pattern, and the line count represents marshalling information among the user, the application metadata, and the application’s functions. Category two is Program, and represents specific application programs related to the qualitative phases (each within either data integration or business intelligence). This category includes PROC statements but also SCL statements for data quality (Tabladillo, 2003b). The third category is Operational, and represents classes which maintain program state and system analyst choices when navigating from tab to tab. The fourth category is Datasets, and represents classes that individually monitor important SAS datasets (including survey data, survey metadata, and application metadata). These classes typically involve multiple simultaneous objects from the same class (Tabladillo, 2003c). Overall, in 2008, the SAS/AF application has over 70,000 lines of code. Over time, the application has been expanded or consolidated for efficiency (Tabladillo, 2003a). Category
Lines
Percentage
Frame Program Operational Dataset TOTAL
7,030 31,819 12,751 20,304 71,904
9.8 44.3 17.7 28.2 100.0
Table 2. Line Count by Category APPLICATION METADATA
In addition to the standard SCL code (stored in a single SAS catalog), the SAS/AF application includes other metadata. First, there is a text file inside the SAS catalog which has mappings for Windows 32 API calls. This code allows for natively calling Windows, and has been discussed in a previous paper (Tabladillo, 2008). Second, there are SAS datasets which store the names of American states and territories, or countries (including World Health Organization regional designations). This authoritative list includes unique two-letter abbreviations (FIPS codes for the American dataset) used for metadata within datasets, on report headings, and in the names of output files and directories. Third, a separate file includes the standard Windows errors, and is used in the unintended but anticipated need to interpret a Windows error code (Tabladillo, 2008). Finally, the system produces SAS datasets with rows which represent individual surveys. These “master” datasets dynamically populate the SAS/AF application interface with choices for survey analysis. APPLICATION DEPLOYMENT
In 1999, the SAS/AF application was developed in version 6.12 on Windows 95, and deployed on desktop installations. A few years later, CDC’s National Center for Chronic Disease Prevention and Health Promotion (NCCDPHP) decided to make SAS deployment available on Windows 2000 Server through technology provided by Citrix. Users would then log on to a central (and typically faster) server and datasets would be commonly shared on a secured network drive area. Since then, the model has expanded to all of CDC, and now runs on Windows 2003 Server. The choice of network deployment of both software and data (on network drives) has streamlined application deployment, data access, application and data security, and backups.
RESULTS The Youth Tobacco Surveys have been administered in over 35 states and 140 international sites. The Centers for Disease Control relies on partnerships in the United States with state Public Health departments and internationally with the World Health Organization (WHO) and the Canadian Public Health Association (CPHA). As stated before, SAS system analysts continue to choose to use this application, though the breakdown of specific SAS technology proves that the internal tools are commonly available to any programmer skilled with SAS macro. Instead of choosing to code perhaps several thousand lines of code, these analysts prefer the internal ease of this customized application to produce consistent scientific quality output for public health reporting. The following figure shows (in blue) the number of countries that the Centers for Disease Control has subsequently been able to serve using this SAS/AF application. 5
Figure 1. Countries served with the SAS/AF Application
CONCLUSION This presentation started with an introduction and summary problem statement. The problem statement describes the public health need, and the subsequent sections show how a specific SAS/AF application was collaboratively developed to respond to a public health need. The first section describes the five qualitative phases (common to all SAS application development) and outlines how this specific application automated certain steps within these common phases. The second section describes the technology quantitatively, providing a categorical census of details inside the application. The third and final section provides evidence of public health results. The details show that the solution was based on the SAS system, it also included technologies from SUDAAN, Microsoft Windows, and Citrix. The framework used to describe this application could be used to describe any application, whether SAS/AF or not, and whether based on the SAS System or not.
REFERENCES Centers for Disease Control. (2008a). Epi Info. Retrieved April 15, 2008, from http://www.cdc.gov/epiinfo/ Centers for Disease Control. (2008b). GYTS Data Sets. Retrieved April 15, 2008, from http://apps.nccd.cdc.gov/GYTSDataSets/ Centers for Disease Control. (2008c). GYTS: Data Results by Country and Year. Retrieved April 15, 2008, from http://www.cdc.gov/tobacco/global/GYTS/results.htm Centers for Disease Control. (2008d). Morbidity and Mortality Weekly Reports (MMWRs): Global Data. Retrieved April 15, 2008, from http://www.cdc.gov/tobacco/data_statistics/MMWR/by_topic/global.htm Centers for Disease Control. (2008e). Morbidity and Mortality Weekly Reports (MMWRs): Youth Data. Retrieved April 15, 2008, from http://www.cdc.gov/tobacco/data_statistics/MMWR/by_topic/youth.htm Centers for Disease Control. (2008f). STATE System. Retrieved April 15, 2008, from http://apps.nccd.cdc.gov/statesystem/ Centers for Disease Control. (2008g). Youth and Tobacco Use: Current Estimates. Retrieved April 15, 2008, from http://www.cdc.gov/tobacco/data_statistics/Factsheets/youth_tobacco.htm Centers for Disease Control. (2008h). Youth Tobacco Survey (YTS). Retrieved April 15, 2008, from http://www.cdc.gov/tobacco/data_statistics/surveys/YTS/index.htm RTI International. (2008). Retrieved April 15, 2008, from http://www.rti.org/SUDAAN/ SAS Institute. (2008). Introduction to Survey Sampling and Analysis Procedures. Retrieved April 15, 2008, from http://support.sas.com/onlinedoc/913/getDoc/en/statug.hlp/introsamp_index.htm SAS Institute Inc. (2008). Sample 25276: Reading SAS Data in .Net Using ADO and ADO.NET. Knowledge Base/Samples & SAS Notes Retrieved February 1, 2008, from http://support.sas.com/kb/25/276.html SPSS Inc. (2008). SPSS Complex Samples. Retrieved April 15, 2008, from http://www.spss.com/complex_samples/ StataCorp LP. (2008). Stata statistical software for professionals. Retrieved April 15, 2008, from http://www.stata.com Tabladillo, M. (2002). Developing a Control Methodology for Customized Data Management and Processing. Proceedings of the Southeast SAS Users' Group Conference 2002 Retrieved April 15, 2008, from http://analytics.ncsu.edu/sesug/2002/DM09.pdf Tabladillo, M. (2003a). Application Refactoring with Design Patterns. Proceedings of the Twenty-Eighth Annual SAS Users Group International Conference Retrieved April 15, 2008, from 6
http://www2.sas.com/proceedings/sugi28/031-28.pdf Tabladillo, M. (2003b). Checking Datasets before Submitting Code. Proceedings of the Southeast SAS Users' Group Conference 2003 Retrieved April 15, 2008, from http://analytics.ncsu.edu/sesug/2003/DM04-Tabladillo.pdf Tabladillo, M. (2003c). The Dataset Attribute Family of Classes. Proceedings of the Southeast SAS Users' Group Conference 2003 Retrieved April 15, 2008, from http://analytics.ncsu.edu/sesug/2003/AD05-Tabladillo.pdf Tabladillo, M. (2004). How to Implement the One-Time Methodology. Proceedings of the Twenty-Ninth Annual SAS Users Group International Conference Retrieved April 15, 2008, from http://www2.sas.com/proceedings/sugi29/028-29.pdf Tabladillo, M. (2008). Return of the Codes: SAS’®, Windows’®, and Yours. Proceedings of the Second Annual SAS Global Forum Retrieved April 15, 2008, from http://www2.sas.com/proceedings/forum2008/004-2008.pdf
TRADEMARK CITATION SAS and all other SAS Institute Inc. product or service names are registered trademarks or trademarks of SAS Institute Inc. in the USA and other countries. ® indicates USA registration. Other brand and product names are registered trademarks or trademarks of their respective companies.
CONTACT INFORMATION Your comments and questions are valued and encouraged. Contact the author at: Mark Tabladillo Web: http://www.marktab.com
7