Introduction T D

Introduction T D
SUGI 31
Data Warehousing, Management and Quality
Paper 104-31
THE DESIGN AND USE OF METADATA: PART FINE ART, PART BLACK ART
Frank DiIorio, CodeCrafters, Inc., Chapel Hill NC
Jeff Abolafia, Rho, Inc., Durham NC
Introduction
The complexity of even small pharmaceutical projects can be daunting. Consider the deliverables: patient
profiles, listings, domain and analysis data sets, Define files, tables, and figures. Even in a single study,
these routinely total hundreds of files. For NDA submissions, these are but a single piece of a larger
“puzzle.”
Consider as well the documentation and human resources pushing the study through its life cycle. Project
managers need to monitor the completion status of the files. Statisticians and analysts have to identify data
requirements and lay out “dummy” displays. Programmers have to write the programs to create the data and
reports using specifications that are often, to be kind, “fluid.”
Creation of high-quality output requires coordination of effort and clear and immediate communication of
results. Rho has migrated much of the requisite project management and data and display specifications to
carefully designed and utilized metadata. By moving items that describe data sets and displays from
documents and low-level programs into data sets, we have realized significant gains in productivity and
quality of output.
This paper describes the current use of metadata at Rho. It:
o Discusses the motivation for using metadata
o Describes the metadata architecture
o Identifies tools that access the tables
o Presents examples, comparing metadata and non metadata-driven programs
The paper is largely conceptual and nearly code-free. While we emphasize application development in the
pharmaceutical industry, we feel the underlying concepts regarding metadata design and implementation are
valid across industries.
The Need for Metadata
Let’s begin with an overview of a typical project’s programming requirements (Figure 1, next page). This
is, essentially, a map of the “before metadata” landscape. Among the notable characteristics are:
Many Files, Many Formats. Specifications for dataset creation, statistical analysis, data displays, variable
derivation, and other items are typically held in a variety of formats. Word files, Excel spreadsheets, and
other formats are, indeed, convenient for the statistician or project manager who is used to and comfortable
with these tools. None, however, have the structure and security of a database – audit trails, controlled
views, and other features that are second nature to database designers are lacking or poorly implemented
here. The variety of tools also makes it somewhat difficult for an analyst or programmer to easily move
from one document format to another. This is, in other words, a “whole” that is not characterized by “parts”
that work well together.
Specifications Are Not Data. Word and similar applications have the advantage of being familiar to most
people, and they can produce attractive output. What they cannot be seen as, however, is a data source.
That is, they cannot be programmatically manipulated. This makes extraction of table specifications,
variable characteristics, and the like nearly impossible. This has a significant impact on work flow, as we
will see in the next section.
Duplication of Effort. Resources such as data, format, and macro libraries must be allocated at the
beginning of a program. One solution is coding the necessary macro variable definitions, LIBNAMEs,
options, and FILENAMEs in every program. When a change is required (e.g., different data source,
1
SUGI 31
Data Warehousing, Management and Quality
modified option setting), it must be applied to every program. The potential for incomplete or inconsistent
modification is significant.
The situation is somewhat improved if a standard AUTOEXEC file is used by all project programs. Changes
can be made to a single location and they will automatically be picked up by any program using it. Still, the
AUTOEXEC approach has shortcomings. It cannot, for example, ensure standardized library naming from
Figure 1: Organization, Pre-Metadata
Data
Programs
Deliverables
Analysis Domain
Dataset
Domain Analysis
data
data
Specifications
Reports
SAP
(Word,
Excel)
NDA
Analysis Domain
.XPT
.XPT
Data
(Word ,
Excel)
CDISC
Define
files
TFL
(Word,
Excel)
Utility
macros
Patient
Profiles
TFLs
SDTM/
CDISC
data
study to study. The file can, if left to the design whims of a project programmer, include files that in turn
include other files – an overarchitected approach that makes debugging and modification time-consuming.
Change. Any one involved for even the briefest time on an NDA submission or similar study has been
exposed to an environment characterized by rapidly changing requirements. To describe but a few:
o The definition of analysis variables may require refinement or correction.
o
Specification of use of this data in patient profiles, listings, tables, and figures may also change.
o
Change to a display may even require a modification to the data being displayed (i.e., the change is
pushed “upstream”).
Variables need to be renamed, relabeled, recoded, or dropped from datasets.
o
Change in and of itself is normal, and should be welcomed as an sign that people at various stages of the
process – statisticians, programmers, medical writers, et al. – are examining project artifacts closely. What
can be problematic is, first, ensuring that the change is communicated to the programmers and second, that the
changes are correctly made in all of the affected programs.
2
SUGI 31
Data Warehousing, Management and Quality
Repetition. Another characteristic of the status quo environment is, arguably, the most exasperating from
the standpoint of the statistical programmer. Display specifications are often characterized by a high
degree of repetition and a small degree of variation. Twenty displays may vary only by the study
population subset or categories they represent: one display may be process the entire population, another
may categorize by age group, or subset by those who left the study prematurely. In all cases, the
underlying program is the same but the data that feeds into it, along with title text, varies.
Consider This Scenario
Consider the scenario presented in Figure 2, below. Ten tables that are identical in layout except for the
underlying table population. The program that produces one of the tables is, essentially, identical to the
other nine table programs. The only difference is the selection of the population and the title text. Having
10 clones of the same program is inefficient for initial program development, and exposes the programmer
to the same set of pitfalls described in the “Change” section, above. If, for example, the text or order of
Figure 2: Footnote Text Change, Pre-Metadata
Programs
Reports
Specifications
TFL
(Word)
TBL110
TBL111
.
.
.
libname clin ‘path’;
TBL110.SAS
libname library ‘path’;
libname derived ‘path’;
options sasautos=(‘xxx’, ‘yyy’);
libname template ‘path’;
ods search path itemStores;
data report; set derived.ae;
where population for table 110;
title1 ‘mPharm NDA’ –r “&sysdate.”;
title2 –c “Table 110”;
title3 –c “subpopulation”;
footnote1 ‘changed text’;
footnote2 ‘more changed text’;
footnote3 ‘Program: TBL_T110.sas’;
Deliverables
tbl110.rtf
tbl111.rtf
TBL128
Utility
macros
libname clin ‘path’;
TBL111.SAS
libname library ‘path’;
libname derived ‘path’;
options sasautos=(‘xxx’, ‘yyy’);
libname template ‘path’;
ods search path itemStores;
data report; set derived.ae;
where population for table 111;
.
.
.
tbl128.rtf
title1 ‘mPharm NDA’ –r “&sysdate.”;
title2 –c “Table 111”;
title3 –c “subpopulation”;
footnote1 ‘changed text’;
footnote2 ‘more changed text’;
footnote3 ‘Program: TBL_T111.sas’;
footnotes is altered, the change must be applied to all 10 programs.
This scenario has many of the negative elements described earlier: the specifications are in a Word
document and thus not readable from a program; the change has to be made by the analyst, then
communicated to the programmer; and the programmer must make the change in multiple locations.
At best, will make for late nights at the office. At worst, the changes will be ineffectively communicated
and/or not made in every program, thus raising the possibility of incorrect output being shipped to the
client. It’s worth noting that while it may be possible to automatically validate the body of the table, the
header and footer areas are usually manually inspected.
3
SUGI 31
Data Warehousing, Management and Quality
What would be preferable is a way to automatically propagate the changes to all 10 programs. That is, we
want to abstract the programs, making them, in effect, a form or template that receives specifications such
as data subsetting, title text, and the like. The program creating the display becomes highly data-driven.
We’ll see an improved, metadata-driven version of this program shortly (Figure 7, page 11, if you want to
cut to the chase).
What To Do?
What would be desirable, given the preceding discussion, is non-redundant storage of project artifacts in
file formats that are programmatically accessible. If data, TFL and other specifications are stored as data,
they can be easily manipulated. Further, if they are stored in the same format across studies and projects,
the uniformity can be exploited and a corporate-level library of access tools can be developed.
Standardization and tools provides the application programmer the means to rapidly develop output,
knowing that it is timely and accurate. What’s needed, in short, is a well-designed collection of metadata
tables.
Metadata Overview
Based on the problems identified above, we moved items such as directory structures, variable definitions,
TFL components out of programs and Word documents into machine-readable metadata. In this section,
we describe the metadata groupings and describe their contents. This is followed by some observations on
the evolution of the data, and leads us into the wide-open realm of the metadata macro/server.
The strategic positioning of the metadata tables is shown in Figure 3, below (compare this with Figure 1,
page 2, and see that traditional specifications have been replaced by metadata).
Figure 3: Metadata Usage
Data
Deliverables
Analysis Domain
Domain Analysis
data
data
Programs
Metadata
Variable Dataset
SDTM
Dataset
Project
Reports
Data
TFL
Structure
TFLs
Analysis Domain
.XPT
.XPT
NDA
GlbDisp Display
CDISC
misc
Footnotes
Patient
Profiles
Utility
macros
StdDisp
4
Define
files
SDTM/
CDISC
data
SUGI 31
Data Warehousing, Management and Quality
Groupings
The tables fall into a few general categories. Note that the classification is for discussion purposes only –
there is no reason why, for example, an application could not draw on Structure, Project, and Display
metadata tables. There are currently five groupings of metadata:
o Structure — Tables describe the organization and usage of directories for a project and its component
studies.
o Project — Contains protocol name and description for the project and each of its component studies.
o
Data — Describes individual datasets and variables.
o
Display — For each TFL, contains fields for display names, titles, subpopulations, footnotes, links to
display-creation program files.
Miscellaneous — Stores items that are not project-specific, such as option settings, global macro
variables, and administrative items.
o
Content
The content of the major metadata groups and tables follows below.
Structure
The table that is central to nearly all metadata usage is STRUCTURE. This table describes the location of
each directory in a study. It captures the location, possibly wildcarding directories for multiple studies,
usage (data library, macro autocall, format library), and other options.
This table serves as the basis for all LIBNAME statements, autocall and format search paths, and ODS
template library allocations. The settings are described only once (in the table). If a change needs to be
made – for example, new order of macro paths, or concatenate data libraries – the change is made here, and
is automatically reflected in programs that use it. Just how the programs transform, say, a Microsoft
Access table into SAS option statements and LIBNAMEs is discussed in “Metadata Tools,” below.
Project
This level of metadata contains high-level, descriptive information about a project and its component
studies. The table contains one observation per study. Fields include: the study title; protocol number;
type of study; and location of key study documents (SAP, manuscripts), phase, number of treatments.
Data
We create two datasets for each study and data type (source / “raw” and analysis / derived). DATASETS
holds dataset-level metadata, one record per dataset in the study. It contains fields that:
o Describe the contents and structure of the dataset
o
Discuss how the dataset was created (essentially background material useful for programmers)
o
List variables that uniquely identify an observation in the dataset
Filter a dataset, identifying whether it should be included in particular types of output. A dataset may,
for example, be part of an ISS but not an ISE.
There are also text fields to enter general comments about the dataset.
o
The VARIABLES table is the companion to the DATASETS metdata. For every dataset in the DATASETS
metadata, the VARIABLES table contains one observation per variable, describing:
o
Descriptors such as type, length, format, and label;
o
For “raw” data, the source page number in the Case Report Form;
o
For derived data, a narrative of how the variable is created;
o
Controlled terms or codes;
o
The desired variable order when writing datasets for FDA submission;
5
SUGI 31
Data Warehousing, Management and Quality
o
A date-time stamp identifying when the most recent change to the observation was made;
o
Filter variables that facilitate selection of variables for different types of output. The ability to filter at
the dataset-variable level is a great asset, and is discussed in the “Examples” section, below.
Displays
Standard output for most studies includes a series of tabulations, graphic displays, and listings of individual
observations, collectively referred to as TFLs. A study can require hundreds of TFLs. Likewise, it is the
norm for these displays to be based on a much smaller number of unique layouts. Twenty tables could be
laid out similarly, for example, varying only by the population used in each table – treated patients, age
greater than 65, and so on.
The DISPLAY metadata group describes key features of each TFL. The DISPLAY table contains:
o
o
Display number
Title lines
o
Footnote codes (see next paragraph)
o
Datasets used by the table
o
Display type (Table, Figure, Listing)
o
Filtering information expressed as both descriptive text to use in titles and syntactically valid SAS
program statement.
A list of variables needed to create the display.
o
The FOOTNOTES table complements DISPLAY. It contains a field with a short footnote code and a longer
text field containing the actual footnote text. The linkage of the DISPLAY and FOOTNOTES tables
emphatically demonstrates the power of metadata-driven processes. Recall the example at the end of the
previous section: a footnote’s text had to be manually changed in multiple programs. Using metadata, the
task is vastly easier and the output more reliable: a single change is made to text in the FOOTNOTE table,
then the affected programs are rerun without requiring modification. Just how the footnote table’s content
is passed to the table program is discussed in the “Tools” section, below.
Other tables in this metadata grouping describe general features such as titles and footnotes common to all
TFL’s, font name, and point size. Figure 4 (next page) illustrates the contribution of the different display
metadata tables to a summary table.
Comments
These Are “Living Documents” The metadata architecture outlined in this section is not static, but
simply shows the system at a particular point in its evolution. As new needs arise (e.g., CDISC’s ADaM
and SDTM), new tables can be easily fit into the overall design. Also, new uses for existing metadata may
require additional fields. It’s important to realize that these and other scenarios are as desirable as they are
inevitable. It is, in effect, an acknowledgement of the power of metadata-driven programming.
If You “Live By the Metadata,” You Can Also “Die By the Metadata.” Continuing with the footnote
example – if the wrong footnote text is entered, or if the DISPLAY metadata incorrectly identifies footnote
codes, all that has been accomplished is more elegant production of bad results. A set of quality control
tools for the metadata was not part of the original design, but the need for them quickly became apparent.
The Choice of File Format Is Important. Early versions of the tables were held in SAS datasets and
Excel spreadsheets. These had a number of drawbacks: there were performance issues with the SAS/Share
server, the SAS and Excel “out of the box” user interface for editing data was awkward; multiple users
entering data into Excel sheets was not possible; and the SAS Version 8 tools for reading Excel sheets
(DDE, PROC Import, Access descriptors) were quirkier than we would have liked.
This suggested the need for a more robust database solution with a friendy and easily modified user
interface. Microsoft Access 2003 is now used for entry and storage of all metadata. Now, rather than type
a Word document, users enter dataset, TFL, and other specifications using Access forms. Our bridge to
6
SUGI 31
Data Warehousing, Management and Quality
Figure 4: DISPLAY Metadata Usage
Metadata table “Standard”
Metadata table “Display”
Metadata table “Footnotes”
Metadata table “Standard”
this data is the SAS Version 9 ACCESS engine. It has proven to be extremely stable, and has eliminated
performance, entry, and data loss issues.
It’s Mostly a Manual Process. Some metadata tables can be at least partially populated from existing
internal and external sources. Consider some of these:
o Standards organizations — CDISC, for example, publishes Excel files that can be viewed as the
“Gold Standard” for SDTM domain files.
o Existing Systems — Internal data management systems are usually table-driven
o Client Systems — Clients usually supply documentation that describes data.
o SAS Dictionary Tables — SAS metadata contains significant content about “as built” data. Contrast
this with the home-grown metadata, which describes data “as we’d like it to be built.”
These and other sources can be used to seed the initial versions of metadata tables. The coding required to
transform the source into “Version 1” metadata is well worth the effort: metadata development time is
reduced, and analyst resources are freed up.
The Underlying Idea Is Familiar. Metadata is hardly new. It predates SAS software, and has been part
of SAS for years, in the form of Dictionary Tables. Once you understand the Tables’ scope and content,
it’s hard not to get excited about the range of possibilities for their use. The metadata that we have
described here is conceptually similar to the Dictionary Tables in its breadth of application. Together,
these two forms of descriptive data enable creation of robust, extensible applications that would be
difficult, if not impossible, to write in their absence. Just how to make best use of the metadata is
discussed in the next section.
Multiple Tables Are the Norm. Even if an application needs information from what appears to be a
single data source, accessing the information often requires references to multiple tables. Suppose, for
example, an application needs a list of all datasets and variables in a study that are to be exported to the
client’s database. The list would need to be filtered not only using the VARIABLES table, but also the
7
SUGI 31
Data Warehousing, Management and Quality
DATASETS table. That is, the list would consist of eligible variables from eligible datasets. These and
other routine tasks are not particularly difficult to code, but they do suggest the need for an application
layer between the metadata and programs that use it.
Providing Easy Access. One doesn’t need to have an active imagination to wonder just how the metadata
can be used. It is rich in content, fully describes processes in need of automation, and is complex.
Effective use of the metadata requires tools that simplify table access. These are discussed next.
Metadata Tools
The metadata is the engine driving applications throughout the project life cycle. Consider for a moment
how useful an engine is without, say, a car wrapped around it. It might emit a gratifying, throaty roar, but
would ultimately be useless because it couldn’t take you anywhere.
Our experience with metadata design quickly pointed out that you would not gain efficiency – indeed, you
would become less productive – without applications making access more or less transparent. This section
describes some of the tools Rho has developed toward this end. As with the design of the tables
themselves, the tools were developed on a somewhat ad hoc, “heat of the moment” basis.
Consider the Need
To underscore the need for access tools, consider the programming requirements needed for accessing
metadata that describes a tabular display. The display type and number must be located in the DISPLAY
metadata. We must also gather and correctly sequence the footnotes from the FOOTNOTES table. Finally,
we retrieve standard headers and footers from the GLOBAL table. Once all the pieces are identified, they
Figure 5: Add Metadata Macros
Data
Deliverables
Analysis Domain
Domain Analysis
data
data
Metadata
Project
Data
Display
misc
Structure
Metadata Macros
Programs
Dataset
Dataset
Reports
Reports
NDA
NDA
CDISC
CDISC
Server
macros
Utility
macros
8
Patient
Profiles
TFLs
Analysis Domain
.XPT
.XPT
Define
files
SDTM/
CDISC
data
SUGI 31
Data Warehousing, Management and Quality
need to be presented to the table-writing program in an agreed-upon format (macro variables, datasets,
etc.). To be thorough, we should add checks to ensure data quality: were all the footnote codes specified in
DISPLAY actually present in FOOTNOTES? Do we have complete title text? And so on.
We could, of course, write code in each table program to perform these actions. More likely, we would
want a tool that would do the work for us, reading the required tables, and creating a set of macro variables
that would make the metadata readily accessible. The process to create macro variables for Table 10.1
should be as simple as:
%getSpecs(type=table, id=10.1)
The macro would perform all the activities described above, and would produce diagnostics that would
quickly give the table programmer an indication of success or failure. The SAS Log would contain
messages showing what was created by %getSpecs, along with what was cautionary or problematic.
Clearly, a library of tools for TFL generation, LIBNAME assignments, option setting, and the like
comprise the application “body” which surrounds the metadata “engine” described at the beginning of this
section. Clearly, too, we want to allow programmers who want to read the metadata directly to do so. The
Big Picture that has emerged over time (Figure 5, page 8) is one that has metadata at its core and that is
amenable to different levels of end-user programming effort.
Tool Descriptions
While there is nothing to prevent a programmer from directly utilizing the metadata, it’s far more likely
that the metadata will be effectively used if a “server” layer is interposed between the data and the
application. Table 1, below, shows some of the tools currently being used by developers at Rho. Table
entries correspond to SAS macros that access metadata. In some cases, they are small and tightly-focused,
Table 1: Metadata Access Tools, By Deliverable and Metadata Group
Metadata Type
Structure
Data
Deliverable
TFL
NDA
Standardized program startup
Standardized program startup
Standardized program startup
Create compound description
Create compound description
Create ATTRIB and RENAME
statements; create dataset
and variable lists
Verify datasets, variables
needed for a display are
available.
Create ATTRIB statements,
variable lists
Project
Data
Display
Document dataset
characteristics
Export data per FDA
requirements.
Generate data create spec
document
Create “define” file per FDA
requirements
Assign variable labels
Verify “define” file links exist in
referenced PDFs.
Check data-metadata
consistency (type, length, etc.)
Verify data, “define” contents
match what is expected, given
metadata
Combine metadata sources
for a given display
Create Title, Footnote
statements
Generate Codes lists from
external sources
Create DATA step code
fragments for filtering data
Display derived variable
ancestry – show circular,
invalid references
HTML displaying Log, output
links, other info for each TFL
Verify accuracy, completeness
of metadata
and are used by other applications (e.g., the ATTRIB statement generator). Other macros are longer,
reference other metadata and utility macros, and are standalone applications (the “define” file generator).
9
SUGI 31
Data Warehousing, Management and Quality
Examples
Footnotes Revisited
Recall the footnote scenario discussed earlier and illustrated in Figure 2 (page 3). The Word document
containing the footnote text changed, then each of the 10 programs using the footnote had to be updated to
reflect the new footnote text. Figures 6 and 7, below and next page, demonstrate a metadata-driven
approach to the problem.
Figure 6 presents a greatly-simplified version of the tables. The DISPLAY table field FNOTES holds a set
of footnote codes. The first row uses footnote codes F1 and F3. The metadata macro that processes Table
110 will create footnote statements containing the formatted text from the FOOTNOTES table.
In this scenario, using metadata and metadata macros, the change is automatically picked up by the display
programs and no modification to the programs is required. Another important advantage of the metadata
macro is that it brings problematic conditions to the programmer’s attention (here, ID 212’s reference to
non-existent footnote code P2).
The revised, metadata-aware programs are shown in Figure 7 (next page). The hard-coded footnote text
Figure 6: DISPLAY-FOOTNOTES Interaction
Footnotes
Display
id
type
fnotes
fcode
fnotes
110
tbl
F1, f3
F1
text for footnote F1
111
tbl
F1, F3
F3
text for footnote F3
114
tbl
F1, F3 P3
P1
text for footnote P2
212
tbl
F8 P2
P3
text for footnote P3
we saw in Figure 2 has disappeared. In its place is a single reference to macro variable FOOTNOTES,
created by macro %TFL.
Remove User-Written Formats
Let’s use metadata for another “program makeover.”. Some US regulatory agencies such as the FDA
accept SAS datasets as deliverables, but prohibit references to user-written formats.
Figure 8 (next page) shows an non-metadata approach. A specification document identifies the datasets
and variables to create, along with characteristics such as data type, length, label, and format. The
programmer could manually create a list of variables with user formats, then remove the format reference
using PROC DATASETS. This coding strategy is included at the end of each dataset-creation program.
Figure 9 (page 12) shows a compact, single-program solution utilizing metadata that comes from SAS
(Dictionary tables) and is home-grown. We read Dictionary table FORMATS to create a list of native SAS
10
SUGI 31
Data Warehousing, Management and Quality
Figure 7: Footnote Text Change, Using Metadata
TBL110.SAS
%inc ‘setup.sas’ / nosource2;
%setup(prj=lPharm\drug001);
%TFL(type=table, id=110);
Programs
Metadata
Server/API
Reports
Project
Dataset
Dataset
Reports
TBL111
Variable
NDA
.
.
.
Display
Structure
Footnotes
data report; set derived.ae;
where &subpop.;
&titles.
&footnotes.
TBL110
Deliverables
tbl110.rtf
tbl111.rtf
.
.
.
TBL111.SAS
CDISC
Server
macros
misc
%inc ‘setup.sas’ / nosource2;
%setup(prj=lPharm\drug001);
%TFL(type=table, id=111);
TBL128
data report; set derived.ae;
where subpop.;
&titles.
&footnotes.
Utility
macros
tbl128.rtf
formats, then use metadata tables DATASET and VARIABLES to identify variables with formats not in the
native format list. Once these variables are known for each dataset, it’s a simple matter to create the
requisite PROC DATASETS statements.
Figure 8: Remove User-Written Formats, Pre-Metadata
libname raw ‘path’;
AE.SAS
create dataset AE
Programs
Data
Specifications
Dataset
(Word)
AE
Demog
ConMed
PE
proc datasets library=raw nolist;
modify AE;
format vars_with_user_formats;
quit;
libname raw ‘path’;
DEMOG.SAS
Deliverables
AE.
sd2
create dataset DEMOG
proc datasets library=raw nolist;
modify DEMOG;
format vars_with_user_formats;
quit;
libname raw ‘path’;
CONMED.SAS
demog.
sd2
conMed.
sd2
create dataset CONMED
Utility
macros
proc datasets library=raw nolist;
modify CONMED;
format vars_with_user_formats;
quit;
libname raw ‘path’;
PE.SAS
create dataset PE
proc datasets library=raw nolist;
modify PE;
format vars_with_user_formats;
quit;
11
PE.
sd2
SUGI 31
Data Warehousing, Management and Quality
Figure 9: Remove User-Written Formats, Using Metadata
SAS
dictionary
tables
Deliverables
Metadata
libname raw ‘path’;
Project
Metadata
macros
Dataset
Variable
Display
Footnotes
Data
Structure
remove
Fmt.sas
Utility
macros
read FORMAT dictionary table to build a
list of SAS System formats
read VARIABLES metadata – generate, by
dataset, list of variables with formats
not in the SAS System list generated
above
proc datasets library=raw nolist;
for each dataset in list from
VARIABLES
modify dataset;
format variable_list_from_dataset;
quit;
AE.
sd2
demog.
sd2
conMed.
sd2
PE.
sd2
misc
Readers familiar with Dictionary Tables could rightly say the entire task could be done without any homegrown metadata (the COLUMNS Dictionary table contains format information). The main reason we did
not take this approach was that the DATASET and VARIABLES tables provide an added level of quality
control – we can identify variables with user formats and can verify, via one or more filter variables,
whether the variable should be written to the dataset. The filtering information is home-grown value-added
that is not in the SAS metadata. The program in Figure 9 demonstrates how the two forms of metadata are
complementary, rather than independent.
Reorder Variables in a Dataset
Rather than using Viewtable or a similar tool to arrange variables while browsing, clients often request that
variables be stored in a specific order. In the absence of metadata, the programmer would need to review a
specification document, then, for each dataset, manually enter the variables in a RETAIN or SQL SELECT
statement. This harkens back to the problems identified at the beginning of this paper: the manual process
is tedious, error-prone, not easily validated, and time-consuming, particularly if the specification changes
more than once. A representation of the pre-metadata solution is shown in Figure 10 (next page).
A much cleaner and more compact approach is shown nearly in its entirety in Figure 11 (next page)
Recall our earlier comment about metadata continually changing. We added a sequence field (SEQ) to the
variable-level metadata to meet the client’s requirements. Had it not been a request that other clients were
likely to request, we would have opted not to alter the metadata.
With the sequencing information available in the metadata, program REORDER.SAS becomes
straightforward. It is notable for its use of a utility macro (%DISTINCT) to produce a list used by the
macro. %DISTINCT reads dataset MDATA and produces macro variables LEVELS and NLEVELS,
containing the names and count of unique values of variable DSET. The macro specifics are less important
than the metadata macro’s effective use of general-purpose utilities. The macro becomes smaller (a single
macro call instead of a dozen or so lines of code) and more reliable (by using validated tools such as
DISTINCT).
12
SUGI 31
Data Warehousing, Management and Quality
Figure 10: Reorder Variables, pre-Metadata
AE.SAS
libname raw ‘path’;
Programs
Data
Specifications
data raw.ae;
retain variable_list;
set raw.ae;
run;
AE
DEMOG.SAS
libname raw ‘path’;
Demog
Dataset
(Word)
ConMed
Utility
macros
Deliverables
data raw.demog;
retain variable_list;
set raw.demog;
run;
libname raw ‘path’;
CONMED.SAS
AE.
sd2
demog.
sd2
conMed.
sd2
data raw.conmed;
retain variable_list;
set raw.conmed;
run;
Figure 11: Reorder Variables, Using Metadata
Metadata Macros
libname raw ‘path’;
Metadata
Project
Dataset
Variable
Display
Footnotes
misc
Structure
reorder.sas
%macro reorder;
%distinct(data=mdata, var=dset)
%do i = 1 %to &nLevels.;
%let dsn = %scan(&levels., &i.);
proc sql noprint;
select var
into :vlist separated by ‘ ‘
from mdata.&dsn
order by seq, name;
quit;
data raw.&dsn.;
retain &vList.;
set raw.&dsn.;
run;
%end;
%mend;
%reorder
13
Deliverables
AE.
sd2
demog.
sd2
conMed.
sd2
SUGI 31
Data Warehousing, Management and Quality
Comments on Tools and Usage
Two general points about metadata usage and tool building have been implied throughout the preceding
discussion. We explicitly mention them here to highlight their importance.
Build with Existing Tools
Utility Library. Metadata macros, like any other tools, use a library of general-purpose macros. The
importance of reliable, comprehensive macros cannot be overemphasized. Likewise, the range of tools is
noteworthy.
Some of the macros are purely diagnostic, used only by the tool builder. These include: writing a brief
listing of one or more datasets’ contents to the SAS Log; writing macro variable values to the SAS Log in a
clearly understandable order (contrast with %put _global_;) displaying settings for a user-specified list
of system options; and identifying datasets and macro variables that were created during execution of a
macro (an aid in preventing unwanted artifacts being produced by a macro).
Other utility macros are intended to be used as part of a larger application: uppercase a list of macro
variables; verify one or more variables exist in a dataset; return the number of observations in a dataset;
count the number of tokens in a macro variable; quote the tokens in a macro variable; and so on.
There is, of course, a modest cost to using the utility macros: the flow of program execution gets a bit more
complex, since the macro is called from a metadata or similar high-level tool; the globally available macros
perform parameter checks and other activities that may be redundant or unnecessary, given the needs of the
calling program, thus unnecessarily consuming CPU or other resources.
These drawbacks are almost always negligible and are greatly outweighed by the huge advantage they
offer when constructing applications. Without an observation-counting utility, one would have to code the
following to create macro variable NOBS:
proc sql noprint;
select nobs into :nobs
from dictionary.tables
where libname=”MASTER” and memname = “DEMOG”
and memtype = “DATA”;
quit;
Admittedly, this is not rocket science, and can be easily coded. It is far easier to refer to an autocall macro,
especially if dataset counts have to be taken multiple times in a program:
%obsCount(data=master.demog, count=nobs)
obsCount not only returns the item of interest – macro variable NOBS – but also performs checks that
would not normally be peformed while writing in-line code. It also writes messages to the SAS Log that
apprise the user of the macro’s success or failure. It is, in short, a means to easily develop robust, concise
programs.
SAS Dictionary Tables. Home-grown metadata is conceptually similar to putting the SAS Dictionary
Tables to work. It bears mentioning here that some of the most powerful, elegant, and extensible macros in
the entire Rho system are those that use both the SAS metadata (the Dictionary Tables) and the sitespecific, home-grown metadata tables.
Usage Levels Vary
It’s unproductive and even counterproductive to force all users to use the full-blown system as described
above. The system as currently architected allows different levels of access and usage. A few of the
possibilities are illustrated in Figure 12 (above). The Figure contains a simplified and abstracted view of
the system. The numbered items represent some possible scenarios of metadata usage.
1. Programs creating deliverables only require STRUCTURE metadata, presumably for allocating
14
SUGI 31
Data Warehousing, Management and Quality
Figure 12: Metadata/Macro Usage Varies By Task
Metadata
Project
Data
Display
Structure
Metadata
Macros
Deliverables
Programs
Deliverables
1
1
1
2
2
2
3
3
4
misc
5
libraries and autocall and format search paths. The metadata macro layer is needed only for
program initialization. The STRUCTURE metadata and %SETUP metadata macro effectively
replace autoexec or hard-coded program startup tasks.
2. The locations of PROJECT, DATA, and DISPLAY metadata are identified by STRUCTURE
metadata. The programs creating deliverables make use of the program initialization macros and
other macros that manipulate and simplify access to the metadata. This is usually the most
effective use of the system because it uses a set of tools (the macros) to provide programmers and
end-users with access to the metadata. It isn’t necessary to see the metadata. One only has to
understand the inputs to and outputs from the “black boxes.”
3. A metadata macro processes one or more metadata tables and creates output based solely on the
display or reformatting of the metadata. The client deliverable could be, for example, dataset
documentation based on merged and formatted DATASETS and VARIABLES tables.
4. No client deliverables are produced in this scenario. Macro inputs and outputs are contained
within the metadata system. A metadata macro could, for example, read CRF or “raw” data and
make a first pass at Domain-level DATASETS and VARIABLES metadata. Another application in
this scenario could be metadata validation, where the macro identifies inconsistencies and
potential problems that were not identified during the manual creation of the metadata.
5. This path through the system uses a global, non project-related table and so does not require
access to the Structure data. Since it does not use the project-specific data around which the
majority of our activity revolves, it is quite literally the “path less traveled.” Macros here display
system-wide settings and perform other, administrative tasks.
15
SUGI 31
Data Warehousing, Management and Quality
Conclusion
By now, the reader should begin to understand the paper’s title. The design of the metadata and tools is
part “fine art.” One has to possess technical expertise to develop the tables and write the programs that
present the metadata to the programmers. The process is also part “black art.” Given that we were often
dealing with entirely new programming and workflow issues, we frequently relied on the intuition that
comes from being “seasoned” professionals.
Key Reasons to Develop Metadata
Uniformity. The fully realized use of metadata, particularly the STRUCTURE table, fosters uniformity
within project-related studies and across projects. When corporate-wide directory structures,
LIBNAMEs, and the like are similar, tools developed for one project can be applied to other projects with
little or no modification.
Cost Reduction. Many of the examples in this paper highlighted the benefits of a single metadata entry
point. If, for example, an analyst modifies a footnote value in metadata, the cost of the change is borne
only once. Compare this to a metadata-free environment where the analyst enters the value in a Word
document, followed by programmers who have to replicate the new text in multiple programs.
Workflow Change. A driving force behind metadata is the abstraction of programs, moving hard-coded
items out of programs and into data stores accessible by SAS. If the metadata interface is friendly
enough, and if tools are in place to trap inconsistencies in the metadata, print it, and the like then some of
the entry can be offloaded to other, non-technical staff. This frees up analyst time and reduces time to
delivery without sacrificing quality.
Quality Improvements. Specifications stored as data can be checked programmatically. Likewise,
output generated using metadata can be validated with metadata tools. Effective use of metadata enables
faster production of validated deliverables and is an ideal match for the times when the content and nature
of the deliverable changes, must be reprogrammed, and then revalidated.
Metadata Is Inherently Multi-Use. A metadata source table can easily be used for multiple types of
tasks throughout the project life cycle. The variable-level table can, for example, be used for:
o Creating attribute statements when a dataset is being built (e.g., use format, length, type fields in a
metadata macro)
o Building keep and drop lists when exporting a dataset (use filter fields)
o Performing quality checks on data (compare as-built dataset characteristics to those expected by the
metadata)
o Documenting data (create publication-quality documents listing variable attributes and derivation)
Happier Programmers. Repetitive hard-coding of title and footnote text is an unchallenging, errorprone task, especially when a deadline is looming. The metadata and the tools to access it allow
programmers to focus on program functionality rather than what are, in essence, clerical issues.
Metadata Is the Way of the Future. Even if you don’t buy into the idea of metadata, other people do.
For example, the FDA currently accepts Define files, product labels, and safety reports as XML files.
This format will eventually be mandatory. This pharmaceutical example is hardly unique. Given the
inevitability of creating these data-rich, programmatically accessible XML files, it will simply not make
sense to use Word-based specifications and then create XML.
Key Lessons Learned
Design Metadata and Tools. Metadata is, indeed, at the heart of the process changes we have
implemented. Without tools that make its use transparent or, at minimum, “pretty simple,” the tables’
impact would have been blunted. Tools should be in place for end-user access and quality control.
16
SUGI 31
Data Warehousing, Management and Quality
Pay Attention to the User Interface. Even if the idea of metadata is appealing and its potential impact
is huge, it will gain only grudging acceptance if it is difficult to enter. Any metadata that must be
manually entered should have a friendly, intuitive interface that guides entry, and preempts or, at the very
least, catches errors and inconsistencies.
Document Everything. Documentation should describe the contents of the metadata tables and the
macros that provide access to the tables. This improves ease of use, which in turn creates an atmosphere
that gives more people “the metadata religion.”
Be Flexible – Expect Change. Remember that just as organizations evolve, so does the content of the
metadata. The change can be driven internally, by your regulatory environment, or by client needs.
Whatever the source of change, welcome it as an opportunity.
Extensions
Create a Metadata Hierarchy. Some items (display font, point size, margins, system option settings,
and the like) are identical from project to project. A metadata hierarchy for these items could be defined,
storing them at corporate, project, and study levels. The lower levels in the hierarchy could accept or
reset values set at a higher level.
Change Notification. While the approach described here percolates a changed field from metadata to
applications (e.g., the footnote example), it begs the question “just what is the mechanism to trigger the
actual re-creation of the affected project pieces?” Using our footnote example, how would we know to
rerun the 10 table programs once the footnote text was changed? The metadata could be examined at set
intervals or on demand, detecting changes and notifying analysts and programmers of processes that need
to be rerun.
Metadata Catalog. Much of the dataset and display-level metadata is duplicated within project studies
and between projects. The user interface for manually-entered metadata could be enhanced, allowing
population of variable or display data to be done via a pick list, drag and drop, or similar access to a
corporate-wide repository. Seen from this perspective, a collection of standardized metadata from
multiple projects becomes a valuable corporate-wide knowledge base.
Contact
Your comments and questions are welcomed and valued. Contact the authors at:
[email protected]
[email protected]
Acknowledgements
Russ Helms and Ron Helms of Rho, Inc. provided the impetus for metadata use at Rho. They have also
created an environment at Rho that encourages experimentation, rewards success, and is sympathetic
when ideas don’t pan out.
April Sansom provided her typically thorough copyediting services, and is probably wondering why,
with all those years of Catholic school education, the authors punctuate in a way that can only be
described as whimsical.
Jack Shostak, Duke Clinical Research Institute, offered several insightful comments on an early version
of this paper.
SAS and all other SAS Institute product or service names are registered trademarks or trademarks of SAS
Institute in the United States and other countries. ® indicates trademark registration. Other brand and
product names are trademarks of their respective companies.
References
Dictionary table and macro design papers can be found at www.CodeCraftersInc.com.
17
Was this manual useful for you? yes no
Thank you for your participation!

* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project

Related manuals

Download PDF

advertisement