**StudySAS Blog: Mastering Clinical Data Management with SAS** 2024-09-09 14:59:00

Best Practices for Joining Additional Columns into an Existing Table Using PROC SQL

Best Practices for Joining Additional Columns into an Existing Table Using PROC SQL

When working with large datasets, it’s common to add new columns from another table to an existing table using SQL. However, many programmers encounter the challenge of recursive referencing in PROC SQL when attempting to create a new table that references itself. This blog post discusses the best practices for adding columns to an existing table using PROC SQL and provides alternative methods that avoid inefficiencies.

1. The Common Approach and Its Pitfall

Here’s a simplified example of a common approach to adding columns via a LEFT JOIN:

PROC SQL;
CREATE TABLE WORK.main_table AS
SELECT main.*, a.newcol1, a.newcol2
FROM WORK.main_table main
LEFT JOIN WORK.addl_data a
ON main.id = a.id;
QUIT;

While this approach might seem straightforward, it leads to a warning: “CREATE TABLE statement recursively references the target table”. This happens because you’re trying to reference the main_table both as the source and the target table in the same query. Furthermore, if you’re dealing with large datasets, creating a new table might take up too much server space.

2. Best Practice 1: Use a Temporary Table

A better approach is to use a temporary table to store the joined result and then replace the original table. Here’s how you can implement this:

PROC SQL;
   CREATE TABLE work.temp_table AS 
   SELECT main.*, a.newcol1, a.newcol2
   FROM WORK.main_table main
   LEFT JOIN WORK.addl_data a
   ON main.id = a.id;
QUIT;

PROC SQL;
   DROP TABLE work.main_table;
   CREATE TABLE work.main_table AS 
   SELECT * FROM work.temp_table;
QUIT;

PROC SQL;
   DROP TABLE work.temp_table;
QUIT;

This ensures that the original table is updated with the new columns without violating the recursive referencing rule. It also minimizes space usage since the temp_table will be dropped after the operation.

3. Best Practice 2: Use ALTER TABLE for Adding Columns

If you’re simply adding new columns without updating existing ones, you can use the ALTER TABLE statement in combination with UPDATE to populate the new columns:

PROC SQL;
   ALTER TABLE WORK.main_table
   ADD newcol1 NUM, 
       newcol2 NUM;

   UPDATE WORK.main_table main
   SET newcol1 = (SELECT a.newcol1 
                  FROM WORK.addl_data a 
                  WHERE main.id = a.id),
       newcol2 = (SELECT a.newcol2 
                  FROM WORK.addl_data a 
                  WHERE main.id = a.id);
QUIT;

This approach avoids creating a new table altogether, and instead modifies the structure of the existing table.

4. Best Practice 3: Consider the DATA Step MERGE for Large Datasets

For very large datasets, a DATA step MERGE can sometimes be more efficient than PROC SQL. The MERGE statement allows you to combine datasets based on a common key variable, as shown below:

PROC SORT DATA=WORK.main_table; BY id; RUN;
PROC SORT DATA=WORK.addl_data; BY id; RUN;

DATA WORK.main_table;
   MERGE WORK.main_table (IN=in1)
         WORK.addl_data (IN=in2);
   BY id;
   IF in1; /* Keep only records from main_table */
RUN;

While some might find the MERGE approach less intuitive, it can be a powerful tool for handling large tables when combined with proper sorting of the datasets.

Conclusion

The best method for joining additional columns into an existing table depends on your specific needs, including dataset size and available server space. Using a temporary table or the ALTER TABLE method can be more efficient in certain situations, while the DATA step MERGE is a reliable fallback for large datasets.

By following these best practices, you can avoid common pitfalls and improve the performance of your SQL queries in SAS.

**StudySAS Blog: Mastering Clinical Data Management with SAS** 2024-09-07 21:29:00

html_content = “””

Comprehensive SDTM Review

Mastering the SDTM Review Process: Comprehensive Insights with Real-World Examples

The process of ensuring compliance with Study Data Tabulation Model (SDTM) standards can be challenging due to the diverse requirements and guidelines that span across multiple sources. These include the SDTM Implementation Guide (SDTMIG), the domain-specific assumptions sections, and the FDA Study Data Technical Conformance Guide. While automated tools like Pinnacle 21 play a critical role in detecting many issues, they have limitations. This article provides an in-depth guide to conducting a thorough SDTM review, enhanced by real-world examples that highlight commonly observed pitfalls and solutions.

1. Understanding the Complexity of SDTM Review

One of the first challenges in SDTM review is recognizing that SDTM requirements are spread across different guidelines and manuals. Each source offers a unique perspective on compliance:

  • SDTMIG domain specifications: Provide detailed variable-level specifications.
  • SDTMIG domain assumptions: Offer clarifications for how variables should be populated.
  • FDA Study Data Technical Conformance Guide: Adds regulatory requirements for submitting SDTM data to health authorities.

Real-World Example: Misinterpreting Domain Assumptions

In a multi-site oncology trial, a programmer misunderstood the domain assumptions for the “Events” domains (such as AE – Adverse Events). The SDTMIG advises that adverse events should be reported based on their actual date of occurrence, but the programmer initially used the visit date, leading to incorrect representation of events.

2. Leveraging Pinnacle 21: What It Catches and What It Misses

Pinnacle 21 is a powerful tool for validating SDTM datasets, but it has limitations:

  • What it catches: Missing mandatory variables, incorrect metadata, and value-level issues (non-conformant values).
  • What it misses: Study-specific variables that should be excluded, domain-specific assumptions that must be manually reviewed.

Real-World Example: Inapplicable Variables Passing Pinnacle 21

In a dermatology study, the variable ARM (Treatment Arm) was populated for all subjects, including those in an observational cohort. Since observational subjects did not receive a treatment, this variable should have been blank. Pinnacle 21 didn’t flag this, but a manual review revealed the issue.

3. Key Findings in the Review Process

3.1 General Findings

  • Incorrect Population of Date Variables: Properly populating start and end dates (–STDTC, –ENDTC) is challenging.
  • Missing SUPPQUAL Links: Incomplete or incorrect links between parent domains and SUPPQUAL can lead to misinterpretation.

Real-World Example: Incorrect Dates in a Global Trial

In a global cardiology trial, visit start dates were incorrectly populated due to time zone differences between sites in the U.S. and Europe. A manual review of the date variables identified these inconsistencies and corrected them.

3.2 Domain-Specific Findings

  • Incorrect Usage of Age Units (AGEU): Misuse of AGEU in pediatric studies can lead to incorrect data representation.
  • Inconsistent Use of Controlled Terminology: Discrepancies in controlled terminology like MedDRA or WHO Drug Dictionary can cause significant issues.

Real-World Example: Incorrect AGEU in a Pediatric Study

In a pediatric vaccine trial, the AGEU variable was incorrectly populated with “YEARS” for infants under one year old, when it should have been “MONTHS.” This was not flagged by Pinnacle 21 but was discovered during manual review.

4. Optimizing the SDTM Review Process

To conduct an effective SDTM review, follow these steps:

  • Review SDTM Specifications Early: Identify potential issues before SDTM datasets are created.
  • Analyze Pinnacle 21 Reports Critically: Don’t rely solely on automated checks—investigate warnings and study-specific variables manually.
  • Manual Domain Review: Ensure assumptions are met and variables are used correctly in specific domains.

5. Conclusion: Building a Holistic SDTM Review Process

By combining early manual review, critical analysis of automated checks, and a detailed review of domain-specific assumptions, programmers can significantly enhance the accuracy and compliance of SDTM datasets. The real-world examples provided highlight how even small errors can lead to significant downstream problems. A holistic SDTM review process not only saves time but also ensures higher data quality and compliance during regulatory submission.

“””

**StudySAS Blog: Mastering Clinical Data Management with SAS** 2024-09-07 14:32:00

Unleashing the Power of PROC DATASETS in SAS

Unleashing the Power of PROC DATASETS in SAS

The PROC DATASETS procedure is a versatile and efficient tool within SAS for managing datasets. Often described as the “Swiss Army Knife” of SAS procedures, it allows users to perform a variety of tasks such as renaming, deleting, modifying attributes, appending datasets, and much more, all while consuming fewer system resources compared to traditional data steps. In this article, we’ll explore key use cases, functionality, and examples of PROC DATASETS, illustrating why it should be part of every SAS programmer’s toolkit.

1. Why Use PROC DATASETS?

Unlike procedures like PROC APPEND, PROC CONTENTS, and PROC COPY, which focus on specific tasks, PROC DATASETS integrates the functionalities of these procedures and more. By using PROC DATASETS, you avoid the need for multiple procedures, saving both time and system resources since it only updates metadata instead of reading and rewriting the entire dataset.

2. Basic Syntax of PROC DATASETS

The basic structure of PROC DATASETS is as follows:

PROC DATASETS LIBRARY=;
    ;
RUN; QUIT;

Here, you specify the library containing the datasets you want to modify. Commands such as CHANGE, DELETE, APPEND, MODIFY, and RENAME follow within the procedure.

3. Use Case 1: Renaming Datasets and Variables

Renaming datasets and variables is a simple yet powerful capability of PROC DATASETS. Here’s an example of how you can rename a dataset:

PROC DATASETS LIBRARY=mylib;
    CHANGE old_data=new_data;
RUN; QUIT;

To rename a variable within a dataset:

PROC DATASETS LIBRARY=mylib;
    MODIFY dataset_name;
    RENAME old_var=new_var;
RUN; QUIT;

4. Use Case 2: Appending Datasets

The APPEND statement is a highly efficient alternative to using SET in a data step because it only reads the dataset being appended (the DATA= dataset), instead of reading both datasets.

PROC DATASETS LIBRARY=mylib;
    APPEND BASE=master_data DATA=new_data;
RUN; QUIT;

5. Use Case 3: Deleting Datasets

Deleting datasets or members within a library is simple with PROC DATASETS. You can delete individual datasets or use the KILL option to remove all members of a library:

PROC DATASETS LIBRARY=mylib;
    DELETE dataset_name;
RUN; QUIT;
PROC DATASETS LIBRARY=mylib KILL;
RUN; QUIT;

6. Use Case 4: Modifying Attributes

You can modify variable attributes such as labels, formats, and informats without rewriting the entire dataset:

PROC DATASETS LIBRARY=mylib;
    MODIFY dataset_name;
    LABEL var_name='New Label';
    FORMAT var_name 8.2;
RUN; QUIT;

7. Advanced Operations with PROC DATASETS

7.1. Working with Audit Trails

You can use PROC DATASETS to manage audit trails, which track changes made to datasets. For instance, the following code creates an audit trail for a dataset:

PROC DATASETS LIBRARY=mylib;
    AUDIT dataset_name;
    INITIATE;
RUN; QUIT;

7.2. Managing Indexes

Indexes help retrieve subsets of data efficiently. You can create or delete indexes with PROC DATASETS:

PROC DATASETS LIBRARY=mylib;
    MODIFY dataset_name;
    INDEX CREATE var_name;
RUN; QUIT;

7.3. Cascading File Renaming with the AGE Command

Another useful feature is the AGE command, which renames a set of files in sequence:

PROC DATASETS LIBRARY=mylib;
    AGE file1-file5;
RUN; QUIT;

8. Conclusion

PROC DATASETS is an indispensable tool for SAS programmers. Its efficiency and versatility allow you to manage datasets with ease, from renaming and deleting to appending and modifying attributes. By leveraging its full potential, you can streamline your SAS workflows and significantly reduce processing times.

Sources

**StudySAS Blog: Mastering Clinical Data Management with SAS** 2024-09-06 23:39:00

Advanced SDTM Programming Techniques for SAS Programmers

Advanced SDTM Programming Techniques for SAS Programmers

As an experienced SAS programmer working with the Study Data Tabulation Model (SDTM), it’s crucial to stay updated with the latest programming techniques. Whether you’re tasked with building SDTM domains from scratch or optimizing existing code, there are several advanced concepts that can improve your workflows and output. In this post, we’ll explore some techniques that can help you overcome common SDTM challenges and boost efficiency in handling clinical trial data.

1. Efficient Handling of Large Datasets

When dealing with large clinical datasets, speed and efficiency are key. One method to optimize SDTM domain generation is to reduce the data footprint by eliminating unnecessary variables and duplicative observations. Consider the following approaches:

Removing Duplicate Observations

Duplicate records can slow down the processing of datasets and cause inaccuracies in reporting. To remove duplicates, you can use the PROC SQL, DATA STEP, or PROC SORT methods. Here’s a quick example using PROC SORT:

proc sort data=mydata nodupkey;
    by usubjid visitnum;
run;

This example ensures that only unique records for each usubjid and visitnum combination are retained, eliminating potential redundancy in your dataset.

2. Mastering Macro Variables for Flexibility

Utilizing macro variables efficiently can significantly improve your code’s flexibility, especially when working across multiple SDTM domains. Macro variables allow you to automate repetitive tasks and reduce the risk of human error. Here’s an example using macro variables to generate domain-specific reports:

%macro create_sdtm_report(domain);
    data &domain._report;
        set sdtm.&domain.;
        /* Apply domain-specific transformations */
    run;
%mend;

%create_sdtm_report(DM);
%create_sdtm_report(LB);

In this case, the macro dynamically generates SDTM reports for any domain by passing the domain name as a parameter, minimizing manual interventions.

3. Managing Demographics with the DC Domain

The Demographics as Collected (DC) domain often presents unique challenges, particularly when distinguishing it from the standard Demographics (DM) domain. While DM represents standardized data, DC focuses on the raw, collected demographic details. Here’s an approach to manage these domains efficiently:

data dc_domain;
    set raw_data;
    /* Capture specific collected demographic data */
    where not missing(collected_age) and not missing(collected_gender);
run;

In this case, the code filters out any missing collected data to ensure the DC domain contains only records with complete demographic information.

4. Debugging SDTM Code with PUTLOG

Efficient debugging is crucial, especially when dealing with complex SDTM transformations. The PUTLOG statement in SAS is a simple yet powerful tool for tracking errors and debugging data issues.

data check;
    set sdtm.dm;
    if missing(usubjid) then putlog "ERROR: Missing USUBJID at obs " _n_;
run;

In this example, the PUTLOG statement flags records where the USUBJID variable is missing, making it easier to spot and address data quality issues during SDTM creation.

5. Advanced Array Processing for Repeated Measures

In certain domains like Vital Signs (VS) or Lab (LB), handling repeated measures for patients across multiple visits is common. Using arrays can help streamline this process. Here’s a basic example of using an array to process repeated lab measurements:

data lab_repeated;
    set sdtm.lb;
    array lb_vals{3} lbtestcd1-lbtestcd3;
    do i=1 to dim(lb_vals);
        if lb_vals{i} = . then putlog "WARNING: Missing value for LBTESTCD at " _n_;
    end;
run;

This code uses an array to loop through repeated lab test results, ensuring that missing values are flagged for review. Such array-based techniques are essential when processing large, multidimensional datasets in SDTM programming.

6. Best Practices for CDISC Validation and Compliance

To ensure SDTM datasets are CDISC compliant, it’s vital to validate your datasets using tools like Pinnacle 21 or OpenCDISC. These tools check compliance against SDTM standards, flagging any inconsistencies or issues.

Make sure to incorporate validation steps into your workflows regularly. This can be done by running the validation tool after each major dataset creation and by including clear annotations in your programming code to ensure traceability for audits.

Conclusion

Advanced SDTM programming requires a mix of technical expertise and strategic thinking. Whether you are optimizing large datasets, automating repetitive tasks with macros, or ensuring CDISC compliance, staying updated with these advanced techniques will enhance your efficiency and ensure high-quality deliverables in clinical trials. Remember, SDTM programming is not just about writing code—it’s about delivering accurate, compliant, and reliable data for critical decision-making.

For more tips and tutorials, check out other PharmaSUG resources or SAS support!