AL MA TE RI SSIS Solution Architecture TE D Imagine that this is the first day of a new internal or client project. You will have responsibility on the data and processing layer of the solution, which involves processing data — a lot of data — from several sources, and then either integrating systems, or maybe consolidating data for reporting. Maybe your background is a developer, or a database administrator (DBA), or a data designer, and you know SSIS fairly well.
Chapter 1: SSIS Solution Architecture This chapter examines how to lay the foundation for successful solutions based on SQL Server Integration Services (SSIS). And, in fact, this whole book is about SSIS solutions to real-world requirements and challenges.
Chapter 1: SSIS Solution Architecture Figure 1-1 The problem with Figure 1-1 is that this mess didn’t happen overnight! It grew into this twisted unorganized process because of poor planning, coordination, and execution. However, be aware that, a lot of the time, a corporation’s politics may lead to this type of situation. Departments hire their own technical people and try to go around IT. Systems don’t talk to each other nicely. Project pressures (such as time and budget) cause designers to cut corners.
Chapter 1: SSIS Solution Architecture ❑ Unknown and uncontrollable data processing — The operations that process data are not centralized, and, in many cases, unknown because of department applications that are created without coordination with other applications. Processes run at uncontrolled times, and may impact systems within the processes even during peak times, which affects work efficiency. ❑ Fragile enterprise changes — Changes to applications are difficult and costly.
Chapter 1: SSIS Solution Architecture Figure 1-2 The overly complex control flow shown in Figure 1-2 is similar to an overly complex data flow, where too many components are used, thus making the development, troubleshooting, and support difficult to manage. The “Design” section later in this chapter proposes a better approach for SSIS packages called the modular package approach.
Chapter 1: SSIS Solution Architecture Figure 1-3 shows the command-line output of an example SSIS package execution.
Chapter 1: SSIS Solution Architecture If you were to consider spending time trying to work through this output when trying to figure out what went wrong, then you should consider implementing a better execution and auditing structure. This includes package execution in your development environment. If you have just turned on the out-of-the-box SSIS logging and are capturing results to output to a table, it still may not be enough.
Chapter 1: SSIS Solution Architecture The bottom line is that you will most likely have a disk I/O bottleneck in your data-processing operations, and you’ll need to plan and manage for that to meet your service level agreements (SLAs) and performance requirements.
Chapter 1: SSIS Solution Architecture Choosing the Right Tool This book is about applying SSIS. You are probably reading it because you assume that SSIS is the right tool for the job. That’s probably the case. However, be sure to consider what you are doing, and ensure that using SSIS is in line with what you are doing.
Chapter 1: SSIS Solution Architecture Be Careful About Tool Selection In some client environments, an ETL tool may be chosen without consideration for the availability of industry skills, support, or even the learning curve. Even though the tool could perform “magic,” it usually doesn’t come with a pocket magician — just the magic of emptying your corporate wallet. In many cases, thousands of dollars have been spent on an ETL tool that takes too long to master, implement, and support.
Chapter 1: SSIS Solution Architecture The tendency when developing a new integration or ETL system is to get it done as quickly as possible. What often happens is that the overall architecture is not integrated well into an organization’s environment. Maybe some time is saved (and that is even questionable), but in the end, more time and money will be wasted. A solution architecture should have several key data-processing objectives.
Chapter 1: SSIS Solution Architecture Customer Relationship Management Master Data Services Data Integration Staging Area Integration Services Cycle Through Remote Inventory Systems Point of Sale Bi-Directional Integration with Core Corporate Systems ERP/ Accounting External FTP File Handling Integration Data Store Human Resources Demographics Competitive Data Figure 1-5 In this diagram, data from different systems is integrated.
Chapter 1: SSIS Solution Architecture reporting of information to show trending and data summaries, the ETL part of a data warehouse is important and critical. Processing ETL for data warehousing involves extracting data from source systems or files, performing transformation logic on the data (to correlate, cleanse, and consolidate), and then loading a data warehouse environment (for reporting and analysis). Figure 1-6 shows common data-processing architecture for a data warehouse ETL system.
Chapter 1: SSIS Solution Architecture Project Planning and Team Development This is not a project methodology book, but you should give some thought to your solution approach. Whether your overall objective is system integration or warehouse ETL, you should give consideration to using an agile development methodology. An agile methodology is an iterative approach to development. You add features of the solution through smaller development cycles, and refine requirements as the solution progresses.
Chapter 1: SSIS Solution Architecture ❑ Estimate your data volumes in one of the initial development cycles so that you can purchase the right hardware. ❑ Get your server storage ironed out upfront. Be sure to set expectations with the storage group or vendor early on in the process. ❑ Plan out your package storage and deployment plan in one of the initial phases. (Chapter 3 provides an in-depth discussion of this topic).
Chapter 1: SSIS Solution Architecture Table 1-1 Item Description Table or filename This names the file or table and any ongoing naming conventions (such as name variations if different systems are involved, or if files have timestamps included). Source and definition Describes the source system where the data originates, and general data that the file contains.
Chapter 1: SSIS Solution Architecture Table 1-3 provides some details for entity tracking. Table 1-3 Item Description Table name This is the destination table name, schema, and database that the table is used in. Table description Describes the use of the table in the overall entity-relationship diagram (ERD), and what general records and grain are included in it. Keys and grain Lists the primary key and any candidate keys in the table, and the data grain of the table.
Chapter 1: SSIS Solution Architecture Just as a review, this discussion only addresses the tracking of data elements, and is supplementary to the overall solution documentation. You may have other related data documentation, or you may choose to include additional items in your documentation (such as partitioning strategy of the destination table, or other pertinent things about the source data availability or data processing).
Chapter 1: SSIS Solution Architecture Figure 1-7 In this example, a couple of data flows and a few other tasks support the processing. In all, ten tasks are in the control flow, which is a very manageable group. Master Packages The way to still keep your complicated order of data processing (or precedence) is to coordinate the execution of the modular packages through a master package.
Chapter 1: SSIS Solution Architecture the Execute Package Task to tie together the modular child packages so that they execute in the right order. Logging and auditing can be included to help facilitate an overall execution auditing and administrative support. Figure 1-8 shows an example parent package.
Chapter 1: SSIS Solution Architecture Each Execute Package Task ties to a package either stored in the file system, or in the msdb database package store. In many solutions, you will need to execute a set of packages at different times and with different precedence. The master package allows this, and helps implement a rollback and checkpoint system. Chapter 2 provides more coverage of this topic when discussing the building of a package framework.
Chapter 1: SSIS Solution Architecture Development and Test Servers For mission-critical ETL processes, you must have a test environment that mirrors your production hardware. It will be costly, but consider the cost if you deploy a change that you couldn’t test and your server goes down. If your budget is restricted, and your ETL process is not mission-critical to your business, then your test environment can be a scaled-down version of your production servers or a virtual server.
Chapter 1: SSIS Solution Architecture Disk Volumes and Configuration Following are some general principles to follow as you try to estimate your disk volumes: ❑ Limit the use of internal server storage, and especially don’t put your databases on the boot/ system drive (C:). ❑ Go with smaller, faster drives, and more of them, rather than bigger and slower drives (except for backups and archive). You can get a lot more throughput for the drives because you can stripe more drives.
Chapter 1: SSIS Solution Architecture Package Execution Location When you are planning out your SSIS solution architecture, you must consider your package execution strategy. Your objective is to leverage the servers and network bandwidth that can handle the impact load from package execution, but without impacting resources that need primary performance. When it comes to where a package should be executed, there is no absolute answer.
Chapter 1: SSIS Solution Architecture For the Execute SQL Task and Bulk Insert Task, the SQL code or BCP command is executed on the machine that the connection specifies. This is different from the Data Flow Task, which runs on the machine where the package is executed. Package Execution and the Data Flow For your packages that have data flows (which is probably most of your packages), you should understand what happens to the data based on where you execute that package (with the embedded data flow).
Chapter 1: SSIS Solution Architecture The source server will both provide the extracted data and handle the data flow transformation logic, and the destination server will require any data load overhead (such as disk I/O for files, or database inserts or index reorganization). Following are some of the benefits of this approach: ❑ There is decreased impact on the destination server, where potential users are querying.
Chapter 1: SSIS Solution Architecture Following are some of the benefits of executing a package on a destination server: ❑ Limited impact on your source server if it is running other critical tasks ❑ Potential performance benefits for data inserts, especially because the SQL Destination component can now be used ❑ Licensing consolidation if your destination server is also running SQL Server 2008 One drawback of this approach is that it has a heavy impact on the destination server, which may affect use
Chapter 1: SSIS Solution Architecture Extraction Impact Load Impact Source files/data So u r ce Da ta Destination files/data Execution Impact Pa h n atio th ta Da Pat in est D Execution Location Figure 1-13 In this case, the impact on both the source and destination machines is reduced because the SSIS server would handle the data flow transformation logic.
Chapter 1: SSIS Solution Architecture The chapters in this book flow in the natural progression that you may go through when designing your SSIS solution. The next couple chapters provide you with the underlying SSIS support structure for your solution — the storage, deployment, and management framework. Next, as you delve into the data, you will be dealing with source data, whether in files or extracted from a relational database management system (RDBMS) and often requiring a data-cleansing process.
Chapter 1: SSIS Solution Architecture different extraction requirements (such as optimization, incremental or targeted extraction, change data capture, or data staging), and also addresses common challenges from the various source types, 32-bit or 64-bit platform. If someone ever tells you that his or her source data is perfect, he or she is lying.
Chapter 1: SSIS Solution Architecture Advanced ETL: Scripting, High Availability, and Performance The final three chapters of this book deal with the advanced topics of scripting, high availability, and scaling your package. Chapter 10, for example, dives into how to build a scale-out SSIS solution for highavailability solutions. One of the most powerful tools within SSIS is the scripting engine. Often, the out-of-the box tasks and transformations cannot handle a unique situation effectively.