BaseSpace User Guide Supporting the NextSeq, Miseq, and HiSeq Sequencing Systems FOR RESEARCH USE ONLY Introduction How Do I Start BaseSpace User Interface How To Use BaseSpace Workflow Reference Data Reference Technical Assistance ILLUMINA PROPRIETARY Part # 15050652 Rev. A January 2014 3 8 13 25 55 61 This document and its contents are proprietary to Illumina, Inc. and its affiliates ("Illumina"), and are intended solely for the contractual use of its customer in connection with the use of the product(s) described herein and for no other purpose. This document and its contents shall not be used or distributed for any other purpose and/or otherwise communicated, disclosed, or reproduced in any way whatsoever without the prior written consent of Illumina. Illumina does not convey any license under its patent, trademark, copyright, or common-law rights nor similar rights of any third parties by this document. The instructions in this document must be strictly and explicitly followed by qualified and properly trained personnel in order to ensure the proper and safe use of the product(s) described herein. All of the contents of this document must be fully read and understood prior to using such product(s). FAILURE TO COMPLETELY READ AND EXPLICITLY FOLLOW ALL OF THE INSTRUCTIONS CONTAINED HEREIN MAY RESULT IN DAMAGE TO THE PRODUCT(S), INJURY TO PERSONS, INCLUDING TO USERS OR OTHERS, AND DAMAGE TO OTHER PROPERTY. ILLUMINA DOES NOT ASSUME ANY LIABILITY ARISING OUT OF THE IMPROPER USE OF THE PRODUCT(S) DESCRIBED HEREIN (INCLUDING PARTS THEREOF OR SOFTWARE) OR ANY USE OF SUCH PRODUCT(S) OUTSIDE THE SCOPE OF THE EXPRESS WRITTEN LICENSES OR PERMISSIONS GRANTED BY ILLUMINA IN CONNECTION WITH CUSTOMER'S ACQUISITION OF SUCH PRODUCT(S). FOR RESEARCH USE ONLY © 2011-2014 Illumina, Inc. All rights reserved. Illumina, IlluminaDx, BaseSpace, BeadArray, BeadXpress, cBot, CSPro, DASL, DesignStudio, Eco, GAIIx, Genetic Energy, Genome Analyzer, GenomeStudio, GoldenGate, HiScan, HiSeq, Infinium, iScan, iSelect, MiSeq, MiSeqDx, Nextera, NextSeq, NuPCR, SeqMonitor, Solexa, TruGenome, TruSeq, TruSight, Understand Your Genome, UYG, VeraCode, the pumpkin orange color, and the Genetic Energy streaming bases design are trademarks of Illumina, Inc. in the U.S. and/or other countries. All other names, logos, and other trademarks are the property of their respective owners. BaseSpace is a genomics analysis platform that is directly integrated into the NextSeq, MiSeq and HiSeq sequencing platforms. When setting up runs on your sequencing instrument, you have the option to send the run to BaseSpace. This will send the basecall (*.bcl) files, as well as associated files, to your dedicated space on the cloud, as well as associated files. On the HiSeq, you can also choose to do Run Monitoring Only, which only sends the files needed for remote monitoring of the run to BaseSpace. NOTE This user guide supports data analysis for the NextSeq, Miseq, and HiSeq sequencing systems, and contains information about the Prep tab, which is used to set up a NextSeq sequencing run. This user guide is specific for BaseSpace running in the cloud, and is not intended for the on-premise implementation, BaseSpace OnSite. The instrument seamlessly pushes the data to BaseSpace for automatic analysis and storage, with the option of retaining data for local analysis and hosting. There is no need for a manual and time-consuming data-transfer step: the data is already up in the cloud, for you and your collaborators to access anywhere, anytime. BaseSpace can automatically run analysis jobs using the Illumina MiSeq workflow apps. BaseSpace also allows you to use the third-party apps to analyze your data. In addition, BaseSpace provides a mechanism to share data with others and easily scale storage and computing needs. For more information about BaseSpace, see the BaseSpace Data Sheet. Workflow Model Prep Run on NextSeq BaseSpace enables you to prep runs for NextSeq sequencing. The prep workflow in BaseSpaceconsists of four steps: } Biological Samples: Define the samples that are going to be sequenced. } Libraries: Define the libraries, which consist of biological samples that are prepped and contain adapters. Each library usually derives from a single biological sample, though biological samples can be used in multiple libraries. } Pools: Group libraries into pools that share analysis parameters. Pools can consist of one or multiple libraries. } Planned Runs: Define run parameters for pool, then send planned run to the NextSeq. You can now start the run from the instrument. BaseSpace User Guide for NextSeq, Miseq, and HiSeq 3 Introduction Introduction Figure 1 Prep Workflow Data Processing Processing a flow cell on a sequencing instrument produces a variety of files, collectively referred to as a run. A run contains log files, instrument health data, run metrics, sample sheet, and base call information (*.bcl files) that is demultiplexed in BaseSpace to create the samples used in secondary analysis. Samples are analyzed automatically using the Illumina workflow apps as specified in the sample sheet, or by launching custom BaseSpace apps. BaseSpace apps, many of which are written by third-party vendors, are processing software and routines that interact with BaseSpace data through the API. User-level authentication and in-flight data encryption are enforced for every app that requests access to BaseSpace data. AppResults then store the files that are generated by Illumina workflow apps or BaseSpace apps. For example, when a resequencing app executes alignment and variant calling, an AppResult is created for each sample. An AppResult generally contains BAM and VCF files, but it can also contain other file types. AppResults can also be used as inputs to apps. App sessions are created to record every time an app is launched. Finally, projects are simple containers that store samples and AppResults. 4 Part # 15050652 Rev. A Introduction Figure 2 BaseSpace Data Model BaseSpace Security Model Data security is a key concern in making the decision to move to cloud-based genomic storage and analysis. Illumina's BaseSpace is hosted on Amazon’s Web Services (AWS) and provides a combination of Amazon’s comprehensive and well-tested approach to platform security, overlaid with Illumina’s own security testing and procedures, which includes reviews and tests by independent security professionals. This provides a cloud genomics solution that meets or exceeds the security provided by many institutional IT infrastructures. Amazon’s Web Services Illumina has chosen to work with AWS as the leader in cloud-based infrastructure, hosting customer-facing services and critical operations for both private industry and U.S. government departments including Treasury, DOE, and State. Amazon’s own security processes and standards are publicly available for review. AWS standards and accreditation include: BaseSpace User Guide for NextSeq, Miseq, and HiSeq 5 } SOC 1/SSAE 16/ISAE 3402 (auditing) } FISMA moderate (U.S. Federal Government; for reference, the NIH’s own data centers are rated FISMA moderate) } PCI DSS Level 1 (electronic payments) } ISO 27001 (international security standard) } FIPS 140-2 (encryption) Additionally, AWS data centers are protected by security staff and controlled access procedures. Staff with system access undergo background checks, and all hardware is located behind firewalls which are configured by default to block all traffic. Operating security patches are automatically applied to AWS servers, including BaseSpace servers. AWS actively monitors its firewalls to check for vulnerabilities, a service beyond the resources of most institutions. BaseSpace encrypts all data, something else that is rarely done in the institutional IT setting. BaseSpace Data Stream Software Illumina sequencing instruments have on-board control and workflow software. This includes a robust data-streaming component which acts as a software broker with the BaseSpace API, allowing individual base-call (*.bcl) files to be sent over an encrypted connection, verified, and assembled into samples for analysis in real time as the sequencing run is conducted. Real-time monitoring of data generated by one sequencing instrument or a federation of instruments is possible through the BaseSpace interface. The instrument control software does not allow publicly addressable inbound communications. All communication is made through standard https requests initiated by the user at the instrument. Each data-upload transaction is linked to an authenticated user account. BaseSpace Apps There are two different types of apps in BaseSpace: } Sample Sheet Driven Workflow Apps (for MiSeq only): launching these Illumina workflow apps is specified in the sample sheet, and they are started automatically by BaseSpace. The sample sheet driven workflow apps primarily perform secondary analysis or simple file manipulations. They consist of the following: • Resequencing • Amplicon Analysis • Library QC • SmallRNA • Metagenomics • De Novo Assembly • Generate FASTQ See the Workflow Reference on page 55 for descriptions of the workflow apps. } Custom BaseSpace apps.: these apps need to be actively launched to analyze data. In general, these apps perform tertiary analysis, visualization, or annotation of data. Most of them are generated by third-party vendors, although Illumina has generated a few too. There may be additional costs associated with running a BaseSpace app from a third-party vendor. BaseSpace apps may require the AppResults from an Illumina workflow app as input. 6 Part # 15050652 Rev. A Sequencing data has traditionally been stored in non-centralized locations, which offer little uniformity of data management across locations. BaseSpace transforms data management by creating an environment with large stores of sequencing data, which can be easily accessed and analyzed online with a store of applications. Third-party vendors can develop their own apps for BaseSpace. BaseSpace offers the following benefits as a development platform: } Easy access to data: sequencing data is automatically uploaded from instrument to BaseSpace } Consistent data retrieval: with a few lines of code, access data with the BaseSpace API and SDKs } Write once, execute often: once an app is written and published, all users can launch it } Flexible billing: apps can bill customers as little or as much as they wish } Highly scalable: data storage and analysis scales since BaseSpace is built on Amazon Web Services (AWS) } Flexible app hosting: apps can be hosted on any website, desktop or mobile application, or inside BaseSpace as a Native App. } Easy sharing: users can easily share their data and results Developers can create apps using the BaseSpace Application Programming Interface (API) or the Software Development Kits (SDKs) available for Java, R, Ruby, and Python. Both approaches offer safe and easy access to BaseSpace data for Apps to analyse, visualize, monitor, etc. Apps access data using the BaseSpace RESTful API. The API may be accessed via simple HTTPS requests using any programming language and is organized to allow you to get to the data you need quickly. The SDKs are available for several programming languages and make it even easier for developers to write applications or to integrate existing ones. The SDKs work by exposing what the API has to offer natively without the developer needing to worry about building their own HTTPS requests. This allows rapid development and integration with BaseSpace data and a simple mechanism of discovering what the API has to offer. For more information, see the BaseSpace API documentation. BaseSpace User Guide for NextSeq, Miseq, and HiSeq 7 Introduction BaseSpace API How Do I Start You can reach BaseSpace from https://basespace.illumina.com/home/index. In this section we discuss the different ways to get started with BaseSpace. Use your MyIllumina account to log on; the first time you visit you will be asked to accept the BaseSpace agreement. After that, you are ready to run BaseSpace. Exploring BaseSpace on page 8 Not Uploading Yet on page 8 Using BaseSpace with MiSeq on page 9 Using BaseSpace with HiSeq on page 10 Using BaseSpace with NextSeq on page 11 Getting Shared Data on page 12 Exploring BaseSpace If you just want to explore BaseSpace, go to https://basespace.illumina.com/home/index and click the Get Started link. You will need to register to set up a new account. Fill out the form, and indicate whether you want access to product and support resources. We will send you a confirmation email to the email account you entered. Open that email, and click on the confirmation link. Now you are ready to start testing BaseSpace with the test data we uploaded for you. For information about how to run certain tasks, see How To Use BaseSpace on page 25 Once you feel comfortable with BaseSpace, and you have a BaseSpace equipped HiSeq or MiSeq, you can start uploading data and run analyses from your sequencing instrument. Raw data from the run is also stored on the instrument, or in the location of the output folder that you specified in Run Options. Not Uploading Yet If you have a sequencing instrument but you are not uploading data yet, you can start by exploring BaseSpace. Go to basespace.illumina.com and log on; there are two ways to do that: } Use your MyIllumina account to log on; the first time you visit you will be asked to accept the BaseSpace agreement. After that, you are ready to run BaseSpace. } Explore BaseSpace by taking a test drive; you must register to set up a new account. See Exploring BaseSpace on page 8. Now you are ready to start testing BaseSpace with the test data we uploaded for you. For information about how to run tasks, see How To Use BaseSpace on page 25 Once you feel comfortable with BaseSpace, you can start uploading data and run analyses from your sequencing instrument. Raw data from the run is also stored on the instrument, or in the location of the output folder that you specified in Run Options. Alternatively, you can elect to only upload health data to BaseSpace. Health data helps Illumina improving the sequencing instruments and BaseSpace; for more information, see Health Runs on page 87. See the MiSeq System User Guide, NextSeq System User Guide, or HiSeq User Guide for instructions for setting up your sequencing instrument. 8 Part # 15050652 Rev. A Using BaseSpace with MiSeq BaseSpace is Illumina's analysis cloud environment. Using BaseSpace to store and analyze your run data provides the following benefits: } Eliminates the need for onsite storage and computing } Enables web-based data management and analysis } Provides tools for global collaboration and sharing In this section we discuss the different ways to get started with BaseSpace when uploading data and analysis from the MiSeq. You can reach BaseSpace by going to basespace.illumina.com. Use your MyIllumina account to log on; the first time you visit you will be asked to accept the BaseSpace agreement. After that, you are ready to run BaseSpace. When you set up the run on the MiSeq, you should select the option to log in to BaseSpace. If you have a problem with the data upload between MiSeq and BaseSpace, see MiSeq Connection on page 9 NOTE Raw data from the run is also stored on the instrument, or in the location of the output folder that you specified in Run Options. BaseSpace automatically disconnects from the MiSeq at the end of the run or as soon as all primary analysis files have finished uploading. If the internet connection is interrupted, analysis files will continue uploading after the connection is restored from the point when the interruption occurred. As soon as the last base call file is uploaded to BaseSpace, secondary analysis of your data begins. The same analysis workflows are supported on BaseSpace as with oninstrument analysis using MiSeq Reporter. For information about how to run tasks, see How To Use BaseSpace on page 25 MiSeq Connection If the MiSeq data is not uploaded to BaseSpace, check the following things. 1 Make sure you have an internet connection from the MiSeq. 2 When setting up runs on the MiSeq, you have the option to log in to BaseSpace, and use BaseSpace for storage and analysis. Make sure that option is checked. When you begin your sequencing run on the MiSeq, the BaseSpace icon changes to indicate that the MiSeq is connected to BaseSpace and data files are being transferred to your secure location. Figure 3 Connected to BaseSpace Icon on the MiSeq For more information, see the MiSeq System User Guide. BaseSpace User Guide for NextSeq, Miseq, and HiSeq 9 How Do I Start Once your sequencing instrument is uploading data, go to Using BaseSpace with MiSeq on page 9, Using BaseSpace with NextSeq on page 11, or Using BaseSpace with HiSeq on page 10 to get started with the analysis of your run. Using BaseSpace with HiSeq BaseSpace Connectivity—The HiSeq features an option to send instrument health and sequencing data to BaseSpace in real time to streamline both instrument quality control and analysis. Real-time monitoring of runs enables fast troubleshooting. BaseSpace facilitates collaboration by enabling you to share results instantly with anyone anywhere in the world. Free alignment and variant calling and the soon to be launched BaseSpace app store provide many easy to use workflows that tailor analysis for diverse biological applications. BaseSpace is Illumina's analysis cloud environment. Using BaseSpace to store and analyze your run data provides the following benefits: } Eliminates the need for onsite storage and computing } Enables web-based data management and analysis } Provides tools for global collaboration and sharing In this section we discuss the different ways to get started with BaseSpace when uploading data and analysis from the HiSeq. You can reach BaseSpace by going to basespace.illumina.com. Use your MyIllumina account to log on; the first time you visit you will be asked to accept the BaseSpace agreement. After that, you are ready to run BaseSpace. When you set up the run on the HiSeq, you should select the option to log in to BaseSpace. If you have a problem with the data upload between HiSeq and BaseSpace, see HiSeq Connection on page 10 NOTE Raw data from the run is also stored on the instrument, or in the location of the output folder that you specified in the Storage screen. BaseSpace automatically disconnects from the HiSeq at the end of the run or as soon as all primary analysis files have finished uploading. If the internet connection is interrupted, analysis files will continue uploading after the connection is restored from the point when the interruption occurred. As soon as the last base-call file is uploaded to BaseSpace, secondary analysis of your data begins. For information about how to run tasks, see How To Use BaseSpace on page 25 HiSeq Connection If the HiSeq data is not uploaded to BaseSpace, check the following things. 10 1 Make sure you have a stable internet connection of at least 10 Mbps from the HiSeq. 2 The Storage screen during run configuration on the HiSeq enables you to define where your run data will be output and stored. Select the options below: • Connect to BaseSpace—When you select this option you will be prompted to enter your MyIllumina account information. Zip BCL files (below) will be selected by default. Illumina recommends that you also save files locally. To save files locally, select Save to an output folder and enter a path, usually to a local network folder. • Storage and analysis—This option enables the HiSeq to send run data as well as system health information to BaseSpace. Part # 15050652 Rev. A How Do I Start Figure 4 Storage Screen 3 If BaseSpace is not available, open Windows Services and start or restart Illumina BaseSpace Broker: a Click the Windows Start button. b Right-click on Computer, select Manage. c On the left, under Services and Applications, choose Services. d Scroll down the list to find Illumina Basespace Broker. e Right click on Illumina Basespace Broker and do one of the following: — Click Start if this option is not greyed out — If the Start option is greyed out, click Restart The service will start, or will close then restart. f Close the Computer Management window. NOTE To use BaseSpace, you must load a sample sheet at the start of your run. For more information, see the HiSeq User Guide When you begin your sequencing run on the HiSeq, the BaseSpace icon changes to indicate that the HiSeq is connected to BaseSpace and data files are being transferred to your secure location. Using BaseSpace with NextSeq BaseSpace is Illumina's analysis cloud environment. BaseSpace facilitates your experiments on the NextSeq sytem in two different ways: } BaseSpace helps to organize your samples and experiments, and preps runs for NextSeq. } BaseSpace stores and analyzes your run data, providing the following benefits: • Eliminates the need for onsite storage and computing • Enables web-based data management and analysis • Provides tools for global collaboration and sharing If your NextSeq sequencing system and BaseSpace do not connect properly, check the following: } Make sure you have a stable internet connection of at least 10 Mbps from the NextSeq. } From the Manage Instrument screen, select System Configuration to access a series of screens that configure the connection to BaseSpace. BaseSpace User Guide for NextSeq, Miseq, and HiSeq 11 } Log in to BaseSpace when setting up the run on the NextSeq sequencing system. Getting Shared Data If you receive a link to shared data in BaseSpace, click on the link. Use your MyIllumina account to log on; the first time you visit you will be asked to accept the BaseSpace agreement. After that, you are ready to run BaseSpace. You may need to set up a new account. Fill out the form, and indicate whether you want access to product and support resources. Alternatively, log on to your BaseSpace account. If someone shared data with you, you should see a notification stating so. The shared data will show up in your project list. Now you can use the BaseSpace tools to look at and download the data. For information about how to run tasks, see How To Use BaseSpace on page 25 NOTE The sharing feature can at any time be disabled by the owner of the data. 12 Part # 15050652 Rev. A The BaseSpace user interface (UI) has four tabs that allow you to access and use your data. In addition, there are a number of common interface elements that enable general tasks. This section describes the various aspects of the BaseSpace UI. Common Elements on page 13 Dashboard Tab on page 15 Prep Tab on page 17 Runs Tab on page 20 Projects Tab on page 21 Apps Tab on page 24 Public Data Tab on page 24 Common Elements There are a number of common UI elements that are shared between all BaseSpace pages: } Toolbar } Contact us button } Bottom links These elements enable general tasks, and are explained below. Toolbar The BaseSpace toolbar elements are listed in the table below. Icon Element Description Dashboard See Dashboard Tab on page 15 Tab Runs Tab See Runs Tab on page 20. Projects Tab See Projects Tab on page 21. Prep Tab See Prep Tab on page 17. This tab is used to set up a NextSeq run. Apps Tab See Apps Tab on page 24. Public Data Tab See Public Data Tab on page 24. BaseSpace User Guide for NextSeq, Miseq, and HiSeq 13 BaseSpace User Interface BaseSpace User Interface Icon Element Support Page Search Account Description The BaseSpace Support page provides access to the BaseSpace Knowledge Base, User Guide, and Illumina Technical Support. The Search box allows you to find runs, projects, or samples. For more information, see Search for Runs, Projects, and Samples on page 53. The Account dropdown list provides access to: • iCredits. See Access Your Wallet on page 51. • MyAccount. See MyAccount on page 15. • MyIllumina Dashboard. • FAQ: leads to a number of frequently asked questions and illumina-provided answers. • Terms: leads to the User Agreement. • Blog: leads to the blog. Check this out for the latest news, developments, and updates. • Sign out. Contact Us Button The Contact us button opens a new screen that allows you to: • Browse the knowledge base • Provide feedback for Illumina, suggest ideas to the user community, or browse, read, and vote for other people's ideas. • Contact Support Figure 5 Knowledge base, feedback, and contact screen. Bottom Links The bottom links provide access to more information: 14 Part # 15050652 Rev. A MyAccount MyAccount provides access to the Settings, Wallet, Purchase History, Transfer History, and Genomes pages. Settings On the Settings page you can edit your notifications settings, edit your profile, or update your profile picture. Wallet The Wallet page allows you to manage iCredits and credit cards. See Access Your Wallet on page 51 for more information. Purchase History The Purchase History page contains detailed information about purchases, adjustments, and balance for your account. See View Purchase History on page 53 for more information. Transfer History The Transfer History page allows you to review projects or runs that have been transferred. See Transfer Ownership on page 51 for more information. Genomes The Genomes page lists the genomes that are associated with your BaseSpace account. Dashboard Tab After login, the first tab you see is the dashboard. The dashboard provides access to notifications, your latest runs, projects, and app results. The dashboard is always accessible in BaseSpace from the top ribbon selector. NOTE If a run or project is not showing on BaseSpace, your data may not have been sent to BaseSpace. You need to set the BaseSpace option on your sequencing instrument; see the instrument's user guide. BaseSpace User Guide for NextSeq, Miseq, and HiSeq 15 BaseSpace User Interface } Help: online help. } FAQ: leads to a number of frequently asked questions and illumina-provided answers. } Developers: leads to the developers portal, set up to help you generate custom apps. } Terms: leads to the User Agreement. } Blog: leads to the blog (blog.basespace.illumina.com). Check this out for the latest news, updates, and developments, and subscribe to updates. Notifications Notifications are displayed here by most recent first. There are multiple types of notifications: } Runs • Run in progress • Run completed • Run error } Collaborators • Collaborator joined a project/run of which you're a member • Collaborator invited you to a project/run • (optionally) collaborator may have included a personal message • Collaborator recommended an App • Collaborator accepted your offer to transfer ownership • Collaborator offered to transfer ownership to you. } Analyses by you • Analysis in progress • Analysis completed • Analysis error } Analyses by collaborators • Analysis in progress • Analysis completed • Analysis error } Uploads, additions, or deletions to/from a project of which you're a member • By you • By a collaborator } Messages from Illumina • New Demo data set • Announcement of a new feature Runs Pane The bottom left pane of the BaseSpace dashboard shows the three most recent runs, which is updated automatically. Clicking on the Runs pane opens the Runs tab. Clicking on a run opens the Runs tab with the run loaded. For more information, see Runs Tab on page 20. Projects Pane The bottom middle pane of the BaseSpace dashboard shows the three most recent projects. The folder icon indicates the sharing status of the project: if it displays several people , the project is shared. Clicking on the Projects pane opens the Projects tab. Clicking on a project opens the Projects tab with the project loaded. For more information, see Projects Tab on page 21. App Results Pane The right bottom pane of the BaseSpace dashboard shows the most recent app results. Clicking on an app result provides charts relevant for the app used in the Projects tab. For more information, see App Results Page on page 23. 16 Part # 15050652 Rev. A The Prep tab enables you to set up a sequencing run on the NextSeq sequencing system. This tab is currently only available for NextSeq sequencing systems. Other sequencing instruments use a sample sheet to provide sample information to BaseSpace. The Prep Tab sets up a run in four steps: } Biological Samples: Contains information about the samples that are going to be sequenced. See Biological Samples on page 17 } Libraries: Consists of biological samples that are prepped and contain adapters. Each library usually derives from a single biological sample, though biological samples can be used in multiple libraries. See Libraries on page 18. } Pools: Consists of groups of libraries that share analysis parameters. Pools can consist of one or multiple libraries. See Pools on page 18. } Planned Runs: Contains pools that run with the same analysis parameters, on the same machine, at the same time. Planned runs can consist of one or multiple pools. See Planned Runs on page 19. Biological Samples When you click on the Biological Samples tab you see the Biological Samples list, which shows all available samples you have created on your account. Figure 6 Biological Samples List If you want information about the samples, you can perform the following: } Sort the list by clicking on the column headers. } Click on a sample to got to the sample page. This page provides the following actions to prepare your analysis: } Create a new sample. } Import new samples. } Select a sample and edit its properties. } Select one or more samples and continue with Prep Libraries. BaseSpace User Guide for NextSeq, Miseq, and HiSeq 17 BaseSpace User Interface Prep Tab NOTE You can select multiple samples by using one of the following methods: • Select multiple checkboxes. • Click anywhere on a sample row while holding Ctrl button to add to a selection. • Click anywhere on a sample row while holding Shift button to select all samples in between. • Click the checkbox next to Plate/Tube ID to select all samples on the current page. The box next to the Biological Samples header keeps track of the total number of samples, and how many are selected. Click X next to the selection count to clear the current selection. For more information about these actions, see Create New Biological Samples on page 36, Import Biological Samples on page 37, and Use Existing Biological Samples on page 38. Libraries When you click on the Libraries tab you see the Libraries list, which shows all available plates or tubes with libraries you have created on your account. You can sort the list by clicking on the column headers, or click on a plate to see its properties and associated libraries. Figure 7 Libraries List This page provides the following actions to prepare your analysis: } Click on a plate, then click the Edit button to edit its properties or libraries. } Select one or more plates or tubes and move to Pool Libraries. NOTE If you want to select multiple libraries: • Select multiple checkboxes. • Click anywhere on a library row while holding Ctrl button to add to a selection. • Click anywhere on a library row while holding Shift button to select all libraries in between. • Click the checkbox next to Plate/Tube ID to select all samples on the current page. The box next to the Libraries header keeps track of the total number of libraries, and how many are selected. Click X next to the selection count to clear the current selection. For more information about these actions, see Prep Libraries on page 38. Pools When you click on the Pools tab you see the Pools list, which shows all available pools of libraries you have created on your account. You can sort the list by clicking on the column headers, or click on a pool to see its properties and associated libraries. 18 Part # 15050652 Rev. A BaseSpace User Interface Figure 8 Pools List This page provides the following actions to prepare your analysis: } Click on a pool, then click the Edit button to edit the notes. } Select a pool and move to Plan Run. NOTE You can also merge pools the following way: • Click the Save & Continue Later. This will take you to the Pools list, with the recently created plate at the top of the list. • Select the checkboxes in the Pools list. • Click the Merge Pools button in the top navbar. The box next to the Pools header keeps track of the total number of pools, and how many are selected. For more information about these actions, see Pool Libraries on page 40. Planned Runs When you click on the Planned Runs tab you see the Planned Runs list, which shows all planned runs you have created on your account. Figure 9 Planned Runs List You can sort the list by clicking on the column headers, or click on a run to see or edit its properties. For more information about these actions, see Plan Runs on page 41. The runs can have the following states: } Ready to Sequence: the run can be started from the NextSeq sequencing system. } Planning, the run will not show up on the NextSeq sequencing system since it is still in the planning stage. NOTE If you want to select multiple runs: • Select multiple checkboxes. • Click anywhere on a planned run row while holding Ctrl button to add to a selection. • Click anywhere on a planned run row while holding Shift button to select all runs in between. • Click the checkbox next Experiment Name to select all planned runs on the current page. BaseSpace User Guide for NextSeq, Miseq, and HiSeq 19 The box next to the Planned Runs header keeps track of the total number of runs, and how many are selected. Click X next to the selection count to clear the current selection. When sequencing on a run starts, the run is removed automatically from the Planned Runs list. Runs Tab The Runs button leads to the runs list, which allows you to sort your runs based on experiment name, state, workflow, created date, machine, and owner. The following run states are possible (blue boxes indicate final states.): If you want to look at a run in detail, click on the name to view metrics in more detail. For more information, see Run Overview Page on page 20. When you click on the gear wheel next to the run name, you will see options for sharing, transferring, and downloading a run. For more information, see Share Data on page 45, Transfer Ownership on page 51, or Download Files on page 43. Run Overview Page The Run Overview page provides 5 panes: } The Run Details pane gives a summary of the run with links to view files and download and share options. For more information, see Share Data on page 45, View Files and Results on page 25, or Download Files on page 43. } The Samples pane gives a list of all the app results in the run, the associated projects and the number of samples in that analysis. This pane provides access to the following pages: • Samples list, see Run Samples List on page 21 • Sample Details page, see Sample Overview Page on page 23 • App Results page, see App Results Page on page 23 • Project Overview page, see Project Overview Page on page 22 20 Part # 15050652 Rev. A In addition, there is a Side Navigation ribbon, which provides easy navigation in the Run Details area. It contains links to the Overview, Run Samples List, Charts, Run Summary, Indexing QC, Sample Sheet or Run Settings, and Files pages. Run Samples List The samples list allows you to sort the samples in your run based on sample ID, app, date created, and project. If you want to look at a sample, app result, or project in detail, click on the links to get to the following pages: } Sample Overview Page on page 23. } App Results Page on page 23. } Project Overview Page on page 22. In addition, there is a Side Navigation ribbon, which provides easy navigation in the Run Details area. It contains links to the Overview, Run Samples List, Charts, Run Summary, Indexing QC, Sample Sheet or Run Settings, and Files pages. Projects Tab The Projects button opens a list of your projects. You can sort the list by name, last update, or owner. Clicking on a project provides access to the app results and samples within that project. You generate a new project by clicking New Project button on top of the list. When you hover over a project that you own, you see the Settings wheel. BaseSpace User Guide for NextSeq, Miseq, and HiSeq 21 BaseSpace User Interface } The Charts pane displays an intensity by cycle chart. Clicking on the header will take you to the Charts page, which contains five charts with run metrics. See Charts on page 69. } The Run Summary pane displays tables with basic data quality metrics. Clicking on the header will take you to the Run Summary page. See Run Summary on page 66. } The Indexing QC pane lists count information for indices used in the run. Clicking on the header will take you to the Indexing QC page. See Indexing QC on page 68 The Settings wheel provides the following options for sharing a project and editing the project details: } Edit project: edit the name and description of the project. See also Edit Project Details on page 49. } Share: manage sharing a project with a particular collaborator. See also Share a Project Using the Email Option on page 46. } Get link: forward the sharing link to any number of collaborators. See also Share a Project with Get Link on page 45. } Transfer ownership: hand control of data over to a collaborator or customer. See also Transfer Ownership on page 51. NOTE Runs and projects have separate permissions. If you share a project you do not share the runs contained within the project. Project Overview Page The Project Overview page provides access to three panes with information about the project: } The About tab gives you summary information about the project: owner, shared status, date created, and collaborators. } The Analyses tab gives a list of all the App Sessions in the project, which can be sorted based on analysis name, last modified date created, status, or application used to generate the analysis. Clicking on the analysis links to the app results for that sample, see App Results Page on page 23 for more information. } The Samples tab gives a list of all the samples in the project. Clicking on a sample links to the page for that sample, see Sample Overview Page on page 23 for more information. Selecting the samples allows you to launch it in a app, copy to a different project, or combine with another result. NOTE You can access these panes through the left navigation bar. Project Toolbar The Project Toolbar provides the following actions: } Launch app: run custom apps on your sample. Clicking on the app name leads to a page with more information about launching that app, including access permissions. See also Analyze Samples Further on page 31. Note that running custom apps may incur a charge. } Share project: manage sharing a project with a particular collaborator. See also Share a Project Using the Email Option on page 46 } Get link: forward the sharing link to any number of collaborators. See also Share a Project with Get Link on page 45 } Edit project: edit the name and description of the project. See also Edit Project Details on page 49 } Transfer owner: hand control of data over to a collaborator or customer. See also Transfer Ownership on page 51 22 Part # 15050652 Rev. A If you have selected samples in the Samples pane, you can perform additional actions: } Copy to...: copy samples from this project to another. See also Copy Samples on page 50 } Combine: combine samples. See also Combine Samples on page 50 NOTE The app session states are defined as follows: State Running Complete Aborted Needs Attention Description The app is processing or uploading data. Processing and file upload has finished and the data is now available to use This AppResult or Sample has been aborted and may not be resumed. Processing cannot continue without user intervention App Results Page The App Results page provides details about the results for that app session. There is a general information pane to the left, and up to four graphs: } } } } Low Percentage Graph on page 61 High Percentage Graph on page 62 Clusters Graph on page 63 Mismatch Graph on page 65 For more information about the various apps, see the topics below: } } } } } } } } Custom/PCR Amplicon on page 55 Resequencing on page 55 Library QC on page 56 Small RNA Analysis on page 56 Metagenomics Analysis on page 57 De Novo Assembly Samples Page on page 57 Generate FASTQ on page 58 Isaac on page 58 Sample Overview Page The Sample Overview page provides 2 panes: } The Sample Details pane gives a summary of the run with a links to launch a custom BaseSpace app on your sample. Clicking on the app name leads to a page with more information about that app, including access permissions. Note that running custom apps may incur a charge. } The Files pane gives a list of files associated with that sample. You can either look at all FASTQ files, or look at files specific for an app session. See also View Files and Results on page 25. You can also download selected files; see Download Files with the BaseSpace Downloader on page 44. BaseSpace User Guide for NextSeq, Miseq, and HiSeq 23 BaseSpace User Interface Options that are not available for the particular analysis or sample are grayed out. Apps Tab The Apps button leads to the Apps page, which provides an overview of the custom BaseSpace apps that you can run. } Clicking on the app name leads to a page with more information about that app, including a link to the developer and their app support contact details. } Clicking the Launch button leads you through the launch pages, which allow you to set up the app session. Depending on the app, you may need to specify the project, sample, or output folder used by the app, as well as accept access permissions. Note that running custom apps may incur a charge. } You can search for apps using the Search Apps box, or filter by app category on the right. Public Data Tab The Public Data page provides an overview of the publicly available data sets that you can use. Clicking on a data set provides more information for that data, and allows you to import the run or project. You can search for apps using the Search Public Data box, or filter by the research areas and categories listed on the right. 24 Part # 15050652 Rev. A How To Use BaseSpace How To Use BaseSpace The following topics describe how to run different functions in BaseSpace. View Files and Results on page 25 Analyze Samples Further on page 31 Prepare a NextSeq Run on page 36 Download Files on page 43 Share Data on page 45 Project and Sample Management on page 49 Purchasing on page 51 Search for Runs, Projects, and Samples on page 53 View Files and Results The following topics describe how to view files and results in BaseSpace. View Files from a Run What is it BaseSpace gives you an option to view your run files or download them individually. When to use it Use this if you want to view files such as bcl's or images, you can also download these files locally. Why to use it Use this if you want to view files such as bcl files or images, you can also download these files locally. How to use it 1 Click the Runs icon. 2 Click the desired run. 3 From the Run Overview Page, select the Files icon from the sidebar. 4 Select the desired file to view. View Indexing QC Page What is it The Indexing QC page lists count information for indices used in the run. Note that the Indexing QC will only be available if the run is an index run. For more information, see Indexing QC on page 68. BaseSpace User Guide for NextSeq, Miseq, and HiSeq 25 When to use it Use this when you want to access indexing QC results. Why to use it You may see unexpected results for a sample with a particular index, and need to troubleshoot what happened. You may also use it to confirm all indexed samples were represented properly. How to use it 1 Click the Runs icon. 2 Click the desired run. 3 There are two methods to go to the Indexing QC page: • From the Run Overview page click the Indexing QC link. • From the Run Overview page click the Indexing QC icon from the sidebar navigation menu. You can select the displayed lane through the dropdown list. The first table provides an overall summary of the indexing performance for that lane, including: Total Reads The total number of reads for this lane. PF Reads The total number of passing filter reads for this lane. % Reads Identified (PF) The total fraction of passing filter reads assigned to an index. CV The coefficient of variation for the number of counts across all indices. Min The lowest representation for any index. Max The highest representation for any index. Further information is provided regarding the frequency of individual indices in both table and graph form. The table contains several columns, including Index Number A unique number assigned to each index by BaseSpace for display purposes. Sample ID The sample ID assigned to an index in the sample sheet. Project The project assigned to an index in the sample sheet. Index 1 (I7) The sequence for the first index read. Index 2 (I5) The sequence for the second index read. % Reads Identified (PF) The number of reads (only includes Passing Filter reads) mapped to this index. This information is also displayed in graphical form. In the graphical display, indices are ordered according to the unique Index Number assigned by BaseSpace. 26 Part # 15050652 Rev. A What is it The Charts page displays charts with run metrics. When to use it Use this when you want to view charts such as Flow Cell, Data By Cycle, Data By Lane, Qscore Distribution, and Qscore Heatmap. For more information, see Charts on page 69. Why to use it Use this if you want access to these various charts. How to use it 1 Click the Runs icon. 2 Click the desired run. 3 There are two methods to go to the Charts page: • From the Run Overview page click the Charts link. • From the Run Overview page click the Charts icon from the sidebar navigation menu. View Run Samples List What is it The Run Samples List contains a list of all the samples in the run. When to use it } Use this when you want to see a list of all the samples in the run } Use this when you want to navigate to details regarding a specific sample. Why to use it Use this if you want a quick way to view all the samples in a Run. Use this when you want to see more detail regarding your samples such as genome name, sample or FASTQ files. How to use it 1 Click the Runs icon. 2 Click the desired run. 3 There are two methods to go to the Run Samples List: BaseSpace User Guide for NextSeq, Miseq, and HiSeq 27 How To Use BaseSpace View Run Charts • From the Runs Overview page click the Samples link. • From the Runs Overview page click the Samples icon from the sidebar navigation menu. You can now click on a sample to see the sample overview; for more information, see Sample Overview Page on page 23. View Run Summary What is it The Run Summary page has the overall statistics about the run. When to use it Use this when you want to view information about the run such as percent alignment,cycles, densities and so on. Why to use it Use this if you want a quick breakdown of the statistics for a particular run. How to use it 1 Click the Runs icon. 2 Click the desired run. 3 There are two methods to go to the Run Summary: • From the Run Overview Page select Run Summary button. • From the Run Overview Page select the Run Summary icon from the left side bar. The following metrics are displayed in the top table, split out by read and total: 28 Level The level or read of the run. Cycles The number of cycles in the level. Yield Total The number of bases sequenced. This is updated as the run progresses Projected Total Yield The projected number of bases expected to be sequenced at the end of the run. Yield Perfect The number of bases in reads that align perfectly, as determined by a spiked in PhiX control sample. If no PhiX control sample is run in the lane, this chart is not available. Yield <=3 errors The number of bases in reads that align with 3 errors or less, as determined by a spiked in PhiX control sample. If no PhiX control sample is run in the lane, this chart is not available, and will show a zero value. Part # 15050652 Rev. A The percentage of the sample that aligned to the PhiX genome. This is determined for each level or read independently. % Perfect [Num Usable Cycles] The percentage of bases in reads that align perfectly, as determined by a spiked in PhiX control sample, at the cycle indicated in the brackets. If no PhiX control sample is run in the lane, this chart will display 0% but will still show the number of cycles used. % <=3 errors [Num Usable Cycles] The percentage of bases in reads that align with 3 errors or less, as determined by a spiked in PhiX control sample, at the indicated cycle. If no PhiX control sample is run in the lane, this chart If no PhiX control sample is run in the lane, this chart will display 0% but will still show the number of cycles used. Error Rate The calculated error rate of the reads that aligned to PhiX. Intensity Cycle 1 The average of the A channel intensity measured at the first cycle averaged over filtered clusters. % Intensity Cycle 20 The corresponding intensity statistic at cycle 20 as a percentage of that at the first cycle. 100%x(Intensity at cycle 20)/(Intensity at cycle 1). %Q>=30 The percentage of bases with a quality score of 30 or higher, th respectively. This chart is generated after the 25 cycle, and the values represent the current cycle. The following metrics are available in the Read tables, split out by lane: Tiles The number of tiles per lane. Density The density of clusters (in thousands per mm ) detected by image analysis, +/- one standard deviation. Clusters PF The percentage of clusters passing filtering, +/- one standard deviation. Phas./Prephas. The value used by RTA for the percentage of molecules in a cluster for which sequencing falls behind (phasing) or jumps ahead (prephasing) the current cycle within a read. Reads The number of clusters (in millions). Reads PF The number of clusters (in millions) passing filtering. %Q>=30 The percentage of bases with a quality score of 30 or higher, th respectively. This chart is generated after the 25 cycle, and the values represent the current cycle. Yield The number of bases sequenced which passed filter. Cycles Err Rated The number of cycles that have been error rated with respect to PhiX starting at cycle 1. Aligned The percentage that aligned to the PhiX genome. BaseSpace User Guide for NextSeq, Miseq, and HiSeq 2 29 How To Use BaseSpace Aligned Error Rate The calculated error rate, as determined by the PhiX alignment. Subsequent columns display the error rate for cycles 1–35, 1–75, and 1–100. Intensity Cycle 1 The average of the A channel intensity measured at the first cycle averaged over filtered clusters. %Intensity Cycle 20 The corresponding intensity statistic at cycle 20 as a percentage of that at the first cycle. 100%x(Intensity at cycle 20)/(Intensity at cycle 1). View Sample Sheet from a Run What is it This option allows you to view the sample sheet that is tied to this run. When to use it Use this when you want to view the associated sample sheet for this Run. Why to use it You want to check whether the sample sheet was set up properly. How to use it 1 Click the Runs icon. 2 Click the desired run. 3 From the Run Overview Page, select the Sample Sheet icon from the sidebar. View the Project Sample List What is it The Project Sample List contains the list of samples in a project. When to use it } Use this when you want to see a list of all the samples in the project } Use this when you want to navigate to details regarding a specific sample. Why to use it This is an easy way to get to the details page of a sample. How to use it 30 1 Click the Projects icon. 2 Click the desired project. 3 Click the Samples link from the sidebar navigation menu. Part # 15050652 Rev. A View the Analyses List What is it The Analyses List contains a list of app sessions in a project. When to use it Use this when you want to navigate to details regarding a specific app session. Why to use it This is an easy way to get to the details of a particular app session. How to use it 1 Click the Projects icon. 2 Click the desired project. You can now click on an Analysis to see the results; for more information, see App Results Page on page 23. Analyze Samples Further The following topics describe how to further analyze samples in BaseSpace, starting with FASTQ files (HiSeq and MiSeq) or the results from sample-sheet driven workflows (MiSeq). Launch the IGV App What is it The Integrative Genomics Viewer (IGV) of the Broad Institute is a fully featured genome browser that allows you to visualize your sequence data in great detail. Illumina has modified IGV to display alignment and variant data from BaseSpace (BAM and VCF files). When to use it IGV enables you to perform variant analysis after launching Resequencing or Amplicon workflows in BaseSpace. IGV is run on a project, which is the highest level directory and contains one or more AppResults. IGV retains all of its native functions, including loading data from your local computer. Why to use it To visualize your sequence data in greater detail. BaseSpace User Guide for NextSeq, Miseq, and HiSeq 31 How To Use BaseSpace You can now click on a sample to see the sample overview; for more information, see Sample Overview Page on page 23. How to use it NOTE The Java run-time environment has to be installed on computer in order for IGV to work properly. Download Java here: java.com/en/. Run the IGV App the following way: 1 Click the Projects icon. 2 Click desired project. 3 Click the Launch Apps button and select the IGV application from the dropdown list. 4 Select the Accept button. 5 Depending on your browser, it will ask you to open or save the .jnlp file. • For Internet Explorer, click the Open button. • For Chrome, click the Keep button and then click on file to open. • For Firefox, select the Open with Java(TM) Web Start Launcher (default) option. The IGV App opens on your desktop with the requested project loaded. BaseSpace Data in IGV The BaseSpace file browser shows data in BaseSpace that is available for viewing in IGV. The directory structure shown is according to how data is organized in BaseSpace. A project is the highest level directory and it contains one or more AppResults. If an AppResult was the result of analyzing a single sample, then the sample name is appended to the AppResult name. Each AppResult contains zero or more files. Only alignment (BAM) and variant (VCF) files are shown in the file browser. Double click a BAM or VCF file to load it as an IGV track. First load VCF files before BAM files since read tracks can take up an entire IGV screen, which requires scrolling to see variants. Additional Reference Genomes IGV contains a number of installed reference genomes: } } } } Homo sapiens: Human hg19 Mus musculus: Mouse mm9 Saccharomyces cerevisiae: S. cerevisiae (sacCer2) Arabidopsis thaliana: A. thaliana (TAIR10) In addition, you can download the following additional reference genomes from Illumina: } PhiX: ftp://igenome:[email protected]/PhiX/Illumina/RTA/PhiX_ Illumina_RTA.tar.gz } Staphylococcus aureus (strain NCTC 8325): ftp://igenome:[email protected]/Staphylococcus_aureus_NCTC_8325/NCBI/2006-0213/Staphylococcus_aureus_NCTC_8325_NCBI_2006-02-13.tar.gz 32 Part # 15050652 Rev. A Launch Third-Party Apps What is it A method to launch apps built by third-party vendors. In general, these apps perform tertiary analysis, visualization, or annotation of data. When to use it When you want to run a third-party app on your samples, because it provides an additional level of analysis that is not provided by Illumina workflow apps. Running third-party apps may incur a charge. How to use it 1 Navigate to the project, sample, or app result that you want to run the app on. 2 Click the Launch Apps button and select the desired third-party application from the dropdown list. 3 Read the End User License Agreement and permissions, and click Accept if you are ok with them. The third-party will now guide you through the start-up process. NOTE Third-party apps are generated by third-party vendors. For support, contact that vendor. Launch the Isaac App What is it An app that runs the Isaac Aligner and Isaac Variant Caller on your sample. For more information, see Isaac on page 58 When to use it When you want to run a fast aligner, and call SNPs and small indels. Isaac should only be used for whole human genome analysis. Do not use it for other species or targeted sequencing. How to use it 1 Click the Projects icon. 2 Click the desired project. 3 From the Samples list, select the sample to run the Isaac workflow on. 4 Click the Launch Apps button and select the Isaac workflow from the dropdown list. BaseSpace User Guide for NextSeq, Miseq, and HiSeq 33 How To Use BaseSpace } E. coli (strain DH10B): ftp://igenome:[email protected]/Escherichia_ coli_K_12_DH10B/NCBI/2008-03-17/Escherichia_coli_K_12_DH10B_NCBI_2008-0317.tar.gz For E. coli, rename the first line in genome.fa from >chr to >ecoli. 5 From the main Isaac Workflow page click the Start button. The Isaac App will now run on your requested project. Run Sample Sheet Driven Workflow Apps Sample sheet driven workflow apps are kicked-off automatically, based on the workflow that is specified in the sample sheet. You can resubmit the sample sheet and re-queue the run with new analysis parameters once. Fix Sample Sheet / Re-Run Workflow What is it The Fix Sample Sheet page lets you correct errors in your sample sheet, or set up a new analysis to re-queue. When to use it } To fix errors in the sample sheet. } To change analysis parameters. } To change indexing details. Why to use it } Errors in the sample sheet may prevent BaseSpace from processing a run. This allows BaseSpace to finish the analysis. } The first analysis may have been sub-optimal. You can resubmit the sample sheet and re-queue the run with new analysis parameters once. } The index settings for samples may have been wrong. You need to correct this. NOTE You can only submit a corrected sample sheet and re-queue the run one time. How to use it 1 You can reach the Fix Sample Sheet page two ways: • A run may have a Needs Attention state. Open the run, and click on the Fix Sample Sheet link. • Go to a run, select the More drop-down list, and then select Fix Sample Sheet. The Fix Sample Sheet page opens. If BaseSpace has detected an error, it displays the issue above the black sample sheet editor. 34 Part # 15050652 Rev. A Depending on the complexity of the change, you have two options: • Easy fix: edit the sample sheet in the sample sheet editor. BaseSpace will keep validating the sample sheet as you edit; any remaning issues are displayed above the sample sheet editor. • More complex change: use Illumina Experiment Manager (IEM). a If you have not installed IEM yet, click on the Illumina Experiment Manager (IEM) link, and install IEM. b Open IEM. c Import the original sample sheet from your system in IEM and edit it, or generate a new sample sheet. See the Illumina Experiment Manager User Guide for instructions. d Copy and paste the sample sheet into the Sample Sheet Editor in BaseSpace. BaseSpace will validate the sample sheet; any issues are displayed above the sample sheet editor. 3 Once you are done editing and the sample sheet is valid, click the Queue Analysis button, and BaseSpace will start analyzing the run using the new sample sheet. You can only resubmit a sample sheet and re-queue the run one time. NOTE If your edits result in an invalid sample sheet, the Queue Analysis button is not available. You can return to the original using the Load Original button. Common Sample Sheet Fixes If a sample sheet is invalid, it could be because the genome path is not set up correctly. This situation would be indicated by the Genome Path Unknown Genome warning (as in the example above). The paths of the standard BaseSpace genomes should conform to the following relative paths: Arabidopsis_thaliana\NCBI\build9.1\Sequence\WholeGenomeFASTA Bos_taurus\Ensembl\UMD3.1\Sequence\WholeGenomeFASTA Escherichia_coli_K_12_DH10B\NCBI\2008-0317\Sequence\WholeGenomeFASTA Homo_sapiens\UCSC\hg19\Sequence\WholeGenomeFASTA Mus_musculus\UCSC\mm9\Sequence\WholeGenomeFASTA PhiX\Illumina\RTA\Sequence\WholeGenomeFASTA Rattus_norvegicus\UCSC\rn4\Sequence\WholeGenomeFASTA BaseSpace User Guide for NextSeq, Miseq, and HiSeq 35 How To Use BaseSpace 2 Saccharomyces_cerevisiae\UCSC\sacCer2\Sequence\WholeGenomeFASTA Staphylococcus_aureus_NCTC_8325\NCBI\2006-0213\Sequence\WholeGenomeFASTA Prepare a NextSeq Run What is it You can prepare NextSeq runs through the BaseSpace Prep tab, which organizes samples, libraries, pools, and run in a single environment. When to use it Use this if you want to prepare a sequencing run on a NextSeq instrument, and have the data stream seamlessly to BaseSpace. Do not use it to prepare sequencing runs for other instruments. If you do have a NextSeq sequencing system but don't want to use BaseSpace, you can also start a run straight on the instrument. Why to use it Preparing a run in the Prep tab moves the data and analysis seamlessly to BaseSpace. Using the Prep tab means BaseSpace is a your single-stop solution for sequencing management, storage and analysis. How to use it 1 Log in to BaseSpace. If it is your first time logging in, accept the user agreement. 2 Click the Prep icon 3 Set up a NextSeq run on the Prep Tab in four consecutive steps: a Biological Samples: Contains information about the samples that are going to be sequenced. You can create new samples, import samples, or use existing samples; for instructions, see one of the following topics: — Create New Biological Samples on page 36 — Import Biological Samples on page 37 — Use Existing Biological Samples on page 38 b Libraries: Consists of biological samples that are prepped and contain adapters. Each library usually derives from a single biological sample, though biological samples can be used in multiple libraries. See Libraries on page 18. c Pools: Consists of groups of libraries that share analysis parameters. Pools can consist of one or multiple libraries. See Pools on page 18. d Planned Runs: Contains pools that run with the same analysis parameters, on the same machine, at the same time. Planned runs can consist of one or multiple pools. See Planned Runs on page 19 . Create New Biological Samples If you want to create a new biological sample, do the following: 36 Part # 15050652 Rev. A 1 Click the Prep icon . 2 Click Biological Samples. 3 Click the + Create button. 4 Fill out the required fields Sample ID, Name, and Nucleic Acid type. NOTE Sample ID and sample name can only exist of alphanumeric characters, dash, or underscore. Sample ID should be unique and short; sample name can be more descriptive to provide a human-readable identifier. 5 Optional: Fill out the Organism (species) field 6 Optional: Fill out the Project fields. You can also generate a new project. A project is optional, but if you don't specify it here, you will have to set it later, because the output data needs to be stored to the project. 7 When finished, do one of the following: • If you only want to select the newly created sample, click the Next: Prep Libraries button. Continue with Prep Libraries on page 38. • If you want to select multiple samples, click the Save & Continue Later. This will take you back to the Biological Samples list, with the recently created sample at the top of the list. Continue with Use Existing Biological Samples on page 38. Import Biological Samples If you want to import new biological samples, do the following: 1 Click the Prep icon . 2 Click Biological Samples. 3 Click the Import button. 4 If you have not generated an import file yet, click on the template link, fill out the samples, and be aware of the following when filling out the template: • User Sample ID and sample name can only exist of alphanumeric characters, dash, or underscore. Sample ID should be unique and short; sample name can be more descriptive to provide a human-readable identifier. • The Organism (species) field is optional. • The Project field is optional at this step, but if you don't specify it here, you will have to set it later, because the output data needs to be stored to the project. • Fill out the Nucleic Acid column with DNA or RNA. BaseSpace User Guide for NextSeq, Miseq, and HiSeq 37 How To Use BaseSpace NOTE Use the import function to create several new samples, see Import Biological Samples on page 37. Figure 10 Import Sample Template 5 Click the Choose File button. 6 Browse to the import file and click Open. 7 Click Import. 8 When finished, do one of the following: • If you only want to select the newly created samples, click the Next: Prep Libraries button. Continue with Prep Libraries on page 38. • If you want to select multiple samples, click the Save & Continue Later. This will take you back to the Biological Samples list, with the recently created sample at the top of the list. Continue with Use Existing Biological Samples on page 38. Use Existing Biological Samples The Biological Samples list shows all available samples you have created on your account. 9 To select existing sample(s), do one of the following in the Biological Samples list: } Select the checkboxes. } Click the sample. If you want to select multiple samples, hold the Ctrl button. } Select all samples by selecting the checkbox next to the SampleID header. 10 Click the Prep Libraries button in the top navbar. Prep Libraries On the Prep Libraries page, you assign indices to biological samples, based on the indices available in the library preparation chosen. Every used well or tube contains a separate library. Best practice is to set up the libraries in BaseSpace first, export a file of your library settings, and use that to pipet the biological samples into the proper wells or tubes. 38 Part # 15050652 Rev. A 1 Select the library prep type. BaseSpace now automatically assigns indices to wells or tubes, depending on the format of the library prep type. Figure 11 Tube Set Up for Single Index Library Preparation Kit Figure 12 Plate Set Up for Dual Index Library Preparation Kit 2 Enter the plate ID. The ID needs to be unique. 3 Click the Auto Prep button to fill the plate or tubes automatically with all samples listed. NOTE You can also manually drag the samples to wells or tubes: 1. Select one or more samples. To multiselect, hold Shift. To multiselect on Firefox or Internet Explorer 9, you need to click the well twice. 2. Drag selected samples to a position. 3. Check whether the indices have been assigned to the proper samples. Hovering over a position will reveal the sample that is assigned to that position. You can drag samples from position to position. 4 Save a file of your library settings by clicking the Download CSV button. Use this file in the lab to indicate which biological samples get pipetted into specific wells. 5 When finished, do one of the following: BaseSpace User Guide for NextSeq, Miseq, and HiSeq 39 How To Use BaseSpace NOTE If you do not want to use index sequencing, you still need to assign your biological sample to an index. Only when you set up your sequencing run, you will specify that you do not sequence the index. } If you want to select the new plate or tubes, click the Pool Libraries button. Continue with Pool Libraries on page 40. } If you want to select multiple library preps or plates, do the following: a Click the Save & Continue Later. This will take you to the Libraries list, with the recently created set up at the top of the list. b Select the checkboxes in the Libraries list. c Click the Pool Libraries button in the top navbar. NOTE If one of your samples is not assigned to a project, you cannot continue. Select the sample, click the Set Project button, and assign it to a project. You can also generate a new project. Nextera Rapid Capture Considerations If you are performing Nextera Rapid Capture, do the following: } Choose Nextera Enrichment as library prep. } Put biological samples belonging to the same enrichment on a row next to each other. } Change the index in the drowdown menu to the left of the rows to the proper index, probably the same indices for the different rows (enrichments). } Name your plate in such a way that makes clear multiple enrichments are on the plate, or add a note to that effect in the Note field. Pool Libraries The Pool Libraries page allows you to pool samples and sequence them in the same run, using the same analysis parameters. 40 6 Fill out the first pool ID. Pool ID needs to be unique. 7 If needed, you can create additional pools on the right by clicking the + Add Pool button and filling out the pool IDs. • Colors of the wells will correspond to the colors of the pools. • You can hover over the wells to see the library IDs. 8 Drag and drop individual samples from their well on the plate to a pool. You can multiselect by holding Shift. To multiselect on Firefox or Internet Explorer 9, you need to click the well twice. Part # 15050652 Rev. A How To Use BaseSpace 9 If you want to pool libraries from multiple plates, use the Plate dropdown menu to specify the plate. NOTE You can also merge pools the following way: • Click the Save & Continue Later. This will take you to the Pools list, with the recently created plate at the top of the list. • Select the checkboxes in the Pools list. • Click the Merge Pools button in the top navbar. 10 Click the Plan Run button. Nextera Rapid Capture Considerations If you are performing Nextera Rapid Capture, make sure to assign only samples from the same enrichment to one pool, and note this in the pool name. Plan Runs In the Planned Runs page you can set up the parameters for the sequencing run on your NextSeq instrument. BaseSpace User Guide for NextSeq, Miseq, and HiSeq 41 1 Enter a name for your planned run. 2 Optional: Enter the reagent barcode you plan to use. This will link a reagent kit to this run. 3 Select the rehyb checkbox if you are performing a rehybridization. 4 5 Verify the Review Indexes section for the indexing strategy. For indexing, it will be set according to the index/library prep type chosen previously. If you choose to override this default indexing scheme, you will be required to select the Index type (Single, Dual, or No Index) and make sure that you enter the number of index cycles accordingly. If you have selected multiple libraries, you can not specify No Index. BaseSpace automatically checks if the indices chosen have enough diversity; if not, it will warn you that you should change your index strategy. 6 Verify the pool included in the planned run. 7 42 Fill out the Enter Cycles section: } Single- vs. paired-end } Number of cycles per read Once your settings are complete, choose one of these two options to continue: } Click the Sequence button. This opens the Planned Runs list , and sets the state of the recently planned run to Ready to Sequence. Part # 15050652 Rev. A NOTE A planned run must be in the Ready to Sequence state in order for it to show up in the Planned Runs list in the control software on the instrument. 8 If you want to change a planned run to the Ready to Sequence state, select the planned run from the list and click the Sequence arrow link in the top navbar on the Planned Runs list page. Your run now shows up in the Planned Runs list in the control software on your NextSeq sequencing system. Complete the run from your sequencing instrument. A sample sheet is not required. BaseSpace will automatically generate FASTQ files once the sequencing run is complete; for more information, see Generate FASTQ on page 58. Download Files The following topics describe how to download files in BaseSpace. For more information about file types, see BaseSpace Files on page 75. Download File Package from a Run What is it BaseSpace allows you to download data either as a pre-defined package for MiSeq runs, one-by-one, or your own selection. This topic describes how to download packages. For a selection of files, see Download Files with the BaseSpace Downloader on page 44; for individual FASTQ files, see Download FASTQ Files with the File Browser on page 44. The packages available depend on your workflow; packages that are grayed out are not available for download. There are four types of data packages: } } } } Variant Data, containing vcf files with variant calls. Aligned Data, containing BAM files with aligned reads. Unaligned Data, containing FASTQ files with unaligned reads. SAV Data, containing file describing the set up of the run and interop files. For more information about file types, see BaseSpace Files on page 75. When to use it } Use this when you want to download a packaged(zipped) file for Variant, Aligned, Unaligned or SAV data. } Don’t use if you only want individual files. Why to use it If you want to download for Variant, Aligned, Unaligned or SAV data in a neatly packaged file versus downloading the files one-by-one. How to use it 1 Click the Runs icon. BaseSpace User Guide for NextSeq, Miseq, and HiSeq 43 How To Use BaseSpace } Click the Save & Continue Later button. This opens the Planned Runs list , and sets the state of the recently planned run to Planning. 2 Click the desired run. 3 Click the Download button. 4 Select the desired data option. Download FASTQ Files with the File Browser What is it BaseSpace allows you to download data either as a pre-defined package, one-by-one, or your own selection. This topic describes how to download individual FASTQ files with the file browser. For packages, see Download FASTQ Files with the File Browser on page 44; for a selection of files, see Download Files with the BaseSpace Downloader on page 44. When to use it Use this option when you want to download FASTQ files per sample. Why to use it If you only want to download the FASTQ files of a sample, it saves you time, because you are not downloading all the other files. How to use it 1 Click the Runs icon or Projects icon. 2 Click the desired run or project. 3 Click the desired sample in the Samples pane. 4 In the Files pane, select the FASTQ Files section. 5 Click the file you want to download. 6 Click the Download button. BaseSpace will now download the files to the desired location. For more information about FASTQ files, see FASTQ Files on page 84. Download Files with the BaseSpace Downloader What is it BaseSpace allows you to download data either as a package or individually. This topic describes how to download multiple files with the downloader; for packages, see Download File Package from a Run on page 43. When to use it Use this option when you want to download multiple files per sample, but do not want to download a pre-defined package. 44 Part # 15050652 Rev. A If you only want to download a number files of a sample, it saves you time, because you are not downloading all the other files. How to use it 1 Click the Runs icon or Projects icon. 2 Click the desired run or project. 3 Click the desired sample in the Samples pane. 4 In the Files pane, select the checkboxes for the desired files. 5 Click the Download Selected button. The BaseSpace Downloader will guide you through the download process, and start the download of the files to the desired location. Share Data Data in BaseSpace can be shared with collaborators in a couple of different ways. You can either share data at a run or project level, via an email invitation or through a hyperlink. With the email invitation option, only the accounts with the specified email can view shared data. Sharing via a hyperlink option allows anyone with access to the hyperlink to be able to view the shared data, as long as the hyperlink is still active. Sharing is for read-only access. If you want a collaborator to have write access, see Transfer Ownership on page 51. NOTE Runs and projects have separate permissions. If you share a run, the project associated with that run is not shared automatically, meaning samples and app results will not be accessible to collaborators of the run. The following topics describe how to share. Share a Project with Get Link What is it Sharing using the Get Link option allows you to share a project or a run with any collaborator who has access to the link. The hyperlink can be turned on or off by setting the activate or deactivate option. Be aware that anyone can access the project or run when the link is activated. Furthermore, anyone who previously accepted the link will still have access to the run even though the link is deactivated. NOTE If you want more control, use the email share option where you can specify who can view the project (Share a Project Using the Email Option on page 46). BaseSpace User Guide for NextSeq, Miseq, and HiSeq 45 How To Use BaseSpace Why to use it When to use it } Use this when you don’t want to assign the project to a specific person. } This share link can be forwarded to many other collaborators while the link is still active. } Don’t use this option if you want to confine the list of who has access to this project. Why to use it If you want an easy way to share a link without the hassle of adding specific people by email and setting permissions, then this is the way to go. How to use it 1 Click the Projects icon. 2 Click the desired project. 3 Click the Get Link button. 4 Click the Activate button. 5 Copy the URL to share with collaborators. The link is active until the Deactivate option is selected. The path to deactivate a sharing link is very similar: 1 Navigate to the shared item. 2 Click the Get Link button. 3 Click the Deactivate button. Share a Project Using the Email Option What is it Sharing using the "Share" option allows you to share a Project or Run with a specified collaborator via an email link. The specified collaborators will receive an email with a link to the Project or Run and only that person can view the corresponding data. NOTE The Email option allows greater control over who can view your data as opposed to sharing using the Get Link options which gives anyone access to your data if the link is left activated. For more information, see Share a Project with Get Link on page 45 When to use it } Use this option of you want to easily share your project with collaborators. } Use this option if you want to be able to control who has access to the projects. Why to use it Use this option if you want to be able to control who has access to the project. 46 Part # 15050652 Rev. A 1 Click the Projects icon. 2 Click the desired project. 3 Click Share Project. 4 In the Share Settings dialog box, enter the collaborators email address and click the Invite button. NOTE The invitation email address must match your BaseSpace login's email address or else your collaborator will not be able to view the project. 5 4. Click Save Settings. Share a Run with Get Link What is it Sharing using the Get Link option allows you to share a run with any collaborator who has access to the link. The hyperlink can be turned on or off by setting the activate or deactivate option. Be aware that anyone can access the project or run when the link is activated. Furthermore, anyone who previously accepted the link will still have access to the run even though the link is deactivated. Sharing runs with the Get Link option is very similar to sharing projects with the Get Link option. NOTE If you want more control, use the email share option where you can specify who can view the project (Share a Run Using the Email Option on page 48). When to use it } Use this when you don’t want to assign the run to a specific person. } This share link can be forwarded to many other collaborators while the link is still active. } Don’t use this option if you want to confine the list of who has access to this run. Why to use it If you want an easy way to share a link without having to specify people's email and setting permissions. How to use it 1 Click the Runs icon. 2 Click the desired run. 3 Click the More button and select the Get Link option. 4 Click the Activate button. BaseSpace User Guide for NextSeq, Miseq, and HiSeq 47 How To Use BaseSpace How to use it 5 Copy the URL to share with collaborators. The link is active until the Deactivate option is selected. The path to deactivate a sharing link is very similar: 1 Navigate to the run 2 Click the Get Link button. 3 Click the Deactivate button. NOTE Runs and projects have separate permissions. If you share a run, the project associated with that run is not shared automatically, meaning samples and app results will not be accessible to collaborators of the run. Share a Run Using the Email Option What is it Sharing using the Share option allows you to share a project or run with a specified collaborator via an email link. Specified collaborators will receive an email with a link to the project or run and only that person can view the corresponding data. NOTE The Email option allows greater control over who can view your data as opposed to sharing using the Get Link options which gives anyone access to your data if the link is left activated. See Share a Run with Get Link on page 47 for more information. When to use it } Use this option of you want to easily share your run with collaborators. } Use this option if you want to be able to control who has access to the run. Why to use it Use this option if you want to be able to control who has access to the run. How to use it 1 Click the Runs icon. 2 Click the desired run. 3 Click the Share button. 4 In the Share Settings dialog box, enter the collaborators email address and click the Invite button. NOTE The invitation email address must match your BaseSpace login's email address or else your collaborator will not be able to view the project. 5 4. Click the Save Settings button. 48 NOTE Runs and projects have separate permissions. If you share a run, the project associated with that run is not shared automatically, meaning samples and app results will not be accessible to collaborators of the run. Part # 15050652 Rev. A The following topics describe how to manage projects and samples in BaseSpace. Edit Project Details What is it The way to edit project details. When to use it Use this when you want to change details regarding the project such as the description or project name. Why to use it Use this if you need to edit the project name or description How to use it 1 Click the Projects icon. 2 Click the desired project. 3 Click the Edit Project button. 4 Make changes in the Edit Project dialog box 5 Click the Save. Set Up a New Project What is it A method to set up a new project. When to use it } When you want to analyze a sample in the context of two different projects } When you want to transfer ownership of samples to a collaborator, but still keep a copy yourself } When you want to split a project into multiple projects How to use it 1 Click the Projects icon. 2 Click New Project link in the top left corner. 3 Enter a new name and description. 4 Click the Create button. To copy samples into the new project, seeCopy Samples on page 50. BaseSpace User Guide for NextSeq, Miseq, and HiSeq 49 How To Use BaseSpace Project and Sample Management Combine Samples What is it A method to combine (merge) samples. When to use it When you want to merge the data from two different sequencing runs on the same sample. How to use it 1 Click the Projects icon. 2 Click the desired project. 3 Click the Samples link from the sidebar navigation menu. 4 Select the checkboxes of the samples you want to combine. 5 Click the Combine button. 6 Click the Combine button in the pop-up screen. Copy Samples What is it A method to copy samples from one project to another. When to use it } When you want to analyze a sample in the context of two different projects } When you want to transfer ownership of a sample to a collaborator, but still keep a copy yourself } When you have assigned a sample to the wrong project How to use it 50 1 Click the Projects icon. 2 Click the desired project. 3 Click the Samples link from the sidebar navigation menu. 4 Select the checkboxes of the samples you want to combine. 5 Click the Copy button. 6 Select the new project in the dropdown list. 7 Click the Copy button. Part # 15050652 Rev. A What is it A method to hand control of data over to a collaborator or customer. When to use it } If you want to give control of your data to a collaborator } If you sequenced samples for a customer, for example, if you are a core lab or service provider. How to use it 1 Select the project or run you want to transfer: • Project: a Click the Projects icon. b Click the desired project. c Click the Transfer Owner button. • Run: 2 a Click the Runs icon. b Click the desired run. c Click the More button, and then select the Transfer Ownership option. Enter the new owner's email and an optional message in the Transfer Ownership dialog box. 3 Click Continue. BaseSpace sends the new owner an email asking to accept the ownership of the run or project. The ownership transfer of the project or run will complete once the new owner accepts. At this point, you will have no control over that run or project. You will also not be able to see that run or project, unless the new owner shares it with you; see Share Data on page 45 for more information. Purchasing In order to buy app sessions from third-party vendors, you need to have iCredits. This chapter describes how to purchase iCredits and manage your wallet and purchases. Access Your Wallet What is it The wallet contains your iCredits and credit card information. When to use it Use it to update credit cards or add iCredits. BaseSpace User Guide for NextSeq, Miseq, and HiSeq 51 How To Use BaseSpace Transfer Ownership Why to use it You need iCredits if you want to purchase app sessions. The wallet allows you to manage your iCredits. How to use it 1 Go to your account. 2 Click on the Wallet button. When you are on the Wallet screen, you can add iCredits or credit cards. Adding iCredits What is it iCredits allows you to purchase app sessions. When to use it Use Adding iCredits when you are running low on iCredits, and you want to buy app sessions. Why to use it Third-party apps provide functionality that may be exactly needed for your analysis, but need to be purchased with iCredits. How to use it You can either add iCredits directly, or create a purchase order. } Add iCredits directly: 1 Go to the Wallet screen. 2 Click the Add More button. 3 Enter the amount and select the desired credit card from the drop down list 4 Click the Continue button. 5 Click the Purchase button. A message appears stating how many credits have been added. 6 Click the OK button. } Create a purchase order: 1 Go to the Wallet screen. 2 Click the Create a Quote link. 3 Enter the amount and the desired account. 4 Click the Create Quote button. The purchase order appears, and once processed, Illumina will credit the account with the iCredits. 52 Part # 15050652 Rev. A You can generate a paper copy using the Print button. Adding Credit Card What is it You can use a credit card to purchase iCredits, but you need to add it first to BaseSpace. When to use it When you want to buy iCredits with a new credit card. Why to use it Third-party apps provide functionality that may be exactly needed for your analysis, but need to be purchased with iCredits. How to use it 1 Go to the Wallet screen. 2 Click on the Add Credit Card button. 3 Fill in the credit card info and click Submit. View Purchase History What is it The Purchase History page contains detailed information about purchases, adjustments, and balance for your account. When to use it When you want to review your purchases. Why to use it You may want to track where you spend your iCredits, or to see if a refund has been processed How to use it 1 Go to your account. 2 Click on the Purchase History button. Now you can review your purchases, filter on type of transaction, and sort by order number, vendor, date, or total iCredits used. Search for Runs, Projects, and Samples What is it The Search box allows you to find runs, projects, and samples. BaseSpace User Guide for NextSeq, Miseq, and HiSeq 53 How To Use BaseSpace 5 When to use it When you want to do a quick search for something Why to use it Use if there are a lot of runs and you want a quick way to search. How to use it 1 Type in the run, project, or sample name in the search field and hit enter or click on magnifying glass icon. 2 Select the desired run, project, or sample in Search Results. You can also filter the search results by these categories using the dropdown list at the left of the Search Results page. 54 Part # 15050652 Rev. A Workflow Reference Workflow Reference This section describes the Illumina workflow apps listed below. Resequencing on page 55 Custom/PCR Amplicon on page 55 Library QC on page 56 Small RNA Analysis on page 56 Metagenomics Analysis on page 57 De Novo Assembly Samples Page on page 57 Generate FASTQ on page 58 Isaac on page 58 Resequencing The sample sheet driven Resequencing app compares the DNA sequence in the samples against a reference genome and identifies any variants (SNPs or indels) relative to the reference sequence. The main output files generated by the Resequencing workflow are .bam files (containing the alignment results) and .vcf files (containing the variant calls). The Resequencing workflow can only be used to analyze MiSeq sequencing results. The Resequencing App Results Page provides four graphs, described below: } } } } Low Percentage Graph on page 61 High Percentage Graph on page 62 Clusters Graph on page 63 Mismatch Graph on page 65 The Resequencing Sample Details Page provides five panes, described below: } } } } } Samples Table on page 87 Coverage Graph on page 89 Q-Score Graph on page 89 Variant Score Graph on page 89 Variants Table on page 89 The graphs and variants table display data for the chromosome that is selected in the dropdown list. Custom/PCR Amplicon The Custom/PCR Amplicon workflow evaluates short regions of amplified DNA (amplicons) for variants. The focused sequencing of amplicons enables high-coverage sequencing of particular regions across a large number of samples. The main output files generated by the Custom/PCR Amplicon workflow are .bam files (containing the aligned reads) and .vcf files (containing the variant calls). The Custom/PCR Amplicon workflow supports multiple manifests (containing the probe regions) and consensus sequence reporting for multi-manifest runs. BaseSpace User Guide for NextSeq, Miseq, and HiSeq 55 The Custom/PCR Amplicon workflow can only be used to analyze MiSeq sequencing results The Custom/PCR Amplicon App Results Page provides a four graphs, described below: } } } } Low Percentage Graph on page 61 High Percentage Graph on page 62 Clusters Graph on page 63 Mismatch Graph on page 65 The PCR amplicon Sample Details page provides six panes, described below: } } } } } } Samples Table on page 87 Amplicons Table on page 89 Coverage Graph on page 89 Q-Score Graph on page 89 Variant Score Graph on page 89 Variants Table on page 89 The graphs and variants table display data for the amplicon that is selected in the Amplicon Table. Library QC The Library QC workflow is intended for evaluating the abundance, fragment length, and sample quality of libraries. The analysis performed in the Library QC workflow is very similar to the Resequencing workflow. The Library QC workflow does not perform variant calling; instead, it provides a report of the characteristics of each sample. The Library QC workflow can only be used to analyze MiSeq sequencing results The Library QC App Results page provides a four graphs, described below: } } } } Low Percentage Graph on page 61 High Percentage Graph on page 62 Clusters Graph on page 63 Mismatch Graph on page 65 The Library QC Sample Details Page provides four panes, described below: } } } } Samples Table on page 87 Coverage Graph on page 89 Q-Score Graph on page 89 Sample QC Table on page 91 The graphs display data for the chromosome that is selected in the dropdown list. Small RNA Analysis The Small RNA workflow measures the abundance of various types of short RNA sequences, particularly miRNA. It is suitable for identifying and quantifying miRNA expression and for comparing abundance across samples. The Small RNA workflow can only be used to analyze MiSeq sequencing results The small RNA analysis App Results page provides access to two graphs, described below. 56 Part # 15050652 Rev. A The Small RNA Sample Details Page provides three panes, described below: } Small RNA Samples Table on page 88 } Small RNA Pie Chart on page 90 } Small RNA Graph on page 90 Metagenomics Analysis The Metagenomics workflow enables the analysis of 16S ribosomal RNA, a component of the 30S subunit of prokaryotic ribosomes. The 16S ribosomal sequences from an environmental sample can be analyzed to determine which organisms are present. In MiSeq Reporter, a naïve Bayesian classifier (based on Wang et al., Appl Environ Microbiol (2007) Aug;73(16):5261-7) has been implemented that has been optimized for Illumina paired-end reads. Our 16S rRNA data store is populated by sequences in the May 2011 release of the GreenGenes 16S rRNA database. The main output of this workflow is a classification of reads at several taxonomic levels (kingdom, phylum, class, order, family, genus). The Metagenomics workflow can only be used to analyze MiSeq sequencing results The metagenomics App Results page provides one graph, described below. } Clusters Graph on page 63 The Metagenomics Sample Details Page provides two panes, described below: } Samples Table on page 87 } Metagenomics Pie Chart on page 90 De Novo Assembly Samples Page The Assembly workflow enables de novo assembly of a draft genome directly from the sequencing reads. Because assembly relies upon significant coverage of the genome, this workflow is best suited for the assembly of small genomes (up to 5 to 10 MB). The assembly process is performed by the Velvet software (Velvet: algorithms for de novo short read assembly using de Bruijn graphs (2008) D.R. Zerbino and E. Birney. Genome Research 18:821–829). The Assembly workflow can only be used to analyze MiSeq sequencing results The de novo assembly App Results page provides access to three graphs, described below: } Low Percentage Graph on page 61 } High Percentage Graph on page 62 } Clusters Graph on page 63 The De Novo Assembly Sample Details Page provides two panes, described below: } De Novo Assembly Samples Table on page 88 } Samples Graph on page 91 BaseSpace User Guide for NextSeq, Miseq, and HiSeq 57 Workflow Reference } Clusters Graph on page 63 } Trimmed Lengths on page 66 Generate FASTQ The app Generate FASTQ does not perform any analysis, but generates FASTQ files for download and shows basic summary data. The Generate FASTQ app can be used with all sequencing instruments that are supported by BaseSpace. For more information, see FASTQ Files on page 84. Generate FASTQ is also used to analyze RNA-Seq samples from MiSeq. Isaac Alignment and variant calling in the Isaac app are performed with the Isaac Alignment Software and the Isaac Variant Caller. The Isaac workflow generates output that consists of the realigned and duplicate marked reads in a BAM file format, variants in a VCF file format, an additional Genome VCF (gVCF) file that has an entry for every base in the reference, which differentiates reference calls and no calls, and a summary of the run quality. The Isaac app is intended for use with HiSeq sequencing runs. See Isaac App Results Page on page 91. Isaac Aligner The Isaac aligner1 aligns DNA sequencing data, single or paired end, with read lengths and low error rates using the following steps: } Candidate mapping positions—Identifies the complete set of relevant candidate mapping positions using a 32-mer seed-based search. } Mapping selection—Selects the best mapping among all candidates. } Alignment score—Determines alignment scores for the selected candidates based on a Bayesian model. } Alignment output—Generates final output in a sorted duplicate-marked BAM file and summary file. 1 Come Raczy, Roman Petrovski, Christopher T. Saunders, Ilya Chorny, Semyon Kruglyak, Elliott H. Margulies, Han-Yu Chuang, Morten Källberg, Swathi A. Kumar, Arnold Liao, Kristina M. Little, Michael P. Strömberg and Stephen W. Tanner (2013) Isaac: Ultra-fast whole genome secondary analysis on Illumina sequencing platforms. Bioinformatics 29(16):2041-3 bioinformatics.oxfordjournals.org/content/early/2013/06/04/bioinformatics.btt314 Candidate Mapping To align reads, the Isaac aligner first identifies a small but complete set of relevant candidate mapping positions. The Isaac aligner begins with a seed-based search using 32-mers as seeds. The initial single-seed search is followed by a multi-seed search for only those reads that were not mapped unambiguously with a single seed. Mapping Selection Following a seed-based search, the Isaac aligner selects the best mapping among all the candidates. For paired-end data sets, all mappings where only one end is aligned (called orphan mappings) trigger a local search to find additional mapping candidates. These candidates (called shadow mappings) are defined by the expected minimum and maximum insert size. After optional trimming of low quality 3' ends and adapter 58 Part # 15050652 Rev. A Alignment Scores The alignment scores of each read pair are based on a Bayesian model, where the probability of each mapping is inferred from the base qualities and the positions of the mismatches. The final mapping quality is the alignment score, truncated to 60 if above 60, and possibly corrected to known ambiguities in the reference as flagged in the seeds. Following alignment, reads are sorted. Further analysis is performed to identify duplicates and optionally to realign indels. Alignment Output After sorting the reads, the Isaac aligner generates compressed binary alignment output files, called BAM (*.bam) files, using the following process: } Marking duplicates—Detection of duplicates is based on the location and observed length of each fragment. The Isaac aligner identifies and marks duplicates even when they appear on oversized fragments or chimeric fragments. Note that optical duplicates are already filtered out during RTA processing. } Realigning indels—The Isaac aligner tracks previously-detected indels, over a window large enough for the current read length, and applies the known indels to all reads with mismatches. } Generating BAM files—The first step in BAM file generation is creation of the BAM record, which contains all required information except the name of the read. The Isaac aligner reads data from base call (BCL) files that were written during primary analysis on the sequencer to generate the read names. Data is then compressed into blocks of 64 Kb or less to create the BAM file. Isaac Variant Caller The Isaac Variant Caller (the algorithm is also referred to as Starling2) identifies single nucleotide polymorphisms (SNPs) and small indels using the following steps: } Read filtering—Filters out reads failing quality checks. } Indel calling—Identifies a set of possible indel candidates and realigns all reads overlapping the candidates using a multiple sequence aligner. } SNP calling—Computes the probability of each possible genotype given the aligned read data and a prior distribution of variation in the genome. } Indel genotypes—Calls indel genotypes and assigns probabilities. } Variant call output—Generates output in a compressed genome variant call (gVCF) file. See gVCF Files on page 78 for details. Indel Candidates Input reads are filtered by removing any of the following: } } } } Reads that failed primary analysis quality checks. Reads marked as PCR duplicates. Paired-end reads not marked as a proper pair. Reads with a mapping quality less than 20. BaseSpace User Guide for NextSeq, Miseq, and HiSeq 59 Workflow Reference sequences, the possible mapping positions of each fragment are compared. This takes into account pair-end information (when available), possible gaps using a banded SmithWaterman gap aligner, and possible shadows. The selection is based on the SmithWaterman score and on the log-probability of each mapping. Indel Calling The variant caller proceeds with candidate indel discovery and generates alternate read alignments based on the candidate indels. As part of the realignment process, the variant caller selects a representative alignment to be used for site genotype calling and depth summarization by the SNP caller. SNP Calling The variant caller runs a series of filters on the set of filtered and realigned reads for SNP calling without affecting indel calls. First, any contiguous trailing sequence of N base calls are trimmed from the ends of reads. Using a mismatch density filter, reads having an unexpectedly high number of disagreements with the reference are masked, as follows: } The variant caller treats each insertion or deletion as a single mismatch. } Base calls with more than two mismatches to the reference sequence within 20 bases of the call are ignored. } If the call occurs within the first or last 20 bases of a read, the mismatch limit is applied to a 41-base window at the corresponding end of the read. } The mismatch limit is applied to the entire read when the read length is 41 or shorter. Indel Genotypes All bases marked by the mismatch density filter and any N base calls that remain after the end-trimming step are filtered out by the variant caller. These filtered base calls are not used for site-genotyping but appear in the filtered base call counts in the variant caller output for each site. All remaining base calls are used for site-genotyping. To account for the possibility of error dependencies, the genotyping method heuristically adjusts the joint error probability that is calculated from multiple observations of the same allele on each strand of the genome. This method treats the highest quality base call from each allele and strand as an independent observation and leaves the associated base call quality scores unmodified. However, quality scores for subsequent base calls for each allele and strand are adjusted to increase the joint error probability of the given allele above the error expected from independent base call observations. Variant Call Output After the site and indel genotyping methods are complete, the variant caller applies a final set of heuristic filters to produce the final set of non-filtered calls in the output. The output in the genome variant call (gVCF) file captures the genotype at each position and the probability that the consensus call differs from reference, which is expressed as a phred-scaled quality score. 60 Part # 15050652 Rev. A This section provides the data references, and describes the files, charts, graphs, and tables listed below. Workflow Graphs on page 61 Run Summary on page 66 Indexing QC on page 68 Charts on page 69 BaseSpace Files on page 75 Sample Details Page Components on page 87 Isaac App Results Page on page 91 Workflow Graphs The workflow graphs provide metrics that allow you to judge the success of the sequencing run for that sample. The following topics provide information about these charts. Low Percentage Graph What is it? The Low Percentage Graph represents statistics of the run that are generally near zero in an ideal run. These are a subset of all metrics of the sequencing run itself. When to use it. Use the Low Percentage Graph to judge sequencing metrics for a sample. This should also be used when troubleshooting unexpected results. BaseSpace User Guide for NextSeq, Miseq, and HiSeq 61 Data Reference Data Reference When not to use it. This graph is not a good predictor of yields or quality of final results. How to use it Metric Description Phasing 1 The percentage of molecules in a cluster that fall behind the current cycle within Read 1. Phasing 2 The percentage of molecules in a cluster that fall behind the current cycle within Read 2. PrePhasing 1 The percentage of molecules in a cluster that run ahead of the current cycle within Read 1. PrePhasing 2 The percentage of molecules in a cluster that run ahead of the current cycle within Read 2. Mismatch 1 The average percentage of mismatches for Read 1 over all cycles. Mismatch 2 The average percentage of mismatches for Read 2 over all cycles. You can expand a chart by clicking on the expand button. High Percentage Graph What is it? The High Percentage Graph represents run statistics that are generally near 100% in an ideal run. These are metrics of the sequencing run or the analysis step. 62 Part # 15050652 Rev. A Use the High Percentage Graph to judge sequencing metrics for a sample. This should also be used when troubleshooting unexpected results. When not to use it. Do not use the High Percentage Graph to look at tertiary analysis metrics. How to use it Metric Description |20/|1 1 The ratio of intensities at cycle 20 to the intensities at cycle 1 for Read 1. |20/|1 2 The ratio of intensities at cycle 20 to the intensities at cycle 1 for Read 2. Align 1 The percentage of clusters that aligned to the reference in Read 1. Align 2 The percentage of clusters that aligned to the reference in Read 2. PE Orientation The percentage of paired-end alignments with the expected orientation. PE Resynthesis The ratio of first cycle intensities for Read 1 to first cycle intensities for Read 2. PF The percentage of clusters passing filters. You can expand a chart by clicking on the expand button. Clusters Graph What is it? The Clusters graph provides information about the amount of clusters that are detected during sequencing, split out by the following groups: } } } } } Total Passing filter Unaligned Unindexed Duplicates BaseSpace User Guide for NextSeq, Miseq, and HiSeq 63 Data Reference When to use it. When to use it. Use the Clusters Graph to judge clustering success and relative cluster density between lanes (on HiSeq), and as a snap shot of the overall run. Can assist with identifying overclustering issues. When not to use it. Do not use the Clusters Graph to look at tertiary analysis metrics. How to use it A cluster represents a clonal spot on the flow cell that contains the amplified DNA strands that will be sequenced. x-axis Description Raw The total number of clusters detected in the run. PF The total number of clusters passing filter in the run. Unaligned The total number of clusters passing filter that did not align to the reference genome, if applicable. Clusters that are unindexed are not included in the unaligned count. Unindexed The total number of clusters passing filter that were not associated with any index sequence in the run. Duplicate The total number of clusters for a paired-end sequencing run that are considered to be PCR duplicates. PCR duplicates are defined as two clusters from a paired-end run where both clusters have the exact same alignment positions for each read. You can expand a chart by clicking on the expand button. 64 Part # 15050652 Rev. A What is it? The Mismatch Graph plots the mismatches between a sequence read and a reference genome after alignment. When to use it. To judge the quality of the sequencing run. Poor sequencing runs usually lead to high numbers of mismatches. When not to use it. } When you are using a reference genome that may have a lot of errors or low confidence stretches. } When sample and reference are quite different } In de novo applications. } In Methyl-seq applications How to use it Mismatch refers to any mismatch between sequence read and a reference genome after alignment. } Cycle: Plots the % mismatches for all clusters in a run versus cycle Note that mismatches can be due to two main reasons: } Sequencing errors (non-specific, random) } Differences between your sample and the reference genomes Make sure to keep this in mind when interpreting the mismatch rates. You can expand a chart by clicking on the expand button. BaseSpace User Guide for NextSeq, Miseq, and HiSeq 65 Data Reference Mismatch Graph Trimmed Lengths Y Axis X Axis Clusters Trimmed Lengths Description Histogram of reads indicating length at trimming because they reached adapter. Run Summary What is it? The Run Summary page displays tables with basic data quality metrics summarized per lane and per read. All the statistics are given as means and standard deviations over the tiles used in the lane. When to use it. When looking at basic data quality metrics for a run from primary analysis. When not to use it. The tables do not contain information about samples or projects. The tables also do not contain app-generated information (secondary or tertiary analysis). How to use it. The following metrics are displayed in the top table, split out by read and total: 66 Level The level or read of the run. Cycles The number of cycles in the level. Yield Total The number of bases sequenced. This is updated as the run progresses Projected Total Yield The projected number of bases expected to be sequenced at the end of the run. Yield Perfect The number of bases in reads that align perfectly, as determined by a spiked in PhiX control sample. If no PhiX control sample is run in the lane, this chart is not available. Part # 15050652 Rev. A The number of bases in reads that align with 3 errors or less, as determined by a spiked in PhiX control sample. If no PhiX control sample is run in the lane, this chart is not available, and will show a zero value. Aligned The percentage of the sample that aligned to the PhiX genome. This is determined for each level or read independently. % Perfect [Num Usable Cycles] The percentage of bases in reads that align perfectly, as determined by a spiked in PhiX control sample, at the cycle indicated in the brackets. If no PhiX control sample is run in the lane, this chart will display 0% but will still show the number of cycles used. % <=3 errors [Num Usable Cycles] The percentage of bases in reads that align with 3 errors or less, as determined by a spiked in PhiX control sample, at the indicated cycle. If no PhiX control sample is run in the lane, this chart If no PhiX control sample is run in the lane, this chart will display 0% but will still show the number of cycles used. Error Rate The calculated error rate of the reads that aligned to PhiX. Intensity Cycle 1 The average of the A channel intensity measured at the first cycle averaged over filtered clusters. % Intensity Cycle 20 The corresponding intensity statistic at cycle 20 as a percentage of that at the first cycle. 100%x(Intensity at cycle 20)/(Intensity at cycle 1). %Q>=30 The percentage of bases with a quality score of 30 or higher, th respectively. This chart is generated after the 25 cycle, and the values represent the current cycle. The following metrics are available in the Read tables, split out by lane: Tiles The number of tiles per lane. Density The density of clusters (in thousands per mm ) detected by image analysis, +/- one standard deviation. Clusters PF The percentage of clusters passing filtering, +/- one standard deviation. Phas./Prephas. The value used by RTA for the percentage of molecules in a cluster for which sequencing falls behind (phasing) or jumps ahead (prephasing) the current cycle within a read. Reads The number of clusters (in millions). Reads PF The number of clusters (in millions) passing filtering. %Q>=30 The percentage of bases with a quality score of 30 or higher, th respectively. This chart is generated after the 25 cycle, and the values represent the current cycle. Yield The number of bases sequenced which passed filter. BaseSpace User Guide for NextSeq, Miseq, and HiSeq 2 67 Data Reference Yield <=3 errors Cycles Err Rated The number of cycles that have been error rated with respect to PhiX starting at cycle 1. Aligned The percentage that aligned to the PhiX genome. Error Rate The calculated error rate, as determined by the PhiX alignment. Subsequent columns display the error rate for cycles 1–35, 1–75, and 1–100. Intensity Cycle 1 The average of the A channel intensity measured at the first cycle averaged over filtered clusters. %Intensity Cycle 20 The corresponding intensity statistic at cycle 20 as a percentage of that at the first cycle. 100%x(Intensity at cycle 20)/(Intensity at cycle 1). Indexing QC What is it? The Indexing QC page lists count information for indices used in the run as designated in the sample sheet. Note that the Indexing QC will only be available if the run is an index run. When to use it. Look at this page when you want to see indexing information for a lane after the index read ias completed. When not to use it. This page only provides indexing information. Do not use it for runs that were not indexed, or to look at other primary, secondary, or tertiary analysis metrics. This is a quick estimation and may vary slightly from final output. How to use it. You can select the displayed lane through the dropdown list. The first table provides an overall summary of the indexing performance for that lane, including: Total Reads The total number of reads for this lane. PF Reads The total number of passing filter reads for this lane. % Reads Identified (PF) The total fraction of passing filter reads assigned to an index. CV The coefficient of variation for the number of counts across all indices. Min The lowest representation for any index. Max The highest representation for any index. Further information is provided regarding the frequency of individual indices in both table and graph form. The table contains several columns, including 68 Part # 15050652 Rev. A A unique number assigned to each index by BaseSpace for display purposes. Sample ID The sample ID assigned to an index in the sample sheet. Project The project assigned to an index in the sample sheet. Index 1 (I7) The sequence for the first index read. Index 2 (I5) The sequence for the second index read. % Reads Identified (PF) The number of reads (only includes Passing Filter reads) mapped to this index. This information is also displayed in graphical form. In the graphical display, indices are ordered according to the unique Index Number assigned by BaseSpace. Charts The Charts page displays five charts with run metrics. You can expand a chart by clicking on the expand button. The following topics provide information about these charts. Flow Cell Chart What is it? The Flow Cell Chart displays color-coded graphical quality metrics per tile for the entire flow cell. When to use it. Use the Flow Cell Chart to judge local differences per cycle, per lane, or per read in sequencing metrics on a flow cell. It is also an easy way to see the %Q30 metric, which is an excellent single metric to judge a run. When not to use it. Do not use the Flow Cell Chart to look at secondary or tertiary analysis metrics. BaseSpace User Guide for NextSeq, Miseq, and HiSeq 69 Data Reference Index Number How to use it. The Flow Cell Chart has the following features: } You can select the displayed metric, surface, cycle, and base through the dropdown lists. } The color bar to the right of the chart indicates the values that the colors represent. } The chart is displayed with tailored scaling by default. } Tiles that have not been measured or are not monitored are gray. You can monitor the following quality metrics with this chart: Intensity This chart displays the intensity by color and cycle of the 90% percentile of the data for each tile. FWHM The average full width of clusters at half maximum (in pixels). Used to display focus quality. % Base The percentage of clusters for which the selected base (A, C, T, or G) has been called. %Q>20, %Q>30 The percentage of bases with a quality score of >20 or >30, th respectively. These charts are generated after the 25 cycle, and the values represent the current scored cycle. Median Q-Score The median Q-Score for each tile over all bases for the th current cycle. These charts are generated after the 25 cycle. This plot is best used to examine the Q-scores of your run as it progresses. Bear in mind that the %Q30 plot can give an over simplified view due to its reliance on a single threshold. Density The density of clusters for each tile (in thousands per mm ). Density PF The density of clusters passing filter for each tile (in 2 thousands per mm ). Clusters The number of clusters for each tile (in millions). Clusters PF The number of clusters passing filter for each tile (in millions). Error Rate The calculated error rate, as determined by a spiked in PhiX control sample. If no PhiX control sample is run in the lane, this chart is not available. % Phasing, % Prephasing. The estimated percentage of molecules in a cluster for which sequencing falls behind (phasing) or jumps ahead (prephasing) the current cycle within a read. % Aligned The percentage of reads from clusters in each tile that aligned to the PhiX genome. Perfect Reads 70 2 The percentage of reads that align perfectly, as determined by a spiked in PhiX control sample. If no PhiX control sample is run in the lane, this chart is all gray. Corrected Intensity The intensity corrected for cross-talk between the color channels by the matrix estimation and phasing and prephasing. Called Intensity The intensity for the called base. Signal to Noise The signal to noise ratio is calculated as mean called intensity divided by standard deviation of non called intensities. Part # 15050652 Rev. A Data By Cycle Plot What is it? The Data by Cycle plot displays the progression of quality metrics during a run as a line graph. When to use it. Use the Data By Cycle Plot to judge the progression of quality metrics during a run on a cycle by cycle basis. When not to use it. Do not use the Data By Cycle Plot to look at secondary or tertiary analysis metrics, or aggregate analysis for a whole lane regardless of cycle. How to use it. The Data by Cycle plot displays plots that allow you to follow the progression of quality metrics during a run. These plots have the following features: } You can select the displayed metric and base through the dropdown lists. } The symbol in the top right hand corner toggles the plot between pane view and full screen view. You can monitor the following quality metrics with this plot: Intensity This chart displays the intensity by color and cycle of the 90% percentile of the data for each tile. FWHM The average full width of clusters at half maximum (in pixels). Used to display focus quality. BaseSpace User Guide for NextSeq, Miseq, and HiSeq 71 Data Reference Note the variable scales used on these different parameters. % Base The percentage of clusters for which the selected base (A, C, T, or G) has been called. %Q>20, %Q>30 The percentage of bases with a quality score of >20 or >30, th respectively. These charts are generated after the 25 cycle, and the values represent the current scored cycle. Median Q-Score The median Q-Score for each tile over all bases for the th current cycle. These charts are generated after the 25 cycle. This plot is best used to examine the Q-scores of your run as it progresses. Bear in mind that the %Q30 plot can give an over simplified view due to its reliance on a single threshold. Error Rate The calculated error rate, as determined by a spiked in PhiX control sample. If no PhiX control sample is run in the lane, this chart is not available. Perfect Reads The percentage of reads that align perfectly, as determined by a spiked in PhiX control sample. If no PhiX control sample is run in the lane, this chart is all gray. Corrected Intensity The intensity corrected for cross-talk between the color channels by the matrix estimation and phasing and prephasing. Called Intensity The intensity for the called base. Signal to Noise The signal to noise ratio is calculated as mean called intensity divided by standard deviation of non called intensities. You can expand a chart by clicking on the expand button. QScore Distribution What is it? The QScore Distribution Plot displays a bar graph that allows you to view the number of bases by quality score. The quality score is cumulative for current cycle and previous cycles, and only bases from reads that pass the quality filter are included. 72 Part # 15050652 Rev. A Use it to judge the QScore distribution for a run, which is an excellent indicator for run performance. When not to use it. Do not use the QScore Distribution Plot to look at secondary or tertiary analysis metrics, or metrics other than quality scores. How to use it. The QScore Distribution pane displays plots that allow you to view the number of reads by quality score. The quality score is cumulative for current cycle and previous cycles, and only reads that pass the quality filter are included. These plots have the following features: } You can select the displayed read, and cycle through the dropdown lists. } The symbol in the top right hand corner toggles the plot between pane view and full screen view. Note that the QScore is based on the Phred scale. The following list displays Q-scores and the corresponding chance that the base call is wrong: } } } } Q10: 10% chance of wrong base call Q20: 1% chance of wrong base call Q30: 0.1% chance of wrong base call Q40: 0.01% chance of wrong base call You can slide the threshold (set at >=Q30 by default) to examine the proportion of bases at or above any particular Q-score, note that when using Q-score binning this plot will reflect the subset of Q-scores used. Data by Lane Plot What is it? The Data by Lane Plot displays plots that allow you to view quality metrics per lane. BaseSpace User Guide for NextSeq, Miseq, and HiSeq 73 Data Reference When to use it. When to use it. Use the Data By Lane Plot to judge the difference in quality metrics between lanes. When not to use it. Do not use the Data By Lane Plot to look at secondary or tertiary analysis metrics. How to use it. The Data by Lane plots have the following features: } You can select the displayed metric through the dropdown lists. } The symbol in the top right hand corner toggles the plot between pane view and full screen view. The plots share a number of characteristics. } The plots show the distribution of mean values for a given parameter across all tiles in a given lane. } The red line indicates the median tile value for the parameter displayed. } Blue boxes are for raw clusters, green boxes for clusters passing filter. } The box outlines the interquartile range (the middle 50% of the data) for the tiles analyzed for the data point. } The error bars delineate the minimum and maximum without outliers. } The outliers are the values that are more than 1.5 times the interquartile range below the 25th percentile, or more than 1.5 times the interquartile range above the 75th percentile. Outliers are indicated as dots. You can monitor the following quality metrics with this plot: } The density of clusters for each tile (in thousands per mm2). } The number of clusters for each tile (in millions). } The estimated percentage of molecules in a cluster for which sequencing falls behind (phasing) or jumps ahead (prephasing) the current cycle within a read. } The percentage of reads from clusters in each tile that aligned to the PhiX genome. You can expand a chart by clicking on the expand button. QScore Heatmap What is it? A heatmap of the Q-scores. 74 Part # 15050652 Rev. A Data Reference When to use it. For a quick overview of the Q-scores over the cycles. When not to use it. Do not use the QScore Distribution Plot to look at secondary or tertiary analysis metrics, or metrics other than quality scores. How to use it. The QScore Heatmap displays plots that allow you to view the QScore by cycle. These plots have the following features: } The color bars to the right of each chart indicate the values that the colors represent. The charts are displayed with tailored scaling; the scale is always 0 to 100% of maximum value. } The symbol in the top right hand corner toggles the plot between pane view and full screen view. You can expand a chart by clicking on the expand button. BaseSpace Files BaseSpace uses and produces a variety of files. See the topics in this section for details. Sample Sheet What is it? The sample sheet is a comma-delimited file (SampleSheet.csv) that stores the information needed to set up and analyze a sequencing experiment. The file includes a list of samples and their index sequences, as well as the workflow to be employed by BaseSpace. BaseSpace User Guide for NextSeq, Miseq, and HiSeq 75 When to use it. Every run in BaseSpace needs to have an associated sample sheet in order to define projects and samples, assign indices, and run sample sheet driven workflow apps. When not to use it. Not applicable; you will always employ a sample sheet with BaseSpace. How to use it The following table is for reference purposes only. For details about creating or modifying a sample sheet, see the MiSeq Reporter User Guide, MiSeq Sample Sheet Quick Reference Guide or HiSeq User Guide. You can create a sample sheet using the Illumina Experiment Manager Software. Table 1 Sample Sheet Fields 76 Row Description Investigator Name (Optional) The name of the investigator. Project Name (Optional) A descriptive name of the run. Experiment Name (Optional) A descriptive name of the experiment. Date The date the sequencing run was performed. Workflow The analysis workflow for the run. Manifests This section is only used by the Amplicon workflow and is the name of the file (provided by Illumina or created by IEM) used in the Amplicon Workflow. It is required for the Amplicon workflow and ignored by other workflows. The file specifies the alignments to a reference and the targeted reference regions used in the Amplicon workflow. Part # 15050652 Rev. A Description Site Reports This section is optional and used by only the Resequencing and Custom Amplicon workflows. Each line below the SiteReports section header is the name of a SiteReport Input File. This file designates positions on a given chromosome to report the genotype found at that position. Data • Contaminants – The path to the folder containing FASTA files of contaminants (used only for SmallRNA) • GenomePath – The reference genome folder containing the FASTA files to be used in the alignment step • Index – Represents the sequence string of a sample's first index. Valid characters in this string are A, C, G, T and N. 'N' matches any base. • Index2 – Represents the sequence string of this sample's second index. Valid characters in this string are A,C,G,T and N. 'N' matches any base. • MiRNA – The path to the folder containing FASTA files of mature miRNAs (used only for SmallRNA) • RNA – The path to the folder containing FASTA files of small RNAs (used only for SmallRNA) • SampleID – A string identifier for the sample. This is usually a bar code but can have any value. Letters and numbers only; some special characters can be detrimental for file creation. • Manifest – The manifest file letter as designated by the manifest field. • Name – A string identifier for the sample. This is used in the reporting web page. Data Reference Row BAM Files What is it? The Sequence Alignment/Map (SAM) format is a generic alignment format for storing read alignments against reference sequences, supporting short and long reads (up to 128 Mb) produced by different sequencing platforms. SAM is a text format file that is humanreadable. The Binary Alignment/Map (BAM) keeps exactly the same information as SAM, but in a compressed, binary format that is only machine-readable. When to use it. Allows you to see alignments. Use it for direct interpretation or as a starting point for tertiary analysis with downstream analysis tools that are compatible with BAM. BAM files are suitable for viewing with an external viewer such as IGV or the UCSC genome browser. When not to use it. Do not use it with tools that are not compatible with the BAM format, or with applications that cannot handle large files, as BAM files can get big, depending on the application and data. BaseSpace User Guide for NextSeq, Miseq, and HiSeq 77 How to use it If you use an app in BaseSpace that uses BAM files as input, the app will locate the file when launched. If using BAM files in other tools, download the file to use it in the external tool. Detailed Description Go to http://samtools.sourceforge.net/SAM1.pdf to see the exact SAM specification. gVCF Files What is it? This application also produces the genome Variant Call Format file (gVCF). gVCF was developed to store sequencing information for both variant and non-variant positions, which is required for human clinical applications. gVCF is a set of conventions applied to the standard variant call format (VCF) 4.1 as documented by the 1000 Genomes Project. These conventions allow representation of genotype, annotation, and other information across all sites in the genome in a compact format. Typical human whole genome sequencing results expressed in gVCF with annotation are less than 1 Gbyte, or about 1/100 the size of the BAM file used for variant calling. If you are performing targeted sequencing, gVCF is also an appropriate choice to represent and compress the results. gVCF is a text file format, stored as a gzip compressed file (*.genome.vcf.gz). Compression is further achieved by joining contiguous non-variant regions with similar properties into single ‘block’ VCF records. To maximize the utility of gVCF, especially for high stringency applications, the properties of the compressed blocks are conservative -thus block properties like depth and genotype quality reflect the minimum of any site in the block. The gVCF file can be indexed (creating a .tbi file) and used with existing VCF tools such as tabix and IGV, making it convenient both for direct interpretation and as a starting point for tertiary analysis. For more information, see https://sites.google.com/site/gvcftools/home/about-gvcf. When to use it. Use it for direct interpretation or as a starting point for tertiary analysis with downstream analysis that is compatible with gVCF, such as tabix and IGV. When not to use it. Do not use it with tools that are not compatible with the gVCF format. How to use it Apps that use gVCF files find it when kicked off and directed to the sample. If using gVCF files in other tools, download the file to use it in the outside tool. Detailed Description The following conventions are used in the variant caller gVCF files. Samples per File There is only one sample per gVCF file. 78 Part # 15050652 Rev. A Contiguous non-variant segments of the genome can be represented as single records in gVCF. These records use the standard 'END' INFO key to indicate the extent of the record. Even though the record can span multiple bases, only the first base is provided in the REF field to reduce file size. The following is a simplified segment of a gVCF file, describing a segment of non-variant calls (starting with an A) on chromosome 1 from position 51845 to 51862. ##INFO=<ID=END,Number=1,Type=Integer,Description="End position of the variant described in this record">#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT NA19238chr1 51845 . A . . PASS END=51862 Any fields provided for a block of sites, such as read depth (using the DP key), will show the minimum value observed among all sites encompassed by the block. Each sample value shown for the block, such as the depth (using the DP key), is restricted to a range where the maximum value is within 30% or 3 of the minimum, i.e. for sample value range [x,y], y <= x+max(3,x*0.3). This range restriction applies to each of the sample values printed out in the final block record. Indel Regions Note that sites which are "filled in" inside of deletions have additional changes: All deletions: } Sites inside of any deletion are marked with the deletion's filters, in addition to any filters which have already been applied to the site. } Sites inside of deletions cannot have a genotype or alternate allele quality score higher than the corresponding value from the enclosing indel. Heterozygous deletions: } Sites inside of heterozygous deletions are altered to have haploid genotype entries (e.g. "0" instead of "0/0", "1" instead of "1/1"). } Heterozygous SNV calls inside of heterozygous deletions are marked with the "SiteConflict" filter and their genotype is unchanged. Homozygous deletions: } Homozygous reference and no-call sites inside of homozygous deletions have genotype "." } Sites inside of homozygous deletions which have a non-reference genotype are marked with a “SiteConflict” filter, and their genotype is unchanged. } Site and genotype quality are set to "." The above modifications reflect the notion that the site confidence is bound by the enclosing indel confidence. Also note that on occasion, the variant caller will produce multiple overlapping indel calls which cannot be resolved into two haplotypes. If this occurs all indels and sites in the region of the overlap will be marked with the “IndelConflict” filter (see below). Genotype Quality for Variant and Non-variant Sites The gVCF file uses an adapted version of genotype quality for variant and non-variant site filtration. This value is associated with the key GQX. The GQX value is intended to represent the minimum of {Phred genotype quality assuming the site is variant, Phred genotype quality assuming the site is non-variant}. The reason for using this is to allow a BaseSpace User Guide for NextSeq, Miseq, and HiSeq 79 Data Reference Non-Variant Blocks Using END Key single value to be used as the primary quality filter for both variant and non-variant sites. Filtering on this value corresponds to a conservative assumption appropriate for applications where reference genotype calls must be determined at the same stringency as variant genotypes, i.e.: } An assertion that a site is homozygous reference at GQX >= 30 is made assuming the site is variant. } An assertion that a site is a non-reference genotype at GQX >= 30 is made assuming the site is non-variant. Section Descriptions The gVCF file contains the following sections: } Meta-information lines start with ## and contain meta-data, config information, and define the values that the INFO, FILTER and FORMAT fields can have. } The header line starts with # and names the fields that the data lines use. These are #CHROM, POS, ID,REF, ALT, QUAL, FILTER, INFO, FORMAT, followed by one or more sample columns. } Data lines that contain information about one or more positions in the genome. Note that if you extract the variant lines from a gVCF file, you produce a conventional variant VCF file. Field Descriptions The fixed fields #CHROM, POS, ID, REF, ALT, QUAL are defined in the VCF 4.1 standard provided by the 1000 Genomes Project, while the fields ID, INFO, FORMAT, and sample are described in the meta-information. Descriptions are provided below. } CHROM: Chromosome: an identifier from the reference genome or an anglebracketed ID String ("<ID>") pointing to a contig. } POS: Position: The reference position, with the 1st base having position 1. Positions are sorted numerically, in increasing order, within each reference sequence CHROM. There can be multiple records with the same POS. Telomeres are indicated by using positions 0 or N+1, where N is the length of the corresponding chromosome or contig. } ID: Semi-colon separated list of unique identifiers where available. If this is a dbSNP variant it is encouraged to use the rs number(s). No identifier should be present in more than one data record. If there is no identifier available, then the missing value should be used. } REF: Reference base(s): A,C,G,T,N; there can be multiple bases. The value in the POS field refers to the position of the first base in the string. For simple insertions and deletions in which either the REF or one of the ALT alleles would otherwise be null/empty, the REF and ALT strings include the base before the event (which is reflected in the POS field), unless the event occurs at position 1 on the contig in which case they include the base after the event. If any of the ALT alleles is a symbolic allele (an angle-bracketed ID String "<ID>") then the padding base is required and POS denotes the coordinate of the base preceding the polymorphism. } ALT: Comma separated list of alternate non-reference alleles called on at least one of the samples. Options are: • Base strings made up of the bases A,C,G,T,N • angle-bracketed ID String (”<ID>”) • breakend replacement string as described in the section on breakends. If there are no alternative alleles, then the missing value should be used. 80 Part # 15050652 Rev. A BaseSpace User Guide for NextSeq, Miseq, and HiSeq 81 Data Reference } QUAL: Phred-scaled quality score for the assertion made in ALT. i.e. -10log_10 prob (call in ALT is wrong). If ALT is ”.” (no variant) then this is -10log_10 p(variant), and if ALT is not ”.” this is -10log_10 p(no variant). High QUAL scores indicate high confidence calls. Although traditionally people use integer phred scores, this field is permitted to be a floating point to enable higher resolution for low confidence calls if desired. If unknown, the missing value should be specified. (Numeric) } FILTER: PASS if this position has passed all filters, i.e. a call is made at this position. Otherwise, if the site has not passed all filters, a semicolon-separated list of codes for filters that fail. gVCF files use the following values: • PASS: position has passed all filters. • IndelConflict: Locus is in region with conflicting indel calls. • SiteConflict: Site genotype conflicts with proximal indel call. This is typically a heterozygous SNV call made inside of a heterozygous deletion. • LowGQX: Locus GQX (minimum of {Genotype quality assuming variant position,Genotype quality assuming non-variant position}) is less than 30 or not present. • HighDPFRatio: The fraction of basecalls filtered out at a site is greater than 0.3. • HighSNVSB: SNV strand bias value (SNVSB) exceeds 10. High strand bias indicates a potential high false-positive rate for SNVs. • HighSNVHPOL: SNV contextual homopolymer length (SNVHPOL) exceeds 6. • HighREFREP: Indel contains an allele which occurs in a homopolymer or dinucleotide track with a reference repeat greater than 8. • HighDepth: Locus depth is greater than 3x the mean chromosome depth. } INFO: Additional information. INFO fields are encoded as a semicolon-separated series of short keys with optional values in the format: <key>=<data>[,data]. gVCF files use the following values: • END: End position of the region described in this record. • BLOCKAVG_min30p3a: Non-variant site block. All sites in a block are constrained to be non-variant, have the same filter value, and have all sample values in range [x,y], y <= max(x+3,(x*1.3)). All printed site block sample values are the minimum observed in the region spanned by the block. • SNVSB: SNV site strand bias. • SNVHPOL: SNV contextual homopolymer length. • CIGAR: CIGAR alignment for each alternate indel allele. • RU: Smallest repeating sequence unit extended or contracted in the indel allele relative to the reference. RUs are not reported if longer than 20 bases. • REFREP: Number of times RU is repeated in reference. • IDREP: Number of times RU is repeated in indel allele. } FORMAT: Format of the sample field. FORMAT specifies the data types and order of the subfields. gVCF files use the following values: • GT: Genotype. • GQ: Genotype Quality. • GQX: Minimum of {Genotype quality assuming variant position,Genotype quality assuming non-variant position}. • DP: Filtered basecall depth used for site genotyping. • DPF: Basecalls filtered from input before site genotyping. • AD: Allelic depths for the ref and alt alleles in the order listed. For indels this value only includes reads which confidently support each allele (posterior probability 0.999 or higher that read contains indicated allele vs all other intersecting indel alleles). • DPI: Read depth associated with indel, taken from the site preceding the indel. } SAMPLE: Sample fields as defined by the header. VCF Files What is it? VCF is a text file format which contains information about variants found at specific positions in a reference genome. The file format consists of meta-information lines, a header line, and then data lines. Each data line contains information about a single variant. When to use it. Use it for direct interpretation or as a starting point for tertiary analysis with downstream analysis that are compatible with VCF, such as IGV or the UCSC genome browser. When not to use it. Do not use it with tools that are not compatible with the VCF format. NOTE Windows recognizes vcf files as an Outlook contact file. Do not open VCF files in Outlook. How to use it If you use an app in BaseSpace that uses VCF files as input, the app will locate the file when launched. If using VCF files in other tools, download the file to use it in the external tool. Detailed Description The file naming convention for VCF files is as follows: SampleName_S#.vcf (where # is the sample number determined by ordering in the sample sheet). The header of the VCF file describes the tags used in the remainder of the file and has the column header: ##fileformat=VCFv4.1 ##fileDate=20120317 ##source=SequenceAnalysisReport.vshost.exe ##reference= ##phasing=none ##INFO=<ID=DP,Number=1,Type=Integer,Description="Total Depth"> ##INFO=<ID=TI,Number=.,Type=String,Description="Transcript ID"> ##INFO=<ID=GI,Number=.,Type=String,Description="Gene ID"> ##INFO=<ID=CD,Number=0,Type=Flag,Description="Coding Region"> ##FILTER=<ID=q20,Description="Quality below 20"> ##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype"> ##FORMAT=<ID=GQ,Number=1,Type=Integer,Description="Genotype Quality"> #CHROM POS ID REF ALT QUAL FILTER INFO FORMAT SAMPLE A sample line of the VCF file is shown below. The data that is used to populate each column is also described: chr22 16285888 rs76548004 T C 17 d15;q20 DP=11;TI=NM_ 001136213;GI=POTEH;CD GT:GQ 1/0:17 82 Part # 15050652 Rev. A Description ALT The allele(s) that differ from the reference read. For example, an insertion of a single T could be represented by reference A and alternate AT. CHROM The chromosome of the reference genome. Chromosomes appear in the same order as the reference FASTA file (generally karyotype order) FILTER If all filters are passed, the' PASS' is written. The possible filters are as follows: • q20 – The variant score is less than 20. (Configurable using the VariantFilterQualityCutoff setting in the config file) • r8 – For an Indel, the number of repeats in the reference (of a 1- or 2-base repeat) is greater than 8. (Configurable using the IndelRepeatFilterCutoff setting in the config file) FORMAT The format column lists fields (separated by colons), for example, "GT:GQ". The list of fields provided depends on the variant caller used. The available fields are as follow: AD – Entry of the form X,Y where X is the number of reference calls, Y the number of alternate calls GQ – Genotype quality GT – Genotype. 0 corresponds to the reference base, 1 corresponds to the first entry in the ALT column, 2 corresponds to the second entry in the ALT column, etc. The '/' indicates that there is no phasing information. NL – Noise level; an estimate of base calling noise at this position SB – Strand bias at this position. Larger negative values indicate more bias; values near zero indicate little strand bias. VF – Variant frequency. The percentage of reads supporting the alternate allele. ID The rs number for the snp obtained from dbSNP. If there are multiple rs numbers at this location, the list is semi-colon delimited. If no dbSNP entry exists at this position, the missing value ('.') is used. INFO These are the possible entries in the INFO column: • AD – Entry of the form X,Y where X is the number of reference calls, Y the number of alternate calls. • CD – A flag indicating that the snp occurs within the coding region of at least one refGene entry • DP – The depth (number of base calls aligned to a this position) • GI – A comma separated list of gene IDs read from refGene • NL – Noise level; an estimate of base calling noise at this position. • TI – A comma separated list of transcript IDs read from refGene • SB – Strand bias at this position. • VF – Variant frequency. The number of reads supporting the alternate allele. POS The 1-based position of this variant in the reference chromosome. The convention for .vcf files is that, for SNPs, this is the reference base with the variant; for indels or deletions, this is the reference base immediately before the variant. Variants are ordered by position. QUAL A phred-scaled quality score assigned by the variant caller. Higher scores indicate higher confidence in the variant (and lower probability of errors). For a quality score of Q, the estimated probability of an error is 10-(Q/10). For example, the set of Q30 calls should have a 0.1% error rate. Note that many variant callers assign quality scores (based on their statistical models) which are high relative to the error rate observed in practice. BaseSpace User Guide for NextSeq, Miseq, and HiSeq 83 Data Reference Setting Setting Description REF The reference genotype. For example, a deletion of a single T could be represented by reference TT and alternate T. SAMPLE The sample column gives the values specified in the FORMAT column. One MAXGT sample column is provided for the normal genotyping (assuming the reference). For reference, a second column is provided for genotyping assuming the site is polymorphic. See the Starling documentation for more details. NOTE Variant files for Isaac also contain off-target variant calls, with filter. FASTQ Files What is it? BaseSpace converts *.bcl files into FASTQ files, which contains base call and quality information for all reads passing filtering. When to use it. FASTQ files can be used as sequence input for alignment and other secondary analysis software. When not to use it. Do not use it with tools that are not compatible with the FASTQ format. How to use it BaseSpace automatically generates FASTQ files in sample sheet-driven workflow apps. Other apps that perform alignment and variant calling also automatically use FASTQ files. A detailed description of the FASTQ format is provided below. Naming FASTQ files are named with the sample name and the sample number, which is a numeric assignment based on the order that the sample is listed in the sample sheet. For example: Data\Intensities\BaseCalls\samplename_S1_L001_R1_001.fastq.gz • samplename—The sample name provided in the sample sheet. If a sample name is not provided, the file name includes the sample ID, which is a required field in the sample sheet and must be unique. • S1—The sample number based on the order that samples are listed in the sample sheet starting with 1. In this example, S1 indicates that this sample is the first sample listed in the sample sheet. NOTE Reads that cannot be assigned to any sample are written to a FASTQ file for sample number 0, and excluded from downstream analysis. • L001—The lane number. • R1—The read. In this example, R1 means Read 1. For a paired-end run, there is at least one file with R2 in the file name for Read 2. 84 Part # 15050652 Rev. A Data Reference • 001—The last segment is always 001. Compression FASTQ files are saved compressed in the GNU zip format (an open source file compression program), indicated by the .gz file extension. Format Each entry in a FASTQ file consists of four lines: } Sequence identifier } Sequence } Quality score identifier line (consisting only of a +) } Quality score Each sequence identifier, the line that precedes the sequence and describes it, is in the following format: @<instrument>:<run number>:<flowcell ID>:<lane>:<tile>:<xpos>:<y-pos> <read>:<is filtered>:<control number>:<sample number> The following table describes the elements: Element Requirements @ @ <instrument> Characters allowed: a–z, A–Z, 0–9 and underscore <run number> Numerical <flowcell Characters allowed: ID> a–z, A–Z, 0–9 <lane> Numerical <tile> Numerical <x_pos> Numerical <y_pos> Numerical <read> Numerical <is filtered> <control number> <sample number> Y or N Numerical Numerical Description Each sequence identifier line starts with @ Instrument ID Run number on instrument Lane number Tile number X coordinate of cluster Y coordinate of cluster Read number. 1 can be single read or Read 2 of paired-end Y if the read is filtered (did not pass), N otherwise 0 when none of the control bits are on, otherwise it is an even number. Sample number from samplesheet An example of a valid entry is as follows; note the space preceding the read number element: @SIM:1:FCX:1:15:6329:1045 1:N:0:2 TCGCACTCAACGCCCTGCATATGACAAGACAGAATC + <>;##=><9=AAAAAAAAAA9#:<#<;<<<????#= Control Values If the read is not identified as a control, then the tenth column (<control number>) is zero. If the read is identified as a control, the number is greater than zero, and the value specifies what type of control it is. The value is the decimal representation of a bit-wise BaseSpace User Guide for NextSeq, Miseq, and HiSeq 85 encoding scheme. In that scheme bit 0 has a decimal value of 1, bit 1 a value of 2, bit 2 a value of 4, and so on. The bits are used as follows: • Bit 0: always empty (0) • Bit 1: was the read identified as a control? • Bit 2: was the match ambiguous? • Bit 3: did the read match the phiX tag? • Bit 4: did the read align to match the phiX tag? • Bit 5: did the read match the control index sequence? • Bits 6, 7: reserved for future use • Bits 8–15: the report key for the matched record in the controls.fasta file (specified by the REPORT_KEY metadata) Quality Scores A quality score (or Q-score) expresses an error probability. In particular, it serves as a convenient and compact way to communicate very small error probabilities. Given an assertion, A, the quality score, Q(A), expresses the probability that A is not true, P(~A), according to the relationship: Q(A) =-10 log10(P(~A)) where P(~A) is the estimated probability of an assertion A being wrong. The relationship between the quality score and error probability is demonstrated with the following table: Quality score, Q (A) 10 20 30 Error probability, P (~A) 0.1 0.01 0.001 Quality Scores Encoding In FASTQ files, quality scores are encoded into a compact form, which uses only 1 byte per quality value. In this encoding, the quality score is represented as the character with an ASCII code equal to its value + 33. The following table demonstrates the relationship between the encoding character, its ASCII code, and the quality score represented. Table 2 ASCII Characters Encoding Q-scores 0–40 Symbol ASCII QSymbol ASCII Code Score Code ! 33 0 / 47 " 34 1 0 48 # 35 2 1 49 $ 36 3 2 50 % 37 4 3 51 & 38 5 4 52 ' 39 6 5 53 ( 40 7 6 54 ) 41 8 7 55 * 42 9 8 56 86 QScore 14 15 16 17 18 19 20 21 22 23 Symbol = > ? @ A B C D E F ASCII Code 61 62 63 64 65 66 67 68 69 70 QScore 28 29 30 31 32 33 34 35 36 37 Part # 15050652 Rev. A + , . ASCII Code 43 44 45 46 QScore 10 11 12 13 Symbol 9 : ; < ASCII Code 57 58 59 60 QScore 24 25 26 27 Symbol G H I ASCII Code 71 72 73 QScore 38 39 40 Health Runs What is it? A user can choose whether or not to send anonymous system health information to Illumina. Health runs help Illumina diagnose issues and improve our products. The information consists of interop files and log files, and is not tied to any user account. This option is on by default. Sample Details Page Components The Sample Details Page shows metrics for a sample that are generated by the app that ran the analysis. Different panes are displayed on this page depending on the app; for descriptions, see topics below. Samples Table The samples table contains general analysis information for the sample. Depending on the workflow, the following metrics can be shown: Column Description Sample Name The sample name from the sample sheet. Sample ID The sample ID from the sample sheet. Sample ID must always be a unique value. Genome The name of the reference genome. Chr The reference target or chromosome name. Cluster PF The number of clusters passing filter for the sample that aligned to the reference genome. Mismatch The percentage mismatch to reference averaged over cycles per read (Read 1/Read 2). No Call The percentage of bases that could not be called (no-call) for the sample averaged over cycles per read (Read 1/Read 2). Coverage Median coverage (number of bases aligned to a given reference position) averaged over all positions. Het SNPs The number of heterozygous SNPs detected for the sample. Hom SNPs The number of homozygous SNPs detected for the sample. Insertions The number of insertions detected for the sample. Deletions The number of deletions detected for the sample. BaseSpace User Guide for NextSeq, Miseq, and HiSeq 87 Data Reference Symbol The workflows apps Small RNA and De Novo Assembly have custom samples tables, these are decribed here: } Small RNA Samples Table on page 88 } De Novo Assembly Samples Table on page 88 Small RNA Samples Table Column Description Sample Name The sample name from the sample sheet. Cluster Raw The number of raw clusters detected for the sample. Cluster PF The number of clusters passing filter for the sample. Cluster Align Contam The number of clusters that match records in the Contaminants database. Cluster Align miRNA The number of clusters that exactly match records in the Mature miRNA database. Cluster Align RNA The number of clusters that match records in the RNA database. Cluster Align Genome The number of clusters that match records in the genomic database. Cluster Unaligned The number of clusters that did not align against any reference database. De Novo Assembly Samples Table 88 Column Description Sample Name The sample name from the sample sheet. Num Contigs The number of contigs assembled for this sample. Mean Contig Length The average contig length for this sample. Median Contig Length The median contig length for this sample. Min Contig Length The minimum contig length for this sample. Max Contig Length The maximum contig length for this sample. Base Count The total length of the resulting assembly. N50 N50 length is the length of the shortest contig such that the sum of contigs of equal length or longer is at least 50% of the total length of all contigs. Part # 15050652 Rev. A Data Reference Amplicons Table Column Description # An ordinal identification number in the table. Amplicons The amplicon name. Location The position at which the variant was found. Variants # The number of variants for this amplicon. Coverage Graph Y Axis X Axis Description Coverage Position The green curve is the number of aligned reads that cover each position in the reference. The red curve is the number of aligned reads that have a miscall at this position in the reference. SNPs and other variants show up as spikes in the red curve. Q-Score Graph Y Axis X Axis QScore Position Description The average quality score of bases at the given position of the reference. Variant Score Graph Y Axis X Axis Score Position Description Graphically depicts quality score and the position of SNPs and indels. Variants Table The variants table shows variants for you sample per chromosome or amplicon. Column Description # An ordinal identification number in the table. Location The position at which the variant was found. Score The quality score for this variant. Type The variant type, which can be either SNP or indel. Call A string representing how the base or bases changed at this location in the reference. dbSNP The dbSNP name of the variant, if applicable. BaseSpace User Guide for NextSeq, Miseq, and HiSeq 89 Column Description RefGene The gene according to RefGene in which this variant appears. Frequency The fraction of reads for the sample that includes the variant. For example, if the reference base is A, and sample 1 has 60 A reads and 40 T reads, then the SNP has a variant frequency of 0.4. Depth The number of reads for a sample covering a particular position. The GATK variant caller subsamples data in regions of high coverage. Filter The criteria for a filtered variant. Small RNA Pie Chart The Small RNA pie chart provides a visualization of clusters identified as mature miRNA, other forms of RNA, genomic sequence, or contaminants. Figure 13 Small RNA Pie Chart Common categories for the Small RNA pie chart are as follows: } Unaligned clusters that did not align against any reference } Genome clusters that aligned to the reference genome } miRNA clusters that aligned to the mature miRNA database Hits to the mature miRNA database are counted only if the cluster aligned to the correct strand and position for the mature miRNA. The remaining category names in the Small RNA pie chart are taken from the FASTA file names in the databases. For example, if the RNA database contains a file named rRNA.fa, then matches to this file are reported as the category rRNA. Small RNA Graph The Small RNA graph provides a plot of the common mature miRNA sequences for a sample and their abundances. The most common miRNA sequences for the selected sample (up to ten records) are shown in proportion to the number of clusters matched. Metagenomics Pie Chart The Metagenomics pie chart provides a visualization of how many clusters from each sample were assigned to a category in each taxonomic level. Click another row in the taxonomy table to change the pie chart to that sample or taxonomic level. 90 Part # 15050652 Rev. A Contigs are arranged end-to-end along the X axis and the reference chromosomes are arranged bottom-to-top along the Y axis. Each pixel of the plot is colored according to how many short sequences of the corresponding contig have a match in the corresponding portion of the reference genome. An identical assembly results in a diagonal line. A vertical gap in the plot might indicate a portion of the reference that is absent in the assembly, such as a plasmid, which is found in some bacteria populations. Y Axis X Axis Reference Assembly Position Description A syntenic plot of assembled contigs compared to a reference. A reference genome must be specified in the sample sheet. Sample QC Table Column Description Sample Name The sample name from the sample sheet. Clusters Count The number of clusters sequenced for this sample. Clusters Percentage The percentage of the total cluster number matching the index for this sample. Pass Filter The percentage of clusters passing filter for this sample. Alignment R1/R2 The percentage of clusters successfully aligned in Read 1/ read 2. Length Median The median fragment length for the sample. Length Min The low percentile of fragment lengths for this sample as they correspond to three standard deviations from the median. Length Max The high percentile of fragment lengths for this sample as they correspond to three standard deviations from the median. Mismatch R1/R2 The percentage mismatch to reference averaged over cycles per read (Read 1/Read 2). Estimated Diversity An estimate of the total library diversity derived from the observed diversity and the number of apparent PCR duplicates. This calculation is available for paired-end runs unless PCR duplicate flagging was disabled in the sample sheet. Observed Diversity Number of distinct aligned positions. Reads with the same aligned positions are assumed to be PCR duplicates. PCR duplicates are defined as sequences with identical Read 1 and Read 2 start sites. Isaac App Results Page The Isaaac App Results Page consists of three panes, which are described in the topics below. BaseSpace User Guide for NextSeq, Miseq, and HiSeq 91 Data Reference Samples Graph Isaac Alignment Statistics Isaac Alignment Statistics display alignment information for the sample. Column Description Number of Reads The number of reads sequenced for this sample. Coverage Median coverage (number of bases aligned to a given reference position) averaged over all positions. Fragment Length Median The median fragment length for the sample. Fragment Length Standard Deviation The standard deviation of the fragment length for the sample. Aligned % The total count of PF clusters aligning for the sample (Read 1/Read 2). Mismatch The percentage mismatch to reference averaged over cycles per read (Read 1/Read 2). Isaac Variants Statistics The variants table shows three tables of variant statistics, for Single Nucleotide Variants (SNVs), insertions, and deletions. The rows contain the following information. Column Description Total number Total numbers of the specific variant. Het/Hom Ratio The ratio between heterozygote and homozygote variants. % in dbSNP 131 The percentage of the specific variants found in dbSNP 131. Transitions / Transversions The ratio of transitions (A-G or C-T changes) to transversions (other changes). Isaac Coverage Graph 92 Y Axis X Axis Description # Reference Bases Read Depth The coverage graph displays the number of bases that are covered at each read depth. Part # 15050652 Rev. A For technical assistance, contact Illumina Technical Support. Table 3 Illumina General Contact Information Illumina Website Email www.illumina.com [email protected] Table 4 Illumina Customer Support Telephone Numbers Region Contact Number Region North America 1.800.809.4566 Italy Austria 0800.296575 Netherlands Belgium 0800.81102 Norway Denmark 80882346 Spain Finland 0800.918363 Sweden France 0800.911850 Switzerland Germany 0800.180.8994 United Kingdom Ireland 1.800.812949 Other countries Contact Number 800.874909 0800.0223859 800.16836 900.812168 020790181 0800.563118 0800.917.0041 +44.1799.534000 Safety Data Sheets Safety data sheets (SDSs) are available on the Illumina website at www.illumina.com/msds. Product Documentation Product documentation in PDF is available for download from the Illumina website. Go to www.illumina.com/support, select a product, then click Documentation & Literature. BaseSpace User Guide for NextSeq, Miseq, and HiSeq Technical Assistance Technical Assistance Illumina San Diego, California 92122 U.S.A. +1.800.809.ILMN (4566) +1.858.202.4566 (outside North America) [email protected] www.illumina.com
* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project
advertisement