582: Introduction to Data Science Page 1 of 12 UNIVERSITY OF WISCONSIN-MILWAUKEE School of Information Studies INFOST (582) – Introduction to Data Science Section 201 and 202 - Online Spring 2016 SYLLABUS Instructor: Margaret Kipp E-mail: [email protected] (best contact method) Fax: 414-229-6699 Office: NWQB 2574 Office Hours: TBA CATALOG DESCRIPTION Introduces basic concepts, background, theoretical, practical and technological aspects of data science. 3 credits GENERAL DESCRIPTION This course provides an introduction to data science. Data science has developed as a set of methods for analysing massive data sets to extract useful knowledge. A data scientist is a person who has the skills and knowledge to perform these analyses. This course will cover topics necessary to develop data-science solutions to problems including data collection, data cleaning and integration, data analysis, and data presentation. PREREQUISITES Junior Standing. For 500 and 600 level courses it is recommended that an undergraduate student first consult with the appropriate instructor and/or advisor concerning the applicability of this specific course. Basic computer facility and technology literacy as listed in the SOIS policy are required: http://www4.uwm.edu/sois/programs/graduate/mlis/complitreq.cfm Optional: Some programming experience or basic statistical knowledge (measures of central tendency) would be an asset for later in the course. OBJECTIVES/OUTCOMES Upon completion of the course, students will be able to: 1. effectively develop researchable questions; (Paper or project) 2. identify data sources, collect, clean and merge data; (Selecting a data set, Cleaning data) 3. manipulate structured or unstructured data sources; (Querying a Database, Creating Metadata) 4. identify and apply appropriate statistical methods for analysing data; (Analysing Data, Project) 5. critically evaluate tools for working with data; (Project, Cleaning Data) 6. address multilingual and multicultural issues in data creation and analysis; (Creating Metadata, Querying a Database, Readings and Discussions) 582: Introduction to Data Science Page 2 of 12 7. identify emerging trends and stay current with issues in data science. (Readings and Discussions) ALA COMPETENCIES (for MLIS students) 1. The systems of cataloguing, metadata, indexing, and classification standards and methods used to organize recorded knowledge and information. 2. Information, communication, assistive, and related technologies as they affect the resources, service delivery, and uses of libraries and other information agencies. 3. The application of information, communication, assistive, and related technology and tools consistent with professional ethics and prevailing service norms and applications. 4. The principles and techniques necessary to identify and analyse emerging technologies and innovations in order to recognize and implement relevant technological improvements. METHOD Lecture/Discussion/Readings/Examples/Exercises – to achieve a satisfactory understanding of the course material and to fulfil requirements of the assignments, students are expected to attend the lectures, read and comment on the readings, participate in discussions and inclass exercises, and explore examples and tutorials. TIME COMMITMENT This course requires a weekly time commitment. General university guidelines indicate that a 3 credit course requires a minimum 144 hour time commitment over the course of a term. This time commitment represents a minimum of 9-10 hours of work per week per course. For an onsite class 3 of these hours represent onsite instruction in a classroom; in an online class this time would be spent on independent reading, discussions and in-class exercises. Each week you may be required to read notes and readings from the reading list associated with that class, participate in discussions, write summaries of readings, complete in-class exercises, explore examples, or complete assignments and projects. It is your responsibility to plan your time in order to complete all activities based on the schedule outlined in this syllabus. ACCOMMODATIONS If you need accommodations due to illness, disabilities, scheduling conflicts with religious observances, or other life events (e.g. military service) contact the instructor as soon as possible, preferably by the third week of class as per university policy. Official documentation may be required depending on the nature of the considerations requested per university policy (http://www4.uwm.edu/secu/docs/faculty/1895R3_Uniform_abus_Policy.pdf). TEXTBOOK AND READINGS Shron, Max. 2014. Thinking with Data: How to Turn Information into Insight. O'Reilly Media. ISBN: 978-1449362935 (Available in Paperback, Kindle, EPUB, MOBI, etc.) [Required] Readings are listed in the course outline for each class. Readings should be completed before the class. Other course materials, including this syllabus, are available through D2L (http://d2l.uwm.edu/). 582: Introduction to Data Science Page 3 of 12 Changes may be made to the readings as the term progresses. These are generally marked with TBD. Changes will be announced in D2L ahead of the classes for which changes will occur. COURSE OUTLINE Class Date Topics 1 Jan Data Science 27 and Big Data, Becoming a Data Scientist 2 Feb 3 3 Feb 10 Developing Data Based Questions Choosing data sets/sources of data and methods of collecting data Readings (complete before class) Shron. 2014. Thinking with Data, Preface, Chapter 1 (18p); Loukides. 2010. What is Data Science? (12p) http://www.cloudera.com/content/dam/cloudera/Resources /PDF/What_is_Data_Science_OReilly.pdf; Zhu & Xiong. 2015. Defining Data Science. (8p) http://arxiv.org/abs/1501.05039 [cs.DB]; Becoming a Data Scientist 8 Jul 2013 by Swami Chandrasekaran (1p) http://nirvacana.com/thoughts/becoming-a-data-scientist/; Miller. 2013. Data Science: The Numbers of Our Lives, APRIL 11, 2013, New York Times (4p) http://nyti.ms/10QarGu; Dumbill. 2012. What is big data?: An introduction to the big data landscape. O'Reilly.com. (9p) http://radar.oreilly.com/2012/01/what-is-big-data.html Shron. 2014. Thinking with Data, Chapters 2-4 (50p); Readings ◦ Shron. 2014. Thinking with Data, Chapter 5, 6 (25 p); ◦ Mattmann. 2013. Computing: A vision for data science. Nature 493, p.473–475. doi:10.1038/493473a (UWM Libary Full Text) ; ◦ Marx. 2013. Biology: The big challenges of big data. Nature 498, p.255–260. doi:10.1038/498255a (UWM Library Full Text) ; ◦ Doctorow. 2008. News Feature: Big data: Welcome to the petacentre. Nature 455, 16-21. http://www.nature.com/news/2008/080903/full/455016a .html ; ◦ Wallis, et al. 2013. If We Share Data, Will Anyone Use Them? Data Sharing and Reuse in the Long Tail of Science and Technology. PLOS One. DOI: 10.1371/journal.pone.0067332 http://journals.plos.org/plosone/article? id=10.1371/journal.pone.0067332 ; Datasets ◦ Open Data Handbook http://opendatahandbook.org/guide/en/what-is-open- 582: Introduction to Data Science Page 4 of 12 4 Feb 17 Privacy, Ethics and Data 5 Feb 24 Metadata and the Semantic Web 6 Mar 2 Databases and other data stores data/; ◦ Open Data Datasets. KDNuggets. http://www.kdnuggets.com/datasets/index.html; ◦ Marr. 2015. The Free 'Big Data' Sources Everyone Should Know. DataScienceCentral.com. http://www.datasciencecentral.com/profiles/blogs/thefree-big-data-sources-everyone-should-know; O'Leary. 2015. "Big Data and Privacy: Emerging Issues," in Intelligent Systems, IEEE 30(6): 92-96. (UWM Library Full Text); Perera, et al. 2015. "Big Data Privacy in the Internet of Things Era," in IT Professional 17(3): 32-39 (UWM Library Full Text); Daries, et al. 2014. "Privacy, Anonymity, and Big Data in the Social Sciences." Communications Of The ACM 57(9): 56-63. (D2L); Shilton. 2012. Participatory personal data: An emerging research challenge for the information sciences. Journal of the American Society for Information Science and Technology 63(10): 1905-1915. (UWM Library Full Text); Readings ◦ Elings and Waibel. 2007. Metadata for All: Descriptive Standards and Metadata Sharing across Libraries, Archives and Museums. First Monday 12(3). http://firstmonday.org/htbin/cgiwrap/bin/ojs/index.php/f m/article/view/1628/1543; ◦ Gilliland. 2008. "Setting the Stage" In Introduction to Metadata. http://www.getty.edu/research/publications/electronic_p ublications/intrometadata/setting.html; ◦ Robu et al. 2006. An introduction to the Semantic Web for health sciences librarians. JMLA 94(2): 198-205. http://www.ncbi.nlm.nih.gov/pmc/articles/PM C1435839/ ; ◦ Introducing Linked Data and the Semantic Web. LinkedDataTools.com. http://www.linkeddatatools.com/semantic-web-basics; Tutorials ◦ XML Basic (first 10 pages) http://www.w3schools.com/xml/default.asp; ◦ JSON Tutorial http://www.w3schools.com/json/; ◦ Introduction to RDF (first 3 pages) http://www.w3schools.com/rdf/rdf_intro.asp; Readings ◦ Kimani. Introduction to Databases. Technopedia. https://www.techopedia.com/6/28832/enterprise/databa 582: Introduction to Data Science Page 5 of 12 7 Mar 9 Working with structured and semi-structured data: Databases and Metadata 8 9 Mar 16 Mar 23 Spring Break No Class Unstructured Data and Log Files, Working with unstructured data ses/introduction-to-databases ; ◦ Pokorny. 2015. Database technologies in the world of big data. In Proceedings of the 16th International Conference on Computer Systems and Technologies (CompSysTech '15), Boris Rachev and Angel Smrikarov (Eds.). ACM, New York, NY, USA, 1-12. (D2L); ◦ Sadalage. 2014. NoSQL Databases: An Overview. ThoughtWorks.com. http://www.thoughtworks.com/insights/blog/nosqldatabases-overview; Tutorials ◦ Interactive SQL Tutorial http://sqlzoo.net/; ◦ Lynda.com (http://www4.uwm.edu/sois/resources/it/lynda/) ▪ Relational Database Fundamentals ▪ MySQL Essential Training Readings ◦ OAI for Beginners - the Open Archives Forum online tutorial (Sections 1,3) http://www.oaforum.org/tutorial/; ◦ Jackson et al. Dublin Core Metadata Harvested Through OAI-PMH. Journal of Library Metadata 8:1 (2008) 5-21. http://hdl.handle.net/2142/9091; Tutorials ◦ Using Open Refine http://openrefine.org/documentation.html; ◦ Interactive SQL Tutorial http://sqlzoo.net/; ◦ SPARQL https://code.google.com/p/tdwgrdf/wiki/Beginners6SPARQL; ◦ Export Data From Database to CSV File https://support.spatialkey.com/export-data-fromdatabase-to-csv-file/; No Readings Nicholas, et al. 2003. Micro-mining and segmented log file analysis: a method for enriching the data yield from Internet log files. Journal of Information Science, 29 (5), pp. 391–404 . (UWM Library Full Text); Huntington, et al. 2006. Obtaining subject data from log files using deep log analysis: case study OhioLINK. Journal of Information Science 32 no. 4, 299-308. (UWM Library Full Text); Blumberg, et al. 2003. The problem with unstructured data. DM REVIEW - soquelgroup.com. http://soquelgroup.com/Articles/dmreview_0203_problem. pdf; 582: Introduction to Data Science Page 6 of 12 10 Mar 30 Cleaning up data and integrating data sets 11 April 6 Describing data 12 April 13 Analysing Data 13 April 20 Visualizing Data Polnaszek, et al. 2016. Overcoming the Challenges of Unstructured Data in Multisite, Electronic Medical Recordbased Abstraction. Medical Care (publish ahead of print). (UWM Library Full Text); Veeranjaneyulu, et al. 2014. Approaches for Managing and Analyzing Unstructured Data. International Journal on Computer Science and Engineering (IJCSE) Vol 6(1). http://www.enggjournals.com/ijcse/doc/IJCSE14-06-01020.pdf; Log File Analysis: The Ultimate Guide http://builtvisible.com/log-file-analysis/ Loshin. 2015. Integrating Data from Multiple Sources http://community.embarcadero.com/index.php/blogs/entry/i ntegrating-data-from-multiple-sources-by-david-loshin; Cody, et al. (n.d.) Data Cleaning 101 http://www.ats.ucla.edu/stat/sas/library/nesug99/ss123.pdf ; Rahm. (n.d.) Data Cleaning: Problems and Current Approaches, University of Leipzig, Germany. http://lips.informatik.uni-leipzig.de/files/2000-45.pdf; Top Ten Ways to Clean Your Data. Microsoft.com. https://support.office.com/en-us/article/Top-ten-ways-toclean-your-data-2844b620-677c-47a7-ac3ec2e157d1db19; Using a spreadsheet to clean up a dataset. 2013. http://schoolofdata.org/handbook/recipes/cleaning-datawith-spreadsheets/; Data Journalism Handbook http://datajournalismhandbook.org/1.0/en/understanding_d ata_2.html; Foreman. 2015. Data Smart. Wiley. Chapters TBD (UWM Library Full Text); Sonnad. 2002. Describing data: statistical and graphical methods. Radiology. Dec; 225(3): 622-8. http://pubs.rsna.org/doi/pdf/10.1148/radiol.2253012154 Online Statistics Education: An Interactive Multimedia Course of Study, Chapter 3, 4 http://onlinestatbook.com/2/index.html Foreman. 2015. Data Smart. Wiley. Chapters TBD (UWM Library Full Text); Online Statistics Education: An Interactive Multimedia Course of Study, Chapter 11, 12, 14 http://onlinestatbook.com/2/index.html Foreman. 2015. Data Smart. Wiley. Chapters TBD (UWM Library Full Text); 582: Introduction to Data Science Page 7 of 12 NIST/SEMATECH e-Handbook of Statistical Methods, Chapter 1: Exploratory Data Analysis http://www.itl.nist.gov/div898/handbook/; Kandel et al. Research directions in data wrangling: Visualizations and transformations for usable and credible data. Information Visualization 0(0) 1–18. http://research.microsoft.com/enus/um/people/nath/docs/datawrangling_ivj2011.pd f; No Readings • TBD 14 15 April 27 May 4 Work on Projects Wrapup and Current Events ASSIGNMENTS Assignment Selecting a dataset Identify a question that interests you. Identify a data set on this topic that could be used to answer the question. Explain the kinds of information available in the dataset and how the data is structured. (400 words) Metadata Select 2 objects and create a metadata record for each using a metadata schema of your choice. Your records should contain enough information to fully describe the object. It is recommended that you use Dublin Core or Schema.org encoded in XML, RDF or JSON. Database Create a simple database in Access or MySQL with at least three joined tables. Populate the tables with enough data to provide useful results for your queries. Create two SQL queries that extract useful data from at least two tables of the database. Short Paper Write a short paper on a data science related topic. (800-1000 words) Cleaning Data Graduate 5 Undergrad uate  5 Associated Classes 1-3 Deadline Class 4  5 5 5 Class 6 5 5 6-7 Class 7 20 n/a All 5 5 7-10 Proposal: Class 3 Paper: Class 10 Class 11 582: Introduction to Data Science Page 8 of 12 Identify problems in a data set which might interfere with analysis of the data (e.g. typos, structure problems, poor adherence to standards). Describe the problems and write policy statements or suggested solutions for solving these problems. (400 words) Analysing Data Develop a basic analysis of a data set using the tools discussed in class. Describe your findings in report format. Use graphs, charts or other tools to present your findings as required. (400U/500G words) Note: It is recommended that you submit a rough draft/early draft of your project, though you may choose to do a separate data analysis. Project Select a topic and gather a small test set of data, clean it, analyse it and present the results in a report. Your report may be written, oral or multimedia based. (800U/1200G words or equivalent, charts, tables, etc. do not count towards the word limit) Note: Graduate students are expected to provide additional critical analysis and reflection of the data including potentially locating and citing appropriate supporting materials from published sources. Participation (see below) 10 10 1-12 Class 13 30 50 All Proposal: Class 8 Project: Last class 20 20 All Last class  Different requirements for graduate and undergraduate levels will be specified in the directions for each assignment where appropriate.  Class numbers are listed in the Course Outline Table. Each class has an associated Class Number (#), Date, Topic, Readings and may have In-class Exercises, Discussions or Tutorials. The assignment table is keyed to the course outline's class numbers. To determine the exact date an assignment is due, go to the appropriate class number in the course outline table or use the D2L calendar. * There is no final exam in this course. * Working with Classmates All assignments except the short paper and participation may be completed in pairs or trios. 582: Introduction to Data Science Page 9 of 12 Assignments completed in pairs/trios must identify all work partners by full name at the top of the assignment. You must each submit the same assignment to the dropbox. If you simply assisted each other but did not do the whole assignment together, you must also note this at the top of the assignment. Unacknowledged borrowing is seen as plagiarism, so be sure to document your teamwork to avoid this. Formatting Guidelines for Assignments Assignments should be written using Arial or another Sans-Serif style font. Do not use red for emphasis or to highlight your answers to questions. Remove all extraneous information before submission (e.g. assignment instructions or tips). Use whatever citation format you prefer, but do not use footnotes. If you are not using a common format such as MLA or APA you should include information about which style guide you are using in the assignment. Paper submissions will not be accepted. All assignments must be typed on a computer and submitted electronically. Handwritten submissions will not be accepted, even if scanned and submitted electronically. Assignments may not be submitted in Pages, Microsoft Works, or Microsoft Project as I cannot open these formats. You should save these as a PDF instead. Other common file formats should be acceptable including Open Office formats. If you are using an unusual format you can always check with me first before submission to ensure I can open it. Due Dates and Assignment Submission All assignments and projects should be submitted through D2L to the appropriate dropbox before midnight (Central Time) on the recommended due date. Students should strive to submit assignments by the recommended due date, but may have until the assignment's Final Deadline to submit. Points for late assignments will be reduced 10% per day late after the Final Deadline. The dropbox will remain open for the submission of late assignments until the late penalty reaches 100%. Participation items should be submitted to the appropriate discussion group (see the participation section below) before the discussion group closes. Discussion groups will be open for 1 week before and 1 week after the date of the associated class. Emailed submissions will only be accepted as a backup to a D2L submission (or in case of D2L errors). Everything must be submitted by the Last Class (this includes all assignments, papers, projects, and participation). All project and assignment deadlines are in the syllabus. For discussion deadlines check the discussion groups or the D2L calendar. The D2L calendar also contains all project and assignments deadlines. It is your responsibility to keep track of deadlines using the tools provided or by creating your own calendar of deadlines. Items submitted early will not be evaluated until their Final Deadline (or Recommended Due Date). Students are encouraged to complete all Associated Classes listed under Assignments 582: Introduction to Data Science Page 10 of 12 before submitting the assignments since the material in these classes constitutes preparation for the assignments. Submission well before the recommended due date is not encouraged. Extensions Students must contact the instructor before each Final Deadline listed under Assignments for any extensions. Extension requests made prior to the Final Deadline do not require any documentation as long as they are not longer than a week. Simply provide a date/time by which you will submit the assignment. After the deadline the penalties listed under Due Dates will be enforced. Material submitted late after an extension will also be subject to these penalties. Plan your time accordingly. Extra Credit or Other Special Considerations Per university policies (see http://www4.uwm.edu/secu/policies/saap/upload/S29.htm) extra credit assignments and other special consideration are not possible. Students should make use of the extensions policy outlined above or provide appropriate documentation of special circumstances as outlined elsewhere in the syllabus. Participation Students are expected to participate in discussion and in-class exercises as a demonstration of their ability to articulate key concepts. Discussion will include individual and group components. Participation is mandatory and constitutes one quarter of the points available for this class. Participation will consist of all of the following: individual summaries of readings, participation in group discussions, contributed articles, and responses to others. Participation will consist of all of the following: • Completion of the Syllabus Quiz ◦ The syllabus quiz must be completed in the first 2 weeks of class. Points will automatically be entered in D2L. • Individual Summaries of Readings ◦ Post 3 summaries of the weekly readings to the appropriate weekly discussion group based on the class associated with each reading. ◦ You must post 3 summaries in total, but you may choose the classes for which you wish to contribute the summaries. ◦ Sign up for 3 sets of readings on the signup sheet posted in the news section of D2L. ◦ Responses need not exceed 300 words. ◦ Summaries posted before the date of the class earn a half bonus point each. Be sure to mark this on your course completion checklist to ensure you receive the bonus. • Participation in Weekly Discussions ◦ Participation in the in-class exercises and discussions included each week in the weekly discussion group. Points will be allocated based on your participation level (i.e. frequently, infrequently, no participation). ◦ Generally frequent participation requires that you participate at least once a week in most weeks. ◦ Responses need not exceed 300 words. 582: Introduction to Data Science Page 11 of 12 • • • Contributed Article ◦ Contribution of a new article, video, cartoon, etc. relevant to the class and a short summary (approximately 100 words) explaining its relevance to class. This should be posted to the appropriate weekly discussion group based on the topic. You may choose which week you wish to contribute this item. ◦ A signup sheet will be posted in the news section of D2L. Responses to Others ◦ Reading and/or responding to weekly reading summaries and other information posted to the weekly discussion groups by classmates. Points will be allocated based on your reading level (i.e. many, few, nothing read) and/or your responses to others. Submission of the Course Checklist to the participation dropbox ◦ The completed checklist with all required course elements listed submitted to the dropbox before the last class. You should complete as much as possible of the checklist. Use the checklist throughout the term to ensure you are on track to complete all course requirements. Code of Conduct/Expectations for this Class This is a professional programme and professional, courteous behaviour is expected of all participants. It is expected that class members will show consideration for all other members of the class and contribute in a constructive manner which is conducive to a good learning environment. Class members should consider the relevance and appropriateness of their contributions to the class before contributing to the class. Violations of these expectations will result in reduced participation points or other sanctions depending on severity. Plagiarism and Referencing It is expected that you will consult and cite the research and professional literature where merited and not rely solely on encyclopaedias, newspapers or unpublished, online sources. Papers where the majority of sources are blogs and Wikipedia (or similar sites) will not be accepted. Use a common style manual for citations (e.g. APA, MLA, Chicago, etc.). Ideally you would choose a citation style guide you have used before, or one you are using in another class. Plagiarism is the unacknowledged borrowing of ideas or material from someone else's work. It is considered an academic offence and can be considered grounds for failure in a course or expulsion from the programme. Cite all references and provide credit for all other materials. This applies to all material including images, sounds or videos. A citation (in the format of your choice) with a functioning URL (if relevant) is the minimum required for a reference. (http://guides.library.uwm.edu/content.php?pid=235714&sid=1949820#6509804) You may not resubmit assignments already submitted in other courses or in a previous instance of this course, nor may you submit other people's work as your own. Plagiarism will be dealt with on a case by case basis but will result in a lowered mark on the assignment, failure on the assignment or failure in the course depending on severity and the number of plagiarized items submitted. Points lost through plagiarism may not be replaced by bonus points on other assignments. 582: Introduction to Data Science Page 12 of 12 GRADING SCALE 96-100 A Superior work 91-95 A87-90 B+ 84-86 B Satisfactory, but undistinguished work 80-83 B77-79 C+ 74-76 70-73 67-69 64-66 C CD+ D 60-63 Below 60 DF Work is below standard Unsatisfactory work GRADE REQUIREMENT FOR A CORE COURSE If you are pursuing an MSIST degree, you need to earn at least a B (does not include B-) in this course. UWM AND SOIS ACADEMIC POLICIES The following link will take you to UWM pages/links which contain university policies affecting all UWM students. http://www.uwm.edu/Dept/SecU/SyllabusLinks.pdf The following link will take you to pages/links which contain SOIS policies affecting all SOIS students. http://www4.uwm.edu/sois/resources/formpol/policies.cfm Undergraduates may also find the Panther Planner and Undergraduate Student Handbook useful (http://www4.uwm.edu/dos/student-handbook.cfm). For graduate students, there are additional guidelines from the Graduate School (http://uwm.edu/graduateschool/). This document is licensed under a Creative Commons Attribution-NoncommercialShare Alike 3.0 United States Licence except where other rights exist. Any commercial use of this work requires a separate licence.