Final Project

Project Details

The goal of the final project is to explore an unruly set of biology data and find a story to tell. You may use the strain-level gene data (DNA), the gene expression data (RNA) and/or the sample annotations (meta-data). With whatever data you choose, find an intresting question to answer that challenges:

  • your analytic skills (using Python to get, manipulate, analyze and visualize)

  • your story telling skills (how do you use the above, to tell a coherent story using data)

BOTH are equally valued.

Here are some tips / thoughts about specifically what I need to see:

  1. The data are not already pre-processed or perfectly packaged so you will need to perform some (but not all) of merging, joining, creating new columns, selecting columns, changing types of data. These operations are likely best done using the Pandas library.

  2. To tailor your data to your question/story, you will need to decide criteria for including/excluding data - both rows (e.g. genes or samples) and columns (e.g samples, strains or meta-data feilds). It is impossible to give minimum or maximum requirements on data size. However, too little data ( less than 500 instances) leaves little chance for finding useful patterns. Too few attributes (columns, less than 5) means that you either need to merge in complementary data, or create your own additional columns. Conversely too much data (in rows or columns) means that you almost certainly have to START reducing the size of your data for realistic processing on Colab, even in a high-mem runtime.

  3. You will need to create visualizations. But, don’t just graph everything because you can, think about what plots you can show that best assist with your story.

  4. I would like some specific analytics - discussion in combination with analytic approaches - i.e. what features are important for a specific problem, what trends exist in the data and how do you explain them? I’m not looking for you to be right, I’m looking for your story to be well-motivated by data

  5. I cannot tell you how long your notebook should be. HOWEVER consider the size of the grade which this notebook is worth. It should contain paragraphs of text that could serve as sections of a paper (introduction, methods, results, discussion) in addition to the plots and tables.

  6. I am looking to learn. Part of your grade will be in how well YOU take ME on a journey through data which honestly, at the end, you are likely to find something I never would have. Telling that story is a combination of code (show me results, do not comment out the good stuff) and the write up. This IS part essay, part presentation, part coding exercise.

Draft Rubric

Exceeding expectations Arriving at expectations Not meeting Expectations
Analytical / Mathematic Broad use of Pandas such as merging, joining, creating new columns, selecting columns, changing types of data. Some use (1-2 instances) of Pandas such as selecting data or summarizing columns. No data manipulation using Pandas and/or data sets are left as loaded.
Visualizations are chosen carefully in order to convey multiple levels of information. Some visualizations Visualizations don’t contribute to the story in clear ways.
Data selection is thoughful and considers rows and columns or multiple data types (DNA, RNA, meta-data) One type of data is used, but it is “large” enough to support conclusions Data is left unfiltered or is too restricted to support conclusions.
Communication / Storytelling Writing builds a convincing story by explaining analysis results, providing interpretation and strong evidence that the data is understood. There’s a story but it is missing pieces: results are partially explain, insufficient evidence that data was understood No discernible story; demonstrates a lack of engagement with the data
Analysis asks and answers a clearly motivated question. A question is posed but no clear answer or larger implications are conveyed No question is posed.
Reflective Student thoughtfully identified specific strengths and areas for improvement. A complete picture of the student’s work process with a plan for greater success in the future. Student identified strengths or areas for improvement but lacked specificity or sufficient development in either area. Only a partial picture of the student’s process. Reflection lacked considerations and areas of improvement, did not incorporate or address suggestions from peer review.