Final Project
Project Details
The goal of the final project is to explore an unruly set of biology data and find a story to tell. You may use the strain-level gene data (DNA), the gene expression data (RNA) and/or the sample annotations (meta-data). With whatever data you choose, find an intresting question to answer that challenges:
your analytic skills (using Python to get, manipulate, analyze and visualize)
your story telling skills (how do you use the above, to tell a coherent story using data)
BOTH are equally valued.
Here are some tips / thoughts about specifically what I need to see:
The data are not already pre-processed or perfectly packaged so you will need to perform some (but not all) of merging, joining, creating new columns, selecting columns, changing types of data. These operations are likely best done using the Pandas library.
To tailor your data to your question/story, you will need to decide criteria for including/excluding data - both rows (e.g. genes or samples) and columns (e.g samples, strains or meta-data feilds). It is impossible to give minimum or maximum requirements on data size. However, too little data ( less than 500 instances) leaves little chance for finding useful patterns. Too few attributes (columns, less than 5) means that you either need to merge in complementary data, or create your own additional columns. Conversely too much data (in rows or columns) means that you almost certainly have to START reducing the size of your data for realistic processing on Colab, even in a high-mem runtime.
You will need to create visualizations. But, don’t just graph everything because you can, think about what plots you can show that best assist with your story.
I would like some specific analytics - discussion in combination with analytic approaches - i.e. what features are important for a specific problem, what trends exist in the data and how do you explain them? I’m not looking for you to be right, I’m looking for your story to be well-motivated by data
I cannot tell you how long your notebook should be. HOWEVER consider the size of the grade which this notebook is worth. It should contain paragraphs of text that could serve as sections of a paper (introduction, methods, results, discussion) in addition to the plots and tables.
I am looking to learn. Part of your grade will be in how well YOU take ME on a journey through data which honestly, at the end, you are likely to find something I never would have. Telling that story is a combination of code (show me results, do not comment out the good stuff) and the write up. This IS part essay, part presentation, part coding exercise.
Rubric
- maximum points: 12
- 6 for analysis
- 4 for storytelling
- 2 for reflection
| Exceeding expectations (2 pts) | Arriving at expectations (1.5 pts) | Not meeting Expectations (0.5 pts) | ||
|---|---|---|---|---|
| Analytical / Mathematic | Broad use of Pandas such as merging, joining, creating new columns, selecting columns, changing types of data. | Some use (1-2 instances) of Pandas such as selecting data or summarizing columns. | No data manipulation using Pandas and/or data sets are left as loaded. | |
| Visualizations are chosen carefully in order to convey multiple levels of information. | Some visualizations | Visualizations don’t contribute to the story in clear ways. | ||
| Data selection is thoughful and considers rows and columns or multiple data types (DNA, RNA, meta-data) | One type of data is used, but it is “large” enough to support conclusions | Data is left unfiltered or is too restricted to support conclusions. | ||
| Communication / Storytelling | Writing builds a convincing story by explaining analysis results, providing interpretation and strong evidence that the data is understood. | There’s a story but it is missing pieces: results are partially explain, insufficient evidence that data was understood | No discernible story; demonstrates a lack of engagement with the data | |
| Analysis asks and answers a clearly motivated question. | A question is posed but no clear answer or larger implications are conveyed | No question is posed. | ||
| Reflective | Student thoughtfully identified specific strengths and areas for improvement. A complete picture of the student’s work process with a plan for greater success in the future. | Student identified strengths or areas for improvement but lacked specificity or sufficient development in either area. Only a partial picture of the student’s process. | Reflection lacked considerations and areas of improvement, did not incorporate or address suggestions from peer review. |
Presentation Rubric
- maximum points: 12
- 6 for analysis
- 4 for storytelling
- 2 for reflection
| Exceeding expectations (2 pts) | Arriving at expectations (1.5 pts) | Not meeting Expectations (0.5 pts) | ||
|---|---|---|---|---|
| Analytical / Mathematic | Convinving analysis accepting or rejecting a null hypothesis with statistical / quantitative evidence. | Use of quantitative descriptions but no clear answer to the central question. | Data does not supoprt the conclusions or, no quantitative results presented, no testable hypothesis posed. | |
| Visualizations are selected from results carefully and are highly informative | Some visualizations presented but not clear highlights of the project | Visualizations are uninformative, misleading or don’t contribute to the story in clear ways. | ||
| Data selection is clearly justified (DNA, RNA, meta-data) | Data selection is stated by not defended | Data selection is unclear. | ||
| Communication / Storytelling | Presentation has a clear and convincing arc presented in time. | There’s a story but it lacks a resolution or does not fit the time frame. | Presentation in not focused on a story. | |
| There is a clear central question which is answered. | A question can be gleaned but is not central. | No question is made explicit. | ||
| Reflective | Student responds to questions thoughtfully providing new information when needed. | Student responds to quesions to the best of their ability but is lacking sufficient breath to engage in discussion. | Responses to questions are unclear or do not address the questions asked. |