Class 13 – Correlations Continued – p-hacking

Here is a data set we just recently published Gnanaprakasam-mBio-2017-2j44e56.   this is the figure I made for the paper.  .   What happened was I did field work with a colleague in Britain who then went and did illumina sequencing to look at species diversity in Bangladesh sediment.  He did 16s rRNA analysis of the DNA and found percent species abundance.  I then did analysis of the water and sediment to get the arsenic content and the amount of an Fe(III) mineral called hematite.   My colleague wants to know what bacterial species he should further investigate.  Now we want to look at more species and see if there are correlations with the mineral hematite or the aqueous arsenic.  He is asking if you can give him a list of 4 and then 6 species to investigate further.

Your mission if you choose……….  Is to p-hack this data.  Tell me what species might be important and what we should investigate further.  You do not need to do both Arsenic and hematite but choose one and run with it.  For a hint, hematite has fewer data points but better correlations.

  • Choose if you will correlate with Aqueous arsenic or hematite.
  • Plot 1.  This is to make just one plot of either arsenic or hematite versus one microbial species.   To make this plot set your y=df.columns[##].    Where ## is a number of your choosing between 10 and 277 (the number of columns). This function will give you the column name if you give it a number.  Then after you make the plot in the figure caption report if the correlation is significant and say 1-2 sentences about the bacterial species.  You can just google the final parts of the bacterial name and usually one of the top choices will explain the main properties of the bacteria.
  • Next p-hack the data.  Search my data to see if any correlations are significant and make one pdf file.  This might be big and cumbersome but could be helpful.  It will have 100’s of pages!  Choose to p-hack with arsenic or hematite.
  • Then you should look at the plots and results.
  • Save the r2 values, sort them and make two more figures.
  • One with 4 graphs on one figure for species with the 4 highest r2 values
  • one with 6 graphs on one figure with the 6 highest r2 values.
  • All graphs should have a text box from linregress that show the slope, intercept, r2 and p-value.  remember r2 is just the r-value squared.
  • The data is in excel this time!  Here are links  link.  or here Species-Arsenic-Data-2lq2nkt.
  • The column names are long and ugly from the sequencing.  Fix this for your plots of 4 and 6 (see below)!  This gets you 5%.  These long names are what are automatically downloaded from NIH/NCBI.
  • The second 5% is if you make a seaborn correlation matrix for at least 5 columns including arsenic, hematite, and 3 bacteria.  .
  • Make sure to label your figures and also give me the answer to the question above.
  • Helpful notebooks for the final 10%

YOU WILL NEED TO HAND IN

  • Your Notebook as usual.
  • PDF file of all the results.
  • Plot of 4 parameters!
    • I am not showing the answers yet!
  • Plot of 6 parameters!
    • I am not showing the answers yet!

 

DUE Wednesday March 6, 2024


			

18 thoughts on “Class 13 – Correlations Continued – p-hacking

  1. I found it difficult to replace the long bacteria captions within the graphs in homework 11 – I think could be helpful to include more hints in the notebook to this regard or even walk through an example of how one would do this.

  2. This was a really tough assignment because it exposes your basic problem-solving assumptions. When we were supposed to shorten the names of the bacteria in the homework, I couldn’t figure out how to do it the proper way, so I just manually renamed the columns that I was using. Then in class, Brian pointed out to me that it’s fine to do that when there are only 10 or 20 columns, but what if you have hundreds? Thousands? It was an “aha” moment for me because I realized I wasn’t just supposed to answer this one homework question in any way possible—I was supposed to be learning a specific skill that I could apply later on, too. I am glad I got to have that moment of self-reflection because of this!

  3. I found this class very useful and helpful in dealing with large datasets. I struggled with making the multiple graphs next to each other just in terms of spacing and formatting wise. A suggestion I have would be to have data sets with more manageable names as dealing with the very long bacteria names was confusing when it came to the homework and trying to figure out changing to the shorter names on my own. Maybe leave a section based on changing names in class rather than leaving it to us because even with the Github notebook, I couldn’t figure it out and had to wait until next class.

  4. I think that this was a super complicated notebook to work through albeit very informative. There are some other short cuts in python to get the r2 values. (dataframe.cor) I believe will give it to you. Overall though I think I would find it more helpful if there were little checkins in the class packet reminding us what each piece of code is actually doing to the data we are working with.

  5. This class was extremely useful as a way for us to apply the knowledge learned in the class before to tackle a large assignment. Learning how to create a bunch of graphs and look through them for correlations is a skill that can be applied to a limitless amount of different areas, so this was definitely good to learn. Learning how to sort was also very handy. The assignment we had to do in this class followed very well from the knowledge learned in the class before, so it was definitely doable with time. Additional guidance on figuring how to format the graphs, along with understanding how formatting works for multiple graph plots, would have been helpful.

  6. This class was one of the first ones where I thought WHOA when I realized what a few lines of code could do. It was super helpful for data visualization and sorting, but I wish that we had had a bit more of a structured introduction to what the data was actually showing before we plunged into it in order to understand what we were looking at a bit more.

  7. This class was a nice addition to the previous notebook on correlations. A little more guidance on multiple graphs – especially if the numbers change from an even number to an odd, for instance – and dictionaries would have been great.
    Overall, one of the most useful classes in the semester, I thought.

  8. This notebook was super useful and it was great to learn how to format multiple graphs so you can look at them in one picture. I had a difficult time with the homework in trying to change the names of the bacteria; would maybe be helpful to do something like that in class before the homework.

  9. This notebook was especially involved but ultimately very helpful! The use of dictionaries, for loops and sorting to get the job done left me feeling really excited about my abilities to code. The written notes in this notebook were particularly helpful, as it could be easy to get lost in error messages. I played with graph formatting at the end and was able to present the data very professionally.

  10. It was very interesting to be able to apply what we have learned into a real data set. I this was the first class where I realized that you can create a simple code and use a for loop to apply it to an enormous amount of data. Creating a dictionary of the names and cycling through them in the for loop was an important lesson (and I used it on my final project). I would say that the hardest part of this class was organizing the graphs. I think maybe dedicating some time to go really go in depth over what “fig” is, and how to use subplots would be helpful for the future.

  11. This was the first time working with a set of data, and I found this notebook helpful not only at the time but also for referring back to when trying to work with a data set for my final project. There were a lot of different ways to complete the task, which was a bit overwhelming, as it was hard to choose where to start. Looking back at my own notebook for this class, I see a few things that I don’t really have a firm understanding of why they worked still, such as making different numbers of graphs, but was forced to try again to understand them when doing my final project.

  12. This lesson was especially tricky but very useful:
    I learned the following skills:
    1. How to sort a pdf file from highest r-squared value using dictionaries
    2. How to plot multiple graphs simultaneously in a for loop. To plot 4 plots in a 2×2 matrix we used this handy code: ax=axtemp[i%2][i/2].
    3. How to shorten data heads with very long names using dictionaries.
    As a suggestion, it would be good to discuss how to plot multiple graphs on different matrices (ex: 2×3)

  13. I really liked this lesson because I was able to see the power that dictionaries can have when used correctly. I used multiple dictionaries and that made my code cleaner. I would recommend that when using a dictionary you write one going from A->B and a second one going from B->A so you can go back and forth between dictionaries. This will be specially helpful when working with data that has really long names.

  14. I really enjoyed this lesson because it was the first time that we really took a set of data and had to analyze any relationships within it. What I found most challenging was figuring out how to space out my graphs for the top four and top six. The y axis labels were always being cut into by the previous graph. It would have been helpful to have some more hints on how to arrange the graphs.

    • I agree that arranging the multiple graphs is tough. I will try to think about how to teach that better. I have added some more notes but I know I need more.

Leave a Reply