Class 13 – Correlations Continued – p-hacking

Here is a data set we just recently published Gnanaprakasam-mBio-2017-2j44e56.   this is the figure I made for the paper.  .   What happened was I did field work with a colleague in Britain who then went and did illumina sequencing to look at species diversity in Bangladesh sediment.  He did 16s rRNA analysis of the DNA and found percent species abundance.  I then did analysis of the water and sediment to get the arsenic content and the amount of an Fe(III) mineral called hematite.   My colleague wants to know what bacterial species he should further investigate.  Now we want to look at more species and see if there are correlations with the mineral hematite or the aqueous arsenic.  He is asking if you can give him a list of 4 and then 6 species to investigate further.

Your mission if you choose……….  Is to p-hack this data.  Tell me what species might be important and what we should investigate further.  You do not need to do both Arsenic and hematite but choose one and run with it.  For a hint, hematite has fewer data points but better correlations.

  • Choose if you will correlate with Aqueous arsenic or hematite.
  • Plot 1.  This is to make just one plot of either arsenic or hematite versus one microbial species.   To make this plot set your y=df.columns[##].    Where ## is a number of your choosing between 10 and 277 (the number of columns). This function will give you the column name if you give it a number.  Then you can do df[y] and also know the column name when you print(y).  Then after you make the plot in the figure caption report if the correlation is significant and say 1-2 sentences about the bacterial species.  You can just google the final parts of the bacterial name and usually one of the top choices will explain the main properties of the bacteria.
  • Next p-hack the data.  Search my data to see if any correlations are significant and make one pdf file.  This might be big and cumbersome but could be helpful.  It will have 100’s of pages!  Choose to p-hack with arsenic or hematite.
  • Then you should look at the plots and results.
  • Save the r2 values, sort them and make two more figures.
  • One with 4 graphs on one figure for species with the 4 highest r2 values
  • one with 6 graphs on one figure with the 6 highest r2 values.
  • All graphs should have a text box from linregress that show the slope, intercept, r2 and p-value.  remember r2 is just the r-value squared.
  • The data is in excel this time!  Here are links  link.  or here Species-Arsenic-Data-2lq2nkt.
  • The column names are long and ugly from the sequencing.  Fix this for your plots of 4 and 6 (see below)!  This gets you 5%.  These long names are what are automatically downloaded from NIH/NCBI.
  • The second 5% is if you make a seaborn correlation matrix for at least 5 columns including arsenic, hematite, and 3 bacteria.  .
  • Make sure to label your figures and also give me the answer to the question above.
  • Helpful notebooks for the final 10%

YOU WILL NEED TO HAND IN

  • Your Notebook as usual.
  • PDF file of all the results.
  • Plot of 4 parameters!
    • I am not showing the answers yet!
  • Plot of 6 parameters!
    • I am not showing the answers yet!

 

DUE Before Spring Break