Class 12 – Correlations in Pandas

This notebook we are doing  over two classes.  Today we are going to look for correlations (pdf=pandas-correlations) in the arsenic data and make some more nice subplots and through this we are going to get more practice with pandas.  Then on Wednesday I am going to show you some tricks with dataframes to make your life easier and make the plots nicer.  This will set us up for your homework which is a different data set on arsenic and Bacteria and will be due in a week.  So you only have 1 homework assignment on this topic.   We are going to go through the arsenic data set and we are going to find the best correlations and then plot them as either 4 or 6 plots.  See the figures below!   Then for homework you are going to take the other data and plot all the correlations and then the best 4 and best 6 plots.  As part of the plots below I added the letters to the plots so we can better reference them in the figure caption and make them look publication quality.  .

What I am teaching you is called p-hacking and it is not always looked upon favorably.  With the advent of these programs such as python and r you can search your data for correlations even if they make no scientific sense.  I am sure you have heard some weird result in the news.  That is bad science and is p-hacking.  But it can be a good start to looking at your data.  To understand the controversy look at the links on this website.

 

Video

Video Quiz 4

ARSENIC BACKGROUND

Arsenic Paper

1 Fendorf, S., Michael, H. A. & van Geen, A. Spatial and Temporal Variations of Groundwater Arsenic in South and Southeast Asia. Science 328, 1123-1127, doi:10.1126/science.1172974 (2010).   fendorf-science-09

PODCAST IS HERE!

Also, here is my powerpoint on Arsenic to give us all some background. Bangladesh-python.   In case you want more arsenic information this article has the latest numbers and predictions.

 

Answers

As-corr14 As-corr6

 

 

 

20 thoughts on “Class 12 – Correlations in Pandas

  1. This was one of the most useful and one of my favorite notebooks! I found myself looking back on this notebook a lot for my final project and sometimes for other classes

  2. This was a very helpful class. It was definitely harder and more complex than what we had been doing at the beginning of class, but I was also able to see how what we had done earlier built into what we are doing here (if statements, plotting…). I thought the lecture did a good job going through these things step by step while also leaving enough for me to figure out that I actually was learning and understanding what was happening. The graph formatting and correlations are very useful tools to know how to do.

  3. Lab 12 was one I’ve returned to many times when working on other assignments for this course. The graphs made in this lesson brought a real sense of high-level work; it felt professional to make them myself. In both the notebook and the homework it took me a while to get my maps to print in the matrix of graphs (where there were 9 graphs 3×3). More guidance on this portion of the notebook may be helpful to future students facing similar troubles.

  4. This was a very useful notebook. I think it would have helped if we had discussed pvalue and r^2 values more before we worked on plotting them. I think it was really helpful learning how to put more than 1 graph together because it makes graphs look more professional. Creating a dictionary was very good learning opportunity. It helped when we needed to rename the bacteria, and also helped me realize how useful this could be for future assignments. Also learning about dictionaries helped me with excel sheets, and also just understanding different and efficient ways to store and manipulate data.

  5. I found this particular notebook very useful and returned to it many times throughout the semester. I especially liked the way you walked us through the process, including common errors we may get (and which I did get with different data later in the semester…). Demonstrating how to troubleshoot these potential errors is often the most useful to me. The one recommendation I have is maybe include/show some examples of “bad” p-hacking, or times when this method should not be used.

  6. This was an extremely helpful assignment. My one complaint is that renaming was difficult because some of the hematites had the same end to their scientific names. More clarify should be provided on renaming the hematites if they have the same last word of their name.

  7. This was hands down the most useful class in the whole semester for me and I can see myself coming back to these concepts a lot. As Professor Brian keep stressing the importance of professional looking plots, I felt it most strongly here where going through a few blocks of code you can slice through your data in every way imaginable.

    Some areas of improvement that come to mind are the emphasis on the statistical concepts. As a few people above already mentioned above, if we could spend sometime learning how r^2 values work and way to prove correlations within the data sets. I can understand how too much statistics might be beyond the scope of this class. Another area of interest could be explaining more about the actual workings of Pandas so in future we can modify what we learn to fit out needs.

  8. I agreed with many of the comments above. This was a very useful notebook. I would have like that we discussed more about the value of finding correlations and that the p value is not the only factor when trying to determine a correlation. I also like how we put more than one graph together. The explanation of 2×2 grid was really helpful.
    I also like when we created dictionaries to name the items, and it was really helpful as we saw wth the bacteria hw that they can have very long names and we need a way to change that.
    I would have like that we discussed more about when r^2 or r is needed for a correlation. For my final project I saw people using the r value in the literature other than r^2.

  9. I found myself looking back on this notebook a lot for my final project and also for work I did in other classes. This class has a lot of important material, but also a lot of little things that weren’t really explained. As others have mentioned, I never really understood why we used ax=axtemp[i%2][i/2] to plot 4 graphs on a two-by-two grid – after some googling I understood (something about how the axes are a 2×2 array and need to be flattened?), but I think it would be useful to add a sentence in the notes about why it’s necessary rather than simply what it does. Also, as someone else mentioned, when we used ax.set_ylim([0,data[elem].max()*1.1]), the *1.1 wasn’t explained so I don’t know what it does. All in all, I think the classes on pandas were extremely useful and that pandas’ use and abilities should be emphasized so students can more fully understand what it can do and when it should be used.

  10. Understanding the significance of R2 value, the underlying calculations behind the statistical modules, and getting multiple plots in the same figure took me a long time to figure out. I absolutely gained a lot from this process and am more confident to plot data.

  11. This class/notebook was very helpful in understanding correlations and how to manipulate/sort pandas dataframes. The only part that was unclear was when we used ax=axtemp[i%2][i/2] to plot the 4 graphs. I see why this line of code is necessary but am confused about how it works exactly. How does it create the 2 by 2 grid? Why do we use ‘axtemp’?

  12. I agree with some of the earlier comments that for a senior thesis or final project, this is one of the most useful exercises and homework assignments. However, I did feel like a lot of what I did for the homework was mostly just copying the chunks of code from the class and not fully understanding each small facet like we did in the beginning days. A few things I did really like:
    -the explanation of 2×2 grid made a lot of sense, great stop to explain
    -walking through adding the rsq dictionary with for-loops, turning it into a data frame, then sorting. It was super confusing at first but I just re-read through it now and in retrospect it makes a lot of sense and you walk us through it as clearly as you can.
    Things that confused me:
    -Setting my x-limits and y-limits were often hard, not completely sure why however–they often got messed up on my pdf. I saw you did ax.set_ylim([0,data[elem].max()*1.1]) & ax.set_xlim([0,data[xval].max()*1.1], but never really explained by *1.1 part. I just went with it, but didn’t totally understand why this was necessary.
    -when we sort the dataframe according to highest r^2 values for the bacteria data sometimes the listed columns or item numbers did not correspond to where the graphs were located on the pdf, so it was hard to find the names to rename them. We never really figured out why this was happening…

  13. I anticipate using this correlations material on my final project on water quality. Also, it was valuable to learn how to display multiple scatter plots so the reader can compare relationships. It was also at this point that I was able to start understanding how Python complements GIS. Next question: how does R compare with Python?

  14. I still don’t understand mod mathematics. A lot of that part was guess and check for me.

    I also was wondering whether there was a way to put three graphs together using different x and y variables for each graph. I wanted to do something like this for my final project but was unable to figure it out and eventually made three separate graphs. I suppose it wouldn’t be possible to do something like this because we’re using a for loop that graphs ys against a common x. I think it might be possible using a second list and for loop.

  15. The most challenging part here for me was actually the nested subplots- I kept getting an error about only being able to call arrays and not being able to call indices for axtemp, which didn’t make much sense to me. It’s an issue I ran into on my final project as well, and there i just set axtemp(squeeze=False) to get around the array issue, but it left me with lots of random white space. Any other ways to approach this?

  16. Looking back at the ipython notebook, I realize that I still don’t understand why we had to do this ax=axtemp[i%2][i/2] to plot 4 graphs on a two-by-two grid. I tried googling it but could not find any information.

  17. Given my senior thesis, this was arguably the most immediately useful class. While working on my thesis analysis, there were so many instances prior to this class where I found myself stumped, and this class basically solved or simplified all those problems. I’m actually a bit confused about the syntax of setting temp=well_data[[‘As’,i]]: why are double brackets required? The function ‘.dropna()’ is one that I think I’ll most likely be using for most future data analyses, and the ‘linregress’ refresher was great. I liked how the lecture built on the past code and walked us through step-by-step. My favorite part though is the fact that we can re-use the final code to pretty much fit any dataset by simply changing ‘data’ and ‘xval.’
    On another note, I’m not sure why, but I keep having to re-run setting the r-squared dictionary and sorting it or the output gives me an error. I thought this would be resolved by the ‘inplace=True,’ but I seem to still run into the same error. Anyone have thoughts on this?

    • The use of double quotes is interesting. Usually to reference a column you just put one name in there. For example. well_data[‘As’]. But you can get more than one column at a time. We wanted an x and y column. So we add two columns. This time I am using i to represent y. But since it is two columns you need to ask for a list. so it becomes well_data[[‘As’,i]]. If you wanted many elements you could do something like well_data[[‘As’,’Fe’,’Mn’,’S’]] etc.

Leave a Reply