Today we are going to try and automate pandas (https://github.com/bmaillou/BigDataPython/tree/master/11-morePandas). Lets see how we can do. Our first goal is to make a plot of every parameter versus depth. The file is called filetest.pdf on github. Click on view raw to get the file. this is the subplots we are also going to try and make. Lets see how far we get! Yours should look much better!

VIDEO

Homework Due: Due Next Class

(Monday March 3, 2025)

Make a pdf file that contains a plot of As versus every other parameter. Make sure it is labeled nicely and in one pdf file. Arsenic should be on the x-axis and the other parameters should be on the y-axis.
1. Each parameter should be on one page.
2. make sure your axes go in the correct direction.
3. Don’t leave in leftover code that messes you up.
4. You will need to hand in your pdf file of results.
5. You do not need to color by arsenic. It can be all one color!! Since you are plotting against arsenic the colors become redundant. But choose a good symbol and color.
6. The column “Drink” is an object and not a float. It plots okay when on an x-axis but not a y-axis. If your code crashes you need to skip this column.
Make a plot with subplots that is As versus at least 3 other parameters each on their own graph but all on one page. You can decide if it is better to share an x or share a y axis. But Arsenic is the independent variable.
1. Save this graph to a jpeg and turn that in also.
2. For this your graphs should be square. They should have letters denoting them like in the packet.
If a graph is a weird shape or scrunched that is not a good thing. You need the data to look nice. I use fig.set_size_inches(10,15) where the 10 and 15 are the paper size. You can play with the numbers to get a good looking figure.
When you make a pdf file or any file for class make sure to turn the file in on courseworks.

HINTS

The Drink column is going to cause scatter to crash. So you will need to avoid it. There are mulitple methods. For example you could use an if statement and say

if col != ‘Drink’:

then do stuff

A more robust methods might be to check and make sure the parameter is a float.

if df[col].dtype==float:

then do stuff

AN EXAMPLE

for col in df:
if df[col].dtype==float:
print (‘{} is a float ‘.format(col))
else:
print(‘{} is NOT a float’.format(col))

14 thoughts on “Class 11 – More Pandas”

I think we made a huge leap in our understanding of Python’s and Pandas’ capability with this lesson. Automating correlations is a great way to practice for loops while figuring out how to slice and substitute properly, as well as learn how to most efficiently articulate the regression code. All of those skills come up again and again as I discovered in my final project. This lesson also introduced the way to cull a new data frame from an existing one – reflecting on the relevance given our final poster work, I think it would be helpful to specify when it’s best to just make a new data frame object by setting a new variable name to a subset of the data versus just create a new df. What is the nature of the data type, in each case, and what are the attendant limitations when using each?

This class really transformed the way I look at this course and helped me see the applicability of Python to real life. I think the concepts we learnt in class today, are those that I will be taking back at the end of the class and using in my work. Really wishing I knew these cool tricks in high school for science lab reports! My only concern was that sometimes new vocabulary showed up in the packet like “sharey” which caught me off guard but a quick google search does the trick 🙂

I liked how this notebook made us reflect on previous notebooks. This made me practice for loops again. I think a better explanation on the steps to take in order to make a file with all the parameters would help in trying to understand what is supposed to be done. I had difficulty in understanding where the code should be placed and if there were additional steps to take besides the one mentioned because the output was not clearly shown on the packet. I understand that the output cannot be really shown because it would take a lot of paper but maybe describing the output would help.

This was a useful assignment, and I have actually made similar plots using similar code in outside of class research. It was a little tricky figuring out how to use the for loop to graph each parameter on a separate page, but overall, not too bad.

I thought the pandas lesson was great, and I really appreciated how it was split into two parts so that we could really get a good grasp of the foundational python analysis. The pandas packet went step by step and was very detailed and helpful for understanding a new concept. I found the homework to be really great as a combination of useful basic enough to do at the beginning of the learning process with pandas.

As stated by others, the introduction of dictionaries would have helped with this lesson but besides that, I found that this class and the p-hacking class were the most interesting and helpful of the semester.

Honestly, these Pandas classes were perhaps the most practical learning parts of python and of reading in data. I agree with others, on learning dictionaries, and I think learning groupby would also be helpful here.

I wish I better understood what dictionaries are and how they could be used in different contexts. I first learned about dictionaries through code academy I think — and code academy showed different ways to manipulate dictionaries to many different things. I think they are quite a useful tool and wish we had a whole class dedicated to them (like for loops and lists and linspace).

Brian Mailloux says:

October 30, 2017 at 4:28 pm

I agree about the dictionary. But as we use Pandas more we don’t need them that much. So I made a choice to focus on other areas instead. I will think about this more for next time!

Log in to Reply

I found this class and homework really useful, especially as it relates to the data analysis I will be conducting in my environmental science classes. It was a great tool to be able to quickly plot certain things and instead of having them in different files, to have them all in one pdf so it is easily accessible. The classwork correlated with the homework which was a great way to reinforce the skills.

The first part of this assignment was unclear to me. I ended up making one graph that has As on the x axis and all of the other parameters on the y axis. This didn’t really make sense to me because the graph seemed useless (it was so crowded that you couldn’t really see anything). I did like learning how to save plots as pdfs and I think that it will be a very useful skill to know in the future!

I wanted to make a general point about code academy but didn’t know where to put it. I feel like Code Academy lacks support in many of the lessons. The hints in the lower left corner were sometimes helpful but especially with the bigger project pieces I had to scour the internet to find more substantive help. For a lot of the smaller pieces and for understanding what you can do in python, it was very helpful, but it still lacks enough features to make it a great learning tool.

This class was important because it was the first time we really combined for loops and pandas slicing to read and graph the data. I found myself constantly referring to this section for help with later assignments. I would also agree that dictionaries should be taught earlier. In a way, I had to re-think what I was doing once I later incorporated dictionaries.

It would have been nice to learn how to do dictionaries for this class section instead of later when we are doing correlations. Overall, it was really useful to learn how to make pdf files and various panda functions.

Big Data with Python

Barnard | Department of Environmental Science | Professor Brian J. Mailloux

Class 11 – More Pandas

VIDEO

Homework Due: Due Next Class

14 thoughts on “Class 11 – More Pandas”

Leave a Reply Cancel reply