Class 10 – Start Pandas

Today we are going to start using Pandas more.  Today’s files are here https://github.com/bmaillou/BigDataPython/tree/master/10-StartPandas

We are using part of a data set that looks at arsenic concentrations in Bangladesh.

VIDEO

MAKE SURE TO TAKE THE QUIZ!

Homework

Due Next Class (Wednesday February 26, 2025).

  1. Take the data from class today and plot depth on the y-axis and arsenic on the x-axis. I would use ax.scatter().  Because when you plot something versus depth you always want depth in reverse order or inverted so the 0 is at the top and the maximum is at the bottom.    Use ax. and some google searching to find out how to do it. But if you stop and think about setting ranges you can change your start and stop intervals.  Then the next step is to put the x-axis ticks on top and to put the label position on top.  then you can set the x limits (xlim) to get the plot so it starts at 0.  Also xlim could go from high to low. If something takes a list you add a [] inside the ().   luckily to make this plot there is more than one way to alter all the parts.  When searching for answers I would google something like “matplotlib switch axes direction”.  I usually add pandas or matplotlib to the search and it should show you options.

arsenic-depth

 

3.  Now the second graph is fancier.  Take the first graph and color the points as shown below.  Do this by calling scatter three times but only selecting the data you want to plot each time you call scatter.  Think about the boolean calls we did in class.  These are the arsenic concentrations I am coloring.  Get the first graph nice and this second graph becomes easier.    Do not copy my colors.  These are good colors for representing the points.  Choose three colors and three symbols of your own and see how they work. As a hint I would first do the less than 10 and then greater than 50.   Then you can do the 10-50 which needs an “and” of some kind. Choose your own 3 colors and 3 symbols. You can use xkcd colors. You use them by saying ‘xlcd:color’ or just google color palettes.  You can also use the hex designation.  Then again you can just do the matplotlib colors.
arsenic-depth-color

 

18 thoughts on “Class 10 – Start Pandas

  1. This package felt like a turning point in the semester from 2 points of view. First, using pandas felt more powerful compared to what we were using in the previous packages, and it was great to see what new graphs we were now able to make. Second, I felt that the class overall became more intense and fast-paced once we started using pandas: working through the packages started taking longer, and the HW tasks started feeling harder, but still manageable.

  2. I really think this class was a good one, especially as a ‘start’ to pandas, though I was a little confused looking back over it. We obviously use pd.read_csv to import our data, but does any of the other stuff we do for the homework utilize pandas? I expected to use .loc or .iloc specifically as you mentioned in the packet that these are the first great trick of pandas. I thought boolean stuff, which we used, could be done outside of pandas, and slicing arrays and stuff too. Is the way we did that here (dot notation, for example), pandas-specific? Perhaps having a comparison table or examples of how one might do something without pandas vs. with would be helpful in clarifying my confusion, or in exemplifying why exactly we are using pandas in the first place.

  3. I really appreciated the way this class introduced us to multiple forms of handling, viewing, and analyzing big data (without viewing the entire file all at once!). There was a lot of specific notation in here to memorize, but I felt like it was all relevant and important to know and I’m glad we spent 2 classes on this topic. Specifically, I liked how you reviewed the purposes behind all the libraries and packages we always import in the beginning (it’s easy to just forget about/take those for granted), and included the process of graphing, color coding, and adding legends to the big data we’d previously manipulated and analyzed.

  4. As mentioned by others, the Pandas lesson was completely essential in giving me the tools to work with our datasets in a more efficient way, and ultimately really helped when I approached our final project. For me, however, this was the most difficult homework assignment. The functions didn’t feel very intuitive, and I feel like because I lacked extensive background knowledge on coding, simply knowing functions but not having a deep understanding of backend information made it very difficult to debug and troubleshoot when I ran into difficulties working on my homework. I do think it may have helped to slowly introduce Pandas and maybe start incorporating it earlier on, but overall this was a very useful notebook.

  5. Pandas was the most useful tool I acquired in this class. This is not surprising since many classes are dedicated to this library. The first workbook was a great introduction to Pandas, with lots of guidance. It is much different than other tools learned so far, such as NumPy, and it was worth delving deeper into it. If you ever have a dataset to analyze in the form of a CSV file or an Excel file, I highly recommend using Pandas.

  6. I had a hard time understanding at first why Pandas was going to be such an important tool in analyzing data, but this class provided a great introductions to the basics of it and proved extremely helpful. I found myself going back to this notebook while I was working on my final project, which goes to show how valuable the information was.

  7. I definitely found pandas very confusing when we started using it in class. But as we went through the notebook, I was more impressed by how much statistical analysis could be done with relatively simple code.
    I found the lesson on indexing and splicing incredibly useful as well.

  8. This was a great introduction to Pandas; I can’t imagine doing the final project and getting through the semester without being able to use the package. The explanation of different ways to index a pandas dataset was especially crucial, I kept coming back to this notebook for all the future homework exercises and the final project. If this lecture could have been moved any earlier into the semester, and components of it (like the indexing) could have been repeated in subsequent notebooks, I think that would have helped improve my overall understanding and memory of the Pandas’ cool functionalities!

  9. This was a great class. It allowed us to use pervious skills we had learned in plotting and building legends to create a very visually impressive graph. I found it helpful that you asked us to first build the graph without color-coding it. That made building the colored one less intimidating and more approachable.

  10. This assignment was useful as it taught me the following skills:
    1. How to invert an axis by reversing the range
    2. How to put the x-axis ticks and x-label on top of the plot
    3. How to set the xlim efficiently by generating an array
    4. How to delineate data points by color using boolean expressions

  11. Pandas is a critical tool for python and data analysis, so this class was crucial to my understanding those topics. I think the in-class packet presented material really well; the occasional messages to take my time and slow down were appreciated while learning all of the quirks of pandas.

    I thought the focus on notation in this lesson was spot on, because there is a lot to learn with respect to pandas notation (in terms of indexing, columns, and masking).

    Additionally, the introduction to data visualization in the packet and the homework was really useful; I found that this class provided a solid foundation that paved the way for the introduction of more advanced topics with pandas.

    I think everything was presented super well, and I can really only come up with a small small way the course might be improved in this lesson for future classes – perhaps go into a little more detail on how dataframe .plot(), .boxplot() functions interact with plt.subplots() and other matplotlib functions and when we should use one over the other. I remember being a bit confused with this notion at first (but picked up on things quickly after).

  12. Class 11 (Old Class 10) was definitely important because Pandas is so powerful and really facilitates the data analysis work that many of us are interested in doing outside of class. It was not too much at once, though. I appreciated how we were allowed a couple of classes to really get introduced to and then dive into Pandas. It was well timed, too. If the introduction to Pandas (and especially if the introduction to correlations in Pandas) was pushed back any later in the semester, I wouldn’t have been able to utilize it for analyzing my thesis data.

  13. I think this lecture is really important, especially since it really set the foundation on how to pick and manipulate big data. I was a little confused at first with the different notations and it took me quite a while before understanding what they really are. I wonder whether it is necessary to go through all the different functions? I feel that it would be more beneficial to stick with one notation (my suggestion would be the dot notation because we ended up using it a lot and it’s just a lot more natural to look for the column heading instead of the location), and expand on it in more detail instead of giving more than one notations but only cover a little bit of each.

  14. I am glad you started the notebook by reviewing libraries and packages we learnt in previous classes. We use these all the time, but it’s good to remind ourselves the basic definition and function of each. It may be useful to make a similar summary list of all the libraries and packages covered in the course so that future students can use the list for their final project.
    Dot notation has been really useful, though oddly it did not work in one of the later classes (probably the atmospheric CO2 concentration class).

  15. I really enjoyed this class as it was the first real foray into plotting larger amounts of data and creating the graphs really gave me a better way to visualize the information. Also, learning about the best practices in terms of plotting and legends, etc. was really helpful. I think code-academy is really helpful, as was the code.org in giving me extra (but ungraded) practice and forcing me to confront the issues and problem solve when I hit a road block.

  16. Class 10 was definitely one of the most important python lectures – learning to create plots based on certain parameters (like whether or not people drink well water) has been really useful and will definitely be useful in future data analysis projects. I think this lesson balanced guided instructions with meta-cognition very well. The boolean lesson key was especially helpful.

  17. This was a fun assignment, which again required me to think outside the box to understand how to color code based on ranges. A skill like this will come in handy in all data that I work with beyond this class. For data sets, I always have to delineate between different ranges. Being able to represent that with python will be enormously beneficial.

Leave a Reply