Class 19 Groupby and Categorical Data

Class 19 (Maybe).   Groupby and Categorical Data

I numbered this class 19 to try and have it go to the correct spot on the website.  I have the most trouble wrapping my mind around this data.  It is hard thinking like this.  It is when you have a lot of categorical data describing each point.  So for each data point you know all these other details.  Plus we have a time series of points.  We are going to use our tree data from around campus that the Professor Terryanne Maenza-Gmelch and Professor Rodriguez have been collecting since 2015.  We are going to make the plots.  Here is a map I made of the data.  Click on a point!  You see the graphs!  We are going to make all of them!  We can make the map in the future!  You could make the same map.  We are just short on time.

BIG WARNING:  For some people with new python we get an error when doing the first mean on groupby.  You need to update it to this

df.groupby(‘old tree number’).mean(numeric_only=True)

I will update the packet and GitHub soon

 

Files we need

Homework Due TBA

  • Hand in a jupyter notebook where you make the percentages of species.  Answer in notebook
  • Hand in the notebook and pdf of all the growth fits.  Your name needs to be on each graph either in the title, and axis, or the data box.  The graphs are mad nice!

Classes 23 – Final Poster Presentations

Past Poster Session!

We are going to end the class by having poster presentations.  We are doing this to prepare you for scientific conferences and your senior thesis.  It is really fun and inspiring to talk to your colleagues about their research. This will take place on the 4th floor of Altschul Hall during the last class (Photo above from a few years ago).

  • You are going to create a poster using a data set and the poster will describe the data set and your analyses.  I will be posting Datasets on this page that can be analyzed.  But if there is an interesting dataset from your work/thesis/life you would like to analyze you can use it.  Just make sure we talk before you begin to make sure it is appropriate.  re-analyzing a lab report or data you have already analyzed is not appropriate.
  • Each Person needs to make their own poster.  Multiple people can work on the same data set and talk about the analysis as they do it in Python.  But each person needs to create an individual Poster with their own hypothesis and thought process.
  • We will do the poster session on the final class day during class time.  Deadlines will be strictly enforced for the Poster session!
  • We will set aside class time to work on the posters.

Due Dates

  • Topic Due –  April 3
    • This will be one sentence on courseworks describing what you hope to study and where you think you can find the data.
  • April 10 – Hypothesis and dataset due
    • What is your initial hypothesis/question?  Do you have access to the data?  What is your first step?.
  • Sunday April, 28 12:00pm (noon) Posters Due.  Hard deadline. Each hour late counts as one late pass day.
    • You will post a powerpoint and pdf of your poster on courseworks.
  • Monday April 29, 10:10 – 4th floor Altschul.  Poster presentations.  Please come for ten minutes early so we can start on time.  Bagels will be served.

Overall Goals and Details

Making a poster from real data is really hard.  You are going to hit roadblocks you cannot imagine.  Your goal is to ask a question and develop a hypothesis to test using data.  I do not know how far you can take this.  Some datasets are so messy getting them read in is a triumph.  Sometimes you need to merge two datasets.  Other times you need to really delve into the data to even understand what questions are feasible.  It is going to be an amazing journey.  Each time you learn something about your data it should open 100 more doors!

Data Sets

  • Your can use your senior Thesis data or data from a summer internship. But you need to do a new analysis and can’t repeat something you have already done.
  • Covid and College student well being.  I have a dataset.
  • Something that interests you.
  • Tree Growth Around the Barnard Campus
  • Bird Diversity at Black Rock Forest
  • Brooklyn Lead dataset
  • we have a dataset of ~800 deep wells from Bangladesh.
  • Brian’s Time Series Data From Bangladesh.  We have two questions we want to answer.  First if arsenic is changing over time.  For this we can just focus on the B wells for now.  The second is Chloride to Bromide ratios.  This are indicators of fecal contamination and we want to know how they vary with depth, location, and time.  The data is on courseworks along with the well information
  • New York City Tree Census
  • CitiBike (This is hard because the datasets get too large)
  • More ideas to come.

 

Elements of a Poster

  • We will do a modified version of what is posted on the Senior Seminar Website.  Here is my example poster. LastnameF_poster-21.  Your Poster MUST BE THIS SIZE!  YOU MUST USE THIS TEMPLATE!
  • You will have a
    • Title
    • Abstract
    • Introduction
    • Goals or hypotheses (maybe)
    • Methods
    • Results
    • Discussion
    • Conclusion
    • References
  • Abstract-This is a brief overview.  Since we are on small paper we will keep it to less than 100 words
  • Introduction-Give a brief background of you problem and the issues.  This can be bullets or a paragraph.
  • Goals or hypotheses -I like one to three bullet points that succinctly say what you are trying to accomplish.  This can then be tied to the conclusions.
  • Methods-You should describe what you did.
  • Results-We want some awesome graphs!!!!!!
  • Discussion-What do the results mean? Why are they important?  What is the bigger context?
  • Conclusions-What are the main take home points?  You can always tie back to your Goals and hypotheses.
  • References-if you used any include them.
  • The best way to learn about what should go into each section is examples.  I will put up some posters around the classroom.
  • TALKING ABOUT YOUR POSTER-One of the most important parts of a poster is talking about it to your audience. Can you explain it in 1-3 minutes and then answer questions.  Basically can you explain your work and sell it?

Poster Printing

  • I have made an example 42″ x 21″ poster.  LastnameF_poster-21
  • You can use this powerpoint as a template and start from there.  Fill in your data and your text and change it as needed.  Add your own departmental logos!
  • Do not add background color or background picture as those are hard to print.  For example do not make the poster black.  Or do not make the poster into a big picture of New York City.  These posters look great but we don’t have enough toner or the time to print them.
  • Please do add lots of color to your graphs and pictures to help the poster.
  • When I print the poster I will need a pdf file.  So when you hand in the poster you will turn in a pdf file.  To do this go to save as in powerpoint and choose a pdf.

Poster Session:

  • The poster session is really fun.  Anytime you can talk to your colleagues about data and results; just enjoy it.  Offer ideas, be supportive, learn something.
  • We will break up the class into 3 groups.  One group will present their posters while the two other two groups go around and talk to the presenters.  We will repeat this three times, 20 minutes each.  For example for 20 minutes group 1 will present while groups 2 and 3 walk around looking at the posters and asking questions.
  • you will provide feedback to the presenters.  For four posters you will present a “glow” and a “grow”.  Think about what you would like someone to tell you to help you improve.

 

Goals and Grading Rubric

  1. I will ask each person to give feedback on how they felt they did on the final project.
  2. I want each person to really try to get into the data.
    1. Did you try to understand it?
    2. Were you able to get it read in and manipulated?
    3. Were you able to make plots of the data?
    4. Did you hit a wall but then work out how to get around it?
  3. Were you able to make an iPython notebook that explained what you did and that analyzed the data?
  4. Is the iPython notebook self explanatory?
  5. Your final iPython notebook and data
    1. was it well commented with comments and markdown?
    2. was it easy to follow and understand?
    3. did it work?
    4. were the figures “pretty”?
  6. Were you able to use what you learned this semester on a real world problem and data set?
  7. I am going to make a poster grading rubric based on the elements above.  Every student is going to have to comment on and give feedback on 4 other posters.

Poster Session

  1. The poster session is going to be held on the time allotted for finals.
  2. Here is the poster session Rubric we will use to critique posters.  PosterSessionRubric

 

Students and Their Projects

  1. Student 1
  2. Student 2

Groupby

  • Many data sets have subgroups in them.  This happens as our data sets get bigger.  You will want to compare different groups within a data set.  This could be by many things but examples include different wells, different sampling locations, different days, different people.  The list is endless.
  • This is similar to the idea of Pivot Tables.  https://en.wikipedia.org/wiki/Pivot_table
  • Pulling out parts of data sets can be tedious with if statements.
  • Luckily Pandas makes it easy.  The function you need is groupby.  He is an example online
  • I also made a short notebook to show you how it works.  Here is the notebook and here is an excel file but you will have to change the name or get the data on github.

 

Classes 21, and 22 – NetCDF

This is our second to last assignment……..

We are going to expand on our maps and NetCDF to make an animated GIF.

These are all really helpful skills to have going forward.

Mapping is a critical part of earth sciences.  A lot of people will use google maps and ArcGIS for their mapping.  These are great tools. What i am also learning is that Python has a lot of tools within it that makes mapping very easy. Plus you can make professional looking maps.  Plus if you have multiple parameters in your pandas dataframe it is easy to loop and plot them all.  This is a new mapping segment as I changed what I used to do to update it and help us make publishable maps. I am still organizing so here is a list of files we will be using.

Files.

NetCDF is now becoming a common file type in the earth and atmospheric sciences.  After doing today’s workbook your homework is to make something similar to the animated gif below.  I put a ton of hints in the notebook and it really brings together everything you have learned.  Be patient.  This is a really big data set.  In addition, we are just using numpy arrays for this.  No need to use pandas!

Homework

Due See Courseworks

1..  Make the animated Gif below.   But you need your name on it somewhere so I know it is yours and not copied from someone else.  Also, make sure to write a brief description of what you observe in your Gif. Do you see any seasonal patterns?  Also, I am not sure why I lost resolution on my writing. I am working on it…  (Note to myself.  To get it to work I needed to make it full size).  I JUST LEARNED how to get better resolution on the words.  I did this.  fig.savefig(filename,dpi=50,bbox_inches=’tight’,facecolor=’white’).  When reading stackoverflow someone said putting a background on a png improves the letter quality and it really does help.

 

IGNORE BELOW THIS

Overview

This is our last assignment……..

We had to skip mapping but I will show you a little and this is a really fun project.  …. One of the things I wanted you to learn was how easy it was to install packages. We learned this with netCDF.

We need to install Basemap today.

conda install -c anaconda basemap

Then in our notebook we need to add these lines at the top maybe.

#On Mac's
import os
os.environ['PROJ_LIB'] = '/anaconda3/share/proj/share/proj'

First try without.  Then if you are on a mac. try it.  If you are on a pc and it doesn’t work we need to look at your screen.

NetCDF is now becoming a common file type in the earth and atmospheric sciences.  After doing today’s workbook your homework is to make something similar to the animated gif below.  I put a ton of hints in the notebook and it really brings together everything you have learned.  Be patient.  This is a really big data set.  In addition, we are just using numpy arrays for this.  No need to use pandas!

Homework

Due See Courseworks

  1.  Make a map of surface air temperatures on the day you were born and mark where you were born.

2..  Make the animated Gif below.   But you need your name on it somewhere so I know it is yours and not copied from someone else.  Also, make sure to write a brief description of what you observe in your Gif. Do you see any seasonal patterns?  Also, I am not sure why I lost resolution on my writing. I am working on it…

 

2. For the final 10% bonus….. You need to make a second animated gif.  I have two examples for you below.   Make ONE of them!  I think the sea ice on might be more straightforward.  Remember the data is from our notebook. To get to the data go here to noaa  and then scroll up and hit ‘select see list’ and then ‘see list’ again.

  • skip this one.  We can make the second animated Gif that compares 2014 and 2015 to see if we see the impact of El Nino.  Mine is below.  If you want you can choose two other years to compare.  It is not much harder………  But it takes some thinking and really slowed down my computer.  Give it a shot!  If you make it describe the difference you see.
  • Make an animated gif of the sea ice extent for one year with a polar projection.  From basemap you know how to make a polar projection.  Then it becomes very similar to what you have already done.  You just need to download the sea ice data from the same website.  The units are from 0-1 and are in fraction of sea ice.  So a 1 means 100% of the sea has sea ice.  I changed my colors this year in the example from class and I think I like it much better.

 

 

 

SKIP THIS ONE

elnino

Classes 18 and 20 – Mapping

Overview

Mapping is a critical part of earth sciences.  A lot of people will use google maps and ArcGIS for their mapping.  These are great tools. What i am also learning is that Python has a lot of tools within it that makes mapping very easy. Plus you can make professional looking maps.  Plus if you have multiple parameters in your pandas dataframe it is easy to loop and plot them all.  This is a new mapping segment as I changed what I used to do to update it and help us make publishable maps. I am still organizing so here is a list of files we will be using.

Files.

Mapping Homework.

Due- Monday 8, 2024

  • Map 1.  Make a worldwide map with 5 places you like to visit marked with points and labeled.  To do this.  Make an excel sheet and then read in the excel sheet and map it.  Choose a projection and type of map you like.   You need to read your data from a file.
  • Map 2.  The temperature on the day were you born with a star showing where you were born and a title that correctly converts the time units to the date.
  • Bonus 10%.  Make a map in folium.  Here is the Folium pdf that is in the mapping folder (Map3-Bonus-Folium-contextilly).  Take one of your 5 places from map 1.  Find 5 places in that location you want to visit.  Using an excel file for your data make a folium map showing those 5 new places you are going to visit in that location.  Make sure to name each place in the popup/circle on the folium map.   For example if you wanted to visit the twin cities in Minnesota you could make a map with the Mall of America, St Anthony Falls Lab etc,.
  • For the homework you need to turn in each excel file and the html file for folium.

Classes 16 and 17 – CO2 Time Series

Overview

(Class numbers may be slightly off don’t let that impact you.  We are going in numerical order if a number gets skipped.  Changing headers messes up the website)

Pandas is known for its time series capability where you make the index the time.  We are going to do this with CO2 data.  First we will analyze the data from the LaJolla Pier.  Here is the website that the data comes from.  I made a CO2 folder in GitHub.   I will walk you through how to do time series analysis with Pandas(pdf=pandastimeseries).  Work through how to do this.  Take your time.  Then you will do your own analysis from Mauna Lao, building on what you learned.  This is our work for the week.

Try this link http://scrippsco2.ucsd.edu/assets/data/atmospheric/stations/flask_co2/daily/daily_flask_co2_ljo.csv

VIDEO

Homework.

Notebook 13.

1.  Get the daily Flask CO2 data from from Keeling curve data from Mauna Lao. USE THER DAILY FLASK!  So that is Flask CO2 Data->daily_flask_co2_mlo.csv .   http://scrippsco2.ucsd.edu/data/atmospheric_co2/mlo.html.   Things should work as we did it in the class example.  THINGS SHOULD WORK BETTER THIS YEAR!  If we use the daily flask and like we did in class things should work!  When using this data you need to clean it up by only using flag=0.  We used this in our code.  df_scripps=df_scripps[df_scripps.Flags==0]

2. Predict the annual  MAX CO2 out to 2050.  Present the graph showing this data all properly labeled.  You can show the equations for each line that you use for the predictions.  Make it look good!  This is similar to what we did in class so a good warm up.  I am NOT showing my answer below.  In class we did mean.  Now we are doing max.

3. This is the harder one.  When looking at CO2 data we see yearly patterns.  It peaks in the late spring and decreases in the summer.  Here is a worldwide visualization that was released in the fall of 2014.  We know this is caused by growth during northern hemisphere spring and summer.  I want to see the monthly patterns by themselves. So to do this we need to subtract out the yearly mean data from our samples.  this is easy to do but took me a while to figure out.  You can do it one of two ways.  Choose one and go for it.

a.  Pandas has a nice function called rolling.  It used to be called rolling_mean but that was deprecated and it is now called rolling.  Since we have weekly data if you make a window size of 52 weeks this is a year long average around each point.  It is not the average of that year but a rolling average with a 52 week window.  Do not use weeks.  It is buggy.  Use days.  Use a frequency average of 365D.  You can not use a frequency average of years or months because that length changes during leap years.  Here are two good links https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.rolling.html and https://stackoverflow.com/questions/43556344/pandas-monthly-rolling-operation and https://pandas.pydata.org/pandas-docs/stable/computation.html#rolling-windows    To be entertaining and distracting here is a video of a rolling panda(a slightly different google search). So if you compute the rolling mean in a new column you get a nice smooth curve and could use that to subtract out the mean to look at the monthly differences.

b. If you really want to subtract out the calendar year differences it is slightly different.  This is the January 1-December 31 mean.  I found two examples here is a stackoverflow description of one.  But I also found a slightly different method.  If you make a new column that is the year of each sample you can find the mean for each year this way.  mlo.groupby(‘year’).CO2.mean()   It is similar to how we resampled for the trend analysis.  Now you need to put the years data back into the larger dataframe.  You can do this using join.  Look at the favorite green checked answer here.  https://stackoverflow.com/questions/12200693/python-pandas-how-to-assign-groupby-operation-results-back-to-columns-in-parent

c.  I think you could also use resample but I have not figured it out yet.  So I wouldn’t try it yet.

Now you should be able to make 3 plots very quickly from the new column of data you could create.  I am going to paste in the three graphs you need to make.  I did not label my graphs so don’t follow my lead.  Make yours look much better.  I shouldn’t have to tell you that by now but I am reminding you.  Add a figure caption to each figure.   Good luck making them.

4.  Final 10%.  Can you make a simple box model of CO2 concentrations?  Take a look at this notebook(pdf=MLO-Bonus-BoxModel) and see if you can do it.  I have too many hints for you in the notebook.  My graph that is not labeled well is below!

1.  The difference between every sampling point and the mean value for the year to see the seasonal differences.

2.  Same as #1 but for the years 2000 to present.  I am showing the answers for rolling and groupby so you can compare.

3.  A boxplot by month.

co2-monthly-diff

 

4.  Bonus graph of boxmodel results.  You will need to submit two graphs!  This is just the first!

CO2-box-model

Classes 14 and 15 – Plot Core Data

Core Data Analysis-2 Class assignment

Notebook/Homework 12

Due Wednesday March 20, 2024

This is a true story.  A few years ago I was working with a graduate student from Lamont/Columbia.  We were helping analyze soil for lead at a farmer’s marker in Williamsburg.  She wanted to expand her work and to look at lead in backyards in Brooklyn.  We flyered and used facebook to find people willing to let us sample their backyard.  We went into people’s backyards and collected soil cores.  We had not idea what we were going to find.  We only had data from studies published in other parts of the city.  We told the graduate student to present what they found and what they think.  For this homework you are that graduate student!  You are going to present on one of the cores from the study.    I will email everyone the core to use.

This week we are going to develop a hypothesis to test on core data.  When you come in for class we are going to sit down and read and discuss the Chillrud paper.  Read it so if you have questions we can discuss them.  The more you ask the better you will be.

Files That may be helpful

  1. All data and notebooks are in the core-data folder on GitHub.
  2. Read the Chillrud Paper (Chillrud-es9807892)
  3. Listen to the Podcast.  Here is the itunes Link.  Choose Chillrud_Lead
  4. Read the Earth Institute Blog post about Sampling in Brooklyn.
  5. notebook (pdf=plotcoredata-Brooklyn) for background
  6. The paper that Franziska Landes (pdf=Landes-1-s2.0-S0048969723040305-main-2) published and the data from the paper (excel=Landes_STOTEN or on github).
  7. Data from Table 2 in Chillrud
  8. Notebook with the simple answer (pdf=plotcoredata_Landes_answer-Brooklyn-Inventory)for adding axes

You are going to read a paper (Chillrud-es9807892) by Chillrud et al., 1999 titled Twentieth Century Atmospheric Metal fluxes into Central Park Lake, New York City.  They have some interesting conclusions.  This was all we had before we started to collect cores (there are a few other studies but this is the best for understanding processes).  As part of a new study we have collected soil samples and Cores from Northern Brooklyn.  You will analyze one core from the Landes paper.   The cores are highly contaminated.  We want to determine the source of that lead.    Your goal is to develop a hypothesis and test it.   (One plot of Pb v versus depth will not get you far and at best will get you a 50 for the assignment).  Plus you will need some writing to explain your thinking.  So use Markdown to explain the process and to create figure captions.

Remember that analyzing and thinking about data is hard and takes time.    You need to think of a strategy for analyzing the data and what you want to accomplish.  What is your hypothesis?  What are asking?  You know how to do a lot of different graphs and analyses, figure out what you want to apply.  You can make plots versus depth. You can correlate parameters.  You can do math to calculate numbers.  You can look at a lot of data in many different ways.   Use your knowledge to test your hypotheses. Remember you can iterate on your question/hypothesis as you analyze the data.  It is a two-way street.  Make some plots go back re-read.  Think some more.  Hone you hypothesis.   Look at all the graphs you have learned to make.  Think about what you can do with the data. Chillrud didn’t have course and fine grain data.  Can that help you think about sources?

Class 13 – Correlations Continued – p-hacking

Here is a data set we just recently published Gnanaprakasam-mBio-2017-2j44e56.   this is the figure I made for the paper.  .   What happened was I did field work with a colleague in Britain who then went and did illumina sequencing to look at species diversity in Bangladesh sediment.  He did 16s rRNA analysis of the DNA and found percent species abundance.  I then did analysis of the water and sediment to get the arsenic content and the amount of an Fe(III) mineral called hematite.   My colleague wants to know what bacterial species he should further investigate.  Now we want to look at more species and see if there are correlations with the mineral hematite or the aqueous arsenic.  He is asking if you can give him a list of 4 and then 6 species to investigate further.

Your mission if you choose……….  Is to p-hack this data.  Tell me what species might be important and what we should investigate further.  You do not need to do both Arsenic and hematite but choose one and run with it.  For a hint, hematite has fewer data points but better correlations.

  • Choose if you will correlate with Aqueous arsenic or hematite.
  • Plot 1.  This is to make just one plot of either arsenic or hematite versus one microbial species.   To make this plot set your y=df.columns[##].    Where ## is a number of your choosing between 10 and 277 (the number of columns). This function will give you the column name if you give it a number.  Then after you make the plot in the figure caption report if the correlation is significant and say 1-2 sentences about the bacterial species.  You can just google the final parts of the bacterial name and usually one of the top choices will explain the main properties of the bacteria.
  • Next p-hack the data.  Search my data to see if any correlations are significant and make one pdf file.  This might be big and cumbersome but could be helpful.  It will have 100’s of pages!  Choose to p-hack with arsenic or hematite.
  • Then you should look at the plots and results.
  • Save the r2 values, sort them and make two more figures.
  • One with 4 graphs on one figure for species with the 4 highest r2 values
  • one with 6 graphs on one figure with the 6 highest r2 values.
  • All graphs should have a text box from linregress that show the slope, intercept, r2 and p-value.  remember r2 is just the r-value squared.
  • The data is in excel this time!  Here are links  link.  or here Species-Arsenic-Data-2lq2nkt.
  • The column names are long and ugly from the sequencing.  Fix this for your plots of 4 and 6 (see below)!  This gets you 5%.  These long names are what are automatically downloaded from NIH/NCBI.
  • The second 5% is if you make a seaborn correlation matrix for at least 5 columns including arsenic, hematite, and 3 bacteria.  .
  • Make sure to label your figures and also give me the answer to the question above.
  • Helpful notebooks for the final 10%

YOU WILL NEED TO HAND IN

  • Your Notebook as usual.
  • PDF file of all the results.
  • Plot of 4 parameters!
    • I am not showing the answers yet!
  • Plot of 6 parameters!
    • I am not showing the answers yet!

 

DUE Wednesday March 6, 2024


			

Class 12 – Correlations in Pandas

This notebook we are doing  over two classes.  Today we are going to look for correlations (pdf=pandas-correlations) in the arsenic data and make some more nice subplots and through this we are going to get more practice with pandas.  Then on Wednesday I am going to show you some tricks with dataframes to make your life easier and make the plots nicer.  This will set us up for your homework which is a different data set on arsenic and Bacteria and will be due in a week.  So you only have 1 homework assignment on this topic.   We are going to go through the arsenic data set and we are going to find the best correlations and then plot them as either 4 or 6 plots.  See the figures below!   Then for homework you are going to take the other data and plot all the correlations and then the best 4 and best 6 plots.  As part of the plots below I added the letters to the plots so we can better reference them in the figure caption and make them look publication quality.  .

What I am teaching you is called p-hacking and it is not always looked upon favorably.  With the advent of these programs such as python and r you can search your data for correlations even if they make no scientific sense.  I am sure you have heard some weird result in the news.  That is bad science and is p-hacking.  But it can be a good start to looking at your data.  To understand the controversy look at the links on this website.

 

Video

Video Quiz 4

ARSENIC BACKGROUND

Arsenic Paper

1 Fendorf, S., Michael, H. A. & van Geen, A. Spatial and Temporal Variations of Groundwater Arsenic in South and Southeast Asia. Science 328, 1123-1127, doi:10.1126/science.1172974 (2010).   fendorf-science-09

PODCAST IS HERE!

Also, here is my powerpoint on Arsenic to give us all some background. Bangladesh-python.   In case you want more arsenic information this article has the latest numbers and predictions.

 

Answers

As-corr14 As-corr6

 

 

 

Class 11 – More Pandas

Today we are going to try and automate pandas (pdf=pandas_well_data_part2).  Lets see how we can do.  Our first goal is to make a plot of every parameter versus depth.  this is what my file looks like.  Click on view raw to get the file.  this is the subplots we are also going to try and make.    Lets see how far we get!  Yours should look much better!

VIDEO

 

Homework Due: Due Next Class

(Wednesday February 28, 2024)

  1. Make a pdf file that contains a plot of As versus every other parameter.  Make sure it is labeled nicely and in one pdf file.   Arsenic should be on the x-axis and the other parameters should be on the y-axis.
    1. Each parameter should be on one page.
    2. make sure your axes go in the correct direction.
    3. Don’t leave in leftover code that messes you up.
    4. You will need to hand in your pdf file of results.
    5. You do not need to color by arsenic.  It can be all one color!! Since you are plotting against arsenic the colors become redundant.  But choose a good symbol and color.
    6. The column “Drink” is an object and not a float.  It plots okay when on an x-axis but not a y-axis.  If your code crashes you need to skip this column.
  2. Make a plot with subplots that is As versus at least 3 other parameters each on their own graph but all on one page.  You can decide if it is better to share an x or share a y axis.  But Arsenic is the independent variable.
    1. Save this graph to a jpeg and turn that in also.
    2. For this your graphs should be square.  They should have letters denoting them like in the packet.
  3. If a graph is a weird shape or scrunched that is not a good thing.  You need the data to look nice. I use fig.set_size_inches(10,15) where the 10 and 15 are the paper size. You can play with the numbers to get a good looking figure.
  4. When you make a pdf file or any file for class make sure to turn the file in on courseworks.

 

HINTS

The Drink column is going to cause scatter to crash. So you will need to avoid it. There are mulitple methods. For example you could use an if statement and say

if col != ‘Drink’:
            then do stuff
A more robust methods might be to check and make sure the parameter is a float.
if df[col].dtype==float:
            then do stuff
AN EXAMPLE
for col in df:
if df[col].dtype==float:
print (‘{} is a float ‘.format(col))
else:
print(‘{} is NOT a float’.format(col))