Since bamboolib is developed by data scientists, it was designed to make common data wrangling and exploration tasks fast and easy. Here’s an overview of how to wrangle, visualize & explore data with bamboolib.
If you want to follow this guide, there are three ways you can set yourself up:
If you want to install bamboolib on your computer, please follow our installation instructions. For all the ones that fancy a step-by-step walkthrough, please watch the video below.
Click on this link and you are good to go🚀. That's it! 🙂
If you want to use bamboolib within Kaggle, use our example Notebook on Kaggle and follow the video below.
The titanic dataset comes pre-installed with bamboolib, so we will use it to show you what you can do with bamboolib.
After you have set-up everything, start your Jupyter Notebook or JupyterLab and create a new
.ipynb file. Then, enter the following code in the first cell and run it.
import bamboolib as bamimport pandas as pddf = pd.read_csv(bam.titanic_csv)df
You should see the following output.
Whenever you display a
pandas.DataFrame (just as we did by typing
df above), you will see the typical static pandas output augmented by a bamboolib button. This button is your main entry point to bamboolib.
Click on the green button to open bamboolib.
The interface consists of three components:
Global controls: allow you to edit, undo and redo transformations and to export code.
DataFrame actions: your main entry point for carrying out transformation and for exploring your data.
An Interactive Data View including the dimensions of the data.
That's enough of the interface for now. Let's wrangle the titanic dataset!
With data transformations, you can clean and prepare your data set. bamboolib offers all typical data transformations such as filtering, sorting, selecting/dropping columns, groupby, joins, and many more.
Let's start with two common operations: filtering and aggregating.
First, we filter all passengers between age 18 and 60. We will create that filter using the keyboard, as it is the fastest way of doing so. For the filter, we will use the "Filter" transformation.
See the video below for how to do a filter using the keyboard.
Want to know more about the keyboard shortcuts? Check out our Keyboard tutorial.
Watching the movie "Titanic", we saw that people from the higher passenger classes (Pclass 1 and 2) as well as women had a higher chance of getting on rescue boats and therefore were more likely to survive. We want to test this hypothesis using our data set. Additionally, we want to see how Age is roughly distributed for each passenger class and sex.
Let's compute the sum and fraction (i.e. mean) of Survived as well as the min, max, and median Age for each Passenger class and Sex. Note that Survived is a boolean variable, so the fraction of survived passengers is equal to the mean of Survived.
We get the summary statistics using the "Groupby and aggregate" transformation. Again, try to only use your keyboard to find and create the transformation.
Feel free to fiddle around yourself a bit. The video below shows you how you could do it. Also check out our Keyboard tutorial if you want to learn about the keyboard.
"A picture is worth a thousand words". With bamboolib, you can quickly create and export plotly express plots using the "Create plot" functionality. And again, we support you with full keyboard control here.
We looked at the chance of survival by Passenger Class and Sex using the groupby transformation above. Now, let's try to express that as a plot.
There are of course many ways to do that. We choose to show a stacked histogram of Sex, adding Survived as color. We create a suplot for each Pclass (done by so-called "facetting"). To show the probability, we normalize the bars to lie between 0 and 100(%).
In many situations, it makes sense to have the full flexibility of creating own plots, but sometimes, you just want to get insights fast. If you want to explore your data quickly, then the "Explore DataFrame" tool is your way to go.
With the "Explore DataFrame" functionality, you can do the following:
Have a glimpse on the data including some validation metrics such as the number of missing values in each column
Get a univariate summary of each column
Look at bivariate plots that adjust to the data type at hand
Identify predictors for a given target
See correlations between all columns
Let's go through the most important "Explore DataFrame" features using the titanic data set as an example. Feel free to stop the video and try the steps yourself!