Get you up and running with bamboolib from scratch
Since bamboolib is developed by data scientists, it was designed to make common data wrangling and exploration tasks fast, easy and fun. Here’s an overview of how to wrangle, visualize & explore data with bamboolib.
If you want to follow this guide, there are three ways you can set yourself up:
Due to changes in Kaggle's technical infrastructure, bamboolib cannot support it at the moment. We're in contact with the Kaggle team to fix this!
If you want to use bamboolib within Kaggle, use our example Notebook on Kaggle and follow the video below.
Use bamboolib on kaggle
Let's work with the titanic dataset
The titanic dataset comes pre-installed with bamboolib, so we will use it to show you what you can do with bamboolib.
After you have set-up everything, start your Jupyter Notebook or JupyterLab and create a new .ipynb file. Then, enter the following code in the first cell and run it.
import bamboolib as bam
import pandas as pd
df = pd.read_csv(bam.titanic_csv)
You should see the following output.
pandas display with bamboolib button
Whenever you display a pandas.DataFrame (just as we did by typing df above), you will see the typical static pandas output augmented by a bamboolib button. This button is your main entry point to bamboolib.
Click on the green button to open bamboolib.
The interface consists of three components:
Global controls: allow you to edit, undo and redo transformations and to export code.
DataFrame actions: your main entry point for carrying out transformation and for exploring your data.
An Interactive Data View including the dimensions of the data.
That's enough of the interface for now. Let's wrangle the titanic dataset!
With data transformations, you can clean and prepare your data set. bamboolib offers all typical data transformations such as filtering, sorting, selecting/dropping columns, groupby, joins, and many more.
All transformations that are available in bamboolib are listed in the "Search transformations" search field if you click on it. Thus, "Search transformation" is a great way to check what's possible with bamboolib. If you are missing any transformations, please reach out. We love feedback!
Let's start with two common operations: filtering and aggregating.
First, we filter all passengers between age 18 and 60. We will create that filter using the keyboard, as it is the fastest way of doing so. For the filter, we will use the "Filter" transformation.
bamboolib is keyboard first! That means that you can use the keyboard to search transformations and to navigate through them. If you become friends with the keyboard shortcuts, they will make you blazingly fast!
But of course, feel free to start with the mouse if you feel more comfortable that way. Check out our keyboard control tutorial or reference for more details.
See the video below for how to do a filter using the keyboard.
Watching the movie "Titanic", we saw that people from the higher passenger classes (Pclass 1 and 2) as well as women had a higher chance of getting on rescue boats and therefore were more likely to survive. We want to test this hypothesis using our data set. Additionally, we want to see how Age is roughly distributed for each passenger class and sex.
Let's compute the sum and fraction (i.e. mean) of Survived as well as the min, max, and medianAge for each Passenger class and Sex. Note that Survived is a boolean variable, so the fraction of survived passengers is equal to the mean of Survived.
We get the summary statistics using the "Groupby and aggregate" transformation. Again, try to only use your keyboard to find and create the transformation.
Feel free to fiddle around yourself a bit. The video below shows you how you could do it. Also check out our Keyboard tutorial if you want to learn about the keyboard.
"A picture is worth a thousand words". With bamboolib, you can quickly create and export plotly express plots using the "Create plot" functionality. And again, we support you with full keyboard control here.
We looked at the chance of survival by Passenger Class and Sex using the groupby transformation above. Now, let's try to express that as a plot.
There are of course many ways to do that. We choose to show a stacked histogram of Sex, adding Survived as color. We create a suplot for each Pclass (done by so-called "facetting"). To show the probability, we normalize the bars to lie between 0 and 100(%).
In many situations, it makes sense to have the full flexibility of creating own plots, but sometimes, you just want to get insights fast. If you want to explore your data quickly, then the "Explore DataFrame" tool is your way to go.
With the "Explore DataFrame" functionality, you can do the following:
Have a glimpse on the data including some validation metrics such as the number of missing values in each column
Get a univariate summary of each column
Look at bivariate plots that adjust to the data type at hand
Identify predictors for a given target
See correlations between all columns
Let's go through the most important "Explore DataFrame" features using the titanic data set as an example. Feel free to stop the video and try the steps yourself!