Week 3

Throughout this week I continued labelling bees in the larger swarm to use for training data. I also started the week with a one-on-one meeting with Danielle to discuss my goals for this summer in more depth. Then, I spent the majority of my time making modifications to the lab’s preliminary neural network. Since the starting architecture was fairly simple, there were lots of things I could do to change it and see how I could improve it, so I spent the rest of the week making small modifications to it and seeing how they impacted the output. Below I summarize what changes I made and the key observations of each change. Note that, at this point, the neural network has only a single cube of bees to train on and no validation data, so the model is easily at risk of overfitting. Additionally, all models were trained for 1000 epochs.

Modification 0, starting structure:

With no changes, the network had a loss of 3.182e-06 for the training data and 9.292 for test data (the same cube of bees but cropped in a different area)
Wall time: 30min 38s

Modification 1:

Given the starting neural network, I tried to decrease the number of channels by half at every step
This change increased the loss from 3.182e-06 to 0.002 given the training data
Given test data, the loss decreased from 9.292 to 3.835
Training concluded faster (wall time: 13min 9s)

Modification 2:

• Given the starting neural network, I increased the number of channels by doubling them at every step
This change decreased the loss from 3.182e-06 to 8.884e-07 given the training data
Given test data, the loss increased from 9.292 to 7.548
Training took significantly longer (wall time: 1h 22min 10s)

At this point, I determined that the higher number of channels probably improved the network’s ability to learn but was also overfitting the model to the limited training data. Since I only had one cube of labeled data to use at the moment, I didn’t have more training data or validation data I could use to prevent overfitting.

Later, I was provided with a labeled cube that one other student made and was able to use all three of the pre-trained models on that cube to see how the three modifications performed when tested on somewhat new data.

Testing Modification 0 With Different Test Labels:

Given new test data, loss decreased to 6.498

Testing Modification 1 With Different Test Labels:

Given new test data, loss increased to 5.394

Testing Modification 2 With Different Test Labels:

Given new test data, loss decreased to 6.049

Since I now had “new” data, I decided to retrain the models using both cubes as training data to hopefully help with the overfitting problem I observed. After changing the model to allow for batch sizes, I trained the models using 2 batch sizes and 1000 epochs for the same three channel sizes. Even though there are now technically two cubes to use for training data, they are the same cube, just labeled differently, so I am concerned that this may just confuse the model.

Modification 0 With Two Sets of Training Labels:

Wall time: 23min 27s
Unlike all other times when training loss leveled out at around 500 epochs, this model’s performance was still wildly fluctuating at the end of 1000 epochs:
The final loss after 1000 epochs was 0.053 given the training data
Given the same cube, cropped in a different location, the test loss was 1.312

Modification 1 With Two Sets of Training Labels:

Wall time: 5min 56s
The model was still slightly decreasing at 1000 epochs, but it was almost level and not fluctuating like the previous training ^
The final loss after 1000 epochs was 0.057 given the training data
Given the same cube, cropped in a different location, the test loss was 1.051

Modification 2 With Two Sets of Training Labels:

Wall time: 2h 8min 43s
The final loss after 1000 epochs was 0.053 given the training data
Given the same cube, cropped in a different location, the test loss was 1.177, the best so far for this modification

Based on these three tests, it appears that training with more data, even if it is just the same starting cube but with two different sets of labels helps prevent overfitting. So, I modified my code to allow for training to occur on overlapping crops of the cube. I used just my own labels and created 27 overlapping crops of the cube to train on. I used this just to train for modification 1 since training was taking significantly longer to complete (wall time: 1h 57min). This model seemed to have a harder time learning, but after 1000 epochs reached a loss of 0.071 (which was not the lowest loss overall). Given a previously unseen crop of the same cube, this model had a test loss of 0.148, which is better than training with a single crop but two different labels. (Note, batch size of 3 was used) Since this model seemed to perform best overall at this point, I decided to also test it using the labels made by the other student. When given the same cube (still an unseen crop) but these different labels, the test loss increased to 0.290 which is still better than the other models.

At this point, I’ve observed that:

1) The higher number of channels seems to improve learning, but limited training data easily leads to overfitting.

2) The higher number of channels also dramatically increases the amount of time it takes for learning to be completed

3) I can synthetically create more training data by cropping my single cube in different places, which significantly improves test loss. However, this also increases training time drastically so I won’t be able to train a model like this with larger channel numbers until I can use a GPU.

4) I can also increase the amount of training data by using the same cube but two different labels for the cube. Though this doesn’t seem intuitive to me, it still prevents overfitting and therefore decreases the test loss.

Some things that I think would be interesting to look into next:

So far, I’ve only edited the channel numbers, so adding another layer to the network, playing with the kernel size, and the step size are all things I’d like to explore
Since I now can synthetically create more training data, setting some aside for validation data to help further prevent overfitting could be interesting
Other methods, such as dropout, to prevent overfitting could be added
Other forms of data augmentation (i.e. rotating and mirroring the cube) could help diversify training data more
Allowing for larger sets of test data to be tested at the same time to get a better metric of how models are performing

At the end of this week, I had completed labeling about 100 bees in the small swarm:

Written on June 21, 2024