Friday, March 25, 2022
HomeBig DataThe Most Distinctive Snowflake - Cloudera Weblog

The Most Distinctive Snowflake – Cloudera Weblog

Okay, I admit, the title is a little bit click-batey, nevertheless it does maintain some reality! I spent the vacations up within the mountains, and if you happen to reside within the northern hemisphere like me, you already know that implies that I spent the vacations both celebrating or cursing the snow. After I was a child, throughout this time of yr we might at all times do an artwork mission making snowflakes. We’d bust out the scissors, glue, paper, string, and glitter, and go to work. In some unspecified time in the future, the instructor would undoubtedly pull out the massive weapons and blow our minds with the truth that each snowflake in your entire world for all of time is completely different and distinctive (folks simply like to oversell unimpressive snowflake options). 

Now that I’m a grown mature grownup that has every little thing found out (pause for laughter), I’ve began to marvel concerning the uniqueness of snowflakes. We are saying they’re all distinctive, however some have to be extra distinctive than others. Is there a way that we might quantify the individuality of snowflakes and thus discover essentially the most distinctive snowflake

Certainly with trendy ML know-how, a job like this could not solely be doable, however dare I say, trivial? It most likely appears like a novel thought to mix snowflakes with ML, nevertheless it’s about time somebody does. At Cloudera, we offer our prospects with an intensive library of prebuilt knowledge science initiatives (full with out of the field fashions and apps) referred to as Utilized ML Prototypes (AMPs) to assist them transfer the place to begin of their mission nearer to the end line.

Certainly one of my favourite issues about AMPs is that they’re completely open supply, which means anybody can use any a part of them to do no matter they need. Sure, they’re full ML options which can be able to deploy with a single click on in Cloudera Machine Studying (CML), however they will also be repurposed for use in different initiatives. AMPs are developed by ML analysis engineers at Cloudera’s Quick Ahead Labs, and because of this they’re an incredible supply for ML greatest practices and code snippets. It’s yet one more instrument within the knowledge scientist’s toolbox that can be utilized to make their life simpler and assist ship initiatives quicker.

Launch the AMP

On this weblog we’ll dig into how the Deep Studying for Picture Evaluation AMP may be reused to seek out snowflakes which can be much less just like each other. In case you are a Cloudera buyer and have entry to CML or Cloudera Knowledge Science Workbench (CDSW), you can begin out by deploying the Deep Studying for Picture Evaluation AMP from the “AMPs” tab. 

In case you don’t have entry to CDSW or CML, the AMP github repo has a README with directions for getting up and operating in any setting.

Knowledge Acquisition

After getting the AMP up and operating, we will get began from there. For essentially the most half, we can reuse elements of the present code. Nevertheless, as a result of we’re solely all for evaluating snowflakes, we have to carry our personal dataset consisting solely of snowflakes, and lots of them.

It seems that there aren’t very many publicly out there datasets of snowflake photos. This wasn’t an enormous shock, as taking photos of particular person snowflakes could be a guide intensive course of, with a comparatively minimal return. Nevertheless, I did discover one good dataset from Japanese Indiana College that we’ll use on this tutorial. 

You can undergo and obtain every picture from the web site individually or use another utility, however I opted to place collectively a fast pocket book to obtain and retailer the pictures within the mission listing. You’ll want to put it within the /notebooks subdirectory and run it. The code parses out all the picture URLs from the linked net pages that include photos of snowflakes and downloads the pictures. It’ll create a brand new subdirectory referred to as snowflakes in /notebooks/photos and the script will populate this new folder with the snowflake photos.

Like several good knowledge scientist, we should always take a while to discover the info set. You’ll discover that these photos have a constant format. They’ve little or no colour variation and a comparatively fixed background. An ideal playground for laptop imaginative and prescient fashions.

Repurposing the AMP

Now that we have now our knowledge, and it appears to be fairly fitted to picture evaluation, let’s take a second to restate our objective. We need to quantify the individuality of a person snowflake. In line with its description, Deep Studying for Picture Evaluation is an AMP that “demonstrates the way to construct a scalable semantic search resolution on a dataset of photos.” Historically, semantic search is an NLP method used to extract the contextual which means of a search time period, as an alternative of simply matching key phrases. This AMP is exclusive in that it extends that idea to photographs as an alternative of textual content to seek out photos which can be just like each other.

The objective of this AMP is basically centered on educating customers on how deep studying and semantic search works. Within the AMP there’s a pocket book situated in /notebooks that’s titled Semantic Picture Search Tutorial. It presents a sensible implementation information for 2 of the primary strategies underlying the general resolution – characteristic extraction & semantic similarity search. This pocket book would be the basis for our snowflake evaluation. Go forward and open it and run your entire pocket book (as a result of it takes a short while), after which we’ll check out what it comprises.

The pocket book is damaged down into three principal sections: 

  1. A conceptual overview of semantic picture search
  2. An evidence of extracting options with CNN’s and demonstration code
  3. An evidence of similarity search with Fb’s AI Similarity Search (FAISS) and demonstration code

Pocket book Part 1

The primary part comprises background data on how the end-to-end strategy of semantic search works. There isn’t a executable code on this part so there may be nothing for us to run or change, but when time permits and the matters are new to you, you must take the time to learn.

Pocket book Part 2

Part 2 is the place we’ll begin to make our modifications. Within the first cell with executable code, we have to set the variable ICONIC_PATH equal to our new snowflake folder, so change 

ICONIC_PATH = “../app/frontend/construct/property/semsearch/datasets/iconic200/”


ICONIC_PATH = "./photos/snowflakes"

Now run this cell and the following one. You must see a picture of a snowflake displayed the place earlier than there there was a picture of a automobile. The pocket book will now use solely our snowflake photos to carry out semantic search.

From right here, we really can run the remainder of the cells in part 2 and depart the code as is up till part 3, Similarity Search with FAISS. If in case you have time although, I’d extremely advocate studying the remainder of the part to achieve an understanding of what’s occurring. A pre-trained neural community is loaded, characteristic maps are saved at every layer of the neural community, and the characteristic maps are visualized for comparability.

Pocket book Part 3

Part 3 is the place we’ll make most of our modifications. Normally with semantic search, you are attempting to seek out issues which can be similar to each other, however for our use case we have an interest within the reverse, we need to discover the snowflakes on this dataset which can be the least just like the others, aka essentially the most distinctive. 

The intro to this part within the pocket book does an incredible job of explaining how FAISS works. In abstract, FAISS is a library that enables us to retailer the characteristic vectors in a extremely optimized database, after which question that database with different characteristic vectors to retrieve the vector (or vectors) which can be most comparable. If you wish to dig deeper into FAISS, you must learn this submit from Fb’s engineering web site by .

One of many classes that the unique pocket book focuses on is how the options output from the final convolutional layer are a way more summary and generalized illustration of what options the mannequin deems vital, particularly when in comparison with the output of the primary convolutional layer. Within the spirit of KISS (maintain it easy silly), we’ll apply this lesson to our evaluation and solely deal with the characteristic index of the final convolutional layer, b5c3, to be able to discover our most unusual snowflake.

The code within the first 3 executable cells must be barely altered. We nonetheless need to extract the options of every picture then create an FAISS index for the set of options, however we’ll solely do that for the options from convolutional layer b5c3.

# Cell 1

​​def get_feature_maps(mannequin, image_holder):

    # Add dimension and preprocess to scale pixel values for VGG

    photos = np.asarray(image_holder)

    photos = preprocess_input(photos)

    # Get characteristic maps

    feature_maps = mannequin.predict(photos)

    # Reshape to flatten characteristic tensor into characteristic vectors

    feature_vector = feature_maps.reshape(feature_maps.form[0], -1)

    return feature_vector


# Cell 2

all_b5c3_features = get_feature_maps(b5c3_model, iconic_imgs)


# Cell 3

import faiss

feature_dim = all_b5c3_features.form[1]

b5c3_index = faiss.IndexFlatL2(feature_dim)



Right here is the place we’ll begin deviating considerably from the supply materials. Within the unique pocket book, the writer created a perform that enables customers to pick out a particular picture from every index, the perform returns essentially the most comparable photos from every index and shows these photos. We’re going to use elements of that code to be able to obtain our new objective, discovering essentially the most distinctive snowflake, however for the needs of this tutorial you possibly can delete the remainder of the cells and we’ll undergo what so as to add of their place.

First off, we’ll create a perform that makes use of the index to retrieve the second most comparable characteristic vector to the index that was chosen (as a result of essentially the most comparable could be the identical picture). There additionally occurs to be a pair duplicate photos within the dataset, so if the second most comparable characteristic vector can also be an actual match, we’ll use the third most comparable.


def get_most_similar(index, query_vec):

    distances, indices =, 2)

    if distances[0][1] > 0:

        return distances[0][1], indices[0][1]


        distances, indices =, 3)

        return distances[0][2], indices[0][2]


From there it’s only a matter of iterating by means of every characteristic, looking for essentially the most comparable picture that isn’t the very same picture, and storing the ends in a listing:


distance_list = []

for x in vary(b5c3_index.ntotal):

    dist, indic = get_most_similar(b5c3_index, all_b5c3_features[x:x+1])

    distance_list.append([x, dist, indic])

Now we’ll import pandas and convert the listing to a dataframe. This offers us a dataframe for every layer, containing a row for each characteristic vector within the unique FAISS index, with the index of the characteristic vector, the index of the characteristic vector that’s most just like it, and the L2 distance between the 2 characteristic vectors. We’re curious concerning the snowflakes which can be most distant from their most comparable snowflake, so we should always finish this cell with sorting the dataframe in ascending order by the L2 distance.

import pandas as pd

df = pd.DataFrame(distance_list, columns = ['index', 'L2', 'similar_index'])

df = df.sort_values('L2', ascending=False)

Let’s check out the outcomes by printing out the dataframe, in addition to displaying the L2 values in a box-and-whisker plot.



Superb stuff. Not solely did we discover the indexes of the snowflakes which can be the least just like their most comparable snowflake, however we have now a handful of outliers made evident within the field and whisker plot, one in every of which stands alone.

To complete issues up, we should always see what these tremendous distinctive snowflakes really seem like, so let’s show the highest 3 most unusual snowflakes in a column on the left, together with their most comparable snowflake counterparts within the column on the suitable. 

fig, ax = plt.subplots(nrows=3, ncols=2, figsize=(12, 12))

i = 0

for row in df.head(3).itertuples():

    # column 1



    ax[i][0].set_title('Distinctive Rank: %s' % (i+1), fontsize=12, loc='heart')

    ax[i][0].textual content(0.5, -0.1, 'index = %s' % row.index, dimension=11, ha='heart', remodel=ax[i][0].transAxes)

    # column 2



    ax[i][1].set_title('L2 Distance: %s' % (row.L2), fontsize=12, loc='heart')

    ax[i][1].textual content(0.5, -0.1, 'index = %s' % row.similar_index, dimension=11, ha='heart', remodel=ax[i][1].transAxes)

    i += 1

fig.subplots_adjust(wspace=-.56, hspace=.5)


Because of this ML strategies are so nice. Nobody would ever take a look at that first snowflake and suppose, that’s one tremendous distinctive snowflake, however in line with our evaluation it’s by far essentially the most dissimilar to the following most comparable snowflake.


Now, there are a large number of instruments that you can have used and ML methodologies that you can have leveraged to discover a distinctive snowflake, together with a kind of overhyped ones. The great factor about utilizing Cloudera’s Utilized ML Prototypes is that we had been capable of leverage an present, fully-built, and useful resolution, and alter it for our personal functions, leading to a considerably quicker time to perception than had we began from scratch. That, girls and gents, is what AMPs are all about!

To your comfort, I’ve made the ultimate ensuing pocket book out there on github right here. In case you’re all for ending initiatives quicker (higher query – who isn’t?) you must also take the time to take a look at what code within the different AMPs could possibly be used on your present initiatives. Simply choose the AMP you’re all for and also you’ll see a hyperlink to view the supply code on GitHub. In spite of everything, who wouldn’t be all for, legally, beginning a race nearer to the end line? Take a take a look at drive to strive AMPs for your self.



Please enter your comment!
Please enter your name here

Most Popular

Recent Comments