OpenAI recently released the second version of their embedding models. They have simplified their previous set of five models into one, reduced the cost of using them and improved performance. The previous version of openAI embeddings were shown to fare poorly compared to smaller models (for example, see this thorough critique by Nils Reimers

Although the new embeddings dimensions size has been reduced to 1536, this is still much larger than many of the smaller transformer models used in the above critique. Embeddings are usually the starting point for some down stream task, such as clustering and so the size of the embeddings matters as the bigger they are, the more memory and processing power they require. Also, the improvements in sentence similiarity in version 2 do not seem large enough to bridge the gap in performance between version 1 and the smaller models.

I thought it would be interesting to apply the smallest of the above small models to a real world application of grouping companies together via descriptions of their business. The model is all-MiniLM-L6-v2 and the business descriptions were scraped from finviz for companies in the S&P500. The model consists of 22M parameters (vs 175B for GPT-3) and produces embeddings with 384 dimensionsThe business descriptions have a median length of 200 words (90th percentile 260). This matters as the model is designed to be used on short paragraphs and truncates text at 256 tokens.

The encoding is fast (43 seconds for 489 companies) and given the low dimensionality needs little memory. The embeddings can be found here

As well as doing well in standard tests we can see that in a real world application, they produce highly intuitive results. Below are the clusters produced by applying affinity propogation to the cosine similiarity of the embeddings. As you can see they produce cluters that make economic sense:

Cluster 1: apple(AAPL), alphabet(GOOG), alphabet(GOOGL), amazon(AMZN), arista(ANET)

Cluster 2: johnson(JNJ), procter(PG), estee(EL), kimberly(KMB), ecolab(ECL), international(IFF), church(CHD), clorox(CLX), cooper(COO), align(ALGN), dentsply(XRAY)

Cluster 3: jpmorgan(JPM), bank(BAC), wells(WFC), morgan(MS), goldman(GS), citigroup(C), s&p(SPGI), pnc(PNC), us(USB), intercontinental(ICE), capital(COF), bank(BK), ameriprise(AMP), fifth(FITB), raymond(RJF), first(FRC), huntington(HBAN), regions(RF), m&t(MTB), citizens(CFG), northern(NTRS), keycorp(KEY), svb(SIVB), cboe(CBOE), nasdaq,(NDAQ), comerica(CMA), signature(SBNY), zions(ZION)

Cluster 4: united(UPS), norfolk(NSC), fedex(FDX), old(ODFL), otis(OTIS), expeditors(EXPD), c.(CHRW)

Cluster 5: broadcom(AVGO), t-mobile(TMUS), verizon(VZ), qualcomm(QCOM), at&t(T), crown(CCI), charter(CHTR), motorola(MSI), te(TEL), sba(SBAC), lumen(LUMN)

Cluster 6: medtronic(MDT), intuitive(ISRG), stryker(SYK), boston(BSX), edwards(EW), resmed(RMD), baxter(BAX), teleflex(TFX)

Cluster 7: unitedhealth(UNH), philip(PM), cvs(CVS), cigna(CI), centene(CNC), cintas(CTAS), molina(MOH), omnicom(OMC), davita(DVA)

Cluster 8: applied(AMAT), lam(LRCX), kla(KLAC), dupont(DD), corning(GLW), western(WDC), mohawk(MHK)

Cluster 9: texas(TXN), analog(ADI), cadence(CDNS), nxp(NXPI), on(ON), teledyne(TDY), monolithic(MPWR), qorvo,(QRVO)

Cluster 10: microsoft(MSFT), oracle(ORCL), accenture(ACN), adobe(ADBE), salesforce,(CRM), servicenow,(NOW), roper(ROP), cognizant(CTSH), cdw(CDW), epam(EPAM), factset(FDS), netapp,(NTAP), ptc(PTC), akamai(AKAM), tyler(TYL), f5,(FFIV), fortinet,(FTNT), dxc(DXC)

Cluster 11: chevron(CVX), marathon(MPC), sherwin(SHW), valero(VLO), phillips(PSX), kinder(KMI), marathon(MRO), targa(TRGP)

Cluster 12: hca(HCA), humana(HUM), universal(UHS)

Cluster 13: illinois(ITW), 3m(MMM), amphenol(APH), autodesk,(ADSK), nucor(NUE), mettler(MTD), keysight(KEYS), ametek(AME), rockwell(ROK), fastenal(FAST), fortive(FTV), ansys(ANSS), dover(DOV), idex(IEX), howmet(HWM), avery(AVY), teradyne,(TER), nordson(NDSN), stanley(SWK), snap-on(SNA)

Cluster 14: honeywell(HON), raytheon(RTX), lockheed(LMT), boeing(BA), general(GE), northrop(NOC), general(GD), l3harris(LHX), parker(PH), transdigm(TDG), garmin(GRMN), textron(TXT), leidos(LDOS), huntington(HII)

Cluster 15: waste(WM), republic(RSG)

Cluster 16: linde(LIN), altria(MO), air(APD), cf(CF), skyworks(SWKS), sealed(SEE)

Cluster 17: nvidia(NVDA), tesla,(TSLA), general(GM), ford(F), dexcom(DXCM), paccar(PCAR), aptiv(APTV), hunt(JBHT), westinghouse(WAB), lkq(LKQ), carmax(KMX)

Cluster 18: exxon(XOM), conocophillips(COP), eog(EOG), occidental(OXY), freeport-mcmoran(FCX), pioneer(PXD), archer-daniels-midland(ADM), devon(DVN), hess(HES), williams(WMB), newmont(NEM), diamondback(FANG), coterra(CTRA), eqt(EQT), apa(APA)

Cluster 19: coca(KO), pepsico(PEP), keurig(KDP), monster(MNST), constellation(STZ), brown(BF-B), molson(TAP)

Cluster 20: prologis,(PLD), public(PSA), extra(EXR), iron(IRM), seagate(STX)

Cluster 21: berkshire(BRK-B), mondelez(MDLZ), colgate(CL), dollar(DG), general(GIS), hershey(HSY), sysco(SYY), w.w.(GWW), hormel(HRL), kellogg(K), tyson(TSN), mccormick(MKC), conagra(CAG), j(SJM), campbell(CPB), lamb(LW), whirlpool(WHR), newell(NWL)

Cluster 22: chubb(CB), metlife(MET), american(AIG), travelers(TRV), prudential(PRU), principal(PFG), globe(GL), lincoln(LNC), assurant,(AIZ)

Cluster 23: eli(LLY), abbvie(ABBV), pfizer(PFE), bristol(BMY), amgen(AMGN), gilead(GILD), regeneron(REGN), vertex(VRTX), moderna,(MRNA), corteva,(CTVA), biogen(BIIB), zimmer(ZBH), incyte(INCY), organon(OGN)

Cluster 24: advanced(AMD), intel(INTC), micron(MU), synopsys(SNPS), microchip(MCHP)

Cluster 25: mcdonalds(MCD), chipotle(CMG), yum(YUM), darden(DRI), dominos(DPZ)

Cluster 26: walmart(WMT), home(HD), costco(COST), lowes(LOW), tjx(TJX), target(TGT), kraft(KHC), kroger(KR), walgreens(WBA), dollar(DLTR), tractor(TSCO), ulta(ULTA)

Cluster 27: merck(MRK), zoetis(ZTS), mckesson(MCK), iqvia(IQV), amerisourcebergen(ABC), cardinal(CAH), west(WST), viatris(VTRS), henry(HSIC), catalent,(CTLT)

Cluster 28: deere(DE), caterpillar(CAT), schlumberger(SLB), halliburton(HAL), baker(BKR), united(URI)

Cluster 29: mastercard(MA), international(IBM), intuit(INTU), fiserv(FISV), fidelity(FIS), digital(DLR), copart(CPRT), global(GPN), discover(DFS), equifax(EFX), synchrony(SYF), fleetcor(FLT), trimble(TRMB), henry(JKHY)

Cluster 30: schwab(SCHW), blackrock(BLK), truist(TFC), moodys(MCO), msci(MSCI), realty(O), verisk(VRSK), state(STT), broadridge(BR), marketaxess(MKTX), invesco(IVZ)

Cluster 31: paypal(PYPL), equinix(EQIX), electronic(EA), ebay(EBAY), etsy(ETSY)

Cluster 32: horton(DHI), price(TROW), lennar(LEN), weyerhaeuser(WY), rollins(ROL), nvr(NVR), pultegroup(PHM), bath(BBWI)

Cluster 33: progressive(PGR), aon(AON), aflac(AFL), arthur(AJG), carrier(CARR), allstate(ALL), hartford(HIG), arch(ACGL), berkley(WRB), cincinnati(CINF), brown(BRO), everest(RE)

Cluster 34: marsh(MMC), cme(CME), costar(CSGP), gartner(IT), cbre(CBRE), jacobs(J), expedia(EXPE), regency(REG), robert(RHI)

Cluster 35: cisco(CSCO), hp(HPQ), hewlett(HPE), zebra(ZBRA), juniper(JNPR)

Cluster 36: automatic(ADP), paychex(PAYX), paycom(PAYC), ceridian(CDAY)

Cluster 37: nextera(NEE), emerson(EMR), american(AEP), johnson(JCI), enphase(ENPH), exelon(EXC), constellation(CEG), entergy(ETR), aes(AES), solaredge(SEDG), interpublic(IPG), generac(GNRC)

Cluster 38: comcast(CMCSA), netflix(NFLX), fox(FOX), fox(FOXA), dish(DISH)

Cluster 39: duke(DUK), southern(SO), sempra(SRE), dominion(D), pg&e(PCG), consolidated(ED), public(PEG), edison(EIX), ameren(AEE), dte(DTE), centerpoint(CNP), cms(CMS), atmos(ATO), nisource(NI), pinnacle(PNW)

Cluster 40: walt(DIS), booking(BKNG), las(LVS), mgm(MGM), caesars(CZR), wynn(WYNN)

Cluster 41: verisign(VRSN), royal(RCL), carnival(CCL), norwegian(NCLH)

Cluster 42: union(UNP), csx(CSX), xcel(XEL), wec(WEC), oneok(OKE), eversource(ES), firstenergy(FE), ppl(PPL), quanta(PWR), loews(L), alliant(LNT), evergy,(EVRG), nrg(NRG), allegion(ALLE)

Cluster 43: marriott(MAR), hilton(HLT), vici(VICI), ventas,(VTR), host(HST), camden(CPT)

Cluster 44: amcor(AMCR), ball(BALL), international(IP), packaging(PKG), westrock(WRK)

Cluster 45: thermo(TMO), danaher(DHR), abbott(ABT), becton(BDX), agilent(A), idexx(IDXX), illumina,(ILMN), laboratory(LH), waters(WAT), hologic(HOLX), quest(DGX), steris(STE), perkinelmer(PKI), fmc(FMC), bio-rad(BIO), charles(CRL), bio-techne(TECH)

Cluster 46: dow(DOW), albemarle(ALB), ppg(PPG), lyondellbasell(LYB), vulcan(VMC), martin(MLM), eastman(EMN), celanese(CE)

Cluster 47: news(NWS), news(NWSA)

Cluster 48: american(AMT), simon(SPG), equity(EQR), mid(MAA), essex(ESS), healthpeak(PEAK), kimco(KIM), franklin(BEN), udr,(UDR), boston(BXP), federal(FRT), vornado(VNO)

Cluster 49: american(AXP), southwest(LUV), delta(DAL), united(UAL), american(AAL), alaska(ALK)

Cluster 50: eaton(ETN), o(ORLY), autozone(AZO), cummins(CMI), genuine(GPC), borgwarner(BWA), advance(AAP)

Cluster 51: activision(ATVI), best(BBY), live(LYV), v(VFC), take(TTWO), hasbro,(HAS)

Cluster 52: nike,(NKE), ross(ROST), match(MTCH), tapestry,(TPR), ralph(RL)

Cluster 53: trane(TT), american(AWK), ingersoll(IR), xylem(XYL), mosaic(MOS), pool(POOL), masco(MAS), smith(AOS), pentair(PNR)