Good things come in small packages - using small transformer models to embed company descriptions
OpenAI recently released the second version of their GPT-3 embedding models. They have simplified their previous set of five models into one, reduced the cost of using them and improved performance. The previous version of openAI embeddings were shown to fare poorly compared to smaller models (for example, see this thorough critique by Nils Reimers)
Although the new embeddings dimensions size has been reduced to 1536, this is still much larger than many of the smaller transformer models used in the above critique. Embeddings are usually the starting point for some downstream task, such as clustering, and so the size of the embeddings matters as the bigger they are, the more memory and processing power they require. Also, the improvements in sentence similiarity in version 2 do not seem large enough to bridge the gap in performance between version 1 and the smaller models.
I thought it would be interesting to apply the smallest of the above small models to a real world application of grouping companies together via descriptions of their business. The model is all-MiniLM-L6-v2 and the business descriptions were scraped from finviz for companies in the S&P500. The model consists of 22M parameters (vs 175B for GPT-3) and produces embeddings with 384 dimensions. The business descriptions have a median length of 200 words (90th percentile 260). This matters as the model is designed to be used on short paragraphs and truncates text at 256 tokens.
The encoding is fast (43 seconds for 489 companies) and given the low dimensionality needs little memory. The embeddings can be found here
As well as good performance in standard tests we can see that in a real world application, they produce intuitive results. At the bottom of this post are the resulting clusters formed by applying affinity propogation to the cosine similiarity of the embeddings. As you can see they produce cluters that make economic sense. There are one or two oddities though. For example, Philip Morris (PM) is in cluster 7, which looks like quite a healthy cluster. Most of the other members, like UnitedHealth and Molinas are in healthcare or medical supplies. Here is PM’s description:
Philip Morris International Inc. operates as a tobacco company working to delivers a smoke-free future and evolving portfolio for the long-term to include products outside of the tobacco and nicotine sector. The company’s product portfolio primarily consists of cigarettes and smoke-free products, including heat-not-burn, vapor, and oral nicotine products that are sold in markets outside the United States. The company offers its smoke-free products under the HEETS, HEETS Creations, HEETS Dimensions, HEETS Marlboro, HEETS FROM MARLBORO, Marlboro Dimensions, Marlboro HeatSticks, Parliament HeatSticks, and TEREA brands, as well as the KT&G-licensed brands, Fiit, and Miix. It also sells its products under the Marlboro, Parliament, Bond Street, Chesterfield, L&M, Lark, and Philip Morris brands. In addition, the company owns various cigarette brands, such as Dji Sam Soe, Sampoerna A, and Sampoerna U in Indonesia; and Fortune and Jackpot in the Philippines. The company sells its smoke-free products in 71 markets. Philip Morris International Inc. was incorporated in 1987 and is headquartered in New York, New York.
All that talk of a smoke free future makes PM seem pretty healthy! Further improvements or refinements could come from using larger texts (which would then require larger models). For example US listed companies are required to describe their business activities in section 1 of form 10-K (for reference, the paragraph above has 164 words, versus 3871 words in the 10-K business description). Howvever, the aim of this post wasn’t to devise an optimal clustering strategy but more to highlight that, as with many things, in machine-learning bigger isn’t always better!
Clusters
Cluster 1: apple(AAPL), alphabet(GOOG), alphabet(GOOGL), amazon(AMZN), arista(ANET)
Cluster 2: johnson(JNJ), procter(PG), estee(EL), kimberly(KMB), ecolab(ECL), international(IFF), church(CHD), clorox(CLX), cooper(COO), align(ALGN), dentsply(XRAY)
Cluster 3: jpmorgan(JPM), bank(BAC), wells(WFC), morgan(MS), goldman(GS), citigroup(C), s&p(SPGI), pnc(PNC), us(USB), intercontinental(ICE), capital(COF), bank(BK), ameriprise(AMP), fifth(FITB), raymond(RJF), first(FRC), huntington(HBAN), regions(RF), m&t(MTB), citizens(CFG), northern(NTRS), keycorp(KEY), svb(SIVB), cboe(CBOE), nasdaq,(NDAQ), comerica(CMA), signature(SBNY), zions(ZION)
Cluster 4: united(UPS), norfolk(NSC), fedex(FDX), old(ODFL), otis(OTIS), expeditors(EXPD), c.(CHRW)
Cluster 5: broadcom(AVGO), t-mobile(TMUS), verizon(VZ), qualcomm(QCOM), at&t(T), crown(CCI), charter(CHTR), motorola(MSI), te(TEL), sba(SBAC), lumen(LUMN)
Cluster 6: medtronic(MDT), intuitive(ISRG), stryker(SYK), boston(BSX), edwards(EW), resmed(RMD), baxter(BAX), teleflex(TFX)
Cluster 7: unitedhealth(UNH), philip(PM), cvs(CVS), cigna(CI), centene(CNC), cintas(CTAS), molina(MOH), omnicom(OMC), davita(DVA)
Cluster 8: applied(AMAT), lam(LRCX), kla(KLAC), dupont(DD), corning(GLW), western(WDC), mohawk(MHK)
Cluster 9: texas(TXN), analog(ADI), cadence(CDNS), nxp(NXPI), on(ON), teledyne(TDY), monolithic(MPWR), qorvo,(QRVO)
Cluster 10: microsoft(MSFT), oracle(ORCL), accenture(ACN), adobe(ADBE), salesforce,(CRM), servicenow,(NOW), roper(ROP), cognizant(CTSH), cdw(CDW), epam(EPAM), factset(FDS), netapp,(NTAP), ptc(PTC), akamai(AKAM), tyler(TYL), f5,(FFIV), fortinet,(FTNT), dxc(DXC)
Cluster 11: chevron(CVX), marathon(MPC), sherwin(SHW), valero(VLO), phillips(PSX), kinder(KMI), marathon(MRO), targa(TRGP)
Cluster 12: hca(HCA), humana(HUM), universal(UHS)
Cluster 13: illinois(ITW), 3m(MMM), amphenol(APH), autodesk,(ADSK), nucor(NUE), mettler(MTD), keysight(KEYS), ametek(AME), rockwell(ROK), fastenal(FAST), fortive(FTV), ansys(ANSS), dover(DOV), idex(IEX), howmet(HWM), avery(AVY), teradyne,(TER), nordson(NDSN), stanley(SWK), snap-on(SNA)
Cluster 14: honeywell(HON), raytheon(RTX), lockheed(LMT), boeing(BA), general(GE), northrop(NOC), general(GD), l3harris(LHX), parker(PH), transdigm(TDG), garmin(GRMN), textron(TXT), leidos(LDOS), huntington(HII)
Cluster 15: waste(WM), republic(RSG)
Cluster 16: linde(LIN), altria(MO), air(APD), cf(CF), skyworks(SWKS), sealed(SEE)
Cluster 17: nvidia(NVDA), tesla,(TSLA), general(GM), ford(F), dexcom(DXCM), paccar(PCAR), aptiv(APTV), hunt(JBHT), westinghouse(WAB), lkq(LKQ), carmax(KMX)
Cluster 18: exxon(XOM), conocophillips(COP), eog(EOG), occidental(OXY), freeport-mcmoran(FCX), pioneer(PXD), archer-daniels-midland(ADM), devon(DVN), hess(HES), williams(WMB), newmont(NEM), diamondback(FANG), coterra(CTRA), eqt(EQT), apa(APA)
Cluster 19: coca(KO), pepsico(PEP), keurig(KDP), monster(MNST), constellation(STZ), brown(BF-B), molson(TAP)
Cluster 20: prologis,(PLD), public(PSA), extra(EXR), iron(IRM), seagate(STX)
Cluster 21: berkshire(BRK-B), mondelez(MDLZ), colgate(CL), dollar(DG), general(GIS), hershey(HSY), sysco(SYY), w.w.(GWW), hormel(HRL), kellogg(K), tyson(TSN), mccormick(MKC), conagra(CAG), j(SJM), campbell(CPB), lamb(LW), whirlpool(WHR), newell(NWL)
Cluster 22: chubb(CB), metlife(MET), american(AIG), travelers(TRV), prudential(PRU), principal(PFG), globe(GL), lincoln(LNC), assurant,(AIZ)
Cluster 23: eli(LLY), abbvie(ABBV), pfizer(PFE), bristol(BMY), amgen(AMGN), gilead(GILD), regeneron(REGN), vertex(VRTX), moderna,(MRNA), corteva,(CTVA), biogen(BIIB), zimmer(ZBH), incyte(INCY), organon(OGN)
Cluster 24: advanced(AMD), intel(INTC), micron(MU), synopsys(SNPS), microchip(MCHP)
Cluster 25: mcdonalds(MCD), chipotle(CMG), yum(YUM), darden(DRI), dominos(DPZ)
Cluster 26: walmart(WMT), home(HD), costco(COST), lowes(LOW), tjx(TJX), target(TGT), kraft(KHC), kroger(KR), walgreens(WBA), dollar(DLTR), tractor(TSCO), ulta(ULTA)
Cluster 27: merck(MRK), zoetis(ZTS), mckesson(MCK), iqvia(IQV), amerisourcebergen(ABC), cardinal(CAH), west(WST), viatris(VTRS), henry(HSIC), catalent,(CTLT)
Cluster 28: deere(DE), caterpillar(CAT), schlumberger(SLB), halliburton(HAL), baker(BKR), united(URI)
Cluster 29: mastercard(MA), international(IBM), intuit(INTU), fiserv(FISV), fidelity(FIS), digital(DLR), copart(CPRT), global(GPN), discover(DFS), equifax(EFX), synchrony(SYF), fleetcor(FLT), trimble(TRMB), henry(JKHY)
Cluster 30: schwab(SCHW), blackrock(BLK), truist(TFC), moodys(MCO), msci(MSCI), realty(O), verisk(VRSK), state(STT), broadridge(BR), marketaxess(MKTX), invesco(IVZ)
Cluster 31: paypal(PYPL), equinix(EQIX), electronic(EA), ebay(EBAY), etsy(ETSY)
Cluster 32: horton(DHI), price(TROW), lennar(LEN), weyerhaeuser(WY), rollins(ROL), nvr(NVR), pultegroup(PHM), bath(BBWI)
Cluster 33: progressive(PGR), aon(AON), aflac(AFL), arthur(AJG), carrier(CARR), allstate(ALL), hartford(HIG), arch(ACGL), berkley(WRB), cincinnati(CINF), brown(BRO), everest(RE)
Cluster 34: marsh(MMC), cme(CME), costar(CSGP), gartner(IT), cbre(CBRE), jacobs(J), expedia(EXPE), regency(REG), robert(RHI)
Cluster 35: cisco(CSCO), hp(HPQ), hewlett(HPE), zebra(ZBRA), juniper(JNPR)
Cluster 36: automatic(ADP), paychex(PAYX), paycom(PAYC), ceridian(CDAY)
Cluster 37: nextera(NEE), emerson(EMR), american(AEP), johnson(JCI), enphase(ENPH), exelon(EXC), constellation(CEG), entergy(ETR), aes(AES), solaredge(SEDG), interpublic(IPG), generac(GNRC)
Cluster 38: comcast(CMCSA), netflix(NFLX), fox(FOX), fox(FOXA), dish(DISH)
Cluster 39: duke(DUK), southern(SO), sempra(SRE), dominion(D), pg&e(PCG), consolidated(ED), public(PEG), edison(EIX), ameren(AEE), dte(DTE), centerpoint(CNP), cms(CMS), atmos(ATO), nisource(NI), pinnacle(PNW)
Cluster 40: walt(DIS), booking(BKNG), las(LVS), mgm(MGM), caesars(CZR), wynn(WYNN)
Cluster 41: verisign(VRSN), royal(RCL), carnival(CCL), norwegian(NCLH)
Cluster 42: union(UNP), csx(CSX), xcel(XEL), wec(WEC), oneok(OKE), eversource(ES), firstenergy(FE), ppl(PPL), quanta(PWR), loews(L), alliant(LNT), evergy,(EVRG), nrg(NRG), allegion(ALLE)
Cluster 43: marriott(MAR), hilton(HLT), vici(VICI), ventas,(VTR), host(HST), camden(CPT)
Cluster 44: amcor(AMCR), ball(BALL), international(IP), packaging(PKG), westrock(WRK)
Cluster 45: thermo(TMO), danaher(DHR), abbott(ABT), becton(BDX), agilent(A), idexx(IDXX), illumina,(ILMN), laboratory(LH), waters(WAT), hologic(HOLX), quest(DGX), steris(STE), perkinelmer(PKI), fmc(FMC), bio-rad(BIO), charles(CRL), bio-techne(TECH)
Cluster 46: dow(DOW), albemarle(ALB), ppg(PPG), lyondellbasell(LYB), vulcan(VMC), martin(MLM), eastman(EMN), celanese(CE)
Cluster 47: news(NWS), news(NWSA)
Cluster 48: american(AMT), simon(SPG), equity(EQR), mid(MAA), essex(ESS), healthpeak(PEAK), kimco(KIM), franklin(BEN), udr,(UDR), boston(BXP), federal(FRT), vornado(VNO)
Cluster 49: american(AXP), southwest(LUV), delta(DAL), united(UAL), american(AAL), alaska(ALK)
Cluster 50: eaton(ETN), o(ORLY), autozone(AZO), cummins(CMI), genuine(GPC), borgwarner(BWA), advance(AAP)
Cluster 51: activision(ATVI), best(BBY), live(LYV), v(VFC), take(TTWO), hasbro,(HAS)
Cluster 52: nike,(NKE), ross(ROST), match(MTCH), tapestry,(TPR), ralph(RL)
Cluster 53: trane(TT), american(AWK), ingersoll(IR), xylem(XYL), mosaic(MOS), pool(POOL), masco(MAS), smith(AOS), pentair(PNR)