2023-04-13 06:32:53 +00:00

2.5 KiB

Compare these 4 different technologies for vector searching over images

Databases to test

  • marqo tensor search database marqo.gif
  • milvus vector search database milvus.svg maintained by the linux foundation
  • qdrant open source vector search database for images
  • weaviate an open source but very commercial vector database. it is multipurpose but is focused in llm

Data to Load

	- [Dota-v1.0](https://captain-whu.github.io/DOTA/dataset.html)

Notes

	- milvus requires using townhee library for the actual ml models
	- townhee requires  `libgl1-mesa-glx` drivers package in your os
	- marqo was the easiest to setup and it had an example of multimodal text2image in their documentaiton
	- milvus and townhee are very interconnected, making references to each other in their documentations.
	- townhee is very powerfull but it doesn't have a good api reference for the python sdk.
	- townhee didn't have a multimodal example yet it was pretty easy to set it up once you know which model you want to use. it is quire flexible and extensible

Motivation

	- the main motivation for this project is to understand which vector database will be the easiest to maintain in an application with a dataset that doesn't change frequently and is very homogeneous
	- the application is for internal use, so it should not rely on complex to set infrastructure neither require a lot of implementation from the developers
	- the usecase is to enable internal developers to query an image dataset to find elements to validate an hypothesis, better understand what is the image dataset they're working with and be able the generalize the bugfixes to a brother subset from a single issue with an specific data point
	- the selected dataset contains aerial images of urban areas because it's quite homogeneous once there won't be multiple scenes or objects to interact with, which represents the kind of scenario we are interested in applying the databases for
	- we'll also use the embedding technologies that are used in each documentation in order to assess the quality of the ecosystem each particular database is surround with, because it is also an indication of what usecases they have in mind when they plan the feature roadmap as they grow in maturity.