GitHub - solrqueue/solr-kafka: Solr nodes pull updates directly from a queue instead of receiving via push

Goal

Instead of pushing updates to Solr, and forwarding them on to the replicas, send the updates in Kafka, and have each replica (including the "leader") pull the updates from the queue.

This is currently a proof-of-concept plugin for SolrCloud 7.3+ that shows the feasibility of the idea, and to encourage deeper upstream support.

Benefits

All replicas independently tail the log for updates. No coordination needed in Solr.
Automatic update batching on the consumer side for efficiency.
Configurable QoS for updates vs queries by sizing the consumer pool. Currently they compete for the same resources
Leadership isn't important, any replica can process an update by simply sending it to the queue. There is no need to elect a leader. Ideally:
- No LIR
- No PeerSync
- No Overseer leadership state processing
- No "ClusterState says we are the leader, but locally we don't think so"

The project doesn't currently prevent change the leader election, but it does avoid leadership checking during update processing.

Requirements and Constraints

Solr 7.3+ (requires scheduled trigger)
Collections use the compositeId router, no implicit sharding
Does not support SPLITSHARD (yet)
Atomic updates, if failing due to version constraints, will fail asynchronously in the consumer, not synchronously fail (409) in the http /update request. Therefore the client making the update request will not be informed.

Supported Features

update / atomic update
delete by id / query / query with route
bypass kafka on a per-update basis (for phasing in/out)
configurable replication factor for created topics

Document flow

Demo

# downloads a local copy of:
#  solr 7.5, zookeeper 3.5.3, kafka-2.0
# builds the plugin, installs with deps under solr server WEB-INF
# alters the solrconfig to include the UpdateProcessor
# starts a 1-node zk cluster
# starts a 2-node kafka cluster under /kafka zk chroot
# starts a 2-node solr cluster

make start

# installs the scheduled overseer trigger.  this
# watches for added/removed collections and creates
# the corresponding kafka topics with a partition per shard
make solr-trigger

Adding the plugin manually

Step 0: Build It, add libs to your Solr install

make build
cp target/solr_kafka-7.5.0.jar \
  /path/to/kafka-clients-2.0.0.jar \ # kafka client
  /path/to/lz4-java-1.4.1.jar \      # kafka client dep
  /path/to/snappy-java-1.1.7.1.jar \ # kafka client dep
  $SOLR_HOME/server/solr-webapp/webapp/WEB-INF/lib

Step 1: Add the UpdateProcessor to solrconfig.xml

Add this line right above solr.DistributedUpdateProcessorFactory:

<processor class="com.github.solrqueue.solrkafka.KafkaUpdateProcessorFactory">

  <!-- where to find your kafka servers -->
  <str name="bootstrap.servers">localhost:9093,localhost:9094</str>

  <!-- name of field on each solr doc to store kafka offset, must be a docValues plong -->
  <str name="field">_offset_</str>

</processor>

Step 2: Add the KafkaTopicSync trigger

You don't want to have producers dynamically creating kafka topics, because you can't specify the correct number of partitions (shards) by default when a topic is dynamically created. It is best if there is just one entity in your cluster whose job is to sync the Solr's clusterstate with Kafka's topic + partition. The natural place for this is the overseer. The scheduled trigger can run can run the sync check every N seconds. N determines how quickly a topic is writable after collection creation.

curl -s "localhost:8983/solr/admin/autoscaling" \
  -HContent-Type:application/json \
  -d@trigger.json

Step 3: Fine tuning

You'll probably want to use a separate kafka cluster for this, since the automated trigger will be in charge of the topics and partitions in the cluster.

Broker Config

You should set auto.create.topics.enable=false so that producers reject updates before the topic as been properly created.

Trigger Config

You should configure replication.factor in your trigger.json file to set your desired kafka replication factor (not related to number of Solr replicas).

KafkaUpdateProcessorFactory Config

Set producer.linger.ms to something greater than 0 ms (default) to send updates to kafka in batch. Note that this slows the /update HTTP response to wait for other requests.
Configure timeouts like producer.request.timeout.ms or producer.max.block.ms to define how Solr's update handler responds when it cannot send updates to kafka. By default it responds in 30 seconds with a 500 if it cannot reach kafka.

Name		Name	Last commit message	Last commit date
Latest commit History 14 Commits
.github/workflows		.github/workflows
build		build
docs		docs
src		src
.gitignore		.gitignore
.maintainers_guide.md		.maintainers_guide.md
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
pom.xml		pom.xml
trigger.json		trigger.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Goal

Requirements and Constraints

Supported Features

Document flow

Demo

Adding the plugin manually

Step 0: Build It, add libs to your Solr install

Step 1: Add the UpdateProcessor to solrconfig.xml

Step 2: Add the KafkaTopicSync trigger

Step 3: Fine tuning

About

Releases

Packages

Languages

License

solrqueue/solr-kafka

Folders and files

Latest commit

History

Repository files navigation

Goal

Requirements and Constraints

Supported Features

Document flow

Demo

Adding the plugin manually

Step 0: Build It, add libs to your Solr install

Step 1: Add the UpdateProcessor to solrconfig.xml

Step 2: Add the KafkaTopicSync trigger

Step 3: Fine tuning

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages