samza

Mirror of Apache Samza

Subscribe to updates I use samza


Statistics on samza

Number of watchers on Github 406
Number of open issues 18
Main language Scala
Open pull requests 67+
Closed pull requests 189+
Last commit 4 months ago
Repo Created over 3 years ago
Repo Last Updated 5 months ago
Size 18.8 MB
Organization / Authorapache
Contributors46
Page Updated
Do you use samza? Leave a review!
View open issues (18)
View samza activity
View on github
Latest Open Source Launches
Trendy new open source projects in your inbox! View examples

Subscribe to our mailing list

Evaluating samza for your project? Score Explanation
Commits Score (?)
Issues & PR Score (?)

What is Samza? Build Status

Apache Samza is a distributed stream processing framework. It uses Apache Kafka for messaging, and Apache Hadoop YARN to provide fault tolerance, processor isolation, security, and resource management.

Samza's key features include:

  • Simple API: Unlike most low-level messaging system APIs, Samza provides a very simple callback-based process message API comparable to MapReduce.
  • Managed state: Samza manages snapshotting and restoration of a stream processor's state. When the processor is restarted, Samza restores its state to a consistent snapshot. Samza is built to handle large amounts of state (many gigabytes per partition).
  • Fault tolerance: Whenever a machine in the cluster fails, Samza works with YARN to transparently migrate your tasks to another machine.
  • Durability: Samza uses Kafka to guarantee that messages are processed in the order they were written to a partition, and that no messages are ever lost.
  • Scalability: Samza is partitioned and distributed at every level. Kafka provides ordered, partitioned, replayable, fault-tolerant streams. YARN provides a distributed environment for Samza containers to run in.
  • Pluggable: Though Samza works out of the box with Kafka and YARN, Samza provides a pluggable API that lets you run Samza with other messaging systems and execution environments.
  • Processor isolation: Samza works with Apache YARN, which supports Hadoop's security model, and resource isolation through Linux CGroups.

Check out Hello Samza to try Samza. Read the Background page to learn more about Samza.

Building Samza

To build Samza from a git checkout, run:

./gradlew clean build

To build Samza from a source release, it is first necessary to download the gradle wrapper script above. This bootstrapping process requires Gradle to be installed on the source machine. Gradle is available through most package managers or directly from its website. To bootstrap the wrapper, run:

gradle -b bootstrap.gradle

After the bootstrap script has completed, the regular gradlew instructions below are available.

Scala and YARN

Samza builds with Scala 2.10 or 2.11 and YARN 2.6.1, by default. Use the -PscalaVersion switches to change Scala versions. Samza supports building Scala with 2.10 and 2.11.

./gradlew -PscalaVersion=2.11 clean build

Testing Samza

To run all tests:

./gradlew clean test

To run a single test:

./gradlew clean :samza-test:test -Dtest.single=TestStatefulTask

To run key-value performance tests:

./gradlew samza-shell:kvPerformanceTest -PconfigPath=file://$PWD/samza-test/src/main/config/perf/kv-perf.properties

To run all integration tests:

./bin/integration-tests.sh <dir>

Running checkstyle on the java code

./gradlew checkstyleMain checkstyleTest

Job Management

To run a job (defined in a properties file):

./gradlew samza-shell:runJob -PconfigPath=file:///path/to/job/config.properties

To inspect a job's latest checkpoint:

./gradlew samza-shell:checkpointTool -PconfigPath=file:///path/to/job/config.properties

To modify a job's checkpoint (assumes that the job is not currently running), give it a file with the new offset for each partition, in the format systems.<system>.streams.<topic>.partitions.<partition>=<offset>:

./gradlew samza-shell:checkpointTool -PconfigPath=file:///path/to/job/config.properties \
    -PnewOffsets=file:///path/to/new/offsets.properties

Developers

To get Eclipse projects, run:

./gradlew eclipse

For IntelliJ, run:

./gradlew idea

Contribution

To start contributing on Samza please read Rules and Contributor Corner. Notice that Samza git repository does not support git pull request.

Apache Software Foundation

Apache Samza is a top level project of the Apache Software Foundation.

Apache Software Foundation Logo

samza open pull requests (View All Pulls)
  • upTime should be field not method
  • Inability to create configured producer should be logged as error
  • Use Optional class from guava instead of elasticsearch
  • Add REST endpoint under /ws/v1/samza at tracking url
  • SAMZA-41 changes with doc updates
  • fix the config yarn.queue not take effect bug
  • delete it sorry
  • [doc] remove samza-serializers maven dependency
  • Fixed typo in run-in-multi-node-yarn.md
  • Typo fix in web-ui-rest-api.md
  • fix the bug containsValue method
  • Removed setting log4j-console.xml as default
  • SAMZA-1033: Remove import-control from checkstyle
  • Samza on Kafka 0.10
  • Implement initial session window operator API.
  • SAMZA-1055: Disable broken tests in SamzaRest
  • SAMZA-1054: Refactor Operator APIs
  • SAMZA-1047: testEndOfStreamWithOutOfOrderProcess is flaky
  • SAMZA-1077: Catch throwables in SamzaContainer
  • SAMZA-1074: Cannot build hello-samza on various CDH version
  • Specification of various Window and Trigger APIs in Samza
  • SAMZA-1206: Fix TestJMXServer.
  • SAMZA-871: Heart-beat mechanism between JobCoordinator and all running containers
  • SAMZA-1228 : StreamProcessor should stop JmxServer
  • SAMZA-1155: Validate users configure window.ms when using the fluent API
  • Disable flaky tests.
  • SAMZA-1258. Integration tests. Happy Path.
  • SAMZA-1251 - Remove DebounceTimer dependency from ZkLeaderElector & ZkController
  • SAMZA-1196: Fix TestJmxReporter
  • SAMZA-1232: Log configuration value in RunLoopFactory
  • Profiling State Performance
  • SAMZA-1105 upgrade Kafka client version to 0.10.2.0
  • SAMZA-1071: Implement aliases for input topics
  • SAMZA-859: Create a simple join example in hello-samza tutorail (docs)
  • Merge pull request #1 from apache/master
  • Extract hdfs docs into its own section
  • SAMZA-1234: Documentation for 0.13.0 release
  • DO NOT REVIEW YET. changed processor ids for different partiicipants
  • SAMZA-1305. SAMZA-1306. Unit test - zk unavailable.
  • SAMZA-1296 : Stand alone integration tests.
  • SAMZA-1221, SAMZA-1101: Internal cleanup for High-Level API implementation.
  • Disabled flaky test for TestExponentialSleepStrategy testThreadInterruptInOperationSleep
  • SAMZA-1185: Async commit documentation.
  • SAMZA-1555: Move creation of checkpoint and changelog streams to the Job Coordinators
  • SAMZA-1562: TaskStorageManager should delete any local store it canno…
  • SAMZA-1552: Host affinity improvements - Improve matching of hosts to allocated resources
  • SAMZA-1542 refactor config classes in samza-core java to hide non-necessary interfaces
  • SAMZA-1541 migrate config classes in samza-yarn to use composition over inherence
  • SAMZA-1508: JobRunner should not return success until the job is healthy
  • SAMZA-1478; Delete unneeded data from intermediate Kafka topic on offset commit
  • SAMZA-1459: Upgrade to gradle 4.2
  • SAMZA-1372: Change Latch Interface to Lock Interface for Samza Standalone with ZK
  • SAMZA-1293: Enable partition expansion of input streams (SEP-5)
  • Support source types where the last part of the source is not the streamName
  • SAMZA-1572: Add fixed retries on failure in KafkaCheckpointManager.
  • SAMZA-1498: Support arbitrary system clock timer in operators
  • SAMZA-1568: Handle ZkInterruptedException in zkclient.close.
  • Misc. minor cleanup.
  • Initial implementation of remote table provider
  • SAMZA-1596: Staging directory name has to be formatted in config
  • Added rate limiter interface and embedded implementation
  • Add stream-table join support for samza sql
  • SAMZA-1460: StreamAppender does not explicitly create logging topic
  • SAMZA-1611 : BootstrappingChooser should use systemAdmin offsetComparator API to compare the offsets
  • SAMZA-1608 : Add hidden config to enable explicit stream creation in StreamAppender due to bug.
  • SAMZA-1584: Improve SamzaContainer shutdown sequence in StreamProcessor.
  • SD-1599: Improve the efficiency of the AsynRunLoop when some partitio…
samza questions on Stackoverflow (View All Questions)
  • Samza build with gradle failing
  • Best practice Apache Kafka and Apache Samza
  • Does Samza work with ResourceManager in HA?
  • Apache Samza local storage - OrientDB / Neo4J graph instead of KV store
  • How to write my own job in samza
  • Samza/Flink/Spark: processing completion ("done" flag)
  • Does Samza create partitions automatically when sending messages?
  • Samza/Kafka Failed to Update Metadata
  • How to deploy & run Samza job on HDFS?
  • Apache Samza aggregation rules for missing expected events in rolling time-period
  • How to deploy Hello Samza app on ubuntu 15.04?
  • How can you create a partition on a Kafka topic using Samza?
  • How to implement something similar to Storm DRPC in Samza?
  • How to query Samza KeyValueStore by key prefix?
  • Where do Apache Samza and Apache Storm differ in their use cases?
  • Apache Samza does not run
  • Testing Samza with RocksDB application with SBT
  • Getting NullPointerException when using a S3 job file with Samza
samza list of languages used
Other projects in Scala