spark

Mirror of Apache Spark

Star full 4f7b624809470f25b6493d5a7b30d9b9cb905931146e785d67c86ef0c205a402Star full 4f7b624809470f25b6493d5a7b30d9b9cb905931146e785d67c86ef0c205a402Star full 4f7b624809470f25b6493d5a7b30d9b9cb905931146e785d67c86ef0c205a402Star full 4f7b624809470f25b6493d5a7b30d9b9cb905931146e785d67c86ef0c205a402Star full 4f7b624809470f25b6493d5a7b30d9b9cb905931146e785d67c86ef0c205a402 (2 ratings)
Rated 5.0 out of 5
Subscribe to updates I use spark


Statistics on spark

Number of watchers on Github 16437
Number of open issues 462
Main language Scala
Open pull requests 743+
Closed pull requests 1453+
Last commit 8 months ago
Repo Created over 4 years ago
Repo Last Updated 8 months ago
Size 263 MB
Organization / Authorapache
Contributors370
Page Updated
Do you use spark? Leave a review!
View open issues (462)
View spark activity
View on github
Latest Open Source Launches
Trendy new open source projects in your inbox! View examples

Subscribe to our mailing list

Evaluating spark for your project? Score Explanation
Commits Score (?)
Issues & PR Score (?)
What people are saying about spark Leave a review
It's blazing fast.
Modern Big data framework

Apache Spark

Spark is a fast and general cluster computing system for Big Data. It provides high-level APIs in Scala, Java, Python, and R, and an optimized engine that supports general computation graphs for data analysis. It also supports a rich set of higher-level tools including Spark SQL for SQL and DataFrames, MLlib for machine learning, GraphX for graph processing, and Spark Streaming for stream processing.

http://spark.apache.org/

Online Documentation

You can find the latest Spark documentation, including a programming guide, on the project web page. This README file only contains basic setup instructions.

Building Spark

Spark is built using Apache Maven. To build Spark and its example programs, run:

build/mvn -DskipTests clean package

(You do not need to do this if you downloaded a pre-built package.)

You can build Spark using more than one thread by using the -T option with Maven, see Parallel builds in Maven 3. More detailed documentation is available from the project site, at Building Spark.

For general development tips, including info on developing Spark using an IDE, see Useful Developer Tools.

Interactive Scala Shell

The easiest way to start using Spark is through the Scala shell:

./bin/spark-shell

Try the following command, which should return 1000:

scala> sc.parallelize(1 to 1000).count()

Interactive Python Shell

Alternatively, if you prefer Python, you can use the Python shell:

./bin/pyspark

And run the following command, which should also return 1000:

>>> sc.parallelize(range(1000)).count()

Example Programs

Spark also comes with several sample programs in the examples directory. To run one of them, use ./bin/run-example <class> [params]. For example:

./bin/run-example SparkPi

will run the Pi example locally.

You can set the MASTER environment variable when running examples to submit examples to a cluster. This can be a mesos:// or spark:// URL, yarn to run on YARN, and local to run locally with one thread, or local[N] to run locally with N threads. You can also use an abbreviated class name if the class is in the examples package. For instance:

MASTER=spark://host:7077 ./bin/run-example SparkPi

Many of the example programs print usage help if no params are given.

Running Tests

Testing first requires building Spark. Once Spark is built, tests can be run using:

./dev/run-tests

Please see the guidance on how to run tests for a module, or individual tests.

A Note About Hadoop Versions

Spark uses the Hadoop core library to talk to HDFS and other Hadoop-supported storage systems. Because the protocols have changed in different versions of Hadoop, you must build Spark against the same version that your cluster runs.

Please refer to the build documentation at Specifying the Hadoop Version for detailed guidance on building for a particular distribution of Hadoop, including building for particular Hive and Hive Thriftserver distributions.

Configuration

Please refer to the Configuration Guide in the online documentation for an overview on how to configure Spark.

Contributing

Please review the Contribution to Spark guide for information on how to get started contributing to the project.

spark open pull requests (View All Pulls)
  • [SPARK-13367] [Streaming] Refactor KinesusUtils to specify more KCL options
  • [SPARK-13366] Support Cartesian join for Datasets
  • [SPARK-13242] [SQL] Fall back to interpreting complex `when` expressions
  • SPARK-9926: Parallelize partition logic in UnionRDD.
  • [SPARK-13328][Core]: Poor read performance for broadcast variables with dynamic resource allocation
  • [SPARK-10759][ML] update cross validator with include_example
  • [SPARK-13360][PYSPARK][YARN] PYSPARK_DRIVER_PYTHON and PYSPARK_PYTHON…
  • [SPARK-13359][ML] ArrayType(_, true) should also accept ArrayType(_, false) fix for branch-1.6
  • [SPARK-13361][SQL] Add benchmark codes for Encoder#compress() in CompressionSchemeBenchmark
  • [SPARK-13249][SQL] Add Filter checking nullability of keys for inner join
  • [SPARK-13333] [SQL] Added Rand and Randn Functions Generating Deterministic Results
  • [SPARK-13358][SQL] Retrieve grep path when do benchmark
  • [SPARK-13220][Core]deprecate yarn-client and yarn-cluster mode
  • [SPARK-13356][Streaming]WebUI missing input informations when recovering from dirver failure
  • [SPARK-13355] [MLLIB] replace GraphImpl.fromExistingRDDs by Graph.apply
  • [SPARK-13351] [SQL] fix column pruning on Expand
  • [SPARK-13283][SQL] Escape column names based on JdbcDialect
  • [SPARK-12154] Upgrade to Jersey 2
  • [SPARK-13242] [SQL] Generate one method per `when` clause
  • [SPARK-13327][SPARKR] Added parameter validations for colnames<-
  • [MINOR][MLLIB] Public visibility for eval metric's dataframe constructor
  • [SPARK-13340][ML] PolynomialExpansion and Normalizer should validate input type
  • [SPARK-13339] [DOCS] Clarify commutative / associative operator requirements for reduce, fold
  • [SPARK-13338][ML] Allow setting 'degree' parameter to 1 for PolynomialExpansion
  • [SPARK-10969] [Streaming] [Kinesis] Allow specifying separate credentials for Kinesis and DynamoDB
  • [SPARK-13334] [ML] ML KMeansModel / BisectingKMeansModel / QuantileDiscretizer should set parent
  • [WIP][SPARK-13332][SQL] Decimal datatype support for SQL pow
  • [SPARK-13330][PYSPARK] PYTHONHASHSEED is not propgated to executor
  • [SPARK-13329] [SQL] considering output for statistics of logical plan
  • [SPARK-13325][SQL] Create a 64-bit hashcode expression [POC]
  • [SPARK-13722] [SQL] No Push Down for Non-deterministics Predicates through Generate
  • improve the doc for "spark.memory.offHeap.size"
  • [SPARK-13715] [MLLIB] Remove last usages of jblas in tests
  • [SPARK-13717][Core] Let RandomSampler can sample with Java iterator
  • [SPARK-13655] [STREAMING] [TESTS] Fix WithAggregationKinesisBackedBlockRDDSuite
  • [SPARK-13713][SQL] Migrate parser from ANTLR3 to ANTLR4 [WIP]
  • [SPARK-13714][GraphX] Another ConnectedComponents based on Max-Degree Propagation
  • [SPARK-12718][SPARK-13720][SQL] SQL generation support for window functions
  • [SPARK-13712] [ML] Add OneVsOne to ML
  • [SPARK-13600] [MLlib] [WIP] Incorrect number of buckets in QuantileDiscretizer
  • [SPARK-13034] Add export/import for all estimators and transformers(w…
  • [SPARK-12243][BUILD][PYTHON] PySpark tests are slow in Jenkins.
  • [SPARK-13667][SQL] Support for specifying custom date format for date and timestamp types at CSV datasource.
  • [SPARK-12566] [ML] [WIP] GLM model family, link function support in SparkR:::glm
  • [SPARK-13706] [ML] Add Python Example for Train Validation Split
  • [SPARK-13566] Avoid deadlock between BlockManager and Executor Thread
  • [SPARK-13396] Stop using our internal deprecated .metrics on Exceptio…
  • [SPARK-10380] Confusing examples in pyspark SQL docs
  • [SPARK-13702][CORE][SQL][MLLIB] Use diamond operator for generic instance creation in Java code.
  • [SPARK-13698][SQL] Fix Analysis Exceptions when Using Backticks in Generate
  • [SPARK-13629] [ML] Add binary toggle Param to CountVectorizer
  • [SPARK-13696][WIP] Remove BlockStore class & simplify interfaces of mem. & disk stores
  • [SPARK-13695] Don't cache MEMORY_AND_DISK blocks as bytes in memory after spills
  • [SPARK-13694][SQL] QueryPlan.expressions should always include all expressions
  • [SPARK-13692][CORE][SQL] Fix trivial Coverity/Checkstyle defects
  • [SPARK-13689] [SQL] Move helper things in CatalystQl to new utils object
  • SPARK-13688: Add spark.dynamicAllocation.overrideNumInstances.
  • [SPARK-13686][MLLIB][STREAMING] Add a constructor parameter `reqParam` to (Streaming)LinearRegressionWithSGD
  • FileInputDStream should read old files when newFilesOnly is set to false
  • [SPARK-13663] Upgrade Snappy Java to 1.1.2.1
  • [SPARK-13842][PYSPARK] pyspark.sql.types.StructType accessor enhancements
  • [SPARK-14477][BUILD] Allow custom mirrors for downloading artifacts in build/mvn
  • [SPARK-14475] Propagate user-defined context from driver to executors
  • [SPARK-14474][SQL]Move FileSource offset log into checkpointLocation
  • [WIP, DO-NOT-MERGE][SPARK-14473][SQL] Define analysis rules to catch operations not supported in streaming
  • [SPARK-14470] Allow for overriding both httpclient and httpcore versions
  • [SPARK-14467][SQL] Interleave CPU and IO better in FileScanRDD.
  • [SPARK-14465][BUILD] Checkstyle should check all Java files
  • [SPARK-14462][ML][MLLIB] add the mllib-local build to maven pom
  • [SPARK-14437][Core]Use the address that NettyBlockTransferService listens to create BlockManagerId
  • [SPARK-14459] [SQL] Detect relation partitioning and adjust the logical plan
  • [SPARK-14373] [PySpark] PySpark RandomForestClassifier, Regressor support export/import
  • [SPARK-14455][Streaming] Fix NPE in allocatedExecutors when calling in receiver-less scenario
  • [SPARK-14454] Better exception handling while marking tasks as failed
  • [SPARK-13687][PYTHON] Cleanup PySpark parallelize temporary files
  • [SPARK-14451][SQL] Move encoder definition into Aggregator interface
  • [SPARK-13783] [ML] Model export/import for spark.ml: GBTs
  • [SPARK-14357] [CORE] Properly handle the root cause being a commit denied exception
  • [SPARK-14103][SQL] Parse unescaped quotes in CSV data source.
  • [SPARK-14448] Improvements to ColumnVector
  • [WIP][SPARK-14447] Experiments: AggregateHashMap for aggregates with long keys
  • [SPARK-14445][SQL] Support native execution of SHOW COLUMNS and SHOW PARTITIONS
  • [SPARK-14132][SPARK-14133][SQL] Alter table partition DDLs
  • [WIP][SPARK-14408][CORE] Changed RDD.treeAggregate to use fold instead of reduce
  • [SPARK-14440][PySpark] Remove pipeline specific reader and writer
  • [SPARK-14435][BUILD] Shade Kryo in our custom Hive 1.2.1 fork
  • SPARK-14421 : Upgrade Kinesis Client Library (KCL) to 1.6.2, fixes support for de-aggregation.
  • [SPARK-14432][SQL] Add API to calculate the approximate quantiles for multiple columns
  • [SPARK-14427][SQL] Support persisting partitioned data source relations in Hive compatible format
  • [SPARK-14423][YARN] Avoid same name files added to distributed cache again
  • [SPARK-15183][Streaming] Adding outputMode to structure Streaming Experimental Api
  • [SPARK-15182] [ML] Copy MLlib doc to ML: ml.feature.tf, idf
  • [SPARK-15180][SQL] Support subexpression elimination in Fliter
  • [Docs] Added Scaladoc for countApprox and countByValueApprox parameters
  • [SPARK-15122][SQL] Fix TPC-DS 41 - Normalize predicates before pulling them out
  • [SPARK-15087][MINOR][DOC] Follow Up: Fix the Comments
  • [SPARK-15112][SQL] Allows query plan schema and encoder schema of a Dataset have different column order
  • [SPARK-15176][Core] Add maxShares setting to Pools
  • [SPARK-15173][SQL] DataFrameWriter.insertInto should work with datasource table stored in hive
  • [SPARK-15172][ML] Improve LogisticRegression warning message
  • [SPARK-14476][SQL][WIP] Improve the physical plan visualization by adding meta info like table name and file path for data source.
  • [SPARK-15085][Streaming][Kafka] Rename streaming-kafka artifact
  • [SPARK-15171][SQL][WIP]Deprecate registerTempTable and add dataset.createTempView
  • [SPARK-15074][Shuffle] Cache shuffle index file to speedup shuffle fetch
  • [SPARK-15168][PySpark][ML] Add missing params to MultilayerPerceptronClassifier
  • [SPARK-15167][SQL] Expose catalog implementation in SparkSession
  • [SPARK-15166][SQL] Move some hive-specific code from SparkSession
  • [DOC][MINOR] Fixed minor errors in feature.ml user guide doc
  • [SPARK-15165][SQL] Codegen can break because toCommentSafeString is not actually safe
  • [SPARK-15162][SPARK-15164][PySpark][DOCS][ML] update some pydocs
  • [SPARK-15092][PySpark][ML] Add toDebugString to DecisionTreeModel
  • [SPARK-15080][CORE] Break copyAndReset into copy and reset
  • [SPARK-15160][SQL] support data source table in InMemoryCatalog
  • [SPARK-14127][SQL] Makes 'DESC [EXTENDED|FORMATTED] <table>' support data source tables
  • [Spark-15155][Mesos] Optionally ignore default role resources
  • [SPARK-14261][SQL] Memory leak in Spark Thrift Server
  • [SPARK-15153] [ML] [SparkR] Fix SparkR spark.naiveBayes error when label is numeric type
  • [SPARK-15150][EXAMPLE][DOC] Add python example for LDA
  • [SPARK-15094][SPARK-14803][SQL] Add ObjectProject for EliminateSerialization
  • [SPARK-15149][EXAMPLE] include python example for kmeans
  • [SPARK-15247][SQL] Set the default number of partitions for reading parquet schemas
  • [SPARK-15350][mllib]add unit test function for LogisticRegressionWithLBFGS in JavaLogisticRegressionSuite
  • [SPARK-15031][EXAMPLES][FOLLOW-UP] Make Python param example working with SparkSession
  • [SPARK-15342][SQL][PySpark] PySpark test for non ascii column name does not actually test with unicode column name
  • [SPARK-15346] [MLlib] Reduce duplicate computation in picking initial points
  • [SPARK-15341] [Doc] [ML] Add documentation for "model.write" to clarify "summary" was not saved
  • [SPARK-15340][SQL]Limit the size of the map used to cache JobConfs to void OOM
  • [SPARK-15339] [ML] ML 2.0 QA: Scala APIs and code audit for regression
  • [SPARK-15337] [SPARK-15338] [SQL] Unable to Make Run-time Changes on Hive-related Conf
  • [SPARK-15334][SQL] HiveClient facade not compatible with Hive 0.12
  • [SPARK-14603] [SQL] [FOLLOWUP] Verification of Metadata Operations by Session Catalog
  • [MINOR][DOCS] Replace remaining 'sqlContext' in ScalaDoc/JavaDoc.
  • [SPARK-15333] [DOCS] Reorganize building-spark.md; rationalize vs wiki
  • [Core] Remove unnecessary calculation of stage's parents
  • [SPARK-15331] [SQL] Disallow All the Unsupported CLI Commands
  • [SPARK-15330] [SQL] Implement Reset Command
  • [SPARK-15269][SQL] Set provided path to CatalogTable.storage.locationURI when creating external non-hive compatible table
  • [SPARK-15328][MLLIB][ML] Word2Vec import for original binary format
  • [SPARK-15324] [SQL] Add the takeSample function to the Dataset
  • [SPARK-12492] Using spark-sql commond to run query, write the event of SparkListenerJobStart
  • Branch 1.4
  • [SPARK-15322][mllib][core][sql]update deprecate accumulator usage into accumulatorV2 in spark project
  • [SPARK-15320] [SQL] Spark-SQL Cli Ignores Parameter hive.metastore.warehouse.dir
  • [SPARK-15318][ML][Example]:spark.ml Collaborative Filtering example does not work in spark-shell
  • [SPARK-15319][SPARKR][DOCS] Fix SparkR doc layout for corr and other DataFrame stats functions
  • [SPARK-15321] Fix bug where Array[Timestamp] cannot be encoded/decoded correctly
  • [SPARK-13850] Force the sorter to Spill when number of elements in th…
  • [SPARK-15316][PySpark][ML] Add linkPredictionCol to GeneralizedLinearRegression
  • [SPARK-15315][SQL] Adding error check to the CSV datasource writer for unsupported complex data types.
  • [SPARK-15323] Fix reading of partitioned format=text datasets
  • [SPARK-15663] SparkSession.catalog.listFunctions shouldn't include the list of built-in functions
  • [SPARK-15670][Java API][Spark Core]label_accumulator_deprecate_in_java_spark_context
  • [SPARK-15668] [ML] ml.feature: update check schema to avoid confusion when user use MLlib.vector as input type
  • [SPARK-15587] [ML] ML 2.0 QA: Scala APIs audit for ml.feature
  • [SPARK-15667][SQL]Throw exception if columns number of outputs mismatch the inputs
  • [SPARK-15664] [MLlib] Replace FileSystem.get(conf) with path.getFileSystem(conf) when removing CheckpointFile in MLlib
  • [SPARK-15665] [CORE] spark-submit --kill and --status are not working
  • [SPARK-15662][SQL] Add since annotation for classes in sql.catalog
  • [SPARK-15659][SQL] Ensure FileSystem is gotten from path
  • [SPARK-15660][CORE] RDD and Dataset should show the consistent values for variance/stdev.
  • [SPARK-15658][SQL] UDT serializer should declare its data type as udt instead of udt.sqlType
  • [SPARK-15657][SQL] RowEncoder should validate the data type of input object
  • [SPARK-15655] [SQL] Fix Wrong Partition Column Order when Fetching Partitioned Tables
  • [SPARK-15620][SQL] Fix transformed dataset attributes revolve failure
  • [SPARK-14507] [SQL] EXTERNAL keyword in a CTAS statement is not allowed
  • [SPARK-15490][R][DOC] SparkR 2.0 QA: New R APIs and API docs for non-MLib changes
  • [SPARK-14615][ML][FOLLOWUP] Fix Python examples to use the new ML Vector and Matrix APIs in the ML pipeline based algorithms
  • [SPARK-15617][ML][DOC] Clarify that fMeasure in MulticlassMetrics is "micro" f1_score
  • [SPARK-9876][SQL][FOLLOWUP] Enable string and binary tests for Parquet predicate pushdown
  • [SPARK-15646] [SQL] When spark.sql.hive.convertCTAS is true, the conversion rule needs to respect TEXTFILE/SEQUENCEFILE format and the user-defined location
  • [CORE][MINOR][DOC] Removing incorrect scaladoc
  • [CORE][DOC][MINOR] typos + links
  • [SPARK-5581][Core] When writing sorted map output file, avoid open / …
  • [SPARK-15608][ml][doc] add_isotonic_regression_doc
  • [SPARK-15644] [MLlib] [SQL] Replace SQLContext with SparkSession in MLlib
  • [SPARK-12431][GraphX] Add local checkpointing to GraphX.
  • [SPARK-15643] [Doc] [ML] Update spark.ml and spark.mllib migration guide from 1.6 to 2.0
  • [Minor][ML] Enable java save/load tests for ML
  • [SPARK-15100][ML][Doc] doc update for stopwords and binarizer
  • [SPARK-13638][SQL] Add quoteAll option to CSV DataFrameWriter
  • [SPARK-15968][SQL] HiveMetastoreCatalog does not correctly validate
  • [SPARK-15963][CORE] Catch `TaskKilledException` correctly in Executor.TaskRunner
  • [SPARK-15908][R] Add varargs-type dropDuplicates() function in SparkR
  • [SPARK-15518][Core][Follow-up] Rename LocalSchedulerBackendEndpoint -> LocalSchedulerBackend
  • [SPARK-15888] [SQL] fix Python UDF with aggregate
  • [SPARK-15962][SQL] Introduce additonal implementation with a dense format for UnsafeArrayData
  • [SPARK-15959] [SQL] Add the support of hive.metastore.warehouse.dir back
  • [SPARK-15824][SQL] Execute WITH .... INSERT ... statements immediately
  • [SPARK 15926] Improve readability of DAGScheduler stage creation methods
  • [SPARK-15956] [SQL] When unwrapping ORC avoid pattern matching at runtime
  • [SPARK-15957] [ML] RFormula supports forcing to index label
  • [MINOR][DOCS][SQL] Fix some comments about types(TypeCoercion,Partition) and exceptions.
  • [WIP][SPARK-15953][SQL][STREAMING] Renamed ContinuousQuery to StreamingQuery
  • [SPARK-15741][PYSPARK][ML] Pyspark cleanup of set default seed to None
  • [SPARK-15951] Change Executors Page to use datatables to support sorting columns and searching
  • [SPARK-15934] [SQL] Return binary mode in ThriftServer
  • [SPARK-15934] [SQL] Return binary mode in ThriftServer
  • [SPARK-15885] [Web UI] Provide links to executor logs from stage details page in UI
  • [SPARK-15950][SQL] Eliminate unreachable code at projection for complex types
  • [SPARK-15942][REPL] Unblock `:reset` command in REPL.
  • [SPARK-15672][R][DOC] R programming guide update
  • [SPARK-15937] [yarn] Improving the logic to wait for an initialised Spark Context
  • [SPARK-15939][ML][PySpark] Clarify ml.linalg usage
  • [SPARK-15938]Adding "support" property to MLlib Association Rule
  • [SPARK-15868] [Web UI] Executors table in Executors tab should sort Executor IDs in numerical order
  • [SPARK-15613] [SQL] Fix incorrect days to millis conversion
  • [SPARK-15776][SQL] Divide Expression inside Aggregation function is casted to wrong type
  • [SPARK-9623] [ML] Provide variance for RandomForestRegressor predictions
  • [SPARK-15784][ML][WIP]:Add Power Iteration Clustering to spark.ml
  • [SPARK-15922][MLLIB] `toIndexedRowMatrix` should consider the case `cols < offset+colsPerBlock`
  • [SPARK-13015][MLlib][DOC] Replace example code in mllib-data-types.md using include_example
  • [SPARK-15954][SQL] Disable loading test tables in Python tests
  • [SPARK-16285][SQL] Implement sentences SQL functions
  • [SPARK-16335][SQL] Structured streaming should fail if source directory does not exist
  • [SPARK-16331] [SQL] Reduce code generation time
  • [SPARK-16281][SQL] Implement parse_url SQL function
  • [SPARK-12177][Streaming][Kafka] limit api surface area
  • [SPARK-16328][ML][MLLIB][PYSPARK] Add 'asML' and 'fromML' conversion methods to PySpark linalg
  • [SPARK-16144][SPARKR] update R API doc for mllib
  • [SPARK-16318][SQL] Implement various xpath functions
  • [SPARK-16287][SQL][WIP] Implement str_to_map SQL function
  • [SPARK-16311][SQL] Improve metadata refresh
  • [SPARK-16101][SQL][WIP] Refactoring CSV data source to be consistent with JSON data source
  • Upgrade to Avro 1.8.1
  • [SPARK-16310][SPARKR] R na.string-like default for csv source
  • [SPARK-16021] Fill freed memory in test to help catch correctness bugs
  • [SPARK-16304] LinkageError should not crash Spark executor
  • [SPARK-16307] [ML] Add test to verify the predicted variances of a DT on toy data
  • [SPARK-16198] [MLlib] [ML] Change access level of prediction functions to public
  • [SPARK-SPARK-16302] [SQL] Set the right number of partitions for reading data from a local collection.
  • [SPARK-16288][SQL] Implement inline table generating function
  • [SPARK-16299][SPARKR] Capture errors from R workers in daemon.R to avoid deletion of R session temporary directory.
  • [SPARK-16296][SQL] add null check for key when create map data in encoder
  • [SPARK-16182] [CORE] Utils.scala -- terminateProcess() should call Process.destroyForcibly() if and only if Process.destroy() fails
  • [SPARK-16284][SQL] Implement reflect SQL function
  • [SPARK-16278][SPARK-16279][SQL] Implement map_keys/map_values SQL functions
  • [SPARK-16095][YARN] Yarn cluster mode should report correct state to SparkLauncher
  • [SPARK-16269][SQL] Support null handling for vectorized hashmap during hash aggregate
  • [SPARK-14351] [MLlib] [ML] Optimize findBestSplits method for decision trees (and random forest)
  • [SPARK-16114] [SQL] structured streaming event time window example
  • Ensure broadcasted variables are destroyed even in case of exception
  • [SPARK-16283][SQL] Implement `percentile_approx` SQL function
  • [SPARK-16660][SQL] CreateViewCommand should not take CatalogTable
  • [SPARK-16639][SQL] The query with having condition that contains grouping by column should work
  • [SPARK-16648][SQL] Overrides TreeNode.withNewChildren in Last
  • [SPARK-16646][SQL] LEAST and GREATEST doesn't accept numeric arguments with different data types
  • [GIT] add pydev & Rstudio project file to gitignore list
  • [SPARK-14131][SQL[STREAMING] Improved fix for avoiding potential deadlocks in HDFSMetadataLog
  • [SPARK-16658][GRAPHX] Add EdgePartition.withVertexAttributes
  • [SPARK-16657] [SQL] Replace children by innerChildren in InsertIntoHadoopFsRelationCommand and CreateHiveTableAsSelectCommand
  • [SPARK-16656] [SQL] Try to make CreateTableAsSelectSuite more stable
  • [SPARK-16651][PYSPARK][DOC] Make `withColumnRenamed/drop` description more consistent with Scala API
  • [SPARK-16650] Improve documentation of spark.task.maxFailures
  • [SPARK-16653][ML][Optimizer] update ANN convergence tolerance param default to 1e-6
  • [SPARK-16649][SQL] Push partition predicates down into metastore for OptimizeMetadataOnlyQuery
  • [SPARK-16633] [SPARK-16642] Fixes three issues related to lead and lag functions
  • [SPARK-16645][SQL] rename CatalogStorageFormat.serdeProperties to properties
  • [SPARK-16628][SQL] Don't convert Orc Metastore tables to datasource tables if metastore schema does not match schema stored in the files
  • [SPARK-16515][SQL][FOLLOW-UP] Fix test `script` on OS X/Windows...
  • [SPARK-16216][SQL] Write Timestamp and Date in ISO 8601 formatted string by default for CSV and JSON
  • [SPARK-16640][SQL] Add codegen for Elt function
  • [SPARK-16637] Unified containerizer
  • [SPARK-9140] [ML] Replace TimeTracker by MultiStopwatch
  • [SPARK-5847][CORE] Allow for configuring MetricsSystem's use of app ID to namespace all metrics
  • [SPARK-15703] [Scheduler][Core][WebUI] Make ListenerBus event queue size configurable
  • [SPARK-16526][SQL] Benchmarking Performance for Fast HashMap Implementations and Set Knobs
  • [PySpark] add picklable SparseMatrix in pyspark.ml.common
  • [SPARK-11976][SPARKR] Support "." character in DataFrame column name
  • [SPARK-16626][PYTHON][MLLIB] Code duplication after SPARK-14906
  • [SPARK-14974][SQL]delete temporary folder after insert hive table
  • Branch 2.0
  • [SPARK-16409] [SQL] regexp_extract with optional groups causes NPE
  • [#SPARK-16911] Fix the links in the programming guide
  • [SPARK-16909][Spark Core] - Streaming for postgreSQL JDBC driver
  • [SPARK-16906][SQL] Adds auxiliary info like input class and input schema in TypedAggregateExpression
  • [SPARK-] SQL DDL: MSCK REPAIR TABLE
  • [SPARK-16904] [SQL] Removal of Hive Built-in Hash Functions and TestHiveFunctionRegistry [WIP]
  • [SPARK-16901] Hive settings in hive-site.xml may be overridden by Hive's default values
  • [SPARK-16772] [Python] [Docs] Fix API doc references to UDFRegistration + Update "important classes"
  • [SPARK-16898][SQL] Adds argument type information for typed logical plan like MapElements, TypedFilter, and AppendColumn
  • [SPARK-16887] Add SPARK_DIST_CLASSPATH to LAUNCH_CLASSPATH
  • [SPARK-16886] [EXAMPLES][SQL] structured streaming network word count examples …
  • [MINOR][SparkR] R API documentation for "coltypes" is confusing
  • [SPARK-16826][SQL] Switch to java.net.URI for parse_url()
  • [SPARK-16796][Web UI] Mask spark.authenticate.secret on Spark environ…
  • [WIP][SPARK-16844][SQL] Generate code for sort based aggregation
  • [SPARK-16870][docs]Summary:add "spark.sql.broadcastTimeout" into docs/sql-programming-gu…
  • [SPARK-16862] Configurable buffer size in `UnsafeSorterSpillReader`
  • [SPARK-16495] [MLlib]Add ADMM optimizer in mllib package
  • [SPARK-16866][SQL] Infrastructure for file-based SQL end-to-end tests
  • [SPARK-14387][SQL] Enable Hive-1.x ORC compatibility with spark.sql.hive.convertMetastoreOrc
  • [SPARK-16700] [PYSPARK] [SQL] create DataFrame from dict/Row with schema
  • [SPARK-16671][core][sql] Consolidate code to do variable substitution.
  • [SPARK-16861][PYSPARK][CORE] Refactor PySpark accumulator API on top of Accumulator V2
  • [minor] Update AccumulatorV2 doc to not mention "+=".
  • [SPARK-16856] [WEBUI] [CORE] Link application summary page and detail page to the master page
  • [Follow-up] [ML] Add transformSchema for StringIndexer/VectorAssembler and fix failed tests.
  • [Minor] [ML] Rename TreeEnsembleModels to TreeEnsembleModel for PySpark
  • [SPARK-16849][SQL] Improve subquery execution by deduplicating the subqueries with the same results
  • [SPARK-16848][SQL] Make jdbc() and read.format("jdbc") consistently throwing exception for user-specified schema
  • [SPARK-17086][ML] Fix an issue in QuantileDiscretizer
  • [SPARK-17180] [SQL] Fix View Resolution Order in ALTER VIEW AS SELECT
  • [SPARK-16896][SQL] Handle duplicated field names in header consistently with null or empty strings in CSV
  • [SPARKR][SPARKSUBMIT] Allow to set sparkr shell command through --conf
  • [SparkR][Minor] Fix Cache Folder Path in Windows
  • [SPARK-17177][SQL] Make grouping columns accessible from `RelationalGroupedDataset`
  • [SPARK-6832][SPARKR][WIP]Handle partial reads in SparkR
  • [SPARK-17176][WEB UI]set default task sort column to "Status"
  • [SPARK-17090][MINOR][ML]Add expert param support to SharedParamsCodeGen
  • [SPARK-17171][WEB UI] DAG will list all partitions in the graph
  • [SPARK-17173][SPARKR] R MLlib refactor, cleanup, reformat, fix deprecation in test
  • [SPARK-16508][SPARKR] doc updates and more CRAN check fixes
  • [SPARK-17170] [SQL] InMemoryTableScanExec driver-side partition pruning
  • [SPARK-16320] [DOC] Document G1 heap region's effect on spark 2.0 vs 1.6
  • [SPARK-17159] [streaming]: optimise check for new files in FileInputDStream
  • [SPARK-17167] [SQL] Issue Exceptions when Analyze Table on In-Memory Cataloged Tables
  • [SPARK-17165][SQL] FileStreamSource should not track the list of seen files indefinitely
  • [SPARK-16862] Configurable buffer size in `UnsafeSorterSpillReader`
  • [SPARK-17161] [PYSPARK][ML] Add PySpark-ML JavaWrapper convenience function to create py4j JavaArrays
  • [SPARK-17162] Range does not support SQL generation
  • [SQL][WIP][Test] Supports object-based aggregation function which can store arbitrary objects in aggregation buffer.
  • [SPARK-13286] [SQL] add the next expression of SQLException as cause
  • SPARK-12868: Allow Add jar to add jars from hdfs/s3n urls.
  • [SPARK-17154][SQL] Wrong result can be returned or AnalysisException can be thrown after self-join or similar operations
  • [SPARK-16711] YarnShuffleService doesn't re-init properly on YARN rolling upgrade
  • paged jdbcRDD for like mysql limit start,pageSize
  • [WIP] [SPARK-17072] [SQL] support table-level statistics generation and storing into/loading from metastore
  • [SPARK-16822] [DOC] [Support latex in scaladoc with MathJax]
  • [SPARK-16533][CORE] resolve deadlocking in driver when executors die
  • [SPARK-17144] [SQL] Removal of useless CreateHiveTableAsSelectLogicalPlan
  • [SPARK-17427][SQL] function SIZE should return -1 when parameter is null
  • [SPARK-17426][SQL] Refactor `TreeNode.toJSON` to avoid OOM when converting unknown fields to JSON
  • [MINOR][SQL] Fixing the typo in unit test
  • [SPARK-17425][SQL] Override sameResult in HiveTableScanExec to make ReuseExchange work in text format table
  • [WIP] [SPARK-17421] Don't use -XX:MaxPermSize option when Java version >= 8
  • [SPARK-17396][core] Share the task support between UnionRDD instances.
  • [SPARK-17296][SQL] Simplify parser join processing [BACKPORT 2.0]
  • [SPARK-17418] Remove Kinesis artifacts from Spark release scripts
  • [SPARK-17317][SparkR] Add SparkR vignette
  • [SPARK-17415][SQL] Better error message for driver-side broadcast join OOMs
  • [Test Only][not ready for review][SPARK-6235][CORE]Address various 2G limits
  • [SPARK-17306] [SQL] QuantileSummaries doesn't compress
  • Correct fetchsize property name in docs
  • [Trivial][ML] Remove unnecessary `new` before case class construction
  • [SPARK-17410] [SPARK-17284] Move Hive-generated Stats Info to HiveClientImpl
  • [SPARK-17406][WEB UI] limit timeline executor events
  • [SPARK-17369][SQL][2.0] MetastoreRelation toJSON throws AssertException due to missing otherCopyArgs
  • [SPARK-16992][PYSPARK] Reenable Pylint
  • [SPARK-17402][SQL] separate the management of temp views and metastore tables/views in SessionCatalog
  • [SPARK-17379] [BUILD] Upgrade netty-all to 4.0.41 final for bug fixes
  • [SPARK-17339][SPARKR][CORE] Fix some R tests and use Path.toUri in SparkContext for Windows paths in SparkR
  • [SPARK-17387][PYSPARK] Creating SparkContext() from python without spark-submit ignores user conf
  • [SPARK-4502][SQL]Support parquet nested struct pruning and add relevant test
  • [SPARK-17389] [ML] [MLLIB] KMeans speedup with better choice of k-means|| init steps = 2
  • [SPARK-17390][ML][MLLib] Optimize MultivariantOnlineSummerizer by making the summarized target configurable
  • [SPARK-17057] [ML] ProbabilisticClassifierModels' prediction more reasonable with multi zero thresholds
  • [SPARK-17388][SQL] Support for inferring type date/timestamp/decimal for partition column
  • [SPARK-17386] Set default trigger interval to 1/10 second
  • [SPARK-17383][GRAPHX] Improvement LabelPropagaton, and reduce label shake and disconnection of communities
  • [SPARK-8519][SPARK-11560] [ML] [MLlib] Optimize KMeans implementation.
  • [SPARK-17866][SPARK-17867][SQL] Fix Dataset.dropduplicates
  • [SPARK-17864][SQL] Mark data type APIs as stable (not DeveloperApi)
  • [SPARK-17816] [Core] [Branch-2.0] Fix ConcurrentModificationException issue in BlockStatusesAccumulator
  • [SPARK-17338][SQL][follow-up] add global temp view
  • [SPARK-17860][SQL] SHOW COLUMN's database conflict check should respect case sensitivity configuration
  • [SPARK-17850][Core]HadoopRDD should not catch EOFException
  • [SPARK-17811] SparkR cannot parallelize data.frame with NA or NULL in Date columns
  • [SPARK-17855][CORE] Remove query string from jar url
  • [SPARK-17851][SQL][MINOR][TESTS] Update invalid test sql in `ColumnPruningSuite`
  • [SPARK-17849] [SQL] Fix NPE problem when using grouping sets
  • [SPARK-14501][ML] spark.ml API for FPGrowth
  • [SPARK-17848][ML] Move LabelCol datatype cast into Predictor.fit
  • [SPARK-17847] [ML] Copy GaussianMixture implementation from mllib to ml
  • Updated master url
  • [SPARK-17843][Web UI] Indicate event logs pending for processing on history server UI
  • [Spark-14761][SQL] Reject invalid join methods when join columns are not specified in PySpark DataFrame join.
  • [SPARK-17839][CORE] Use Nio's directbuffer instead of BufferedInputStream in order to avoid additional copy from os buffer cache to user buffer
  • [SPARK-17841][STREAMING][KAFKA] drain commitQueue
  • [Spark-17745][ml][PySpark] update NB python api - add weight col parameter
  • [SPARK-15917][CORE] Added support for number of executors in Standalone [WIP]
  • Branch 2.0
  • [SPARK-17835][ML][MLlib] Optimize NaiveBayes mllib wrapper to eliminate extra pass on data
  • [SPARK-17782][STREAMING][KAFKA] alternative eliminate race condition of poll twice
  • [SPARK-11272] [Web UI] Add support for downloading event logs from HistoryServer UI
  • [SPARK-17819][SQL] Support default database in connection URIs for Spark Thrift Server
  • [SPARK-17647][SQL] Fix backslash escaping in 'LIKE' patterns.
  • [SPARK-17834][SQL]Fetch the earliest offsets manually in KafkaSource instead of counting on KafkaConsumer
  • [SPARK-14804][Spark][Graphx] Fix checkpointing of VertexRDD/EdgeRDD
  • [SPARK-17749][ML] One pass solver for Weighted Least Squares with ElasticNet
  • [SPARK-17817][PySpark] PySpark RDD Repartitioning Results in Highly Skewed Partition Sizes
  • [SPARK-18111] [SQL] Wrong ApproximatePercentile answer when multiple records have the minimum value
  • [SPARK-18106][SQL] ANALYZE TABLE should raise a ParseException for invalid option
  • [Spark-Core]add defensive check for zipWithIndex
  • [SPARK-18110][PYTHON][ML] add missing parameter in Python for RandomForest regression and classification
  • [SPARK-18000] [SQL] Aggregation function for computing endpoints for histograms
  • [SPARK-18109][ML] Add instrumentation to GMM
  • Branch 1.6
  • [SPARK-18103] [SQL] Rename *FileCatalog to *FileProvider
  • [SPARK-18087] [SQL] [WIP] Optimize insert to not require REPAIR TABLE
  • [SPARK-18105] fix buffer overflow in LZ4
  • [Spark-Core][Test][Minor]: Fix the wrong comment in test
  • [SPARK-18104][DOC]Don't build KafkaSource doc
  • [SQL][DOC] updating doc for JSON source to link to jsonlines.org
  • [SPARK-17471][ML] Add compressed method to ML matrices
  • [SPARK-18099][YARN] Fail if same files added to distributed cache for --files and --archives
  • SPARK-17829 [SQL] Stable format for offset log
  • [SPARK-18094][SQL][TESTS] Move group analytics test cases from `SQLQuerySuite` into a query file test.
  • [SPARK-18093][SQL] Fix default value test in SQLConfSuite to work rega…
  • [SPARK-18092][ML] Fix column prediction type error
  • [SPARK-17748][FOLLOW-UP][ML] Reorg variables of WeightedLeastSquares.
  • [SPARK-18091] [SQL] Deep if expressions cause Generated SpecificUnsafeProjection code to exceed JVM code size limit
  • [SPARK-14914][CORE] Fix Resource not closed after using, mostly for unit tests
  • [SPARK-16827][Shuffle] add disk spill bytes to UnsafeExternalSorter
  • [SPARK-18079] [SQL] CollectLimitExec.executeToIterator should perform per-partition limits
  • [SPARK-18078] Add option for customize zipPartition task preferred locations
  • [SPARK-18076][CORE][SQL] Fix default Locale used in DateFormat, NumberFormat to Locale.US
  • [SPARK-17838][SparkR] Check named arguments for options and use formatted R friendly message from JVM exception message
  • [SPARK-16137][SPARKR] randomForest for R
  • [WIP] [SPARK-18067] [SQL] SortMergeJoin adds shuffle if join predicates have non partitioned columns
  • [SPARK-18066] [CORE] [TESTS] Add Pool usage policies test coverage for FIFO & FAIR Schedulers
  • [SPARK-18397][HIVE]cannot create table by using the hive default fileformat
  • [SPARK-18396]"Duration" column makes search result confused, maybe we should make it unsearchable
  • [SPARK-18395][SQL] Evaluate common subexpression like lazy variable with a function approach
  • [SPARK-18391]
  • [SPARK-17059][SQL] Allow FileFormat to specify partition pruning strategy via splits
  • [SPARK-18353][CORE] spark.rpc.askTimeout defalut value is not 120s
  • [SPARK-18385][ML] Make the transformer's natively in ml framework to avoid extra conversion
  • [SPARK-18375][SPARK-18383][BUILD][CORE]Upgrade netty to 4.0.42.Final
  • [SPARK-18379][SQL] Make the parallelism of parallelPartitionDiscovery configurable.
  • [WIP][SPARK-18187][SS] CompactibleFileStreamLog should not rely on "compactInterval" to detect a compaction batch
  • [SPARK-18187][STREAMING] CompactibleFileStreamLog should not use "compactInterval" direcly with user setting.
  • [SPARK-14077][ML][FOLLOW-UP] Minor refactor and cleanup for NaiveBayes
  • [SPARK-18377][SQL] warehouse path should be a static conf
  • [Minor][PySpark] Improve error message when running PySpark with different minor versions
  • [SPARK-13534][WIP][PySpark] Using Apache Arrow to increase performance of DataFrame.toPandas
  • [SPARK-18373][SS][Kafka]Make failOnDataLoss=false work with Spark jobs
  • [SPARK-18372][SQL].Staging directory fail to be removed
  • [SPARK-14914][CORE]:Fix Resource not closed after using, for unit tests and example
  • [SPARK-18366][PYSPARK] Add handleInvalid to Pyspark for QuantileDiscretizer and Bucketizer
  • [DOCS][SPARK-18365] Documentation is Switched on Sample Methods
  • [SPARK-18185] Fix all forms of INSERT / OVERWRITE TABLE for Datasource tables
  • [SPARK-18362][SQL] Use TextFileFormat in JsonFileFormat and CSVFileFormat
  • [SPARK-18360][SQL] warehouse path config should work for data source tables
  • [SPARK-18361] [PySpark] Expose RDD localCheckpoint in PySpark
  • [SPARK-18268][ML][MLLib] ALS fail with better message if ratings is empty rdd
  • [SPARK-18345][STRUCTURED STREAMING] Structured Streaming quick examples fails with default configuration
  • [SPARK-18298][Web UI]change gmt time to local zone time in HistoryServer in applists
  • [SPARK-18337] Complete mode memory sinks should be able to recover from checkpoints
  • [SPARK-18334] MinHash should use binary hash distance
  • [SPARK-18262][BUILD][SQL][WIP] JSON.org license is now CatX
  • [SPARK-18427][DOC] Update docs of mllib.KMeans
  • [SPARK-17116][Pyspark] Allow parameters to be {string,value} dict at runtime
  • [SPARK-18425][Structured Streaming][Tests] Test `CompactibleFileStreamLog` directly
  • [YARN][DOC] Update Yarn configuration doc
  • [SPARK-18413][SQL] Control the number of JDBC connections by repartition with `numPartition` JDBCOption
  • [SPARK-18423][Streaming] ReceiverTracker should close checkpoint dir when stopped even if it was not started
  • [SPARK-18422][CORE] Fix wholeTextFiles test to pass on Windows in JavaAPISuite
  • [SPARK-18420][SPARK][BUILD]Fix the compile errors caused by checkstyle
  • [SPARK-18419][SQL] Fix JDBCOptions and DataSource to be case-insensitive for JDBCOptions keys
  • [SPARK-18382][WEBUI] "run at null:-1" in UI when no file/line info in call site info
  • [SPARK-18294][CORE] Implement commit protocol to support `mapred` package's committer
  • [SPARK-18416][Structured Streaming] Fixed temp file leak in state store
  • [SPARK-18417][YARN] Define 'spark.yarn.am.port' in yarn config object
  • [SPARK-18300][SQL] Do not apply foldable propagation with expand as a child.
  • [SPARK-16808][Core] History Server main page does not honor APPLICATION_WEB_PROXY_BASE
  • [SPARK-18415] [SQL] Weird Plan Output when CTE used in RunnableCommand
  • Spark-18187 [SQL] CompactibleFileStreamLog should not use "compactInterval" direcly with user setting.
  • [SPARK-18412][SPARKR][ML] Fix exception for some SparkR ML algorithms training on libsvm data
  • [SPARK-18411] [SQL] Add Argument Types and Test Cases for String Functions [WIP]
  • [SPARK-18410][STREAMING] Add structured kafka example
  • [SPARK-9487] Use the same num. worker threads in Java/Scala unit tests
  • [SPARK-18274][ML][PYSPARK] Memory leak in PySpark JavaWrapper
  • [SPARK-18398][SQL] Fix nullabilities of MapObjects and optimize not to check null if lambda is not nullable.
  • [SPARK-18537][Web UI]Add a REST api to spark streaming
  • [WIP][SPARK-3359][BUILD][DOCS] More changes to resolve javadoc 8 errors that will help unidoc/genjavadoc compatibility
  • [SPARK-18572][SQL] Add a method `listPartitionNames` to `ExternalCatalog`
  • [SPARK-18567][SQL][WIP] Simplify CreateDataSourceTableAsSelectCommand
  • [SPARK-18566][SQL] remove OverwriteOptions
  • [SPARK-18555][SQL][WIP]DataFrameNaFunctions.fill miss up original values in long integers
  • [SPARK-18560][CORE][STREAMING] Receiver data can not be deserialized properly.
  • [SPARK-17843][WEB UI] Indicate event logs pending for processing on h…
  • [SPARK-18559] [SQL] Fix HLL++ with small relative error
  • [SPARK-18515][SQL] AlterTableDropPartitions fails for non-string columns
  • [SPARK-18553][CORE][branch-2.0] Fix leak of TaskSetManager following executor loss
  • [SPARK-18551] [Web UI] [Core] [WIP] Add functionality to delete event logs from the History Server UI
  • [SPARK-18544] [SQL] Append with df.saveAsTable writes data to wrong location
  • [SPARK-18546][core] Fix merging shuffle spills when using encryption.
  • [SPARK-18547][core] Propagate I/O encryption key when executors register.
  • [SPARK-18528][SQL] Fix a bug to initialise an iterator of aggregation buffer
  • [SPARK-18251][SQL] the type of Dataset can't be Option of non-flat type
  • [SPARK-18436][SQL] isin causing SQL syntax error with JDBC
  • [SPARK-18403][SQL] Fix unsafe data false sharing issue in ObjectHashAggregateExec
  • [SPARK-18538] [SQL] Fix Concurrent Table Fetching Using DataFrameReader JDBC APIs
  • [SPARK-18319][ML][QA2.1] 2.1 QA: API: Experimental, DeveloperApi, final, sealed audit
  • [SPARK-18535][UI][YARN] Redact sensitive information from Spark logs and UI
  • [SPARK-18134][SQL] Comparable MapTypes [POC]
  • [SPARK-18413][SQL][FOLLOW-UP] Use `numPartitions` instead of `maxConnections`
  • [SPARK-18356] [ML] Improve MLKmeans Performance
  • [SPARK-18471][MLLIB] In LBFGS, avoid sending huge vectors of 0
  • [SPARK-18523][PySpark]Make SparkContext.stop more reliable
  • [SPARK-18521] Add `NoRedundantStringInterpolator` Scala rule
  • [SPARK-17932][SQL] Failed to run SQL "show table extended like table_name" in Spark2.0.0
  • [SPARK-18520][ML] Add missing setXXXCol methods for BisectingKMeansModel and GaussianMixtureModel
  • [SPARK-19146][Core]Drop more elements when stageData.taskData.size > retainedTasks
  • [SPARK-18959] the new leader will lost the statistics of the driver's resource on the worker When the leader master has changed.
  • [SPARK-19110][MLLIB][FollowUP]: Add a unit test
  • [SPARK-19142][SparkR]:spark.kmeans should take seed, initSteps, and tol as parameters
  • [SPARK-19137][SQL] Fix `withSQLConf` to reset `OptionalConfigEntry` correctly
  • [SPARK-19139][core] New auth mechanism for transport library.
  • [SPARK-19140][SS]Allow update mode for non-aggregation streaming queries
  • [BACKPORT][SPARK-18952] Regex strings not properly escaped in codegen for aggregations
  • [SPARK-18243][SQL] Port Hive writing to use FileFormat interface
  • [SPARK-19133][ML] ML GLR family and link could be uppercase.
  • [SPARK-19134][EXAMPLE] Fix several sql, mllib and status api examples not working
  • [SPARK-19128] [SQL] Refresh Cache after Set Location
  • [SPARK-12757][CORE] lower "block locks were not released" log to info level
  • [SPARK-18335][SPARKR] createDataFrame to support numPartitions parameter
  • [SPARK-19133][SPARKR][ML] fix glm for Gamma, clarify glm family supported
  • [SPARK-19130][SPARKR] Support setting literal value as column implicitly
  • [SPARK-19042] spark executor can't download the jars when uber jar's http url contains any query strings
  • [spark-18806] [core] the processors DriverWrapper and CoarseGrainedExecutorBackend should be exit when worker exit
  • [SPARK-18113] canCommit should return same when called by same attempt multi times.
  • Branch 2.1
  • [SPARK-19117][TESTS] Skip the tests using script transformation on Windows
  • [SPARK-19120] [SPARK-19121] Refresh Metadata Cache After Loading Hive Tables
  • [SPARK-17204][CORE] Fix replicated off heap storage
  • SPARK-9487][Tests] Begin the work of robustifying unit tests , start with ContextCleanerSuite
  • [SPARK-19118] [SQL] Percentile support for frequency distribution table
  • [SPARK-16101][SQL] Refactoring CSV write path to be consistent with JSON data source
  • SPARK-16920: Add a stress test for evaluateEachIteration for 2000 trees
  • [SPARK-17975][MLLIB] Fix EMLDAOptimizer failing with ClassCastException
  • [SPARK-19113][SS][Tests]Set UncaughtExceptionHandler in onQueryStarted to ensure catching fatal errors during query initialization
  • [SPARK-19107][SQL] support creating hive table with DataFrameWriter and Catalog
  • [SPARK-19564][SPARK-19559][SS][KAFKA] KafkaOffsetReader's consumers should not be in the same group
  • [SPARK-19565] Improve DAGScheduler tests.
  • [SPARK-19563][SQL] advoid unnecessary sort in FileFormatWriter
  • [SPARK-19561][Python] cast TimestampType.toInternal output to long
  • [SPARK-15615][SQL] Add an API to load DataFrame from Dataset[String] storing JSON
  • [SPARK-19555][SQL] Improve the performance of StringUtils.escapeLikeRegex method
  • [SPARK-19560] Improve DAGScheduler tests.
  • [SPARK-19318][SQL] Fix to treat JDBC connection properties specified by the user in case-sensitive manner.
  • when colum is use alias ,the order by result is wrong
  • [SPARK-17668][SQL] Use Expressions for conversions to/from user types in UDFs
  • [SPARK-19552] [BUILD] Upgrade Netty version to 4.1.8 final
  • [SPARK-17498][ML] StringIndexer enhancement for handling unseen labels
  • [SPARK-19544][SQL] Improve error message when some column types are compatible and others are not in set operations
  • [SPARK-19542][SS]Delete the temp checkpoint if a query is stopped without errors
  • [SPARK-19541][SQL] High Availability support for ThriftServer
  • [SPARK-19539][SQL] Block duplicate temp table during creation
  • [WIP] [SPARK-19538] Explicitly tell the DAGScheduler when a TaskSet is complete
  • [SPARK-19550][BUILD][CORE][WIP] Remove Java 7 support
  • [SPARK-19496][SQL]to_date udf to return null when input date is invalid
  • [SPARK-19115] [SQL] Supporting Create External Table Like Location
  • [SPARK-16929] Improve performance when check speculatable tasks.
  • [SPARK-19529] TransportClientFactory.createClient() shouldn't call awaitUninterruptibly()
  • [SPARK-19530][SQL] Use guava weigher for code cache eviction
  • [SPARK-19527][Core] Approximate Size of Intersection of Bloom Filters
  • [SPARK-19520][streaming] Do not encrypt data written to the WAL.
  • [SPARK-17714][Core][test-maven][test-hadoop2.6]Avoid using ExecutorClassLoader to load Netty generated classes
  • [SPARK-19517][SS] KafkaSource fails to initialize partition offsets
  • [SPARK-19516][DOC] update public doc to use SparkSession instead of SparkContext
  • [SPARK-13931] Stage can hang if an executor fails while speculated tasks are running
  • [SPARK-15463][SQL] Add an API to load DataFrame from Dataset[String] storing CSV
  • [SPARK-20607][CORE]Add new unit tests to ShuffleSuite
  • [SPARK-20606][ML] ML 2.2 QA: Remove deprecated methods for ML
  • [SPARK-20605][Core][Yarn][Mesos] Deprecate not used AM and executor port configuration
  • [SPARK-20456][Docs] Add examples for functions collection for pyspark
  • [SPARK-20604][ML] Allow imputer to handle numeric types
  • [SPARK-20603][SS][Test]Set default number of topic partitions to 1 to reduce the load
  • [SPARK-20602] [ML]Adding LBFGS as optimizer for LinearSVC
  • Remove excess quotes in Windows executable
  • [SPARK-20596][ML][TEST] Consolidate and improve ALS recommendAll test cases
  • [SPARK-20594][SQL]The staging directory should be appended with ".hive-staging" to avoid being deleted if we set hive.exec.stagingdir under the table directory without start with "."
  • [SPARK-19660][SQL] Replace the deprecated property name fs.default.name to fs.defaultFS that newly introduced
  • [INFRA] Close stale PRs
  • [SPARK-20564][Deploy] Reduce massive executor failures when executor count is large (>2000)
  • Branch 2.2
  • [SPARK-20546][Deploy] spark-class gets syntax error in posix mode
  • [SPARK-10931][ML][PYSPARK] PySpark Models Copy Param Values from Estimator
  • [SPARK-20586] [SQL] Add deterministic and distinctLike to ScalaUDF and JavaUDF [WIP]
  • [SPARK-20590] Map default input data source formats to inlined classes
  • [SPARK-20562][MESOS] Support for declining offers under Unavailability.
  • [SPARK-20587][ML] Improve performance of ML ALS recommendForAll
  • [SPARK-20548][FLAKY-TEST] share one REPL instance among REPL test cases
  • [Streaming] groupByKey should also disable map side combine in streaming.
  • [SPARK-20577][DOC][CORE]Add REST API Documentation in Cluster Mode.
  • [SPARK-20557] [SQL] Improve the error message for unsupported JDBC types.
  • [SPARK-7481] [build] Add spark-hadoop-cloud module to pull in object store access.
  • [SPARK-20557][SQL] Support for db column type TIMESTAMP WITH TIME ZONE
  • [SPARK-18777][PYTHON][SQL] Return UDF from udf.register
  • [SPARK-20555][SQL] Fix mapping of Oracle DECIMAL types to Spark types
  • [SPARK-20550][SPARKR] R wrapper for Dataset.alias
  • [SPARK-20454] [GraphX] Two Improvements of ShortestPaths in GraphX
  • [Web UI] Remove no need loop in JobProgressListener
  • [SPARK-20365][YARN] Remove LocalSchem when add path to ClassPath.
  • [SPARK-20906][SparkR]:Constrained Logistic Regression for SparkR
  • [SPARK-6628][SQL][Branch-2.1] Fix ClassCastException when executing sql statement 'insert into' on hbase table
  • [SPARK-20891][SQL] Reduce duplicate code typedaggregators.scala
  • [SPARK-20900][YARN] Catch IllegalArgumentException thrown by new Path in ApplicationMaster
  • [SPARK-20903] [ML] Word2Vec Skip-Gram + Negative Sampling
  • [SPARK-20899][PySpark] PySpark supports stringIndexerOrderType in RFormula
  • [SPARK-20897][SQL] cached self-join should not fail
  • [SPARK-20498][PYSPARK][ML] Expose getMaxDepth for ensemble tree model in PySpark
  • SPARK-20199 : Provided featureSubsetStrategy to GBTClassifier
  • [SPARK-20892][SparkR] Add SQL trunc function to SparkR
  • [SPARK-20889][SparkR] Grouped documentation for DATETIME column methods
  • [SPARK-20890][SQL] Added min and max typed aggregation functions
  • [SPARK-20886][CORE] HadoopMapReduceCommitProtocol to fail meaningfully if FileOutputCommitter.getWorkPath==null
  • [SPARK-20884] Spark' masters will be both standby due to the bug of curator
  • [SPARK-20883][SPARK-20376][SS] Refactored StateStore APIs and added conf to choose implementation
  • [SPARK-20754][SQL] Support TRUNC (number)
  • [SPARK-20881] [SQL] Use Hive's stats in metastore when cbo is disabled
  • [SPARK-20877][SPARKR][WIP] add timestamps to test runs
  • [SPARK-20876][SQL]If the input parameter is float type for ceil or floor,the result is not we expected
  • [SPARK-16944][Mesos] Improve data locality when launching new executors when dynamic allocation is enabled
  • [SPARK-20774][SQL] Cancel all jobs when QueryExection throws.
  • [SPARK-20640][CORE]Make rpc timeout and retry for shuffle registration configurable.
  • [SPARK-20854][SQL] Extend hint syntax to support expressions
  • [SPARK-19900][core]Remove driver when relaunching.
  • [SPARK-20863] Add metrics/instrumentation to LiveListenerBus
  • [Spark-20771][SQL] Make weekofyear more intuitive
  • [SPARK-20841][SQL] Support column aliases for catalog tables
  • [SPARK-18016][SQL][CATALYST] Code Generation: Constant Pool Limit - Class Splitting
  • [SPARK-23223][SQL] Make stacking dataset transforms more performant
  • [MINOR][SS][DOC] Fix `Trigger` Scala/Java doc examples
  • [SPARK-23084][PYTHON]Add unboundedPreceding(), unboundedFollowing() and currentRow() to PySpark
  • [SPARK-23209][core] Allow credential manager to work when Hive not available.
  • [SPARK-23221][SS][TEST] Fix KafkaContinuousSourceStressForDontFailOnDataLossSuite
  • [SPARK-23219][SQL]Rename ReadTask to DataReaderFactory in data source v2
  • [SPARK-23217][ML] Add cosine distance measure to ClusteringEvaluator
  • [SPARK-23218][SQL] simplify ColumnVector.getArray
  • [SPARK-23214][SQL] cached data should not carry extra hint info
  • [SPARK-23207][SQL] Shuffle+Repartition on an RDD/DataFrame could lead to Data Loss
  • [SPARK-23203][SPARK-23204][SQL]: DataSourceV2: Use immutable logical plans.
  • [WIP][SPARK-23202][SQL] Break down DataSourceV2Writer.commit into two phase
  • [SPARK-21396][SQL] Fixes MatchError when UDTs are passed through Hive Thriftserver
  • [SPARK-23097][SQL][SS] Migrate text socket source to V2
  • [SPARK-11222][Build][Python] Python document style checker added
  • [SPARK-17088] [FOLLOW-UP] Fix 'sharesHadoopClasses' option when creating client
  • [SPARK-23199][SQL]improved Removes repetition from group expressions in Aggregate
  • [SPARK-23159][PYTHON] Update cloudpickle to match 0.4.2
  • Improved block merging logic for partitions
  • Changing JDBC relation to better process quotes
  • [SPARK-23196] Unify continuous and microbatch V2 sinks
  • [SPARK-23166][ML] Add maxDF Parameter to CountVectorizer
  • [Spark-22886][ML][TESTS] ML test for structured streaming: ml.recomme…
  • [SPARK-23188][SQL] Make vectorized columar reader batch size configurable
  • [SPARK-23186][SQL] Initialize DriverManager first before loading JDBC Drivers
  • [SPARK-23185][SQL] Make the configuration "spark.default.parallelism" can be changed on each SQL session to decrease empty files
  • [SPARK-23014][SS] Fully remove V1 memory sink.
  • [SPARK-23179][SQL] Support option to throw exception if overflow occurs
  • [SPARK-20129][Core] JavaSparkContext should use SparkContext.getOrCreate
  • [SPARK-23172][SQL] Expand the ReorderJoin rule to handle Project nodes
  • SPAR[SPARK-23379][SQL] remove redundant metastore access
  • [SPARK-23378][SQL] move setCurrentDatabase from HiveExternalCatalog to HiveClientImpl
  • [SPARK-23376][SQL] creating UnsafeKVExternalSorter with BytesToBytesMap may fail
  • [SPARK-23375][SQL] Eliminate unneeded Sort in Optimizer
  • [SPARK-23360][SQL][PYTHON] Get local timezone from environment via pytz, or dateutil.
  • [SPARK-23364][SQL]'desc table' command in spark-sql add column head display
  • [SPARK-23367][Build] Include python document style checking
  • [SPARK-23366] Improve hot reading path in ReadAheadInputStream
  • [SPARK-23362][SS] Migrate Kafka Microbatch source to v2
  • [SPARK-23285][K8S] Add a config property for specifying physical executor cores
  • [SPARK-23099][SS] Migrate foreach sink to DataSourceV2
  • SPARK-18844[MLLIB] Add more binary classification metrics to BinaryClassificationMetrics
  • [SPARK-23316][SQL] AnalysisException after max iteration reached for IN query
  • [SPARK-20659][Core] Removing sc.getExecutorStorageStatus and making StorageStatus private
  • [SPARK-23359][SQL] Adds an alias 'names' of 'fieldNames' in Scala's StructType
  • [SPARK-23357][CORE] 'SHOW TABLE EXTENDED LIKE pattern=STRING' add ‘Partitioned’ display similar to hive, and partition is empty, also need to show empty partition field []
  • [SPARK-23356][SQL]Pushes Project to both sides of Union when expression is non-deterministic
  • [SPARK-22700][ML] Bucketizer.transform incorrectly drops row containing NaN - for branch-2.2
  • [SPARK-23314][PYTHON] Add ambiguous=False when localizing tz-naive timestamps in Arrow codepath to deal with dst
  • [SPARK-23341][SQL] define some standard options for data source v2
  • [SPARK-23353][CORE] Allow ExecutorMetricsUpdate events to be logged t…
  • [SPARK-23352][PYTHON] Explicitly specify supported types in Pandas UDFs
  • [SPARK-23349][SQL]ShuffleExchangeExec: Duplicate and redundant type determination for ShuffleManager Object
  • [SPARK-23350][SS]Bug fix for exception handling when stopping continu…
  • [SPARK-23271[SQL] Parquet output contains only _SUCCESS file after writing an empty dataframe
  • [SPARK-23355][SQL] convertMetastore should not ignore table properties
  • [SPARK-22977][SQL] fix web UI SQL tab for CTAS
  • [SPARK-23344][PYTHON][ML] Add distanceMeasure param to KMeans
  • [Spark-23240][python] Don't let python site customizations interfere with communication between PythonWorkerFactory and daemon.py
  • [SPARK-22119][FOLLOWUP][ML] Use spherical KMeans with cosine distance
  • [SPARK-23405] Add constranits
  • [SPARK-22839][K8S] Remove the use of init-container for downloading remote dependencies
  • [SPARK-23510][SQL] Support Hive 2.2 and Hive 2.3 metastore
  • [SPARK-23508][CORE] Use timeStampedHashMap for BlockmanagerId in case blockManagerIdCache…
  • [SPARK-23448][SQL] Clarify JSON and CSV parser behavior in document
  • [SPARK-23499][MESOS] Support for priority queues in Mesos scheduler
  • [SPARK-23496][CORE] Locality of coalesced partitions can be severely skewed by the order of input partitions
  • [SPARK-23501][UI] Refactor AllStagesPage in order to avoid redundant code
  • [SPARK-23475][UI][BACKPORT-2.3] Show also skipped stages
  • [SPARK-23488][python] Add missing catalog methods to python API
  • [SPARK-23361][yarn] Allow AM to restart after initial tokens expire.
  • [SPARK-23462][SQL] improve missing field error message in `StructType`
  • [SPARK-23303][SQL] improve the explain result for data source v2 relations
  • [SPARK-23408][SS] Synchronize successive AddDataMemory actions in StreamTest.
  • SPARK-23472: Add defaultJavaOptions for drivers and executors.
  • [SPARK-23464][MESOS] Fix mesos cluster scheduler options double-escaping
  • [SPARK-19755][Mesos] Blacklist is always active for MesosCoarseGrainedSchedulerBackend
  • [SPARK-23288][SS] Fix output metrics with parquet sink
  • [SPARK-23417][python] Fix the build instructions supplied by exception messages in python streaming tests
  • [SPARK-23466][SQL] Remove redundant null checks in generated Java code by GenerateUnsafeProjection
  • [SPARK-23415][SQL][TEST] Make behavior of BufferHolderSparkSubmitSuite correct and stable
  • [SPARK-23053][CORE][BRANCH-2.1] taskBinarySerialization and task partitions calculate in DagScheduler.submitMissingTasks should keep the same RDD checkpoint status
  • [SPARK-23455][ML] Default Params in ML should be saved separately in metadata
  • [SPARK-3159] added subtree pruning in the translation from LearningNode to Node, added unit tests for tree redundancy and adapted existing ones that were affected
  • [SPARK-23451][ML] Deprecate KMeans.computeCost
  • [SPARK-23449][K8S] Preserve extraJavaOptions ordering
  • [SPARK-23445] ColumnStat refactoring
  • [SPARK-23491][SS] Remove explicit job cancellation from ContinuousExecution reconfiguring
  • [SPARK-23438][DSTREAMS] Fix DStreams data loss with WAL when driver crashes
  • [SPARK-23329][SQL] Fix documentation of trigonometric functions
  • [SPARK-23651]Add a check for host name
  • [SPARK-23627][SQL] Provide isEmpty in DataSet
  • [SPARK-23635][YARN] AM env variable should not overwrite same name env variable set through spark.executorEnv.
  • [SPARK-23645][PYTHON] Allow python udfs to be called with keyword arguments
  • [SPARK-23583][SQL] Invoke should support interpreted execution
  • [SPARK-23649][SQL] Prevent crashes on schema inferring of CSV containing wrong UTF-8 chars
  • [SPARK-23486]cache the function name from the catalog for lookupFunctions
  • [SPARK-23644][CORE][UI] Use absolute path for REST call in SHS
  • [WIP][SPARK-23643] Shrinking the buffer in hashSeed up to size of the seed parameter
  • Branch 2.1
  • [SPARK-23618][K8s][BUILD] Initialize BUILD_ARGS in docker-image-tool.sh
  • AccumulatorV2 subclass isZero scaladoc fix
  • [SPARK-23647][PYTHON][SQL] Adds more types for hint in pyspark
  • Documenting months_between direction
  • [SPARK-14681][ML] Provide label/impurity stats for spark.ml decision tree nodes
  • [SPARK-23640][CORE] Fix hadoop config may override spark config
  • [SPARK-23639][SQL]Obtain token before init metastore client in SparkSQL CLI
  • [SPARK-23637][YARN]Yarn might allocate more resource if a same executor is killed multiple times.
  • [MINOR] [SQL] [TEST] Create table using `dataSourceName` in `HadoopFsRelationTest`
  • [SPARK-23598][SQL] Make methods in BufferedRowIterator public to avoid runtime error for a large query
  • [SPARK-23584][SQL] NewInstance should support interpreted execution
  • [SPARK-23615][ML][PYSPARK]Add maxDF Parameter to Python CountVectorizer
  • [SPARK-23549][SQL] Cast to timestamp when comparing timestamp with date
  • [SPARK-23587][SQL] Add interpreted execution for MapObjects expression
  • [SPARK-23626][CORE] DAGScheduler blocked due to JobSubmitted event
  • [SPARK-23623] [SS] Avoid concurrent use of cached consumers in CachedKafkaConsumer
  • [SPARK-23523] [SQL] [BACKPORT-2.3] Fix the incorrect result caused by the rule OptimizeMetadataOnlyQuery
  • [SPARK-20327][CORE][YARN] Add CLI support for YARN custom resources, like GPUs
  • Added description of checkpointInterval parameter
  • [SPARK-23595][SQL] ValidateExternalType should support interpreted execution
spark questions on Stackoverflow (View All Questions)
  • Apache Spark union method giving inexplicable result
  • Efficiently manipulating subsets of RDD's keys in spark
  • Error using MemSQL Spark Connector
  • Apache Spark and MongoDb Date issue
  • Why won't this Spark sample code load in spark-shell?
  • spark join: parenthesis issue
  • Is spark standalone scheduler or Yarn scheduler better for a Cloudera 5.4 hadoop cluster?
  • Apache Spark runtime exception "Unable to load native-hadoop library for your platform" despite not using or referenceing Hadoop at all
  • Write single CSV file using spark-csv
  • Apache Spark Word Count on PDF file
  • Confusion about Spark streaming "added jobs"
  • Error with type written avro file in spark
  • Spark - Need to access separate index file to interpret data
  • Spark Configuration: memory/instance/cores
  • Spark performance for Scala vs Python
  • StreamsSQL for Stream Processing Engines like Spark or Storm
  • spark application java.lang.OutOfMemoryError: Direct buffer memory
  • Spark: Why do i have to explicitly tell what to cache?
  • Spark mllib shuffling the data
  • Query mongodb using spark
  • Multiple Aggregate operations on the same column of a spark dataframe
  • Accessing HBase tables through Spark
  • YARN REST API - Spark job submission
  • How can I read or use netcdf file using Spark or scalacode
  • How to send a file to database using spark streaming
  • Error starting spark-sql-thriftserver
  • Apache Spark: How do I convert a Spark DataFrame to a RDD with type RDD[(Type1,Type2, ...)]?
  • How to get the file name from DStream of Spark StreamingContext?
  • Storing orc format through Spark in java
  • Spark ML indexer cannot resolve DataFrame column name with dots?
spark list of languages used
Other projects in Scala