Skip to content
Snippets Groups Projects
  1. Aug 12, 2017
    • Sean Owen's avatar
      [MINOR][BUILD] Download RAT and R version info over HTTPS; use RAT 0.12 · b0bdfce9
      Sean Owen authored
      ## What changes were proposed in this pull request?
      
      This is trivial, but bugged me. We should download software over HTTPS.
      And we can use RAT 0.12 while at it to pick up bug fixes.
      
      ## How was this patch tested?
      
      N/A
      
      Author: Sean Owen <sowen@cloudera.com>
      
      Closes #18927 from srowen/Rat012.
      b0bdfce9
  2. Aug 11, 2017
    • Stavros Kontopoulos's avatar
      [SPARK-12559][SPARK SUBMIT] fix --packages for stand-alone cluster mode · da8c59bd
      Stavros Kontopoulos authored
      Fixes --packages flag for the stand-alone case in cluster mode. Adds to the driver classpath the jars that are resolved via ivy along with any other jars passed to `spark.jars`. Jars not resolved by ivy are downloaded explicitly to a tmp folder on the driver node. Similar code is available in SparkSubmit so we refactored part of it to use it at the DriverWrapper class which is responsible for launching driver in standalone cluster mode.
      
      Note: In stand-alone mode `spark.jars` contains the user jar so it can be fetched later on at the executor side.
      
      Manually by submitting a driver in cluster mode within a standalone cluster and checking if dependencies were resolved at the driver side.
      
      Author: Stavros Kontopoulos <st.kontopoulos@gmail.com>
      
      Closes #18630 from skonto/fix_packages_stand_alone_cluster.
      da8c59bd
    • Tejas Patil's avatar
      [SPARK-19122][SQL] Unnecessary shuffle+sort added if join predicates ordering... · 7f16c691
      Tejas Patil authored
      [SPARK-19122][SQL] Unnecessary shuffle+sort added if join predicates ordering differ from bucketing and sorting order
      
      ## What changes were proposed in this pull request?
      
      Jira : https://issues.apache.org/jira/browse/SPARK-19122
      
      `leftKeys` and `rightKeys` in `SortMergeJoinExec` are altered based on the ordering of join keys in the child's `outputPartitioning`. This is done everytime `requiredChildDistribution` is invoked during query planning.
      
      ## How was this patch tested?
      
      - Added new test case
      - Existing tests
      
      Author: Tejas Patil <tejasp@fb.com>
      
      Closes #16985 from tejasapatil/SPARK-19122_join_order_shuffle.
      7f16c691
    • Tejas Patil's avatar
      [SPARK-21595] Separate thresholds for buffering and spilling in ExternalAppendOnlyUnsafeRowArray · 94439997
      Tejas Patil authored
      ## What changes were proposed in this pull request?
      
      [SPARK-21595](https://issues.apache.org/jira/browse/SPARK-21595) reported that there is excessive spilling to disk due to default spill threshold for `ExternalAppendOnlyUnsafeRowArray` being quite small for WINDOW operator. Old behaviour of WINDOW operator (pre https://github.com/apache/spark/pull/16909) would hold data in an array for first 4096 records post which it would switch to `UnsafeExternalSorter` and start spilling to disk after reaching `spark.shuffle.spill.numElementsForceSpillThreshold` (or earlier if there was paucity of memory due to excessive consumers).
      
      Currently the (switch from in-memory to `UnsafeExternalSorter`) and (`UnsafeExternalSorter` spilling to disk) for `ExternalAppendOnlyUnsafeRowArray` is controlled by a single threshold. This PR aims to separate that to have more granular control.
      
      ## How was this patch tested?
      
      Added unit tests
      
      Author: Tejas Patil <tejasp@fb.com>
      
      Closes #18843 from tejasapatil/SPARK-21595.
      94439997
    • LucaCanali's avatar
      [SPARK-21519][SQL] Add an option to the JDBC data source to initialize the target DB environment · 0377338b
      LucaCanali authored
      Add an option to the JDBC data source to initialize the environment of the remote database session
      
      ## What changes were proposed in this pull request?
      
      This proposes an option to the JDBC datasource, tentatively called " sessionInitStatement" to implement the functionality of session initialization present for example in the Sqoop connector for Oracle (see https://sqoop.apache.org/docs/1.4.6/SqoopUserGuide.html#_oraoop_oracle_session_initialization_statements ) . After each database session is opened to the remote DB, and before starting to read data, this option executes a custom SQL statement (or a PL/SQL block in the case of Oracle).
      
      See also https://issues.apache.org/jira/browse/SPARK-21519
      
      ## How was this patch tested?
      
      Manually tested using Spark SQL data source and Oracle JDBC
      
      Author: LucaCanali <luca.canali@cern.ch>
      
      Closes #18724 from LucaCanali/JDBC_datasource_sessionInitStatement.
      0377338b
    • Kent Yao's avatar
      [SPARK-21675][WEBUI] Add a navigation bar at the bottom of the Details for Stage Page · 2387f1e3
      Kent Yao authored
      ## What changes were proposed in this pull request?
      
      1. In Spark Web UI, the Details for Stage Page don't have a navigation bar at the bottom. When we drop down to the bottom, it is better for us to see a navi bar right there to go wherever we what.
      2. Executor ID is not equivalent to Host, it may be  better to separate them, and then we can group the tasks by Hosts .
      
      ## How was this patch tested?
      manually test
      ![wx20170809-165606](https://user-images.githubusercontent.com/8326978/29114161-f82b4920-7d25-11e7-8d0c-0c036b008a78.png)
      
      Please review http://spark.apache.org/contributing.html before opening a pull request.
      
      Author: Kent Yao <yaooqinn@hotmail.com>
      
      Closes #18893 from yaooqinn/SPARK-21675.
      2387f1e3
  3. Aug 10, 2017
    • Reynold Xin's avatar
      [SPARK-21699][SQL] Remove unused getTableOption in ExternalCatalog · 584c7f14
      Reynold Xin authored
      ## What changes were proposed in this pull request?
      This patch removes the unused SessionCatalog.getTableMetadataOption and ExternalCatalog. getTableOption.
      
      ## How was this patch tested?
      Removed the test case.
      
      Author: Reynold Xin <rxin@databricks.com>
      
      Closes #18912 from rxin/remove-getTableOption.
      584c7f14
    • Peng Meng's avatar
      [SPARK-21638][ML] Fix RF/GBT Warning message error · ca695585
      Peng Meng authored
      ## What changes were proposed in this pull request?
      
      When train RF model, there are many warning messages like this:
      
      > WARN  RandomForest: Tree learning is using approximately 268492800 bytes per iteration, which exceeds requested limit maxMemoryUsage=268435456. This allows splitting 2622 nodes in this iteration.
      
      This warning message is unnecessary and the data is not accurate.
      
      Actually, if all the nodes cannot split in one iteration, it will show this warning. For most of the case, all the nodes cannot split just in one iteration, so for most of the case, it will show this warning for each iteration.
      
      ## How was this patch tested?
      The existing UT
      
      Author: Peng Meng <peng.meng@intel.com>
      
      Closes #18868 from mpjlu/fixRFwarning.
      ca695585
    • Adrian Ionescu's avatar
      [SPARK-21669] Internal API for collecting metrics/stats during FileFormatWriter jobs · 95ad960c
      Adrian Ionescu authored
      ## What changes were proposed in this pull request?
      
      This patch introduces an internal interface for tracking metrics and/or statistics on data on the fly, as it is being written to disk during a `FileFormatWriter` job and partially reimplements SPARK-20703 in terms of it.
      
      The interface basically consists of 3 traits:
      - `WriteTaskStats`: just a tag for classes that represent statistics collected during a `WriteTask`
        The only constraint it adds is that the class should be `Serializable`, as instances of it will be collected on the driver from all executors at the end of the `WriteJob`.
      - `WriteTaskStatsTracker`: a trait for classes that can actually compute statistics based on tuples that are processed by a given `WriteTask` and eventually produce a `WriteTaskStats` instance.
      - `WriteJobStatsTracker`: a trait for classes that act as containers of `Serializable` state that's necessary for instantiating `WriteTaskStatsTracker` on executors and finally process the resulting collection of `WriteTaskStats`, once they're gathered back on the driver.
      
      Potential future use of this interface is e.g. CBO stats maintenance during `INSERT INTO table ... ` operations.
      
      ## How was this patch tested?
      Existing tests for SPARK-20703 exercise the new code: `hive/SQLMetricsSuite`, `sql/JavaDataFrameReaderWriterSuite`, etc.
      
      Author: Adrian Ionescu <adrian@databricks.com>
      
      Closes #18884 from adrian-ionescu/write-stats-tracker-api.
      95ad960c
  4. Aug 09, 2017
    • bravo-zhang's avatar
      [SPARK-14932][SQL] Allow DataFrame.replace() to replace values with None · 84454d7d
      bravo-zhang authored
      ## What changes were proposed in this pull request?
      
      Currently `df.na.replace("*", Map[String, String]("NULL" -> null))` will produce exception.
      This PR enables passing null/None as value in the replacement map in DataFrame.replace().
      Note that the replacement map keys and values should still be the same type, while the values can have a mix of null/None and that type.
      This PR enables following operations for example:
      `df.na.replace("*", Map[String, String]("NULL" -> null))`(scala)
      `df.na.replace("*", Map[Any, Any](60 -> null, 70 -> 80))`(scala)
      `df.na.replace('Alice', None)`(python)
      `df.na.replace([10, 20])`(python, replacing with None is by default)
      One use case could be: I want to replace all the empty strings with null/None because they were incorrectly generated and then drop all null/None data
      `df.na.replace("*", Map("" -> null)).na.drop()`(scala)
      `df.replace(u'', None).dropna()`(python)
      
      ## How was this patch tested?
      
      Scala unit test.
      Python doctest and unit test.
      
      Author: bravo-zhang <mzhang1230@gmail.com>
      
      Closes #18820 from bravo-zhang/spark-14932.
      84454d7d
    • peay's avatar
      [SPARK-21551][PYTHON] Increase timeout for PythonRDD.serveIterator · c06f3f5a
      peay authored
      ## What changes were proposed in this pull request?
      
      This modification increases the timeout for `serveIterator` (which is not dynamically configurable). This fixes timeout issues in pyspark when using `collect` and similar functions, in cases where Python may take more than a couple seconds to connect.
      
      See https://issues.apache.org/jira/browse/SPARK-21551
      
      ## How was this patch tested?
      
      Ran the tests.
      
      cc rxin
      
      Author: peay <peay@protonmail.com>
      
      Closes #18752 from peay/spark-21551.
      c06f3f5a
    • Jose Torres's avatar
      [SPARK-21587][SS] Added filter pushdown through watermarks. · 0fb73253
      Jose Torres authored
      ## What changes were proposed in this pull request?
      
      Push filter predicates through EventTimeWatermark if they're deterministic and do not reference the watermarked attribute. (This is similar but not identical to the logic for pushing through UnaryNode.)
      
      ## How was this patch tested?
      unit tests
      
      Author: Jose Torres <joseph-torres@databricks.com>
      
      Closes #18790 from joseph-torres/SPARK-21587.
      0fb73253
    • gatorsmile's avatar
      [SPARK-21504][SQL] Add spark version info into table metadata · 2d799d08
      gatorsmile authored
      ## What changes were proposed in this pull request?
      This PR is to add the spark version info in the table metadata. When creating the table, this value is assigned. It can help users find which version of Spark was used to create the table.
      
      ## How was this patch tested?
      N/A
      
      Author: gatorsmile <gatorsmile@gmail.com>
      
      Closes #18709 from gatorsmile/addVersion.
      2d799d08
    • Takeshi Yamamuro's avatar
      [SPARK-21276][CORE] Update lz4-java to the latest (v1.4.0) · b78cf13b
      Takeshi Yamamuro authored
      ## What changes were proposed in this pull request?
      This pr updated `lz4-java` to the latest (v1.4.0) and removed custom `LZ4BlockInputStream`. We currently use custom `LZ4BlockInputStream` to read concatenated byte stream in shuffle. But, this functionality has been implemented in the latest lz4-java (https://github.com/lz4/lz4-java/pull/105). So, we might update the latest to remove the custom `LZ4BlockInputStream`.
      
      Major diffs between the latest release and v1.3.0 in the master are as follows (https://github.com/lz4/lz4-java/compare/62f7547abb0819d1ca1e669645ee1a9d26cd60b0...6d4693f56253fcddfad7b441bb8d917b182efa2d);
      - fixed NPE in XXHashFactory similarly
      - Don't place resources in default package to support shading
      - Fixes ByteBuffer methods failing to apply arrayOffset() for array-backed
      - Try to load lz4-java from java.library.path, then fallback to bundled
      - Add ppc64le binary
      - Add s390x JNI binding
      - Add basic LZ4 Frame v1.5.0 support
      - enable aarch64 support for lz4-java
      - Allow unsafeInstance() for ppc64le archiecture
      - Add unsafeInstance support for AArch64
      - Support 64-bit JNI build on Solaris
      - Avoid over-allocating a buffer
      - Allow EndMark to be incompressible for LZ4FrameInputStream.
      - Concat byte stream
      
      ## How was this patch tested?
      Existing tests.
      
      Author: Takeshi Yamamuro <yamamuro@apache.org>
      
      Closes #18883 from maropu/SPARK-21276.
      b78cf13b
    • vinodkc's avatar
      [SPARK-21665][CORE] Need to close resources after use · 83fe3b5e
      vinodkc authored
      ## What changes were proposed in this pull request?
      Resources in Core - SparkSubmitArguments.scala, Spark-launcher - AbstractCommandBuilder.java, resource-managers- YARN - Client.scala are released
      
      ## How was this patch tested?
      No new test cases added, Unit test have been passed
      
      Author: vinodkc <vinod.kc.in@gmail.com>
      
      Closes #18880 from vinodkc/br_fixresouceleak.
      83fe3b5e
    • 10087686's avatar
      [SPARK-21663][TESTS] test("remote fetch below max RPC message size") should... · 6426adff
      10087686 authored
      [SPARK-21663][TESTS] test("remote fetch below max RPC message size") should call masterTracker.stop() in MapOutputTrackerSuite
      
      Signed-off-by: 10087686 <wang.jiaochunzte.com.cn>
      
      ## What changes were proposed in this pull request?
      After Unit tests end,there should be call masterTracker.stop() to free resource;
      (Please fill in changes proposed in this fix)
      
      ## How was this patch tested?
      Run Unit tests;
      (Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests)
      (If this patch involves UI changes, please attach a screenshot; otherwise, remove this)
      
      Please review http://spark.apache.org/contributing.html before opening a pull request.
      
      Author: 10087686 <wang.jiaochun@zte.com.cn>
      
      Closes #18867 from wangjiaochun/mapout.
      6426adff
    • WeichenXu's avatar
      [SPARK-21523][ML] update breeze to 0.13.2 for an emergency bugfix in strong wolfe line search · b35660dd
      WeichenXu authored
      ## What changes were proposed in this pull request?
      
      Update breeze to 0.13.1 for an emergency bugfix in strong wolfe line search
      https://github.com/scalanlp/breeze/pull/651
      
      ## How was this patch tested?
      
      N/A
      
      Author: WeichenXu <WeichenXu123@outlook.com>
      
      Closes #18797 from WeichenXu123/update-breeze.
      b35660dd
    • Anderson Osagie's avatar
      [SPARK-21176][WEB UI] Use a single ProxyServlet to proxy all workers and applications · ae8a2b14
      Anderson Osagie authored
      ## What changes were proposed in this pull request?
      
      Currently, each application and each worker creates their own proxy servlet. Each proxy servlet is backed by its own HTTP client and a relatively large number of selector threads. This is excessive but was fixed (to an extent) by https://github.com/apache/spark/pull/18437.
      
      However, a single HTTP client (backed by a single selector thread) should be enough to handle all proxy requests. This PR creates a single proxy servlet no matter how many applications and workers there are.
      
      ## How was this patch tested?
      .
      The unit tests for rewriting proxied locations and headers were updated. I then spun up a 100 node cluster to ensure that proxy'ing worked correctly
      
      jiangxb1987 Please let me know if there's anything else I can do to help push this thru. Thanks!
      
      Author: Anderson Osagie <osagie@gmail.com>
      
      Closes #18499 from aosagie/fix/minimize-proxy-threads.
      ae8a2b14
    • pgandhi's avatar
      [SPARK-21503][UI] Spark UI shows incorrect task status for a killed Executor Process · f016f5c8
      pgandhi authored
      The executor tab on Spark UI page shows task as completed when an executor process that is running that task is killed using the kill command.
      Added the case ExecutorLostFailure which was previously not there, thus, the default case would be executed in which case, task would be marked as completed. This case will consider all those cases where executor connection to Spark Driver was lost due to killing the executor process, network connection etc.
      
      ## How was this patch tested?
      Manually Tested the fix by observing the UI change before and after.
      Before:
      <img width="1398" alt="screen shot-before" src="https://user-images.githubusercontent.com/22228190/28482929-571c9cea-6e30-11e7-93dd-728de5cdea95.png">
      After:
      <img width="1385" alt="screen shot-after" src="https://user-images.githubusercontent.com/22228190/28482964-8649f5ee-6e30-11e7-91bd-2eb2089c61cc.png">
      
      Please review http://spark.apache.org/contributing.html before opening a pull request.
      
      Author: pgandhi <pgandhi@yahoo-inc.com>
      Author: pgandhi999 <parthkgandhi9@gmail.com>
      
      Closes #18707 from pgandhi999/master.
      f016f5c8
    • Xingbo Jiang's avatar
      [SPARK-21608][SPARK-9221][SQL] Window rangeBetween() API should allow literal boundary · 031910b0
      Xingbo Jiang authored
      ## What changes were proposed in this pull request?
      
      Window rangeBetween() API should allow literal boundary, that means, the window range frame can calculate frame of double/date/timestamp.
      
      Example of the use case can be:
      ```
      SELECT
      	val_timestamp,
      	cate,
      	avg(val_timestamp) OVER(PARTITION BY cate ORDER BY val_timestamp RANGE BETWEEN CURRENT ROW AND interval 23 days 4 hours FOLLOWING)
      FROM testData
      ```
      
      This PR refactors the Window `rangeBetween` and `rowsBetween` API, while the legacy user code should still be valid.
      
      ## How was this patch tested?
      
      Add new test cases both in `DataFrameWindowFunctionsSuite` and in `window.sql`.
      
      Author: Xingbo Jiang <xingbo.jiang@databricks.com>
      
      Closes #18814 from jiangxb1987/literal-boundary.
      031910b0
  5. Aug 08, 2017
  6. Aug 07, 2017
    • Yanbo Liang's avatar
      [SPARK-19270][FOLLOW-UP][ML] PySpark GLR model.summary should return a printable representation. · f763d846
      Yanbo Liang authored
      ## What changes were proposed in this pull request?
      PySpark GLR ```model.summary``` should return a printable representation by calling Scala ```toString```.
      
      ## How was this patch tested?
      ```
      from pyspark.ml.regression import GeneralizedLinearRegression
      dataset = spark.read.format("libsvm").load("data/mllib/sample_linear_regression_data.txt")
      glr = GeneralizedLinearRegression(family="gaussian", link="identity", maxIter=10, regParam=0.3)
      model = glr.fit(dataset)
      model.summary
      ```
      Before this PR:
      ![image](https://user-images.githubusercontent.com/1962026/29021059-e221633e-7b96-11e7-8d77-5d53f89c81a9.png)
      After this PR:
      ![image](https://user-images.githubusercontent.com/1962026/29021097-fce80fa6-7b96-11e7-8ab4-7e113d447d5d.png)
      
      Author: Yanbo Liang <ybliang8@gmail.com>
      
      Closes #18870 from yanboliang/spark-19270.
      f763d846
    • Ajay Saini's avatar
      [SPARK-21542][ML][PYTHON] Python persistence helper functions · fdcee028
      Ajay Saini authored
      ## What changes were proposed in this pull request?
      
      Added DefaultParamsWriteable, DefaultParamsReadable, DefaultParamsWriter, and DefaultParamsReader to Python to support Python-only persistence of Json-serializable parameters.
      
      ## How was this patch tested?
      
      Instantiated an estimator with Json-serializable parameters (ex. LogisticRegression), saved it using the added helper functions, and loaded it back, and compared it to the original instance to make sure it is the same. This test was both done in the Python REPL and implemented in the unit tests.
      
      Note to reviewers: there are a few excess comments that I left in the code for clarity but will remove before the code is merged to master.
      
      Author: Ajay Saini <ajays725@gmail.com>
      
      Closes #18742 from ajaysaini725/PythonPersistenceHelperFunctions.
      fdcee028
    • gatorsmile's avatar
      [SPARK-21648][SQL] Fix confusing assert failure in JDBC source when parallel... · baf5cac0
      gatorsmile authored
      [SPARK-21648][SQL] Fix confusing assert failure in JDBC source when parallel fetching parameters are not properly provided.
      
      ### What changes were proposed in this pull request?
      ```SQL
      CREATE TABLE mytesttable1
      USING org.apache.spark.sql.jdbc
        OPTIONS (
        url 'jdbc:mysql://${jdbcHostname}:${jdbcPort}/${jdbcDatabase}?user=${jdbcUsername}&password=${jdbcPassword}',
        dbtable 'mytesttable1',
        paritionColumn 'state_id',
        lowerBound '0',
        upperBound '52',
        numPartitions '53',
        fetchSize '10000'
      )
      ```
      
      The above option name `paritionColumn` is wrong. That mean, users did not provide the value for `partitionColumn`. In such case, users hit a confusing error.
      
      ```
      AssertionError: assertion failed
      java.lang.AssertionError: assertion failed
      	at scala.Predef$.assert(Predef.scala:156)
      	at org.apache.spark.sql.execution.datasources.jdbc.JdbcRelationProvider.createRelation(JdbcRelationProvider.scala:39)
      	at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:312)
      ```
      
      ### How was this patch tested?
      Added a test case
      
      Author: gatorsmile <gatorsmile@gmail.com>
      
      Closes #18864 from gatorsmile/jdbcPartCol.
      baf5cac0
    • Jose Torres's avatar
      [SPARK-21565][SS] Propagate metadata in attribute replacement. · cce25b36
      Jose Torres authored
      ## What changes were proposed in this pull request?
      
      Propagate metadata in attribute replacement during streaming execution. This is necessary for EventTimeWatermarks consuming replaced attributes.
      
      ## How was this patch tested?
      new unit test, which was verified to fail before the fix
      
      Author: Jose Torres <joseph-torres@databricks.com>
      
      Closes #18840 from joseph-torres/SPARK-21565.
      cce25b36
    • Mac's avatar
      [SPARK][DOCS] Added note on meaning of position to substring function · 4f7ec3a3
      Mac authored
      ## What changes were proposed in this pull request?
      
      Enhanced some existing documentation
      
      Please review http://spark.apache.org/contributing.html before opening a pull request.
      
      Author: Mac <maclockard@gmail.com>
      
      Closes #18710 from maclockard/maclockard-patch-1.
      4f7ec3a3
    • Xiao Li's avatar
      [SPARK-21647][SQL] Fix SortMergeJoin when using CROSS · bbfd6b5d
      Xiao Li authored
      ### What changes were proposed in this pull request?
      author: BoleynSu
      closes https://github.com/apache/spark/pull/18836
      
      ```Scala
      val df = Seq((1, 1)).toDF("i", "j")
      df.createOrReplaceTempView("T")
      withSQLConf(SQLConf.AUTO_BROADCASTJOIN_THRESHOLD.key -> "-1") {
        sql("select * from (select a.i from T a cross join T t where t.i = a.i) as t1 " +
          "cross join T t2 where t2.i = t1.i").explain(true)
      }
      ```
      The above code could cause the following exception:
      ```
      SortMergeJoinExec should not take Cross as the JoinType
      java.lang.IllegalArgumentException: SortMergeJoinExec should not take Cross as the JoinType
      	at org.apache.spark.sql.execution.joins.SortMergeJoinExec.outputOrdering(SortMergeJoinExec.scala:100)
      ```
      
      Our SortMergeJoinExec supports CROSS. We should not hit such an exception. This PR is to fix the issue.
      
      ### How was this patch tested?
      Modified the two existing test cases.
      
      Author: Xiao Li <gatorsmile@gmail.com>
      Author: Boleyn Su <boleyn.su@gmail.com>
      
      Closes #18863 from gatorsmile/pr-18836.
      bbfd6b5d
    • zhoukang's avatar
      [SPARK-21544][DEPLOY][TEST-MAVEN] Tests jar of some module should not upload twice · 8b69b17f
      zhoukang authored
      ## What changes were proposed in this pull request?
      
      **For moudle below:**
      common/network-common
      streaming
      sql/core
      sql/catalyst
      **tests.jar will install or deploy twice.Like:**
      `[DEBUG] Installing org.apache.spark:spark-streaming_2.11/maven-metadata.xml to /home/mi/.m2/repository/org/apache/spark/spark-streaming_2.11/maven-metadata-local.xml
      [INFO] Installing /home/mi/Work/Spark/scala2.11/spark/streaming/target/spark-streaming_2.11-2.1.0-mdh2.1.0.1-SNAPSHOT-tests.jar to /home/mi/.m2/repository/org/apache/spark/spark-streaming_2.11/2.1.0-mdh2.1.0.1-SNAPSHOT/spark-streaming_2.11-2.1.0-mdh2.1.0.1-SNAPSHOT-tests.jar
      [DEBUG] Skipped re-installing /home/mi/Work/Spark/scala2.11/spark/streaming/target/spark-streaming_2.11-2.1.0-mdh2.1.0.1-SNAPSHOT-tests.jar to /home/mi/.m2/repository/org/apache/spark/spark-streaming_2.11/2.1.0-mdh2.1.0.1-SNAPSHOT/spark-streaming_2.11-2.1.0-mdh2.1.0.1-SNAPSHOT-tests.jar, seems unchanged`
      **The reason is below:**
      `[DEBUG]   (f) artifact = org.apache.spark:spark-streaming_2.11:jar:2.1.0-mdh2.1.0.1-SNAPSHOT
      [DEBUG]   (f) attachedArtifacts = [org.apache.spark:spark-streaming_2.11:test-jar:tests:2.1.0-mdh2.1.0.1-SNAPSHOT, org.apache.spark:spark-streaming_2.11:jar:tests:2.1.0-mdh2.1.0.1-SNAPSHOT, org.apache.spark:spark
      -streaming_2.11:java-source:sources:2.1.0-mdh2.1.0.1-SNAPSHOT, org.apache.spark:spark-streaming_2.11:java-source:test-sources:2.1.0-mdh2.1.0.1-SNAPSHOT, org.apache.spark:spark-streaming_2.11:javadoc:javadoc:2.1.0
      -mdh2.1.0.1-SNAPSHOT]`
      
      when executing 'mvn deploy' to nexus during release.I will fail since release nexus can not be overrided.
      
      ## How was this patch tested?
      Execute 'mvn clean install -Pyarn -Phadoop-2.6 -Phadoop-provided -DskipTests'
      
      Author: zhoukang <zhoukang199191@gmail.com>
      
      Closes #18745 from caneGuy/zhoukang/fix-installtwice.
      8b69b17f
    • Peng Meng's avatar
      [SPARK-21623][ML] fix RF doc · 1426eea8
      Peng Meng authored
      ## What changes were proposed in this pull request?
      
      comments of parentStats in RF are wrong.
      parentStats is not only used for the first iteration, it is used with all the iteration for unordered features.
      
      ## How was this patch tested?
      
      Author: Peng Meng <peng.meng@intel.com>
      
      Closes #18832 from mpjlu/fixRFDoc.
      1426eea8
    • Stavros Kontopoulos's avatar
      [SPARK-13041][MESOS] Adds sandbox uri to spark dispatcher ui · 663f30d1
      Stavros Kontopoulos authored
      ## What changes were proposed in this pull request?
      Adds a sandbox link per driver in the dispatcher ui with minimal changes after a bug was fixed here:
      https://issues.apache.org/jira/browse/MESOS-4992
      The sandbox uri has the following format:
      http://<proxy_uri>/#/slaves/\<agent-id\>/ frameworks/ \<scheduler-id\>/executors/\<driver-id\>/browse
      
      For dc/os the proxy uri is <dc/os uri>/mesos. For the dc/os deployment scenario and to make things easier I introduced a new config property named `spark.mesos.proxy.baseURL` which should be passed to the dispatcher when launched using --conf. If no such configuration is detected then no sandbox uri is depicted, and there is an empty column with a header (this can be changed so nothing is shown).
      
      Within dc/os the base url must be a property for the dispatcher that we should add in the future here:
      https://github.com/mesosphere/universe/blob/9e7c909c3b8680eeb0494f2a58d5746e3bab18c1/repo/packages/S/spark/26/config.json
      It is not easy to detect in different environments what is that uri so user should pass it.
      
      ## How was this patch tested?
      Tested with the mesos test suite here: https://github.com/typesafehub/mesos-spark-integration-tests.
      Attached image shows the ui modification where the sandbox header is added.
      ![image](https://user-images.githubusercontent.com/7945591/27831630-2a3b447e-60d4-11e7-87bb-d057efd4efa7.png)
      
      Tested the uri redirection the way it was suggested here:
      https://issues.apache.org/jira/browse/MESOS-4992
      
      Built mesos 1.4 from the master branch and started the mesos dispatcher with the command:
      
      `./sbin/start-mesos-dispatcher.sh --conf spark.mesos.proxy.baseURL=http://localhost:5050 -m mesos://127.0.0.1:5050`
      
      Run a spark example:
      
      `./bin/spark-submit   --class org.apache.spark.examples.SparkPi   --master mesos://10.10.1.79:7078   --deploy-mode cluster   --executor-memory 2G   --total-executor-cores 2     http://<path>/spark-examples_2.11-2.1.1.jar  10`
      
      Sandbox uri is shown at the bottom of the page:
      
      ![image](https://user-images.githubusercontent.com/7945591/28599237-89d0a8c8-71b1-11e7-8f94-41ad117ceead.png)
      
      Redirection works as expected:
      ![image](https://user-images.githubusercontent.com/7945591/28599247-a5d65248-71b1-11e7-8b5e-a0ac2a79fa23.png)
      
      Author: Stavros Kontopoulos <st.kontopoulos@gmail.com>
      
      Closes #18528 from skonto/adds_the_sandbox_uri.
      663f30d1
    • Xianyang Liu's avatar
      [SPARK-21621][CORE] Reset numRecordsWritten after DiskBlockObjectWriter.commitAndGet called · 534a063f
      Xianyang Liu authored
      ## What changes were proposed in this pull request?
      
      We should reset numRecordsWritten to zero after DiskBlockObjectWriter.commitAndGet called.
      Because when `revertPartialWritesAndClose` be called, we decrease the written records in `ShuffleWriteMetrics` . However, we decreased the written records to zero, this should be wrong, we should only decreased the number reords after the last `commitAndGet` called.
      
      ## How was this patch tested?
      Modified existing test.
      
      Please review http://spark.apache.org/contributing.html before opening a pull request.
      
      Author: Xianyang Liu <xianyang.liu@intel.com>
      
      Closes #18830 from ConeyLiu/DiskBlockObjectWriter.
      534a063f
  7. Aug 06, 2017
    • Sean Owen's avatar
      [MINOR][BUILD] Remove duplicate test-jar:test spark-sql dependency from Hive module · 39e044e3
      Sean Owen authored
      ## What changes were proposed in this pull request?
      
      Remove duplicate test-jar:test spark-sql dependency from Hive module; move test-jar dependencies together logically. This generates a big warning at the start of the Maven build otherwise.
      
      ## How was this patch tested?
      
      Existing build. No functional changes here.
      
      Author: Sean Owen <sowen@cloudera.com>
      
      Closes #18858 from srowen/DupeSqlTestDep.
      39e044e3
    • BartekH's avatar
      Add "full_outer" name to join types · 438c3815
      BartekH authored
      I have discovered that "full_outer" name option is working in Spark 2.0, but it is not printed in exception. Please verify.
      
      ## What changes were proposed in this pull request?
      
      (Please fill in changes proposed in this fix)
      
      ## How was this patch tested?
      
      (Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests)
      (If this patch involves UI changes, please attach a screenshot; otherwise, remove this)
      
      Please review http://spark.apache.org/contributing.html before opening a pull request.
      
      Author: BartekH <bartekhamielec@gmail.com>
      
      Closes #17985 from BartekH/patch-1.
      438c3815
    • actuaryzhang's avatar
      [SPARK-21622][ML][SPARKR] Support offset in SparkR GLM · 55aa4da2
      actuaryzhang authored
      ## What changes were proposed in this pull request?
      Support offset in SparkR GLM #16699
      
      Author: actuaryzhang <actuaryzhang10@gmail.com>
      
      Closes #18831 from actuaryzhang/sparkROffset.
      55aa4da2
Loading