Skip to content
Snippets Groups Projects
  1. Feb 06, 2015
    • Michael Armbrust's avatar
      [HOTFIX] Fix the maven build after adding sqlContext to spark-shell · 57961567
      Michael Armbrust authored
      Follow up to #4387 to fix the build break.
      
      Author: Michael Armbrust <michael@databricks.com>
      
      Closes #4443 from marmbrus/fixMaven and squashes the following commits:
      
      1eeba7d [Michael Armbrust] try again
      7f5fb15 [Michael Armbrust] [HOTFIX] Fix the maven build after adding sqlContext to spark-shell
      57961567
    • Marcelo Vanzin's avatar
      [SPARK-5600] [core] Clean up FsHistoryProvider test, fix app sort order. · 5687bab8
      Marcelo Vanzin authored
      Clean up some test setup code to remove duplicate instantiation of the
      provider. Also make sure unfinished apps are sorted correctly.
      
      Author: Marcelo Vanzin <vanzin@cloudera.com>
      
      Closes #4370 from vanzin/SPARK-5600 and squashes the following commits:
      
      0d048d5 [Marcelo Vanzin] Cleanup test code a bit.
      2585119 [Marcelo Vanzin] Review feedback.
      8b97544 [Marcelo Vanzin] Merge branch 'master' into SPARK-5600
      be979e9 [Marcelo Vanzin] Merge branch 'master' into SPARK-5600
      298371c [Marcelo Vanzin] [SPARK-5600] [core] Clean up FsHistoryProvider test, fix app sort order.
      5687bab8
    • Kashish Jain's avatar
      SPARK-5613: Catch the ApplicationNotFoundException exception to avoid thread... · ca66159a
      Kashish Jain authored
      SPARK-5613: Catch the ApplicationNotFoundException exception to avoid thread from getting killed on yarn restart.
      
      [SPARK-5613] Added a  catch block to catch the ApplicationNotFoundException. Without this catch block the thread gets killed on occurrence of this exception. This Exception occurs when yarn restarts and tries to find an application id for a spark job which got interrupted due to yarn getting stopped.
      See the stacktrace in the bug for more details.
      
      Author: Kashish Jain <kashish.jain@guavus.com>
      
      Closes #4392 from kasjain/branch-1.2 and squashes the following commits:
      
      4831000 [Kashish Jain] SPARK-5613: Catch the ApplicationNotFoundException exception to avoid thread from getting killed on yarn restart.
      ca66159a
    • Vladimir Vladimirov's avatar
      SPARK-5633 pyspark saveAsTextFile support for compression codec · b3872e00
      Vladimir Vladimirov authored
      See https://issues.apache.org/jira/browse/SPARK-5633 for details
      
      Author: Vladimir Vladimirov <vladimir.vladimirov@magnetic.com>
      
      Closes #4403 from smartkiwi/master and squashes the following commits:
      
      94c014e [Vladimir Vladimirov] SPARK-5633 pyspark saveAsTextFile support for compression codec
      b3872e00
    • Xiangrui Meng's avatar
      [HOTFIX][MLLIB] fix a compilation error with java 6 · 65181b75
      Xiangrui Meng authored
      Author: Xiangrui Meng <meng@databricks.com>
      
      Closes #4442 from mengxr/java6-fix and squashes the following commits:
      
      2098500 [Xiangrui Meng] fix a compilation error with java 6
      65181b75
    • GenTang's avatar
      [SPARK-4983] Insert waiting time before tagging EC2 instances · 0f3a3607
      GenTang authored
      The boto API doesn't support tag EC2 instances in the same call that launches them.
      We add a five-second wait so EC2 has enough time to propagate the information so that
      the tagging can succeed.
      
      Author: GenTang <gen.tang86@gmail.com>
      Author: Gen TANG <gen.tang86@gmail.com>
      
      Closes #3986 from GenTang/spark-4983 and squashes the following commits:
      
      13e257d [Gen TANG] modification of comments
      47f06755 [GenTang] print the information
      ab7a931 [GenTang] solve the issus spark-4983 by inserting waiting time
      3179737 [GenTang] Revert "handling exceptions about adding tags to ec2"
      6a8b53b [GenTang] Revert "the improvement of exception handling"
      13e97a6 [GenTang] Revert "typo"
      63fd360 [GenTang] typo
      692fc2b [GenTang] the improvement of exception handling
      6adcf6d [GenTang] handling exceptions about adding tags to ec2
      0f3a3607
    • OopsOutOfMemory's avatar
      [SPARK-5586][Spark Shell][SQL] Make `sqlContext` available in spark shell · 3d3ecd77
      OopsOutOfMemory authored
      Result is like this
      ```
      15/02/05 13:41:22 INFO SparkILoop: Created spark context..
      Spark context available as sc.
      15/02/05 13:41:22 INFO SparkILoop: Created sql context..
      SQLContext available as sqlContext.
      
      scala> sq
      sql          sqlContext   sqlParser    sqrt
      ```
      
      Author: OopsOutOfMemory <victorshengli@126.com>
      
      Closes #4387 from OopsOutOfMemory/sqlContextInShell and squashes the following commits:
      
      c7f5203 [OopsOutOfMemory] auto-import sql() function
      e160697 [OopsOutOfMemory] Merge branch 'sqlContextInShell' of https://github.com/OopsOutOfMemory/spark into sqlContextInShell
      37c0a16 [OopsOutOfMemory] auto detect hive support
      a9c59d9 [OopsOutOfMemory] rename and reduce range of imports
      6b9e309 [OopsOutOfMemory] Merge branch 'master' into sqlContextInShell
      cae652f [OopsOutOfMemory] make sqlContext available in spark shell
      3d3ecd77
    • Wenchen Fan's avatar
      [SPARK-5278][SQL] Introduce UnresolvedGetField and complete the check of... · 4793c840
      Wenchen Fan authored
      [SPARK-5278][SQL] Introduce UnresolvedGetField and complete the check of ambiguous reference to fields
      
      When the `GetField` chain(`a.b.c.d.....`) is interrupted by `GetItem` like `a.b[0].c.d....`, then the check of ambiguous reference to fields is broken.
      The reason is that: for something like `a.b[0].c.d`, we first parse it to `GetField(GetField(GetItem(Unresolved("a.b"), 0), "c"), "d")`. Then in `LogicalPlan#resolve`, we resolve `"a.b"` and build a `GetField` chain from bottom(the relation). But for the 2 outer `GetFiled`, we have to resolve them in `Analyzer` or do it in `GetField` lazily, check data type of child, search needed field, etc. which is similar to what we have done in `LogicalPlan#resolve`.
      So in this PR, the fix is just copy the same logic in `LogicalPlan#resolve` to `Analyzer`, which is simple and quick, but I do suggest introduce `UnresolvedGetFiled` like I explained in https://github.com/apache/spark/pull/2405.
      
      Author: Wenchen Fan <cloud0fan@outlook.com>
      
      Closes #4068 from cloud-fan/simple and squashes the following commits:
      
      a6857b5 [Wenchen Fan] fix import order
      8411c40 [Wenchen Fan] use UnresolvedGetField
      4793c840
    • wangfei's avatar
      [SQL][Minor] Remove cache keyword in SqlParser · bc363560
      wangfei authored
      Since cache keyword already defined in `SparkSQLParser` and `SqlParser` of catalyst is a more general parser which should not cover keywords related to underlying compute engine, to remove  cache keyword in  `SqlParser`.
      
      Author: wangfei <wangfei1@huawei.com>
      
      Closes #4393 from scwf/remove-cache-keyword and squashes the following commits:
      
      10ade16 [wangfei] remove cache keyword in sql parser
      bc363560
    • OopsOutOfMemory's avatar
      [SQL][HiveConsole][DOC] HiveConsole `correct hiveconsole imports` · b62c3524
      OopsOutOfMemory authored
      Sorry for that PR #4330 has some mistakes.
      
      I correct it....  so it works correctly now.
      
      Author: OopsOutOfMemory <victorshengli@126.com>
      
      Closes #4389 from OopsOutOfMemory/doc and squashes the following commits:
      
      843eed9 [OopsOutOfMemory] correct hiveconsole imports
      b62c3524
    • Yin Huai's avatar
      [SPARK-5595][SPARK-5603][SQL] Add a rule to do PreInsert type casting and... · 3eccf29c
      Yin Huai authored
      [SPARK-5595][SPARK-5603][SQL] Add a rule to do PreInsert type casting and field renaming and invalidating in memory cache after INSERT
      
      This PR adds a rule to Analyzer that will add preinsert data type casting and field renaming to the select clause in an `INSERT INTO/OVERWRITE` statement. Also, with the change of this PR, we always invalidate our in memory data cache after inserting into a BaseRelation.
      
      cc marmbrus liancheng
      
      Author: Yin Huai <yhuai@databricks.com>
      
      Closes #4373 from yhuai/insertFollowUp and squashes the following commits:
      
      08237a7 [Yin Huai] Merge remote-tracking branch 'upstream/master' into insertFollowUp
      316542e [Yin Huai] Doc update.
      c9ccfeb [Yin Huai] Revert a unnecessary change.
      84aecc4 [Yin Huai] Address comments.
      1951fe1 [Yin Huai] Merge remote-tracking branch 'upstream/master'
      c18da34 [Yin Huai] Invalidate cache after insert.
      727f21a [Yin Huai] Preinsert casting and renaming.
      3eccf29c
    • OopsOutOfMemory's avatar
      [SPARK-5324][SQL] Results of describe can't be queried · 0b7eb3f3
      OopsOutOfMemory authored
      Make below code works.
      ```
      sql("DESCRIBE test").registerTempTable("describeTest")
      sql("SELECT * FROM describeTest").collect()
      ```
      
      Author: OopsOutOfMemory <victorshengli@126.com>
      Author: Sheng, Li <OopsOutOfMemory@users.noreply.github.com>
      
      Closes #4249 from OopsOutOfMemory/desc_query and squashes the following commits:
      
      6fee13d [OopsOutOfMemory] up-to-date
      e71430a [Sheng, Li] Update HiveOperatorQueryableSuite.scala
      3ba1058 [OopsOutOfMemory] change to default argument
      aac7226 [OopsOutOfMemory] Merge branch 'master' into desc_query
      68eb6dd [OopsOutOfMemory] Merge branch 'desc_query' of github.com:OopsOutOfMemory/spark into desc_query
      354ad71 [OopsOutOfMemory] query describe command
      d541a35 [OopsOutOfMemory] refine test suite
      e1da481 [OopsOutOfMemory] refine test suite
      a780539 [OopsOutOfMemory] Merge branch 'desc_query' of github.com:OopsOutOfMemory/spark into desc_query
      0015f82 [OopsOutOfMemory] code style
      dd0aaef [OopsOutOfMemory] code style
      c7d606d [OopsOutOfMemory] rename test suite
      75f2342 [OopsOutOfMemory] refine code and test suite
      f942c9b [OopsOutOfMemory] initial
      11559ae [OopsOutOfMemory] code style
      c5fdecf [OopsOutOfMemory] code style
      aeaea5f [OopsOutOfMemory] rename test suite
      ac2c3bb [OopsOutOfMemory] refine code and test suite
      544573e [OopsOutOfMemory] initial
      0b7eb3f3
    • q00251598's avatar
      [SPARK-5619][SQL] Support 'show roles' in HiveContext · a958d609
      q00251598 authored
      Author: q00251598 <qiyadong@huawei.com>
      
      Closes #4397 from watermen/SPARK-5619 and squashes the following commits:
      
      f819b6c [q00251598] Support show roles in HiveContext.
      a958d609
    • Tobias Schlatter's avatar
      [SPARK-5640] Synchronize ScalaReflection where necessary · 500dc2b4
      Tobias Schlatter authored
      Author: Tobias Schlatter <tobias@meisch.ch>
      
      Closes #4431 from gzm0/sync-scala-refl and squashes the following commits:
      
      c5da21e [Tobias Schlatter] [SPARK-5640] Synchronize ScalaReflection where necessary
      500dc2b4
    • Liang-Chi Hsieh's avatar
      [SPARK-5650][SQL] Support optional 'FROM' clause · d4338161
      Liang-Chi Hsieh authored
      In Hive, 'FROM' clause is optional. This pr supports it.
      
      Author: Liang-Chi Hsieh <viirya@gmail.com>
      
      Closes #4426 from viirya/optional_from and squashes the following commits:
      
      fe81f31 [Liang-Chi Hsieh] Support optional 'FROM' clause.
      d4338161
    • Nicholas Chammas's avatar
      [SPARK-5628] Add version option to spark-ec2 · 70e5b030
      Nicholas Chammas authored
      Every proper command line tool should include a `--version` option or something similar.
      
      This PR adds this to `spark-ec2` using the standard functionality provided by `optparse`.
      
      One thing we don't do here is follow the Python convention of setting `__version__`, since it seems awkward given how `spark-ec2` is laid out.
      
      Author: Nicholas Chammas <nicholas.chammas@gmail.com>
      
      Closes #4414 from nchammas/spark-ec2-show-version and squashes the following commits:
      
      914cab5 [Nicholas Chammas] add version info
      70e5b030
    • WangTaoTheTonic's avatar
      [SPARK-2945][YARN][Doc]add doc for spark.executor.instances · d34f79c8
      WangTaoTheTonic authored
      https://issues.apache.org/jira/browse/SPARK-2945
      
      spark.executor.instances works. As this JIRA recommended, we should add docs for this common config.
      
      Author: WangTaoTheTonic <wangtao111@huawei.com>
      
      Closes #4350 from WangTaoTheTonic/SPARK-2945 and squashes the following commits:
      
      4c3913a [WangTaoTheTonic] not compatible with dynamic allocation
      5fa9c46 [WangTaoTheTonic] add doc for spark.executor.instances
      d34f79c8
    • zsxwing's avatar
      [SPARK-4361][Doc] Add more docs for Hadoop Configuration · af2a2a26
      zsxwing authored
      I'm trying to point out reusing a Configuration in these APIs is dangerous. Any better idea?
      
      Author: zsxwing <zsxwing@gmail.com>
      
      Closes #3225 from zsxwing/SPARK-4361 and squashes the following commits:
      
      fe4e3d5 [zsxwing] Add more docs for Hadoop Configuration
      af2a2a26
    • Josh Rosen's avatar
      [HOTFIX] Fix test build break in ExecutorAllocationManagerSuite. · fb6c0cba
      Josh Rosen authored
      This was caused because #3486 added a new field to ExecutorInfo and #4369
      added new tests that created ExecutorInfos.  These patches were merged in
      quick succession and were never tested together, hence the compilation error.
      fb6c0cba
    • Liang-Chi Hsieh's avatar
      [SPARK-5652][Mllib] Use broadcasted weights in LogisticRegressionModel · 80f3bcb5
      Liang-Chi Hsieh authored
      `LogisticRegressionModel`'s `predictPoint` should directly use broadcasted weights. This pr also fixes the compilation errors of two unit test suite: `JavaLogisticRegressionSuite ` and `JavaLinearRegressionSuite`.
      
      Author: Liang-Chi Hsieh <viirya@gmail.com>
      
      Closes #4429 from viirya/use_bcvalue and squashes the following commits:
      
      5a797e5 [Liang-Chi Hsieh] Use broadcasted weights. Fix compilation error.
      80f3bcb5
    • Josh Rosen's avatar
      [SPARK-5555] Enable UISeleniumSuite tests · 0d74bd7f
      Josh Rosen authored
      This patch enables UISeleniumSuite, a set of tests for the Spark application web UI.  These tests were previously disabled because they were slow, but I think we now have sufficient test time budget that the benefit of enabling them outweighs the time costs.
      
      Author: Josh Rosen <joshrosen@databricks.com>
      
      Closes #4334 from JoshRosen/enable-uiseleniumsuite and squashes the following commits:
      
      4ab9477 [Josh Rosen] Use BeforeAndAfterAll to cleanup WebDriver
      71efc72 [Josh Rosen] Update broken UISeleniumSuite tests; use random port #.
      a5ab595 [Josh Rosen] Enable UISeleniumSuite tests.
      0d74bd7f
    • Kostas Sakellis's avatar
      SPARK-2450 Adds executor log links to Web UI · 32e964c4
      Kostas Sakellis authored
      Adds links to stderr/stdout in the executor tab of the webUI for:
      1) Standalone
      2) Yarn client
      3) Yarn cluster
      
      This tries to add the log url support in a general way so as to make it easy to add support for all the
      cluster managers. This is done by using environment variables to pass to the executor the log urls. The
      SPARK_LOG_URL_ prefix is used and so additional logs besides stderr/stdout can also be added.
      
      To propagate this information to the UI we use the onExecutorAdded spark listener event.
      
      Although this commit doesn't add log urls when running on a mesos cluster, it should be possible to add using the same mechanism.
      
      Author: Kostas Sakellis <kostas@cloudera.com>
      Author: Josh Rosen <joshrosen@databricks.com>
      
      Closes #3486 from ksakellis/kostas-spark-2450 and squashes the following commits:
      
      d190936 [Josh Rosen] Fix a few minor style / formatting nits. Reset listener after each test Don't null listener out at end of main().
      8673fe1 [Kostas Sakellis] CR feedback. Hide the log column if there are no logs available
      5bf6952 [Kostas Sakellis] [SPARK-2450] [CORE] Adds exeuctor log links to Web UI
      32e964c4
    • Makoto Fukuhara's avatar
      [SPARK-5618][Spark Core][Minor] Optimise utility code. · 4cdb26c1
      Makoto Fukuhara authored
      Author: Makoto Fukuhara <fukuo33@gmail.com>
      
      Closes #4396 from fukuo33/fix-unnecessary-regex and squashes the following commits:
      
      cd07fd6 [Makoto Fukuhara] fix unnecessary regex.
      4cdb26c1
    • lianhuiwang's avatar
      [SPARK-5593][Core]Replace BlockManagerListener with ExecutorListener in ExecutorAllocationListener · 6072fcc1
      lianhuiwang authored
      More strictly, in ExecutorAllocationListener, we need to replace onBlockManagerAdded, onBlockManagerRemoved with onExecutorAdded,onExecutorRemoved. because at some time, onExecutorAdded and onExecutorRemoved are more accurate to express these meanings. example at SPARK-5529, BlockManager has been removed,but executor is existed.
       andrewor14 sryza
      
      Author: lianhuiwang <lianhuiwang09@gmail.com>
      
      Closes #4369 from lianhuiwang/SPARK-5593 and squashes the following commits:
      
      333367c [lianhuiwang] Replace BlockManagerListener with ExecutorListener in ExecutorAllocationListener
      6072fcc1
    • Stephen Haberman's avatar
      [SPARK-4877] Allow user first classes to extend classes in the parent. · 9792bec5
      Stephen Haberman authored
      Previously, the classloader isolation was almost too good, such
      that if a child class needed to load/reference a class that was
      only available in the parent, it could not do so.
      
      This adds tests for that case, the user-first Fake2 class extends
      the only-in-parent Fake3 class.
      
      It also sneaks in a fix where only the first stage seemed to work,
      and on subsequent stages, a LinkageError happened because classes
      from the user-first classpath were getting defined twice.
      
      Author: Stephen Haberman <stephen@exigencecorp.com>
      
      Closes #3725 from stephenh/4877_user_first_parent_inheritance and squashes the following commits:
      
      dabcd35 [Stephen Haberman] [SPARK-4877] Respect userClassPathFirst for the driver code too.
      3d0fa7c [Stephen Haberman] [SPARK-4877] Allow user first classes to extend classes in the parent.
      9792bec5
    • Masayoshi TSUZUKI's avatar
      [SPARK-5396] Syntax error in spark scripts on windows. · c01b9852
      Masayoshi TSUZUKI authored
      Modified syntax error in spark-submit2.cmd. Command prompt doesn't have "defined" operator.
      
      Author: Masayoshi TSUZUKI <tsudukim@oss.nttdata.co.jp>
      
      Closes #4428 from tsudukim/feature/SPARK-5396 and squashes the following commits:
      
      ec18465 [Masayoshi TSUZUKI] [SPARK-5396] Syntax error in spark scripts on windows.
      c01b9852
    • Andrew Or's avatar
      [SPARK-5636] Ramp up faster in dynamic allocation · fe3740c4
      Andrew Or authored
      A recent patch #4051 made the initial number default to 0. With this change, any Spark application using dynamic allocation's default settings will ramp up very slowly. Since we never request more executors than needed to saturate the pending tasks, it is safe to ramp up quickly. The current default of 60 may be too slow.
      
      Author: Andrew Or <andrew@databricks.com>
      
      Closes #4409 from andrewor14/dynamic-allocation-interval and squashes the following commits:
      
      d3cc485 [Andrew Or] Lower request interval
      fe3740c4
    • Sandy Ryza's avatar
      SPARK-4337. [YARN] Add ability to cancel pending requests · 1a88f20d
      Sandy Ryza authored
      Author: Sandy Ryza <sandy@cloudera.com>
      
      Closes #4141 from sryza/sandy-spark-4337 and squashes the following commits:
      
      a98bd20 [Sandy Ryza] Andrew's comments
      cdaab7f [Sandy Ryza] SPARK-4337. Add ability to cancel pending requests to YARN
      1a88f20d
    • lianhuiwang's avatar
      [SPARK-5653][YARN] In ApplicationMaster rename isDriver to isClusterMode · cc6e5311
      lianhuiwang authored
      in ApplicationMaster rename isDriver to isClusterMode,because in Client it uses isClusterMode,ApplicationMaster should keep consistent with it and uses isClusterMode.Also isClusterMode is easier to understand.
      andrewor14 sryza
      
      Author: lianhuiwang <lianhuiwang09@gmail.com>
      
      Closes #4430 from lianhuiwang/am-isDriver-rename and squashes the following commits:
      
      f9f3ed0 [lianhuiwang] rename isDriver to isClusterMode
      cc6e5311
    • Travis Galoppo's avatar
      [SPARK-5013] [MLlib] Added documentation and sample data file for GaussianMixture · 9ad56ad2
      Travis Galoppo authored
      Simple description and code samples (and sample data) for GaussianMixture
      
      Author: Travis Galoppo <tjg2107@columbia.edu>
      
      Closes #4401 from tgaloppo/spark-5013 and squashes the following commits:
      
      c9ff9a5 [Travis Galoppo] Fixed link in mllib-clustering.md Added Gaussian mixture and power iteration as available clustering techniques in mllib-guide
      2368690 [Travis Galoppo] Minor fixes
      3eb41fa [Travis Galoppo] [SPARK-5013] Added documentation and sample data file for GaussianMixture
      9ad56ad2
    • Ryan Williams's avatar
      [SPARK-5416] init Executor.threadPool before ExecutorSource · 37d35ab5
      Ryan Williams authored
      Some ExecutorSource metrics can NPE by attempting to reference the
      threadpool otherwise.
      
      Author: Ryan Williams <ryan.blake.williams@gmail.com>
      
      Closes #4212 from ryan-williams/threadpool and squashes the following commits:
      
      236f2ad [Ryan Williams] init Executor.threadPool before ExecutorSource
      37d35ab5
    • Nicholas Chammas's avatar
      [Build] Set all Debian package permissions to 755 · cf6778e8
      Nicholas Chammas authored
      755 means the owner can read, write, and execute, and everyone else can just read and execute. I think that's what we want here since without execute permissions others cannot open directories.
      
      Inspired by [this comment on a separate PR](https://github.com/apache/spark/pull/3297#issuecomment-63286730).
      
      Author: Nicholas Chammas <nicholas.chammas@gmail.com>
      
      Closes #4277 from nchammas/patch-1 and squashes the following commits:
      
      da77fb0 [Nicholas Chammas] [Build] Set all Debian package permissions to 755
      cf6778e8
    • Miguel Peralvo's avatar
      Update ec2-scripts.md · f827ef4d
      Miguel Peralvo authored
      Change spark-version from 1.1.0 to 1.2.0 in the example for spark-ec2/Launch Cluster.
      
      Author: Miguel Peralvo <miguel.peralvo@gmail.com>
      
      Closes #4300 from MiguelPeralvo/patch-1 and squashes the following commits:
      
      38adf0b [Miguel Peralvo] Update ec2-scripts.md
      1850869 [Miguel Peralvo] Update ec2-scripts.md
      f827ef4d
    • lianhuiwang's avatar
      [SPARK-5470][Core]use defaultClassLoader to load classes in KryoSerializer · ed3aac79
      lianhuiwang authored
      Now KryoSerializer load classes of classesToRegister at the time of its initialization. when we set spark.kryo.classesToRegister=class1, it will throw SparkException("Failed to load class to register with Kryo".
      because in KryoSerializer's initialization, classLoader cannot include class of user's jars.
      we need to use defaultClassLoader of Serializer in newKryo(), because executor will reset defaultClassLoader of Serializer after Serializer's initialization.
      thank zzcclp for reporting it to me.
      
      Author: lianhuiwang <lianhuiwang09@gmail.com>
      
      Closes #4258 from lianhuiwang/SPARK-5470 and squashes the following commits:
      
      73b719f [lianhuiwang] do the splitting and filtering during initialization
      64cf306 [lianhuiwang] use defaultClassLoader to load classes of classesToRegister in KryoSerializer
      ed3aac79
    • Marcelo Vanzin's avatar
      [SPARK-5582] [history] Ignore empty log directories. · 85692897
      Marcelo Vanzin authored
      Empty log directories are not useful at the moment, but if one ends
      up showing in the log root, it breaks the code that checks for log
      directories.
      
      Author: Marcelo Vanzin <vanzin@cloudera.com>
      
      Closes #4352 from vanzin/SPARK-5582 and squashes the following commits:
      
      1a6a3d4 [Marcelo Vanzin] [SPARK-5582] Fix exception when looking at empty directories.
      85692897
    • Kousuke Saruta's avatar
      [SPARK-5157][YARN] Configure more JVM options properly when we use ConcMarkSweepGC for AM. · 24dbc50b
      Kousuke Saruta authored
      When we set `SPARK_USE_CONC_INCR_GC`, ConcurrentMarkSweepGC works on the AM.
      Actually, if ConcurrentMarkSweepGC is set for the JVM, following JVM options are set automatically and implicitly.
      
      * MaxTenuringThreshold=0
      * SurvivorRatio=1024
      
      Those can not be proper value for most cases.
      See also http://www.oracle.com/technetwork/java/tuning-139912.html
      
      Author: Kousuke Saruta <sarutak@oss.nttdata.co.jp>
      
      Closes #3956 from sarutak/SPARK-5157 and squashes the following commits:
      
      c15da4e [Kousuke Saruta] Set more JVM options for AM when enabling CMS
      24dbc50b
    • Kousuke Saruta's avatar
      [Minor] Remove permission for execution from spark-shell.cmd · f6ba813a
      Kousuke Saruta authored
      .cmd files in bin is not set permission for execution except for spark-shell.cmd.
      Let's unify that.
      
      Author: Kousuke Saruta <sarutak@oss.nttdata.co.jp>
      
      Closes #3983 from sarutak/fix-mode-of-cmd and squashes the following commits:
      
      9d6eedc [Kousuke Saruta] Removed permission for execution from spark-shell.cmd
      f6ba813a
    • Leolh's avatar
      [SPARK-5380][GraphX] Solve an ArrayIndexOutOfBoundsException when build graph... · 575d2df3
      Leolh authored
      [SPARK-5380][GraphX]  Solve an ArrayIndexOutOfBoundsException when build graph with a file format error
      
      When I build a graph with a file format error, there will be an ArrayIndexOutOfBoundsException
      
      Author: Leolh <leosandylh@gmail.com>
      
      Closes #4176 from Leolh/patch-1 and squashes the following commits:
      
      94f6d22 [Leolh] Update GraphLoader.scala
      23767f1 [Leolh] [SPARK-3650][GraphX] There will be an ArrayIndexOutOfBoundsException if the format of the source file is wrong
      575d2df3
    • Joseph K. Bradley's avatar
      [SPARK-4789] [SPARK-4942] [SPARK-5031] [mllib] Standardize ML Prediction APIs · dc0c4490
      Joseph K. Bradley authored
      This is part (1a) of the updates from the design doc in [https://docs.google.com/document/d/1BH9el33kBX8JiDdgUJXdLW14CA2qhTCWIG46eXZVoJs]
      
      **UPDATE**: Most of the APIs are being kept private[spark] to allow further discussion.  Here is a list of changes which are public:
      * new output columns: rawPrediction, probabilities
        * The “score” column is now called “rawPrediction”
      * Classifiers now provide numClasses
      * Params.get and .set are now protected instead of private[ml].
      * ParamMap now has a size method.
      * new classes: LinearRegression, LinearRegressionModel
      * LogisticRegression now has an intercept.
      
      ### Sketch of APIs (most of which are private[spark] for now)
      
      Abstract classes for learning algorithms (+ corresponding Model abstractions):
      * Classifier (+ ClassificationModel)
      * ProbabilisticClassifier (+ ProbabilisticClassificationModel)
      * Regressor (+ RegressionModel)
      * Predictor (+ PredictionModel)
      * *For all of these*:
       * There is no strongly typed training-time API.
       * There is a strongly typed test-time (prediction) API which helps developers implement new algorithms.
      
      Concrete classes: learning algorithms
      * LinearRegression
      * LogisticRegression (updated to use new abstract classes)
       * Also, removed "score" in favor of "probability" output column.  Changed BinaryClassificationEvaluator to match. (SPARK-5031)
      
      Other updates:
      * params.scala: Changed Params.set/get to be protected instead of private[ml]
       * This was needed for the example of defining a class from outside of the MLlib namespace.
      * VectorUDT: Will later change from private[spark] to public.
       * This is needed for outside users to write their own validateAndTransformSchema() methods using vectors.
       * Also, added equals() method.f
      * SPARK-4942 : ML Transformers should allow output cols to be turned on,off
       * Update validateAndTransformSchema
       * Update transform
      * (Updated examples, test suites according to other changes)
      
      New examples:
      * DeveloperApiExample.scala (example of defining algorithm from outside of the MLlib namespace)
       * Added Java version too
      
      Test Suites:
      * LinearRegressionSuite
      * LogisticRegressionSuite
      * + Java versions of above suites
      
      CC: mengxr  etrain  shivaram
      
      Author: Joseph K. Bradley <joseph@databricks.com>
      
      Closes #3637 from jkbradley/ml-api-part1 and squashes the following commits:
      
      405bfb8 [Joseph K. Bradley] Last edits based on code review.  Small cleanups
      fec348a [Joseph K. Bradley] Added JavaDeveloperApiExample.java and fixed other issues: Made developer API private[spark] for now. Added constructors Java can understand to specialized Param types.
      8316d5e [Joseph K. Bradley] fixes after rebasing on master
      fc62406 [Joseph K. Bradley] fixed test suites after last commit
      bcb9549 [Joseph K. Bradley] Fixed issues after rebasing from master (after move from SchemaRDD to DataFrame)
      9872424 [Joseph K. Bradley] fixed JavaLinearRegressionSuite.java Java sql api
      f542997 [Joseph K. Bradley] Added MIMA excludes for VectorUDT (now public), and added DeveloperApi annotation to it
      216d199 [Joseph K. Bradley] fixed after sql datatypes PR got merged
      f549e34 [Joseph K. Bradley] Updates based on code review.  Major ones are: * Created weakly typed Predictor.train() method which is called by fit() so that developers do not have to call schema validation or copy parameters. * Made Predictor.featuresDataType have a default value of VectorUDT.   * NOTE: This could be dangerous since the FeaturesType type parameter cannot have a default value.
      343e7bd [Joseph K. Bradley] added blanket mima exclude for ml package
      82f340b [Joseph K. Bradley] Fixed bug in LogisticRegression (introduced in this PR).  Fixed Java suites
      0a16da9 [Joseph K. Bradley] Fixed Linear/Logistic RegressionSuites
      c3c8da5 [Joseph K. Bradley] small cleanup
      934f97b [Joseph K. Bradley] Fixed bugs from previous commit.
      1c61723 [Joseph K. Bradley] * Made ProbabilisticClassificationModel into a subclass of ClassificationModel.  Also introduced ProbabilisticClassifier.  * This was to support output column “probabilityCol” in transform().
      4e2f711 [Joseph K. Bradley] rat fix
      bc654e1 [Joseph K. Bradley] Added spark.ml LinearRegressionSuite
      8d13233 [Joseph K. Bradley] Added methods: * Classifier: batch predictRaw() * Predictor: train() without paramMap ProbabilisticClassificationModel.predictProbabilities() * Java versions of all above batch methods + others
      1680905 [Joseph K. Bradley] Added JavaLabeledPointSuite.java for spark.ml, and added constructor to LabeledPoint which defaults weight to 1.0
      adbe50a [Joseph K. Bradley] * fixed LinearRegression train() to use embedded paramMap * added Predictor.predict(RDD[Vector]) method * updated Linear/LogisticRegressionSuites
      58802e3 [Joseph K. Bradley] added train() to Predictor subclasses which does not take a ParamMap.
      57d54ab [Joseph K. Bradley] * Changed semantics of Predictor.train() to merge the given paramMap with the embedded paramMap. * remove threshold_internal from logreg * Added Predictor.copy() * Extended LogisticRegressionSuite
      e433872 [Joseph K. Bradley] Updated docs.  Added LabeledPointSuite to spark.ml
      54b7b31 [Joseph K. Bradley] Fixed issue with logreg threshold being set correctly
      0617d61 [Joseph K. Bradley] Fixed bug from last commit (sorting paramMap by parameter names in toString).  Fixed bug in persisting logreg data.  Added threshold_internal to logreg for faster test-time prediction (avoiding map lookup).
      601e792 [Joseph K. Bradley] Modified ParamMap to sort parameters in toString.  Cleaned up classes in class hierarchy, before implementing tests and examples.
      d705e87 [Joseph K. Bradley] Added LinearRegression and Regressor back from ml-api branch
      52f4fde [Joseph K. Bradley] removing everything except for simple class hierarchy for classification
      d35bb5d [Joseph K. Bradley] fixed compilation issues, but have not added tests yet
      bfade12 [Joseph K. Bradley] Added lots of classes for new ML API:
      dc0c4490
    • Xiangrui Meng's avatar
      [SPARK-5604][MLLIB] remove checkpointDir from trees · 6b88825a
      Xiangrui Meng authored
      This is the second part of SPARK-5604, which removes checkpointDir from tree strategies. Note that this is a break change. I will mention it in the migration guide.
      
      Author: Xiangrui Meng <meng@databricks.com>
      
      Closes #4407 from mengxr/SPARK-5604-1 and squashes the following commits:
      
      13a276d [Xiangrui Meng] remove checkpointDir from trees
      6b88825a
Loading